# Absenteeism at work

## Overview

The task proposed is to create a model which can predict how likely it is for an employee to be absent from work during normal working hours, given some inputs. In order to achieve this goal, data regarding characteristics of employees along with past absenteeism information is provided.

This second notebook consists of the processing section after the data has already been cleaned and processed, where the regression model is actually created. 

## Importing the relevant libraries

In [None]:
import numpy as np
import pandas as pd

In [None]:
data_preprocessed = pd.read_csv('Absenteeism_preprocessed.csv')

Since the desired outcome is to predict if an employee is expected to be absent from work or not it makes sense to use a logistic regression, since the dependent variable (outcome) assumes one of two values: 1 or 0, yes or no. In other words, the goal is not to predict how much someone will be absent, but how likely it is. 
<br></br>
<br></br>

The proposed approach to the problem is to categorize absenteeism into two groups: moderately absent or highly absent. To do so, we can take the median of 'Absenteeism Time in Hours' data and classify the data above this cutoff line as being highly absent (1 / yes) and below as moderately absent (0 / no). This new variable is then used as the target to train the model. 

Targets
Above median -> 1
Below median -> 0

In [None]:
# Defining the median as the cut-off line
cutoff = data_preprocessed['Absenteeism Time in Hours'].median()
targets = np.where(data_preprocessed['Absenteeism Time in Hours'] > cutoff, 1, 0)
data_preprocessed['Excessive Absenteeism'] = targets

# Dropping 'Absenteeism Time in Hours', since it will no longer be used.
data_preprocessed = data_preprocessed.drop(['Absenteeism Time in Hours'], axis=1)

In [None]:
# Creating a checkpoint
data_1 = data_preprocessed.copy()

## standardizing the data

In [None]:
# Splitting the data into inputs and targets
unscaled_inputs = data_1.iloc[:,0:-1]
targets = data_1.iloc[:,-1]

In [None]:
# Process to standardize the inputs, except the dummy variables
# Standarizing the inputs is a good practive that improves the model, however if 
# we apply it to dummy variables, it becomes harder to understand the outputs.  
# Not standarizing dummies result in lower accuracy, but provides more explainability to the weights wich will later be obtained

from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin

class CustomScaler(BaseEstimator,TransformerMixin):
    def __init__(self,columns,copy=True,with_mean=True,with_std=True):
        self.scaler = StandardScaler(copy,with_mean,with_std)
        self.columns = columns
        self.mean_ = None
        self.var_ = None
        
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self
    
    def transform(self, X, y=None, copy=None):
        init_col_order = X.columns
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]


In [None]:
# Extracting column names
unscaled_inputs.columns.values

array(['Reasons_1', 'Reasons_2', 'Reasons_3', 'Reasons_4', 'Week Day',
       'Month', 'Transportation Expense', 'Distance to Work',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Age_1', 'Age_2', 'Age_3', 'Age_4'],
      dtype=object)

In [None]:
# We do not desire to standarize the dummies
columns_to_omit = ['Reasons_1', 'Reasons_2', 'Reasons_3', 'Reasons_4','Education', 'Age_1', 'Age_2', 'Age_3', 'Age_4']

In [None]:
# Comprehensive list with features to scale
columns_to_scale = [x for x in unscaled_inputs.columns.values if x not in columns_to_omit]

In [None]:
# Scaling process
scaler = CustomScaler(columns_to_scale)
scaler.fit(unscaled_inputs)
scaled_inputs = scaler.transform(unscaled_inputs);



In [None]:
scaled_inputs

Unnamed: 0,Reasons_1,Reasons_2,Reasons_3,Reasons_4,Week Day,Month,Transportation Expense,Distance to Work,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Age_1,Age_2,Age_3,Age_4
0,0,0,0,1,-0.683704,0.182726,1.005844,0.412816,-0.806331,0.767431,0,0.880469,0.268487,0,1,0,0
1,0,0,0,0,-0.683704,0.182726,-1.574681,-1.141882,-0.806331,1.002633,0,-0.019280,-0.589690,0,0,0,1
2,0,0,0,1,-0.007725,0.182726,-0.654143,1.426749,-0.806331,1.002633,0,-0.919030,-0.589690,0,1,0,0
3,1,0,0,0,0.668253,0.182726,0.854936,-1.682647,-0.806331,-0.643782,0,0.880469,-0.589690,0,1,0,0
4,0,0,0,1,0.668253,0.182726,1.005844,0.412816,-0.806331,0.767431,0,0.880469,0.268487,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,-0.007725,-0.388293,-0.654143,-0.533522,-0.853789,-1.114186,1,0.880469,-0.589690,0,0,1,0
696,1,0,0,0,-0.007725,-0.388293,0.040034,-0.263140,-0.853789,-0.643782,0,-0.019280,1.126663,1,0,0,0
697,1,0,0,0,0.668253,-0.388293,1.624567,-0.939096,-0.853789,-0.408580,1,-0.919030,-0.589690,1,0,0,0
698,0,0,0,1,0.668253,-0.388293,0.190942,-0.939096,-0.853789,-0.408580,1,-0.919030,-0.589690,1,0,0,0


## Splitting train and test

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# The split used is a common 80% of the data used to train, wich is widely accepted
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs, targets, train_size = 0.8, random_state = 20)

## Building the model

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [None]:
reg = LogisticRegression()
reg.fit(x_train,y_train);

LogisticRegression()

In [None]:
reg.score(x_train,y_train)

0.7589285714285714

# Summary

In [None]:
feature_name = unscaled_inputs.columns.values

In [None]:
summary = pd.DataFrame(columns=['Feature'], data=feature_name)
summary['Coefficient'] = np.transpose(reg.coef_)
summary.index = summary.index + 1
summary.loc[0] = ['Intercept',reg.intercept_[0]]
summary = summary.sort_index()
summary['Odds Ratio'] = np.exp(summary['Coefficient'])
summary.sort_values('Odds Ratio',ascending=False)
summary

Unnamed: 0,Feature,Coefficient,Odds Ratio
0,Intercept,-1.677692,0.186805
1,Reasons_1,2.924136,18.618124
2,Reasons_2,1.020202,2.773754
3,Reasons_3,3.082583,21.814682
4,Reasons_4,0.967621,2.631677
5,Week Day,-0.073165,0.929447
6,Month,0.177992,1.194815
7,Transportation Expense,0.8141,2.257143
8,Distance to Work,-0.163289,0.849346
9,Daily Work Load Average,-0.01663,0.983508


## Takeaways

The trained model has returned the coefficients for each input. The further the coefficient is from 0, the more relevant it is when making predictions. We can see that information regarding day of the week, month, daily work load and even body mass index are not important.
<br></br>
<br></br>
The coefficient also translates into odds with respect to the dummy baseline, wich is, when no reason for absenteeism is given. In that way, if an employee has food poisoning (reason group 3) for example it is 22 more likely to being absent. By itself this information on reason seems obvious, but we can infer more interesting observations. People with high transportation expenses are way more likely to be absent, so it's in the company best interest to reduce this load and thereby increasing productivy.
<br></br>
<br></br>
Employees with on the second group (33-39 years) have a higher probability of being absent when compared to the base line (27 years). A possible explanation is that for this age is very common to have younger children wich need constant care. This makes sense considering having a child increases the odds of being absent. If that is the case, the company could invest in parenting aid, such as daycare assistance. Also, it's clear that younger employees are also way less likely to be absent. 
<br></br>
<br></br>
Grouping age and reasons in the early preprocessing of the dataset proved to be an efficient way of drawing conclusions, since not doing so would result in a huge amount of variables to be analysed. 

# Testing

In [None]:
reg.score(x_test,y_test)

0.7214285714285714

The model has achieved and accuracy of 72% when subjected to test data. In other words, it could correctly predict if 7 out of 10 employees would be absent. It's a good result, considering the randomness of human behavior and the huge amount of other possible reasons and inputs for being absent, not covered in the dataset. 