# Creating Logistic Regression to predict Absenteeism

In [1]:
# importing the relevant libraries
import numpy as np
import pandas as pd

In [2]:
# get the preprocessed data
data_preprocessed = pd.read_csv('Absenteeism_preprocessed.csv')

In [3]:
# check the top 5 values
data_preprocessed.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month_Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2


### Create Targets

As we are doing the logistic regression that means targets can be classified into classes.So here we may form classes like 
employees those are "excessively absent" & the other is "moderately absent". For this we may use median of Absenteeism
time.Employees which are absent greater than the median value are excessively absent and which are less than median are 
moderately absent.

In [4]:
# getting the median of absence time 
absentee_time_median = data_preprocessed['Absenteeism Time in Hours'].median()

In [5]:
# so conversion of time to 1 and 0 can be done with several ways like using apply,map as follows
data_preprocessed['Absenteeism Time in Hours'].apply(lambda x:1 if x>absentee_time_median else 0)

0      1
1      0
2      0
3      1
4      0
      ..
695    1
696    0
697    1
698    0
699    0
Name: Absenteeism Time in Hours, Length: 700, dtype: int64

In [6]:
data_preprocessed['Absenteeism Time in Hours'].map(lambda x:1 if x>absentee_time_median else 0)

0      1
1      0
2      0
3      1
4      0
      ..
695    1
696    0
697    1
698    0
699    0
Name: Absenteeism Time in Hours, Length: 700, dtype: int64

In [8]:
data_preprocessed['Absenteeism Time in Hours'].apply(lambda x:1 if x>absentee_time_median else 0).unique()

array([1, 0], dtype=int64)

In [9]:
# other method we can use is from numpy i.e. where 
targets = np.where(data_preprocessed['Absenteeism Time in Hours']>absentee_time_median,1,0)

In [10]:
# make a new column for encoded targets 
data_preprocessed["Excessive Absenteeism"] = targets

In [11]:
data_preprocessed.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month_Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2,0


### Small Analysis of targets

In [12]:
# sum of all targets means sumof all 1 divided by the number of observations that will give the average of ones..how 1 is ditributed.
targets.sum()/targets.shape[0]

0.45571428571428574

The  above can also be checked with mean() method of pandas. SO it is indicating that aroud 45% targets are 1 and 55% are 0s.In logistic regression our targets classes should be distributed equally or 60-40% will also work..Our distribution is in line with the condition.

In [13]:

data_preprocessed['Excessive Absenteeism'].mean()

0.45571428571428574

### Drop column 'Absenteeism Time in Hours'

In [15]:
data_with_targets = data_preprocessed.drop(['Absenteeism Time in Hours','Day of the Week','Distance to Work','Daily Work Load Average'],axis = 1)

In [16]:
data_with_targets.head(2)

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month_Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets,Excessive Absenteeism
0,0,0,0,1,7,289,33,30,0,2,1,1
1,0,0,0,0,7,118,50,31,0,1,0,0


### Select inputs for the logistic regression

In [17]:
data_with_targets.shape

(700, 12)

In [18]:
# we are taking all the features as inputs, except last one excessive absenteeism which is a target
unscaled_inputs = data_with_targets.iloc[ : ,:-1]
unscaled_inputs.head(2)

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month_Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets
0,0,0,0,1,7,289,33,30,0,2,1
1,0,0,0,0,7,118,50,31,0,1,0


## standardize the data

We are going to standardize our data with sklearn.But the features for which we made dummies that should not be standardised. Otherwise the purpose for which we made dummies that will not be fullfilled. So we avoid standardising the feature with dummies.

So while standardizing we will select the only those features without dummies.

In [19]:
# importing the necessary class from sklearn
from sklearn.preprocessing import StandardScaler

In [20]:
absenteeism_scaler = StandardScaler()

In [21]:
unscaled_inputs.iloc[:,[4,5,6,7,9,10]]= absenteeism_scaler.fit_transform(unscaled_inputs.iloc[:,[4,5,6,7,9,10]])

In [22]:
inputs = unscaled_inputs
inputs.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month_Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets
0,0,0,0,1,0.182726,1.005844,-0.536062,0.767431,0,0.880469,0.268487
1,0,0,0,0,0.182726,-1.574681,2.130803,1.002633,0,-0.01928,-0.58969
2,0,0,0,1,0.182726,-0.654143,0.24831,1.002633,0,-0.91903,-0.58969
3,1,0,0,0,0.182726,0.854936,0.405184,-0.643782,0,0.880469,-0.58969
4,0,0,0,1,0.182726,1.005844,-0.536062,0.767431,0,0.880469,0.268487


In [24]:
inputs.shape

(700, 11)

### Split the data into train & test and shuffle

In [25]:
from sklearn.model_selection import train_test_split

In [26]:
train_test_split(inputs,targets)

[     Reason_1  Reason_2  Reason_3  Reason_4  Month_Value  \
 446         0         0         0         0    -0.102784   
 620         0         0         0         1    -0.959313   
 439         1         0         0         0    -0.388293   
 48          0         0         0         1     0.753746   
 251         0         0         0         0     0.182726   
 ..        ...       ...       ...       ...          ...   
 640         0         0         0         1    -0.959313   
 314         1         0         0         0     1.039256   
 142         0         0         0         1    -1.244823   
 394         0         0         0         1    -0.959313   
 281         0         0         0         1     0.753746   
 
      Transportation Expense       Age  Body Mass Index  Education  Children  \
 446               -0.654143  0.248310         1.002633          0 -0.919030   
 620                0.085306  3.385799        -1.114186          0  0.880469   
 439                0.1909

In [27]:
x_train,x_test,y_train,y_test = train_test_split(inputs,targets)

In [28]:
print(x_train.shape,y_train.shape)

(525, 11) (525,)


The above shape shows that training inputs have 525 obeservations across 14 features and targets for training is a vector of length 525.

In [29]:
print(x_test.shape,y_test.shape)

(175, 11) (175,)


So here sklearn splits the data as 75% training and 25% testing

#### Generally a split of 80-20 or 90-10 is used. So we will do here a 80%-20% splitting

In [30]:
x_train,x_test,y_train,y_test = train_test_split(inputs,targets,train_size=0.8,random_state=20)

In [31]:
print(x_train.shape,y_train.shape)

(560, 11) (560,)


In [32]:
print(x_test.shape,y_test.shape)

(140, 11) (140,)


## Logistic Regression with Sklearn

In [33]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

### Training the model

In [34]:
reg = LogisticRegression()

In [35]:
reg.fit(x_train,y_train)

LogisticRegression()

#### Check the accuracy of model

In [36]:
reg.score(x_train,y_train)

0.7732142857142857

### Checking the accuracy manuually

In [37]:
# for cheking the model o/p
model_outputs = reg.predict(x_train)

In [38]:
model_outputs

array([0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0,

In [39]:
y_train

array([0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0,

In [40]:
model_outputs == y_train

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True, False, False,  True,  True,  True,  True,
       False,  True, False,  True, False, False,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
       False, False, False, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True,
       False,  True, False,  True,  True, False, False, False,  True,
        True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
       False,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,

In [41]:
np.sum(model_outputs==y_train)

433

In [42]:
observations = model_outputs.shape[0]
observations

560

In [43]:
np.sum(model_outputs==y_train)/observations

0.7732142857142857

So in total of 560 observations our model predicted accurately(prediction= target) for 439 observations.Hence we got the same number. nearly 80 % is accuracy.

### Finding the intercepts & coefficients

In [44]:
reg.intercept_

array([-1.6474549])

In [45]:
reg.coef_

array([[ 2.80019733,  0.95188356,  3.11555338,  0.83900082,  0.1589299 ,
         0.60528415, -0.16989096,  0.27981088, -0.21053312,  0.34826214,
        -0.27739602]])

In [46]:
unscaled_inputs.columns.values

array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month_Value',
       'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
       'Children', 'Pets'], dtype=object)

In [47]:
feature_name = unscaled_inputs.columns.values

In [48]:
summary_table = pd.DataFrame(data=feature_name,columns=['Feature Name'])
summary_table['Coefficient'] = np.transpose(reg.coef_)    #transpose is made to convert nd array(row) into column
summary_table

Unnamed: 0,Feature Name,Coefficient
0,Reason_1,2.800197
1,Reason_2,0.951884
2,Reason_3,3.115553
3,Reason_4,0.839001
4,Month_Value,0.15893
5,Transportation Expense,0.605284
6,Age,-0.169891
7,Body Mass Index,0.279811
8,Education,-0.210533
9,Children,0.348262


In [49]:
# in the above table intercept have to be added.. but we want intercept at the first.So we will move forward indices by 1
summary_table.index = summary_table.index + 1

In [50]:
summary_table.loc[0]= ['Intercept',reg.intercept_[0]]

In [51]:
summary_table

Unnamed: 0,Feature Name,Coefficient
1,Reason_1,2.800197
2,Reason_2,0.951884
3,Reason_3,3.115553
4,Reason_4,0.839001
5,Month_Value,0.15893
6,Transportation Expense,0.605284
7,Age,-0.169891
8,Body Mass Index,0.279811
9,Education,-0.210533
10,Children,0.348262


In [52]:
summary_table = summary_table.sort_index()

In [53]:
summary_table

Unnamed: 0,Feature Name,Coefficient
0,Intercept,-1.647455
1,Reason_1,2.800197
2,Reason_2,0.951884
3,Reason_3,3.115553
4,Reason_4,0.839001
5,Month_Value,0.15893
6,Transportation Expense,0.605284
7,Age,-0.169891
8,Body Mass Index,0.279811
9,Education,-0.210533


### Interpreting the coefficieints

In [54]:
summary_table['Odds_ratio'] = np.exp(summary_table.Coefficient)
summary_table

Unnamed: 0,Feature Name,Coefficient,Odds_ratio
0,Intercept,-1.647455,0.192539
1,Reason_1,2.800197,16.447892
2,Reason_2,0.951884,2.590585
3,Reason_3,3.115553,22.545903
4,Reason_4,0.839001,2.314054
5,Month_Value,0.15893,1.172256
6,Transportation Expense,0.605284,1.831773
7,Age,-0.169891,0.843757
8,Body Mass Index,0.279811,1.32288
9,Education,-0.210533,0.810152


We will sort the summary table as per the odds_ratio in descending order. That will give us important reasons for absenteeism at top. 
A feature is not particularly important:
- if its coefficient is around 0
- if its odd ratio is around 1

In [55]:
summary_table.sort_values('Odds_ratio',ascending= False)

Unnamed: 0,Feature Name,Coefficient,Odds_ratio
3,Reason_3,3.115553,22.545903
1,Reason_1,2.800197,16.447892
2,Reason_2,0.951884,2.590585
4,Reason_4,0.839001,2.314054
6,Transportation Expense,0.605284,1.831773
10,Children,0.348262,1.416604
8,Body Mass Index,0.279811,1.32288
5,Month_Value,0.15893,1.172256
7,Age,-0.169891,0.843757
9,Education,-0.210533,0.810152


So from the above table we can see that daily work load avearge,distance to work and day of the week are not much important features for being absent/ these dont make much differnce .

In [56]:
reg.score(x_test,y_test)

0.75

So our model is predicting the excessive absenteeism with an accuracy of 74% for the data which is not seen by the model. i.e. test data. 

Test accuracy is always LESS than train accuracy. If test accuracy is more then we may have made some mistake in model.

Testing should be done once only.If we are doing testing repeatedly with modifications of parameters then its nothing but a "TRAINING".So testing should be done only once.

## Save the model

In [57]:
# import the relevant module
import pickle

In [58]:
# pickle the model file
with open('model.pkl', 'wb') as file:
    pickle.dump(reg, file)

In [59]:
# pickle the scaler file
with open('scaler','wb') as file:
    pickle.dump(absenteeism_scaler, file)