# Creating Logistic Regression to predict Absenteeism

In [1]:
# importing the relevant libraries
import numpy as np
import pandas as pd

In [2]:
# get the preprocessed data
data_preprocessed = pd.read_csv('Absenteeism_preprocessed.csv')

In [3]:
# check the top 5 values
data_preprocessed.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month_Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2


### Create Targets

As we are doing the logistic regression that means targets can be classified into classes.So here we may form classes like 
employees those are "excessively absent" & the other is "moderately absent". For this we may use median of Absenteeism
time.Employees which are absent greater than the median value are excessively absent and which are less than median are 
moderately absent.

In [4]:
# getting the median of absence time 
absentee_time_median = data_preprocessed['Absenteeism Time in Hours'].median()

In [5]:
# so conversion of time to 1 and 0 can be done with several ways like using apply,map as follows
data_preprocessed['Absenteeism Time in Hours'].apply(lambda x:1 if x>absentee_time_median else 0)

0      1
1      0
2      0
3      1
4      0
      ..
695    1
696    0
697    1
698    0
699    0
Name: Absenteeism Time in Hours, Length: 700, dtype: int64

In [6]:
data_preprocessed['Absenteeism Time in Hours'].map(lambda x:1 if x>absentee_time_median else 0)

0      1
1      0
2      0
3      1
4      0
      ..
695    1
696    0
697    1
698    0
699    0
Name: Absenteeism Time in Hours, Length: 700, dtype: int64

In [7]:
data_preprocessed['Absenteeism Time in Hours'].apply(lambda x:1 if x>absentee_time_median else 0).unique()

array([1, 0], dtype=int64)

In [8]:
# other method we can use is from numpy i.e. where 
targets = np.where(data_preprocessed['Absenteeism Time in Hours']>absentee_time_median,1,0)

In [9]:
# make a new column for encoded targets 
data_preprocessed["Excessive Absenteeism"] = targets

In [10]:
data_preprocessed.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month_Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2,0


### Small Analysis of targets

In [11]:
# sum of all targets means sumof all 1 divided by the number of observations that will give the average of ones..how 1 is ditributed.
targets.sum()/targets.shape[0]

0.45571428571428574

The  above can also be checked with mean() method of pandas. SO it is indicating that aroud 45% targets are 1 and 55% are 0s.In logistic regression our targets classes should be distributed equally or 60-40% will also work..Our distribution is in line with the condition.

In [12]:

data_preprocessed['Excessive Absenteeism'].mean()

0.45571428571428574

### Drop column 'Absenteeism Time in Hours'

In [13]:
data_with_targets = data_preprocessed.drop(['Absenteeism Time in Hours','Day of the Week','Distance to Work','Daily Work Load Average'],axis = 1)

In [14]:
data_with_targets.head(2)

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month_Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets,Excessive Absenteeism
0,0,0,0,1,7,289,33,30,0,2,1,1
1,0,0,0,0,7,118,50,31,0,1,0,0


### Select inputs for the logistic regression

In [15]:
data_with_targets.shape

(700, 12)

In [17]:
# we are taking all the features as inputs, except last one excessive absenteeism which is a target
inputs = data_with_targets.iloc[ : ,:-1]
inputs.head(2)

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month_Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets
0,0,0,0,1,7,289,33,30,0,2,1
1,0,0,0,0,7,118,50,31,0,1,0


In [18]:
inputs.shape

(700, 11)

### Split the data into train & test and shuffle

In [19]:
from sklearn.model_selection import train_test_split

In [20]:
train_test_split(inputs,targets)

[     Reason_1  Reason_2  Reason_3  Reason_4  Month_Value  \
 517         0         0         0         1           10   
 203         0         0         0         0            4   
 142         0         0         0         1            2   
 413         0         0         0         1            4   
 113         0         0         0         1            1   
 ..        ...       ...       ...       ...          ...   
 252         0         0         1         0            8   
 385         0         0         0         1            2   
 501         0         0         0         1            9   
 336         0         0         0         0           11   
 438         0         0         0         1            5   
 
      Transportation Expense  Age  Body Mass Index  Education  Children  Pets  
 517                     369   31               25          0         3     0  
 203                     235   48               33          0         1     5  
 142                     2

In [21]:
x_train,x_test,y_train,y_test = train_test_split(inputs,targets)

In [22]:
print(x_train.shape,y_train.shape)

(525, 11) (525,)


The above shape shows that training inputs have 525 obeservations across 14 features and targets for training is a vector of length 525.

In [23]:
print(x_test.shape,y_test.shape)

(175, 11) (175,)


So here sklearn splits the data as 75% training and 25% testing

#### Generally a split of 80-20 or 90-10 is used. So we will do here a 80%-20% splitting

In [24]:
x_train,x_test,y_train,y_test = train_test_split(inputs,targets,train_size=0.8,random_state=20)

In [25]:
print(x_train.shape,y_train.shape)

(560, 11) (560,)


In [26]:
print(x_test.shape,y_test.shape)

(140, 11) (140,)


## Logistic Regression with Sklearn

In [27]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

### Training the model

In [28]:
reg = LogisticRegression()

In [29]:
reg.fit(x_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

#### Check the accuracy of model

In [30]:
reg.score(x_train,y_train)

0.75

### Checking the accuracy manuually

In [31]:
# for cheking the model o/p
model_outputs = reg.predict(x_train)

In [32]:
model_outputs

array([0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0,

In [33]:
y_train

array([0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0,

In [34]:
model_outputs == y_train

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True, False, False,  True,  True,  True,  True,
       False,  True, False,  True, False, False,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True, False,  True,
       False, False, False, False,  True,  True,  True,  True, False,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True, False,  True,  True, False, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True,
       False,  True, False,  True,  True, False, False, False,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
       False,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,

In [35]:
np.sum(model_outputs==y_train)

420

In [36]:
observations = model_outputs.shape[0]
observations

560

In [37]:
np.sum(model_outputs==y_train)/observations

0.75

So in total of 560 observations our model predicted accurately(prediction= target) for 439 observations.Hence we got the same number. nearly 80 % is accuracy.

### Finding the intercepts & coefficients

In [38]:
reg.intercept_

array([-0.83663708])

In [39]:
reg.coef_

array([[ 1.42562719, -0.04240364,  2.08854523, -0.39874278,  0.02761209,
         0.00623239, -0.04538864,  0.01736553, -0.80900801,  0.27784792,
        -0.21986337]])

In [54]:
inputs.columns.values

array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month_Value',
       'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
       'Children', 'Pets'], dtype=object)

In [55]:
feature_name = inputs.columns.values

In [56]:
summary_table = pd.DataFrame(data=feature_name,columns=['Feature Name'])
summary_table['Coefficient'] = np.transpose(reg.coef_)    #transpose is made to convert nd array(row) into column
summary_table

Unnamed: 0,Feature Name,Coefficient
0,Reason_1,1.425627
1,Reason_2,-0.042404
2,Reason_3,2.088545
3,Reason_4,-0.398743
4,Month_Value,0.027612
5,Transportation Expense,0.006232
6,Age,-0.045389
7,Body Mass Index,0.017366
8,Education,-0.809008
9,Children,0.277848


In [43]:
# in the above table intercept have to be added.. but we want intercept at the first.So we will move forward indices by 1
summary_table.index = summary_table.index + 1

In [44]:
summary_table.loc[0]= ['Intercept',reg.intercept_[0]]

In [45]:
summary_table

Unnamed: 0,Feature Name,Coefficient
1,Reason_1,1.425627
2,Reason_2,-0.042404
3,Reason_3,2.088545
4,Reason_4,-0.398743
5,Month_Value,0.027612
6,Transportation Expense,0.006232
7,Age,-0.045389
8,Body Mass Index,0.017366
9,Education,-0.809008
10,Children,0.277848


In [46]:
summary_table = summary_table.sort_index()

In [47]:
summary_table

Unnamed: 0,Feature Name,Coefficient
0,Intercept,-0.836637
1,Reason_1,1.425627
2,Reason_2,-0.042404
3,Reason_3,2.088545
4,Reason_4,-0.398743
5,Month_Value,0.027612
6,Transportation Expense,0.006232
7,Age,-0.045389
8,Body Mass Index,0.017366
9,Education,-0.809008


### Interpreting the coefficieints

In [48]:
summary_table['Odds_ratio'] = np.exp(summary_table.Coefficient)
summary_table

Unnamed: 0,Feature Name,Coefficient,Odds_ratio
0,Intercept,-0.836637,0.433165
1,Reason_1,1.425627,4.160466
2,Reason_2,-0.042404,0.958483
3,Reason_3,2.088545,8.073162
4,Reason_4,-0.398743,0.671163
5,Month_Value,0.027612,1.027997
6,Transportation Expense,0.006232,1.006252
7,Age,-0.045389,0.955626
8,Body Mass Index,0.017366,1.017517
9,Education,-0.809008,0.4453


We will sort the summary table as per the odds_ratio in descending order. That will give us important reasons for absenteeism at top. 
A feature is not particularly important:
- if its coefficient is around 0
- if its odd ratio is around 1

In [49]:
summary_table.sort_values('Odds_ratio',ascending= False)

Unnamed: 0,Feature Name,Coefficient,Odds_ratio
3,Reason_3,2.088545,8.073162
1,Reason_1,1.425627,4.160466
10,Children,0.277848,1.320285
5,Month_Value,0.027612,1.027997
8,Body Mass Index,0.017366,1.017517
6,Transportation Expense,0.006232,1.006252
2,Reason_2,-0.042404,0.958483
7,Age,-0.045389,0.955626
11,Pets,-0.219863,0.802628
4,Reason_4,-0.398743,0.671163


So from the above table we can see that daily work load avearge,distance to work and day of the week are not much important features for being absent/ these dont make much differnce .

In [50]:
reg.score(x_test,y_test)

0.7214285714285714

So our model is predicting the excessive absenteeism with an accuracy of 72% for the data which is not seen by the model. i.e. test data. 

Test accuracy is always LESS than train accuracy. If test accuracy is more then we may have made some mistake in model.

Testing should be done once only.If we are doing testing repeatedly with modifications of parameters then its nothing but a "TRAINING".So testing should be done only once.