# Absenteeism Prediction Model Testing
This project is designed to take input data, and predict instances of absenteeism for schedule planning purposes. This data will take in multiple parameters and estimate it based on an sklearn machine learning model built using logistic regression analysis paired with a custom scaler that is used to standardize specific parameters of the input data.

## Table of Contents
- 1.1 [Project Initialization](#Project_Initialization)

- 1.2 [Exploration and data processing begin](#Exploration)
    - 1.2.1 [Dropping the ID column](#ID_column)
    - 1.2.2 [Redefining reason for absence column](#Absence_column)
    - 1.2.3 [Creating the necassary date information](#Date_column)
    - 1.2.4 [Remapping the education column](#Education_column)
    - 1.2.5 [Organization](#Organization)
    
- 2.0 [Machine Learning](#Machine_learning)
    - 2.1 [Setting up the targets](#Targets)
    - 2.2 [Standardizing the data](#Standardization)
    - 2.3 [Building the Model](#Construction)
    - 2.4 [Evaluating the Model](#Evaluation)
    - 2.5 [Interpereting coefficients](#Interperation)
    - 2.6 [Optimizing the Model](#Optimization)
    - 2.7 [Testing model against new data](#Testing)

- 3.0 [Results](#Results)

## 1.1 Project Initialization <a class="anchor" id="Project_Initialization"></a>
This section is used in order to import neccassary modules as well as importing the file and opening the data and setting up the first checkpoint, and setting up the neccassary column lists in order to note any dropped or processed columns.


In [83]:
import os
import pandas as pd

# Below code is to set path and to set file name
data_path = '../data/'
data_file = 'absenteeism_training_data.csv'

# Sets path and opens file as a Pandas dataframe
os.chdir(data_path)
raw_csv_data = pd.read_csv(data_file)

#### Creates first checkpoint and defines columns to drop and displays full Dataframe

In [84]:
df = raw_csv_data.copy()

cols_to_drop = []

# Makes max display for columns and rows unlimited
pd.options.display.max_columns = None
pd.options.display.max_rows = None

## 1.2 Exploration and begin data processing <a class="anchor" id="Exploration"></a>

### 1.2.1 Dropping the ID column <a class="anchor" id="ID_column"></a>
This column is no longer needed as it is an ID related to individual identification in another table and will bnot be reused in our model training

In [85]:
cols_to_drop.append('ID')

### 1.2.2 Redefining reason for absence column <a class="anchor" id="Absence_column"></a>
The Reason for absence column is a list of 1 - 28 of all of the different reasons someone could list as absent from work. For training purposes the individual values are going to be changed to include: 
- 1-14 reason_disease as they are all pertaining to sickness or disease 
- 15-17 reason_maternity as they all have to do with pregnancy 
- 18-21 reason_external as they all have to due with medical reasons due to external influences such as poisoning or assault
- 22-28 reason_medical which includes medical appointments of patient follow up.

#### Exploring the information in the reason for absence column
Using the max and min functions in order to see the numbers included in this column. As demonstrated below the minimum is 0 (this means there was no reason given for the absence which is not shown in the table) and it ranges to 28. When the numbers were sorted below it is shown the only excuse code not used is reason 20 which is external causes of morbidity and mortality (hard to imagine a deceased person calling in a half day out of work).

In [86]:
df['Reason for Absence'].min()

0

In [87]:
df['Reason for Absence'].max()

28

In [88]:
df['Reason for Absence'].unique()

array([26,  0, 23,  7, 22, 19,  1, 11, 14, 21, 10, 13, 28, 18, 25, 24,  6,
       27, 17,  8, 12,  5,  9, 15,  4,  3,  2, 16])

In [89]:
sorted(df['Reason for Absence'].unique())

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28]

#### Creating and testing dummy table to replace the current reason for absence column

In [90]:
# This pulls all the dummy values and shows them in a table (the dtype=int is to convert the output to binary instead of boolean
reason_columns = pd.get_dummies(df['Reason for Absence'], dtype=int)
reason_columns.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0


#### Creating a check column to verify that this table is setup properly

In [91]:
# Sums every row in the Check column to make sure each value is there
reason_columns['check'] = reason_columns.sum(axis=1)

reason_columns.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28,check
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1
3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1


#### Verifying that there is only one unique variable in each row
This statement is used to show that there are exactly 700 rows of data with one number in each row and there are no unique values that could somehow add up to 700 they are all exactly 1.

In [92]:
# Counts every value in the check column to verify all 700 values are there
check_sum = reason_columns['check'].sum(axis=0)

# Verifies that there is only one unique value and there isn't just random variables that add to 700
check_unique = reason_columns['check'].unique()

print(f"The numbers in this column add up to {check_sum} and the unique values are {check_unique}")

The numbers in this column add up to 700 and the unique values are [1]


#### Removing the check column and 0's from the table

In [93]:
reason_columns = reason_columns.drop(['check'], axis = 1)

# This has the table drop the first column 0
# reason_columns = pd.get_dummies(df['Reason for Absence'], drop_first = True, dtype=int)

reason_columns.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0


#### Creating reason columns and appending them to the table
The first section defines the parameters for the different reason codes and are then added to the end of the table. After that the columns are renamed so that they are legible and defined and the reason for absence column is added to columns to drop because it is now redundant.

In [94]:
# Sections each of the reasons into the four categories according to their table location
reason_disease = reason_columns.loc[:, 1:14].max(axis=1)
reason_maternity = reason_columns.loc[:, 15:17].max(axis=1)
reason_external = reason_columns.loc[:, 18:21].max(axis=1)
reason_medical = reason_columns.loc[:, 22:].max(axis=1)

# Concatenates the reason columns to the end of the df
df = pd.concat([df, reason_disease, reason_maternity, reason_external, reason_medical], axis = 1)

# Renames the columns because without this step they would just be displayed as 0-3 at the end of the table
df.columns = [
    'ID', 'Reason for Absence', 'Date', 'Transportation Expense',
    'Distance to Work', 'Age', 'Daily Work Load Average',
    'Body Mass Index', 'Education', 'Children', 'Pets',
    'Absenteeism Time in Hours', 'Reason_disease',
    'Reason_maternity', 'Reason_external', 'Reason_medical',
]

# Adds reason for absence to columns to drop
cols_to_drop.append('Reason for Absence')

df.head()

Unnamed: 0,ID,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Reason_disease,Reason_maternity,Reason_external,Reason_medical
0,11,26,07/07/2015,289,36,33,239.554,30,1,2,1,4,0,0,0,1
1,36,0,14/07/2015,118,13,50,239.554,31,1,1,0,0,0,0,0,0
2,3,23,15/07/2015,179,51,38,239.554,31,1,0,0,2,0,0,0,1
3,7,7,16/07/2015,279,5,39,239.554,24,1,2,0,4,1,0,0,0
4,11,23,23/07/2015,289,36,33,239.554,30,1,2,1,2,0,0,0,1


#### Sets up 2nd checkpoint

In [95]:
df_reason_mod = df
df = df_reason_mod.copy()

### 1.2.3 Creating the necassary date information <a class="anchor" id="Date_column"></a>
It was determined that instead of date, it was more interesting to prove if certain months or days of the week may cause more abseenteeism than the date. Is it more likely that a Friday in May would have more abseenteeism than a Wednesday in February? That is the question this processing is aiming to answer.

#### First step is to convert the date to python date/time

In [96]:
# Checks the data type of the reason column
type(df['Date'][0])

str

In [97]:
# Converts the date column to a time stamp and the format at the end is to denote the format of
# the table so that it is processed properly
df['Date'] = pd.to_datetime(df_reason_mod['Date'], format = '%d/%m/%Y')

In [98]:
# Checks the data type of the reason column after the conversion
type(df['Date'][0])

pandas._libs.tslibs.timestamps.Timestamp

#### After converting to date time the columns can be outputted with the following functions
The date is converted to a month and dotw column and then data is added to columns to drop because it is now redundant

In [99]:
df['Month'] = ([df['Date'][x].month for x in range(len(df))])
df['DOTW'] = ([df['Date'][x].weekday() for x in range(len(df))])

# Adds date to columns to drop
cols_to_drop.append('Date')

df.head()

Unnamed: 0,ID,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Reason_disease,Reason_maternity,Reason_external,Reason_medical,Month,DOTW
0,11,26,2015-07-07,289,36,33,239.554,30,1,2,1,4,0,0,0,1,7,1
1,36,0,2015-07-14,118,13,50,239.554,31,1,1,0,0,0,0,0,0,7,1
2,3,23,2015-07-15,179,51,38,239.554,31,1,0,0,2,0,0,0,1,7,2
3,7,7,2015-07-16,279,5,39,239.554,24,1,2,0,4,1,0,0,0,7,3
4,11,23,2015-07-23,289,36,33,239.554,30,1,2,1,2,0,0,0,1,7,3


#### Sets up 3rd checkpoint

In [100]:
df_date_mod = df
df = df_date_mod.copy()

### 1.2.4 Remapping the education column <a class="anchor" id="Education_column"></a>
The education column is the last column that will need to be processed during this portion of the project. This data is portioned into 1 as having a highschool diploma 2 as an associate's or trade school 3 as bachelor's or 3 as master's or higher

Since the highschool data so overwhelms the other data it is split into high school or higher and remapped as 0 or 1 for further processing.

In [101]:
# Displays the unique variables in the column and then counts them
df['Education'].value_counts()

Education
1    583
3     73
2     40
4      4
Name: count, dtype: int64

In [102]:
# Maps the education column as either having post-high school education or high school education and
# then is assigned below to education
degrees = df['Education'].map({1:0, 2:1, 3:1, 4:1})
degrees.value_counts()

Education
0    583
1    117
Name: count, dtype: int64

In [103]:
# Defines education column as the new remapping for degrees
df['Education'] = degrees

df.head()

Unnamed: 0,ID,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Reason_disease,Reason_maternity,Reason_external,Reason_medical,Month,DOTW
0,11,26,2015-07-07,289,36,33,239.554,30,0,2,1,4,0,0,0,1,7,1
1,36,0,2015-07-14,118,13,50,239.554,31,0,1,0,0,0,0,0,0,7,1
2,3,23,2015-07-15,179,51,38,239.554,31,0,0,0,2,0,0,0,1,7,2
3,7,7,2015-07-16,279,5,39,239.554,24,0,2,0,4,1,0,0,0,7,3
4,11,23,2015-07-23,289,36,33,239.554,30,0,2,1,2,0,0,0,1,7,3


#### Sets up 4th checkpoint

In [104]:
df_education = df
df = df_education.copy()

### 1.2.5 Organization <a class="anchor" id="Organization"></a>
After completing the data cleaning and prep the columns will be reorganized and dropped from the columns to drop list.

In [105]:
# Drops the columns that are no longer neccassary
df.drop(cols_to_drop, axis = 1)

# Sets up the columns in the neccassary order
columns_reordered = [
    'Reason_disease', 'Reason_maternity', 'Reason_external',
    'Reason_medical', 'Month', 'DOTW', 'Transportation Expense',
    'Distance to Work', 'Age', 'Daily Work Load Average',
    'Body Mass Index', 'Education', 'Children', 'Pets',
    'Absenteeism Time in Hours',
]

# Re-orders the data frame
df = df[columns_reordered]

df.head()

Unnamed: 0,Reason_disease,Reason_maternity,Reason_external,Reason_medical,Month,DOTW,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2


#### Converts preprocessed dataframe to csv to save the file

In [106]:
# Convert file to CSV and export to current folder
df.to_csv('absenteeism_preprocessed_new_notebook.csv', index=False)

## 2.0 Machine Learning <a class="anchor" id="Machine_learning"></a>
This is the portion of the project that will begin work on the machine learning model.

### Importing necassary packages

In [107]:
import numpy as np

### 2.1 Setting up the targets <a class="anchor" id="Targets"></a>
The first thing that needs to happen is to set up actual targets for this model to predict. The absenteeism time in hours is the target for this model. However it will not be built to predict the time. It will predict weather or not someone will be "excessively absent" and this will require some plan to define excessively absent.

#### Defining "excessive absenteeism"
The way excessive absenteeism will be defined in this case is to take the median for Absenteeism In Hours and either define it as over or under the median (1 or 0). Then it will be assessed whether the median will be a good candidate for the split. It is shown below that is splits the population nicely in two at 46%

In [108]:
# Takes the median and displays it
absenteeism_median = df['Absenteeism Time in Hours'].median()
absenteeism_median

3.0

In [109]:
# Defines an array that contains either 1 or 0 for excessively absent or not
targets = np.where(df['Absenteeism Time in Hours'] > absenteeism_median, 1, 0)
targets[0:9]

array([1, 0, 0, 1, 0, 0, 1, 1, 1])

In [110]:
# Checks and displays the percentage above and below the median as 46% which would be
# an acceptable balance point
targets.sum() / targets.shape

array([0.45571429])

This table is then added to the df and absenteeism time in hours column is removed.

In [111]:
# Defines excessive absenteeism column as the targets
df['Excessive Absenteeism'] = targets

# Drops absenteeism in hours as it is now redundant
df = df.drop(['Absenteeism Time in Hours'], axis = 1)

df.head()

Unnamed: 0,Reason_disease,Reason_maternity,Reason_external,Reason_medical,Month,DOTW,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,0


#### Defines next checkpoint

In [112]:
df_with_targets = df
df = df_with_targets.copy()

## 2.2 Standardizing the data <a class="anchor" id="Standardization"></a>
At this point the decision needs to be made before training the data on whether or not the data should be standardized. It was chosen to standardize the data as the model will be slightly more accurate and the coefficients will be used to evaluate corelation between each parameter and the output.

#### Definining inputs to scale
Not all of the data will need to be standardized, the reasons as well as the education columns are binary values and do not need to be standardized so it will only need to be fed the rest of the data and this must be called out before transforming the data.

In [113]:
# Defines the parameters to inputs only it removed the dependent variable which is
# excessive absenteeism
unscaled_inputs = df.iloc[:,:-1]
unscaled_inputs.head()

Unnamed: 0,Reason_disease,Reason_maternity,Reason_external,Reason_medical,Month,DOTW,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1


In [114]:
unscaled_inputs.columns.values

# Sets up the columns that we need scaled for standardization
columns_to_omit = [
    'Reason_disease', 'Reason_maternity', 'Reason_external',
    'Reason_medical', 'Education',
]

columns_to_scale = [x for x in unscaled_inputs.columns.values if x not in columns_to_omit]

#### Transforming the data
The following class was defined in order to fit the inputs and transform them.

In [115]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

#
# Custom scaler class that sets up scaler to transform all the columns
# except the dummy data
#
class CustomScaler(BaseEstimator, TransformerMixin):
    
    def __init__(self,columns,copy=True,with_mean=True,with_std=True):
        self.scaler = StandardScaler(copy=copy,with_mean=with_mean,with_std=with_std)
        self.columns = columns
        self.mean_ = None
        self.var_ = None
        
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self
    
    def transform(self, X, y=None, copy = None):
        init_col_order = X.columns
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

In [116]:
# Sets up and displays the columns after being standardized with custom scaler
absenteeism_scaler = CustomScaler(columns_to_scale)
scaled_inputs = absenteeism_scaler.fit(unscaled_inputs)
scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)
scaled_inputs.head()

  return var(axis=axis, dtype=dtype, out=out, ddof=ddof, **kwargs)


Unnamed: 0,Reason_disease,Reason_maternity,Reason_external,Reason_medical,Month,DOTW,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets
0,0,0,0,1,0.182726,-0.683704,1.005844,0.412816,-0.536062,-0.806331,0.767431,0,0.880469,0.268487
1,0,0,0,0,0.182726,-0.683704,-1.574681,-1.141882,2.130803,-0.806331,1.002633,0,-0.01928,-0.58969
2,0,0,0,1,0.182726,-0.007725,-0.654143,1.426749,0.24831,-0.806331,1.002633,0,-0.91903,-0.58969
3,1,0,0,0,0.182726,0.668253,0.854936,-1.682647,0.405184,-0.806331,-0.643782,0,0.880469,-0.58969
4,0,0,0,1,0.182726,0.668253,1.005844,0.412816,-0.536062,-0.806331,0.767431,0,0.880469,0.268487


In [117]:
# This is just a check to make sure all of the rows and columns necassary are still there
scaled_inputs.shape

(700, 14)

### 2.3 Building the Model <a class="anchor" id="Construction"></a>
The following portion does the model construction from our training data that has been cleaned, prepped, and standardized thus far. The train test split command will randomly split the data into two portions of 80% training data and 20% testing data for the model. The random state modifier will call out the way that the data is randomly split for a more reproducable result. It is then checked for the size of the data in each portion and then is fed into our logistic regression for further evaluation.

In [118]:
from sklearn.model_selection import train_test_split

In [119]:
# Defines the training and testing data(.08 is an 80 / 20 split for the data, and random
# state defines a specific random shuffle), it then saves them in the above input and then
# is checked for the shape
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs, targets, train_size = 0.8, random_state = 20)

print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

(560, 14) (560,)
(140, 14) (140,)


In [120]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [121]:
# Sets up and runs the model
reg = LogisticRegression()
reg.fit(x_train,y_train)

### 2.4 Evaluating the Model <a class="anchor" id="Evaluation"></a>
The first portion of evaluation will depend on scoring the model for accuracy. The following command **reg.score** is capable of testing the model for the x training data against the y training data. In a perfect world the model would come up with 100% accuracy, meaning that the model could perfectly predict the output of the x training data (our input training data) against the y training data (excessive absenteeism targets). However it came close to 80% as the accuracy rating which is fair. The following steps were taken in order to manually ensure accuracy of our scoring function.

In [122]:
# Tests our model for accuracy automatically
reg.score(x_train,y_train)

0.775

Testing the model scoring manaully to see at the end that it does come up with the same value.

In [123]:
# Stores and prints the prediction of the training data
model_outputs = reg.predict(x_train)
model_outputs[0:10]

array([0, 1, 1, 1, 1, 0, 0, 1, 0, 1])

In [124]:
# Tests the model outputs against the training data
model_output_list = model_outputs == y_train
model_output_list[0:25]

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True, False, False,  True,  True,  True,  True,
       False,  True, False,  True, False, False,  True])

In [125]:
# This code is taking the correct numbers comparing it to the size of the outputs
# and finding that it is exactly as accurate as the learning model said
np.sum((model_outputs == y_train)) / model_outputs.shape[0]

0.775

### 2.5 Interpereting coefficients <a class="anchor" id="Interperation"></a>

In [126]:
# Grabs the column names from unscaled inputs as once they become arrays with sklearn that data is lost
feature_name = unscaled_inputs.columns.values

# Defines the feature name as the features we pulled from unscaled inputs table
summary_table = pd.DataFrame (columns=['Feature name'], data = feature_name)

# Defines the coefficient column
summary_table['Coefficient'] = np.transpose(reg.coef_)

# Shifts the whole index down one in order to clear space for the intercept coefficient
summary_table.index += 1

# Inserts the intercept row into table
summary_table.loc[0] = ['Intercept', reg.intercept_[0]]

# Sorts this table to actually have the correct order of rows
summary_table = summary_table.sort_index()

# Sets up the odds ratio column
summary_table['Odds_ratio'] = np.exp(summary_table.Coefficient)

# Re-orders the columns in descending order according to their odds ratio
summary_table.sort_values('Odds_ratio', ascending=False)

summary_table

Unnamed: 0,Feature name,Coefficient,Odds_ratio
0,Intercept,-1.656109,0.19088
1,Reason_disease,2.800965,16.460523
2,Reason_maternity,0.934858,2.546851
3,Reason_external,3.095616,22.100858
4,Reason_medical,0.856587,2.35511
5,Month,0.166248,1.180866
6,DOTW,-0.08437,0.919091
7,Transportation Expense,0.612733,1.845467
8,Distance to Work,-0.007797,0.992233
9,Age,-0.165923,0.847112


Odds ratios describe the amount of change most likely to be expected for one unit of change in measurement. For example, the **Reason_disease** input is binary. It is either the reason that someone may call out of work or not.

Changing from 0 to 1 correlates to a **1646%** increase in the likelihood that someone will be considered excessively absent. This makes perfect sense! If someone is absent from work due to a severe illness, they will need time to seek treatment and recover, breaking the 3-hour median is a reasonable expectation.

The following variables will be removed as they do not have a significant statistical correlation to the absenteeism targets:
- **DOTW: 8%**
- **Distance to Work: >1%**
- **Daily Work Load Average: >1%**

These variables will have less than 10% change due to 1 unit of change in either direction and can be considered statistically insignificant in our model. Therefore, they are removed, and the model is rerun to improve accuracy and reduce complexity.

### 2.6 Optimizing Model <a class="anchor" id="Optimization"></a>

In [127]:
new_scaled_inputs = scaled_inputs.drop(['DOTW', 'Daily Work Load Average', 'Distance to Work'], axis = 1)
new_scaled_inputs.head()

Unnamed: 0,Reason_disease,Reason_maternity,Reason_external,Reason_medical,Month,Transportation Expense,Age,Body Mass Index,Education,Children,Pets
0,0,0,0,1,0.182726,1.005844,-0.536062,0.767431,0,0.880469,0.268487
1,0,0,0,0,0.182726,-1.574681,2.130803,1.002633,0,-0.01928,-0.58969
2,0,0,0,1,0.182726,-0.654143,0.24831,1.002633,0,-0.91903,-0.58969
3,1,0,0,0,0.182726,0.854936,0.405184,-0.643782,0,0.880469,-0.58969
4,0,0,0,1,0.182726,1.005844,-0.536062,0.767431,0,0.880469,0.268487


In [128]:
# Defines test split according to new scaled data
x_train, x_test, y_train, y_test = train_test_split(new_scaled_inputs, targets, train_size = 0.8, random_state = 20)

# Fits and trains model
reg = LogisticRegression()
reg.fit(x_train,y_train)

# Tests our model for accuracy automatically
reg.score(x_train,y_train)

0.7732142857142857

In [129]:
# Grabs the column names from unscaled inputs as once they become arrays with sklearn that data is lost
feature_name = new_scaled_inputs.columns.values

# Defines the feature name as the features we pulled from unscaled inputs table
summary_table = pd.DataFrame (columns=['Feature name'], data = feature_name)

# Defines the coefficient column
summary_table['Coefficient'] = np.transpose(reg.coef_)

# Shifts the whole index down one in order to clear space for the intercept coefficient
summary_table.index += 1

# Inserts the intercept row into table
summary_table.loc[0] = ['Intercept', reg.intercept_[0]]

# Sorts this table to actually have the correct order of rows
summary_table = summary_table.sort_index()

# Sets up the odds ratio column
summary_table['Odds_ratio'] = np.exp(summary_table.Coefficient)

# Reorders the columns in descending order according to their odds ratio
summary_table.sort_values('Odds_ratio', ascending=False)

summary_table

Unnamed: 0,Feature name,Coefficient,Odds_ratio
0,Intercept,-1.647455,0.192539
1,Reason_disease,2.800197,16.447892
2,Reason_maternity,0.951884,2.590585
3,Reason_external,3.115553,22.545903
4,Reason_medical,0.839001,2.314054
5,Month,0.15893,1.172256
6,Transportation Expense,0.605284,1.831773
7,Age,-0.169891,0.843757
8,Body Mass Index,0.279811,1.32288
9,Education,-0.210533,0.810152


Looking at the above table it is clear that some of the odds ratios have changed by removing the statistically insignificant parameters. The model has been reduced in complexity and accuracy has been preserved. With the current information that is presented this would be the better model to finalize and continue the project with.


### 2.7 Testing Model <a class="anchor" id="Testing"></a>

In [130]:
# Starts the testing of the model through test data
reg.score(x_test, y_test)

0.75

The function below will provide two numbers, the left side being the probability of the output being 0 (meaning not excessively absent) or on the right is the probability of 1 (excessively absent).

In [131]:
predicted_proba = reg.predict_proba(x_test)
predicted_proba[0:9]

array([[0.71340413, 0.28659587],
       [0.58724228, 0.41275772],
       [0.44020821, 0.55979179],
       [0.78159464, 0.21840536],
       [0.08410854, 0.91589146],
       [0.33487603, 0.66512397],
       [0.29984576, 0.70015424],
       [0.13103971, 0.86896029],
       [0.78625404, 0.21374596]])

## 3.0 Results <a class="anchor" id="Results"></a>

In [132]:
# Bases the table on x test paramers
model_summary_table = x_test.sort_index()

# Adds the predicted output probability of being excessively absent
model_summary_table['Predicted Absence Probability'] = predicted_proba[:,1]

output_binary = []
for x in predicted_proba[:,1]:
    if x > .5:
        output_binary.append(1)
    else:
        output_binary.append(0)

#Outputs binary predicted output
model_summary_table['Predicted Output'] = output_binary

# Adds the column on whether or not the instance was actually absent
model_summary_table['Real Absence Data'] = y_test

# Outputs accuracy data into summary table
model_summary_table['Prediction Accuracy'] = (model_summary_table['Predicted Output'] == model_summary_table['Real Absence Data'])

# Converted file to csv and exported to folder
model_summary_table.to_csv('Trained_Model_Predicted_Output.csv', index=False)

model_summary_table.head()

Unnamed: 0,Reason_disease,Reason_maternity,Reason_external,Reason_medical,Month,Transportation Expense,Age,Body Mass Index,Education,Children,Pets,Predicted Absence Probability,Predicted Output,Real Absence Data,Prediction Accuracy
1,0,0,0,0,0.182726,-1.574681,2.130803,1.002633,0,-0.01928,-0.58969,0.286596,0,0,True
7,0,0,0,1,0.182726,0.568211,-0.065439,-0.878984,0,2.679969,-0.58969,0.412758,0,1,False
8,0,0,1,0,0.182726,-1.016322,-0.379188,-0.40858,0,0.880469,-0.58969,0.559792,1,1,True
9,0,0,0,1,0.182726,0.190942,0.091435,0.532229,1,-0.01928,0.268487,0.218405,0,0,True
10,1,0,0,0,0.182726,0.568211,-0.065439,-0.878984,0,2.679969,-0.58969,0.915891,1,1,True


The above table is designed to show the x_test data(**scaled inputs**), next to **Excessive Absence Probability** the model predicted probability that the instance will be excessively absent, and the y_test data (**Absence Data**) which is the actual outcome of the instance being excessively absent or not. This data is out of order because it was randomly chosen when the test and train data was split.

### Saving model and scaler files for later reuse

In [134]:
import pickle

# Pickles and saves reg as the model we have trained
with open('model.pkl', 'wb') as file:
          pickle.dump(reg, file)

# Pickles and saves training data as scaler
with open('scaler.pkl', 'wb') as file:
          pickle.dump(absenteeism_scaler, file)