# Absenteeism at work

## Overview

The task proposed is to create a model which can predict how likely it is for an employee to be absent from work during normal working hours, given some inputs. In order to achieve this goal, data regarding characteristics of employees along with past absenteeism information is provided.

This notebook consists of the preprocessing section to create a cleaned .csv file containing the data which will be later used on the processing part, that is, the actual creation of the model

## Importing the relevant libraries

In [1]:
import pandas as pd

## Preprocessing

In [2]:
# Importing the data and creating a backup variable before manipulating
raw_data = pd.read_csv('Absenteeism-data.csv')
df = raw_data.copy()

There is a total of 11 columns in the provided dataset, 9 of which are characteristics of employees which seem to be relevant factors on predicting if one is expected to be away from work, such as reasons for absence, distance to work, number of kids, pets and others. There is also an ID column which distincts one employee from another and the last one which presents how long the given employee has been absent from work in the past. 

In [3]:
# Checking for general info and missing values
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   ID                         700 non-null    int64  
 1   Reason for Absence         700 non-null    int64  
 2   Date                       700 non-null    object 
 3   Transportation Expense     700 non-null    int64  
 4   Distance to Work           700 non-null    int64  
 5   Age                        700 non-null    int64  
 6   Daily Work Load Average    700 non-null    float64
 7   Body Mass Index            700 non-null    int64  
 8   Education                  700 non-null    int64  
 9   Children                   700 non-null    int64  
 10  Pets                       700 non-null    int64  
 11  Absenteeism Time in Hours  700 non-null    int64  
dtypes: float64(1), int64(10), object(1)
memory usage: 65.8+ KB


In [4]:
# Dropping ID, since does not provides relevant information to the model
df = df.drop(['ID'],axis=1)

The feature 'Reason for Absence' has 28 possible values representing different given explanations be away from work and it's only possible to have one of them. Therefore, this is a categorical data and requires some dummy variables. Age will also be treated as a categorical data on this problem so later it may be grouped by ranges. 

In [5]:
# Creating dummies for categorical data
# Dropping the first column to avoid issues with multicolinearity
reasons_dummies = pd.get_dummies(df['Reason for Absence'], drop_first = True)
age_dummies = pd.get_dummies(df['Age'], drop_first = True)

# Dropping the original 'Reason for Absence' and 'Age' in the Data Frame, which will no longer be useful
df = df.drop(['Reason for Absence', 'Age'], axis = 1)

## Classifying the data

The file 'reasons_for_absenteeism.png' describes the 28 possible reasons for absenteeism. It's evident that some of them are quite similar and could be grouped by different topics. It's a good idea to do so, thus avoiding creating too many dummy variables which would add unnecessary complexity to the model. The same idea also applies to the age of the employees. 
<br/><br/>
The groups for 'Reason for Absence' are the following:
<br/><br/>
0 -> Unkown reasons, which was dropped when creating dummy variables and works as the parameter for the model

1-14 -> Disease related

15-17 -> Pregnancy related

18-21 -> Poisoning and injuries 

22-28 -> Light diseases and medical apointments

<br/><br/>
The age groups proposed are:
<br/><br/>
27 - 32 years

33 - 39 years

40 - 47 years

48 - 58 years

In [6]:
# Since we're interested if at least one of the reasons for absence within a group was present.
# we use .max(), wich basically will group all columns values into 1 if at 
# least one of them is present for the given employee or 0 if not

reasons_group_1 = reasons_dummies.loc[:,1:14].max(axis=1)
reasons_group_2 = reasons_dummies.loc[:,15:17].max(axis=1)
reasons_group_3 = reasons_dummies.loc[:,18:21].max(axis=1)
reasons_group_4 = reasons_dummies.loc[:,22:28].max(axis=1)

age_group_1 = age_dummies.loc[:,27:32].max(axis=1)
age_group_2 = age_dummies.loc[:,33:39].max(axis=1)
age_group_3 = age_dummies.loc[:,40:47].max(axis=1)
age_group_4 = age_dummies.loc[:,48:58].max(axis=1)

In [7]:
# Mergint the creeated dummies with the dataset we're using
df_concat = pd.concat([df,reasons_group_1, reasons_group_2, reasons_group_3, reasons_group_4, 
                       age_group_1, age_group_2, age_group_3, age_group_4],axis =1)

In [8]:
# Renaming the dummy variable columns
columns_values = ['Date', 'Transportation Expense', 'Distance to Work',
                  'Daily Work Load Average', 'Body Mass Index', 'Education',
                  'Children', 'Pets', 'Absenteeism Time in Hours', 
                  'Reasons_1', 'Reasons_2', 'Reasons_3', 'Reasons_4', 
                  'Age_1', 'Age_2', 'Age_3', 'Age_4']
df_concat.columns = columns_values

In [9]:
# Creating a checkpoint
df_checkpoint = df_concat.copy()

## Modifying the Date information

The provided dataset contains information on when the given employee was absent, which could also be a good predictor. Perhaps the odds of absenteeism could be greater during some days of the week or months.  To make use of this information, the date information needs to be split into these categories.

In [10]:
# Converting the 'Date' column into the suitable format
df_checkpoint['Date'] = pd.to_datetime(df_checkpoint['Date'],format = '%d/%m/%Y')

In [11]:
def week_day(date_value):
    return date_value.weekday()

def get_month(date_value):
    return date_value.month  

In [12]:
# Extracting the day of the week and month into another column and dropping 'Date', which isn't useful anymore
df_checkpoint['Week Day'] = df_checkpoint['Date'].apply(week_day)
df_checkpoint['Month'] = df_checkpoint['Date'].apply(get_month)
df_checkpoint = df_checkpoint.drop(['Date'], axis=1)

In [13]:
# Reordering the columns
columns_values = ['Reasons_1', 'Reasons_2', 'Reasons_3', 'Reasons_4','Week Day', 'Month',
                  'Transportation Expense', 'Distance to Work',
                  'Daily Work Load Average', 'Body Mass Index', 'Education',
                  'Children', 'Pets', 'Age_1', 'Age_2', 'Age_3', 'Age_4',
                  'Absenteeism Time in Hours']
df_checkpoint = df_checkpoint[columns_values]

In [21]:
# Creating a checkpoint
df_date_mod = df_checkpoint.copy()

## Approaching the 'Education' feature

We're checking if the education level could be a good predictor of absenteeism. There are four categories regarding education:


1 = High School 

2 = Graduate 

3 = Post Graduate 

4 = Master or Doctor

In [23]:
df_date_mod['Education'].value_counts()

1    583
3     73
2     40
4      4
Name: Education, dtype: int64

The 3 higher education levels samples are too smal by itself and wouldn't probally be relevant to the model by their own. It makes sense to group them.

In [16]:
df_date_mod['Education'] = df_date_mod['Education'].map({1:0, 2:1 ,3:1, 4:1})

In [17]:
df_date_mod['Education'].value_counts()

0    583
1    117
Name: Education, dtype: int64

In [18]:
# Creating a checkpoint
df_preprocessed = df_date_mod.copy()

## Final checkpoint

In [19]:
# df_preprocessed.to_csv('Absenteeism_preprocessed.csv', index=False)