<a href="https://colab.research.google.com/github/ZHAbotorabi/Workforce-Analytics-and-Predictive-Modeling/blob/main/absenteeism_Data_Preprosseing__1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1- Absenteeism - Data Preprocessing
This project is to analyze the factors contributing to absenteeism in the workplace and develop a predictive model to identify employees at risk of high absenteeism.

---

###   **Input file**: `Absenteeism_data.csv`
   
###   **Output file**: `preprocessedd_1.csv`

---

#### Dataset Overview
- ID: Unique identifier for employees.
- Reason for Absence: Categorical variable indicating the reason for absenteeism.
- Date: Date of absenteeism.
- Transportation Expense: Cost incurred for transportation.
- Distance to Work: Distance between home and workplace.
- Age: Age of the employee.
- Daily Work Load Average: Average workload of the employee.
- Body Mass Index (BMI): BMI of the employee.
- Education: Education level (categorical).
- Children: Number of children the employee has.
- Pets: Number of pets the employee has.
- Absenteeism Time in Hours: Target variable indicating hours of absence.

---

#### Steps to Design the Project

1. **Problem Understanding**
   - Define the problem: Analyze absenteeism and its causes to mitigate its impact on productivity.
   - Target variable: `Absenteeism Time in Hours`.

2. **Data Preprocessing**
   - Handle missing values if present.
   - Transform the `Date` feature:
     - Extract `Year`, `Month`, `Day of the Week`, and `Holiday/Workday` indicators.
   - Group `Reason for Absence` into broader 4 categories:
     - Health-related issues, Family responsibilities, Transportation issues, Others.
   - Scale numerical features like `Transportation Expense`, `Distance to Work`, and `Daily Work Load Average`.
   - Encode categorical variables (`Education`, `Reason for Absence`).


   


In [None]:
import pandas as pd

In [None]:
raw_csv_data=pd.read_csv('Absenteeism_data.csv')

In [None]:
#raw_csv_data
raw_csv_data.head()

Unnamed: 0,ID,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,11,26,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,36,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,3,23,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,7,7,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,11,23,23/07/2015,289,36,33,239.554,30,1,2,1,2


In [None]:
df=raw_csv_data.copy()
df.head()

Unnamed: 0,ID,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,11,26,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,36,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,3,23,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,7,7,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,11,23,23/07/2015,289,36,33,239.554,30,1,2,1,2


In [None]:
# To display all of dataframe
pd.options.display.max_columns=None
pd.options.display.max_rows=None
#display(df)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   ID                         700 non-null    int64  
 1   Reason for Absence         700 non-null    int64  
 2   Date                       700 non-null    object 
 3   Transportation Expense     700 non-null    int64  
 4   Distance to Work           700 non-null    int64  
 5   Age                        700 non-null    int64  
 6   Daily Work Load Average    700 non-null    float64
 7   Body Mass Index            700 non-null    int64  
 8   Education                  700 non-null    int64  
 9   Children                   700 non-null    int64  
 10  Pets                       700 non-null    int64  
 11  Absenteeism Time in Hours  700 non-null    int64  
dtypes: float64(1), int64(10), object(1)
memory usage: 65.8+ KB


In [None]:
df.drop(['ID'], axis = 1).head()

Unnamed: 0,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,26,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,23,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,7,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,23,23/07/2015,289,36,33,239.554,30,1,2,1,2


In [None]:
df.head()

Unnamed: 0,ID,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,11,26,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,36,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,3,23,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,7,7,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,11,23,23/07/2015,289,36,33,239.554,30,1,2,1,2


In [None]:
df['Reason for Absence'].head()

Unnamed: 0,Reason for Absence
0,26
1,0
2,23
3,7
4,23


In [None]:
print("Min= ",df['Reason for Absence'].min())
print("Max= ",df['Reason for Absence'].max())

Min=  0
Max=  28


In [None]:
# pd.unique(df['Reason for Absence'])
df['Reason for Absence'].unique()

array([26,  0, 23,  7, 22, 19,  1, 11, 14, 21, 10, 13, 28, 18, 25, 24,  6,
       27, 17,  8, 12,  5,  9, 15,  4,  3,  2, 16])

In [None]:
print("len= ",len(df['Reason for Absence'].unique()))
sorted(df['Reason for Absence'].unique())
# 20 isn't avaliable

len=  28


[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28]

In [None]:
reason_columns=pd.get_dummies(df['Reason for Absence'])
reason_columns

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
1,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False
3,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False


In [None]:
reason_columns['check']=reason_columns.sum(axis=1)
reason_columns

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,19,21,22,23,24,25,26,27,28,check
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
696,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
697,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
698,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1


In [None]:
reason_columns['check'].sum(axis=0)

700

In [None]:

reason_columns['check'].unique()

array([1], dtype=int64)

In [None]:
reason_columns=reason_columns.drop(['check'], axis=1)
reason_columns

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18,19,21,22,23,24,25,26,27,28
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
696,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
697,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
698,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [None]:
df.columns.values

array(['Reason for Absence', 'Date', 'Transportation Expense',
       'Distance to Work', 'Age', 'Daily Work Load Average',
       'Body Mass Index', 'Education', 'Children', 'Pets',
       'Absenteeism Time in Hours'], dtype=object)

In [None]:
reason_columns.columns.values

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
       19, 21, 22, 23, 24, 25, 26, 27, 28], dtype=object)

In [None]:
df=df.drop(['Reason for Absence'],axis=1)
df.head()

Unnamed: 0,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,23/07/2015,289,36,33,239.554,30,1,2,1,2


In [None]:
reason_columns.loc[:,15:17].head()

Unnamed: 0,15,16,17
0,0,0,0
1,0,0,0
2,0,0,0
3,0,0,0
4,0,0,0


In [None]:
reason_columns.loc[:,15:17].max(axis=1).head()

0    0
1    0
2    0
3    0
4    0
dtype: uint8

In [None]:
reason_type_1=reason_columns.loc[:,1:14].max(axis=1)
reason_type_2=reason_columns.loc[:,15:17].max(axis=1)
reason_type_3=reason_columns.loc[:,18:21].max(axis=1)
reason_type_4=reason_columns.loc[:,22:].max(axis=1)

In [None]:
reason_type_4.head()

0    1
1    0
2    1
3    0
4    1
dtype: uint8

In [None]:
df=pd.concat([df,reason_type_1,reason_type_2,reason_type_3,reason_type_4], axis=1)
df.head()

Unnamed: 0,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,0,1,2,3
0,07/07/2015,289,36,33,239.554,30,1,2,1,4,0,0,0,1
1,14/07/2015,118,13,50,239.554,31,1,1,0,0,0,0,0,0
2,15/07/2015,179,51,38,239.554,31,1,0,0,2,0,0,0,1
3,16/07/2015,279,5,39,239.554,24,1,2,0,4,1,0,0,0
4,23/07/2015,289,36,33,239.554,30,1,2,1,2,0,0,0,1


In [None]:
df.columns.values

array(['Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours', 0, 1, 2, 3],
      dtype=object)

In [None]:
column_names=['Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours','reason_1','reason_2','reason_3','reason_4']

In [None]:
df.columns=column_names

In [None]:
df.head()

Unnamed: 0,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,reason_1,reason_2,reason_3,reason_4
0,07/07/2015,289,36,33,239.554,30,1,2,1,4,0,0,0,1
1,14/07/2015,118,13,50,239.554,31,1,1,0,0,0,0,0,0
2,15/07/2015,179,51,38,239.554,31,1,0,0,2,0,0,0,1
3,16/07/2015,279,5,39,239.554,24,1,2,0,4,1,0,0,0
4,23/07/2015,289,36,33,239.554,30,1,2,1,2,0,0,0,1


In [None]:
column_names_reordered=['reason_1','reason_2','reason_3','reason_4','Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours']

In [None]:
df=df[column_names_reordered]
df.head()

Unnamed: 0,reason_1,reason_2,reason_3,reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,0,0,0,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,0,0,0,1,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,1,0,0,0,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,0,0,0,1,23/07/2015,289,36,33,239.554,30,1,2,1,2


In [None]:
df_reason_mod=df.copy()
df_reason_mod.head()

Unnamed: 0,reason_1,reason_2,reason_3,reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,0,0,0,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,0,0,0,1,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,1,0,0,0,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,0,0,0,1,23/07/2015,289,36,33,239.554,30,1,2,1,2


In [None]:
df_reason_mod['Date'].head()

0    07/07/2015
1    14/07/2015
2    15/07/2015
3    16/07/2015
4    23/07/2015
Name: Date, dtype: object

In [None]:
type(df_reason_mod['Date'])

pandas.core.series.Series

In [None]:
type(df_reason_mod['Date'][0])

str

In [None]:
df_reason_mod['Date']=pd.to_datetime(df_reason_mod['Date'])
df_reason_mod['Date'].head()

  df_reason_mod['Date']=pd.to_datetime(df_reason_mod['Date'])


0   2015-07-07
1   2015-07-14
2   2015-07-15
3   2015-07-16
4   2015-07-23
Name: Date, dtype: datetime64[ns]

In [None]:
df_reason_mod['Date']=pd.to_datetime(df_reason_mod['Date'], format= '%d%m%Y')
df_reason_mod['Date']

0     2015-07-07
1     2015-07-14
2     2015-07-15
3     2015-07-16
4     2015-07-23
         ...    
695   2018-05-23
696   2018-05-23
697   2018-05-24
698   2018-05-24
699   2018-05-31
Name: Date, Length: 700, dtype: datetime64[ns]

In [None]:
df_reason_mod.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   reason_1                   700 non-null    uint8         
 1   reason_2                   700 non-null    uint8         
 2   reason_3                   700 non-null    uint8         
 3   reason_4                   700 non-null    uint8         
 4   Date                       700 non-null    datetime64[ns]
 5   Transportation Expense     700 non-null    int64         
 6   Distance to Work           700 non-null    int64         
 7   Age                        700 non-null    int64         
 8   Daily Work Load Average    700 non-null    float64       
 9   Body Mass Index            700 non-null    int64         
 10  Education                  700 non-null    int64         
 11  Children                   700 non-null    int64         
 12  Pets    

In [None]:
df_reason_mod['Date'][0]

Timestamp('2015-07-07 00:00:00')

In [None]:
df_reason_mod['Date'][0].month

7

In [None]:
list_months=[]
list_months

[]

In [None]:
df_reason_mod.shape

(700, 14)

In [None]:
for i in range(df_reason_mod.shape[0]):
    list_months.append(df_reason_mod['Date'][i].month)
list_months

[7,
 7,
 7,
 7,
 7,
 10,
 7,
 7,
 6,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 5,
 12,
 3,
 10,
 8,
 8,
 8,
 4,
 12,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 4,
 8,
 8,
 8,
 1,
 7,
 1,
 8,
 9,
 9,
 9,
 9,
 4,
 9,
 9,
 9,
 8,
 9,
 9,
 9,
 9,
 9,
 9,
 11,
 9,
 9,
 6,
 10,
 10,
 10,
 10,
 10,
 6,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 5,
 4,
 5,
 12,
 11,
 2,
 9,
 11,
 11,
 11,
 11,
 11,
 11,
 11,
 2,
 10,
 11,
 11,
 1,
 1,
 2,
 2,
 3,
 4,
 8,
 9,
 10,
 11,
 12,
 12,
 11,
 12,
 12,
 6,
 4,
 5,
 5,
 6,
 7,
 7,
 8,
 11,
 12,
 12,
 1,
 1,
 1,
 11,
 12,
 12,
 12,
 1,
 1,
 1,
 1,
 1,
 1,
 5,
 3,
 10,
 11,
 11,
 12,
 12,
 8,
 9,
 2,
 9,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 1,
 1,
 8,
 3,
 3,
 2,
 3,
 4,
 4,
 7,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 6,
 4,
 12,
 4,
 4,
 4,
 4,
 6,
 7,
 8,
 8,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 8,
 4,
 4,
 5,
 2,
 3,
 3,
 11,
 9,
 10,
 4,
 11,
 5,
 5,
 5,
 5,
 11,
 5,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 3,
 7,
 8,
 9,
 10,
 6,
 6

In [None]:
len(list_months)

700

In [None]:
df_reason_mod['Month Value']=list_months
df_reason_mod.head(2)

Unnamed: 0,reason_1,reason_2,reason_3,reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month Value
0,0,0,0,1,2015-07-07,289,36,33,239.554,30,1,2,1,4,7
1,0,0,0,0,2015-07-14,118,13,50,239.554,31,1,1,0,0,7


In [None]:
df_reason_mod['Date'][699].weekday()

3

In [None]:
df_reason_mod['Date'][699]

Timestamp('2018-05-31 00:00:00')

In [None]:
def date_to_weekday(date_value):
    return date_value.weekday()

In [None]:
df_reason_mod['Day of the week']=df_reason_mod['Date'].apply(date_to_weekday)

In [None]:
df_reason_mod.head()

Unnamed: 0,reason_1,reason_2,reason_3,reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month Value,Day of the week
0,0,0,0,1,2015-07-07,289,36,33,239.554,30,1,2,1,4,7,1
1,0,0,0,0,2015-07-14,118,13,50,239.554,31,1,1,0,0,7,1
2,0,0,0,1,2015-07-15,179,51,38,239.554,31,1,0,0,2,7,2
3,1,0,0,0,2015-07-16,279,5,39,239.554,24,1,2,0,4,7,3
4,0,0,0,1,2015-07-23,289,36,33,239.554,30,1,2,1,2,7,3


In [None]:
df_reason_date_mod=df_reason_mod.copy()

In [None]:
df_reason_date_mod = df_reason_date_mod.drop(['Date'], axis=1)
df_reason_date_mod.head()

Unnamed: 0,reason_1,reason_2,reason_3,reason_4,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month Value,Day of the week
0,0,0,0,1,289,36,33,239.554,30,1,2,1,4,7,1
1,0,0,0,0,118,13,50,239.554,31,1,1,0,0,7,1
2,0,0,0,1,179,51,38,239.554,31,1,0,0,2,7,2
3,1,0,0,0,279,5,39,239.554,24,1,2,0,4,7,3
4,0,0,0,1,289,36,33,239.554,30,1,2,1,2,7,3


In [None]:
df_reason_date_mod.columns.values

array(['reason_1', 'reason_2', 'reason_3', 'reason_4',
       'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours', 'Month Value',
       'Day of the week'], dtype=object)

In [None]:
column_names_reordered_2=['reason_1', 'reason_2', 'reason_3', 'reason_4','Month Value',
       'Day of the week','Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours']

In [None]:
df_reason_date_mod=df_reason_date_mod[column_names_reordered_2]
df_reason_date_mod.head(2)

Unnamed: 0,reason_1,reason_2,reason_3,reason_4,Month Value,Day of the week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,1,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,1,1,0,0


In [None]:
df_reason_date_mod['Education'].unique()

array([1, 3, 2, 4], dtype=int64)

In [None]:
df_reason_date_mod['Education'].value_counts()

1    583
3     73
2     40
4      4
Name: Education, dtype: int64

In [None]:
df_reason_date_mod['Education']=df_reason_date_mod['Education'].map({1:0,2:1,3:1,4:1})

In [None]:
df_reason_date_mod['Education'].unique()

array([0, 1], dtype=int64)

In [None]:
df_reason_date_mod['Education'].value_counts()

0    583
1    117
Name: Education, dtype: int64

In [None]:
df_preprocessed_1=df_reason_date_mod.copy()
df_preprocessed_1.head(2)

Unnamed: 0,reason_1,reason_2,reason_3,reason_4,Month Value,Day of the week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0


In [None]:
df_preprocessed_1.to_csv('preprocessedd_1.csv', index=False)