## Employee Burnout Prediction

Given *data about employees*, let's try to predict the **burnout rate** of a given employee.

We will use a variety of regression models to make our predictions.

Data source: https://www.kaggle.com/datasets/blurredmachine/are-your-employees-burning-out

### Importing Libraries

In [60]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import LinearSVR, SVR
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

import warnings
warnings.filterwarnings(action='ignore')

In [39]:
data = pd.read_csv("archive/train.csv")
data

Unnamed: 0,Employee ID,Date of Joining,Gender,Company Type,WFH Setup Available,Designation,Resource Allocation,Mental Fatigue Score,Burn Rate
0,fffe32003000360033003200,2008-09-30,Female,Service,No,2.0,3.0,3.8,0.16
1,fffe3700360033003500,2008-11-30,Male,Service,Yes,1.0,2.0,5.0,0.36
2,fffe31003300320037003900,2008-03-10,Female,Product,Yes,2.0,,5.8,0.49
3,fffe32003400380032003900,2008-11-03,Male,Service,Yes,1.0,1.0,2.6,0.20
4,fffe31003900340031003600,2008-07-24,Female,Service,No,3.0,7.0,6.9,0.52
...,...,...,...,...,...,...,...,...,...
22745,fffe31003500370039003100,2008-12-30,Female,Service,No,1.0,3.0,,0.41
22746,fffe33003000350031003800,2008-01-19,Female,Product,Yes,3.0,6.0,6.7,0.59
22747,fffe390032003000,2008-11-05,Male,Service,Yes,3.0,7.0,,0.72
22748,fffe33003300320036003900,2008-01-10,Female,Service,No,2.0,5.0,5.9,0.52


In [40]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22750 entries, 0 to 22749
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Employee ID           22750 non-null  object 
 1   Date of Joining       22750 non-null  object 
 2   Gender                22750 non-null  object 
 3   Company Type          22750 non-null  object 
 4   WFH Setup Available   22750 non-null  object 
 5   Designation           22750 non-null  float64
 6   Resource Allocation   21369 non-null  float64
 7   Mental Fatigue Score  20633 non-null  float64
 8   Burn Rate             21626 non-null  float64
dtypes: float64(4), object(5)
memory usage: 1.6+ MB


### Preprocessing

In [41]:
df = data.copy()   

In [42]:
# Drop Employee ID column
df = df.drop('Employee ID', axis=1)
df

Unnamed: 0,Date of Joining,Gender,Company Type,WFH Setup Available,Designation,Resource Allocation,Mental Fatigue Score,Burn Rate
0,2008-09-30,Female,Service,No,2.0,3.0,3.8,0.16
1,2008-11-30,Male,Service,Yes,1.0,2.0,5.0,0.36
2,2008-03-10,Female,Product,Yes,2.0,,5.8,0.49
3,2008-11-03,Male,Service,Yes,1.0,1.0,2.6,0.20
4,2008-07-24,Female,Service,No,3.0,7.0,6.9,0.52
...,...,...,...,...,...,...,...,...
22745,2008-12-30,Female,Service,No,1.0,3.0,,0.41
22746,2008-01-19,Female,Product,Yes,3.0,6.0,6.7,0.59
22747,2008-11-05,Male,Service,Yes,3.0,7.0,,0.72
22748,2008-01-10,Female,Service,No,2.0,5.0,5.9,0.52


In [43]:
df.isna().sum()

Date of Joining            0
Gender                     0
Company Type               0
WFH Setup Available        0
Designation                0
Resource Allocation     1381
Mental Fatigue Score    2117
Burn Rate               1124
dtype: int64

In [44]:
# Drop rows with missing target values
missing_target_rows = df.loc[df['Burn Rate'].isna(), :].index
df = df.drop(missing_target_rows, axis=0).reset_index(drop=True)
df

Unnamed: 0,Date of Joining,Gender,Company Type,WFH Setup Available,Designation,Resource Allocation,Mental Fatigue Score,Burn Rate
0,2008-09-30,Female,Service,No,2.0,3.0,3.8,0.16
1,2008-11-30,Male,Service,Yes,1.0,2.0,5.0,0.36
2,2008-03-10,Female,Product,Yes,2.0,,5.8,0.49
3,2008-11-03,Male,Service,Yes,1.0,1.0,2.6,0.20
4,2008-07-24,Female,Service,No,3.0,7.0,6.9,0.52
...,...,...,...,...,...,...,...,...
21621,2008-12-30,Female,Service,No,1.0,3.0,,0.41
21622,2008-01-19,Female,Product,Yes,3.0,6.0,6.7,0.59
21623,2008-11-05,Male,Service,Yes,3.0,7.0,,0.72
21624,2008-01-10,Female,Service,No,2.0,5.0,5.9,0.52


In [45]:
df.isna().sum()

Date of Joining            0
Gender                     0
Company Type               0
WFH Setup Available        0
Designation                0
Resource Allocation     1278
Mental Fatigue Score    1945
Burn Rate                  0
dtype: int64

In [46]:
# Fill remaining missing values with column means
for column in ['Resource Allocation', 'Mental Fatigue Score']:
    df[column] = df[column].fillna(df[column].mean())

In [47]:
df.isna().sum()

Date of Joining         0
Gender                  0
Company Type            0
WFH Setup Available     0
Designation             0
Resource Allocation     0
Mental Fatigue Score    0
Burn Rate               0
dtype: int64

In [48]:
df

Unnamed: 0,Date of Joining,Gender,Company Type,WFH Setup Available,Designation,Resource Allocation,Mental Fatigue Score,Burn Rate
0,2008-09-30,Female,Service,No,2.0,3.000000,3.800000,0.16
1,2008-11-30,Male,Service,Yes,1.0,2.000000,5.000000,0.36
2,2008-03-10,Female,Product,Yes,2.0,4.483831,5.800000,0.49
3,2008-11-03,Male,Service,Yes,1.0,1.000000,2.600000,0.20
4,2008-07-24,Female,Service,No,3.0,7.000000,6.900000,0.52
...,...,...,...,...,...,...,...,...
21621,2008-12-30,Female,Service,No,1.0,3.000000,5.729851,0.41
21622,2008-01-19,Female,Product,Yes,3.0,6.000000,6.700000,0.59
21623,2008-11-05,Male,Service,Yes,3.0,7.000000,5.729851,0.72
21624,2008-01-10,Female,Service,No,2.0,5.000000,5.900000,0.52


In [49]:
# Extract date features
df['Date of Joining'] = pd.to_datetime(df['Date of Joining'])
df['Join Month'] = df['Date of Joining'].apply(lambda x: x.month)
df['Join Day'] = df['Date of Joining'].apply(lambda x: x.day)
df = df.drop('Date of Joining', axis=1)
df

Unnamed: 0,Gender,Company Type,WFH Setup Available,Designation,Resource Allocation,Mental Fatigue Score,Burn Rate,Join Month,Join Day
0,Female,Service,No,2.0,3.000000,3.800000,0.16,9,30
1,Male,Service,Yes,1.0,2.000000,5.000000,0.36,11,30
2,Female,Product,Yes,2.0,4.483831,5.800000,0.49,3,10
3,Male,Service,Yes,1.0,1.000000,2.600000,0.20,11,3
4,Female,Service,No,3.0,7.000000,6.900000,0.52,7,24
...,...,...,...,...,...,...,...,...,...
21621,Female,Service,No,1.0,3.000000,5.729851,0.41,12,30
21622,Female,Product,Yes,3.0,6.000000,6.700000,0.59,1,19
21623,Male,Service,Yes,3.0,7.000000,5.729851,0.72,11,5
21624,Female,Service,No,2.0,5.000000,5.900000,0.52,1,10


In [50]:
{column: len(df[column].unique()) for column in df.columns}

{'Gender': 2,
 'Company Type': 2,
 'WFH Setup Available': 2,
 'Designation': 6,
 'Resource Allocation': 11,
 'Mental Fatigue Score': 102,
 'Burn Rate': 101,
 'Join Month': 12,
 'Join Day': 31}

In [51]:
# Binary Encoding
df['Gender'] = df['Gender'].replace({'Female': 0, 'Male': 1})
df['Company Type'] = df['Company Type'].replace({'Service': 0, 'Product': 1})
df['WFH Setup Available'] = df['WFH Setup Available'].replace({'No': 0, 'Yes': 1})

In [52]:
df

Unnamed: 0,Gender,Company Type,WFH Setup Available,Designation,Resource Allocation,Mental Fatigue Score,Burn Rate,Join Month,Join Day
0,0,0,0,2.0,3.000000,3.800000,0.16,9,30
1,1,0,1,1.0,2.000000,5.000000,0.36,11,30
2,0,1,1,2.0,4.483831,5.800000,0.49,3,10
3,1,0,1,1.0,1.000000,2.600000,0.20,11,3
4,0,0,0,3.0,7.000000,6.900000,0.52,7,24
...,...,...,...,...,...,...,...,...,...
21621,0,0,0,1.0,3.000000,5.729851,0.41,12,30
21622,0,1,1,3.0,6.000000,6.700000,0.59,1,19
21623,1,0,1,3.0,7.000000,5.729851,0.72,11,5
21624,0,0,0,2.0,5.000000,5.900000,0.52,1,10


In [53]:
# Split df into X and y
y = df['Burn Rate']
X = df.drop('Burn Rate', axis=1)

In [54]:
# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=1)

In [55]:
X_train

Unnamed: 0,Gender,Company Type,WFH Setup Available,Designation,Resource Allocation,Mental Fatigue Score,Join Month,Join Day
8275,0,1,0,3.0,6.0,6.6,8,10
21284,1,0,0,4.0,7.0,7.8,12,11
16802,1,0,0,2.0,6.0,6.5,11,19
3271,1,1,0,4.0,9.0,8.9,7,30
5302,0,1,0,2.0,4.0,6.6,9,12
...,...,...,...,...,...,...,...,...
10955,0,0,0,2.0,6.0,7.2,3,3
17289,0,0,1,3.0,4.0,4.8,6,22
5192,0,0,1,3.0,5.0,3.6,10,16
12172,1,1,1,0.0,1.0,3.5,8,1


In [56]:
y_train

8275     0.61
21284    0.81
16802    0.62
3271     0.73
5302     0.43
         ... 
10955    0.58
17289    0.39
5192     0.24
12172    0.18
235      0.00
Name: Burn Rate, Length: 15138, dtype: float64

In [57]:
# Scale X
scaler = StandardScaler()
scaler.fit(X_train)
X_train = pd.DataFrame(scaler.transform(X_train), index=X_train.index, columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), index=X_test.index, columns=X_test.columns)

In [58]:
X_train

Unnamed: 0,Gender,Company Type,WFH Setup Available,Designation,Resource Allocation,Mental Fatigue Score,Join Month,Join Day
8275,-0.954022,1.379211,-1.087295,0.725025,0.768001,0.475128,0.433442,-0.649693
21284,1.048194,-0.725052,-1.087295,1.604608,1.270205,1.131455,1.596251,-0.536187
16802,1.048194,-0.725052,-1.087295,-0.154557,0.768001,0.420434,1.305549,0.371860
3271,1.048194,1.379211,-1.087295,1.604608,2.274612,1.733089,0.142739,1.620424
5302,-0.954022,1.379211,-1.087295,-0.154557,-0.236406,0.475128,0.724144,-0.422682
...,...,...,...,...,...,...,...,...
10955,-0.954022,-0.725052,-1.087295,-0.154557,0.768001,0.803292,-1.020070,-1.444234
17289,-0.954022,-0.725052,0.919713,0.725025,-0.236406,-0.509363,-0.147963,0.712377
5192,-0.954022,-0.725052,0.919713,0.725025,0.265797,-1.165690,1.014847,0.031342
12172,1.048194,1.379211,0.919713,-1.913723,-1.743017,-1.220384,0.433442,-1.671246


### Training

In [61]:
models = {
    "Linear Regression": LinearRegression(),
    "Linear Regression (L2 Regularization)": Ridge(),
    "Linear Regression (L1 Regularization)": Lasso(),
    "K-Nearest Neighbors": KNeighborsRegressor(),
    "Neural Network": MLPRegressor(),
    "Support Vector Machine (Linear Kernel)": LinearSVR(),
    "Support Vector Machine (RBF Kernel)": SVR(),
    "Decision Tree": DecisionTreeRegressor(),
    "Random Forest": RandomForestRegressor(),
    "Gradient Boosting": GradientBoostingRegressor(),
    "XGBoost": XGBRegressor(),
    "LightGBM": LGBMRegressor(),
    "CatBoost": CatBoostRegressor(verbose=0)
}

In [62]:
for name, model in models.items():
    model.fit(X_train, y_train)
    print(name + " trained.")

Linear Regression trained.
Linear Regression (L2 Regularization) trained.
Linear Regression (L1 Regularization) trained.
K-Nearest Neighbors trained.
Neural Network trained.
Support Vector Machine (Linear Kernel) trained.
Support Vector Machine (RBF Kernel) trained.
Decision Tree trained.
Random Forest trained.
Gradient Boosting trained.
XGBoost trained.
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000190 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 176
[LightGBM] [Info] Number of data points in the train set: 15138, number of used features: 8
[LightGBM] [Info] Start training from score 0.451776
LightGBM trained.
CatBoost trained.


### Results

In [63]:
for name, model in models.items():
    print(name + " R^2 Score: {:.5f}".format(model.score(X_test, y_test)))

Linear Regression R^2 Score: 0.87075
Linear Regression (L2 Regularization) R^2 Score: 0.87075
Linear Regression (L1 Regularization) R^2 Score: -0.00001
K-Nearest Neighbors R^2 Score: 0.85608
Neural Network R^2 Score: 0.86956
Support Vector Machine (Linear Kernel) R^2 Score: 0.86767
Support Vector Machine (RBF Kernel) R^2 Score: 0.88430
Decision Tree R^2 Score: 0.81752
Random Forest R^2 Score: 0.89832
Gradient Boosting R^2 Score: 0.90257
XGBoost R^2 Score: 0.90357
LightGBM R^2 Score: 0.90912
CatBoost R^2 Score: 0.90822
