## EMPLOYEE ATTRITION PREDICTION MODEL

This machine learning model was created to predict if an employee is likely to leave the company based on certain conditions.

### Workflow
1. Get the data ready
   - Clean and Transform the data
   - Convert String Columns to Numbers
   - Filter Important Features
2. Choose the Model, Train the Model and Predict
4. Evaluate the Model and Improve the Model (if score is low)
5. Save and Load the Model

### 1. Get the data ready

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# import the data
df_attrition = pd.read_excel("data/hr_attrition.xlsx")
df_attrition.head()

Unnamed: 0,Attrition,Business Travel,CF_age band,CF_attrition label,Department,Education Field,emp no,Employee Number,Gender,Job Role,...,Performance Rating,Relationship Satisfaction,Standard Hours,Stock Option Level,Total Working Years,Work Life Balance,Years At Company,Years In Current Role,Years Since Last Promotion,Years With Curr Manager
0,Yes,Travel_Rarely,35 - 44,Ex-Employees,Sales,Life Sciences,STAFF-1,1,Female,Sales Executive,...,3,1,80,0,8,1,6,4,0,5
1,No,Travel_Frequently,45 - 54,Current Employees,R&D,Life Sciences,STAFF-2,2,Male,Research Scientist,...,4,4,80,1,10,3,10,7,1,7
2,Yes,Travel_Rarely,35 - 44,Ex-Employees,R&D,Other,STAFF-4,4,Male,Laboratory Technician,...,3,2,80,0,7,3,0,0,0,0
3,No,Travel_Frequently,25 - 34,Current Employees,R&D,Life Sciences,STAFF-5,5,Female,Research Scientist,...,3,3,80,0,8,3,8,7,3,0
4,No,Travel_Rarely,25 - 34,Current Employees,R&D,Medical,STAFF-7,7,Male,Laboratory Technician,...,3,4,80,1,6,3,2,2,2,2


In [3]:
# set options to see all columns
pd.set_option('display.max.columns', 40)

In [4]:
df_attrition.head(3)

Unnamed: 0,Attrition,Business Travel,CF_age band,CF_attrition label,Department,Education Field,emp no,Employee Number,Gender,Job Role,Marital Status,Over Time,Over18,Training Times Last Year,Age,CF_current Employee,Daily Rate,Distance From Home,Education,Employee Count,Environment Satisfaction,Hourly Rate,Job Involvement,Job Level,Job Satisfaction,Monthly Income,Monthly Rate,Num Companies Worked,Percent Salary Hike,Performance Rating,Relationship Satisfaction,Standard Hours,Stock Option Level,Total Working Years,Work Life Balance,Years At Company,Years In Current Role,Years Since Last Promotion,Years With Curr Manager
0,Yes,Travel_Rarely,35 - 44,Ex-Employees,Sales,Life Sciences,STAFF-1,1,Female,Sales Executive,Single,Yes,Y,0,41,0,1102,1,Associates Degree,1,2,94,3,2,4,5993,19479,8,11,3,1,80,0,8,1,6,4,0,5
1,No,Travel_Frequently,45 - 54,Current Employees,R&D,Life Sciences,STAFF-2,2,Male,Research Scientist,Married,No,Y,3,49,1,279,8,High School,1,3,61,2,2,2,5130,24907,1,23,4,4,80,1,10,3,10,7,1,7
2,Yes,Travel_Rarely,35 - 44,Ex-Employees,R&D,Other,STAFF-4,4,Male,Laboratory Technician,Single,Yes,Y,3,37,0,1373,2,Associates Degree,1,4,92,2,1,3,2090,2396,6,15,3,2,80,0,7,3,0,0,0,0


In [5]:
# see columns info
df_attrition.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 39 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Attrition                   1470 non-null   object
 1   Business Travel             1470 non-null   object
 2   CF_age band                 1470 non-null   object
 3   CF_attrition label          1470 non-null   object
 4   Department                  1470 non-null   object
 5   Education Field             1470 non-null   object
 6   emp no                      1470 non-null   object
 7   Employee Number             1470 non-null   int64 
 8   Gender                      1470 non-null   object
 9   Job Role                    1470 non-null   object
 10  Marital Status              1470 non-null   object
 11  Over Time                   1470 non-null   object
 12  Over18                      1470 non-null   object
 13  Training Times Last Year    1470 non-null   int6

In [6]:
# drop unneccessary columns, 
# all pay rates columns would also be removed except for monthly income

df_attrition = df_attrition.drop(['CF_attrition label', 'emp no', 'Employee Number','Over18','CF_current Employee','Daily Rate','Employee Count','Hourly Rate','Monthly Rate'], axis = 1)

In [7]:
df_attrition.head(3)

Unnamed: 0,Attrition,Business Travel,CF_age band,Department,Education Field,Gender,Job Role,Marital Status,Over Time,Training Times Last Year,Age,Distance From Home,Education,Environment Satisfaction,Job Involvement,Job Level,Job Satisfaction,Monthly Income,Num Companies Worked,Percent Salary Hike,Performance Rating,Relationship Satisfaction,Standard Hours,Stock Option Level,Total Working Years,Work Life Balance,Years At Company,Years In Current Role,Years Since Last Promotion,Years With Curr Manager
0,Yes,Travel_Rarely,35 - 44,Sales,Life Sciences,Female,Sales Executive,Single,Yes,0,41,1,Associates Degree,2,3,2,4,5993,8,11,3,1,80,0,8,1,6,4,0,5
1,No,Travel_Frequently,45 - 54,R&D,Life Sciences,Male,Research Scientist,Married,No,3,49,8,High School,3,2,2,2,5130,1,23,4,4,80,1,10,3,10,7,1,7
2,Yes,Travel_Rarely,35 - 44,R&D,Other,Male,Laboratory Technician,Single,Yes,3,37,2,Associates Degree,4,2,1,3,2090,6,15,3,2,80,0,7,3,0,0,0,0


In [8]:
# convert yes, no columns to 1 and 0
df_attrition['Employee_Attrition'] = df_attrition['Attrition'].apply(lambda x: 1 if x=="Yes" else 0)
df_attrition['Over_Time'] = df_attrition['Over Time'].apply(lambda x: 1 if x=="Yes" else 0)

In [9]:
# preview conversions
df_attrition[['Attrition','Employee_Attrition','Over Time','Over_Time']].head()

Unnamed: 0,Attrition,Employee_Attrition,Over Time,Over_Time
0,Yes,1,Yes,1
1,No,0,No,0
2,Yes,1,Yes,1
3,No,0,Yes,1
4,No,0,No,0


In [10]:
# drop attrition and over time old columns
df_attrition = df_attrition.drop(['Attrition', 'Over Time'], axis = 1)

We would use pandas dummies to one hot encode the string columns

In [11]:
# display string columns
df_attrition.select_dtypes(include = 'object').head(3)

Unnamed: 0,Business Travel,CF_age band,Department,Education Field,Gender,Job Role,Marital Status,Education
0,Travel_Rarely,35 - 44,Sales,Life Sciences,Female,Sales Executive,Single,Associates Degree
1,Travel_Frequently,45 - 54,R&D,Life Sciences,Male,Research Scientist,Married,High School
2,Travel_Rarely,35 - 44,R&D,Other,Male,Laboratory Technician,Single,Associates Degree


In [12]:
# get string columns and convert to numbers
string_columns = df_attrition.select_dtypes(include = 'object').columns
string_columns

Index(['Business Travel', 'CF_age band', 'Department', 'Education Field',
       'Gender', 'Job Role', 'Marital Status', 'Education'],
      dtype='object')

In [13]:
# convert string columns to numbers using pandas dummies
dummies = pd.get_dummies(df_attrition[string_columns])
dummies.head()

Unnamed: 0,Business Travel_Non-Travel,Business Travel_Travel_Frequently,Business Travel_Travel_Rarely,CF_age band_25 - 34,CF_age band_35 - 44,CF_age band_45 - 54,CF_age band_Over 55,CF_age band_Under 25,Department_HR,Department_R&D,Department_Sales,Education Field_Human Resources,Education Field_Life Sciences,Education Field_Marketing,Education Field_Medical,Education Field_Other,Education Field_Technical Degree,Gender_Female,Gender_Male,Job Role_Healthcare Representative,Job Role_Human Resources,Job Role_Laboratory Technician,Job Role_Manager,Job Role_Manufacturing Director,Job Role_Research Director,Job Role_Research Scientist,Job Role_Sales Executive,Job Role_Sales Representative,Marital Status_Divorced,Marital Status_Married,Marital Status_Single,Education_Associates Degree,Education_Bachelor's Degree,Education_Doctoral Degree,Education_High School,Education_Master's Degree
0,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0
1,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0
2,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0
3,0,1,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1
4,0,0,1,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0


In [14]:
# drop string columns from df attrition
df_attrition = df_attrition.drop(string_columns, axis=1)

In [15]:
# merge the two transformed dataframes - df_attrition and dummies
transformed_df = df_attrition.join(dummies)

In [16]:
transformed_df.head(3)

Unnamed: 0,Training Times Last Year,Age,Distance From Home,Environment Satisfaction,Job Involvement,Job Level,Job Satisfaction,Monthly Income,Num Companies Worked,Percent Salary Hike,Performance Rating,Relationship Satisfaction,Standard Hours,Stock Option Level,Total Working Years,Work Life Balance,Years At Company,Years In Current Role,Years Since Last Promotion,Years With Curr Manager,...,Education Field_Technical Degree,Gender_Female,Gender_Male,Job Role_Healthcare Representative,Job Role_Human Resources,Job Role_Laboratory Technician,Job Role_Manager,Job Role_Manufacturing Director,Job Role_Research Director,Job Role_Research Scientist,Job Role_Sales Executive,Job Role_Sales Representative,Marital Status_Divorced,Marital Status_Married,Marital Status_Single,Education_Associates Degree,Education_Bachelor's Degree,Education_Doctoral Degree,Education_High School,Education_Master's Degree
0,0,41,1,2,3,2,4,5993,8,11,3,1,80,0,8,1,6,4,0,5,...,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0
1,3,49,8,3,2,2,2,5130,1,23,4,4,80,1,10,3,10,7,1,7,...,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0
2,3,37,2,4,2,1,3,2090,6,15,3,2,80,0,7,3,0,0,0,0,...,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0


In [17]:
# get highly correlated features whether positive or negative
# we would get the top 10
transformed_df.corr()['Employee_Attrition'].abs().sort_values(ascending=False)[:11]

Employee_Attrition               1.000000
Over_Time                        0.246118
Marital Status_Single            0.175419
Total Working Years              0.171063
Job Level                        0.169105
CF_age band_Under 25             0.166623
Years In Current Role            0.160545
Monthly Income                   0.159840
Age                              0.159205
Job Role_Sales Representative    0.157234
Years With Curr Manager          0.156199
Name: Employee_Attrition, dtype: float64

In [18]:
top_corr_features = transformed_df.corr()['Employee_Attrition'].abs().sort_values(ascending=False)[:11].index
top_corr_features

Index(['Employee_Attrition', 'Over_Time', 'Marital Status_Single',
       'Total Working Years', 'Job Level', 'CF_age band_Under 25',
       'Years In Current Role', 'Monthly Income', 'Age',
       'Job Role_Sales Representative', 'Years With Curr Manager'],
      dtype='object')

In [19]:
# top features
top_corr_df = transformed_df[top_corr_features]
top_corr_df.head(3)

Unnamed: 0,Employee_Attrition,Over_Time,Marital Status_Single,Total Working Years,Job Level,CF_age band_Under 25,Years In Current Role,Monthly Income,Age,Job Role_Sales Representative,Years With Curr Manager
0,1,1,1,8,2,0,4,5993,41,0,5
1,0,0,0,10,2,0,7,5130,49,0,7
2,1,1,1,7,1,0,0,2090,37,0,0


### 2. Choose the Model, Train the Model and Run Predictions

In [20]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split the data into train and test data
X = top_corr_df.drop(['Employee_Attrition'], axis = 1)
y = top_corr_df['Employee_Attrition']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2)

# maintain same values
np.random.seed(42)

# apply features scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# instantiate classifier, and train model
clf = RandomForestClassifier()
clf.fit(X_train_scaled, y_train)
y_pred = clf.predict(X_test_scaled)
y_pred[:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1], dtype=int64)

### 3. Evaluate the Model

In [21]:
# how well did the model train
clf.score(X_train_scaled,y_train)

1.0

**how well did the model predict?**

In [23]:
# Accuracy Score
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

0.8367346938775511

In [31]:
# Cross validation score
from sklearn.model_selection import cross_val_score

clf_cross_val_score = cross_val_score(clf, X, y)
clf_cross_val_score = np.mean(cross_val)
clf_cross_val_score

0.8476190476190476

### 4. Save and Load the Model

In [25]:
import pickle

# save the model
pickle.dump(clf, open("model/employee-attrition-predict.pk1","wb"))

In [26]:
# load model
attrition_model = pickle.load(open("model/employee-attrition-predict.pk1","rb"))

In [27]:
# run a prediction
attrition_y_preds = attrition_model.predict(X_test_scaled)
attrition_y_preds[:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1], dtype=int64)

In [28]:
attrition_model.score(X_test_scaled, y_test)

0.8367346938775511