# IBM Employee Attrition Prediction 

The goal of this notebook is to predict employee attrition and extract meaningful inside to IBM's employee status. 

## Table of content 

1. Importing libraries 
2. Data reading 
3. Data cleaning 
4. Data tranformation 
5. Data exploration (A minor analysis is provided for each element's findings)
    - Jobs 
    - Travel
    - Money 
    - Experience/Education 
    - Emotion 
    - Time 
    - General info 
6. Spliiting dataset for prediction 
7. Feature scaling data
8. Attrition Prediction 
     - Using mean of the 5 important elements for employees who left 
     - Using classification algorithm and K-fold validation 
     - Model evaluation 
 

## **Importing Library**

In [1]:
import pandas as pd 
import numpy as np 
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

Loading dataset

In [2]:
df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')
pd.set_option('display.max_columns', None) # Show max columns of dataset 

## **Data reading**

In [None]:
# Columns and their content
df.nunique()
#Total column and rows
df.shape
# Data type 
df.info()
# Data description 
df.describe()
# Number of null values 
df.isnull().sum()
#Number of duplicate values
df.duplicated().sum()
# Max values of the data
df.max()

## **Data Cleaning**

Since "Over 18,'EmployeeCount'and'StandardHours' are of the same value to all, I decided that it is not that meaningful for my analysis, thus dropping it.

In [3]:
# Removing unuseful data
df.drop(['Over18','EmployeeCount','StandardHours'], axis = 1, inplace= True)

## **Data Transformation**

Separating the data between employees who left and employees who stayed. We also split the information into 7 different catergories for easy reference. The 7 catergories I created are (
**'Job info', 'Travel', 'Money', 'Education/Expereince', 'Emotion', 'Time', 'Additional info'**)

In [4]:
# Changing the Attrition value to numerical 
df['Attrition']= df['Attrition'].replace({"Yes": 1, "No": 0,}) 
# Splitting data for comparision (employees who left and stayed)
left = df[df['Attrition'] == 1] # Employees who left 
stayed = df[df['Attrition'] == 0] # Employees who stayed

# Job info
jobsinfo_left = left[['EmployeeNumber','JobLevel', 'JobInvolvement', 'Department','JobRole','JobSatisfaction','PerformanceRating']]
jobsinfo_stayed = stayed[['EmployeeNumber','JobLevel', 'JobInvolvement', 'Department','JobRole','JobSatisfaction','PerformanceRating']]

# Travel
travel_left = left[['EmployeeNumber','BusinessTravel','DistanceFromHome']]
travel_stayed = stayed[['EmployeeNumber','BusinessTravel','DistanceFromHome']]

# Money
money_left = left[['EmployeeNumber','DailyRate','HourlyRate','MonthlyRate','MonthlyIncome','PercentSalaryHike']]
money_stayed = stayed[['EmployeeNumber','DailyRate','HourlyRate','MonthlyRate','MonthlyIncome','PercentSalaryHike']]

# Expereince/ Education
eduexp_left = left[['EmployeeNumber','Education','EducationField','NumCompaniesWorked','TotalWorkingYears', 'TrainingTimesLastYear']]
eduexp_stayed = stayed[['EmployeeNumber','Education','EducationField','NumCompaniesWorked','TotalWorkingYears', 'TrainingTimesLastYear']]

# Emotions
emotions_left = left[['EmployeeNumber','EnvironmentSatisfaction','OverTime','RelationshipSatisfaction']]
emotions_stayed = stayed[['EmployeeNumber','EnvironmentSatisfaction','OverTime','RelationshipSatisfaction']]

# Time
time_left =left[['EmployeeNumber','YearsAtCompany','YearsInCurrentRole','YearsSinceLastPromotion','YearsWithCurrManager']]
time_stayed =stayed[['EmployeeNumber','YearsAtCompany','YearsInCurrentRole','YearsSinceLastPromotion','YearsWithCurrManager']]

# Additional Info
additionalinfo_left = left[['EmployeeNumber','Age','Gender','MaritalStatus','WorkLifeBalance']]
additionalinfo_stayed= stayed[['EmployeeNumber','Age','Gender','MaritalStatus','WorkLifeBalance']]

## **Data Exploration**

We first explore the characteristics of the employess and compare them between those who left and those who stayed

People who left and stayed

In [36]:
# Amount of people who left and stayed 
pd.DataFrame(([[left['Attrition'].count(),stayed['Attrition'].count()]]),index= ['Number of people'], columns= ['People who left','People who stayed'])

Unnamed: 0,People who left,People who stayed
Number of people,237,1233


**Benchmark settings** (average score for each element)

In [5]:
jobinfo_benchmark = round(pd.DataFrame(df[['JobInvolvement', 'Department','JobRole','JobSatisfaction','PerformanceRating']].mean(),columns=['Benchmark Score']),ndigits=2)
travelinfo_benchmark = round(pd.DataFrame(df[['DistanceFromHome']].mean(),columns=['Benchmark Score']),ndigits=2)
moneyinfo_benchmark = round(pd.DataFrame(df[['DailyRate','HourlyRate','MonthlyRate','MonthlyIncome','PercentSalaryHike']].mean(),columns=['Benchmark Score']),ndigits=2)
eduexpinfo_benchmark = round(pd.DataFrame(df[['Education','NumCompaniesWorked','TotalWorkingYears', 'TrainingTimesLastYear']].mean(),columns=['Benchmark Score']),ndigits=2)
emotion_benchmark = round(pd.DataFrame(df[['EnvironmentSatisfaction','OverTime','RelationshipSatisfaction']].mean(),columns=['Benchmark Score']),ndigits=2)
time_benchmark = round(pd.DataFrame(df[['YearsAtCompany','YearsInCurrentRole','YearsSinceLastPromotion','YearsWithCurrManager']].mean(),columns=['Benchmark Score']),ndigits=2)

**Jobs Information**



Exploration of JobInvolvement, JobSatisfaction and PerformanceRating	

In [139]:
# Comparing average score to benchmark
ass = round(pd.DataFrame(jobsinfo_stayed.mean(),columns=['Average Score Stayed']),ndigits = 2).iloc[2:,:]
asl = round(pd.DataFrame(jobsinfo_left.mean(),columns=['Average Score Left']),ndigits = 2).iloc[2:,:]
pd.concat([ass,asl,jobinfo_benchmark], axis=1).iloc[:,:]

Unnamed: 0,Average Score Stayed,Average Score Left,Benchmark Score
JobInvolvement,2.77,2.52,2.73
JobSatisfaction,2.78,2.47,2.73
PerformanceRating,3.15,3.16,3.15


**Analysis**: Shown in the result, people who left are less involved in their jobs and thus prompted the effect of them not being satisfy with the job too 

Number of people

In [17]:
# Calculation to number of people who are above or below the benchmark
jil = (jobsinfo_left['JobInvolvement']>= 2.37).value_counts()
jis =(jobsinfo_stayed['JobInvolvement']>= 2.37).value_counts()
jsl =(jobsinfo_left['JobSatisfaction']>= 2.37).value_counts()
jss =(jobsinfo_stayed['JobSatisfaction']>= 2.37).value_counts()
prl =(jobsinfo_left['PerformanceRating']>= 3.15).value_counts()
prs =(jobsinfo_stayed['PerformanceRating']>= 3.15).value_counts()

In [8]:
# Manual list creation for dataframe 
job_people_above_benchamrk = [138,874,125,776,37,189]
job_people_below_benchamrk = [99,359,112,457,200,1044]
pd.DataFrame([job_people_above_benchamrk,job_people_below_benchamrk],index=['Number of people above benchmark','Number of people below benchmark']
             ,columns=['Job Involvment~left','Job Involvment~stayed','Job Satisfaction~left'
                       ,'Job Satisfaction~stayed','Performance Rating~left','Performance Rating~stayed'])

Unnamed: 0,Job Involvment~left,Job Involvment~stayed,Job Satisfaction~left,Job Satisfaction~stayed,Performance Rating~left,Performance Rating~stayed
Number of people above benchmark,138,874,125,776,37,189
Number of people below benchmark,99,359,112,457,200,1044


Exploration of Department

In [13]:
# The amount of people that worked in each deparment and their job role
departmentl =left['Department'].value_counts()
departmetns = stayed['Department'].value_counts()
pd.DataFrame([departmentl,departmetns],index=['People who left','People who stayed'])

Unnamed: 0,Research & Development,Sales,Human Resources
People who left,133,92,12
People who stayed,828,354,51


**Analysis:** The department consist of 3 unique values which include ‘R&D, Sales and Human Resources. Managers of each department might want to take a close look to why their colleagues are leaving 

Exploration to Job Role

In [14]:
# Job roles of people who left and stayed
jobrolel =left['JobRole'].value_counts()
jobroles = stayed['JobRole'].value_counts()
pd.DataFrame([jobrolel,jobroles],index=['People who left','People who stayed'])

Unnamed: 0,Laboratory Technician,Sales Executive,Research Scientist,Sales Representative,Human Resources,Manufacturing Director,Healthcare Representative,Manager,Research Director
People who left,62,57,47,33,12,10,9,5,2
People who stayed,197,269,245,50,40,135,122,97,78


**Analysis:** Most of the people who left are Laboratory Technician, sales executive, and Sales representatives 

# ===============================================================

**Travel info**

Distance From Home Exploration

In [12]:
tras = round(pd.DataFrame(travel_stayed.mean(),columns=['Average Score Stayed']),ndigits = 2).iloc[1:,:]
tral = round(pd.DataFrame(travel_left.mean(),columns=['Average Score Left']),ndigits = 2).iloc[1:,:]
pd.concat([tras,tral,travelinfo_benchmark], axis=1).iloc[:,:]

Unnamed: 0,Average Score Stayed,Average Score Left,Benchmark Score
DistanceFromHome,8.92,10.63,9.19


**Analysis:** Averagely employee has to travel 9.19 km to reach work, however most the people that left traveled more. 

Business Travel Exploration

In [86]:
btl =travel_left['BusinessTravel'].value_counts()
bts = travel_stayed['BusinessTravel'].value_counts()

In [92]:
pd.DataFrame([btl,bts],index=['People who left','People who stayed'])

Unnamed: 0,Travel_Rarely,Travel_Frequently,Non-Travel
People who left,156,69,12
People who stayed,887,208,138


# ================================================================

**Money Info**

Average Pay Rate

In [93]:
# Comparing average score to benchmark
ms = round(pd.DataFrame(money_stayed.mean(),columns=['Average Rate Stayed']),ndigits = 2)
ml = round(pd.DataFrame(money_left.mean(),columns=['Average Rate Left']),ndigits = 2)
pd.concat([ms,ml,moneyinfo_benchmark], axis=1).dropna()

Unnamed: 0,Average Rate Stayed,Average Rate Left,Benchmark Score
DailyRate,812.5,750.36,802.49
HourlyRate,65.95,65.57,65.89
MonthlyIncome,6832.74,4787.09,6502.93
MonthlyRate,14265.78,14559.31,14313.1
PercentSalaryHike,15.23,15.1,15.21


**Analysis:** For this analysis, we’re going to consider the monthly income of our employees. People who left generally get paid under the benchmark 

People who got paid above the average 

In [6]:
drl = (money_left['DailyRate']>= 802.49).value_counts()
drs =(money_stayed['DailyRate']>=802.49).value_counts()
hrl =(money_left['HourlyRate']>= 65.89).value_counts()
hrs =(money_stayed['HourlyRate']>= 65.89).value_counts()
mil =(money_left['MonthlyIncome']>= 6502.93).value_counts()
mis =(money_stayed['MonthlyIncome']>= 6502.93).value_counts()
mrl =(money_left['MonthlyRate']>= 14313.10).value_counts()
mrs =(money_stayed['MonthlyRate']>= 14313.10).value_counts()
pshl =(money_left['PercentSalaryHike']>= 15.21).value_counts()
pshs =(money_stayed['PercentSalaryHike']>= 15.21).value_counts()

In [24]:
# Manual list creation for dataframe 
money_people_above_benchamrk = [104,630,119,627,52,441,122,608,87,464]
money_people_below_benchamrk = [133,603,118,606,185,792,115,625,150,769]
money_benchmark = [802.49,802.49,65.89,65.89,6502.93,6502.93,14313.10,14313.10,15.21,15.21]
pd.DataFrame([money_people_above_benchamrk,money_people_below_benchamrk,money_benchmark],
             index=['Number of people above benchmark','Number of people below benchmark','Benchmark']
             ,columns=['DailyRate~left','DailyRate~stayed','HourlyRate~left'
                       ,'HourlyRate~stayed','MonthlyIncome~left','MonthlyIncome~stayed',
                      'MonthlyRate~left','MonthlyRate~stayed','PercentSalaryHike~left','PercentSalaryHike~stayed'])

Unnamed: 0,DailyRate~left,DailyRate~stayed,HourlyRate~left,HourlyRate~stayed,MonthlyIncome~left,MonthlyIncome~stayed,MonthlyRate~left,MonthlyRate~stayed,PercentSalaryHike~left,PercentSalaryHike~stayed
Number of people above benchmark,104.0,630.0,119.0,627.0,52.0,441.0,122.0,608.0,87.0,464.0
Number of people below benchmark,133.0,603.0,118.0,606.0,185.0,792.0,115.0,625.0,150.0,769.0
Benchmark,802.49,802.49,65.89,65.89,6502.93,6502.93,14313.1,14313.1,15.21,15.21


Job Role Average pay

In [30]:
# Calculating the mean of each job role
combine_jrmi = df[['JobRole', 'MonthlyIncome']]
aise = combine_jrmi[combine_jrmi['JobRole']=='Sales Executive'].mean()
airs = combine_jrmi[combine_jrmi['JobRole']=='Research Scientist'].mean()
ailt = combine_jrmi[combine_jrmi['JobRole']=='Laboratory Technician'].mean()
aimd = combine_jrmi[combine_jrmi['JobRole']=='Manufacturing Director'].mean()
aihr = combine_jrmi[combine_jrmi['JobRole']=='Healthcare Representative'].mean()
aim = combine_jrmi[combine_jrmi['JobRole']=='Manager'].mean()
aisr = combine_jrmi[combine_jrmi['JobRole']=='Sales Representative'].mean()
aird = combine_jrmi[combine_jrmi['JobRole']=='Research Director'].mean()
aihur = combine_jrmi[combine_jrmi['JobRole']=='Human Resources'].mean()

# Creating the dataset for average income based on job roles 
avg_income =pd.DataFrame([aise,airs,ailt,aimd,
                          aihr,aim,aisr,aird,aihur],index= ['Sales Executive','Research Scientist','Laboratory Technician',
                                                                   'Manufacturing Director','Healthcare Representative',
                                                                   'Manager','Sales Representative',
                                                                   'Research Director','Human Resources'], columns = ['MonthlyIncome']).round(decimals=2)

# Renaming the column
avg_income.columns = [c.replace('MonthlyIncome', 'Average Monthly Income') for c in avg_income.columns]
avg_income

Unnamed: 0,Average Monthly Income
Sales Executive,6924.28
Research Scientist,3239.97
Laboratory Technician,3237.17
Manufacturing Director,7295.14
Healthcare Representative,7528.76
Manager,17181.68
Sales Representative,2626.0
Research Director,16033.55
Human Resources,4235.75


Stats of person who got paid the most

In [25]:
combine_jrmi.max()
df[df['MonthlyIncome'] == 19999]

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeNumber,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
190,52,0,Travel_Rarely,699,Research & Development,1,4,Life Sciences,259,3,Male,65,2,5,Manager,3,Married,19999,5678,0,No,14,3,1,1,34,5,3,33,18,11,9


# ================================================================

**Experience and Education Info**

In [31]:
# Comparing average score to benchmark
eds = round(pd.DataFrame(eduexp_stayed.mean(),columns=['Average Score Stayed']),ndigits = 2).iloc[1:,:]
edl = round(pd.DataFrame(eduexp_left.mean(),columns=['Average Score Left']),ndigits = 2).iloc[1:,:]
pd.concat([eds,edl,eduexpinfo_benchmark], axis=1)

Unnamed: 0,Average Score Stayed,Average Score Left,Benchmark Score
Education,2.93,2.84,2.91
NumCompaniesWorked,2.65,2.94,2.69
TotalWorkingYears,11.86,8.24,11.28
TrainingTimesLastYear,2.83,2.62,2.8


**Analysis:** 
It seems that people with higher experience tend to stay 


Number of people above the average score

In [41]:
# Calculation to number of people who are above or below the benchmark
el = (eduexp_left['Education']>= 2.91).value_counts()
es =(eduexp_stayed['Education']>= 2.91).value_counts()
ncwl =(eduexp_left['NumCompaniesWorked']>= 2.69).value_counts()
ncws =(eduexp_stayed['NumCompaniesWorked']>= 2.69).value_counts()
twyl =(eduexp_left['TotalWorkingYears']>= 11.28).value_counts()
twys =(eduexp_stayed['TotalWorkingYears']>= 11.28).value_counts()
ttlyl =(eduexp_left['TrainingTimesLastYear']>= 2.8).value_counts()
ttlys =(eduexp_stayed['TrainingTimesLastYear']>= 2.8).value_counts()

In [32]:
# Manual list creation for dataframe 
edu_people_above_benchamrk = [162,856,100,506,48,463,115,683]
edu_people_below_benchamrk = [75,377,137,727,189,770,122,550]
pd.DataFrame([edu_people_above_benchamrk,edu_people_below_benchamrk],index=['Number of people above benchmark','Number of people below benchmark']
             ,columns=['Education for people who left','Education for people who stayed',
                                                           'Companies Worked~left','Companies Worked~stayed','Work Exepereince~left','Work Exepereince~stayed',
                                                           'Training times for last year~left','Training times for last year~stayed'])

Unnamed: 0,Education for people who left,Education for people who stayed,Companies Worked~left,Companies Worked~stayed,Work Exepereince~left,Work Exepereince~stayed,Training times for last year~left,Training times for last year~stayed
Number of people above benchmark,162,856,100,506,48,463,115,683
Number of people below benchmark,75,377,137,727,189,770,122,550


Person who have the most years of working expereince 

In [35]:
df['TotalWorkingYears'].max()
df[df['TotalWorkingYears'] == 40]

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeNumber,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
126,58,1,Travel_Rarely,147,Research & Development,23,4,Medical,165,4,Female,94,3,3,Healthcare Representative,4,Married,10312,3465,1,No,12,3,4,1,40,3,2,40,10,15,6
595,58,1,Travel_Rarely,286,Research & Development,2,4,Life Sciences,825,4,Male,31,3,5,Research Director,2,Single,19246,25761,7,Yes,12,3,4,0,40,2,3,31,15,13,8


# ================================================================

**Emotion Info**

In [37]:
# Comparing average score to benchmark
emos = round(pd.DataFrame(emotions_stayed.mean(),columns=['Average Score Stayed']),ndigits = 2).iloc[1:,:]
emol = round(pd.DataFrame(emotions_left.mean(),columns=['Average Score Left']),ndigits = 2).iloc[1:,:]
pd.concat([emos,emol,emotion_benchmark], axis=1)

Unnamed: 0,Average Score Stayed,Average Score Left,Benchmark Score
EnvironmentSatisfaction,2.77,2.46,2.72
RelationshipSatisfaction,2.73,2.6,2.71


**Analysis:** 
People that left are less satisfy with the environment they worked in and the relationship between their colleagues. However, the difference is not far form each other 


In [None]:
# Calculation to number of people who are above or below the benchmark
esl = (emotions_left['EnvironmentSatisfaction']>= 2.46).value_counts()
ess =(emotions_stayed['EnvironmentSatisfaction']>= 2.77).value_counts()
resl =(emotions_left['RelationshipSatisfaction']>= 2.60).value_counts()
ress =(emotions_stayed['RelationshipSatisfaction']>= 2.73).value_counts()

pd.DataFrame([esl,ess,resl,ress])

Number of people above the average score

In [45]:
# Manual list creation for dataframe 
emotion_people_above_benchamrk = [122,777,135,756]
emotion_people_below_benchamrk = [115,456,102,477]
pd.DataFrame([emotion_people_above_benchamrk,emotion_people_below_benchamrk],index=['Number of people above benchmark','Number of people below benchmark']
             ,columns=['Environment Satisfaction~left','Environment Satisfaction~stayed','Relationship Satisfaction~left'
                       ,'Relationship Satisfaction~stayed'])

Unnamed: 0,Environment Satisfaction~left,Environment Satisfaction~stayed,Relationship Satisfaction~left,Relationship Satisfaction~stayed
Number of people above benchmark,122,777,135,756
Number of people below benchmark,115,456,102,477


# ===========================================================

**Time Info**

Comparing average score to benchmark

In [59]:
# Comparing average score to benchmark
ts = round(pd.DataFrame(time_stayed.mean(),columns=['Average Years Stayed']),ndigits = 2).iloc[1:,:]
tl = round(pd.DataFrame(time_left.mean(),columns=['Average Years Left']),ndigits = 2).iloc[1:,:]
pd.concat([ts,tl,time_benchmark], axis=1)

Unnamed: 0,Average Years Stayed,Average Years Left,Benchmark Score
YearsAtCompany,7.37,5.13,7.01
YearsInCurrentRole,4.48,2.9,4.23
YearsSinceLastPromotion,2.23,1.95,2.19
YearsWithCurrManager,4.37,2.85,4.12


**Analysis** From the data we can predict that employees that leave are usually new employees

In [None]:
# Calculation to number of people who are above or below the benchmark
yacl = (time_left['YearsAtCompany']>= 7.01).value_counts()
yacs =(time_stayed['YearsAtCompany']>= 7.01).value_counts()
yicrl =(time_left['YearsInCurrentRole']>= 4.23).value_counts()
yicrs =(time_stayed['YearsInCurrentRole']>= 4.23).value_counts()
yslpl = (time_left['YearsSinceLastPromotion']>= 2.19).value_counts()
yslps =(time_stayed['YearsSinceLastPromotion']>= 2.19).value_counts()
ywcml =(time_left['YearsWithCurrManager']>= 4.12).value_counts()
ywcms =(time_stayed['YearsWithCurrManager']>= 4.12).value_counts()

pd.DataFrame([yacl,yacs,yicrl,yicrs,yslpl,yslps,ywcml,ywcms])

Number of people above average years

In [61]:
# Manual list creation for dataframe 
time_people_above_benchamrk = [55,473,54,504,51,322,61,486]
time_people_below_benchamrk = [182,760,183,729,186,911,176,747]
pd.DataFrame([time_people_above_benchamrk,time_people_below_benchamrk],index=['Number of people above average years','Number of people below average years']
             ,columns=['Years At Company~left','Years At Company~stayed','Years In Current Role~left'
                       ,'Years In Current Role~stayed','Years Since Last Promotion~left','Years Since Last Promotion',
                       'Years With Current Manager~left'
                       ,'Years With Current Manager~stayed'])

Unnamed: 0,Years At Company~left,Years At Company~stayed,Years In Current Role~left,Years In Current Role~stayed,Years Since Last Promotion~left,Years Since Last Promotion,Years With Current Manager~left,Years With Current Manager~stayed
Number of people above average years,55,473,54,504,51,322,61,486
Number of people below average years,182,760,183,729,186,911,176,747


Over Time (Amount and percentage of people who worked overtime)

In [62]:
# Total number of people who worked over time 
df['OverTime'].value_counts()

No     1054
Yes     416
Name: OverTime, dtype: int64

In [63]:
# Percentage of people who worked overtime
yes = 416
no = 1054
total = yes+no
percentage = yes/total *100
round(percentage, ndigits=2)

28.3

Employee who are the most loyal (top 10)

In [75]:
# df.groupby(['YearsAtCompany']).sort_values([df],ascending = False)
stayed.sort_values(['YearsAtCompany'],ascending = False).head(10)

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeNumber,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
98,58,0,Travel_Rarely,682,Sales,10,4,Medical,131,4,Male,37,3,4,Sales Executive,3,Single,13872,24409,0,No,13,3,3,0,38,1,2,37,10,1,8
270,55,0,Travel_Rarely,452,Research & Development,1,3,Medical,374,4,Male,81,3,5,Manager,1,Single,19045,18938,0,Yes,14,3,3,0,37,2,3,36,10,4,13
1116,55,0,Travel_Rarely,685,Sales,26,5,Marketing,1578,3,Male,60,2,5,Manager,4,Married,19586,23037,1,No,21,4,3,1,36,3,3,36,6,2,13
561,52,0,Travel_Rarely,621,Sales,3,4,Marketing,776,3,Male,31,2,4,Manager,1,Married,16856,10084,1,No,11,3,1,0,34,3,4,34,6,1,16
962,51,0,Travel_Rarely,770,Human Resources,5,3,Life Sciences,1352,3,Male,84,3,4,Manager,2,Divorced,14026,17588,1,Yes,11,3,2,1,33,2,3,33,9,0,10
914,55,0,Non-Travel,177,Research & Development,8,1,Medical,1278,4,Male,37,2,4,Healthcare Representative,2,Divorced,13577,25592,1,Yes,15,3,4,1,34,3,3,33,9,15,0
190,52,0,Travel_Rarely,699,Research & Development,1,4,Life Sciences,259,3,Male,65,2,5,Manager,3,Married,19999,5678,0,No,14,3,1,1,34,5,3,33,18,11,9
237,52,0,Non-Travel,771,Sales,2,4,Life Sciences,329,1,Male,79,2,5,Manager,3,Single,19068,21030,1,Yes,18,3,4,0,33,2,4,33,7,15,12
477,50,0,Travel_Frequently,1246,Human Resources,3,3,Medical,644,1,Male,99,3,5,Manager,2,Married,18200,7999,1,No,11,3,3,1,32,2,3,32,5,10,7
1086,50,0,Travel_Frequently,333,Research & Development,22,5,Medical,1539,3,Male,88,1,4,Research Director,4,Single,14411,24450,1,Yes,13,3,4,0,32,2,3,32,6,13,9


**Analysis:** These are employees are consider the mostly loyal because the stayed the most 

**General Info**

Status of people that left  

In [126]:
additionalinfo_left['Gender'].value_counts()

Male      150
Female     87
Name: Gender, dtype: int64

In [137]:
additionalinfo_left['Age'].min()

18

In [13]:
additionalinfo_left['Age'].max()

58

In [138]:
additionalinfo_left['MaritalStatus'].value_counts()

Single      120
Married      84
Divorced     33
Name: MaritalStatus, dtype: int64

**Summary / Report**



## Employee Attrition Prediction

Now that we explored and understand the attributes of our employees, we will explore the reason why they leave the company. According to a few sources online the main reason why employees leave if for the following reason: 

1. Lack of engagement with company values and personal growth
2. Unsatisfy with pay 
3. Conflict with Co-workers 
4. Work life balance 

For that reason we are going to explore (**RelationshipSatisfaction , JobInvolvement, JobSatisfaction, MonthlyIncome,WorkLifeBalance**) between the employees who left and the ones who stayed. We are first going to manually compare them to its benchmarks and after we are going to use classification models to determine the relationship between all the variables and the dependant variable and create a churn prediction. 

**Classification algorithms used:** 

1. Logistic Regression
2. K-Nearest Neighbors
3. Support Vector Machines
4. Naive Bayes classifier
5. Decision Tree
6. Random Forrest


**Self algorithm implimentation:**

Fomular:
Predicted Employee Attrition = Calculating the mean of (RelationshipSatisfaction , JobInvolvement, JobSatisfaction, MonthlyIncome, WorkLifeBalance) for people who left and selecting from people who stayed (same element selected) who have a score of below the mean.

In [7]:
# Mean calculation
wb0 = left['WorkLifeBalance'].mean()
ji0 = left['JobInvolvement'].mean()
js0 = left['JobSatisfaction'].mean()
mi0 = left['MonthlyIncome'].mean()
rs0 = left['RelationshipSatisfaction'].mean()
print(wb0,ji0,js0,mi0,rs0)

2.6582278481012658 2.518987341772152 2.4683544303797467 4787.0928270042195 2.5991561181434597


In [8]:
# Predicted employee attrition  
stayed.loc[(df['WorkLifeBalance'] <= 2.66) & (df['JobInvolvement'] <= 2.52)& (df['JobSatisfaction'] <= 2.4) & (df['MonthlyIncome'] <= 4787) & (df['RelationshipSatisfaction'] <= 2.6)] 

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeNumber,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
157,58,0,Travel_Rarely,1145,Research & Development,9,3,Medical,214,2,Female,75,2,1,Research Scientist,2,Married,3346,11873,4,Yes,20,4,2,1,9,3,2,1,0,0,0
248,37,0,Travel_Rarely,1017,Research & Development,1,2,Medical,340,3,Female,83,2,1,Research Scientist,1,Married,3920,18697,2,No,14,3,1,1,17,2,2,3,1,0,2
734,22,0,Travel_Rarely,217,Research & Development,8,1,Life Sciences,1019,2,Male,94,1,1,Laboratory Technician,1,Married,2451,6881,1,No,15,3,1,1,4,3,2,4,3,1,1
819,28,0,Travel_Rarely,1451,Research & Development,2,1,Life Sciences,1136,1,Male,67,2,1,Research Scientist,2,Married,3201,19911,0,No,17,3,1,0,6,2,1,5,3,0,4
1027,34,0,Travel_Rarely,401,Research & Development,1,3,Life Sciences,1447,4,Female,86,2,1,Laboratory Technician,2,Married,3294,3708,5,No,17,3,1,1,7,2,2,5,4,0,2
1141,30,0,Travel_Rarely,241,Research & Development,7,3,Medical,1609,2,Male,48,2,1,Research Scientist,2,Married,2141,5348,1,No,12,3,2,1,6,3,2,6,4,1,1
1188,29,0,Travel_Rarely,991,Sales,5,3,Medical,1669,1,Male,43,2,2,Sales Executive,2,Divorced,4187,3356,1,Yes,13,3,2,1,10,3,2,10,0,0,9
1256,38,0,Travel_Frequently,594,Research & Development,2,2,Medical,1760,3,Female,75,2,1,Laboratory Technician,2,Married,2468,15963,4,No,14,3,2,1,9,4,2,6,1,0,5
1391,38,0,Travel_Rarely,1404,Sales,1,3,Life Sciences,1961,1,Male,59,2,1,Sales Representative,1,Single,2858,11473,4,No,14,3,1,0,20,3,2,1,0,0,0
1460,29,0,Travel_Rarely,468,Research & Development,28,4,Medical,2054,4,Female,73,2,1,Research Scientist,1,Single,3785,8489,1,No,14,3,2,0,5,3,1,5,4,0,4



Each of these 10 employees's 5 selected elements have a score below the mean. Analyzing the data, these employees have similar behaviour and worst if not equal, thus they have a higer chance of leaving the company in the near future.


**Model preparation**

Metrics selected will be the same (RelationshipSatisfaction , JobInvolvement, JobSatisfaction, MonthlyIncome, WorkLifeBalance)

In [60]:
X = df[['WorkLifeBalance','JobInvolvement','JobSatisfaction','MonthlyIncome','RelationshipSatisfaction']] 
y = df['Attrition']

**Splitting dataset to trainig and test**

In [61]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)

**Feature scaling**

In [62]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
# sc_y = StandardScaler()
# y_train = sc_y.fit_transform(y_train.values.reshape(-1,1))

**Logistic Regression** 

In [91]:
from sklearn.linear_model import LogisticRegression
classifierlr = LogisticRegression(random_state = 0)
classifierlr.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifierlr.predict(X_test)

#Accuracy 
lr_accuracy = round(classifierlr.score(X_train, y_train) * 100, 2)
lr_accuracy

84.08

In [117]:
# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifierlr, X = X_train, y = y_train, cv = 10)
lrk = round(accuracies.mean() *100,ndigits=2)
lrk

83.88

**K-Nearest Neighbour**

In [97]:
from sklearn.neighbors import KNeighborsClassifier
classifierknn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2) #choosing the metric to select the distance. P = 2 means the distance selected is the euclidean distance
classifierknn.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifierknn.predict(X_test)

#Accuracy 
knn_accuracy = round(classifierknn.score(X_train, y_train) * 100, 2)
knn_accuracy

85.71

In [116]:
# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifierknn, X = X_train, y = y_train, cv = 10)
knnk = round(accuracies.mean() *100,ndigits=2)
knnk

81.94

**Support Vector Machine**

In [102]:
# Fitting SVM to the Training set
from sklearn.svm import SVC
classifiersvc = SVC(kernel = 'linear', random_state = 0) #choose the kernel
classifiersvc.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifiersvc.predict(X_test)

#Accuracy 
svm_accuracy = round(classifiersvc.score(X_train, y_train) * 100, 2)
svm_accuracy

83.88

In [115]:
# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifiersvc, X = X_train, y = y_train, cv = 10)
svmk = round(accuracies.mean() *100,ndigits=2)
svmk

83.88

**Naive Bayes**

In [105]:
# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifiernb = GaussianNB()
classifiernb.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifiernb.predict(X_test)

#Accuracy 
nb_accuracy = round(classifiernb.score(X_train, y_train) * 100, 2)
nb_accuracy

83.98

In [114]:
# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifiernb, X = X_train, y = y_train, cv = 10)
nbk = round(accuracies.mean() *100,ndigits=2)
nbk

83.47

**Decision Tree**

In [107]:
# Fitting Decision Tree Classification to the Training set
from sklearn.tree import DecisionTreeClassifier
classifierdt = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifierdt.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifierdt.predict(X_test)


#Accuracy 
dt_accuracy = round(classifierdt.score(X_train, y_train) * 100, 2)
dt_accuracy

100.0

In [113]:
# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifierdt, X = X_train, y = y_train, cv = 10)
dtk = round(accuracies.mean() *100,ndigits=2)
dtk

73.05

**Random Forest**

In [109]:
# Fitting Random Forest Classification to the Training set
from sklearn.ensemble import RandomForestClassifier
classifierrf = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifierrf.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifierrf.predict(X_test)

#Accuracy 
rf_accuracy = round(classifierrf.score(X_train, y_train) * 100, 2)
rf_accuracy

97.86

In [112]:
# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifierrf, X = X_train, y = y_train, cv = 10)
rfk = round(accuracies.mean() *100,ndigits=2)
rfk

80.71

**Model Evaluation**

In [121]:
bmodels = pd.DataFrame({
    'Model': ['Support Vector Machines', 'K-Nearest Neighbour', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Decision Tree'],
    'Score': [svm_accuracy, knn_accuracy, lr_accuracy, 
              rf_accuracy, nb_accuracy, dt_accuracy],
    'K-Fold' :[lrk, knnk, svmk, nbk, dtk, rfk]})
models.sort_values(by='Score', ascending=False)

Unnamed: 0,Model,Score,K-Fold
5,Decision Tree,100.0,80.71
3,Random Forest,97.86,83.47
1,K-Nearest Neighbour,85.71,81.94
2,Logistic Regression,84.08,83.88
4,Naive Bayes,83.98,73.05
0,Support Vector Machines,83.88,83.88


**Analysis:** The best model prediction is decision tree with a score of 100% while after k-folding, the prediction with the best accuracy is the support vector machine 