# Predicting IBM Employee Attrition using Machine Learning

This paper focuses on building a classification model with machine learning to predict employee attrition by analyzing factors such as distance from home to workplace, marital status, monthly income, etc.

Predicting employee attrition is highly important because it has significant consequences, that are confronted by organizations, including:

*   Decreased productivity;
*   Loss of skilled employees;
*   Loss of time and resources for training new employees;
*   To sum up, reduced profits.

Therefore, it is essential to identify the factors that contribute to employee attrition and take proactive measures to address them.

The findings of the paper can be useful for HR professionals, managers, and executives responsible for managing human resources in their organizations.

Original dataset consists of fictional data and provided by IBM on [Kaggle](https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset).

In [105]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Data Overview

First examine data to better understand how to work with it throughout the paper.

Inspect next questions:

*   How many observations and variables in the dataset?
*   How many missing values?
*   What types of data are we dealing with?
*   What is the output variable and how to interpret it?
*   How balanced is the dataset?




In [None]:
data = pd.read_csv("./input/ibm.csv")

data.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


Dataset includes **1470 observations** and **35 variables**:

In [None]:
data.shape

(1470, 35)

**No missing values**:

In [None]:
data.isnull().sum()

Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EmployeeNumber              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
Over18                      0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSince

There are **only categorical and integer** data types:

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

**Attrition is a target variable** indicating that an employee has left a job.

Moreover, the dataset is highly **imbalanced**:

In [None]:
data["Attrition"].value_counts()

No     1233
Yes     237
Name: Attrition, dtype: int64

## Data Analysis

Need to carefully study the dataset to find patterns and detect outliers.

In [None]:
data.describe()

Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,...,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,36.92381,802.485714,9.192517,2.912925,1.0,1024.865306,2.721769,65.891156,2.729932,2.063946,...,2.712245,80.0,0.793878,11.279592,2.79932,2.761224,7.008163,4.229252,2.187755,4.123129
std,9.135373,403.5091,8.106864,1.024165,0.0,602.024335,1.093082,20.329428,0.711561,1.10694,...,1.081209,0.0,0.852077,7.780782,1.289271,0.706476,6.126525,3.623137,3.22243,3.568136
min,18.0,102.0,1.0,1.0,1.0,1.0,1.0,30.0,1.0,1.0,...,1.0,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,30.0,465.0,2.0,2.0,1.0,491.25,2.0,48.0,2.0,1.0,...,2.0,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,36.0,802.0,7.0,3.0,1.0,1020.5,3.0,66.0,3.0,2.0,...,3.0,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,43.0,1157.0,14.0,4.0,1.0,1555.75,4.0,83.75,3.0,3.0,...,4.0,80.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0
max,60.0,1499.0,29.0,5.0,1.0,2068.0,4.0,100.0,4.0,5.0,...,4.0,80.0,3.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0


Counting unique values:

In [None]:
data.nunique()

Age                           43
Attrition                      2
BusinessTravel                 3
DailyRate                    886
Department                     3
DistanceFromHome              29
Education                      5
EducationField                 6
EmployeeCount                  1
EmployeeNumber              1470
EnvironmentSatisfaction        4
Gender                         2
HourlyRate                    71
JobInvolvement                 4
JobLevel                       5
JobRole                        9
JobSatisfaction                4
MaritalStatus                  3
MonthlyIncome               1349
MonthlyRate                 1427
NumCompaniesWorked            10
Over18                         1
OverTime                       2
PercentSalaryHike             15
PerformanceRating              2
RelationshipSatisfaction       4
StandardHours                  1
StockOptionLevel               4
TotalWorkingYears             40
TrainingTimesLastYear          7
WorkLifeBa

"EmployeeCount", "Over18", "StandardHours" have only one unique values; "EmployeeNumber" has 1470 ones. So this features are useless and should be dropped:

In [None]:
data.drop(["EmployeeCount", "EmployeeNumber", "Over18", "StandardHours"], axis="columns", inplace=True)

**Look into the data deeper.**

In [110]:
categorical_columns = []

for column in data.columns:
    if data[column].dtype == object and len(data[column].unique()) <= 30:
        categorical_columns.append(column)
        print(column)
        print(data[column].value_counts())
        print("====================================")

categorical_columns.remove("Attrition")

Attrition
No     1233
Yes     237
Name: Attrition, dtype: int64
BusinessTravel
Travel_Rarely        1043
Travel_Frequently     277
Non-Travel            150
Name: BusinessTravel, dtype: int64
Department
Research & Development    961
Sales                     446
Human Resources            63
Name: Department, dtype: int64
EducationField
Life Sciences       606
Medical             464
Marketing           159
Technical Degree    132
Other                82
Human Resources      27
Name: EducationField, dtype: int64
Gender
Male      882
Female    588
Name: Gender, dtype: int64
JobRole
Sales Executive              326
Research Scientist           292
Laboratory Technician        259
Manufacturing Director       145
Healthcare Representative    131
Manager                      102
Sales Representative          83
Research Director             80
Human Resources               52
Name: JobRole, dtype: int64
MaritalStatus
Married     673
Single      470
Divorced    327
Name: MaritalStatus, dtyp

Interpretation of some numerical features:

* Education:
 * Below College, 1;
 * College, 2;
 * Bachelor, 3;
 * Master, 4;
 * Doctor, 5.

* Environment Satisfaction:
 * Low, 1;
 * Medium, 2;
 * High, 3;
 * Very High, 4.

* Job Involvement:
 * Low, 1;
 * Medium, 2;
 * High, 3;
 * Very High, 4.

* Job Level:
 * Staff, 1;
 * Senior Staff, 2;
 * First Level Management, 3;
 * Middle Management, 4;
 * Senior management, 5.

* Job Satisfaction:
 * Low, 1;
 * Medium, 2;
 * High, 3;
 * Very High, 4.

* Performance Rating:
 * Low, 1;
 * Good, 2;
 * Excellent, 3;
 * Outstanding, 4.

* Relationship Satisfaction:
 * Low, 1;
 * Medium, 2;
 * High, 3;
 * Very High, 4.

* Work-Life Balance:
 * Bad, 1;
 * Good, 2;
 * Better, 3;
 * Best, 4.

In [113]:
numerical_columns = []

for column in data.columns:
    if data[column].dtypes != object and data[column].nunique() < 30:
        numerical_columns.append(column)
        print(column)
        print(data[column].unique())
        print("====================================")

DistanceFromHome
[ 1  8  2  3 24 23 27 16 15 26 19 21  5 11  9  7  6 10  4 25 12 18 29 22
 14 20 28 17 13]
Education
[2 1 4 3 5]
EnvironmentSatisfaction
[2 3 4 1]
JobInvolvement
[3 2 4 1]
JobLevel
[2 1 3 4 5]
JobSatisfaction
[4 2 3 1]
NumCompaniesWorked
[8 1 6 9 0 4 5 2 7 3]
PercentSalaryHike
[11 23 15 12 13 20 22 21 17 14 16 18 19 24 25]
PerformanceRating
[3 4]
RelationshipSatisfaction
[1 4 2 3]
StockOptionLevel
[0 1 3 2]
TrainingTimesLastYear
[0 3 2 5 1 4 6]
WorkLifeBalance
[1 3 2 4]
YearsInCurrentRole
[ 4  7  0  2  5  9  8  3  6 13  1 15 14 16 11 10 12 18 17]
YearsSinceLastPromotion
[ 0  1  3  2  7  4  8  6  5 15  9 13 12 10 11 14]
YearsWithCurrManager
[ 5  7  0  2  6  8  3 11 17  1  4 12  9 10 15 13 16 14]


## Data Preprocessing

*To be continued...*