# **Employee Attrition Prediction**

## The goal here is to build a classification model that will predict if the employees will stay or leave the organization.

Importing necessary libraries.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from tabulate import tabulate
from sklearn import preprocessing
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.model_selection import cross_val_score
from google.colab import drive

drive.mount('/content/gdrive/', force_remount=True)
%cd gdrive/MyDrive/

Mounted at /content/gdrive/
/content/gdrive/MyDrive


Reading dataset from CSV.

In [3]:
attrition_df = pd.read_csv("/content/gdrive/MyDrive/kaggle.csv")
attrition_df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


Certain columns contains single value for all the observations and some of them are not required for prediction. Hence, dropping these columns.

In [4]:
attrition_df.drop(columns = ['EmployeeCount','EmployeeNumber','Over18','StandardHours','DailyRate','HourlyRate','MonthlyRate'], inplace = True)
list(attrition_df.columns)

['Age',
 'Attrition',
 'BusinessTravel',
 'Department',
 'DistanceFromHome',
 'Education',
 'EducationField',
 'EnvironmentSatisfaction',
 'Gender',
 'JobInvolvement',
 'JobLevel',
 'JobRole',
 'JobSatisfaction',
 'MaritalStatus',
 'MonthlyIncome',
 'NumCompaniesWorked',
 'OverTime',
 'PercentSalaryHike',
 'PerformanceRating',
 'RelationshipSatisfaction',
 'StockOptionLevel',
 'TotalWorkingYears',
 'TrainingTimesLastYear',
 'WorkLifeBalance',
 'YearsAtCompany',
 'YearsInCurrentRole',
 'YearsSinceLastPromotion',
 'YearsWithCurrManager']

Assigning labels to categorical variables.

In [5]:
label_encoder = preprocessing.LabelEncoder()
columns = ['BusinessTravel','Department','EducationField','Gender','JobRole','MaritalStatus','OverTime','Attrition']

for c in columns:
  attrition_df[c]= label_encoder.fit_transform(attrition_df[c])
attrition_df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,Gender,JobInvolvement,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,1,2,2,1,2,1,2,0,3,...,3,1,0,8,0,1,6,4,0,5
1,49,0,1,1,8,1,1,3,1,2,...,4,4,1,10,3,3,10,7,1,7
2,37,1,2,1,2,2,4,4,1,2,...,3,2,0,7,3,3,0,0,0,0
3,33,0,1,1,3,4,1,4,0,3,...,3,3,0,8,3,3,8,7,3,0
4,27,0,2,1,2,1,3,1,1,3,...,3,4,1,6,3,3,2,2,2,2


Function to predict values using a given model and calculation of accuracy with and without cross validation.

In [6]:
def run_model(model,x, y):
  x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

  clf = model
  clf.fit(x_train, y_train)
  y_pred = clf.predict(x_test)
  
  results = confusion_matrix(y_test, y_pred)
  accuracy = (results[0,0] + results[1,1])/sum(results.flatten())
  cv_accuracy = cross_val_score(clf, x, y, cv=5)
  
  return accuracy, cv_accuracy.mean()

Defining various classification models and storing the results.

In [7]:
x = attrition_df.drop('Attrition', axis=1)
y = attrition_df['Attrition']
models = models = {
          "Logisitic Regression" : LogisticRegression(max_iter = 1500), 
          "Naive Bayes": GaussianNB(), 
          "K Nearest Neighbors": KNeighborsClassifier(n_neighbors=8), 
          "Decision Tree Classifier": DecisionTreeClassifier(max_depth=3), 
          "SVM": svm.SVC()
        }
tabulate_results = []

for i in range (0,len(models)):
  accuracy, cv_accuracy = run_model(list(models.values())[i], x, y)
  row = []
  row.append(list(models.keys())[i])
  row.append(accuracy)
  row.append(cv_accuracy)
  tabulate_results.append(row)

Printing the results in a table.

In [9]:
results_header = ['Model', 'Accuracy without Cross Validation', 'Accuracy with Cross Validation']
print(tabulate(tabulate_results, headers=results_header))

Model                       Accuracy without Cross Validation    Accuracy with Cross Validation
------------------------  -----------------------------------  --------------------------------
Logisitic Regrression                                0.872428                          0.866667
Naive Bayes                                          0.761317                          0.789116
K Nearest Neighbors                                  0.843621                          0.829252
Decision Tree Classifier                             0.864198                          0.853061
SVM                                                  0.855967                          0.838776


Here, we can see that the Logistic Regression performs better than rest of the algorithms. The Decision Tree Classifier also has similar results.

Since, the data does not contain any missing values and is a combination of numerical as well as categorical data, we will be using the Logistic Regression for predicting whether the employees will stay or leave the organization.