# Attrition Prediction Model Using Naïve Bayes Classifier

Naive Bayes Classification: (Age), (BusinessTravel), (MonthlyIncome) and (JobSatisfaction) to predict Attrition.

In [1]:
#1) Import necessary library 
import pandas as pd

In [2]:
#2) Import dataset (Employees data)
ds=pd.read_csv('dataset.csv')
ds.head()

Unnamed: 0,Age,BusinessTravel,MonthlyIncome,JobSatisfaction,Bonus,Department,DistanceFromHome,Education,EducationField,EmployeeCount,...,JobRole,MaritalStatus,PerformanceRating,StockOptionLevel,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsSinceLastPromotion,OverTime,Attrition
0,41,Travel_Rarely,5993,4,17979,Sales,1,2,Life Sciences,1,...,Sales Executive,Single,3,0,0,1,6,0,Yes,Yes
1,49,Travel_Frequently,5130,2,20520,Research & Development,8,1,Life Sciences,1,...,Research Scientist,Married,4,1,3,3,10,1,No,No
2,37,Travel_Rarely,2090,3,6270,Research & Development,2,2,Other,1,...,Laboratory Technician,Single,3,0,3,3,0,0,Yes,Yes
3,33,Travel_Frequently,2909,3,8727,Research & Development,3,4,Life Sciences,1,...,Research Scientist,Married,3,0,3,3,8,3,Yes,No
4,27,Travel_Rarely,3468,2,10404,Research & Development,2,1,Medical,1,...,Laboratory Technician,Married,3,1,3,3,2,2,No,No


In [3]:
#check for missing/null data

ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 24 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Age                      1470 non-null   int64 
 1   BusinessTravel           1470 non-null   object
 2   MonthlyIncome            1470 non-null   int64 
 3   JobSatisfaction          1470 non-null   int64 
 4   Bonus                    1470 non-null   int64 
 5   Department               1470 non-null   object
 6   DistanceFromHome         1470 non-null   int64 
 7   Education                1470 non-null   int64 
 8   EducationField           1470 non-null   object
 9   EmployeeCount            1470 non-null   int64 
 10  EmployeeNumber           1470 non-null   int64 
 11  EnvironmentSatisfaction  1470 non-null   int64 
 12  Gender                   1470 non-null   object
 13  JobLevel                 1470 non-null   int64 
 14  JobRole                  1470 non-null  

    Some columns in the dataset I don’t think will be needed for training the machine learning model. So we will be continue focusing on these four attributes:(Age, BusinessTravel, MonthlyIncome, JobSatisfaction)

In [4]:
#3) Allocate the relevant attributes as input and output

x=ds.iloc[:,:4].values
y=ds.iloc[:,-1].values

In [5]:
#4) Use LabelEncoder to encode categorical data
from sklearn.preprocessing import LabelEncoder , OneHotEncoder
from sklearn.compose import ColumnTransformer
labelencoder_x = LabelEncoder()
x[:,1] = labelencoder_x.fit_transform (x[:,1])
ct = ColumnTransformer([("BusinessTravel", OneHotEncoder(), [1])], remainder = 'passthrough')
x = ct.fit_transform(x)

In [6]:
# Encoding the Dependent Variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

In [7]:
#5) Split data into training and test sets with the appropriate proportions
from sklearn.model_selection import train_test_split
X_train, X_test , y_train , y_test = train_test_split (x, y, test_size = 0.2, random_state = 0)

In [8]:
#6) Normalized data using StandardScaler
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X = sc_X.fit_transform(x)

In [9]:
#7) Fit and predict results using the Classifier
from sklearn.naive_bayes import GaussianNB
classifier= GaussianNB()
classifier.fit(X_train,y_train)

GaussianNB()

In [10]:
#8) Evaluate the results
y_pred=classifier.predict(X_test)
y_test,y_pred

(array([0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
        1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 1, 0, 0]),
 arr

In [11]:
#Performance Evaluation: using confusion matrix
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test,y_pred)
cm

array([[226,  19],
       [ 38,  11]], dtype=int64)

In [12]:
#calculate the prediction accuracy
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.8061224489795918

The model correctly identified 80.61% of the employees that left the company.