---

![](https://media.eadbox.com/system/uploads/saas/devise_logo/5ee2b68e48b56200102e2258/newebac_logo_black.png)

---

Student: [Victor Chicati](https://www.linkedin.com.br/in/victorchicati)

# Semantix project with EBAC - Employee Turnover prediction

# 1 | Introduction

### What to Expect
The purpose of this analysis is to try to predict which of a company's employees are most likely to resign and what factors contribute to this


### Dataset
This Employee Turnover dataset is a real dataset shared from Edward Babushkin's blog used to predict an Employee's risk of quitting (with a Survival Analysis Model).

#### Column Attributes
* **stag** - Experience (time)
* **event** - Employee turnover
* **gender** - Employee's gender, female(f), or male(m)
* **age** - Employee's age (year)
* **industry** - Employee's Industry
* **profession** - Employee's profession
* **traffic** - From what pipelene employee came to the company. You contacted the company directly (after learning from advertising, knowing the company's brand, etc.) - advert You contacted the company directly on the recommendation of your friend - NOT an employee of this company-recNErab You contacted the company directly on the recommendation of your friend - an employee of this company - referal You have applied for a vacancy on the job site - youjs The recruiting agency brought you to the employer - KA Invited by the Employer, we knew him before the employment - friends The employer contacted you on the recommendation of a person who knows you - rabrecNErab The employer reached you through your resume on the job site - empjs
* **coach** - Presence of a coach (training) on probation
* **head_gender** - head (supervisor) gender
* **greywage** - The salary does not seem to the tax authorities. Greywage in Russia or Ukraine means that the employer (company) pay
* **way** - Employee's way of transportation
* **extraversion** - Extraversion score
* **independ** - Independend score
* **selfcontrol** - Selfcontrol score
* **anxiety** - Anxiety score
* **novator** - Novator score

# 2 | Data Validation

### Importing Libraries

In [50]:
!pip install pycaret



In [51]:
import pandas as pd
import numpy as np

from plotly.offline import iplot, init_notebook_mode
import plotly.express as px

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier,GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import train_test_split

from pycaret.classification import *

import warnings
warnings.filterwarnings('ignore')

In [52]:
df = pd.read_csv('turnover.csv', encoding = 'ISO-8859-1')
df.head()

Unnamed: 0,stag,event,gender,age,industry,profession,traffic,coach,head_gender,greywage,way,extraversion,independ,selfcontrol,anxiety,novator
0,7.030801,1,m,35.0,Banks,HR,rabrecNErab,no,f,white,bus,6.2,4.1,5.7,7.1,8.3
1,22.965092,1,m,33.0,Banks,HR,empjs,no,m,white,bus,6.2,4.1,5.7,7.1,8.3
2,15.934292,1,f,35.0,PowerGeneration,HR,rabrecNErab,no,m,white,bus,6.2,6.2,2.6,4.8,8.3
3,15.934292,1,f,35.0,PowerGeneration,HR,rabrecNErab,no,m,white,bus,5.4,7.6,4.9,2.5,6.7
4,8.410678,1,m,32.0,Retail,Commercial,youjs,yes,f,white,bus,3.0,4.1,8.0,7.1,3.7


### Check for missing values and the columns datatypes

In [53]:
display(df.isnull().values.any())

False

Good, there is no null data.

### Check the numerical columns basic statistical description

In [54]:
df.describe()

Unnamed: 0,stag,event,age,extraversion,independ,selfcontrol,anxiety,novator
count,1129.0,1129.0,1129.0,1129.0,1129.0,1129.0,1129.0,1129.0
mean,36.627526,0.505757,31.066965,5.592383,5.478034,5.597254,5.665633,5.879628
std,34.096597,0.500188,6.996147,1.851637,1.703312,1.980101,1.709176,1.904016
min,0.394251,0.0,18.0,1.0,1.0,1.0,1.7,1.0
25%,11.728953,0.0,26.0,4.6,4.1,4.1,4.8,4.4
50%,24.344969,1.0,30.0,5.4,5.5,5.7,5.6,6.0
75%,51.318275,1.0,36.0,7.0,6.9,7.2,7.1,7.5
max,179.449692,1.0,58.0,10.0,10.0,10.0,10.0,10.0


# 3 | Data preprocessing



In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1129 entries, 0 to 1128
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   stag          1129 non-null   float64
 1   event         1129 non-null   int64  
 2   gender        1129 non-null   object 
 3   age           1129 non-null   float64
 4   industry      1129 non-null   object 
 5   profession    1129 non-null   object 
 6   traffic       1129 non-null   object 
 7   coach         1129 non-null   object 
 8   head_gender   1129 non-null   object 
 9   greywage      1129 non-null   object 
 10  way           1129 non-null   object 
 11  extraversion  1129 non-null   float64
 12  independ      1129 non-null   float64
 13  selfcontrol   1129 non-null   float64
 14  anxiety       1129 non-null   float64
 15  novator       1129 non-null   float64
dtypes: float64(7), int64(1), object(8)
memory usage: 141.2+ KB


### Encoding categorical columns to numerical column
As you can see, there are several columns of the type
Object, we'll transform them into numeric ones so that we can feed the model later on

In [56]:
from sklearn.preprocessing import LabelEncoder
print('Categorical columns: ')
for col in df.columns:
    if df[col].dtype == 'object':
        values = df[col].value_counts()
        values = dict(values)

        print(str(col))
        label = LabelEncoder()
        label = label.fit(df[col])
        df[col] = label.transform(df[col].astype(str))

        new_values = df[col].value_counts()
        new_values = dict(new_values)

        value_dict = {}
        i=0
        for key in values:
            value_dict[key] = list(new_values)[i]
            i+= 1
        print(value_dict)

Categorical columns: 
gender
{'f': 0, 'm': 1}
industry
{'Retail': 10, 'manufacture': 14, 'IT': 5, 'Banks': 2, 'etc': 13, 'Consult': 4, 'State': 11, 'Building': 3, 'PowerGeneration': 8, 'transport': 15, 'Telecom': 12, 'Mining': 6, 'Pharma': 7, 'Agriculture': 1, 'RealEstate': 9, ' HoReCa': 0}
profession
{'HR': 6, 'IT': 7, 'Sales': 11, 'etc': 13, 'Marketing': 9, 'BusinessDevelopment': 1, 'Consult': 3, 'Commercial': 2, 'manage': 14, 'Finanñe': 5, 'Engineer': 4, 'Teaching': 12, 'Accounting': 0, 'Law': 8, 'PR': 10}
traffic
{'youjs': 7, 'empjs': 2, 'rabrecNErab': 4, 'friends': 3, 'referal': 6, 'KA': 0, 'recNErab': 5, 'advert': 1}
coach
{'no': 1, 'my head': 0, 'yes': 2}
head_gender
{'m': 1, 'f': 0}
greywage
{'white': 1, 'grey': 0}
way
{'bus': 0, 'car': 1, 'foot': 2}


In [57]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1129 entries, 0 to 1128
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   stag          1129 non-null   float64
 1   event         1129 non-null   int64  
 2   gender        1129 non-null   int64  
 3   age           1129 non-null   float64
 4   industry      1129 non-null   int64  
 5   profession    1129 non-null   int64  
 6   traffic       1129 non-null   int64  
 7   coach         1129 non-null   int64  
 8   head_gender   1129 non-null   int64  
 9   greywage      1129 non-null   int64  
 10  way           1129 non-null   int64  
 11  extraversion  1129 non-null   float64
 12  independ      1129 non-null   float64
 13  selfcontrol   1129 non-null   float64
 14  anxiety       1129 non-null   float64
 15  novator       1129 non-null   float64
dtypes: float64(7), int64(9)
memory usage: 141.2 KB


Cool, now all our Object variables have become numeric and non-null

# 4 | Exploratory Data Analysis
Analyzing the data and understanding it better using data visualization

### Heatmap

In [58]:
df_corr = df.corr()
fig = px.imshow(df_corr, color_continuous_scale='RdBu_r')
fig.update_xaxes(tickangle=45)
fig.show()

Looking at the linear analysis shown in the graph, we can see that there is no strong linear correlation between the variables, but there may still be a correlation, just a non-linear one between the variables

### Lets check whats the distribution of employee that resigned and not

In [59]:
fig = px.histogram(df, x='event', color='event')
fig.show()

In [60]:
fig = px.pie(df, "event", color='event', hole=.5)
fig.show()


It seems that theres almost equal amount of employee that resigned and employee that did not data in this dataset

### Checking if experience (time) is a factor that affects employee from resigning

In [61]:
fig = px.histogram(df, x="stag", color='event', marginal='box', barmode='group')
fig.show()

Looking at the graph above you can tell that experience (time) has a very weak correlation with employee quitting so we can can that experience (time) is not a major factor on employee resigning

### Checking if age is a factor that affects employees from resigning

In [62]:
fig = px.histogram(df, x="age", color='event', marginal='box', barmode='group')
fig.show()

Expectedly, age is same as experience (time) as they both does not have strong correlations towards to employees resigning

### Checking if gender is a factor that affects employees from resigning

In [63]:
fig = px.histogram(df, x="gender", color='event', barmode='group')
fig.show()

By analyzing the graphs above, all of them has a very weak to no correlations towards employees resigning

# 5 | Preparing Data for Modelling
Preparing data by standarization, assigning X and y, splitting our data and trying out different models

### Assigning columns as features (X) and target (y)

In [64]:
X = df.drop(columns=['event'])
y = df['event']

### Split our data into training and testing with 20% for testing

In [65]:
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size=0.2)

### Trying out different models and use the best

In [66]:
models = {}
def train_validate_predict(classifiers, x_train, y_train, x_test, y_test, index):
    model = classifiers
    model.fit(x_train, y_train)

    y_pred = model.predict(x_test)

    r2 = accuracy_score(y_test, y_pred)
    models[index] = r2

In [67]:
model_names = ['SVC', 'DecisionTreeClassifier', 'AdaBoostClassifier', 'RandomForestClassifier', 'ExtraTreesClassifier', 'LogisticRegression', 'GradientBoostingClassifier']
model_list = [SVC, DecisionTreeClassifier, AdaBoostClassifier, RandomForestClassifier, ExtraTreesClassifier, LogisticRegression, GradientBoostingClassifier]

index = 0
for classifiers in model_list:
    train_validate_predict(classifiers(), X_train, y_train, X_test, y_test, model_names[index])
    index+=1

In [68]:
sorted_models = sorted(models.items(), key=lambda x: x[1], reverse=True)
sorted_models

[('ExtraTreesClassifier', 0.6592920353982301),
 ('RandomForestClassifier', 0.6504424778761062),
 ('DecisionTreeClassifier', 0.6415929203539823),
 ('GradientBoostingClassifier', 0.6327433628318584),
 ('AdaBoostClassifier', 0.5575221238938053),
 ('LogisticRegression', 0.5221238938053098),
 ('SVC', 0.5132743362831859)]

ExtraTreesClassifier achieved the highest accuracy so lets use it to train our data

# 6 | Training and Evaluation
Now we will train our data using ExtraTreesClassifier model since thats the model that got the most accuracy

In [69]:
model = ExtraTreesClassifier()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

Lets create a classification report to examine the accuracy of our model

In [70]:
clfr = classification_report(y_test, y_pred, output_dict=True)

In [71]:
df_classification_report = pd.DataFrame(clfr).transpose()
display(df_classification_report)

Unnamed: 0,precision,recall,f1-score,support
0,0.712871,0.615385,0.66055,117.0
1,0.64,0.733945,0.683761,109.0
accuracy,0.672566,0.672566,0.672566,0.672566
macro avg,0.676436,0.674665,0.672156,226.0
weighted avg,0.677725,0.672566,0.671745,226.0


Just to get to know you, we will use the Pycaret library using cross validation to simulate whether it would be possible to achieve a model with better assertiveness.

In [74]:
# Load the data
data = pd.read_csv('turnover.csv', encoding = 'ISO-8859-1')

# Set up the PyCaret environment with cross-validation
clf1 = setup(data, target='event', session_id=123, fold=10)

# Compare and select the best model
best_model = compare_models()

# Evaluate the model
evaluate_model(best_model)

# Make predictions on the test data set itself
# PyCaret automatically splits the data into training and test and returns the predictions
predictions = predict_model(best_model)
print(predictions)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,event
2,Target type,Binary
3,Original data shape,"(1129, 16)"
4,Transformed data shape,"(1129, 56)"
5,Transformed train set shape,"(790, 56)"
6,Transformed test set shape,"(339, 56)"
7,Numeric features,7
8,Categorical features,8
9,Preprocess,True


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.6975,0.7623,0.7125,0.6989,0.7047,0.3946,0.3955,0.424
lightgbm,Light Gradient Boosting Machine,0.6734,0.7288,0.6875,0.6768,0.6798,0.3465,0.3486,0.628
xgboost,Extreme Gradient Boosting,0.6684,0.7248,0.685,0.6703,0.6757,0.3364,0.3387,0.426
et,Extra Trees Classifier,0.6646,0.7628,0.695,0.6633,0.677,0.3285,0.331,0.395
dt,Decision Tree Classifier,0.6241,0.6242,0.6125,0.6364,0.6224,0.2482,0.2498,0.181
lda,Linear Discriminant Analysis,0.6177,0.6524,0.615,0.6263,0.6189,0.2354,0.2365,0.197
lr,Logistic Regression,0.6165,0.6551,0.615,0.6243,0.6181,0.2329,0.234,1.267
ridge,Ridge Classifier,0.6165,0.6535,0.6125,0.6246,0.6168,0.233,0.2339,0.327
gbc,Gradient Boosting Classifier,0.6127,0.6554,0.6275,0.6199,0.6222,0.2249,0.226,0.475
ada,Ada Boost Classifier,0.5962,0.6345,0.5925,0.608,0.5984,0.1924,0.1938,0.397


Processing:   0%|          | 0/65 [00:00<?, ?it/s]

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.646,0.7193,0.6316,0.6545,0.6429,0.2922,0.2924


            stag gender   age     industry profession      traffic coach  \
46      9.100616      f  33.0       Retail         HR        empjs    no   
947     7.852156      f  41.0       Retail         HR  rabrecNErab    no   
716    13.930184      f  27.0  manufacture         HR        empjs    no   
1012   74.743324      f  25.0       Retail         HR        youjs    no   
110    13.174538      m  28.0      Telecom         IT  rabrecNErab    no   
...          ...    ...   ...          ...        ...          ...   ...   
834    24.082136      f  42.0    transport         HR        empjs    no   
453    23.983572      f  29.0       Mining         HR  rabrecNErab    no   
490   106.841888      f  22.0      Telecom         HR      referal    no   
474    99.778236      f  42.0          etc         HR        youjs    no   
278    68.271049      f  24.0          etc         PR      referal    no   

     head_gender greywage   way  extraversion  independ  selfcontrol  anxiety  \
46    

We can see that by using the Pycaret library we have obtained slightly greater accuracy, but the difference in this case is negligible.

# 7 | Conclusion

Looking at our classification report, it's not that good, but it's not that bad either. It's a bit surprising that we've achieved this high accuracy, considering that when we look at the heat map correlations of the data, the events don't correlate strongly in any column of our dataset.
With this model, we were able to achieve an accuracy of approximately 70%. Depending on the organization and the needs of the predictive model, this information can help reduce employee turnover.
