# Predicting the Risk of Employees’ Attrition with Classification Models

By: Evelyn Peh Ting Yu, Koh Pei Ling, Song Yu Xiang, Wong Jie Peng, Guo Peng Yuan

The US department of labour mentioned that “it could cost up to 11k in direct training expenses and lost productivity to replace an experienced employee earning an annual salary of $33,000.” This statement itself implies that employee attrition is costly. When an employee leaves the organization, the company is not only losing its valuable employees, but the company also loses on the amount that it has spent to recruit and select those employees and to train them for their respective jobs. On the other hand, the organization needs to invest more and more in recruitment, training, and development of new staff to fill up their vacant positions. Due to these reasons, every organization wants to control the attrition rate and retain its employees through more satisfactory company policies and work environments. 

This project, therefore, aims to predict the likelihood of an employee leaving the company by understanding the main drivers of employee churn. Such information can be valuable in helping management take possible actions to potentially retain these employees thereby reducing the high cost incurred in employee turnover. 

### 1. Import Libraries

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import make_scorer
from sklearn.metrics import f1_score,precision_score,recall_score,accuracy_score

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold

#model
from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier # pip install xgboost

from imblearn.over_sampling import SMOTE # !pip install imblearn

#PyTorch
import torch.nn as nn
import torch

#Save and Load Model
import joblib
import pickle

import warnings
warnings.filterwarnings('ignore')

<hr>

### 2. Import Dataset

https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset

<b>35 attributes, 1460 observations</b> <br><br>
<b>Attributes:</b> <br>
<b>Dependent Variables:</b> Attrition <br>
<b>Independent Variables:</b>
<ul>
    <li><b><i>Basic Information (6 attributes)</i></b>: Age, Education, Education Field, Gender, Marital Status, Distance From Home
    <li><b><i>Work Information (8 attributes)</i></b>: Department, Job Role, Job Level, Over Time, Business Travel, Performance Rating, Stock Option Level,
          Job Involvement
    <li><b><i>Satisfaction (4 attributes)</i></b>: Work-Life Balance, Job Satisfaction, Relationship Satisfaction, Enviroment Satisfaction
    <li><b><i>Salary Related (5 attributes)</i></b>: Monthly Income, Monthly Rate, Daily Rate, Hourly Rate, Percent Salary Hike
    <li><b><i>Time Related (7 attributes)</i></b>: Total Working Years, Training Time Last Year, Years At Company, Years In Current Role, Years Since Last Promotion, 
           Years With Current Manager, Num Companies Worked
</ul>

In [5]:
df = pd.read_csv("dataset.csv") 
df.shape #1470 rows and 35 columns

(1470, 35)

In [6]:
df.head() #top 5 records, ensure data loaded properly

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


<hr>

### 3. Exploratory Data Analysis (EDA)

refer to "Project - Exploratory Data Analysis" Notebook

<hr>

### 4. Feature Pre-processing

###### Dropping columns with constant value and identifier columns

In [7]:
col_to_drop = ['EmployeeCount', 'Over18', 'StandardHours', 'EmployeeNumber']

In [8]:
df.drop(columns=col_to_drop, inplace=True) 

In [9]:
df.shape #35 - 4 = 31 remaining cols 

(1470, 31)

###### Encoding the categorical type data 


In [10]:
le = LabelEncoder()

In [11]:
# Label Encoding will be used for columns with 2 or less unique values
le_count = 0
for col in df.columns[1:]:
    if df[col].dtype == 'object':
        if len(list(df[col].unique())) <= 2:
            le.fit(df[col])
            df[col] = le.transform(df[col])
            le_count += 1
print('{} columns were label encoded.'.format(le_count))

3 columns were label encoded.


In [12]:
# convert rest of categorical variable into dummy
df = pd.get_dummies(df, drop_first=True)

###### Standardisation

In [13]:
# import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 5))
HR_col = list(df.columns)
HR_col.remove('Attrition')
for col in HR_col:
    df[col] = df[col].astype(float)
    df[[col]] = scaler.fit_transform(df[[col]])
df['Attrition'] = pd.to_numeric(df['Attrition'], downcast='float')
df.head()

Unnamed: 0,Age,Attrition,DailyRate,DistanceFromHome,Education,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,JobRole_Human Resources,JobRole_Laboratory Technician,JobRole_Manager,JobRole_Manufacturing Director,JobRole_Research Director,JobRole_Research Scientist,JobRole_Sales Executive,JobRole_Sales Representative,MaritalStatus_Married,MaritalStatus_Single
0,2.738095,1.0,3.579098,0.0,1.25,1.666667,0.0,4.571429,3.333333,1.25,...,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,5.0
1,3.690476,0.0,0.6335,1.25,0.0,3.333333,5.0,2.214286,1.666667,1.25,...,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,5.0,0.0
2,2.261905,1.0,4.549034,0.178571,1.25,5.0,5.0,4.428571,1.666667,0.0,...,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0
3,1.785714,0.0,4.617037,0.357143,3.75,5.0,0.0,1.857143,3.333333,0.0,...,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,5.0,0.0
4,1.071429,0.0,1.750179,0.178571,0.0,0.0,5.0,0.714286,3.333333,0.0,...,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 45 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Age                                1470 non-null   float64
 1   Attrition                          1470 non-null   float32
 2   DailyRate                          1470 non-null   float64
 3   DistanceFromHome                   1470 non-null   float64
 4   Education                          1470 non-null   float64
 5   EnvironmentSatisfaction            1470 non-null   float64
 6   Gender                             1470 non-null   float64
 7   HourlyRate                         1470 non-null   float64
 8   JobInvolvement                     1470 non-null   float64
 9   JobLevel                           1470 non-null   float64
 10  JobSatisfaction                    1470 non-null   float64
 11  MonthlyIncome                      1470 non-null   float

In [15]:
# Separate input features and target
y = df["Attrition"]
X = df.loc[:, df.columns != 'Attrition']

<hr>

In [16]:
# setting up testing and training sets
original_X_train, original_X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)


### 5. Handling imbalance data with SMOTE

In [17]:
# setting up testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)


sm = SMOTE(random_state = 0)
X_train, y_train = sm.fit_resample(X_train, y_train)

In [18]:
y_train.value_counts()

0.0    913
1.0    913
Name: Attrition, dtype: int64

<hr>

### 6. Handling imbalance data with GAN

https://realpython.com/generative-adversarial-networks/

In [19]:
# setting up testing and training sets
X_gantrain, X_gantest, y_gantrain, y_gantest = train_test_split(X, y, test_size=0.25, random_state=123)

In [20]:
process = X_gantrain.copy()
process['target'] = y_gantrain.copy()

In [21]:
process.target.value_counts()

0.0    913
1.0    189
Name: target, dtype: int64

In [22]:
X_forgenerate = process.query("target == 1").iloc[:,:-1].values
X_forgenerate = torch.tensor(X_forgenerate).type(torch.FloatTensor)

X_non_attrition = process.query('target == 0').iloc[:,:-1].values #without target col
n_generate = X_non_attrition.shape[0] - X_forgenerate.shape[0]

print(f'Have to generate {n_generate} record of attrition = 1 to balance data')

Have to generate 724 record of attrition = 1 to balance data


In [83]:
#implmentation of generator
G = nn.Sequential(                     
    nn.Linear(28, 128),            
    nn.ReLU(),
    nn.Linear(128, 44), 
)

#implementation of discriminator
D = nn.Sequential(                     
    nn.Linear(44, 128),     
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(64, 1),
    nn.Sigmoid()
)


In [88]:
#Before training the models, need to set up some parameters to use during training:
BATCH_SIZE = 50
lr = 0.001                   

# optimizator
opt_D = torch.optim.Adam(D.parameters(),lr=lr)
opt_G = torch.optim.Adam(G.parameters(),lr=lr)
num_epochs = 3000

for step in range(num_epochs):
    #Randomly select real samples of defined batch size with label 1
    chosen_data = np.random.choice((X_forgenerate.shape[0]),size=(BATCH_SIZE),replace=False)
    artist_paintings = X_forgenerate[chosen_data,:]
    
    # Training the discriminator
    D.zero_grad()
    output_discriminator  = D(artist_paintings)
    
    # Generate fake samples using generators
    G.zero_grad()        
    generated_samples  = G(torch.randn(BATCH_SIZE, 28, requires_grad=True))                  
    output_discriminator_generated = D(generated_samples ) 
    
    # loss
    G_loss = torch.mean(torch.log(1. - output_discriminator_generated ))
    G_loss.backward()
    opt_G.step()
    
    output_discriminator_generated  = D(G_paintings.detach())
    
    # loss
    D_loss = - torch.mean(torch.log(output_discriminator) + torch.log(1. - output_discriminator_generated))
    D_loss.backward(retain_graph=True)
    opt_D.step()
    
    
    

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

In [89]:
#GAN fake data
fake_data = G(torch.randn(n_generate,28)).detach().numpy()

X_default = pd.DataFrame(np.concatenate([X_forgenerate,fake_data]))
X_default['target'] = 1

X_non_attrition = pd.DataFrame(X_non_attrition)
X_non_attrition['target'] = 0
train_data_gan = pd.concat([X_default,X_non_attrition])

X_gan = train_data_gan.iloc[:,:-1]
y_gan = train_data_gan.iloc[:,-1]

print(X_gan.shape,y_gan.shape)

(1826, 44) (1826,)


In [90]:
X_gan.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1826 entries, 0 to 912
Data columns (total 44 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       1826 non-null   float64
 1   1       1826 non-null   float64
 2   2       1826 non-null   float64
 3   3       1826 non-null   float64
 4   4       1826 non-null   float64
 5   5       1826 non-null   float64
 6   6       1826 non-null   float64
 7   7       1826 non-null   float64
 8   8       1826 non-null   float64
 9   9       1826 non-null   float64
 10  10      1826 non-null   float64
 11  11      1826 non-null   float64
 12  12      1826 non-null   float64
 13  13      1826 non-null   float64
 14  14      1826 non-null   float64
 15  15      1826 non-null   float64
 16  16      1826 non-null   float64
 17  17      1826 non-null   float64
 18  18      1826 non-null   float64
 19  19      1826 non-null   float64
 20  20      1826 non-null   float64
 21  21      1826 non-null   float64
 22  2

<hr>

### 7. Testing SMOTE and GAN on Baseline Modelling

In [91]:
# set table to table to populate with performance results
col = ['Algorithm','Train Accuracy Mean','Test Accuracy', 'Recall']

def runmodel(Xtrain, ytrain):
    train_acc_results = []
    test_acc_results = []
    names = []
    
    df_results = pd.DataFrame(columns=col)
    
    i = 0
    # evaluate each model using cross-validation
    for name, model in models:
        kfold = KFold(n_splits=10)  # 10-fold cross-validation

        # cv accuracy scoring
        cv_acc_results = cross_val_score(model, Xtrain, ytrain, cv=kfold, scoring='accuracy')
        train_acc_results.append(cv_acc_results)

        #test accuracy scoring
        model.fit(Xtrain,ytrain)
        
        #y_train_pred = model.predict(Xtrain) 
        #train_accuracy = accuracy_score(ytrain, y_train_pred)
        #test_acc_results.append(train_accuracy)
        
        pred = model.predict(X_test)
        test_accuracy = accuracy_score(y_test, pred)
        test_acc_results.append(test_accuracy)
        
        #Confusion Matrix
        cm = confusion_matrix(pred,y_test)
        tn = cm[0][0]
        fn = cm[1][0]
        fp = cm[0][1]
        tp = cm[1][1]
        precision = round(tp/(tp+fp),4)
        recall = round(tp/(tp+fn),4)
        f1 = round((2*precision*recall)/(precision + recall),4)

        names.append(name)
        df_results.loc[i] = [name, round(cv_acc_results.mean()*100, 2), round(test_accuracy*100,2), round(recall*100,2)]

        i += 1
        
    return df_results

In [75]:
# selection of algorithms to consider and set performance measure
models = []
models.append(('Logistic Regression', LogisticRegression(random_state=345)))
models.append(('Random Forest', RandomForestClassifier(random_state=345)))
models.append(('SVM', SVC(probability=True)))
models.append(('KNN', KNeighborsClassifier()))
models.append(('Decision Tree Classifier',DecisionTreeClassifier(random_state=345)))
models.append(('Gaussian NB', GaussianNB()))
models.append(('Adaboost', AdaBoostClassifier(random_state=345)))
models.append(("Gradientboost", GradientBoostingClassifier(random_state=345)))
models.append(("BaggingClassifier", BaggingClassifier(random_state=345)))
models.append(("ExtremeGradientBoost", XGBClassifier(random_state=345)))
models.append(("ExtraTreesClassifier", ExtraTreesClassifier(random_state=345)))

###### original

In [63]:
original_result = runmodel(original_X_train, original_y_train)

In [64]:
original_result.sort_values(by=['Test Accuracy', 'Recall'], ascending=False)

Unnamed: 0,Algorithm,Train Accuracy Mean,Test Accuracy,Recall
0,Logistic Regression,86.84,91.3,72.22
2,SVM,85.48,91.03,94.12
6,Adaboost,85.48,90.49,69.7
1,Random Forest,84.57,90.22,92.86
7,Gradientboost,84.66,90.22,75.0
10,ExtraTreesClassifier,85.02,89.95,100.0
9,ExtremeGradientBoost,85.48,89.67,65.62
3,KNN,83.66,87.23,52.94
8,BaggingClassifier,84.93,87.23,51.61
4,Decision Tree Classifier,76.58,77.72,24.24


###### GAN

In [76]:
gan_result = runmodel(X_gan, y_gan)

In [77]:
gan_result.sort_values(by=['Test Accuracy', 'Recall'], ascending=False)

Unnamed: 0,Algorithm,Train Accuracy Mean,Test Accuracy,Recall
2,SVM,88.57,90.76,88.89
9,ExtremeGradientBoost,86.98,90.76,79.17
10,ExtraTreesClassifier,88.9,90.22,100.0
7,Gradientboost,87.8,90.22,73.08
0,Logistic Regression,87.53,90.22,65.79
1,Random Forest,88.96,89.95,86.67
8,BaggingClassifier,87.75,87.5,55.56
3,KNN,87.58,87.23,52.94
5,Gaussian NB,89.56,86.96,
6,Adaboost,86.98,86.41,47.83


###### SMOTE

In [67]:
smote_result = runmodel(X_train, y_train)

In [68]:
smote_result.sort_values(by=['Test Accuracy', 'Recall'], ascending=False)

Unnamed: 0,Algorithm,Train Accuracy Mean,Test Accuracy,Recall
7,Gradientboost,91.42,90.76,75.0
10,ExtraTreesClassifier,96.56,90.22,75.0
1,Random Forest,93.88,88.86,65.22
9,ExtremeGradientBoost,92.73,88.04,55.26
8,BaggingClassifier,90.87,86.96,50.0
6,Adaboost,89.23,85.33,44.83
2,SVM,90.59,81.79,36.99
4,Decision Tree Classifier,85.34,79.08,31.65
0,Logistic Regression,76.07,74.73,30.43
3,KNN,79.94,60.6,18.71


#### Analysis

> 

<hr>

### 8. Feature Selection

In [None]:
# selection of algorithms to consider and set performance measure
fs_models = []
fs_models.append(('Logistic Regression', LogisticRegression(random_state=345)))
fs_models.append(('Random Forest', RandomForestClassifier(random_state=345)))
fs_models.append(('SVM', SVC(probability=True)))
fs_models.append(('KNN', KNeighborsClassifier()))
fs_models.append(('Decision Tree Classifier',DecisionTreeClassifier(random_state=345)))
fs_models.append(('Gaussian NB', GaussianNB()))
fs_models.append(('Adaboost', AdaBoostClassifier(random_state=345)))
fs_models.append(("Gradientboost", GradientBoostingClassifier(random_state=345)))
fs_models.append(("BaggingClassifier", BaggingClassifier(random_state=345)))
fs_models.append(("ExtremeGradientBoost", XGBClassifier(random_state=345)))
fs_models.append(("ExtraTreesClassifier", ExtraTreesClassifier(random_state=345)))

In [None]:
def feature_selection_model(model, xtrain,ytrain, xtest):
    
    rfecv = RFECV(estimator=model, cv=StratifiedKFold(10, random_state=123, shuffle=True), scoring="accuracy")
    rfecv.fit(xtrain, ytrain)
    
    return xtrain.iloc[:, rfecv.support_], xtest.iloc[:, rfecv.support_]

In [None]:
# set table to table to populate with performance results
col = ['Algorithm','Train Accuracy Mean', 'Test Accuracy', "Recall"]

def runfsmodel(X_train_importance, y_train_importance, x_test_importance):
    train_acc_results = []
    test_acc_results = []
    names = []
    
    df_results = pd.DataFrame(columns=col)
    
    i = 0
    # evaluate each model using cross-validation
    for name, model in fs_models:
        kfold = KFold(n_splits=10)  # 10-fold cross-validation

        # cv accuracy scoring
        cv_acc_results = cross_val_score(model, X_train_importance, y_train_importance, cv=kfold, scoring='accuracy')
        train_acc_results.append(cv_acc_results)

        #test accuracy scoring
        model.fit(X_train_importance,y_train_importance)
        pred = model.predict(x_test_importance)
        test_accuracy = accuracy_score(y_test, pred)
        test_acc_results.append(test_accuracy)

        #Confusion Matrix
        cm = confusion_matrix(pred,y_test)
        tn = cm[0][0]
        fn = cm[1][0]
        fp = cm[0][1]
        tp = cm[1][1]
        precision = round(tp/(tp+fp),4)
        recall = round(tp/(tp+fn),4)
        f1 = round((2*precision*recall)/(precision + recall),4)

        names.append(name)
        df_results.loc[i] = [name, round(cv_acc_results.mean()*100, 2), round(test_accuracy*100,2), round(recall*100,2)]

        i += 1
        
    return df_results

###### Logistic Regression


In [None]:
base_lr_model = fs_models[0][1]

In [None]:
LR_fs = feature_selection_model(base_lr_model, X_train, y_train, X_test)
LR_X_train_importance = LR_fs[0]
LR_X_test_importance = LR_fs[1]

print(f'Number of columns selected: {LR_X_train_importance.shape[1]}')

In [None]:
LR_fs_result = runfsmodel(LR_X_train_importance, y_train, LR_X_test_importance)

In [None]:
LR_fs_result.sort_values(by=['Test Accuracy', 'Recall'], ascending=False)

###### Decision Tree

In [None]:
base_dt_model = fs_models[4][1]

In [None]:
DT_fs = feature_selection_model(base_dt_model, X_train, y_train, X_test)
DT_X_train_importance = DT_fs[0]
DT_X_test_importance = DT_fs[1]

print(f'Number of columns selected: {DT_X_train_importance.shape[1]}')

In [None]:
DT_fs_result = runfsmodel(DT_X_train_importance, y_train, DT_X_test_importance)

In [None]:
DT_fs_result.sort_values(by=['Test Accuracy', 'Recall'], ascending=False)

###### Random Forest

In [None]:
base_rf_model = fs_models[1][1]

In [None]:
RF_fs = feature_selection_model(base_rf_model, X_train, y_train, X_test)
RF_X_train_importance = RF_fs[0]
RF_X_test_importance = RF_fs[1]

print(f'Number of columns selected: {RF_X_train_importance.shape[1]}')

In [None]:
RF_fs_result = runfsmodel(RF_X_train_importance, y_train, RF_X_test_importance)

In [None]:
RF_fs_result.sort_values(by=['Test Accuracy', 'Recall'], ascending=False)

#### Analysis

>

<hr>

## 9. Top 3 Model Tuning

In [None]:
col = ["Accuracy_Score","Recall"]
df_performance = pd.DataFrame(columns=col)

In [None]:
#To evaluate performances of all the models
def performance(name, p, ytest):
    
    accuracy = np.round(accuracy_score(p,ytest),4)
    
    cm = confusion_matrix(p,ytest)
    tn = cm[0][0]
    fn = cm[1][0]
    fp = cm[0][1]
    tp = cm[1][1]
    precision = round(tp/(tp+fp),4)
    recall = round(tp/(tp+fn),4)
    f1 = round((2*precision*recall)/(precision + recall),4)
    
    df_performance.loc[name] = [round(accuracy*100,2),round(recall*100,2)]


###### 1. Gradient Boosting

In [None]:
gb_params = {
    'n_estimators':[100,200, 300],
    'learning_rate': [0.1,0.2,0.3],
    "max_leaf_nodes": [2, 5, 10, 20, 50, 100],
    "random_state": [345]
}


performance gridserach on original train dataset

In [None]:
gb_gs = GridSearchCV(GradientBoostingClassifier(),gb_params, scoring=make_scorer(accuracy_score),cv=5)
gb_gs.fit(X_train, y_train)

In [None]:
print(f'Best Params: {gb_gs.best_params_}')
print(f'Best Estimator: {gb_gs.best_estimator_}')
print(f'Best Score: {gb_gs.best_score_}')

In [None]:
y_pred_gb = gb_gs.predict(X_test)

performance("GradientBoosting_Tuned",y_pred_gb, y_test)

perform gridsearch on feature selection train dataset 

In [None]:
gb_gs_fs = GridSearchCV(GradientBoostingClassifier(),gb_params, scoring=make_scorer(accuracy_score),cv=5)
gb_gs_fs.fit(RF_X_train_importance, y_train)

In [None]:
print(f'Best Params: {gb_gs_fs.best_params_}')
print(f'Best Estimator: {gb_gs_fs.best_estimator_}')
print(f'Best Score: {gb_gs_fs.best_score_}')

In [None]:
y_important_pred = gb_gs_fs.predict(RF_X_test_importance)

In [None]:

performance("GradientBoosting_FS_Tuned", y_important_pred, y_test)

In [None]:
df_performance.head()

###### Analysis

>

###### 2. ExtraTreesClassifier

In [None]:
et_params = {"max_depth": [None,2,6,10],
             'max_features':["sqrt", "log2", None],
             "n_estimators": [100,300,500],
             "criterion": ["gini", "entropy", "log_loss"],
             "random_state": [345]
            }

performance gridsearch on original train dataset

In [None]:
et_gs = GridSearchCV(ExtraTreesClassifier(),et_params, scoring=make_scorer(accuracy_score),cv=5)
et_gs.fit(X_train, y_train)

In [None]:
print(f'Best Params: {et_gs.best_params_}')
print(f'Best Estimator: {et_gs.best_estimator_}')
print(f'Best Score: {et_gs.best_score_}')

In [None]:
y_pred_et = et_gs.predict(X_test)

performance("ExtraTreeClassifier_Tuned",y_pred_et, y_test)

perform gridsearch on feature selection train dataset

In [None]:
et_gs_fs = GridSearchCV(ExtraTreesClassifier(),et_params, scoring=make_scorer(accuracy_score),cv=5)
et_gs_fs.fit(RF_X_train_importance, y_train)

In [None]:
print(f'Best Params: {et_gs_fs.best_params_}')
print(f'Best Estimator: {et_gs_fs.best_estimator_}')
print(f'Best Score: {et_gs_fs.best_score_}')

In [None]:
y_important_pred = et_gs_fs.predict(RF_X_test_importance)

In [None]:
performance("ExtraTreeClassifier_FS_Tuned", y_important_pred, y_test)

In [None]:
df_performance.head()

###### 3. Random Forest

In [None]:
rf_params={
        'n_estimators':[100, 300, 500],
        'criterion':['gini','entropy'],
        'max_depth':[None, 3,5,7],
        'max_features':["sqrt", "log2", None],
        "random_state": [345]
       }

In [None]:
rf_gs=GridSearchCV(RandomForestClassifier(),rf_params,scoring=make_scorer(accuracy_score),cv=5)
rf_gs.fit(X_train, y_train)

In [None]:
print(f'Best Params: {rf_gs.best_params_}')
print(f'Best Estimator: {rf_gs.best_estimator_}')
print(f'Best Score: {rf_gs.best_score_}')

In [None]:
y_pred_rf = rf_gs.predict(X_test)

performance("RandomForest_Tuned",y_pred_rf, y_test)

perform gridsearch on feature selection train dataset

In [None]:
rf_gs_fs=GridSearchCV(RandomForestClassifier(),rf_params,scoring=make_scorer(accuracy_score),cv=5)
rf_gs_fs.fit(RF_X_train_importance, y_train)

In [None]:
print(f'Best Params: {rf_gs_fs.best_params_}')
print(f'Best Estimator: {rf_gs_fs.best_estimator_}')
print(f'Best Score: {rf_gs_fs.best_score_}')

In [None]:
y_important_pred = rf_gs_fs.predict(RF_X_test_importance)

In [None]:
performance("RandomForest_FS_Tuned", y_important_pred, y_test)

In [None]:
df_performance.head(6)

### 10. Final Evaluation

In [None]:
rename_index={'Gradientboost': 'Gradientboost_Baseline','ExtraTreesClassifier': 'ExtraTreesClassifier_Baseline','Random Forest': 'RandomForest_Baseline' }

In [None]:
smote_baseline = smote_result.sort_values(by=['Test Accuracy', 'Recall'], ascending=False).iloc[:3].set_index(['Algorithm']).drop(['Train Accuracy Mean'], axis=1)

In [None]:
smote_baseline.rename(columns={'Test Accuracy': 'Accuracy_Score'},index = rename_index, inplace=True)

In [None]:
rename_index={'Gradientboost': 'Gradientboost_FS_Baseline','ExtraTreesClassifier': 'ExtraTreesClassifier_FS_Baseline','Random Forest': 'RandomForest_FS_Baseline' }

In [None]:
df_rs_result = RF_fs_result.sort_values(by=['Test Accuracy', 'Recall'], ascending=False).iloc[:3].set_index(['Algorithm']).drop(['Train Accuracy Mean'], axis=1)

In [None]:
df_rs_result.rename(columns={'Test Accuracy': 'Accuracy_Score'}, index = rename_index, inplace=True)

In [None]:
pd.concat([df_performance, smote_baseline,df_rs_result]).sort_index()

<hr>

### 11. Save Model

In [None]:
#baseline model w original dataset
filename_gb_baseline = 'finalized_baseline_gbmodel.joblib'
filename_et_baseline = 'finalized_baseline_etmodel.joblib'
filename_rf_baseline = 'finalized_baseline_rfmodel.joblib'

#baseline model w fs dataset
filename_gb_fs_baseline = 'finalized_baseline_fs_gbmodel.joblib'
filename_et_fs_baseline = 'finalized_baseline_fs_etmodel.joblib'
filename_rf_fs_baseline = 'finalized_baseline_fs_rfmodel.joblib'

#tuned w original datset
filename_gb = 'finalized_tuned_gbmodel.joblib'
filename_et = 'finalized_tuned_etmodel.joblib'
filename_rf = 'finalized_tuned_rfmodel.joblib'

#tuned w fs dataset
filename_gb_fs = 'finalized_tunedfs_gbmodel.joblib'
filename_et_fs = 'finalized_tunedfs_etmodel.joblib'
filename_rf_fs = 'finalized_tunedfs_rfmodel.joblib'


In [None]:
joblib.dump(models[7][1], filename_gb_baseline)
joblib.dump(models[10][1], filename_et_baseline)
joblib.dump(models[1][1], filename_rf_baseline)

joblib.dump(fs_models[7][1], filename_gb_fs_baseline)
joblib.dump(fs_models[10][1], filename_et_fs_baseline)
joblib.dump(fs_models[1][1], filename_rf_fs_baseline)

joblib.dump(gb_gs.best_estimator_, filename_gb)
joblib.dump(et_gs.best_estimator_, filename_et)
joblib.dump(rf_gs.best_estimator_, filename_rf)

joblib.dump(gb_gs_fs.best_estimator_, filename_gb_fs)
joblib.dump(et_gs_fs.best_estimator_, filename_et_fs)
joblib.dump(rf_gs_fs.best_estimator_, filename_rf_fs)

###### baseline w original dataset

In [None]:
gb = joblib.load(filename_gb_baseline)
et = joblib.load(filename_et_baseline)
rf = joblib.load(filename_rf_baseline)

###### baseline w fs dataset

In [None]:
gb_fs = joblib.load(filename_gb_fs_baseline)
et_fs = joblib.load(filename_et_fs_baseline)
rf_fs = joblib.load(filename_rf_fs_baseline)

##### tuned w original dataset

In [None]:
gb_gs = joblib.load(filename_gb)
et_gs = joblib.load(filename_et)
rf_gs = joblib.load(filename_rf)

###### tuned w fs dataset

In [None]:
gb_gs_fs = joblib.load(filename_gb_fs)
et_gs_fs = joblib.load(filename_et_fs)
rf_gs_fs = joblib.load(filename_rf_fs)

<hr>

### 12. Voting Classifier

In [None]:
# group / ensemble of best models
estimator = []

estimator.append(('gb', gb))
estimator.append(('et', et))
estimator.append(('rf', rf))
estimator.append(('gb_gs', gb_gs))
estimator.append(('et_gs', et_gs))
estimator.append(('rf_gs', rf_gs))

# Voting Classifier with hard voting
vot_hard = VotingClassifier(estimators = estimator, voting ='hard')
vot_hard.fit(X_train, y_train)
y_pred = vot_hard.predict(X_test)

# using accuracy_score metric to predict accuracy
score = accuracy_score(y_test, y_pred)

cm = confusion_matrix(y_pred,y_test)

tn = cm[0][0]
fn = cm[1][0]
fp = cm[0][1]
tp = cm[1][1]

recall = round(tp/(tp+fn),4)

print(f'Accuracy Score: {round(score*100,2)}%')
print(f'Recall: {recall}')

In [None]:
# group / ensemble of best models
estimator = []

estimator.append(('gb_fs', gb_fs))
estimator.append(('et_fs', et_fs))
estimator.append(('rf_fs', rf_fs))
estimator.append(('gb_gs_fs', gb_gs_fs))
estimator.append(('et_gs_fs', et_gs_fs))
estimator.append(('rf_gs_fs', rf_gs_fs))

# Voting Classifier with hard voting
vot_hard = VotingClassifier(estimators = estimator, voting ='hard')
vot_hard.fit(RF_X_train_importance, y_train)
y_pred = vot_hard.predict(RF_X_test_importance)

# using accuracy_score metric to predict accuracy
score = accuracy_score(y_test, y_pred)
print(f'Accuracy Score:  {round(score*100,2)}%')

cm = confusion_matrix(y_pred,y_test)
tn = cm[0][0]
fn = cm[1][0]
fp = cm[0][1]
tp = cm[1][1]

recall = round(tp/(tp+fn),4)
print(f'Recall: {recall}')