# Predicting the Risk of Employees’ Attrition with Classification Models

By: Evelyn Peh Ting Yu, Koh Pei Ling, Song Yu Xiang, Wong Jie Peng, Guo Peng Yuan

The US department of labour mentioned that “it could cost up to 11k in direct training expenses and lost productivity to replace an experienced employee earning an annual salary of $33,000.” This statement itself implies that employee attrition is costly. When an employee leaves the organization, the company is not only losing its valuable employees, but the company also loses on the amount that it has spent to recruit and select those employees and to train them for their respective jobs. On the other hand, the organization needs to invest more and more in recruitment, training, and development of new staff to fill up their vacant positions. Due to these reasons, every organization wants to control the attrition rate and retain its employees through more satisfactory company policies and work environments. 

This project, therefore, aims to predict the likelihood of an employee leaving the company by understanding the main drivers of employee churn. Such information can be valuable in helping management take possible actions to potentially retain these employees thereby reducing the high cost incurred in employee turnover. 

### 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import make_scorer
from sklearn.metrics import f1_score,precision_score,recall_score,accuracy_score

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_selection import RFECV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold

#model
from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier # pip install xgboost

from imblearn.over_sampling import SMOTE # !pip install imblearn

#PyTorch
import torch.nn as nn
import torch

#Save and Load Model
import joblib
import pickle

import warnings
warnings.filterwarnings('ignore')

<hr>

### 2. Import Dataset

https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset

<b>35 attributes, 1460 observations</b> <br><br>
<b>Attributes:</b> <br>
<b>Dependent Variables:</b> Attrition <br>
<b>Independent Variables:</b>
<ul>
    <li><b><i>Basic Information (6 attributes)</i></b>: Age, Education, Education Field, Gender, Marital Status, Distance From Home
    <li><b><i>Work Information (8 attributes)</i></b>: Department, Job Role, Job Level, Over Time, Business Travel, Performance Rating, Stock Option Level,
          Job Involvement
    <li><b><i>Satisfaction (4 attributes)</i></b>: Work-Life Balance, Job Satisfaction, Relationship Satisfaction, Enviroment Satisfaction
    <li><b><i>Salary Related (5 attributes)</i></b>: Monthly Income, Monthly Rate, Daily Rate, Hourly Rate, Percent Salary Hike
    <li><b><i>Time Related (7 attributes)</i></b>: Total Working Years, Training Time Last Year, Years At Company, Years In Current Role, Years Since Last Promotion, 
           Years With Current Manager, Num Companies Worked
</ul>

In [None]:
df = pd.read_csv("dataset.csv") 
df.shape #1470 rows and 35 columns

In [None]:
df.head() #top 5 records, ensure data loaded properly

<hr>

### 3. Exploratory Data Analysis (EDA)

refer to "Project - Exploratory Data Analysis" Notebook

<hr>

### 4. Feature Pre-processing

###### Dropping columns with constant value and identifier columns

In [None]:
col_to_drop = ['EmployeeCount', 'Over18', 'StandardHours', 'EmployeeNumber']

In [None]:
df.drop(columns=col_to_drop, inplace=True) 

In [None]:
df.shape #35 - 4 = 31 remaining cols 

###### Encoding the categorical type data 


In [None]:
le = LabelEncoder()

In [None]:
# Label Encoding will be used for columns with 2 or less unique values
le_count = 0
for col in df.columns[1:]:
    if df[col].dtype == 'object':
        if len(list(df[col].unique())) <= 2:
            le.fit(df[col])
            df[col] = le.transform(df[col])
            le_count += 1
print('{} columns were label encoded.'.format(le_count))

In [None]:
# convert rest of categorical variable into dummy
df = pd.get_dummies(df, drop_first=True)

###### Standardisation

In [None]:
# import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 5))
HR_col = list(df.columns)
HR_col.remove('Attrition')
for col in HR_col:
    df[col] = df[col].astype(float)
    df[[col]] = scaler.fit_transform(df[[col]])
df['Attrition'] = pd.to_numeric(df['Attrition'], downcast='float')
df.head()

In [None]:
df.info()

In [None]:
# Separate input features and target
y = df["Attrition"]
X = df.loc[:, df.columns != 'Attrition']

<hr>

In [None]:
# setting up testing and training sets
original_X_train, original_X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)


### 5. Handling imbalance data with SMOTE

In [None]:
# setting up testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)


sm = SMOTE(random_state = 0)
X_train, y_train = sm.fit_resample(X_train, y_train)

In [None]:
y_train.value_counts()

<hr>

### 6. Handling imbalance data with GAN

https://realpython.com/generative-adversarial-networks/

In [None]:
# setting up testing and training sets
X_gantrain, X_gantest, y_gantrain, y_gantest = train_test_split(X, y, test_size=0.25, random_state=123)

In [None]:
process = X_gantrain.copy()
process['target'] = y_gantrain.copy()

In [None]:
process.target.value_counts()

In [None]:
X_forgenerate = process.query("target == 1").iloc[:,:-1].values
X_forgenerate = torch.tensor(X_forgenerate).type(torch.FloatTensor)

X_non_attrition = process.query('target == 0').iloc[:,:-1].values #without target col
n_generate = X_non_attrition.shape[0] - X_forgenerate.shape[0]

print(f'Have to generate {n_generate} record of attrition = 1 to balance data')

In [None]:
'''
implmentation of generator
it’s a model with a two-dimensional input, which will receive random points (z₁, z₂), 
and a two-dimensional output that must provide (x̃₁, x̃₂) points resembling those from the training data.
'''
G = nn.Sequential(      
    nn.Linear(44, 16),
    nn.ReLU(),
    nn.Linear(16, 32),
    nn.ReLU(),
    nn.Linear(32, 44)
)

'''
#implementation of discriminator
The discriminator is a model with a two-dimensional input and a one-dimensional output.
It’ll receive a sample from the real data or from the generator 
and will provide the probability that the sample belongs to the real training data. 
'''

D = nn.Sequential(                     
    nn.Linear(44, 128),     
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Dropout(0.3), #After the first, second, and third hidden layers, you use dropout to avoid overfitting.
    nn.Linear(64, 1), #The output is composed of a single neuron with sigmoidal activation to represent a probability.
    nn.Sigmoid()
)


In [None]:
#Before training the models, need to set up some parameters to use during training:
BATCH_SIZE = 25
lr = 0.001                   

# optimizator
opt_D = torch.optim.Adam(D.parameters(),lr=lr)
opt_G = torch.optim.Adam(G.parameters(),lr=lr)
num_epochs = 3000

for step in range(num_epochs):
    #Randomly select real samples of defined batch size with label 1
    chosen_data = np.random.choice((X_forgenerate.shape[0]),size=(BATCH_SIZE),replace=False)
    samples = X_forgenerate[chosen_data,:]
    
    # Generate fake samples using generators
    G.zero_grad()        
    generated_samples  = G(torch.randn(BATCH_SIZE, 44))                  
    output_discriminator_generated = D(generated_samples) # feed the generator’s output into the discriminator 
    
    # loss
    G_loss = torch.mean(torch.log(1. - output_discriminator_generated))
    G_loss.backward()
    opt_G.step()
    
    # Training the discriminator
    D.zero_grad()
    output_discriminator = D(samples)
    output_discriminator_generated = D(generated_samples.detach())
    
    # loss
    D_loss = - torch.mean(torch.log(output_discriminator) + torch.log(1. - output_discriminator_generated))
    D_loss.backward(retain_graph=True)
    opt_D.step()

In [None]:
#GAN fake data
fake_data = G(torch.randn(n_generate,44)).detach().numpy()

X_default = pd.DataFrame(np.concatenate([X_forgenerate,fake_data]))
X_default['target'] = 1

X_non_attrition = pd.DataFrame(X_non_attrition)
X_non_attrition['target'] = 0
train_data_gan = pd.concat([X_default,X_non_attrition])

X_gan = train_data_gan.iloc[:,:-1]
y_gan = train_data_gan.iloc[:,-1]

print(X_gan.shape,y_gan.shape)

In [None]:
X_gan.info()

<hr>

### 7. Testing SMOTE and GAN on Baseline Modelling

In [None]:
# set table to table to populate with performance results
col = ['Algorithm','Train Accuracy Mean','Test Accuracy', 'Recall']

def runmodel(Xtrain, ytrain):
    train_acc_results = []
    test_acc_results = []
    names = []
    
    df_results = pd.DataFrame(columns=col)
    
    i = 0
    # evaluate each model using cross-validation
    for name, model in models:
        kfold = KFold(n_splits=10)  # 10-fold cross-validation

        # cv accuracy scoring
        cv_acc_results = cross_val_score(model, Xtrain, ytrain, cv=kfold, scoring='accuracy')
        train_acc_results.append(cv_acc_results)

        #test accuracy scoring
        model.fit(Xtrain,ytrain)
                
        pred = model.predict(X_test)
        test_accuracy = accuracy_score(y_test, pred)
        test_acc_results.append(test_accuracy)
        
        #Confusion Matrix
        cm = confusion_matrix(pred,y_test)
        tn = cm[0][0]
        fn = cm[1][0]
        fp = cm[0][1]
        tp = cm[1][1]
        precision = round(tp/(tp+fp),4)
        recall = round(tp/(tp+fn),4)
        f1 = round((2*precision*recall)/(precision + recall),4)

        names.append(name)
        df_results.loc[i] = [name, round(cv_acc_results.mean()*100, 2), round(test_accuracy*100,2), round(recall*100,2)]

        i += 1
        
    return df_results

In [None]:
# selection of algorithms to consider and set performance measure
models = []
models.append(('Logistic Regression', LogisticRegression(random_state=345)))
models.append(('Random Forest', RandomForestClassifier(random_state=345)))
models.append(('SVM', SVC(probability=True)))
models.append(('KNN', KNeighborsClassifier()))
models.append(('Decision Tree Classifier',DecisionTreeClassifier(random_state=345)))
models.append(('Gaussian NB', GaussianNB()))
models.append(('Adaboost', AdaBoostClassifier(random_state=345)))
models.append(("Gradientboost", GradientBoostingClassifier(random_state=345)))
models.append(("BaggingClassifier", BaggingClassifier(random_state=345)))
models.append(("ExtremeGradientBoost", XGBClassifier(random_state=345)))
models.append(("ExtraTreesClassifier", ExtraTreesClassifier(random_state=345)))

###### GAN

In [None]:
gan_result = runmodel(X_gan, y_gan)

In [None]:
gan_result.sort_values(by=['Test Accuracy', 'Recall'], ascending=False)

###### SMOTE

In [None]:
smote_result = runmodel(X_train, y_train)

In [None]:
smote_result.sort_values(by=['Test Accuracy', 'Recall'], ascending=False)

<hr>

### 8. Feature Selection

In [None]:
# selection of algorithms to consider and set performance measure
fs_models = []
fs_models.append(('Logistic Regression', LogisticRegression(random_state=345)))
fs_models.append(('Random Forest', RandomForestClassifier(random_state=345)))
fs_models.append(('SVM', SVC(probability=True)))
fs_models.append(('KNN', KNeighborsClassifier()))
fs_models.append(('Decision Tree Classifier',DecisionTreeClassifier(random_state=345)))
fs_models.append(('Gaussian NB', GaussianNB()))
fs_models.append(('Adaboost', AdaBoostClassifier(random_state=345)))
fs_models.append(("Gradientboost", GradientBoostingClassifier(random_state=345)))
fs_models.append(("BaggingClassifier", BaggingClassifier(random_state=345)))
fs_models.append(("ExtremeGradientBoost", XGBClassifier(random_state=345)))
fs_models.append(("ExtraTreesClassifier", ExtraTreesClassifier(random_state=345)))

In [None]:
def feature_selection_model(model, xtrain,ytrain, xtest):
    
    rfecv = RFECV(estimator=model, cv=StratifiedKFold(10, random_state=123, shuffle=True), scoring="accuracy")
    rfecv.fit(xtrain, ytrain)
    
    return xtrain.iloc[:, rfecv.support_], xtest.iloc[:, rfecv.support_]

In [None]:
# set table to table to populate with performance results
col = ['Algorithm','Train Accuracy Mean', 'Test Accuracy', "Recall"]

def runfsmodel(X_train_importance, y_train_importance, x_test_importance):
    train_acc_results = []
    test_acc_results = []
    names = []
    
    df_results = pd.DataFrame(columns=col)
    
    i = 0
    # evaluate each model using cross-validation
    for name, model in fs_models:
        kfold = KFold(n_splits=10)  # 10-fold cross-validation

        # cv accuracy scoring
        cv_acc_results = cross_val_score(model, X_train_importance, y_train_importance, cv=kfold, scoring='accuracy')
        train_acc_results.append(cv_acc_results)

        #test accuracy scoring
        model.fit(X_train_importance,y_train_importance)
        pred = model.predict(x_test_importance)
        test_accuracy = accuracy_score(y_test, pred)
        test_acc_results.append(test_accuracy)

        #Confusion Matrix
        cm = confusion_matrix(pred,y_test)
        tn = cm[0][0]
        fn = cm[1][0]
        fp = cm[0][1]
        tp = cm[1][1]
        precision = round(tp/(tp+fp),4)
        recall = round(tp/(tp+fn),4)
        f1 = round((2*precision*recall)/(precision + recall),4)

        names.append(name)
        df_results.loc[i] = [name, round(cv_acc_results.mean()*100, 2), round(test_accuracy*100,2), round(recall*100,2)]

        i += 1
        
    return df_results

###### Logistic Regression


In [None]:
base_lr_model = fs_models[0][1]

In [None]:
LR_fs = feature_selection_model(base_lr_model, X_train, y_train, X_test)
LR_X_train_importance = LR_fs[0]
LR_X_test_importance = LR_fs[1]

print(f'Number of columns selected: {LR_X_train_importance.shape[1]}')

In [None]:
LR_fs_result = runfsmodel(LR_X_train_importance, y_train, LR_X_test_importance)

In [None]:
LR_fs_result.sort_values(by=['Test Accuracy', 'Recall'], ascending=False)

###### Decision Tree

In [None]:
base_dt_model = fs_models[4][1]

In [None]:
DT_fs = feature_selection_model(base_dt_model, X_train, y_train, X_test)
DT_X_train_importance = DT_fs[0]
DT_X_test_importance = DT_fs[1]

print(f'Number of columns selected: {DT_X_train_importance.shape[1]}')

In [None]:
DT_fs_result = runfsmodel(DT_X_train_importance, y_train, DT_X_test_importance)

In [None]:
DT_fs_result.sort_values(by=['Test Accuracy', 'Recall'], ascending=False)

###### Random Forest

In [None]:
base_rf_model = fs_models[1][1]

In [None]:
RF_fs = feature_selection_model(base_rf_model, X_train, y_train, X_test)
RF_X_train_importance = RF_fs[0]
RF_X_test_importance = RF_fs[1]

print(f'Number of columns selected: {RF_X_train_importance.shape[1]}')

In [None]:
RF_fs_result = runfsmodel(RF_X_train_importance, y_train, RF_X_test_importance)

In [None]:
RF_fs_result.sort_values(by=['Test Accuracy', 'Recall'], ascending=False)

<hr>

## 9. Top 3 Model Tuning

In [None]:
col = ["Accuracy_Score","Recall"]
df_performance = pd.DataFrame(columns=col)

In [None]:
#To evaluate performances of all the models
def performance(name, p, ytest):
    
    accuracy = np.round(accuracy_score(p,ytest),4)
    
    cm = confusion_matrix(p,ytest)
    tn = cm[0][0]
    fn = cm[1][0]
    fp = cm[0][1]
    tp = cm[1][1]
    precision = round(tp/(tp+fp),4)
    recall = round(tp/(tp+fn),4)
    f1 = round((2*precision*recall)/(precision + recall),4)
    
    df_performance.loc[name] = [round(accuracy*100,2),round(recall*100,2)]


###### 1. Gradient Boosting

In [None]:
# Using randomsearchcv to determine what is the best parameters to use first
from sklearn.model_selection import RandomizedSearchCV
param = {
    'n_estimators':[100,200, 300,400,500,600],
    'learning_rate': [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8],
    "max_leaf_nodes": [2, 5, 10, 20,30,40, 50,60, 100],
    "random_state":[345]
}
rnd_search = RandomizedSearchCV(GradientBoostingClassifier(), param, n_iter =10, cv=9, random_state=200)
rnd_search.fit(X_train,y_train)
rnd_search.best_params_
# rnd_search.best_score_

In [None]:
# Select parameters around that range plus,minus
gb_params = {
    'n_estimators':[300, 400,500],
    'learning_rate': [0.2,0.3,0.4],
    "max_leaf_nodes": [30,40,50],
    "random_state":[345]
}

performance gridserach on original train dataset

In [None]:
gb_gs = GridSearchCV(GradientBoostingClassifier(),gb_params, scoring=make_scorer(accuracy_score),cv=5)
gb_gs.fit(X_train, y_train)

In [None]:
print(f'Best Params: {gb_gs.best_params_}')
print(f'Best Estimator: {gb_gs.best_estimator_}')
print(f'Best Score: {gb_gs.best_score_}')

In [None]:
y_pred_gb = gb_gs.predict(X_test)

performance("GradientBoosting_Tuned",y_pred_gb, y_test)

In [None]:
df_performance.head()

###### 2. ExtraTreesClassifier

In [None]:
# Using randomsearchcv to determine what is the best parameters to use first
from sklearn.model_selection import RandomizedSearchCV
param = {
    "max_depth": [None,2,3,4,5,6,7,8,9,10,11,12,13],
    'max_features':["sqrt", "log2", None],
    "n_estimators": [100,200,300,400,500,600],
    "criterion": ["gini", "entropy", "log_loss"],
    "random_state": [345]

}
rnd_search = RandomizedSearchCV(ExtraTreesClassifier(), param, n_iter =10, cv=9, random_state=200)
rnd_search.fit(X_train,y_train)
rnd_search.best_params_
# rnd_search.best_score_

In [None]:
# Scoping down features for the max_depth and n_estimators, while leaving some options for max_features and criterions
et_params = {"max_depth": [11,12,13],
             'max_features':["sqrt", "log2", None],
             "n_estimators": [100,200,300],
             "criterion": ["gini", "entropy", "log_loss"]}

In [None]:
et_gs = GridSearchCV(ExtraTreesClassifier(),et_params, scoring=make_scorer(accuracy_score),cv=5)
et_gs.fit(X_train, y_train)

In [None]:
print(f'Best Params: {et_gs.best_params_}')
print(f'Best Estimator: {et_gs.best_estimator_}')
print(f'Best Score: {et_gs.best_score_}')

In [None]:
y_pred_et = et_gs.predict(X_test)

performance("ExtraTreeClassifier_Tuned",y_pred_et, y_test)

In [None]:
df_performance.head()

###### 3. Random Forest

In [None]:
# Using randomsearchcv to determine what is the best parameters to use first
from sklearn.model_selection import RandomizedSearchCV
param = {
        'n_estimators':[100,200, 300, 400, 500,600,700],
        'criterion':['gini','entropy'],
        'max_depth':[None, 3,4,5,6,7,8,9],
        'max_features':["sqrt", "log2", None],
        "random_state": [345]
}
rnd_search = RandomizedSearchCV(RandomForestClassifier(), param, n_iter =10, cv=9, random_state=200)
rnd_search.fit(X_train,y_train)
rnd_search.best_params_
# rnd_search.best_score_

In [None]:
rf_params={
        'n_estimators':[500,600, 700],
        'criterion':['gini','entropy'],
        'max_depth':[None,7,8,9],
        'max_features':["sqrt", "log2", None]
       }

In [None]:
rf_gs=GridSearchCV(RandomForestClassifier(),rf_params,scoring=make_scorer(accuracy_score),cv=5)
rf_gs.fit(X_train, y_train)

In [None]:
print(f'Best Params: {rf_gs.best_params_}')
print(f'Best Estimator: {rf_gs.best_estimator_}')
print(f'Best Score: {rf_gs.best_score_}')

In [None]:
y_pred_rf = rf_gs.predict(X_test)

performance("RandomForest_Tuned",y_pred_rf, y_test)

In [None]:
df_performance.head(6)

### 10. Final Evaluation

In [None]:
rename_index={'Gradientboost': 'Gradientboost_Baseline','ExtraTreesClassifier': 'ExtraTreesClassifier_Baseline','Random Forest': 'RandomForest_Baseline' }

In [None]:
smote_baseline = smote_result.sort_values(by=['Test Accuracy', 'Recall'], ascending=False).iloc[:3].set_index(['Algorithm']).drop(['Train Accuracy Mean'], axis=1)

In [None]:
smote_baseline.rename(columns={'Test Accuracy': 'Accuracy_Score'},index = rename_index, inplace=True)

In [None]:
pd.concat([df_performance, smote_baseline]).sort_index()

<hr>

### 11. Save Model

In [None]:
#baseline model w original dataset
filename_gb_baseline = 'finalized_baseline_gbmodel.joblib'
filename_et_baseline = 'finalized_baseline_etmodel.joblib'
filename_rf_baseline = 'finalized_baseline_rfmodel.joblib'

#tuned w original datset
filename_gb = 'finalized_tuned_gbmodel.joblib'
filename_et = 'finalized_tuned_etmodel.joblib'
filename_rf = 'finalized_tuned_rfmodel.joblib'

In [None]:
joblib.dump(models[7][1], filename_gb_baseline)
joblib.dump(models[10][1], filename_et_baseline)
joblib.dump(models[1][1], filename_rf_baseline)

joblib.dump(gb_gs.best_estimator_, filename_gb)
joblib.dump(et_gs.best_estimator_, filename_et)
joblib.dump(rf_gs.best_estimator_, filename_rf)


###### baseline w original dataset

In [None]:
gb = joblib.load(filename_gb_baseline)
et = joblib.load(filename_et_baseline)
rf = joblib.load(filename_rf_baseline)

##### tuned w original dataset

In [None]:
gb_gs = joblib.load(filename_gb)
et_gs = joblib.load(filename_et)
rf_gs = joblib.load(filename_rf)

<hr>

### 12. Voting Classifier

###### Voting Classifier with hard voting (Model trained with original SMOTE dataset)

In [None]:
# group / ensemble of best models
estimator = []

estimator.append(('gb', gb))
estimator.append(('et', et))
estimator.append(('rf', rf))
estimator.append(('gb_gs', gb_gs))
estimator.append(('et_gs', et_gs))
estimator.append(('rf_gs', rf_gs))

# Voting Classifier with hard voting
vot_hard = VotingClassifier(estimators = estimator, voting ='hard')
vot_hard.fit(X_train, y_train)
y_pred = vot_hard.predict(X_test)

# using accuracy_score metric to predict accuracy
score = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_pred,y_test)
tn = cm[0][0]
fn = cm[1][0]
fp = cm[0][1]
tp = cm[1][1]
recall = round(tp/(tp+fn),4)
print(f'Accuracy Score (Hard voting for original SMOTE dataset): {round(score*100,2)}%')
print(f'Recall (Hard voting for original SMOTE dataset): {round(recall*100,2)}%')

###### Voting Classifier with soft voting (Model trained with original SMOTE dataset)

In [None]:
vot_soft = VotingClassifier(estimators = estimator, voting ='soft')
vot_soft.fit(X_train, y_train)
y_pred = vot_soft.predict(X_test)
# using accuracy_score
score = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_pred,y_test)
tn = cm[0][0]
fn = cm[1][0]
fp = cm[0][1]
tp = cm[1][1]
recall = round(tp/(tp+fn),4)

print(f'Accuracy Score (Soft voting for original SMOTE dataset): {round(score*100,2)}%')
print(f'Recall (Soft voting for original SMOTE dataset): {round(recall*100,2)}%')