# Predicting Employee Churn
Authored by: Ananya Kambhampati

## Description of the analysis:

In this notebook I would be using a fictional data set, provided by the IBM (https://ieee-dataport.org/documents/ibm-hr-analytics-employee-attrition-performance) to predict an Employee's risk of quitting based on various factors. The dataset consists of 34 features - Age, Attrition, BusinessTravel, DailyRate, Department, DistanceFromHome, Education, EducationField, EmployeeCount, EmployeeNumber, EnvironmentSatisfaction, Gender, HourlyRate, JobInvolvement, JobLevel, JobRole, JobSatisfaction, MaritalStatus, MonthlyIncome, MonthlyRate, NumCompaniesWorked, Over18, OverTime, PercentSalaryHike, PerformanceRating, RelationshipSatisfaction, StandardHours, StockOptionLevel, TotalWorkingYears, TrainingTimesLastYear, WorkLifeBalance, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion and YearsWithCurrManager.

## Preliminary (Business) Problem Scoping

In case of predicting the Employee churn Recall would carry more weighted in terms of cost as if our model predicts falsely that an employee is not going to churn we would not be taking proactive measure and the company would eventually lose out on good talent. It would generally cost more to hire new telent than to retain good telent in the company where as if the model falsely predicts that an employee is going to chrun and the management is taking proactive measures to ratain that emplyee, it may be an extra cost to the company but this added cost is not more than the cost of accuring new talent. Addionally taking proactive efforts on employees who are not likely to churn might also increase the moral of the employees. 
Hence I would be hyper tuning my models to achive high recall score.

## Importing Packages

In [None]:
import pandas as pd
import numpy as np
from sklearn.tree import plot_tree
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn import model_selection, tree, linear_model, neighbors, naive_bayes, ensemble 
from imblearn.over_sampling import RandomOverSampler
import seaborn as sns
from sklearn.svm import SVC
from imblearn.over_sampling import SMOTE
import warnings
import joblib
import tensorflow as tf
from tensorflow import keras

# ignore all warnings
warnings.filterwarnings("ignore")

# Pre processing of Data

## Loading and Exploring the dataset

### Importing the dataset

In [None]:

from google.colab import drive 
drive.mount('/content/gdrive')

emp_data=pd.read_csv('gdrive/My Drive/WA_Fn-UseC_-HR-Employee-Attrition.csv')


Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


### Viewing the Head of the Data (first 5 rows)

In [None]:

emp_data.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


### Getting the basic information and describtion of the data

In [None]:
emp_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

In [None]:
emp_data.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Age,1470.0,,,,36.92381,9.135373,18.0,30.0,36.0,43.0,60.0
Attrition,1470.0,2.0,No,1233.0,,,,,,,
BusinessTravel,1470.0,3.0,Travel_Rarely,1043.0,,,,,,,
DailyRate,1470.0,,,,802.485714,403.5091,102.0,465.0,802.0,1157.0,1499.0
Department,1470.0,3.0,Research & Development,961.0,,,,,,,
DistanceFromHome,1470.0,,,,9.192517,8.106864,1.0,2.0,7.0,14.0,29.0
Education,1470.0,,,,2.912925,1.024165,1.0,2.0,3.0,4.0,5.0
EducationField,1470.0,6.0,Life Sciences,606.0,,,,,,,
EmployeeCount,1470.0,,,,1.0,0.0,1.0,1.0,1.0,1.0,1.0
EmployeeNumber,1470.0,,,,1024.865306,602.024335,1.0,491.25,1020.5,1555.75,2068.0


### Checking the balancing of the target variable 

In [None]:
emp_data['Attrition'].value_counts()

No     1233
Yes     237
Name: Attrition, dtype: int64

### Creating a df with the categorical data 

In [None]:
categorical=[data for data in emp_data.columns if emp_data[data].dtype=='object']
categorical

['Attrition',
 'BusinessTravel',
 'Department',
 'EducationField',
 'Gender',
 'JobRole',
 'MaritalStatus',
 'Over18',
 'OverTime']

### From our data exploration we can conlude the following 

*   We have no null values.
*   We have categorical data that we need to encode into factorial data.
*   We have columns that are not relavent to predict the Churn, hence we shall drop those columns.
*   Our target variable is not balanced, we need to balance that data to get better results.
*   As we have the data such as age, daily rate, monthly income, and percent salary hike which are measured in different units we would need to standardize the data.

We will drop the unnecessary columns before splitting the data into training and validation sets.
We will encode and standardize our data after splitting it into training and validation sets to avoid data leakage and we will use over sampling also after splitting to avoid over fitting of the data.

## Data Cleaning
Removing the unwanted columns

In [None]:
emp_data=emp_data.drop(['EmployeeCount','EmployeeNumber','Over18','StandardHours','EnvironmentSatisfaction'],axis=1)
emp_data

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,Gender,HourlyRate,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,Female,94,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,Male,61,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,Male,92,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,Female,56,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,Male,40,...,3,4,1,6,3,3,2,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1465,36,No,Travel_Frequently,884,Research & Development,23,2,Medical,Male,41,...,3,3,1,17,3,3,5,2,0,3
1466,39,No,Travel_Rarely,613,Research & Development,6,1,Medical,Male,42,...,3,1,1,9,5,3,7,7,1,7
1467,27,No,Travel_Rarely,155,Research & Development,4,3,Life Sciences,Male,87,...,4,2,1,6,0,3,6,2,0,3
1468,49,No,Travel_Frequently,1023,Sales,2,3,Medical,Male,63,...,3,4,0,17,3,2,9,6,0,8


## splitting the data into features (X) and target variable (y)

In [None]:
X = emp_data.drop("Attrition", axis=1)
y = emp_data["Attrition"]

## Encoding the categorical variables 

In [None]:
# performing one-hot encoding on the nominal categorical data

cat_features = ["BusinessTravel", "Department", "EducationField", "Gender", "JobRole", "MaritalStatus", "OverTime"]
enc = OneHotEncoder(handle_unknown="ignore")
X_encoded = pd.DataFrame(enc.fit_transform(X[cat_features]).toarray(), columns=enc.get_feature_names_out(cat_features))
X = pd.concat([X.drop(cat_features, axis=1), X_encoded], axis=1)

In [None]:
# performing label encoding on the ordinal categorical data
le = LabelEncoder()

# label encode DistanceFromHome
X['DistanceFromHome'] = le.fit_transform(X['DistanceFromHome'])

# label encode Education
X['Education'] = le.fit_transform(X['Education'])

# label encode JobInvolvement
X['JobInvolvement'] = le.fit_transform(X['JobInvolvement'])

# label encode JobLevel
X['JobLevel'] = le.fit_transform(X['JobLevel'])

# label encode JobSatisfaction
X['JobSatisfaction'] = le.fit_transform(X['JobSatisfaction'])

In [None]:
#encoding the target variable
le=LabelEncoder()

y=le.fit_transform(y)
y=pd.DataFrame(y)
y=y.rename(columns={0:'Attrition'})
y

Unnamed: 0,Attrition
0,1
1,0
2,1
3,0
4,0
...,...
1465,0
1466,0
1467,0
1468,0


In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)


## Standardization of the variable 

In [None]:
num_features = ["Age", "DailyRate", "DistanceFromHome", "Education", 
                "HourlyRate", "JobInvolvement", "JobLevel", "JobSatisfaction", "MonthlyIncome", 
                "MonthlyRate", "NumCompaniesWorked", "PercentSalaryHike", "PerformanceRating", 
                "RelationshipSatisfaction", "StockOptionLevel", "TotalWorkingYears", 
                "TrainingTimesLastYear", "WorkLifeBalance", "YearsAtCompany", "YearsInCurrentRole", 
                "YearsSinceLastPromotion", "YearsWithCurrManager"]
scaler = StandardScaler()
X_train[num_features] = scaler.fit_transform(X_train[num_features])
X_val[num_features] = scaler.transform(X_val[num_features])

## Data Balancing 
I would be using SMOTE to balance the data as in our case minority class is important and smote prevents the loss of data reduces the bais towards majority class.

In [None]:
sm = SMOTE(random_state=42)

# resample the training data using SMOTE
X_res, y_res = sm.fit_resample(X_train, y_train)

# check the class distribution of the resampled data
y_res['Attrition'].value_counts()

0    853
1    853
Name: Attrition, dtype: int64

## Creadting a DataFrame to save the Predictive Model results

In [None]:
performance = pd.DataFrame({"model": [], "Accuracy": [], "Precision": [], "Recall": [], "F1": []})

# Predictive Modeling

## Logestic Regression 

### Random Search

In [None]:
score_measure = "recall"
kfolds = 5

param_grid = {
    'max_iter':np.arange(10,500),
    'penalty': ['none','l1','l2','elasticnet'],
    'solver':['saga','liblinear']
}

log_reg_model = LogisticRegression()
rand_search = RandomizedSearchCV(estimator = log_reg_model, param_distributions=param_grid, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1, 
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

Fitting 5 folds for each of 500 candidates, totalling 2500 fits
The best recall score is 0.41984126984126985
... with parameters: {'solver': 'saga', 'penalty': 'none', 'max_iter': 131}


### Grid Search

In [None]:
score_measure = "recall"
kfolds = 5
max_iter = rand_search.best_params_['max_iter']
penalty = rand_search.best_params_['penalty']
solver = rand_search.best_params_['solver']

param_grid = {
    'max_iter': np.arange(max_iter-5,max_iter+5),  
    'penalty': [penalty],
    'solver': [solver]
}

log_reg_model = LogisticRegression()
grid_search = GridSearchCV(estimator = log_reg_model, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestRecallLogistic = grid_search.best_estimator_

Fitting 5 folds for each of 10 candidates, totalling 50 fits
The best recall score is 0.41984126984126985
... with parameters: {'max_iter': 126, 'penalty': 'none', 'solver': 'saga'}


### Confusion Matrix

In [None]:
c_matrix = confusion_matrix(y_val, grid_search.predict(X_val))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]

# add model performance to performance dataframe
performance = performance.append(pd.DataFrame({'model': "Logistic Regression", 
                                               'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                               'Precision': [TP/(TP+FP)], 
                                               'Recall': [TP/(TP+FN)], 
                                               'F1': [2*TP/(2*TP+FP+FN)]
                                               }))

In [None]:
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Logistic Regression,0.863946,0.511628,0.360656,0.423077


## Support Vector Machine


### Random Search 

In [None]:
score_measure = "recall"
kfolds = 5

param_grid = {
    'C': np.arange(10,100),   
    'gamma': ['scale','auto'],
    'kernel':['linear','rbf','poly']
}

svm = SVC()
rand_search = RandomizedSearchCV(estimator = svm, param_distributions=param_grid, cv=kfolds, n_iter=150,
                           scoring=score_measure, verbose=1, n_jobs=-1, 
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

Fitting 5 folds for each of 150 candidates, totalling 750 fits
The best recall score is 0.44238095238095243
... with parameters: {'kernel': 'rbf', 'gamma': 'auto', 'C': 22}


### Grid Search 

In [None]:
score_measure = "recall"
kfolds = 5

C = rand_search.best_params_['C']
gamma = rand_search.best_params_['gamma']
kernel = rand_search.best_params_['kernel']

param_grid = {
    'C': np.arange(C-2,C+2),  
    'gamma': [gamma],
    'kernel': [kernel]
    
}

svm1 = SVC()
grid_search = GridSearchCV(estimator = svm1, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1, # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestprecision_SVM = grid_search.best_estimator_

Fitting 5 folds for each of 4 candidates, totalling 20 fits
The best recall score is 0.44238095238095243
... with parameters: {'C': 20, 'gamma': 'auto', 'kernel': 'rbf'}


### Confusion Matrix

In [None]:
c_matrix = confusion_matrix(y_val, grid_search.predict(X_val))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"SVM", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

In [None]:
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Logistic Regression,0.863946,0.511628,0.360656,0.423077
0,SVM,0.852608,0.452381,0.311475,0.368932


## Decision Tree

### Random Search

In [None]:
score_measure = "recall"
kfolds = 5

param_grid = {
    'min_samples_split': np.arange(10,250),  
    'min_samples_leaf': np.arange(5,250),
    'min_impurity_decrease': np.arange(0.00001, 0.0001, 0.001),
    'max_leaf_nodes': np.arange(10, 250), 
    'max_depth': np.arange(5,100), 
    'criterion': ['gini', 'entropy'],
}

dtree = DecisionTreeClassifier()
rand_search = RandomizedSearchCV(estimator=dtree, param_distributions=param_grid, cv=kfolds, n_iter=1000,
                                 scoring=score_measure, verbose=1, n_jobs=-1, # n_jobs=-1 will utilize all available CPUs 
                                 return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestPrecTree = rand_search.best_estimator_


Fitting 5 folds for each of 1000 candidates, totalling 5000 fits
The best recall score is 0.3753968253968254
... with parameters: {'min_samples_split': 20, 'min_samples_leaf': 18, 'min_impurity_decrease': 1e-05, 'max_leaf_nodes': 100, 'max_depth': 44, 'criterion': 'entropy'}


### Grid Search

In [None]:
score_measure = "recall"
kfolds = 5

param_grid = {
    'min_samples_split': np.arange(8,20),  
    'min_samples_leaf': np.arange(2,30),
    'min_impurity_decrease': np.arange( 0.0001, 0.001, 0.01),
    'max_leaf_nodes': [40,100], 
    'max_depth': [9,15], 
    'criterion': ['entropy']
}


dtree = DecisionTreeClassifier()
grid_search = GridSearchCV(estimator = dtree, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestPrecisionTree = grid_search.best_estimator_

Fitting 5 folds for each of 1344 candidates, totalling 6720 fits
The best recall score is 0.40349206349206346
... with parameters: {'criterion': 'entropy', 'max_depth': 15, 'max_leaf_nodes': 100, 'min_impurity_decrease': 0.0001, 'min_samples_leaf': 5, 'min_samples_split': 17}


In [None]:
c_matrix = confusion_matrix(y_val, grid_search.predict(X_val))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Decision Tree", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

In [None]:
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Logistic Regression,0.863946,0.511628,0.360656,0.423077
0,SVM,0.852608,0.452381,0.311475,0.368932
0,Decision Tree,0.845805,0.410256,0.262295,0.32


## Neural Networks

### Random Search

In [None]:
score_measure = "recall"
kfolds = 5

param_grid = {
    'hidden_layer_sizes': [ (50,), (70,),(50,30), (40,20), (60,40, 20), (70,50,40)],
    'activation': ['logistic', 'tanh', 'relu'],
    'solver': ['adam', 'sgd'],
    'alpha': [0, .2, .5, .7, 1],
    'learning_rate': ['constant', 'invscaling', 'adaptive'],
    'learning_rate_init': [0.001, 0.01, 0.1, 0.2, 0.5],
    'max_iter': [5000]
}

ann = MLPClassifier()
rand_search = RandomizedSearchCV(estimator = ann, param_distributions=param_grid, cv=kfolds, n_iter=100,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

bestRecallTree = rand_search.best_estimator_

print(rand_search.best_params_)



Fitting 5 folds for each of 100 candidates, totalling 500 fits
{'solver': 'adam', 'max_iter': 5000, 'learning_rate_init': 0.5, 'learning_rate': 'adaptive', 'hidden_layer_sizes': (70,), 'alpha': 1, 'activation': 'tanh'}


In [None]:
y_pred = bestRecallTree.predict(X_val)

print(classification_report(y_val, y_pred))

              precision    recall  f1-score   support

           0       0.86      1.00      0.93       380
           1       0.00      0.00      0.00        61

    accuracy                           0.86       441
   macro avg       0.43      0.50      0.46       441
weighted avg       0.74      0.86      0.80       441



### Grid Search

In [None]:
score_measure = "recall"
kfolds = 5

param_grid = {
    'hidden_layer_sizes': [(50,30)],
    'activation': ['tanh', 'relu'],
    'solver': ['adam'],
    'alpha': [.3, .7],
    'learning_rate': ['adaptive', 'invscaling'],
    'learning_rate_init': [0.5],
    'max_iter': [5000]
}

ann = MLPClassifier()
grid_search = GridSearchCV(estimator=ann, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1, 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

best_recall_nn = grid_search.best_estimator_

Fitting 5 folds for each of 8 candidates, totalling 40 fits


In [None]:
y_pred = best_recall_nn.predict(X_val)

report = classification_report(y_val, y_pred, output_dict=True)
precision = report['weighted avg']['precision']
recall = report['weighted avg']['recall']
f1 = report['weighted avg']['f1-score']

performance = performance.append({'model': 'Neural Network Grid Search', 
                                  'Accuracy': accuracy_score(y_val, y_pred), 
                                  'Precision': precision, 
                                  'Recall': recall, 
                                  'F1': f1}, ignore_index=True)

print(performance)

                        model  Accuracy  Precision    Recall        F1
0         Logistic Regression  0.863946   0.511628  0.360656  0.423077
1                         SVM  0.852608   0.452381  0.311475  0.368932
2               Decision Tree  0.845805   0.410256  0.262295  0.320000
3  Neural Network Grid Search  0.861678   0.742489  0.861678  0.797656


## Deep Neural Network

In [None]:
import tensorflow.keras.backend as K

# define recall function as a member function of the Model class
class Metrics(keras.callbacks.Callback):
    def on_train_begin(self, logs={}):
        self.recall = []

    def on_epoch_end(self, epoch, logs={}):
        y_pred = self.model.predict(X_val)
        y_pred = np.round(y_pred)
        _recall = recall_score(y_val, y_pred)
        self.recall.append(_recall)
        print("val_recall:",_recall)

def recall(y_test, y_pred):
    true_positives = K.sum(K.round(K.clip(y_test * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_test, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall


# create model stucture
model = keras.models.Sequential()
model.add(keras.layers.Input(50))
model.add(keras.layers.Dense(10, activation='relu',kernel_initializer= tf.keras.initializers.GlorotNormal()))
model.add(keras.layers.Dense(10, activation='relu', kernel_initializer= tf.keras.initializers.GlorotNormal()))
model.add(keras.layers.Dense(10, activation='relu', kernel_initializer= tf.keras.initializers.GlorotNormal()))
model.add(keras.layers.Dense(1, activation='sigmoid')) 

# compile the model with the custom loss function
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[recall])

# fit the model
history = model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=20, batch_size=100, callbacks=[Metrics()])

# fit the model
history = model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=20, batch_size=100)


Epoch 1/20
 1/11 [=>............................] - ETA: 10s - loss: 0.5174 - recall: 0.0000e+00

KeyboardInterrupt: ignored

In [None]:
from keras.models import Sequential
from keras.layers import Dense

# Create a Keras model
model = Sequential()
model.add(Dense(16, input_dim=50, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model and get the history
history = model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=20, batch_size=100)

# Get the predictions and calculate the performance metrics
y_pred = model.predict(X_val)
y_pred = (y_pred > 0.5)

c_matrix = confusion_matrix(y_val, y_pred)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Keras Deep Neural Network", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)],  
                                                    'Recall': [TP/(TP+FN)], 
                                                    'Precision': [TP/(TP+FP)],
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
print(performance)

# Saving the best performing model 

In [None]:
sorted_perf = performance.sort_values(by=['Recall'], ascending=False)

In [None]:
sorted_perf

Unnamed: 0,model,Accuracy,Precision,Recall,F1
3,Neural Network Grid Search,0.861678,0.742489,0.861678,0.797656
0,Logistic Regression,0.863946,0.511628,0.360656,0.423077
1,SVM,0.852608,0.452381,0.311475,0.368932
2,Decision Tree,0.845805,0.410256,0.262295,0.32
0,Keras Deep Neural Network,0.882086,0.714286,0.245902,0.365854


In [None]:
best_model_name = sorted_perf.iloc[0]['model']

In [None]:
joblib.dump(best_model_name, 'best_model.joblib')

['best_model.joblib']

# Loading the best performing model

In [None]:
# loaded_model = joblib.load('best_model.joblib')

# Analysis

According to our business problem, we need a model that has high recall score, that is a model which gives has low false negatives. In this case none of the models have a good recall score, hence we need to improve our models by either improving the data quality, giving better hyper tuning parameters or by using ensembling techniques.

# MLP and Keras models performance

The results indicate that the neural network model outperformed the other models in terms of recall score. The neural network model achieved a recall score of 0.86, while the other models have recall scores of 0.36. This suggests that the neural network model was able to identify a greater proportion of true positive cases compared to the other models.

The MLP and Keras models performed the worst amoung all the previous models. Even though the model has a good accuracy and precision scores we need a model that gives us low false negatives, that is a model with high recall score hence this model is not a good preference for our business problem. 