<a href="https://colab.research.google.com/github/farrukhtaba/farrukhtaba/blob/main/INN_Learner_Notebook_Full_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><font size=6> Bank Churn Prediction </font></center>

## Problem Statement

### Context

Businesses like banks which provide service have to worry about problem of 'Customer Churn' i.e. customers leaving and joining another service provider. It is important to understand which aspects of the service influence a customer's decision in this regard. Management can concentrate efforts on improvement of service, keeping in mind these priorities.

### Objective

You as a Data scientist with the  bank need to  build a neural network based classifier that can determine whether a customer will leave the bank  or not in the next 6 months.

### Data Dictionary

* CustomerId: Unique ID which is assigned to each customer

* Surname: Last name of the customer

* CreditScore: It defines the credit history of the customer.
  
* Geography: A customer’s location
   
* Gender: It defines the Gender of the customer
   
* Age: Age of the customer
    
* Tenure: Number of years for which the customer has been with the bank

* NumOfProducts: refers to the number of products that a customer has purchased through the bank.

* Balance: Account balance

* HasCrCard: It is a categorical variable which decides whether the customer has credit card or not.

* EstimatedSalary: Estimated salary

* isActiveMember: Is is a categorical variable which decides whether the customer is active member of the bank or not ( Active member in the sense, using bank products regularly, making transactions etc )

* Exited : whether or not the customer left the bank within six month. It can take two values
** 0=No ( Customer did not leave the bank )
** 1=Yes ( Customer left the bank )

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).

In [None]:
# Import drive to access the files on My Drive
from google.colab import drive
drive.mount('/content/drive/')

In [None]:
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns


# Library to split data
from sklearn.model_selection import train_test_split
# library to import to standardize the data
from sklearn.preprocessing import StandardScaler

# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

#To import different metrics
from sklearn import metrics
from sklearn.metrics import explained_variance_score, mean_squared_error, r2_score, mean_absolute_error

#Importing classback API
from keras import callbacks

# Importing tensorflow library
import tensorflow as tf

# importing different functions to build models
from tensorflow.keras.layers import Dense, Dropout,InputLayer
from tensorflow.keras.models import Sequential

# Importing Batch Normalization
from keras.layers import BatchNormalization

# Importing backend
from tensorflow.keras import backend

# Importing shffule
from random import shuffle
from keras.callbacks import ModelCheckpoint

# Importing optimizers
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.optimizers import SGD

# Library to avoid the warnings
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Import and readi the data set
data = pd.read_csv('/content/drive/My Drive/AIML Colab files/Project 7/bank.csv')

In [None]:
# Display 5 sample rows
data.sample(5)

In [None]:
#create a copy of the data
df = data.copy()

In [None]:
#drop the first 2 columns
df.drop(['RowNumber','CustomerId','Surname'], axis=1, inplace=True)
#display the data frane columns to confirm the drop command wss successful
df.columns

It is observed that the first two columns 'RowNumber' and 'CustomerId' are unique for evry customer, hence they do not count as good predictors for our classification problem. The Surname is a variable that is common between customers just by chance and is not considered a valuable predictor as well. Therefore, those three columns need to be dropped.

In [None]:
#check if the dataset has null values
df.isnull().sum()

In [None]:
#check if the dataset has duplicated values
df.duplicated().sum()

In [None]:
#display the data types of all variables
df.dtypes

In [None]:
#Derive the 5 point summary of the data
df.describe(include='all')

There are two categorical variables:

<ul><li>Geography: Contains 3 unique values, with "France" being the most common, representing 50% of the customers.</ul>
<ul><li>Gender: Includes 2 unique values, with "Male" being the most frequent.
Additional categorical indicators:</ul>

<ul><li>HasCrCard: 50% of customers hold a credit card.</ul>
<ul><li>IsActiveMember: 50% of customers are active members.</ul>
<ul><li>Exited: This is the target variable, with at least 75% of customers remaining with the bank.</ul>

<ul>The remaining five variables are numerical:</ul>

<ul><li>CreditScore: Normally distributed, ranging from 350 to 850, with an average of 650.</ul>
<ul><li>Age: Right-skewed, with values from 18 to 850 and a mean of 92.</ul>
<ul><li>Tenure: Normally distributed, with values from 0 to 850 and an average of 10.</ul>
<ul><li>Balance: Left-skewed, ranging from 0 to 250,898, and an average of 97,198.</ul>
<ul><li>EstimatedSalary: Left-skewed, with values from 11.58 to 199,992 and a mean of 100,193.</ul>
<ul>These observations will be validated through Exploratory Data Analysis (EDA).</ul>




<b>Exploratory Data Analysis</b>

1. Univariate Analysis</br>

EDA For the numerical variables
We will create a function to plot the histogram and box plot for all variables (note that all variables are numerical)

In [None]:
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, discrete=True
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

In [None]:
# create a list of numerical columns
numerical = ['CreditScore', 'Age', 'Balance','EstimatedSalary']

# ploting the numerical variables
for i in numerical:
    histogram_boxplot(df, i, kde=True, figsize=(8, 4))

<b>Observations</b><br>

<ul><li>Credit Score: The box plot indicates that the mean and median are aligned, though some outliers appear beyond the left whisker. This raises the question of whether these outliers represent customers who leave the bank, which will be explored in the bivariate analysis. The outliers are consistent with the dataset and don’t require removal.</ul>

<ul><li>Age: The distribution is right-skewed, with several outliers beyond the right whisker. These outliers are also consistent with the data and do not need to be removed.</ul>

<ul><li>Balance: The box plot shows a right-skewed distribution with no outliers. However, the histogram lacks visual clarity, so it will be re-plotted separately for more detailed analysis.</ul>

<ul><li>Estimated Salary: The box plot suggests an almost perfect normal distribution, with mean and median alignment and whiskers of roughly equal length. Since the histogram is not visually clear, it will be re-plotted on a logarithmic scale for better insight.</ul>

In [None]:
#plotting the data on a logarithmic scale
plt.hist(df['Balance']);
plt.title('Balance');

The balance histogram is right skewed and shows a significant number of customers (approx. 3500) with 0 balance. The remaining customers show adopt an almost normally distributed balance values. This indicates there are two different customer clusters at least. It is interesting to study further how these cluster vary with respect to our target variable in the bivariate analysis.

Estinated Salary distribution on a logarithmic scale:

In [None]:
#plotting the data on a logarithmic scale
plt.hist(np.log(df['EstimatedSalary']));
plt.title('log(EstimatedSalary)');

In [None]:
plt.boxplot(np.log(df['EstimatedSalary']));

On a logarithmic scale, the Estimated Salary is heavily left skewed with a high number of outliers beyond the left skewer. The outliers still seem consistent with the reality situation as a number of customers can be of high age and retired with no income salary or non-working customers.

<b>EDA For the categorical variables</b><br>
We will create a function to plot labelled barplots for all variables (note that all variables are categorical)

In [None]:
# funtcion to plot labelled barplot
def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

In [None]:
# create a list of categorical columns
categorical = ['Geography', 'Gender', 'NumOfProducts', 'HasCrCard', 'IsActiveMember','Exited']
# ploting the categorical variables
for i in categorical:
    labeled_barplot(df, i, perc=True)

<b>Observations</b><br>
<ul><li>Geography: 50% of customers are from France, 25% from Germany, and 25% from Spain.</ul></li>

<ul><li>Gender: Approximately 55% of customers are male, with the remainder female.</ul></li>

<ul><li>NumOfProducts: Half of the customers use only one product, around 46% use two products, and the remaining 3.3% use three to four products.</ul></li>

<ul><li>HasCrCard: Most customers (70%) have credit cards.</ul></li>

<ul><li>IsActiveMember: Nearly half of the customers are active members, with a slight majority being active.</ul></li>

<ul><li>Exited: Around 20% of customers have exited, while 80% have not. This indicates an imbalance in the target</ul></li>

<b>2. Bivariate analysis</b><br><br>
<ul><li>Use appropriate visualizations to identify the patterns and insights - Any other exploratory deep dive<br></ul></li>
<ul><li>Start by plotting the pair plot and heatmap to observe if there are any correlations between variables and/or clear clusterring.</ul></li>

In [None]:
plt.figure(figsize=(15,15))
sns.pairplot(df, hue="Exited",diag_kind='kde')

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(df.corr(),vmax=1,vmin=-1,annot=True)

<b>Observations:</b><br>
<ul><li>The pair plot and heat map reveal no notable correlations between the variables.</ul></li>
<ul><li>The pair plot indicates distinct clustering patterns in certain variables: Balance (2 clusters), Num of Products (3-4 clusters), HasCrCards (2 clusters), and IsActiveMember (2 clusters).</ul></li>
<ul><li>Both exited and non-exited customers are represented across all clusters. However, due to class imbalance, exited customers are significantly fewer than non-exited customers.</ul></li>

In [None]:
### function to plot distributions wrt target


def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=False,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of variable for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=False,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],

        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()

<b>Observations:</b><br>
<ul><li>CreditScore: The distribution of this variable is similar across both classes of the target variable, and the presence of outliers does not impact the distribution. Therefore, CreditScore is a poor predictor.</ul></li>
<ul><li>Age: The median age is higher for customers who exited the bank compared to those who stayed, with more outliers among the exited customers. This suggests that customers around the median age of 45 are more likely to leave the bank. Age is thus considered a weak predictor of the target variable.</ul></li>
<ul><li>Balance: There is a high concentration of customers with a balance of zero among those who exited. Additionally, the interquartile range (IQR) for balance is narrower for exited customers compared to those who stayed, and the median balance is slightly higher for those who exited. Consequently, Balance is regarded as a very weak predictor.</ul></li>
<ul><li>EstimatedSalary: This variable has a nearly uniform distribution across both classes, indicating little predictive power.</ul></li>

In [None]:
for i in numerical:
  distribution_plot_wrt_target(df, i, 'Exited')

In [None]:
# function to plot stacked bar chart


def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

<b>Observations<b><br>
<ul><li>Geography: The highest proportion of exited customers comes from Germany, while Spain and France show almost identical exit ratios. This variable is considered a reasonably good predictor.</ul></li>
<ul><li>Gender: A greater proportion of exited customers are female, making Gender a fairly good predictor.</ul></li>
<ul><li>NumOfProducts: All customers with 4 products, as well as a large proportion of those with 3 products, have exited. The lowest exit ratio is among customers with 2 products, indicating strong predictive power for this variable.</ul></li>
<ul><li>HasCrCard: The exit ratio is nearly the same for customers with and without credit cards, making this a very poor predictor.</ul></li>
<ul><li>IsActiveMember: A higher proportion of non-active customers have exited compared to active ones, making this a fairly good predictor.</ul></li>

In [None]:
categorical = ['Geography', 'Gender', 'NumOfProducts', 'HasCrCard', 'IsActiveMember']

# ploting the categorical variables
for i in categorical:
    stacked_barplot(df, i, 'Exited')

<b>EDA Insights</b><br><br>
From the Univariate Analysis we conclude:

<ul><li>All bank customers come are based in France, Germany or Spain and their Age vary from 18 to 92 years old.</ul></li>
<ul><li>The majority of customers are Males, from France, have credit cards and have not exited the bank</ul></li>
From the Bivariate Anslysis we conclude:

<ul><li>There is no colinearity or correlation observed between features.</ul></li>
<ul><li>The fairly good predictors are: Age, Geography, Gender, Num of products and IsActiveMember</ul></li>
<ul><li>Feature of customers who are most likely to exit the bank are :<br><br>
<ul><li>Age: Ranges from 28 to 52, with a median of 45</ul></li>
<ul><li>Location: Germany</ul></li>
<ul><li>Gender: Female</ul></li>
<ul><li>Number of Products: 3 to 4</ul></li>
<ul><li>Active Member: No</ul></li>

<b>Data Pre-processing</b><br><br>
Splitting the target variable and predictors

In [None]:
X = df.drop('Exited',axis=1)
y = df['Exited']

Applying one-hot-encoding for the categorical features

In [None]:
dummy_cat = ['Geography','Gender']
X = pd.get_dummies(X,columns=dummy_cat,drop_first= True)
X.head()

3.Scaling the numerical data using zscore to unite the distribution of all variables to a mean of 0 and a std of 1.


In [None]:
# to scale the data
from scipy.stats import zscore
X_Scaled = X.apply(zscore)
X_Scaled.head()

Split the data to test and train

In [None]:
# Split the non-scaled data into train and test
X_train_ns, X_test_ns, y_train_ns, y_test_ns = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)

In [None]:
# Split the scaled data into train and test
X_train, X_test, y_train, y_test = train_test_split(X_Scaled, y, test_size=0.2, random_state=1, stratify=y)

In [None]:
#displaying the shape of train and test data sets to confirm the split was successful
print(f''' X_train shape: {X_train.shape}
 X_test shape: {X_test.shape}
 y_train shape: {y_train.shape}
 y_test shape: {y_test.shape}''')

4.Data Oversampling: To balance the target variable class ration we will prepare an oversampled train and test datasets using the SMOTE algorithm train data is zscore scaled

In [None]:
print("Before UpSampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before UpSampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)

X_train_over, y_train_over = sm.fit_resample(X_train, y_train)

print("After UpSampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After UpSampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))


print("After UpSampling, the shape of X_train: {}".format(X_train_over.shape))
print("After UpSampling, the shape of y-train: {} \n".format(y_train_over.shape))

5.Data Undersampling: To balance the target variable class ration we will prepare an undersampled train and test datasets using the Random Under Sampler noting that the train data is zscore scaled

In [None]:
applting the random under sampler using sampling strategy 0.5
random_us = RandomUnderSampler(random_state=1, sampling_strategy = 0.5)
X_train_un, y_train_un = random_us.fit_resample(X_train, y_train)

print("After UpSampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After UpSampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))


print("After UpSampling, the shape of X_train: {}".format(X_train_un.shape))
print("After UpSampling, the shape of y-train: {} \n".format(y_train_un.shape))

The dataset now is ready for modeling noting that the over sampled and under sampled data will be utilized only if required to enhance the model performance.

<b>Model building</b><br><br>
Our key performance metric is the Recall as we are aiming to reduce the False Negatives and hence, predict correctly the customers that are most likely to exit the bank and take early actions to avoid this from occuring. We will start by a very simple model using the rule of thumb parameters as initial parameters, observe the performance and start enhancing it.

In [None]:
backend.clear_session()
#Fixing the seed for random number generators so that we can ensure we receive the same output everytime
np.random.seed(1)
import random
random.seed(1)
tf.random.set_seed(1)

<b>Model_01:</b><br>
<ul><li>Hidden layers = 2</ul></li>
<ul><li>Activation function for hidden layers = Relu</ul></li>
<ul><li>Output layer with 1 nodes</ul></li>
<ul><li>Output Activation function = Sigmoid</ul></li>
<ul><li>Output loss function = Cross Entropy (since this is a classification problem)</ul></li>
<ul><li>Optimizer = Adam</ul></li>
<ul><li>epochs = 20</ul></li>
<ul><li>Data is scaled using zscore</ul></li>

In [None]:
# Model summary
model_01.summary()

In [None]:
# Compiling the model with 'cross entropy' as loss function, 'Adam' Optimizer and 'accuracy' metrics
model_01.compile(loss='binary_crossentropy',
              optimizer='Adam',
              metrics=['accuracy'])

# Fitting the model on train and validation with 20 epochs
fitted_model_01 = model_01.fit(X_train, y_train, validation_split=0.2,epochs=20)

In [None]:
#Plotting Train Loss vs Validation Loss
plt.plot(fitted_model_01.history['loss'])
plt.plot(fitted_model_01.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

<b>Observation:</b><br> <br>The error difference between the validation and train set is increasing, hence this indicates overfitting which requires handling. We shall observe further the model performance on other metrics in order to decinde on the next step.<br>

Deriving more metrics

In [None]:
# defining a function to display the confusion matrix
def make_confusion_matrix(cf,
                          group_names=None,
                          categories='auto',
                          count=True,
                          percent=True,
                          cbar=True,
                          xyticks=True,
                          xyplotlabels=True,
                          sum_stats=True,
                          figsize=None,
                          cmap='Blues',
                          title=None):
    '''
    This function will make a pretty plot of an sklearn Confusion Matrix cm using a Seaborn heatmap visualization.
    Arguments
    '''


    # CODE TO GENERATE TEXT INSIDE EACH SQUARE
    blanks = ['' for i in range(cf.size)]

    if group_names and len(group_names)==cf.size:
        group_labels = ["{}\n".format(value) for value in group_names]
    else:
        group_labels = blanks

    if count:
        group_counts = ["{0:0.0f}\n".format(value) for value in cf.flatten()]
    else:
        group_counts = blanks

    if percent:
        group_percentages = ["{0:.2%}".format(value) for value in cf.flatten()/np.sum(cf)]
    else:
        group_percentages = blanks

    box_labels = [f"{v1}{v2}{v3}".strip() for v1, v2, v3 in zip(group_labels,group_counts,group_percentages)]
    box_labels = np.asarray(box_labels).reshape(cf.shape[0],cf.shape[1])


    # CODE TO GENERATE SUMMARY STATISTICS & TEXT FOR SUMMARY STATS
    if sum_stats:
        #Accuracy is sum of diagonal divided by total observations
        accuracy  = np.trace(cf) / float(np.sum(cf))



    # SET FIGURE PARAMETERS ACCORDING TO OTHER ARGUMENTS
    if figsize==None:
        #Get default figure size if not set
        figsize = plt.rcParams.get('figure.figsize')

    if xyticks==False:
        #Do not show categories if xyticks is False
        categories=False


    # MAKE THE HEATMAP VISUALIZATION
    plt.figure(figsize=figsize)
    sns.heatmap(cf,annot=box_labels,fmt="",cmap=cmap,cbar=cbar,xticklabels=categories,yticklabels=categories)


    if title:
        plt.title(title)

In [None]:
# Obtain the y_predict from the X_test
y_pred=model_01.predict(X_test)

#set the threshold to 0.5
y_pred = (y_pred > 0.5)

# Display the classification report
from sklearn import metrics
cr=metrics.classification_report(y_test,y_pred)
print(cr)

The Macro Averge of the Recall is considerably low and since this is still our first model, we can sure enhance this performance. Let us observe the confusion matrix.

In [None]:
#displaying the confusion matrix
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test, y_pred)
labels = ['True Positive','False Negative','False Positive','True Negative']
categories = [ 'Not Exiting','Exiting']
make_confusion_matrix(cm,
                      group_names=labels,
                      categories=categories,
                      cmap='Blues')

The FN percentage (4.55%) is considerably high when compared to the TN percentage (10.5%) and this is expected since the target classes are imbalanced. Yet, 4.5% of FN is low compared to the whole data set. Our next step is to use the oversampled train data set to train the same Neural Network created in Model_01 and observe if there is a performance improvement.

<b>Model Performance Improvement</b><br><br>
Moving on forward, a multiple models will be created using different combinations of hyperparameters including:

<ul><li>Type of Optimizer</ul></li>
<ul><li>Number of Layers</ul></li>
<ul><li>Number of Neurons in a layer</ul></li>
<ul><li>Weight initialization</ul></li>
<ul><li>Regularization techniques (Dropout and Batch Normalization)</ul></li>
<ul><li>Learning Rate</ul></li>
<ul><li>Batch Size</ul></li><br>
Also, we will utilize the GridSeachCV and Keras Tuner to tune some of the above hyperparameters.<br>

Finally, we will keep in mind that the data target classes are imbalanced, hence we will utilize the over_sampled and under_sampled data prepared in the data pre-processing step whenever required.<br>

In each new model, the hyperparameter will be stated, the model performance will be evaluated on several metrics including the our most important metric which is the Recall.

<b>Model_2:<br><br>
<ul><li>Hidden layers = 2</ul></li>
<ul><li>Activation function for hidden layers = Relu</ul></li>
<ul><li>Output layer with 1 nodes</ul></li>
<ul><li>Outout Activation function = Sigmoid</ul></li>
<ul><li>Output loss function = Cross Entropy (since this is a classification problem)</ul></li>
<ul><li>Optimizer = Adam</ul></li>
<ul><li>epochs = 20</ul></li>
<ul><li>Data is scaled using zscore</ul></li>
<ul><li>Training the model on the oversampled train data set (using SMOTE)</ul></li>

In [None]:
backend.clear_session()
#Fixing the seed for random number generators so that we can ensure we receive the same output everytime
np.random.seed(1)
random.seed(1)
tf.random.set_seed(1)

In [None]:
# Initializing the model
model_02 = Sequential()
# Adding the first hidden layer with 64 neurons, relu as activation function
model_02.add(Dense(64, activation='relu', input_dim=11))
# Adding the second hidden layer with 32 neurons, relu as activation function
model_02.add(Dense(32, activation='relu'))
# Adding the output layer with one neuron and Sigmoid as activation
model_02.add(Dense(1, activation='sigmoid'))

In [None]:
# Compiling the model with 'cross entropy' as loss function, 'Adam' Optimizer and 'accuracy' metrics
model_02.compile(loss='binary_crossentropy',
              optimizer='Adam',
              metrics=['accuracy'])

# Fitting the model on train and validation with 20 epochs
fitted_model_02 = model_02.fit(X_train_over, y_train_over, validation_split=0.2,epochs=20)

#Plotting Train Loss vs Validation Loss
plt.plot(fitted_model_02.history['loss'])
plt.plot(fitted_model_02.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

There are very significant noise on the validation set and the error between validation and training is still sigificantly high. Hence, it appears tha over sampling has not trained the model in a better way. Let us calculate the metrics and confusion matrix.

In [None]:
# Obtain the y_predict from the X_test
y_pred=model_02.predict(X_test)

#set the threshold to 0.5
y_pred = (y_pred > 0.5)

# Display the classification report
cr=metrics.classification_report(y_test,y_pred)
print(cr)

#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test, y_pred)
labels = ['True Positive','False Negative','False Positive','True Negative']
categories = [ 'Not Exiting','Exiting']
make_confusion_matrix(cm,
                      group_names=labels,
                      categories=categories,
                      cmap='Blues')

<b>Observation:</b><br><br>

<ul><li>Training the model using over sampled data has not improved the model performance significantly and added noise on the validation set. Although the Recall has improved from 0.73 to 0.75, the FN percentage has increased and the overfit has not improved.<br><br>
<b>Conclusion:</b><br>

<ul><li>The NN requires hyperparameter tuning which will be out next step.</ul></li>
<b>Notes:</b><br>

<ul><li>A trial was conducted using the undersampled training data on model_02 parameters, yet it resulted in lower Recall of 0.70, hence it was not included in this notebook.</ul></li>
<ul><li>A trial was conducted using 50 ephochs on the over sampled training data and same model_02 parameters, yet it resulted in lower Recall of 0.73, hence it was not included in this notebook.</ul></li>

<b>Model_03:</b><br><br>
In this model we will use the basic training data (not oversampled) and change only the optimize to SGD instead of Adam

<ul><li>Hidden layers = 2</ul></li>
<ul><li>Activation function for hidden layers = Relu</ul></li>
<ul><li>Output layer with 1 nodes</ul></li>
<ul><li>Outout Activation function = Sigmoid</ul></li>
<ul><li>Output loss function = Cross Entropy (since this is a classification problem)</ul></li>
<ul><li>Optimizer = SGD</ul></li>
<ul><li>epochs = 20</ul></li>
<ul><li>Data is scaled using zscore</ul></li>

In [None]:
backend.clear_session()
#Fixing the seed for random number generators so that we can ensure we receive the same output everytime
np.random.seed(1)
random.seed(1)
tf.random.set_seed(1)

In [None]:
# Initializing the model
model_03 = Sequential()
# Adding the first hidden layer with 64 neurons, relu as activation function
model_03.add(Dense(64, activation='relu', input_dim=11))
# Adding the second hidden layer with 32 neurons, relu as activation function
model_03.add(Dense(32, activation='relu'))
# Adding the output layer with one neuron and Sigmoid as activation
model_03.add(Dense(1, activation='sigmoid'))

In [None]:
# Compiling the model with 'cross entropy' as loss function, 'Adam' Optimizer and 'accuracy' metrics
model_03.compile(loss='binary_crossentropy',
              optimizer='SGD',
              metrics=['accuracy'])


# Fitting the model on train and validation with 20 epochs
fitted_model_03 = model_03.fit(X_train, y_train, validation_split=0.2,epochs=20)

In [None]:
#Plotting Train Loss vs Validation Loss
plt.plot(fitted_model_03.history['loss'])
plt.plot(fitted_model_03.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

The error loss in validation and train sets is very close and much smoother (less noise). Let us explore how the performance metrics vary



In [None]:
# Obtain the y_predict from the X_test
y_pred=model_03.predict(X_test)

#set the threshold to 0.5
y_pred = (y_pred > 0.5)

# Display the classification report
cr=metrics.classification_report(y_test,y_pred)
print(cr)

The recall has actually decreased when compared to the previous models. Let us display the confusion matrix and observe how the FN count changed.



In [None]:
#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test, y_pred)
labels = ['True Positive','False Negative','False Positive','True Negative']
categories = [ 'Not Exiting','Exiting']
make_confusion_matrix(cm,
                      group_names=labels,
                      categories=categories,
                      cmap='Blues')

The FN count has decreased yet the target variable imbalance is still affecting the figures. Let us train the same model on the oversampled data and observe the difference.



<b>Model_04:</b><br><br>
In this model we will use the oversampled train data and keep the optimizer as SGD instead of Adam

<ul><li>Hidden layers = 2</ul></li>
<ul><li>Activation function for hidden layers = Relu</ul></li>
<ul><li>Output layer with 1 nodes</ul></li>
<ul><li>Outout Activation function = Sigmoid</ul></li>
<ul><li>Output loss function = Cross Entropy (since this is a classification problem)</ul></li>
<ul><li>Optimizer = SGD</ul></li>
<ul><li>epochs = 20</ul></li>
<ul><li>Data is scaled using zscore</ul></li>

In [None]:
backend.clear_session()
#Fixing the seed for random number generators so that we can ensure we receive the same output everytime
np.random.seed(1)
import random
random.seed(1)
tf.random.set_seed(1)

In [None]:
# Initializing the model
model_04 = Sequential()
# Adding the first hidden layer with 64 neurons, relu as activation function
model_04.add(Dense(64, activation='relu', input_dim=11))
# Adding the second hidden layer with 32 neurons, relu as activation function
model_04.add(Dense(32, activation='relu'))
# Adding the output layer with one neuron and Sigmoid as activation
model_04.add(Dense(1, activation='sigmoid'))

# Compiling the model with 'cross entropy' as loss function, 'Adam' Optimizer and 'accuracy' metrics
model_04.compile(loss='binary_crossentropy',
              optimizer='SGD',
              metrics=['accuracy'])

# Fitting the model on train and validation with 20 epochs
fitted_model_04 = model_04.fit(X_train_over, y_train_over, validation_split=0.2,epochs=20)

#Plotting Train Loss vs Validation Loss
plt.plot(fitted_model_04.history['loss'])
plt.plot(fitted_model_04.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

There is a high difference between error on train and validation set (reflects as overfit model)and noise are observed on the validation set. Let us calculate the remaining metrics to see if there is any improvement in the recall.

In [None]:
# Obtain the y_predict from the X_test
y_pred=model_04.predict(X_test)

#set the threshold to 0.5
y_pred = (y_pred > 0.5)

# Display the classification report
cr=metrics.classification_report(y_test,y_pred)
print(cr)

The recall has increased from 0.69 to 0.75 which is an advantage of using the oversampled data over the non-sampled training data. Let us finally explore the confusion matrix.

In [None]:
#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test, y_pred)
labels = ['True Positive','False Negative','False Positive','True Negative']
categories = [ 'Not Exiting','Exiting']
make_confusion_matrix(cm,
                      group_names=labels,
                      categories=categories,
                      cmap='Blues')

The FN Count has increased significantly from 3% to 10%.<br><br>

<b>Conclusion:</b><br><br>

Using over sampled data enhances the performance of the model, yet it adds in lots of noise and increases the overfitting and the FN count.
The SGD Optimizer still shows less noise and less overfitting in the model, hence it is perferred over the Adam optimizer.


<b>Model_05:</b><br><br>
In this model we will add 2 more hidden layer to have a total of 4 layers (160, 64, 32 and 16 Neurons respectively) and also add weight initialization using the he technique (since we are using the ReLU activation function).<br>
<ul><li>Hidden layers = 4</ul></li>
<ul><li>Activation function for hidden layers = Relu</ul></li>
<ul><li>Weight initialization technique = he</ul></li>
<ul><li>Output layer with 1 nodes</ul></li>
<ul><li>Output Activation function = Sigmoid</ul></li>
<ul><li>Output loss function = Cross Entropy (since this is a classification problem)</ul></li>
<ul><li>Optimizer = SGD</ul></li>
<ul><li>epochs = 20</ul></li>
<ul><li>Data is scaled using zscore</ul></li>

In [None]:
backend.clear_session()
#Fixing the seed for random number generators so that we can ensure we receive the same output everytime
np.random.seed(1)
random.seed(1)
tf.random.set_seed(1)

In [None]:
# Initializing the model
model_05 = Sequential()
# Adding the first hidden layer, relu as activation function and, he_uniform as weight initializer.
model_05.add(Dense(160, activation='relu', input_dim=11, kernel_initializer='he_uniform'))
# Adding the second hidden layer , relu as activation function and, he_uniform as weight initializer
model_05.add(Dense(64, activation='relu', kernel_initializer='he_uniform'))
# Adding the third hidden layer , relu as activation function and, he_uniform as weight initializer
model_05.add(Dense(32, activation='relu', kernel_initializer='he_uniform'))
# Adding the fourth hidden layer , relu as activation function and, he_uniform as weight initializer
model_05.add(Dense(16, activation='relu', kernel_initializer='he_uniform'))

# Adding the output layer with one neuron and Sigmoid as activation
model_05.add(Dense(1, activation='sigmoid', kernel_initializer='he_uniform'))

# Compiling the model with 'cross entropy' as loss function, 'Adam' Optimizer and 'accuracy' metrics
model_05.compile(loss='binary_crossentropy',
              optimizer='SGD',
              metrics=['accuracy'])

# Fitting the model on train and validation with 20 epochs
fitted_model_05 = model_05.fit(X_train_over, y_train_over, validation_split=0.2,epochs=20)

#Plotting Train Loss vs Validation Loss
plt.plot(fitted_model_05.history['loss'])
plt.plot(fitted_model_05.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

In [None]:
# Obtain the y_predict from the X_test
y_pred=model_05.predict(X_test)

#set the threshold to 0.5
y_pred = (y_pred > 0.5)

# Display the classification report
cr=metrics.classification_report(y_test,y_pred)
print(cr)

#Calculating the confusion matrix
cm=confusion_matrix(y_test, y_pred)
labels = ['True Positive','False Negative','False Positive','True Negative']
categories = [ 'Not Exiting','Exiting']
make_confusion_matrix(cm,
                      group_names=labels,
                      categories=categories,
                      cmap='Blues')

<b>Observation:</b><br><br>

The recall has decreased while the FN Ratio has stayed almost the same. The weight initialization does not seem to add any value in enhancing the model performance.

<b>Model_06:</b><br>

In this model we will increase the number of epochs to 50 in order to decrease the error values, use the oversampled data and remove the weight initializer.<br><br>
<ul><li>Hidden layers = 4</ul></li>
<ul><li>Activation function for hidden layers = Relu</ul></li>
<ul><li>Output layer with 1 nodes</ul></li>
<ul><li>Outout Activation function = Sigmoid</ul></li>
<ul><li>Output loss function = Cross Entropy (since this is a classification problem)</ul></li>
<ul><li>Optimizer = SGD</ul></li>
<ul><li>epochs = 50</ul></li>
<ul><li>Data is scaled using zscore</ul></li>

In [None]:
backend.clear_session()
#Fixing the seed for random number generators so that we can ensure we receive the same output everytime
np.random.seed(1)
random.seed(1)
tf.random.set_seed(1)

In [None]:
# Initializing the model
model_06 = Sequential()
# Adding the first hidden layer, relu as activation function
model_06.add(Dense(160, activation='relu', input_dim=11))
# Adding the second hidden layer , relu as activation function
model_06.add(Dense(64, activation='relu'))
# Adding the third hidden layer , relu as activation function
model_06.add(Dense(32, activation='relu'))
# Adding the fourth hidden layer , relu as activation function
model_06.add(Dense(16, activation='relu'))
# Adding the output layer with one neuron and Sigmoid as activation
model_06.add(Dense(1, activation='sigmoid'))

# Compiling the model with 'cross entropy' as loss function, 'Adam' Optimizer and 'accuracy' metrics
model_06.compile(loss='binary_crossentropy',
              optimizer='SGD',
              metrics=['accuracy'])

# Fitting the model on train and validation with 20 epochs
fitted_model_06 = model_06.fit(X_train_over, y_train_over, validation_split=0.2,epochs=50)

#Plotting Train Loss vs Validation Loss
plt.plot(fitted_model_06.history['loss'])
plt.plot(fitted_model_06.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

The error value has decreased and seem to stabilize a bit for the validation set as the epochs increase, yet the noise observed on the validation set is higher with higher epochs. It seems that 50 epochs is the right value to stop at, yet we need to reduce the noise and overfit, hence we will apply some regulaziation techniques. Let us observe the performance metrics first.

In [None]:
# Obtain the y_predict from the X_test
y_pred=model_06.predict(X_test)

#set the threshold to 0.5
y_pred = (y_pred > 0.5)

# Display the classification report
cr=metrics.classification_report(y_test,y_pred)
print(cr)

#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test, y_pred)
labels = ['True Positive','False Negative','False Positive','True Negative']
categories = [ 'Not Exiting','Exiting']
make_confusion_matrix(cm,
                      group_names=labels,
                      categories=categories,
                      cmap='Blues')

The recall has decreased and the FN ratio has decreased slightly. Yet the model is still showing overfit model and the noise on the validation set requires further reduction.

<b>Model_07:</b><br><br>
In this model we will apply the Dropout Regularization technique while keeping all other hyperparameters and NN Structure fixed as in model_06<br><br>
<ul><li>Hidden layers = 4</ul></li>
<ul><li>Activation function for hidden layers = Relu</ul></li>
<ul><li>Output layer with 1 nodes</ul></li>
<ul><li>Output Activation function = Sigmoid</ul></li>
<ul><li>Output loss function = Cross Entropy (since this is a classification problem)</ul></li>
<ul><li>Optimizer = SGD</ul></li>
<ul><li>epochs = 50</ul></li>
<ul><li>Data is scaled using zscore</ul></li>
<ul><li>Regularization technique: DropOut</ul></li>


In [None]:
backend.clear_session()
#Fixing the seed for random number generators so that we can ensure we receive the same output everytime
np.random.seed(1)
random.seed(1)
tf.random.set_seed(1)

In [None]:
# Initializing the model
model_07 = Sequential()
# Adding the first hidden layer with 160 neurons, relu as activation function
model_07.add(Dense(160, activation='relu', input_dim=11))
# Adding dropout of 20%
model_07.add(Dropout(0.2))
# Adding the second hidden layer with 64 neurons, relu as activation function
model_07.add(Dense(64, activation='relu', input_dim=11))
# Adding dropout of 20%
model_07.add(Dropout(0.2))
# Adding the third hidden layer with 32 neurons, relu as activation function
model_07.add(Dense(32, activation='relu'))
# Adding dropout of 20%
model_07.add(Dropout(0.2))
# Adding the fourth hidden layer with 16 neurons, relu as activation function
model_07.add(Dense(16, activation='relu'))
# Adding dropout of 20%
model_07.add(Dropout(0.2))
# Adding the output layer with one neuron and Sigmoid as activation
model_07.add(Dense(1, activation='sigmoid'))

In [None]:
# Display the Model summary to make sure all layers are well arranged
model_07.summary()

In [None]:
# Compiling the model with 'cross entropy' as loss function, 'Adam' Optimizer and 'accuracy' metrics
model_07.compile(loss='binary_crossentropy',
              optimizer='SGD',
              metrics=['accuracy'])

# Fitting the model on train and validation with 50 epochs
fitted_model_07 = model_07.fit(X_train_over, y_train_over, validation_split=0.2,epochs=50)

#Plotting Train Loss vs Validation Loss
plt.plot(fitted_model_07.history['loss'])
plt.plot(fitted_model_07.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

The loss curve shows a significant reduction in noise on the validation set, yet the curve is still showing overfitting. Let us compute the performance metrics.

In [None]:
# Obtain the y_predict from the X_test
y_pred=model_07.predict(X_test)

#set the threshold to 0.5
y_pred = (y_pred > 0.5)

# Display the classification report
cr=metrics.classification_report(y_test,y_pred)
print(cr)

#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test, y_pred)
labels = ['True Positive','False Negative','False Positive','True Negative']
categories = [ 'Not Exiting','Exiting']
make_confusion_matrix(cm,
                      group_names=labels,
                      categories=categories,
                      cmap='Blues')

The recall has increased and there is a small increase in the FN ratio, yet the recall has enhanced to 0.75. Let us explore the AUC-ROC Curve for a better threshold.

In [None]:
from sklearn.metrics import roc_curve
from matplotlib import pyplot

# predict probabilities
yhat1 = model_07.predict(X_test)
# keep probabilities for the positive outcome only
yhat1 = yhat1[:, 0]
# calculate roc curves
fpr, tpr, thresholds1 = roc_curve(y_test, yhat1)

# plot the roc curve for the model
pyplot.plot([0,1], [0,1], linestyle='--')
pyplot.plot(fpr, tpr, marker='.')

# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')

# show the plot
pyplot.show()

As we are searching for the threshold that will give us a good TPR value and still maintain. good FPR Value, we will aim for the threshold at TPR value close to 0.80

In [None]:
# we will hunt for the threshold at TPR 0.8
x = 0.80
print("Value to which nearest element is to be found: ", x)

# calculate the difference array
difference_array = np.absolute(tpr-x)

# find the index of minimum element from the array
index = difference_array.argmin()
print("Nearest element to the given values is : ", tpr[index])
print("Index of nearest value is : ", index)
print("Threshold at this index is:", thresholds1[index])

In [None]:
# Obtain the y_predict from the X_test
y_pred=model_07.predict(X_test)

#set the threshold at tpr=0.8
y_pred = (y_pred > thresholds1[index])

# Display the classification report
cr=metrics.classification_report(y_test,y_pred)
print(cr)

#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test, y_pred)
labels = ['True Positive','False Negative','False Positive','True Negative']
categories = [ 'Not Exiting','Exiting']
make_confusion_matrix(cm,
                      group_names=labels,
                      categories=categories,
                      cmap='Blues')

By applying a very low threshold of value 0.26 that acheives a TPR of approx. 0.8 , we get the same recall value of 0.75 but we get a much much higher FN Ratio which is a drawback. Note that at threshold of 0.5 the recall value was the same (0.75) yet the FN ration was approx (10%), hence the threshold 0.5 is preferred over the threshold obtained from the AUC-ROC curve.

<b>Model_08:</b><br><br>
In model_07 the Dropout regularization has shown a slight decrease in the overfitting, yet a noticible decrease in noise. The recall has not improved significantly and so, in this model we will apply the Batch normalization regularization techniqe. With that, we will use the non-scaled data X_train_ns, X_test_ns, y_train_ns and y_test_ns.<br><br>

<ul><li>Hidden layers = 4</ul></li>
<ul><li>Activation function for hidden layers = Relu</ul></li>
<ul><li>Weight initialization technique = non</ul></li>
<ul><li>Output layer with 1 nodes</ul></li>
<ul><li>Outout Activation function = Sigmoid</ul></li>
<ul><li>Output loss function = Cross Entropy (since this is a classification problem)</ul></li>
<ul><li>Optimizer = SGD</ul></li>
<ul><li>epochs = 50</ul></li>
<ul><li>Data is not scaled</ul></li>
<ul><li>Regularization technique: Batch</ul></li>

In [None]:
backend.clear_session()
#Fixing the seed for random number generators so that we can ensure we receive the same output everytime
np.random.seed(1)
random.seed(1)
tf.random.set_seed(1)

In [None]:
# Initializing the model
model_08 = Sequential()
# Adding the first hidden layer with relu as activation function
model_08.add(Dense(160, activation='relu', input_dim=11))
# Adding Batch Normalization
model_08.add(BatchNormalization())
# Adding the second hidden layer with relu as activation function
model_08.add(Dense(64, activation='relu', input_dim=11))
# Adding Batch Normalization
model_08.add(BatchNormalization())
# Adding the third hidden layer with relu as activation function
model_08.add(Dense(32, activation='relu'))
# Adding Batch Normalization
model_08.add(BatchNormalization())
# Adding the fourth hidden layer with relu as activation function
model_08.add(Dense(16, activation='relu'))
# Adding Batch Normalization
model_08.add(BatchNormalization())
# Adding the output layer with one neuron and Sigmoid as activation
model_08.add(Dense(1, activation='sigmoid'))

# Compiling the model with 'cross entropy' as loss function, 'Adam' Optimizer and 'accuracy' metrics
model_08.compile(loss='binary_crossentropy',
              optimizer='SGD',
              metrics=['accuracy'])

# Fitting the model on train and validation with 20 epochs
fitted_model_08 = model_08.fit(X_train_ns, y_train_ns, validation_split=0.2,epochs=50)

#Plotting Train Loss vs Validation Loss
plt.plot(fitted_model_08.history['loss'])
plt.plot(fitted_model_08.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

The loss curve has dropped very rapidly and the error difference between the validation and training set has improved, yet not significantly. The error difference is stabilizing as well between the test and validation. There is also some noise observed. Let us compute the performance metrics.

In [None]:
# Obtain the y_predict from the X_test
y_pred=model_08.predict(X_test_ns)

#set the threshold to 0.5
y_pred = (y_pred > 0.5)

# Display the classification report
cr=metrics.classification_report(y_test_ns,y_pred)
print(cr)

#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test_ns, y_pred)
labels = ['True Positive','False Negative','False Positive','True Negative']
categories = [ 'Not Exiting','Exiting']
make_confusion_matrix(cm,
                      group_names=labels,
                      categories=categories,
                      cmap='Blues')

The recall has significantly decreased and the FN and TN completly vanishied, let us plot the ROC-AUC curve and see if changing the threshold can enhance this model.

The ROC-AUC curve is reflecting a poor model, hence this explains the reduced recall and noise. Since, we have observed better performing models, we will ignore this one.

<b>Model_09_with grid search:<b><br><br>
In this model we will utilize the GridSearchCV to tune the learning rate and batch size. We will use the non-sampled training set to train the model and the below parameters:

<ul><li>Hidden layers = 5</ul></li>
<ul><li>Activation function for hidden layers = Relu</ul></li>
<ul><li>Output layer with 1 nodes</ul></li>
<ul><li>Output Activation function = Sigmoid</ul></li>
<ul><li>Output loss function = Cross Entropy (since this is a classification problem)</ul></li>
<ul><li>Optimizer = SGD</ul></li>
<ul><li>epochs = 50</ul></li>
<ul><li>Data is scaled using zscore</ul></li>
<ul><li>Regularization technique: DropOut</ul></li>

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:

backend.clear_session()
#Fixing the seed for random number generators so that we can ensure we receive the same output everytime
np.random.seed(1)
random.seed(1)
tf.random.set_seed(1)

In [None]:
def create_model(lr,batch_size):
    np.random.seed(1)
    model = Sequential()
    model.add(Dense(256,activation='relu',input_dim = 11))
    model.add(Dropout(0.2))
    model.add(Dense(128,activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(65,activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(32,activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(16,activation='relu',kernel_initializer='he_uniform'))
    model.add(Dropout(0.3))
    model.add(Dense(1, activation='sigmoid'))

    #compile model
    optimizer = tf.keras.optimizers.SGD(learning_rate=lr)
    model.compile(optimizer = optimizer,loss = 'binary_crossentropy', metrics = ['accuracy'])
    return model

In [None]:
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

keras_estimator = KerasClassifier(build_fn=create_model, verbose=1)
# define the grid search parameters
param_grid = {
    'batch_size':[64,32, 128, 256, 560, 1000],
    "lr":[0.001,0.0015,0.0020,0.0025],}

kfold_splits = 3
grid = GridSearchCV(estimator=keras_estimator,
                    verbose=1,
                    cv=kfold_splits,
                    param_grid=param_grid,n_jobs=-1)

In [None]:
import time

# store starting time
begin = time.time()

grid_result = grid.fit(X_train, y_train,validation_split=0.2,verbose=1)

# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']

time.sleep(1)
# store end time
end = time.time()

# total time taken
print(f"Total runtime of the program is {end - begin}")

In [None]:
# create the structure of the NN
model_09=create_model(batch_size=grid_result.best_params_['batch_size'],lr=grid_result.best_params_['lr'])

model_09.summary()

In [None]:
# prepare the optimizer for compiling
optimizer = tf.keras.optimizers.SGD(grid_result.best_params_['lr'])
model_09.compile(loss='binary_crossentropy',optimizer=optimizer,metrics=['accuracy'])

# Fit the model to the train data
fitted_model_09=model_09.fit(X_train, y_train, epochs=50, batch_size = 32, verbose=1,validation_split=0.2)

In [None]:
#Plotting Train Loss vs Validation Loss
plt.plot(fitted_model_09.history['loss'])
plt.plot(fitted_model_09.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

The validation loss curve is much smoother and the overfit is less. Also the average loss value that the curves converge to is within a lower range than observed in the previous model.

In [None]:
# Obtain the y_predict from the X_test
y_pred=model_09.predict(X_test)

#set the threshold to 0.5
y_pred = (y_pred > 0.5)

# Display the classification report
cr=metrics.classification_report(y_test,y_pred)
print(cr)

#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test, y_pred)
labels = ['True Positive','False Negative','False Positive','True Negative']
categories = [ 'Not Exiting','Exiting']
make_confusion_matrix(cm,
                      group_names=labels,
                      categories=categories,
                      cmap='Blues')

Unfortunatly the recall has reduced to 0.54 and the FN and TN reduced significantly in the confusion matrix which makes this model unacceptable.

Note: The GridSearchCV was trialed with the over sampled training set and the overfit was too high, hence the model was not included in this notebook.


<b>Model_10</b><br><br>
In this model we will use the Keras Tuner to guide us through the best combination of :
<ul><li>No. of layers</ul></li>
<ul><li>No. of neurons per layer</ul></li>
<ul><li>Learning rate</ul></li>
We will keep the other parameters as is just the regularization we remove:

<ul><li>Hidden layers = 4</ul></li>
<ul><li>Activation function for hidden layers = Relu</ul></li>
<ul><li>Output layer with 1 nodes</ul></li>
<ul><li>Outout Activation function = Sigmoid</ul></li>
<ul><li>Output loss function = Cross Entropy (since this is a classification problem)</ul></li>
<ul><li>Optimizer = SGD</ul></li>
<ul><li>epochs = 50</ul></li>
<ul><li>Data is scaled using zscore</ul></li>

In [None]:
## First we install Keras Tuner
!pip install keras-tuner

In [None]:
# import the libraries
from tensorflow import keras
from tensorflow.keras import layers
from kerastuner.tuners import RandomSearch

In [None]:
backend.clear_session()
np.random.seed(1)
import random
random.seed(1)
tf.random.set_seed(1)

In [None]:
def build_model(h):
    model = Sequential()
    for i in range(h.Int('num_layers', 2, 10)):
        model.add(layers.Dense(units=h.Int('units_' + str(i),
                                            min_value=32,
                                            max_value=256,
                                            step=32),
                               activation='relu'))
    model.add(layers.Dense(1, activation='sigmoid'))
    model.compile(
        optimizer=SGD(h.Choice('learning_rate', [1e-2, 1e-3, 1e-4])),
        loss='binary_crossentropy',
        metrics=['accuracy'])
    return model

In [None]:
# Initialize a tuner (here, RandomSearch). We use objective to specify the objective to select the best models,
#  and we use max_trials to specify the number of different models to try

tuner = RandomSearch(
    build_model,
    objective='val_accuracy',
    max_trials=5,
    executions_per_trial=3,
     project_name='Job_')

tuner.search_space_summary()

# Print the best models with their hyperparameters
tuner.results_summary()

In [None]:
### Searching the best model on X and y train
tuner.search(X_train, y_train,
             epochs=5,
             validation_split = 0.2)
INFO:tensorflow:Oracle triggered exit

In [None]:
# Print the best models with their hyperparameters
tuner.results_summary()

It seems that the accuracy is almost the same for the number of layers 8 and 9. Also it is the same for learning rates 0.01, 0.001 and 0.0001. Hence, we choose the simpler model of 8 layers and the medium lerning rate of 0.001 to structure our model and observe different metrics.

In [None]:
backend.clear_session()
#Fixing the seed for random number generators so that we can ensure we receive the same output everytime
np.random.seed(1)
random.seed(1)
tf.random.set_seed(1)

In [None]:
# Initializing the model
model_10 = Sequential()
# Adding the first hidden layer with relu as activation function
model_10.add(Dense(160, activation='relu', input_dim=11))
# Adding the second hidden layer with relu as activation function
model_10.add(Dense(224, activation='relu'))
# Adding the third hidden layer with relu as activation function
model_10.add(Dense(160, activation='relu'))
# Adding the fourth hidden layer with relu as activation function
model_10.add(Dense(32, activation='relu'))
# Adding the fifth hidden layer with relu as activation function
model_10.add(Dense(192, activation='relu'))
# Adding the sixth hidden layer with relu as activation function
model_10.add(Dense(96, activation='relu'))
# Adding the seventh hidden layer with relu as activation function
model_10.add(Dense(192, activation='relu'))
# Adding the eighth hidden layer with relu as activation function
model_10.add(Dense(160, activation='relu'))
# Adding the output layer with one neuron and Sigmoid as activation
model_10.add(Dense(1, activation='sigmoid'))

In [None]:
model_10.summary()

In [None]:
optimizer = SGD(0.001)
model_10.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])

In [None]:
# Fitting the model on train and validation with 50 epochs
fitted_model_10 = model_10.fit(X_train, y_train, validation_split=0.2,epochs=50)

In [None]:
#Plotting Train Loss vs Validation Loss
plt.plot(fitted_model_10.history['loss'])
plt.plot(fitted_model_10.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

The validation and training loss curves are showing the smoothest trend in this tuning method. The noise is minimal and the overfitting is much less compared to the Random grid search. Let us compute the performance metrics and observe if the model performance has enhanced as well.

In [None]:
# Obtain the y_predict from the X_test
y_pred=model_10.predict(X_test)

#set the threshold to 0.202 from trialing different thresholds
y_pred = (y_pred > 0.202)

# Display the classification report
cr=metrics.classification_report(y_test,y_pred)
print(cr)

#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test, y_pred)
labels = ['True Positive','False Negative','False Positive','True Negative']
categories = [ 'Not Exiting','Exiting']
make_confusion_matrix(cm,
                      group_names=labels,
                      categories=categories,
                      cmap='Blues')

The recall is reduced to 0.69 and the FN Ratio has increased significantly to 25%. This shows that the tuned NN structure has not enhanced the model performance and that the over sampled data introduces noise and overfit yet higher recall.

<b>Model_11</b><br><br>
same as model_10 yet with dropout

In [None]:
backend.clear_session()
#Fixing the seed for random number generators so that we can ensure we receive the same output everytime
np.random.seed(1)
random.seed(1)
tf.random.set_seed(1)

In [None]:
def build_model_w_dropout(h):
    model = Sequential()
    for i in range(h.Int('num_layers', 2, 10)):
        model.add(layers.Dense(units=h.Int('units_' + str(i),
                                            min_value=32,
                                            max_value=256,
                                            step=32),
                               activation='relu'))
        model.add(Dropout(0.2))
    model.add(layers.Dense(1, activation='sigmoid'))
    model.compile(
        optimizer=SGD(h.Choice('learning_rate', [1e-2, 1e-3, 1e-4])),
        loss='binary_crossentropy',
        metrics=['accuracy'])
    return model

In [None]:
# Initialize a tuner (here, RandomSearch). We use objective to specify the objective to select the best models,
#  and we use max_trials to specify the number of different models to try

tuner_2 = RandomSearch(
    build_model_w_dropout,
    objective='val_accuracy',
    max_trials=5,
    executions_per_trial=3,
     project_name='Job_')

In [None]:
tuner_2.search_space_summary()

In [None]:
### Searching the best model on X and y train
tuner_2.search(X_train, y_train,
             epochs=5,
             validation_split = 0.2)

In [None]:
tuner_2.search_space_summary()

In [None]:
# Initializing the model and adding drop out
model_11 = Sequential()
# Adding the first hidden layer with 224 neurons, relu as activation function
model_11.add(Dense(224, activation='relu', input_dim=11))
# Adding dropout of 20%
model_11.add(Dropout(0.2))
# Adding the second hidden layer with 160 neurons, relu as activation function
model_11.add(Dense(160, activation='relu'))
# Adding dropout of 20%
model_11.add(Dropout(0.2))
# Adding the third hidden layer with 32 neurons, relu as activation function
model_11.add(Dense(32, activation='relu'))
# Adding dropout of 20%
model_11.add(Dropout(0.2))
# Adding the fourth hidden layer with 192 neurons, relu as activation function
model_11.add(Dense(192, activation='relu'))
# Adding dropout of 20%
model_11.add(Dropout(0.2))
# Adding the fifth hidden layer with 96 neurons, relu as activation function
model_11.add(Dense(96, activation='relu'))
# Adding dropout of 20%
model_11.add(Dropout(0.2))
# Adding the sixth hidden layer with 192 neurons, relu as activation function
model_11.add(Dense(192, activation='relu'))
# Adding dropout of 20%
model_11.add(Dropout(0.2))
# Adding the seventh hidden layer with 160 neurons, relu as activation function
model_11.add(Dense(160, activation='relu'))
# Adding dropout of 20%
model_11.add(Dropout(0.2))
# Adding the eighth hidden layer with 160 neurons, relu as activation function
model_11.add(Dense(160, activation='relu'))
# Adding dropout of 20%
model_11.add(Dropout(0.2))
# Adding the output layer with one neuron and Sigmoid as activation
model_11.add(Dense(1, activation='sigmoid'))

optimizer = SGD(0.001)
model_11.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])

In [None]:
# Fitting the model on train and validation with 50 epochs
fitted_model_11 = model_11.fit(X_train, y_train, validation_split=0.2,epochs=50)

#Plotting Train Loss vs Validation Loss
plt.plot(fitted_model_11.history['loss'])
plt.plot(fitted_model_11.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

In [None]:
# Obtain the y_predict from the X_test
y_pred=model_11.predict(X_test)

#set the threshold to 0.202
y_pred = (y_pred > 0.202)

# Display the classification report
cr=metrics.classification_report(y_test,y_pred)
print(cr)

#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test, y_pred)
labels = ['True Positive','False Negative','False Positive','True Negative']
categories = [ 'Not Exiting','Exiting']
make_confusion_matrix(cm,
                      group_names=labels,
                      categories=categories,
                      cmap='Blues')

The model performance has actually worsened in both Recall and FN Ratio. Hence, we will apply oversampling on the train test (SMOTE) and observe if it improves the model performance.

<b>Model_12:</b><br><br>
As the Keras Tuner with Drop out showed less overfitting model, we will train it on oversampled train data set using SMOTE algorithm. All other hyperparameters are fixed:

In [None]:
backend.clear_session()
#Fixing the seed for random number generators so that we can ensure we receive the same output everytime
np.random.seed(1)
random.seed(1)
tf.random.set_seed(1)

In [None]:
def build_model_smote(h):
    model = keras.Sequential()
    for i in range(h.Int('num_layers', 2, 10)):
        model.add(layers.Dense(units=h.Int('units_' + str(i),
                                            min_value=32,
                                            max_value=256,
                                            step=32),
                               activation='relu'))
    model.add(layers.Dense(1, activation='sigmoid'))
    model.compile(
        optimizer=keras.optimizers.Adam(
            h.Choice('learning_rate', [1e-2, 1e-3, 1e-4])),
        loss='binary_crossentropy',
        metrics=['accuracy'])
    return model

In [None]:
tuner_3 = RandomSearch(
    build_model_smote,
    objective='val_accuracy',
    max_trials=5,
    executions_per_trial=3,
    project_name='Job_Switch')

In [None]:
tuner_2.search_space_summary()

In [None]:
tuner_3.search(X_train_over, y_train_over,
             epochs=5,
             validation_split = 0.2)

In [None]:
tuner_3.results_summary()

In [None]:
backend.clear_session()
#Fixing the seed for random number generators so that we can ensure we receive the same output everytime
np.random.seed(1)
random.seed(1)
tf.random.set_seed(1)

In [None]:
# building the NN Based on the hyperparameters recommended by the Keras Tuner
model_12 = Sequential()
# Adding the first hidden layer with 160 neurons, relu as activation function
model_12.add(Dense(160, activation='relu', input_dim=11))
# Adding dropout of 20%
# model_11.add(Dropout(0.2))
# Adding the second hidden layer with 224 neurons, relu as activation function
model_12.add(Dense(224, activation='relu'))
# Adding dropout of 20%
# model_11.add(Dropout(0.2))
# Adding the third hidden layer with 160 neurons, relu as activation function
model_12.add(Dense(160, activation='relu'))
# Adding dropout of 20%
# model_11.add(Dropout(0.2))
# Adding the fourth hidden layer with 32 neurons, relu as activation function
model_12.add(Dense(32, activation='relu'))
# Adding dropout of 20%
# model_11.add(Dropout(0.2))
# Adding the fifth hidden layer with 192 neurons, relu as activation function
model_12.add(Dense(192, activation='relu'))
# Adding dropout of 20%
# model_11.add(Dropout(0.2))
# Adding the sixth hidden layer with 96 neurons, relu as activation function
model_12.add(Dense(96, activation='relu'))
# Adding dropout of 20%
# model_11.add(Dropout(0.2))
# Adding the seventh hidden layer with 192 neurons, relu as activation function
model_12.add(Dense(192, activation='relu'))
# Adding dropout of 20%
model_12.add(Dropout(0.2))
# Adding the eighth hidden layer with 160 neurons, relu as activation function
model_12.add(Dense(160, activation='relu'))
# Adding dropout of 20%
# model_11.add(Dropout(0.2))
# Adding the output layer with one neuron and Sigmoid as activation
model_12.add(Dense(1, activation='sigmoid'))

In [None]:
# Compiling the model with 'cross entropy' as loss function, 'Adam' Optimizer and 'accuracy' metrics
model_12.compile(loss='binary_crossentropy',
              optimizer='SGD',
              metrics=['accuracy'])

In [None]:
fitted_model_12 = model_12.fit(X_train_over, y_train_over, validation_split=0.2,epochs=50)

In [None]:
#Plotting Train Loss vs Validation Loss
plt.plot(fitted_model_12.history['loss'])
plt.plot(fitted_model_12.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

The noise in the validation set is still quiet high (expected though since we are using the oversampled data) yet the overfitting is less than observed in earlier models. Also, the valudation loss curve is fluctuating around a lower loss average value when compared to earlier models.

In [None]:
# Obtain the y_predict from the X_test
y_pred=model_12.predict(X_test)

#set the threshold to 0.5
y_pred = (y_pred > 0.5)

# Display the classification report
cr=metrics.classification_report(y_test,y_pred)
print(cr)

#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test, y_pred)
labels = ['True Positive','False Negative','False Positive','True Negative']
categories = [ 'Not Exiting','Exiting']
make_confusion_matrix(cm,
                      group_names=labels,
                      categories=categories,
                      cmap='Blues')

As expected, the model performance is higher than the recall from the model trained on the non-sampled data. Also, the FN has decreased to 10% which is also an improvement from the latest model.

<b>Conclusion and key takeaways</b><br><br>
<b>Key observations</b><br>
<ul><li>The majority of model gave recall between 0.69 to 0.75 with few models showed a drop to recall value 0.5</ul></li>
<ul><li>The FN Ratio was very challenging to reduce as at the majority of models it was inversly proportional with the recall (recall value enhances and FN value worsens)</ul></li>
<ul><li>Utilizing the oversampled data entroduced noise in the validation dataset loss curve, yet it supported the model to a slightly better performance</ul></li>
<ul><li>It was observed from the EDA that the data set target classes were imbalanced.</ul></li>
<ul><li>It was also observed from the EDA that the majority of features are in fact very poor predictors</ul></li>

<b>Key Conclusion</b><br><br>
<ul><li>The best performing model is: Model_07 which gives a recall of 0.75, slight overfit, acceptable noise and accpetable FN ratio od 7%.</ul></li>
<ul><li>As a result of data set features poor predictibility power and the high imbalance in the target variable, the NN modelling algorithm could acheive a maximum best of Recall value 0.75 with slight overfitting.</ul></li>
<ul><li>Utilizing this model will support the bank in identifying the customers who are willing to exit the bank and acocrdingly early action plans can be put in place to attain these customers (please see the EDA Insights section where the common features of exiting customers are stated)</ul></li>

The curves are still smooth and showing less overfitting, let us explore the performance parameters

<b>

Hidden layers = 4
Activation function for hidden layers = Relu
Weight initialization technique = he
Output layer with 1 nodes
Output Activation function = Sigmoid
Output loss function = Cross Entropy (since this is a classification problem)
Optimizer = SGD
epochs = 20
Data is scaled using zscore

## Importing necessary libraries

## Loading the dataset

## Data Overview

## Exploratory Data Analysis

### Univariate Analysis

### Bivariate Analysis

## Data Preprocessing

### Dummy Variable Creation

### Train-validation-test Split

### Data Normalization

## Model Building

### Model Evaluation Criterion

Write down the logic for choosing the metric that would be the best metric for this business scenario.

-


### Neural Network with SGD Optimizer

## Model Performance Improvement

### Neural Network with Adam Optimizer

### Neural Network with Adam Optimizer and Dropout

### Neural Network with Balanced Data (by applying SMOTE) and SGD Optimizer

### Neural Network with Balanced Data (by applying SMOTE) and Adam Optimizer

### Neural Network with Balanced Data (by applying SMOTE), Adam Optimizer, and Dropout

## Model Performance Comparison and Final Model Selection

## Actionable Insights and Business Recommendations

*



<font size=6 color='blue'>Power Ahead</font>
___