The case study is from an open source dataset from Kaggle. 

Link to the Kaggle project site:

https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling (Links to an external site.)Links to an external site.

Given a Bank customer, can we build a classifier which can determine whether they will leave or not using Neural networks?

 

Case file: 


bank.csvView in a new window

 

The points distribution for this case is as follows:

1. Read the dataset
2. Drop the columns which are unique for all users like IDs
3. Distinguish the feature and target set
4. Divide the data set into Train and test sets
5. Normalize the train and test data (2.5 points)
6. Initialize & build the model (10 points)
7. Optimize the model (5 points)
9. Predict the results using 0.5 as a threshold (5 points)

10. Print the Accuracy score and confusion matrix (2.5 points)

In [1]:
import tensorflow as tf

#Reset Default graph - Needed only for Jupyter notebook
tf.reset_default_graph()

import pandas as pd
import numpy as np
import scipy as sp
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn import model_selection
#from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer
%matplotlib inline
from importlib import reload
from sklearn.metrics import (accuracy_score, f1_score,average_precision_score, confusion_matrix,average_precision_score, precision_score, recall_score, roc_auc_score)
import warnings
warnings.filterwarnings('ignore')

  from ._conv import register_converters as _register_converters


In [2]:
bankdata = pd.read_csv('bank.csv')#, encoding = 'unicode_escape')

In [3]:
bankdata.head(10)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0
5,6,15574012,Chu,645,Spain,Male,44,8,113755.78,2,1,0,149756.71,1
6,7,15592531,Bartlett,822,France,Male,50,7,0.0,2,1,1,10062.8,0
7,8,15656148,Obinna,376,Germany,Female,29,4,115046.74,4,1,0,119346.88,1
8,9,15792365,He,501,France,Male,44,4,142051.07,2,0,1,74940.5,0
9,10,15592389,H?,684,France,Male,27,2,134603.88,1,1,1,71725.73,0


In [4]:
# Dropping RowNumber, CustomerId and Surname
bankdata.drop(["RowNumber","CustomerId","Surname"],axis=1, inplace=True)

In [5]:
bankdata.head(10)

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0
5,645,Spain,Male,44,8,113755.78,2,1,0,149756.71,1
6,822,France,Male,50,7,0.0,2,1,1,10062.8,0
7,376,Germany,Female,29,4,115046.74,4,1,0,119346.88,1
8,501,France,Male,44,4,142051.07,2,0,1,74940.5,0
9,684,France,Male,27,2,134603.88,1,1,1,71725.73,0


In [6]:
bankdata1 = pd.get_dummies(bankdata, prefix='Geo', columns=['Geography'])

In [7]:
bankdata1.head(10)

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geo_France,Geo_Germany,Geo_Spain
0,619,Female,42,2,0.0,1,1,1,101348.88,1,1,0,0
1,608,Female,41,1,83807.86,1,0,1,112542.58,0,0,0,1
2,502,Female,42,8,159660.8,3,1,0,113931.57,1,1,0,0
3,699,Female,39,1,0.0,2,0,0,93826.63,0,1,0,0
4,850,Female,43,2,125510.82,1,1,1,79084.1,0,0,0,1
5,645,Male,44,8,113755.78,2,1,0,149756.71,1,0,0,1
6,822,Male,50,7,0.0,2,1,1,10062.8,0,1,0,0
7,376,Female,29,4,115046.74,4,1,0,119346.88,1,0,1,0
8,501,Male,44,4,142051.07,2,0,1,74940.5,0,1,0,0
9,684,Male,27,2,134603.88,1,1,1,71725.73,0,1,0,0


In [8]:
bankdata1 = pd.get_dummies(bankdata1, prefix='Gender', columns=['Gender'])

In [9]:
bankdata1.head(10)

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geo_France,Geo_Germany,Geo_Spain,Gender_Female,Gender_Male
0,619,42,2,0.0,1,1,1,101348.88,1,1,0,0,1,0
1,608,41,1,83807.86,1,0,1,112542.58,0,0,0,1,1,0
2,502,42,8,159660.8,3,1,0,113931.57,1,1,0,0,1,0
3,699,39,1,0.0,2,0,0,93826.63,0,1,0,0,1,0
4,850,43,2,125510.82,1,1,1,79084.1,0,0,0,1,1,0
5,645,44,8,113755.78,2,1,0,149756.71,1,0,0,1,0,1
6,822,50,7,0.0,2,1,1,10062.8,0,1,0,0,0,1
7,376,29,4,115046.74,4,1,0,119346.88,1,0,1,0,1,0
8,501,44,4,142051.07,2,0,1,74940.5,0,1,0,0,0,1
9,684,27,2,134603.88,1,1,1,71725.73,0,1,0,0,0,1


In [10]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
#scaler = MinMaxScaler() 
scaler = StandardScaler()

bumpy_features = ["CreditScore", "Age", "Balance",'EstimatedSalary']

df_scaled = pd.DataFrame(data = bankdata1)
df_scaled[bumpy_features] = scaler.fit_transform(bankdata1[bumpy_features])

In [11]:
#Divide the data into features and target
features = df_scaled.drop(['Exited'],axis=1)
Target = df_scaled.Exited

In [12]:
features.head(10)

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geo_France,Geo_Germany,Geo_Spain,Gender_Female,Gender_Male
0,-0.326221,0.293517,2,-1.225848,1,1,1,0.021886,1,0,0,1,0
1,-0.440036,0.198164,1,0.11735,1,0,1,0.216534,0,0,1,1,0
2,-1.536794,0.293517,8,1.333053,3,1,0,0.240687,1,0,0,1,0
3,0.501521,0.007457,1,-1.225848,2,0,0,-0.108918,1,0,0,1,0
4,2.063884,0.388871,2,0.785728,1,1,1,-0.365276,0,0,1,1,0
5,-0.057205,0.484225,8,0.597329,2,1,0,0.86365,0,0,1,0,1
6,1.774174,1.056346,7,-1.225848,2,1,1,-1.565487,1,0,0,0,1
7,-2.840488,-0.946079,4,0.618019,4,1,0,0.334854,0,1,0,1,0
8,-1.547141,0.484225,4,1.05082,2,0,1,-0.437329,1,0,0,0,1
9,0.346319,-1.136786,2,0.931463,1,1,1,-0.49323,1,0,0,0,1


In [13]:
features.head(10)

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geo_France,Geo_Germany,Geo_Spain,Gender_Female,Gender_Male
0,-0.326221,0.293517,2,-1.225848,1,1,1,0.021886,1,0,0,1,0
1,-0.440036,0.198164,1,0.11735,1,0,1,0.216534,0,0,1,1,0
2,-1.536794,0.293517,8,1.333053,3,1,0,0.240687,1,0,0,1,0
3,0.501521,0.007457,1,-1.225848,2,0,0,-0.108918,1,0,0,1,0
4,2.063884,0.388871,2,0.785728,1,1,1,-0.365276,0,0,1,1,0
5,-0.057205,0.484225,8,0.597329,2,1,0,0.86365,0,0,1,0,1
6,1.774174,1.056346,7,-1.225848,2,1,1,-1.565487,1,0,0,0,1
7,-2.840488,-0.946079,4,0.618019,4,1,0,0.334854,0,1,0,1,0
8,-1.547141,0.484225,4,1.05082,2,0,1,-0.437329,1,0,0,0,1
9,0.346319,-1.136786,2,0.931463,1,1,1,-0.49323,1,0,0,0,1


In [14]:
Target.head(10)

0    1
1    0
2    1
3    0
4    0
5    1
6    0
7    1
8    0
9    0
Name: Exited, dtype: int64

In [15]:
#split the data into train and test data

In [16]:
x_train,x_test,y_train,y_test = train_test_split(features,Target,test_size=0.3)

In [17]:
print("Train data shape",x_train.shape,y_train.shape)

Train data shape (7000, 13) (7000,)


In [18]:
print("Test data shape",x_test.shape,y_test.shape)

Test data shape (3000, 13) (3000,)


In [19]:
#Initializing and building the model

In [20]:
#Initialize Sequential model
model = tf.keras.models.Sequential()

model.add(tf.keras.layers.Reshape((13,),input_shape=(13,)))
model.add(tf.keras.layers.BatchNormalization())

#Add hidden layers
model.add(tf.keras.layers.Dense(60, activation='sigmoid'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Dense(60, activation='sigmoid'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Dense(60, activation='sigmoid'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

#Create optimizer with non-default learning rate
sgd_optimizer = tf.keras.optimizers.SGD(lr=0.03)

#Compile the model
model.compile(optimizer=sgd_optimizer, loss='binary_crossentropy', metrics=['accuracy'])

#Model Summary
model.summary()

#Train the model
model.fit(x_train, y_train, validation_data=(x_test, y_test),epochs=10,batch_size=10)

pred = model.predict(x_test)

print("Accuracy Score using threshold of .5 using round function:", accuracy_score(y_test, pred.round()))
print(confusion_matrix(y_test, pred.round()))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
reshape (Reshape)            (None, 13)                0         
_________________________________________________________________
batch_normalization (BatchNo (None, 13)                52        
_________________________________________________________________
dense (Dense)                (None, 60)                840       
_________________________________________________________________
batch_normalization_1 (Batch (None, 60)                240       
_________________________________________________________________
dense_1 (Dense)              (None, 60)                3660      
_________________________________________________________________
batch_normalization_2 (Batch (None, 60)                240       
_________________________________________________________________
dense_2 (Dense)              (None, 60)                3660      
__________

In [21]:
from keras.models import Sequential
from keras.layers import Dense

clf = Sequential()

clf.add(Dense(units = 60, kernel_initializer = "uniform", activation= "relu", input_dim=13))
clf.add(Dense(units = 60, kernel_initializer = "uniform", activation= "relu"))
clf.add(Dense(units = 60, kernel_initializer = "uniform", activation= "relu"))
clf.add(Dense(units = 1, kernel_initializer = "uniform", activation= "sigmoid"))

clf.compile(optimizer="adam", loss = "binary_crossentropy", metrics=["accuracy"])

clf.fit(x_train, y_train, validation_data=(x_test, y_test), batch_size = 10, epochs=10)

pred = clf.predict(x_test)
print("Accuracy Score using threshold of .5 using round function:", accuracy_score(y_test, pred.round()))
print(confusion_matrix(y_test, pred.round()))

Using TensorFlow backend.


Train on 7000 samples, validate on 3000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Accuracy Score using threshold of .5 using round function: 0.8486666666666667
[[2292  103]
 [ 351  254]]
