# NNDL - Project 1: Bank Customer Churn Prediction



The case study is from an open source dataset from Kaggle. 

Link to the Kaggle project site:
https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling
 
Given a Bank customer, can we build a classifier which can determine whether they will leave or not using Neural networks?
 
Case file: bank.csv

The points distribution for this case is as follows:
1. Read the dataset

In [161]:
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split

In [162]:
bank_df = pd.read_csv("Churn_Modelling.csv")
bank_df.shape

(10000, 14)

In [163]:
bank_df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


2. Drop the columns which are unique for all users like IDs (2.5 points)

In [164]:
df = bank_df.drop(columns=['RowNumber','CustomerId','Surname'])
df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [165]:
df['Geography'].value_counts()

France     5014
Germany    2509
Spain      2477
Name: Geography, dtype: int64

In [166]:
df['Gender'].value_counts()

Male      5457
Female    4543
Name: Gender, dtype: int64

In [167]:
df['Exited'].value_counts()

0    7963
1    2037
Name: Exited, dtype: int64

In [168]:
df.dtypes

CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure               int64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object

In [169]:
# Categorical boolean mask
categorical_feature_mask = df.dtypes==object
# filter categorical columns using mask and turn it into a list
categorical_cols = df.columns[categorical_feature_mask].tolist()

In [170]:
# import labelencoder
from sklearn.preprocessing import LabelEncoder
# instantiate labelencoder object
le = LabelEncoder()

In [171]:
# apply le on categorical feature columns
df[categorical_cols] = df[categorical_cols].apply(lambda col: le.fit_transform(col))
df[categorical_cols].head(10)

Unnamed: 0,Geography,Gender
0,0,0
1,2,0
2,0,0
3,0,0
4,2,0
5,2,1
6,0,1
7,1,0
8,0,1
9,0,1


In [172]:
df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,0,0,42,2,0.0,1,1,1,101348.88,1
1,608,2,0,41,1,83807.86,1,0,1,112542.58,0
2,502,0,0,42,8,159660.8,3,1,0,113931.57,1
3,699,0,0,39,1,0.0,2,0,0,93826.63,0
4,850,2,0,43,2,125510.82,1,1,1,79084.1,0


3. Distinguish the feature and target set (2.5 points)


In [173]:
y = df['Exited']
X = df.drop(columns ='Exited')

In [174]:
labels = np.array(y).astype('float32')
features = np.array(X).astype('float32')
labels

array([1., 0., 1., ..., 1., 1., 0.], dtype=float32)

4. Divide the data set into Train and test sets

In [175]:
from sklearn.model_selection import train_test_split, GridSearchCV
from scipy.stats import zscore
from scipy import stats

In [176]:
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.25, random_state=1)


5. Normalize the train and test data (2.5 points)


In [177]:
X_train_z = tf.math.l2_normalize(X_train) 
X_test_z  = tf.math.l2_normalize(X_test)
X_train_z.shape


TensorShape([7500, 10])

In [179]:
#trainY = tf.keras.utils.to_categorical(y_train, num_classes =1)
#testY = tf.keras.utils.to_categorical(y_test, num_classes=1)
trainY = tf.convert_to_tensor(y_train)
testY = tf.convert_to_tensor(y_test)
testY

<tf.Tensor: id=2210908, shape=(2500,), dtype=float32, numpy=array([0., 0., 0., ..., 1., 0., 0.], dtype=float32)>

6. Initialize & build the model (7.5 points)

In [180]:
tf.random.set_seed(1)
#Initialize Sequential model
model = tf.keras.models.Sequential()

#Add OUTPUT layer
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

#Compile the model
model.compile(optimizer='sgd', loss='binary_crossentropy',metrics=['accuracy'])

In [181]:
model.fit(X_train_z,trainY,          
          validation_data=(X_test_z,testY),
          epochs=100,
          batch_size=10)

Train on 7500 samples, validate on 2500 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100


Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<tensorflow.python.keras.callbacks.History at 0x1a750818908>

7. Optimize the model (5 points)


In [186]:
tf.random.set_seed(1)
#Initialize Sequential model
model2 = tf.keras.models.Sequential()

#Add 1st Hidden layer
model2.add(tf.keras.layers.Dense(6,input_shape=(10,),activation='relu'))

#Add 2nd Hidden layer
model2.add(tf.keras.layers.Dense(6, activation='relu'))


#Add OUTPUT layer
model2.add(tf.keras.layers.Dense(1,  activation='sigmoid'))

#Compile the model
model2.compile(optimizer='adam', loss='binary_crossentropy',metrics=['accuracy'])

In [187]:
model2.fit(X_train_z,trainY,          
          validation_data=(X_test_z,testY),
          epochs=100,
          batch_size=10)

Train on 7500 samples, validate on 2500 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100


Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<tensorflow.python.keras.callbacks.History at 0x1a74ef44400>

In [195]:
''' Trial for GridCV or RandomizedCV on model evaluation
def build_model3():
    #Initialize Sequential model
    model3 = tf.keras.models.Sequential()
    #Add 1st Hidden layer
    model3.add(tf.keras.layers.Dense(6,input_shape=(10,),activation='relu'))
    #Add 2nd Hidden layer
    model3.add(tf.keras.layers.Dense(6, activation='relu'))
    #Add OUTPUT layer
    model3.add(tf.keras.layers.Dense(1,  activation='sigmoid'))
    #Compile the model
    model3.compile(optimizer='rmsprop', loss='binary_crossentropy',metrics=['accuracy'])
    return model3
'''

8. Predict the results using 0.5 as a threshold (5 points) 


In [190]:
y_pred = model2.predict(X_test_z)
y_pred = (y_pred > 0.5)

In [191]:
print(y_pred)

[[False]
 [False]
 [False]
 ...
 [False]
 [False]
 [False]]


9. Print the Accuracy score and confusion matrix (5 points)

In [193]:
# Confusion matrix for Model 2 
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(testY, y_pred)
print(cm)

[[1980    0]
 [ 520    0]]


In [194]:
accuracy = (cm[0][0]+cm[1][1])/(cm[0][0]+cm[0][1]+cm[1][0]+cm[1][1])
accuracy

0.792

In [208]:
# Accuracy for Model 1
results1 = model.evaluate(x=X_test_z, y= testY, batch_size=10)
print('Model1 test loss, test acc:', results1)

# Accuracy for Model 2
results2 = model2.evaluate(x=X_test_z, y= testY, batch_size=10)
print('Model2 test loss, test acc:', results2)



Model1 test loss, test acc: [0.5111983622908592, 0.792]


Model2 test loss, test acc: [0.5107066378593444, 0.792]
