# NNDL - Project 1: Bank Customer Churn Prediction



The case study is from an open source dataset from Kaggle. 

Link to the Kaggle project site:
https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling
 
Given a Bank customer, can we build a classifier which can determine whether they will leave or not using Neural networks?
 
Case file: bank.csv

The points distribution for this case is as follows:
1. Read the dataset

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split

In [2]:
bank_df = pd.read_csv("Churn_Modelling.csv")
bank_df.shape

(10000, 14)

In [3]:
bank_df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


2. Drop the columns which are unique for all users like IDs (2.5 points)

In [4]:
df = bank_df.drop(columns=['RowNumber','CustomerId','Surname'])
df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [5]:
df['Geography'].value_counts()

France     5014
Germany    2509
Spain      2477
Name: Geography, dtype: int64

In [6]:
df['Gender'].value_counts()

Male      5457
Female    4543
Name: Gender, dtype: int64

In [7]:
df['Exited'].value_counts()

0    7963
1    2037
Name: Exited, dtype: int64

In [8]:
df.dtypes

CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure               int64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object

In [9]:
# Categorical boolean mask
categorical_feature_mask = df.dtypes==object
# filter categorical columns using mask and turn it into a list
categorical_cols = df.columns[categorical_feature_mask].tolist()

In [10]:
# import labelencoder
from sklearn.preprocessing import LabelEncoder
# instantiate labelencoder object
le = LabelEncoder()

In [11]:
# apply le on categorical feature columns
df[categorical_cols] = df[categorical_cols].apply(lambda col: le.fit_transform(col))
df[categorical_cols].head(10)

Unnamed: 0,Geography,Gender
0,0,0
1,2,0
2,0,0
3,0,0
4,2,0
5,2,1
6,0,1
7,1,0
8,0,1
9,0,1


In [12]:
df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,0,0,42,2,0.0,1,1,1,101348.88,1
1,608,2,0,41,1,83807.86,1,0,1,112542.58,0
2,502,0,0,42,8,159660.8,3,1,0,113931.57,1
3,699,0,0,39,1,0.0,2,0,0,93826.63,0
4,850,2,0,43,2,125510.82,1,1,1,79084.1,0


3. Distinguish the feature and target set (2.5 points)


In [13]:
y = df['Exited']
X = df.drop(columns ='Exited')

In [14]:
labels = np.array(y).astype('float32')
features = np.array(X).astype('float32')
print(labels)
print(features)

[1. 0. 1. ... 1. 1. 0.]
[[6.1900000e+02 0.0000000e+00 0.0000000e+00 ... 1.0000000e+00
  1.0000000e+00 1.0134888e+05]
 [6.0800000e+02 2.0000000e+00 0.0000000e+00 ... 0.0000000e+00
  1.0000000e+00 1.1254258e+05]
 [5.0200000e+02 0.0000000e+00 0.0000000e+00 ... 1.0000000e+00
  0.0000000e+00 1.1393157e+05]
 ...
 [7.0900000e+02 0.0000000e+00 0.0000000e+00 ... 0.0000000e+00
  1.0000000e+00 4.2085578e+04]
 [7.7200000e+02 1.0000000e+00 1.0000000e+00 ... 1.0000000e+00
  0.0000000e+00 9.2888523e+04]
 [7.9200000e+02 0.0000000e+00 0.0000000e+00 ... 1.0000000e+00
  0.0000000e+00 3.8190781e+04]]


4. Divide the data set into Train and test sets

In [16]:
from sklearn.model_selection import train_test_split, GridSearchCV
from scipy.stats import zscore
from scipy import stats

In [17]:
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.25, random_state=1)


5. Normalize the train and test data (2.5 points)


In [45]:
X_train_z = tf.math.l2_normalize(X_train) 
X_test_z  = tf.math.l2_normalize(X_test)
X_train_z.shape
X_train_norm = stats.zscore(X_train)
X_test_norm = stats.zscore(X_test)

In [22]:
trainY = tf.keras.utils.to_categorical(y_train)
testY = tf.keras.utils.to_categorical(y_test)
#trainY = tf.convert_to_tensor(y_train)
#testY = tf.convert_to_tensor(y_test)
print(trainY)
testY

[[1. 0.]
 [1. 0.]
 [1. 0.]
 ...
 [0. 1.]
 [1. 0.]
 [0. 1.]]


array([[1., 0.],
       [1., 0.],
       [1., 0.],
       ...,
       [0., 1.],
       [1., 0.],
       [1., 0.]], dtype=float32)

6. Initialize & build the model (7.5 points)

In [28]:
tf.random.set_seed(1)
#Initialize Sequential model
model = tf.keras.models.Sequential()

#Input Layer
model.add(tf.keras.layers.Dense(10, input_dim = 10, activation='relu'))

#Add OUTPUT layer
model.add(tf.keras.layers.Dense(2, activation='sigmoid'))

#Compile the model
model.compile(optimizer='sgd', loss='binary_crossentropy',metrics=['accuracy'])

In [50]:
model.fit(X_train_norm,trainY,          
          validation_data=(X_test_norm,testY),
          epochs=30,
          batch_size=10)

Train on 7500 samples, validate on 2500 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<tensorflow.python.keras.callbacks.History at 0x23fb0e0d7f0>

In [24]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 10)                110       
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 22        
Total params: 132
Trainable params: 132
Non-trainable params: 0
_________________________________________________________________


7. Optimize the model (5 points)


In [25]:
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.optimizers import Nadam
from keras.optimizers import sgd
from keras.layers import Dropout
from keras.constraints import maxnorm

Using TensorFlow backend.


Lets first findout the best optimizer among 'SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam'


In [61]:
# Function to create model, required for KerasClassifier
def create_model(optimizer='adam'):
    #Initialize Sequential model
    model2 = Sequential()
  
    #Input Layer
    model2.add(Dense(10, input_dim = 10, activation='relu'))
  
    #Add 2nd Hidden layer
    model2.add(Dense(6, activation='relu'))

    #Add Dense Layer which provides 1 Outputs after applying softmax (Output Layer)
    model2.add(Dense(1, activation='sigmoid'))
  

    
    #Comile the model
    model2.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
  
    return model2

model2 = KerasClassifier(build_fn=create_model, epochs=50, batch_size=10, verbose=0)

In [62]:
# define the grid search parameters
optimizer = ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']
param_grid = dict(optimizer=optimizer)

grid = GridSearchCV(estimator=model2, param_grid=param_grid, n_jobs=-1, scoring="accuracy", cv=2)
grid_result = grid.fit(X_train_norm,y_train)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']

for mean, stdev, param in zip(means, stds, params):
                       print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.856667 using {'optimizer': 'SGD'}
0.856667 (0.001467) with: {'optimizer': 'SGD'}
0.853067 (0.004000) with: {'optimizer': 'RMSprop'}
0.832933 (0.011333) with: {'optimizer': 'Adagrad'}
0.855333 (0.000400) with: {'optimizer': 'Adadelta'}
0.853467 (0.001733) with: {'optimizer': 'Adam'}
0.854133 (0.001067) with: {'optimizer': 'Adamax'}
0.851067 (0.001733) with: {'optimizer': 'Nadam'}


# Observations:

The best optimizer we have got is Nadam and the accuracy is 85.67%.

There is no discernable increase in accuracy of the model.

Note: As there is difference in multiclass representation with scikit-learn and keras, we are not going to use the categorical transformation on target variable with gridsearch. If we use the categorical transformation of target variable, we will be ending up with the error, *"ValueError: Classification metrics can't handle a mix of multilabel-indicator and binary targets"*. So with gridsearchcv, we are going to use target variable without categorical transformation.

Let's find the best learning rate.

In [65]:
# Tune Learning Rate
from tensorflow.keras.optimizers import SGD

In [66]:
# Function to create model, required for KerasClassifier
def create_model(learn_rate=0.01):
    #Initialize Sequential model
    model4 = Sequential()
    #Input Layer
    model4.add(Dense(10, input_dim = 10, activation='relu'))
    #Add 2nd Hidden layer
    model4.add(Dense(6, activation='relu'))
    #Add Dense Layer which provides 1 Outputs after applying sigmoid (Output Layer)
    model4.add(Dense(2, activation='sigmoid'))
    #Comile the model
    optimizer = SGD(lr=learn_rate)
    model4.compile(optimizer = optimizer, loss = 'binary_crossentropy', metrics = ['accuracy'])
    return model4

# create model
model4 = KerasClassifier(build_fn=create_model, epochs=50, batch_size=20, verbose=0)


In [67]:
# define the grid search parameters
learn_rate = [0.001, 0.01, 0.1, 0.2, 0.3]
param_grid = dict(learn_rate=learn_rate)

grid2 = GridSearchCV(estimator=model4, param_grid=param_grid, n_jobs=1, cv=2)
grid_result2 = grid2.fit(X_train_norm, trainY)

# summarize results
print("Best: %f using %s" % (grid_result2.best_score_, grid_result2.best_params_))
means = grid_result2.cv_results_['mean_test_score']
stds = grid_result2.cv_results_['std_test_score']
params = grid_result2.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.854733 using {'learn_rate': 0.2}
0.797867 (0.001467) with: {'learn_rate': 0.001}
0.834133 (0.016400) with: {'learn_rate': 0.01}
0.853667 (0.003133) with: {'learn_rate': 0.1}
0.854733 (0.000333) with: {'learn_rate': 0.2}
0.851200 (0.003467) with: {'learn_rate': 0.3}


### Observations:

#### The best learning rate we got is 0.2 and the accuracy is 85.4733%.
#### There is a slight decrease in accuracy.

Lets put together the final model.


In [71]:
#Initialize Sequential model
modelF = tf.keras.models.Sequential()

#Input Layer
modelF.add(tf.keras.layers.Dense(10, input_dim = 10, activation='relu'))

#Add OUTPUT layer
modelF.add(tf.keras.layers.Dense(2, activation='sigmoid'))

#Compile the model
modelF.compile(optimizer='sgd', loss='binary_crossentropy',metrics=['accuracy'])
 
modelF.fit(X_train_norm, trainY, 
        validation_data=(X_test_norm, testY), 
        epochs=50,
        batch_size=10)

Train on 7500 samples, validate on 2500 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x23fbb4ad400>

8. Predict the results using 0.5 as a threshold (5 points) 


In [74]:
y_pred = modelF.predict(X_test_norm)


In [75]:
print(" Prediction: ",y_pred[:10])

 Prediction:  [[0.96068287 0.04164359]
 [0.8932359  0.11107728]
 [0.9180343  0.07989183]
 [0.9247933  0.06762668]
 [0.8731804  0.12428898]
 [0.99569976 0.0046187 ]
 [0.65578705 0.32330742]
 [0.94420373 0.0561364 ]
 [0.76438737 0.24550492]
 [0.95779085 0.04308677]]


In [76]:
y_pred_threshold = modelF.predict_proba(X_test_norm) > 0.5

In [77]:
print(" Prediction with threshold: ",y_pred_threshold[:10])

 Prediction with threshold:  [[ True False]
 [ True False]
 [ True False]
 [ True False]
 [ True False]
 [ True False]
 [ True False]
 [ True False]
 [ True False]
 [ True False]]


Observations:
    
We have predicted the results with and without specifying the threshold 0.5.

Lets check the accuracy score and confusion matrix for the same

9. Print the Accuracy score and confusion matrix (5 points)

In [78]:
# Accuracy score for predictions without threshold

from sklearn import metrics
print("Accuracy score for predictions with no specified thershold: ", metrics.accuracy_score(testY, y_pred.round()))
print("Accuracy score for predictions with specified threshold 0.5: ", metrics.accuracy_score(testY, y_pred_threshold.round()))

Accuracy score for predictions with no specified thershold:  0.8588
Accuracy score for predictions with specified threshold 0.5:  0.8588


In [79]:
print ("Confusion Matrix for predictions with no specified threshold")
pd.DataFrame(metrics.confusion_matrix(testY.argmax(axis=1), y_pred.argmax(axis=1)),
                 columns=['pred_neg', 'pred_pos'], index=['neg', 'pos'])

Confusion Matrix for predictions with no specified threshold


Unnamed: 0,pred_neg,pred_pos
neg,1911,69
pos,280,240


In [80]:
print ("Confusion Matrix for predictions with specified threshold 0.5")
pd.DataFrame(metrics.confusion_matrix(testY.argmax(axis=1), y_pred_threshold.argmax(axis=1)),
                 columns=['pred_neg', 'pred_pos'], index=['neg', 'pos'])

Confusion Matrix for predictions with specified threshold 0.5


Unnamed: 0,pred_neg,pred_pos
neg,1920,60
pos,280,240


In [81]:
from sklearn.metrics import classification_report
print ("Classification Report for predictions with no specified threshold")
print(classification_report(testY, y_pred.round()))

Classification Report for predictions with no specified threshold
              precision    recall  f1-score   support

           0       0.87      0.96      0.92      1980
           1       0.79      0.46      0.58       520

   micro avg       0.86      0.86      0.86      2500
   macro avg       0.83      0.71      0.75      2500
weighted avg       0.86      0.86      0.85      2500
 samples avg       0.86      0.86      0.86      2500



  'precision', 'predicted', average, warn_for)


In [82]:
from sklearn.metrics import classification_report
print ("Classification Report for predictions with specified threshold 0.5")
print(classification_report(testY, y_pred_threshold))

Classification Report for predictions with specified threshold 0.5
              precision    recall  f1-score   support

           0       0.87      0.96      0.92      1980
           1       0.79      0.46      0.58       520

   micro avg       0.86      0.86      0.86      2500
   macro avg       0.83      0.71      0.75      2500
weighted avg       0.86      0.86      0.85      2500
 samples avg       0.86      0.86      0.86      2500



Observations:

For binary classification by default the threshold is 0.5. There is slight difference in the accuracy score or classification report with and without specifying the 0.5 threshold.