<a href="https://colab.research.google.com/github/deepa2909/MLP-Classifier-K-NN---Compare-Performance-/blob/main/MLP%20and%20K-NN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task Description

Fit an MLP classifier to the Universal Bank Data

Clean and prepare the data (you’ve done this a few times now)

Fit an MLP classifier to the data

Experiment with different combinations neurons and hidden layers. 

Show at least 3

Analyze performance of each of the three models on validation data

Fit a k-nn model

•use GridSearchCV to find optimal k

•analyze performance of model on validation data

 Discuss findings 

•How does the predictive performance of k-nn model compare to the MLP classifier?

•Which model takes less time to train? Which model is faster at prediction?



In this assignment, the task is to fit an MLP classifier on the Universal bank data. The predictor in the dataset is correctly predict 'personal loan' customer for the bank.

# Import required packages 

In [None]:
!pip install dmba

Collecting dmba
[?25l  Downloading https://files.pythonhosted.org/packages/75/06/15c89846de47be5f3522ef0bebd61c62e14afe167877778094a335df5fe4/dmba-0.0.18-py3-none-any.whl (11.8MB)
[K     |████████████████████████████████| 11.8MB 243kB/s 
[?25hInstalling collected packages: dmba
Successfully installed dmba-0.0.18


In [None]:

import pandas as pd
from sklearn import preprocessing 
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
import time
from dmba import classificationSummary
from sklearn.metrics import classification_report
import numpy as np


no display found. Using non-interactive Agg backend


# Load the data direct from GitHub

In [None]:

bank_df = pd.read_csv('https://github.com/timcsmith/MIS536-Public/raw/master/Data/UniversalBank.csv')

Quickly explore the data and fix any obvious problems


Here, I explore the number of columns, see what the columns names look like (and remove whitespace and rename when it will make it easier to work with, and check occurances of NaN (missing values).

In [None]:
bank_df.columns

Index(['ID', 'Age', 'Experience', 'Income', 'ZIP Code', 'Family', 'CCAvg',
       'Education', 'Mortgage', 'Personal Loan', 'Securities Account',
       'CD Account', 'Online', 'CreditCard'],
      dtype='object')

# let's replace any spaces in the column names with underscore

In [None]:

bank_df.columns = [s.strip().replace(' ','_') for s in bank_df.columns] 

# Let's check to see if there is a problem with missing values

In [None]:

bank_df.isnull().sum()

ID                    0
Age                   0
Experience            0
Income                0
ZIP_Code              0
Family                0
CCAvg                 0
Education             0
Mortgage              0
Personal_Loan         0
Securities_Account    0
CD_Account            0
Online                0
CreditCard            0
dtype: int64


There are no issues with missing values in this data.

We can also look at the proportion of the data for each of our target variables -- we find that this is somewhere around 9.6%, but since we have a rather large sample (5000), we will have many records with both clssification, so this should be fine.



In [None]:
bank_df['Personal_Loan'].value_counts(normalize=True)

0    0.904
1    0.096
Name: Personal_Loan, dtype: float64

# Drop the ID and zip code columns

In [None]:
bank_df = bank_df.drop(columns=['ID', 'ZIP_Code'])

# Create dummy variables for Education

In [None]:
bank_df['Education'] = bank_df['Education'].astype('category')
bank_df = pd.get_dummies(bank_df, prefix_sep='_', drop_first=False)
bank_df.head()

Unnamed: 0,Age,Experience,Income,Family,CCAvg,Mortgage,Personal_Loan,Securities_Account,CD_Account,Online,CreditCard,Education_1,Education_2,Education_3
0,25,1,49,4,1.6,0,0,1,0,0,0,1,0,0
1,45,19,34,3,1.5,0,0,1,0,0,0,1,0,0
2,39,15,11,1,1.0,0,0,0,0,0,0,1,0,0
3,35,9,100,1,2.7,0,0,0,0,0,0,0,1,0
4,35,8,45,4,1.0,0,0,0,0,0,1,0,1,0


# Split dataset into training (70%) and validation (30%) sets

In [None]:

target = 'Personal_Loan'
predictors = list(bank_df.columns)
predictors.remove(target)
X = bank_df[predictors]
y = bank_df[target]
train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=.3, random_state=1)
print('Training set:', train_X.shape, 'Validation set:', valid_y.shape)
print("train_X", train_X.shape)
print("train_Y", train_y.shape)
print("valid_X", valid_X.shape)
print("valid_y", valid_y.shape)



Training set: (3500, 13) Validation set: (1500,)
train_X (3500, 13)
train_Y (3500,)
valid_X (1500, 13)
valid_y (1500,)


# Fit MLPClassifier (Multi-Layer Perceptron Classifier)

In [None]:
param_grid_rand = {
'hidden_layer_sizes': [(450,350,250), (700,350,200,80), (250,275,400)], # this is a list of different tuples
'activation': ['tanh', 'relu'],
'solver': ['sgd', 'adam'],
}

In [None]:
#combination 1
%%time
start = time.time()

model = MLPClassifier()
randomSearch = RandomizedSearchCV(estimator = model, param_distributions  = param_grid_rand,  cv = 3, verbose=2, n_jobs = -1)
randomSearch.fit(train_X, train_y)

bestgridmodel_mlp_1 = randomSearch.best_estimator_
print('Best parameters found: ', randomSearch.best_params_)


end = time.time()
print("Total Time", end - start)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed: 12.4min finished


Best parameters found:  {'solver': 'adam', 'hidden_layer_sizes': (450, 350, 250), 'activation': 'tanh'}
Total Time 802.3188631534576
CPU times: user 1min 26s, sys: 37.7 s, total: 2min 4s
Wall time: 13min 22s


In [None]:
%%time
start = time.time()

sample_X = valid_X.iloc[1]
sample_X = sample_X.values.reshape(1, -1)
print(sample_X)
print(model.predict(sample_X))

end = time.time()
print("Total Time", end - start)

[[ 35.    9.   45.    3.    0.9 101.    1.    0.    0.    0.    1.    0.
    0. ]]


NotFittedError: ignored

In [None]:
%%time

start = time.time()
validation_predictions = model.predict(valid_X)
end = time.time()
print("Total Time per prediction =", (end - start)/len(valid_X))

print('confusion_matrix:\n ', confusion_matrix(valid_y, validation_predictions))
print('Accuracy Score: ', accuracy_score(valid_y, validation_predictions))
#print('Precision Score: ', precision_score(valid_y, validation_predictions))
print('Recall Score: ', recall_score(valid_y, validation_predictions))

In [None]:
param_grid = {
'hidden_layer_sizes': [(250,350,50), (500,250,100,40), (150,175,100)], # this is a list of different tuples
'activation': ['tanh', 'relu'],
'solver': ['sgd', 'adam'],
}

In [None]:
#combination 2
%%time
start = time.time()

model = MLPClassifier()
gridSearch = GridSearchCV(estimator = model, param_grid = param_grid,  cv = 3, verbose=2, n_jobs = -1)
gridSearch.fit(train_X, train_y)

bestgridmodel_mlp_1 = gridSearch.best_estimator_
print('Best parameters found: ', gridSearch.best_params_)

end = time.time()
print("Total Time", end - start)

In [None]:
%%time
start = time.time()

sample_X = valid_X.iloc[1]
sample_X = sample_X.values.reshape(1, -1)
print(sample_X)
print(gridSearch.predict(sample_X))

end = time.time()
print("Total Time", end - start)

In [None]:
%%time

start = time.time()
validation_predictions = gridSearch.predict(valid_X)
end = time.time()
print("Total Time per prediction =", (end - start)/len(valid_X))

print('confusion_matrix:\n ', confusion_matrix(valid_y, validation_predictions))
print('Accuracy Score: ', accuracy_score(valid_y, validation_predictions))
#print('Precision Score: ', precision_score(valid_y, validation_predictions))
print('Recall Score: ', recall_score(valid_y, validation_predictions))

In [None]:

param_grid2 = {
'hidden_layer_sizes': [(110,50,20), (260,200,140, 120), (360,325,300)], 
'activation': ['tanh'],
'solver': ['adam'],
}

In [None]:

#combination 3

%%time
start = time.time()

model = MLPClassifier()
gridSearch = GridSearchCV(estimator = model, param_grid = param_grid2,  cv = 3, verbose=2, n_jobs = -1)
gridSearch.fit(train_X, train_y)

bestgridmodel_mlp_1 = gridSearch.best_estimator_
print('Best parameters found: ', gridSearch.best_params_)

end = time.time()
print("Total Time", end - start)

In [None]:
%%time
start = time.time()

sample_X = valid_X.iloc[1]
sample_X = sample_X.values.reshape(1, -1)
print(sample_X)
print(gridSearch.predict(sample_X))

end = time.time()
print("Total Time", end - start)

In [None]:
%%time

start = time.time()
validation_predictions_2 = gridSearch.predict(valid_X)
end = time.time()
print("Total Time per prediction =", (end - start)/len(valid_X))

print('confusion_matrix:\n ', confusion_matrix(valid_y, validation_predictions_2))
print('Accuracy Score: ', accuracy_score(valid_y, validation_predictions_2))
#print('Precision Score: ', precision_score(valid_y, validation_predictions_2))
print('Recall Score: ', recall_score(valid_y, validation_predictions_2))

# KNN

In [None]:

 # create a standard scaler and fit it to the training set of predictors For KNN model
scaler = preprocessing.StandardScaler()
scaler.fit(train_X)

# Transform the predictors of training and validation sets
train_X_knn = scaler.transform(train_X)
train_y_knn = train_y
valid_X_knn = scaler.transform(valid_X)
valid_y_knn = valid_y

In [None]:

# let's explore the performance of 5-NN model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(train_X_knn, train_y_knn)
knn_prediction_output = knn.predict(valid_X_knn)
confusion = confusion_matrix(valid_y_knn, knn_prediction_output)
confusion

In [None]:
%%time
start = time.time()
#List Hyperparameters that we want to tune.
leaf_size = list(range(1,500))
n_neighbors = list(range(1,300))
p=[1,2]
#Convert to dictionary
hyperparameters = dict(leaf_size=leaf_size, n_neighbors=n_neighbors, p=p)
#Create new KNN object
knn_1 = KNeighborsClassifier()
#Use GridSearch
clf = RandomizedSearchCV(knn_1, hyperparameters, cv=10)
#Fit the model
best_model = clf.fit(train_X_knn,train_y_knn)
#Print The value of best Hyperparameters

bestRandomModel = best_model.best_estimator_
print('Best parameters found: ', clf.best_params_)

end = time.time()
print("Total Time", end - start)

In [None]:
%%time

start = time.time()
validation_predictions_knn_1 = bestRandomModel.predict(valid_X_knn)
end = time.time()
print("Total Time per prediction =", (end - start)/len(valid_X_knn))


print('Confusion Matrix: \n', confusion_matrix(valid_y_knn, validation_predictions_knn_1))
print('Accuracy: ', accuracy_score(valid_y_knn, validation_predictions_knn_1))
print('Precision: ', precision_score(valid_y_knn, validation_predictions_knn_1))
print('Recall: ', recall_score(valid_y_knn, validation_predictions_knn_1))

In [None]:
%%time
start = time.time()

#List Hyperparameters that we want to tune.
leaf_size = list(range(350,390))
n_neighbors = list(range(80,120))
p=[1,2]
#Convert to dictionary
hyperparameters = dict(leaf_size=leaf_size, n_neighbors=n_neighbors, p=p)
#Create new KNN object
knn_3 = KNeighborsClassifier()
#Use GridSearch
clf_3 = GridSearchCV(knn_3, hyperparameters, cv=10)
#Fit the model
best_model_3 = clf_3.fit(train_X_knn,train_y_knn)
#Print The value of best Hyperparameters
bestRandomModel_3 = best_model_3.best_estimator_
print('Best parameters found: ', clf_3.best_params_)
end = time.time()
print("Total Time", end - start)

In [None]:
%%time

start = time.time()
validation_predictions_knn_3 = bestRandomModel_3.predict(valid_X_knn)
end = time.time()
print("Total Time per prediction =", (end - start)/len(valid_X_knn))


print('Confusion Matrix: \n', confusion_matrix(valid_y_knn, validation_predictions_knn_3))
print('Accuracy: ', accuracy_score(valid_y_knn, validation_predictions_knn_3))
print('Precision: ', precision_score(valid_y_knn, validation_predictions_knn_3))
print('Recall: ', recall_score(valid_y_knn, validation_predictions_knn_3))

In [None]:
%%time
start = time.time()

#List Hyperparameters that we want to tune.
leaf_size = list(range(350,380))
n_neighbors = list(range(60,80))
p=[1,2]
#Convert to dictionary
hyperparameters = dict(leaf_size=leaf_size, n_neighbors=n_neighbors, p=p)
#Create new KNN object
knn_4 = KNeighborsClassifier()
#Use GridSearch
clf_4 = GridSearchCV(knn_4, hyperparameters, cv=10)
#Fit the model
best_model_4 = clf_4.fit(train_X_knn,train_y_knn)

#Print The value of best Hyperparameters
bestRandomModel_4 = best_model_4.best_estimator_
print('Best parameters found: ', clf_3.best_params_)
end = time.time()
print("Total Time", end - start)



In [None]:
%%time

start = time.time()
validation_predictions_knn_4 = bestRandomModel_4.predict(valid_X_knn)
end = time.time()
print("Total Time per prediction =", (end - start)/len(valid_X_knn))


print('Confusion Matrix: \n', confusion_matrix(valid_y_knn, validation_predictions_knn_4))
print('Accuracy: ', accuracy_score(valid_y_knn, validation_predictions_knn_4))
print('Precision: ', precision_score(valid_y_knn, validation_predictions_knn_4))
print('Recall: ', recall_score(valid_y_knn, validation_predictions_knn_4))

## Discussion

So I tried three combinations of parameters for each model i.e., neural network and KNN both. 
one with randomsearch and two with gridsearchcv hyperparameter tuning.

1. Which model takes less time to train? Which model is faster at prediction?


> Time taken by each model is listed below:
      

      > **Neural Network**: 
          Time taken to train:
          > Random search -  total: 1min 5s
          > Grid Search 1 -  total: 48.9 s
          > Grid Search 2 - total: 10.2 s
          Time taken for prediction:
          > Random search -  total: 81.7 ms
          > Grid Search 1 -  total: 34.8 ms 
          > Grid Search 2 -  total: 54.1 ms
      > **KNN**:
          Time taken to train:
          > Random search -  total: 5.67 s
          > Grid Search 1 -  total: 27min 9s
          > Grid Search 2 -  total: 9min 22s
          Time taken for prediction:
          > Random search -  total: 206 ms
          > Grid Search 1 -  total: 196 ms
          > Grid Search 2 -  total: 179 ms


Time taken by KNN for prediction is way more than the neural network. So, Neural network is faster at predictions. Also, the time taken for training KNN classifier is way higher than MLP classifier, so I conclude that MLP classifier is faster than KNN. 

















# 2. How does the predictive performance of k-nn model compare to the MLP classifier?

# 2.1 MLP Classifier
    Random Search 
    confusion_matrix:
      [[1332   19]
      [  20  129]]
    Accuracy Score:  0.974
    Recall Score:  0.8657718120805369
On running random search the neural network, MLP classifier, does a fine job and has a recall of 86.577. Our aim is to increase this score, so, the model predicts less and less FN so we the business doesn't lose any customer by false prediction. 
We will chose a model that will have the highest recall score, which will indicate the model is predicting less and less falsely judging a potential customer a not a customer for personal loan. 

    Grid Search 1
    confusion_matrix:
      [[1339   12]
      [  31  118]]
    Accuracy Score:  0.9713333333333334
    Recall Score:  0.7919463087248322
We notice that by hyperparameter tunning parameter further, both the accuracy and recall score is reduced. So, this should not be considered because this model does not do better than the above random search model. 

    Grid Search 1
    confusion_matrix:
      [[1340   11]
      [  36  113]]
    Accuracy Score:  0.9686666666666667
    Recall Score:  0.7583892617449665
We notice again, the hyperparameter tuning is not helping improve the performance of the model further. Therefore, I decided to stop further tuning parameters further. I accepted the randomsearch parameter tuning and performance of the model and accepted to compare with the KNN model with random search and grid search hyperparameter tuning of model for further investigation. 

# 2.2 KNN 
I initially fitted and run a KNN model with default parameter and confusion matrix comes out to be:
    Confusion Matrix:
    array([[1347,    4],
       [  62,   87]])

It looks like it is doing a fine job at predicting FN, i.e, recall score, which is the requirement of the business model. But, it is still not doing better than random search MLP classifier. So, I decided to further investigate using randomsearch and gridsearch hyperparameter to conclude something about the performance of both the classifier and tuning method used.

#    Random search 

    Confusion Matrix: 
      [[1351    0]
      [ 132   17]]
    Accuracy:  0.912
    Precision:  1.0
    Recall:  0.11409395973154363


#    Grid Search 1

    Confusion Matrix: 
    [[1350    1]
    [ 117   32]]
    Accuracy:  0.9213333333333333
    Precision:  0.9696969696969697
    Recall:  0.21476510067114093
# Grid Search 2

    Confusion Matrix: 
    [[1350    1]
    [ 105   44]]
    Accuracy:  0.9293333333333333
    Precision:  0.9777777777777777
    Recall:  0.2953020134228188

After comparing the confusion matrix for all i.e, random search, grid search 1 and 2 for KNN classifer, I can conclude that the hyperparameter tuning is not helping in improving the performance of the KNN model.But, by default parameters, it performed better, so it is a possibility that further investigating the gridsearch using a wide range of parameters may further improve the performance of the model. But, this process is time consuming. It is trade off between the time and better recall score performance of the model. 

Therfore, with current investigation, the MLP classifer with random search parameters performed the best. The time taken for training and prediction for MLP classifier is lesser compared to the KNN. Therefore, I recommend MLP classifier with random search model for the business model where the model should predict potential customer with less and less chance of losing a potential customer, i.e, with high recall score. 

