<a href="https://colab.research.google.com/github/deepa2909/Best-Classifier---RandomForest-Neural-Network-/blob/main/Wine%20classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Description of Wine Dataset and Problem

* Dataset has 13 attributes and 1 predictor 'class'.
* Size of dataset is small - (178, 14)
* Models to Random Forest and MLP Classifier
* Accuracy is considered at the performance measure for the business problem because it is not clear how recall and precision score is going to affect the businesss. 

#Import the necessary packages and library

In [None]:
import pandas as pd
from sklearn import preprocessing 
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
import time
from sklearn.metrics import classification_report
import numpy as np
from sklearn import metrics

#Load the dataset

In [None]:
data_df = pd.read_csv('https://raw.githubusercontent.com/timcsmith/MIS536-Public/master/Data/wine.csv')

data_df


Unnamed: 0,class,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.20,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050
2,1,13.16,2.36,2.67,18.6,101,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.50,16.8,113,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,3,13.71,5.65,2.45,20.5,95,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740
174,3,13.40,3.91,2.48,23.0,102,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750
175,3,13.27,4.28,2.26,20.0,120,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835
176,3,13.17,2.59,2.37,20.0,120,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840


#Quickly explore the data and fix any obvious problems

Here, I explore the number of columns, see what the columns names look like (and remove whitespace and rename when it will make it easier to work with, and check occurances of NaN (missing values).

In [None]:
data_df.columns
data_df.shape

(178, 14)

#Let's replace any spaces in the column names with underscore

In [None]:
data_df.columns = [s.strip().replace(' ','_') for s in data_df.columns] 
data_df.columns

Index(['class', 'Alcohol', 'Malic_acid', 'Ash', 'Alcalinity_of_ash',
       'Magnesium', 'Total_phenols', 'Flavanoids', 'Nonflavanoid_phenols',
       'Proanthocyanins', 'Color_intensity', 'Hue',
       'OD280/OD315_of_diluted_wines', 'Proline'],
      dtype='object')

#Let's check to see if there is a problem with missing values

In [None]:
data_df.isnull().sum()

class                           0
Alcohol                         0
Malic_acid                      0
Ash                             0
Alcalinity_of_ash               0
Magnesium                       0
Total_phenols                   0
Flavanoids                      0
Nonflavanoid_phenols            0
Proanthocyanins                 0
Color_intensity                 0
Hue                             0
OD280/OD315_of_diluted_wines    0
Proline                         0
dtype: int64

The data looks almost clean, no nulls.

#Split dataset into training (60%) and validation (40%) sets

In [None]:
# construct datasets for analysis
target = 'class'
predictors = [ 'Alcohol', 'Malic_acid', 'Ash', 'Alcalinity_of_ash',
       'Magnesium', 'Total_phenols', 'Flavanoids', 'Nonflavanoid_phenols',
       'Proanthocyanins', 'Color_intensity', 'Hue',
       'OD280/OD315_of_diluted_wines', 'Proline']
X = data_df[predictors]
y = data_df[target]
print(X)
print(y)

     Alcohol  Malic_acid   Ash  ...   Hue  OD280/OD315_of_diluted_wines  Proline
0      14.23        1.71  2.43  ...  1.04                          3.92     1065
1      13.20        1.78  2.14  ...  1.05                          3.40     1050
2      13.16        2.36  2.67  ...  1.03                          3.17     1185
3      14.37        1.95  2.50  ...  0.86                          3.45     1480
4      13.24        2.59  2.87  ...  1.04                          2.93      735
..       ...         ...   ...  ...   ...                           ...      ...
173    13.71        5.65  2.45  ...  0.64                          1.74      740
174    13.40        3.91  2.48  ...  0.70                          1.56      750
175    13.27        4.28  2.26  ...  0.59                          1.56      835
176    13.17        2.59  2.37  ...  0.60                          1.62      840
177    14.13        4.10  2.74  ...  0.61                          1.60      560

[178 rows x 13 columns]
0  

In [None]:
# create the training set and the test set 
train_X, valid_X, train_y, valid_y = train_test_split(X,y, test_size=0.4, random_state=1)
print(train_X)
print(valid_X)

     Alcohol  Malic_acid   Ash  ...   Hue  OD280/OD315_of_diluted_wines  Proline
168    13.58        2.58  2.69  ...  0.74                          1.80      750
175    13.27        4.28  2.26  ...  0.59                          1.56      835
118    12.77        3.43  1.98  ...  0.70                          2.12      372
75     11.66        1.88  1.92  ...  1.23                          2.14      428
21     12.93        3.80  2.65  ...  1.03                          3.52      770
..       ...         ...   ...  ...   ...                           ...      ...
133    12.70        3.55  2.36  ...  0.78                          1.29      600
137    12.53        5.51  2.64  ...  0.82                          1.69      515
72     13.49        1.66  2.24  ...  0.98                          2.78      472
140    12.93        2.81  2.70  ...  0.77                          2.31      600
37     13.05        1.65  2.55  ...  1.12                          2.51     1105

[106 rows x 13 columns]
   

#Create an initial 'wide' range of possible hyperparameter values

Here we create a wide range of possible parameter values for each of the hyperparameters for this model.

In [None]:

# Number of trees in random forest; default is 100
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1000, num = 10)]

# Criterion used to guide data splits
criterion = ['gini', 'entropy']

# Maximum number of levels in tree. If None, then nodes are expanded until all leaves are pure or until all 
# leaves contain less than min_samples_split samples.
# default = None
max_depth = [int(x) for x in np.linspace(8, 120, num = 11)]
max_depth.append(None)

# Minimum number of samples required to split a node
# default is 2
min_samples_split = [2, 5, 13]

# Minimum number of samples required at each leaf node
# default = 1 
min_samples_leaf = [1, 2, 5]

# Number of features to consider at every split
# default is auto (which is equivalent to sqrt)
max_features = ['auto']

# max_leaf_nodes  - Grow trees with max_leaf_nodes in best-first fashion.
# If None then unlimited number of leaf nodes.
# default=None 
max_leaf_nodes = [None]

# min_impurity_decrease - A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
# default=0.0
min_impurity_decrease = [0.001, 0.005, 0.01, 0.05, 0.02]

# Method of selecting samples for training each tree
# default = True,  If False, the whole dataset is used to build each tree.
bootstrap = [True]

# Create the random grid
param_grid_random = {'n_estimators': n_estimators,
                      'criterion': criterion,
                      'max_depth': max_depth,
                      'min_samples_split': min_samples_split,
                      'min_samples_leaf' : min_samples_leaf,
                      'max_features': max_features,
                      'max_leaf_nodes' : max_leaf_nodes,
                      'min_impurity_decrease' : min_impurity_decrease,
                      'bootstrap': bootstrap,
                     }

 # Use Randomize Search to narrow the possible range of parameter values

In [None]:
# Use the param_grid_random for an initial "rough" search using Randomized search
rand_f = RandomForestClassifier()

randomSearch = RandomizedSearchCV(estimator = rand_f, param_distributions = param_grid_random, n_iter = 250, cv = 3, verbose=2, random_state=24, n_jobs = -1)
# Fit the random search model
randomSearch.fit(train_X, train_y)
bestRandomModel = randomSearch.best_estimator_
print('Best parameters found: ', randomSearch.best_params_)

Fitting 3 folds for each of 250 candidates, totalling 750 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   28.8s
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 361 tasks      | elapsed:  4.0min
[Parallel(n_jobs=-1)]: Done 644 tasks      | elapsed:  7.2min
[Parallel(n_jobs=-1)]: Done 750 out of 750 | elapsed:  8.3min finished


Best parameters found:  {'n_estimators': 200, 'min_samples_split': 5, 'min_samples_leaf': 1, 'min_impurity_decrease': 0.005, 'max_leaf_nodes': None, 'max_features': 'auto', 'max_depth': None, 'criterion': 'entropy', 'bootstrap': True}


#Test the performance of the selected parameters

In [None]:
validation_predictions = bestRandomModel.predict(valid_X)
print('Accuracy Score: ', metrics.accuracy_score(valid_y, validation_predictions))
print('Precision Score: ', metrics.precision_score(valid_y, validation_predictions, average ='weighted'))
print('Recall Score: ', metrics.recall_score(valid_y, validation_predictions, average ='weighted'))
print('Confusion Matrix: \n ', metrics.confusion_matrix(valid_y, validation_predictions))



Accuracy Score:  0.9861111111111112
Precision Score:  0.9865900383141764
Recall Score:  0.9861111111111112
Confusion Matrix: 
  [[28  0  0]
 [ 1 26  0]
 [ 0  0 17]]


#Use knowledge gained from above two steps to create new 'narrow' range of possible hyperparameter values

In [None]:
# let's take the best parameters from the the random search, and use this as a base for gridsearch
param_grid = {'n_estimators': [170, 180, 200, 210, 220],
              'min_samples_split': [2, 3, 5, 7],  
              'min_samples_leaf': [1, 2],
              'min_impurity_decrease': [0.000, 0.005, 0.001, 0.002],
              'max_leaf_nodes': [None], 
              'max_features': ['auto'], 
              'criterion': ['entropy'],
              'bootstrap': [True]}

 # Use Grid (exhaustive) to refine model

In [None]:
# refine our search using param_grid
rand_f2 = RandomForestClassifier()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
gridSearch = GridSearchCV(estimator = rand_f2, param_grid = param_grid,  cv = 2, verbose=2,  n_jobs = -1)
# Fit the exhaustive search model
gridSearch.fit(train_X, train_y)
bestGridModel = gridSearch.best_estimator_
print('Best parameters found: ', gridSearch.best_params_)

Fitting 2 folds for each of 160 candidates, totalling 320 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:    8.9s
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:   37.5s
[Parallel(n_jobs=-1)]: Done 320 out of 320 | elapsed:  1.3min finished


Best parameters found:  {'bootstrap': True, 'criterion': 'entropy', 'max_features': 'auto', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 210}


# Performance of the model using identified parameters

In [None]:
validation_predictions = bestGridModel.predict(valid_X)
print('Accuracy Score: ', metrics.accuracy_score(valid_y, validation_predictions))
print('Precision Score: ', metrics.precision_score(valid_y, validation_predictions, average ='weighted'))
print('Recall Score: ', metrics.recall_score(valid_y, validation_predictions, average ='weighted'))
print('Confusion Matrix: \n ', metrics.confusion_matrix(valid_y, validation_predictions))

Accuracy Score:  0.9861111111111112
Precision Score:  0.9865900383141764
Recall Score:  0.9861111111111112
Confusion Matrix: 
  [[28  0  0]
 [ 1 26  0]
 [ 0  0 17]]


Since, after performing randomized grid search and exhaustive grid search for hypertuning parameter to find best parameter for random forest model, after using both, the performance of random forest model didn't improve. Therefore, I am going to stop hyperparameter tuning here and model a neural network classifier.

# Neural Network (testing at least a 1-layer, a 2-layer, and a 3-layer model)

# Fit MLPClassifier (Multi-Layer Perceptron Classifier)

* Layer **1**

In [None]:
#combination 1
%%time
start = time.time()

model = MLPClassifier(hidden_layer_sizes=(500), solver='adam', max_iter=200, verbose=True)
model.fit(train_X, train_y)

end = time.time()
print("Total Time", end - start)

Iteration 1, loss = 27.18164751
Iteration 2, loss = 31.80458214
Iteration 3, loss = 22.90783404
Iteration 4, loss = 9.34850699
Iteration 5, loss = 8.93523795
Iteration 6, loss = 11.45633320
Iteration 7, loss = 13.52499487
Iteration 8, loss = 11.71919024
Iteration 9, loss = 12.10607874
Iteration 10, loss = 9.70387311
Iteration 11, loss = 9.29349660
Iteration 12, loss = 6.04319036
Iteration 13, loss = 2.93023602
Iteration 14, loss = 2.98738302
Iteration 15, loss = 5.59564225
Iteration 16, loss = 4.35205516
Iteration 17, loss = 4.00805677
Iteration 18, loss = 4.37322189
Iteration 19, loss = 2.82970261
Iteration 20, loss = 0.91691556
Iteration 21, loss = 3.06049700
Iteration 22, loss = 2.43984060
Iteration 23, loss = 3.03621676
Iteration 24, loss = 1.67305430
Iteration 25, loss = 1.44117045
Iteration 26, loss = 2.27509120
Iteration 27, loss = 2.10757029
Iteration 28, loss = 1.80738349
Iteration 29, loss = 1.93764862
Iteration 30, loss = 0.90927767
Iteration 31, loss = 0.71584948
Iteration 



In [None]:
%%time
start = time.time()

sample_X = valid_X.iloc[1]
sample_X = sample_X.values.reshape(1, -1)
print(sample_X)
print(model.predict(sample_X))

end = time.time()
print("Total Time", end - start)

[[1.242e+01 1.610e+00 2.190e+00 2.250e+01 1.080e+02 2.000e+00 2.090e+00
  3.400e-01 1.610e+00 2.060e+00 1.060e+00 2.960e+00 3.450e+02]]
[2]
Total Time 0.0037384033203125
CPU times: user 4.13 ms, sys: 5 µs, total: 4.14 ms
Wall time: 3.83 ms


# Test the performance of the selected parameters

In [None]:
%%time
start = time.time()
validation_predictions = model.predict(valid_X)
end = time.time()
print("Total Time per prediction =", (end - start)/len(valid_X))
print('confusion_matrix:\n ', confusion_matrix(valid_y, validation_predictions))
print('Accuracy Score: ', metrics.accuracy_score(valid_y, validation_predictions))
print('Precision Score: ', metrics.precision_score(valid_y, validation_predictions, average ='weighted'))
print('Recall Score: ', metrics.recall_score(valid_y, validation_predictions, average ='weighted'))

Total Time per prediction = 6.091263559129503e-05
confusion_matrix:
  [[27  1  0]
 [ 0 27  0]
 [ 0  2 15]]
Accuracy Score:  0.9583333333333334
Precision Score:  0.9624999999999999
Recall Score:  0.9583333333333334
CPU times: user 13.6 ms, sys: 2.01 ms, total: 15.6 ms
Wall time: 12.7 ms


* Layer 2

In [None]:
#combination 2 
%%time
start = time.time()

model2 = MLPClassifier(hidden_layer_sizes=(500, 250), solver='adam', max_iter=200, verbose=True)
model2.fit(train_X, train_y)

end = time.time()
print("Total Time", end - start)

Iteration 1, loss = 9.67171205
Iteration 2, loss = 52.53655669
Iteration 3, loss = 85.87874096
Iteration 4, loss = 70.43160295
Iteration 5, loss = 35.40925870
Iteration 6, loss = 20.78792628
Iteration 7, loss = 17.57679717
Iteration 8, loss = 28.89047312
Iteration 9, loss = 29.56181524
Iteration 10, loss = 25.26194169
Iteration 11, loss = 30.19915918
Iteration 12, loss = 22.83668427
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Total Time 0.15098047256469727
CPU times: user 180 ms, sys: 82.2 ms, total: 262 ms
Wall time: 151 ms


In [None]:
%%time
start = time.time()

sample_X = valid_X.iloc[1]
sample_X = sample_X.values.reshape(1, -1)
print(sample_X)
print(model.predict(sample_X))

end = time.time()
print("Total Time", end - start)

[[1.242e+01 1.610e+00 2.190e+00 2.250e+01 1.080e+02 2.000e+00 2.090e+00
  3.400e-01 1.610e+00 2.060e+00 1.060e+00 2.960e+00 3.450e+02]]
[3]
Total Time 0.01052403450012207
CPU times: user 5.41 ms, sys: 5.12 ms, total: 10.5 ms
Wall time: 10.7 ms


# Test the performance of the selected parameters

In [None]:
%%time
start = time.time()
validation_predictions = model2.predict(valid_X)
end = time.time()
print("Total Time per prediction =", (end - start)/len(valid_X))
print('confusion_matrix:\n ', confusion_matrix(valid_y, validation_predictions))
print('Accuracy Score: ', metrics.accuracy_score(valid_y, validation_predictions))
print('Precision Score: ', metrics.precision_score(valid_y, validation_predictions, average ='weighted'))
print('Recall Score: ', metrics.recall_score(valid_y, validation_predictions, average ='weighted'))

Total Time per prediction = 0.0001948972543080648
confusion_matrix:
  [[28  0  0]
 [17 10  0]
 [15  2  0]]
Accuracy Score:  0.5277777777777778
Precision Score:  0.4939814814814814
Recall Score:  0.5277777777777778
CPU times: user 14.8 ms, sys: 9.26 ms, total: 24.1 ms
Wall time: 26.1 ms


  _warn_prf(average, modifier, msg_start, len(result))


* Layer -3 

In [None]:
#combination 3 
%%time
start = time.time()

model3 = MLPClassifier(hidden_layer_sizes=(30, 50, 75), solver='adam', max_iter=200, verbose=True)
model3.fit(train_X, train_y)

end = time.time()
print("Total Time", end - start)

Iteration 1, loss = 51.62472700
Iteration 2, loss = 40.60244922
Iteration 3, loss = 35.08389358
Iteration 4, loss = 27.20211813
Iteration 5, loss = 24.38486320
Iteration 6, loss = 18.52900915
Iteration 7, loss = 12.42475267
Iteration 8, loss = 8.92262055
Iteration 9, loss = 5.94895009
Iteration 10, loss = 12.21553959
Iteration 11, loss = 12.41050907
Iteration 12, loss = 8.38865275
Iteration 13, loss = 6.62582523
Iteration 14, loss = 7.22136306
Iteration 15, loss = 5.14765131
Iteration 16, loss = 4.57836204
Iteration 17, loss = 6.05699979
Iteration 18, loss = 3.18943434
Iteration 19, loss = 3.39479086
Iteration 20, loss = 3.83756658
Iteration 21, loss = 3.22484965
Iteration 22, loss = 3.02838375
Iteration 23, loss = 3.18170157
Iteration 24, loss = 1.72145301
Iteration 25, loss = 2.68705062
Iteration 26, loss = 2.11719221
Iteration 27, loss = 1.57866503
Iteration 28, loss = 2.36816531
Iteration 29, loss = 2.66339037
Iteration 30, loss = 1.99575325
Iteration 31, loss = 1.13172635
Iteratio

In [None]:
%%time
start = time.time()

sample_X = valid_X.iloc[1]
sample_X = sample_X.values.reshape(1, -1)
print(sample_X)
print(model.predict(sample_X))

end = time.time()
print("Total Time", end - start)

[[1.242e+01 1.610e+00 2.190e+00 2.250e+01 1.080e+02 2.000e+00 2.090e+00
  3.400e-01 1.610e+00 2.060e+00 1.060e+00 2.960e+00 3.450e+02]]
[3]
Total Time 0.002755880355834961
CPU times: user 3.11 ms, sys: 1.03 ms, total: 4.14 ms
Wall time: 2.83 ms


# Test the performance of the selected parameters

In [None]:
%%time
start = time.time()
validation_predictions = model3.predict(valid_X)
end = time.time()
print("Total Time per prediction =", (end - start)/len(valid_X))
print('confusion_matrix:\n ', confusion_matrix(valid_y, validation_predictions))
print('Accuracy Score: ', metrics.accuracy_score(valid_y, validation_predictions))
print('Precision Score: ', metrics.precision_score(valid_y, validation_predictions, average ='weighted'))
print('Recall Score: ', metrics.recall_score(valid_y, validation_predictions, average ='weighted'))

Total Time per prediction = 0.00012489491038852267
confusion_matrix:
  [[20  7  1]
 [ 0 22  5]
 [ 0  6 11]]
Accuracy Score:  0.7361111111111112
Precision Score:  0.7773809523809524
Recall Score:  0.7361111111111112
CPU times: user 10 ms, sys: 8.15 ms, total: 18.2 ms
Wall time: 16.4 ms


# Condition 4 - GridSearchCV Hyperparameter tunning 

In [None]:

param_grid2 = {
'hidden_layer_sizes': [(110,50,20), (260,200,140, 120), (360,325,300)], 
'activation': ['tanh'],
'solver': ['adam'],
}

In [None]:
#combination 4

%%time
start = time.time()

model = MLPClassifier()
gridSearch = GridSearchCV(estimator = model, param_grid = param_grid2,  cv = 3, verbose=2, n_jobs = -1)
gridSearch.fit(train_X, train_y)

bestgridmodel_mlp_1 = gridSearch.best_estimator_
print('Best parameters found: ', gridSearch.best_params_)

end = time.time()
print("Total Time", end - start)

Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:   10.7s finished


Best parameters found:  {'activation': 'tanh', 'hidden_layer_sizes': (360, 325, 300), 'solver': 'adam'}
Total Time 13.401542901992798
CPU times: user 3.65 s, sys: 1.76 s, total: 5.42 s
Wall time: 13.4 s


In [None]:
%%time
start = time.time()

sample_X = valid_X.iloc[1]
sample_X = sample_X.values.reshape(1, -1)
print(sample_X)
print(gridSearch.predict(sample_X))

end = time.time()
print("Total Time", end - start)

[[1.242e+01 1.610e+00 2.190e+00 2.250e+01 1.080e+02 2.000e+00 2.090e+00
  3.400e-01 1.610e+00 2.060e+00 1.060e+00 2.960e+00 3.450e+02]]
[2]
Total Time 0.00531315803527832
CPU times: user 4.94 ms, sys: 1.81 ms, total: 6.74 ms
Wall time: 5.44 ms


# Performance of measures

In [None]:

print('confusion_matrix:\n ', confusion_matrix(valid_y, validation_predictions))
print('Accuracy Score: ', metrics.accuracy_score(valid_y, validation_predictions))
print('Precision Score: ', metrics.precision_score(valid_y, validation_predictions, average ='weighted'))
print('Recall Score: ', metrics.recall_score(valid_y, validation_predictions, average ='weighted'))

confusion_matrix:
  [[20  7  1]
 [ 0 22  5]
 [ 0  6 11]]
Accuracy Score:  0.7361111111111112
Precision Score:  0.7773809523809524
Recall Score:  0.7361111111111112
CPU times: user 8.51 ms, sys: 1.76 ms, total: 10.3 ms
Wall time: 11.8 ms


#Discussion

* # Random Forest with RandomizedSearched hyperparameter tuning of parameters

Test performance measure after randomized search for paramters:

    Accuracy Score:  0.9861111111111112
    Precision Score:  0.9865900383141764
    Recall Score:  0.9861111111111112
    Confusion Matrix: 
      [[28  0  0]
      [ 1 26  0]
      [ 0  0 17]]

The randomized searched parameters seems to be giving fine performance for the random forest model. After, looking at the confusion matrix, it can be said that model predicts 1 category wrong and rest predicts accurate. So, if hyperparameter tuning of paramters gives better performance of the model than this then only it should be selected. 

 # Random Forest with GridSearched hyperparameter tuning of parameters 

On GridSearch, the best parameters come out to be:


Best parameters found:  {'bootstrap': True, 'criterion': 'entropy', 'max_features': 'auto', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 210}

The performance of the selected parameters:

    Accuracy Score:  0.9861111111111112
    Precision Score:  0.9865900383141764
    Recall Score:  0.9861111111111112
    Confusion Matrix: 
      [[28  0  0]
      [ 1 26  0]
      [ 0  0 17]]

The performance of random forest model before and after exhaustive hyperparameter tuning didn't change, therefore, these parameters give an accuracy of 98.61 % in both stages. Since, the busines context of developing this model is not clear, for example, how the high precision score is going to benefit or how recall score is going to affect the business is not clear, therfore, in this condition, I am going to select the accuracy measure the performance measure for this business model. And, I would recommend that model which will show a higher accuracy.

# MLP Classifier 

* Condition 1, Layer - 1
    Total Time per prediction = 6.091263559129503e-05
    confusion_matrix:
      [[27  1  0]
      [ 0 27  0]
      [ 0  2 15]]
    Accuracy Score:  0.9583333333333334
    Precision Score:  0.9624999999999999
    Recall Score:  0.9583333333333334
    
* Condition 2, Layer - 2
  
    confusion_matrix:
      [[28  0  0]
      [17 10  0]
      [15  2  0]]
    Accuracy Score:  0.5277777777777778
    Precision Score:  0.4939814814814814
    Recall Score:  0.5277777777777778

* Condition 3, Layer - 3
    confusion_matrix:
        [[20  7  1]
        [ 0 22  5]
        [ 0  6 11]]
    Accuracy Score:  0.7361111111111112
    Precision Score:  0.7773809523809524
    Recall Score:  0.7361111111111112

* Condition 4 - GridSearchCV Hyperparameter tuning 

 Performance of the selected parameters:
    confusion_matrix:
      [[20  7  1]
      [ 0 22  5]
      [ 0  6 11]]
    Accuracy Score:  0.7361111111111112
    Precision Score:  0.7773809523809524
    Recall Score:  0.7361111111111112

By looking at the current performance of the paramters on Random Forest and MLP classifier, I can conclude that Random Forest randomized hyperparamter tuning does a better job. Gives the highest performance measure, accuracy of 98.61.

Although, MLP Classifier usually does a good job, maybe using the Best parameters found:  {'activation': 'tanh', 'hidden_layer_sizes': (360, 325, 300), 'solver': 'adam'} after runing the condition -4 in MLP Classifier, further tuning gives best performing parameters for MLP classifier, which then would give a higher accuracy. But, it is a trade-off between time and accuracy, so the random forest with randomized parameter search gives a higher accuracy in a smaller/lesser time comapred to the MLP CLassifier. 

Therefore, I recommend, random forest model with randomizedsearchedCV paramters.



