# Evaluate ***Scikit-Learn*** classifiers using benchmark datasets.

We shall evaluate several ***Scikit-Learn*** classifiers on benchmark datasetsusing *k*-fold cross-validation.

*Benchmark datasets used*:
- iris:  https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris
- digits: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits
- wine: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html
- breast_cancer: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html
- people: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_lfw_people.html


*Classifiers evaluated (https://scikit-learn.org/stable/)*:

- Multinomial Naive Bayes: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

- Gaussian Naive Bayes: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

- Decision Tree: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

- Random Forest: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

- ExtraTrees Classifier: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

- Logistic Regression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

- Support Vector Classifier:
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

- K Nearest Neighbors Classifier: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

- Multilayered Perceptron: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html


# 1\. Import libraries

In [1]:
import numpy as np # for computation
import pandas as pd # for data handling
from time import time # to time runs
from sklearn.model_selection import cross_val_score # for model evaluation

# scikit-learn classifiers evaluated (change as desired)
from sklearn.naive_bayes import MultinomialNB, GaussianNB 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

# To load sklearn data sets (change as desired)
from sklearn.datasets import load_iris, load_digits, load_wine, load_breast_cancer # toy datasets
from sklearn.datasets import fetch_lfw_people

# 2\. Load datasets
We shall download benchmark  *sklearn* datasets and maintain them in a dictionary *DATASETS*.

In [2]:
# Dictionary with sklearn dataset loading functions
DATASETS = {
    'iris': load_iris(),
    'digits': load_digits(),
    'wine': load_wine(),
    'breast_cancer': load_breast_cancer(),
    'people': fetch_lfw_people(min_faces_per_person=70, resize=0.4)}
print('Available data sets: ', ', '.join([d for d in DATASETS]))

Downloading LFW metadata: https://ndownloader.figshare.com/files/5976012
Downloading LFW metadata: https://ndownloader.figshare.com/files/5976009
Downloading LFW metadata: https://ndownloader.figshare.com/files/5976006
Downloading LFW data (~200MB): https://ndownloader.figshare.com/files/5976015


Available data sets:  iris, digits, wine, breast_cancer, people


# 3\. Specify classifiers
We shall specify *sklearn* classifiers with desired hyper-parameters in the dictionary *CLASSIFIERS*. Default values of hyper-parameters are used if they are not specified.

In [4]:
# Specify Scikit-Learn models to use
CLASSIFIERS = {} # dictionary of Scikit-Learn classifiers with non-default parameters
CLASSIFIERS['GNB'] = GaussianNB() 
CLASSIFIERS['MNB'] = MultinomialNB()
CLASSIFIERS['DT'] = DecisionTreeClassifier()
CLASSIFIERS['RF'] = RandomForestClassifier()
CLASSIFIERS['ET'] = ExtraTreesClassifier()
CLASSIFIERS['KNN'] = KNeighborsClassifier(algorithm='brute')
CLASSIFIERS['LRM'] = LogisticRegression(max_iter=10000)
CLASSIFIERS['SVM'] = SVC()
CLASSIFIERS['MLP'] = MLPClassifier(max_iter=10000)

print('Available sklearn classifiers:')
for c in CLASSIFIERS:
    print(f'{c} : {CLASSIFIERS[c].__class__.__name__}')

Available sklearn classifiers:
GNB : GaussianNB
MNB : MultinomialNB
DT : DecisionTreeClassifier
RF : RandomForestClassifier
ET : ExtraTreesClassifier
KNN : KNeighborsClassifier
LRM : LogisticRegression
SVM : SVC
MLP : MLPClassifier


# 4\. Evaluate classifiers on datasets

In [5]:
result = [] # list containing evaluation results

for d in DATASETS: # for each dataset
    
    # Get dataset and print summary
    X, y = DATASETS[d].data, DATASETS[d].target # input features, labels
    nS, nF = X.shape # number of samples, number of input features
    nC = len(np.unique(y)) # number of output classes
    print(80*'=') # print dataset summary
    print(f'Dataset: {d}')
    print(f'Number of samples = {nS}, Number of input features = {nF}, Number of classes = {nC}')
    print(80*'=') # end dataset summary

    for c in CLASSIFIERS: # for each classifier
        model = CLASSIFIERS[c] # create model (classifier object)
        st = time() # start time for training and validation
        score = cross_val_score(model, X, y).mean() # mean cross-validation accuracy
        t = time() - st # time taken for k-fold cross-validation
        result.append([d, c, nS, nF, nC, score, t]) # record results
        # print results for classifier
        print(f'Model: {c}, CV accuracy = {score:0.3f}, Time={t:0.3f} seconds')
    
    print(80*'='+'\n') # done with dataset


Dataset: iris
Number of samples = 150, Number of input features = 4, Number of classes = 3
Model: GNB, CV accuracy = 0.953, Time=0.026 seconds
Model: MNB, CV accuracy = 0.953, Time=0.023 seconds
Model: DT, CV accuracy = 0.967, Time=0.016 seconds
Model: RF, CV accuracy = 0.967, Time=2.250 seconds
Model: ET, CV accuracy = 0.953, Time=1.192 seconds
Model: KNN, CV accuracy = 0.973, Time=0.031 seconds
Model: LRM, CV accuracy = 0.973, Time=0.399 seconds
Model: SVM, CV accuracy = 0.967, Time=0.017 seconds
Model: MLP, CV accuracy = 0.980, Time=4.013 seconds

Dataset: digits
Number of samples = 1797, Number of input features = 64, Number of classes = 10
Model: GNB, CV accuracy = 0.807, Time=0.029 seconds
Model: MNB, CV accuracy = 0.870, Time=0.022 seconds
Model: DT, CV accuracy = 0.787, Time=0.174 seconds
Model: RF, CV accuracy = 0.936, Time=3.232 seconds
Model: ET, CV accuracy = 0.958, Time=2.213 seconds
Model: KNN, CV accuracy = 0.963, Time=0.274 seconds
Model: LRM, CV accuracy = 0.914, Time=

# 5\. Comparative performance of classifiers

In [7]:
# Evaluation results maintained as pandas dataframe
cols = ['dataset', 'classifier', 'nbr_samples', 'nbr_features', 'nbr_Classes', 'cv_score', 'time'] # column headers
result_df = pd.DataFrame(result, columns=cols).round(3) # create pandas dataframe (round values to 3 decimal places)
result_df.to_csv('model_evaluation.csv', index=False) # save results in CSV file
result_df # sow results

Unnamed: 0,dataset,classifier,nbr_samples,nbr_features,nbr_Classes,cv_score,time
0,iris,GNB,150,4,3,0.953,0.026
1,iris,MNB,150,4,3,0.953,0.023
2,iris,DT,150,4,3,0.967,0.016
3,iris,RF,150,4,3,0.967,2.25
4,iris,ET,150,4,3,0.953,1.192
5,iris,KNN,150,4,3,0.973,0.031
6,iris,LRM,150,4,3,0.973,0.399
7,iris,SVM,150,4,3,0.967,0.017
8,iris,MLP,150,4,3,0.98,4.013
9,digits,GNB,1797,64,10,0.807,0.029


## Show best 3 models for each dataset

In [8]:
show_cols = ['classifier', 'cv_score', 'time'] # columns to display
for d in DATASETS: # for each dataset
    print(d+ 75*'=') # show best 3 classifiers based on crossvalidation accuracy and time
    df = result_df[result_df.dataset==d].sort_values(by=['cv_score', 'time'], ascending=[False, True])[show_cols]
    print(df.iloc[:3]) # show best 3 models

  classifier  cv_score   time
8        MLP     0.980  4.013
5        KNN     0.973  0.031
6        LRM     0.973  0.399
   classifier  cv_score   time
14        KNN     0.963  0.274
16        SVM     0.963  0.766
13         ET     0.958  2.213
   classifier  cv_score   time
22         ET     0.983  1.029
21         RF     0.972  1.758
18        GNB     0.966  0.028
   classifier  cv_score   time
31         ET     0.963  0.755
30         RF     0.960  1.458
33        LRM     0.951  7.135
   classifier  cv_score     time
42        LRM     0.836  210.208
43        SVM     0.759   27.884
39         RF     0.649   16.519


# 5\. Next steps

- Identify a shortlist of better performing classifiers for each dataset.
- Try to improve the performance of these classifiers by identifying a good set of hyper-parameters for the model.    