## Week 2 - Different classifiers exercise solutions

Here, we look at building models using the different classifiers (discussed in the lectures) from scikit-learn.

#### Import packages

In [1]:
import numpy as np
import pandas as pd

from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB   
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

# model utils
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector as selector

In [2]:
#Input files - this can be obtained running 'Tutorial 2A - Introduction to machine-learning with scikit-learn.ipynb'
file_adult_census = 'adult_census_normalized_encoded.csv'

#### Load data

In [3]:
df = pd.read_csv(file_adult_census)
df.head()

Unnamed: 0,age,education-num,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,workclass_Without-pay,...,native-region_Asia_East,native-region_Central_America,native-region_Europe_East,native-region_Europe_West,native-region_North_America,native-region_South_America,capital-gain,capital-loss,hours-per-week,income>50K
0,0.30137,13,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0
1,0.452055,13,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0
2,0.287671,9,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0
3,0.493151,7,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0
4,0.150685,13,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0


Here, we use the numerical columns to determine the outcome of the target variable, `RiskPerformance`.

In [4]:
df.dtypes

age                                     float64
education-num                             int64
workclass_Federal-gov                   float64
workclass_Local-gov                     float64
workclass_Never-worked                  float64
workclass_Private                       float64
workclass_Self-emp-inc                  float64
workclass_Self-emp-not-inc              float64
workclass_State-gov                     float64
workclass_Without-pay                   float64
marital-status_Divorced                 float64
marital-status_Married-AF-spouse        float64
marital-status_Married-civ-spouse       float64
marital-status_Married-spouse-absent    float64
marital-status_Never-married            float64
marital-status_Separated                float64
marital-status_Widowed                  float64
occupation_Adm-clerical                 float64
occupation_Armed-Forces                 float64
occupation_Craft-repair                 float64
occupation_Exec-managerial              

In [5]:
# Select columns containing categorical data
categorical_columns = df[['education-num', 'capital-gain','capital-loss','hours-per-week']].columns
# MaxDelq2PublicRecLast12M and MaxDelqEver were chosen as categories because 
# the data contains single digit values each representing a different meaning

# Convert data type to category for these columns
for column in categorical_columns:
    df[column] = df[column].astype('category')  

Converted *RiskPerformance*, *'MaxDelq2PublicRecLast12M'* & *'MaxDelqEver'* to categorical type.

In [6]:
target = df.pop('income>50K')
target.head()

0    0
1    0
2    0
3    0
4    0
Name: income>50K, dtype: int64

In [7]:
df.head()

Unnamed: 0,age,education-num,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,workclass_Without-pay,...,native-region_Asia_Central,native-region_Asia_East,native-region_Central_America,native-region_Europe_East,native-region_Europe_West,native-region_North_America,native-region_South_America,capital-gain,capital-loss,hours-per-week
0,0.30137,13,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,0.452055,13,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.287671,9,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,0.493151,7,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4,0.150685,13,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


#### Train-test split

### 1. Changing the train-test split proportion

In [8]:
# splitting the data as 60% training set and the remaining 40% as testing set
X_train, X_test, y_train, y_test = train_test_split(df, target, train_size=0.6, random_state=1)

#### *k* Nearest Neighbour

The default similarity metric for k-NN is Euclidean distance.  
In some circumstances other metrics (or measures) will be more appropriate - for instance correlation.
Refer to <a href=https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification>documentation</a>

In [9]:
knn_clf = KNeighborsClassifier() 
# fit the KNN model on the credit risk training data
knn_clf.fit(X_train, y_train)

KNeighborsClassifier()

In [10]:
# predict on the training dataset 
y_predicted = knn_clf.predict(X_train)
accuracy_score(y_train, y_predicted)

0.8721435668156979

In [11]:
# predict if an individual's credit risk performance isgood or bad on the test data using the trained KNN model
y_predicted = knn_clf.predict(X_test)
accuracy_score(y_test, y_predicted)

0.8175298062593145

#### Decision Trees

Decision trees are non-parametric models: they are not controlled by a mathematical decision function and do not have weights or intercept to be optimized.
Decision trees will partition the space by considering a single feature at a time. Refer to <a href=https://scikit-learn.org/stable/modules/tree.html>documentation</a>

In [12]:
dtc = DecisionTreeClassifier(criterion='entropy', max_depth=3)
# fit the Decision Tree model on the athlete training data
dtc.fit(X_train, y_train)

DecisionTreeClassifier(criterion='entropy', max_depth=3)

In [13]:
# predict on the training dataset 
y_predicted = dtc.predict(X_train)
accuracy_score(y_train, y_predicted)

0.8114133134624938

In [14]:
# predict if an individual's credit risk performance isgood or bad on the test data using Decision Tree model
y_predicted = dtc.predict(X_test)
accuracy_score(y_test, y_predicted)

0.8139903129657228

#### Naive Bayes

A ranking classifier is a classifier that can rank a test set in order of confidence for a given classification outcome.  
Naive Bayes is a ranking classifier because the ‘probability’ can be used as a confidence measure for ranking. Refer to <a href=https://scikit-learn.org/stable/modules/naive_bayes.html>documentation</a>

In [15]:
gnb_clf = GaussianNB()
gnb_clf.fit(X_train, y_train)

GaussianNB()

In [16]:
#cpredict on the training dataset 
y_predicted = gnb_clf.predict(X_train)
accuracy_score(y_train, y_predicted)

0.5703551912568307

In [17]:
# predict if an individual's credit risk performance isgood or bad on the test data using Naive Bayes model
y_predicted = gnb_clf.predict(X_test)
accuracy_score(y_test, y_predicted)

0.5576564828614009

**NOTE:**  The performance of the model changes when the train-test split changes for the dataset. So, we will learn another technique to evaluate the model perfomance in the coming weeks. 

### 2. Change the model's parameters / hyper-parameters / type

#### a. *k* Nearest Neighbour

In [18]:
for n in [2, 5, 10, 12, 15]:
    print('\nn_neighbors = ', n)
    knn_clf = KNeighborsClassifier(n_neighbors=n) 
    # fit the KNN model on the credit risk training data
    knn_clf.fit(X_train, y_train)

    # predict on the training dataset 
    y_predicted = knn_clf.predict(X_train)
    print('Prediction performance on training set: ', accuracy_score(y_train, y_predicted))
        
    # predict on the test dataset 
    y_predicted = knn_clf.predict(X_test)
    print('Prediction performance on testing set: ', accuracy_score(y_test, y_predicted))


n_neighbors =  2
Prediction performance on training set:  0.8901515151515151
Prediction performance on testing set:  0.7969448584202683

n_neighbors =  5
Prediction performance on training set:  0.8721435668156979
Prediction performance on testing set:  0.8175298062593145

n_neighbors =  10
Prediction performance on training set:  0.8554396423248882
Prediction performance on testing set:  0.8243293591654247

n_neighbors =  12
Prediction performance on training set:  0.8524590163934426
Prediction performance on testing set:  0.8265648286140089

n_neighbors =  15
Prediction performance on training set:  0.8516517635370094
Prediction performance on testing set:  0.8295454545454546


**NOTE:** Choosing different number of neighbours, changes the accuracy of the model on both training and testing dataset

#### b. Decision Trees

In [19]:
for depth in [2, 5, 7]:
    print('\nMax depth = ', depth)
    for samples_in_leaf in [2, 3, 5, 10]:
        print('\nMin samples leaf = ', samples_in_leaf)
        dtc = DecisionTreeClassifier(max_depth=depth, min_samples_leaf=samples_in_leaf) 
        # fit the decision trees model on the credit risk training data
        dtc.fit(X_train, y_train)

        # predict on the training dataset 
        y_predicted = dtc.predict(X_train)
        print('Prediction performance on training set: ', accuracy_score(y_train, y_predicted))

        # predict on the test dataset 
        y_predicted = dtc.predict(X_test)
        print('Prediction performance on testing set: ', accuracy_score(y_test, y_predicted))


Max depth =  2

Min samples leaf =  2
Prediction performance on training set:  0.8111649279682066
Prediction performance on testing set:  0.8145491803278688

Min samples leaf =  3
Prediction performance on training set:  0.8111649279682066
Prediction performance on testing set:  0.8145491803278688

Min samples leaf =  5
Prediction performance on training set:  0.8111649279682066
Prediction performance on testing set:  0.8145491803278688

Min samples leaf =  10
Prediction performance on training set:  0.8111649279682066
Prediction performance on testing set:  0.8145491803278688

Max depth =  5

Min samples leaf =  2
Prediction performance on training set:  0.8292970690511674
Prediction performance on testing set:  0.8258196721311475

Min samples leaf =  3
Prediction performance on training set:  0.8292970690511674
Prediction performance on testing set:  0.8257265275707899

Min samples leaf =  5
Prediction performance on training set:  0.8292970690511674
Prediction performance on testin

**NOTE:** When `max_depth` is 2, changing `min_samples_leaf` has no impact on the performance of the model on testing dataset in this case. However, increasing `max_depth`, improves the accuracy of the model on the training dataset. The accuracy of the model on the testing data is quite the same.

#### c. Naive Bayes

In [20]:
for classifier in [GaussianNB(), BernoulliNB(), MultinomialNB()]:
    print('\nModel = ', classifier.__class__.__name__)
    # fit the Naive Bayes on the credit risk training data
    classifier.fit(X_train, y_train)

    # predict on the training dataset 
    y_predicted = classifier.predict(X_train)
    print('Prediction performance on training set: ', accuracy_score(y_train,  y_predicted))
        
    # predict on the test dataset 
    y_predicted = classifier.predict(X_test)
    print('Prediction performance on testing set: ', accuracy_score(y_test,  y_predicted))


Model =  GaussianNB
Prediction performance on training set:  0.5703551912568307
Prediction performance on testing set:  0.5576564828614009

Model =  BernoulliNB
Prediction performance on training set:  0.7455911574764034
Prediction performance on testing set:  0.7449701937406855

Model =  MultinomialNB
Prediction performance on training set:  0.7826626924987581
Prediction performance on testing set:  0.7857675111773472


**NOTE:** Multinomial Naive Bayes is better than the other types of Naive Bayes classifier for this dataset.

In this tutorial, we learn that
- On the same dataset, different classifiers have different accuracy scores. 
- On the same dataset and same classifiers, changing the parameters, hyper-parameters, type of the classifier can (or cannot) improve the performance of the model. 