# GTI770 - Systèmes Intelligents et Apprentissage Machine

### Alessandro L. Koerich

## Notebook Jupyter - 11_Bagging_AdaBoost_UppercaseHandwriting_26Classes

#### July 2018

In [59]:
from sklearn.preprocessing import MinMaxScaler
from sklearn import tree
import numpy as np

In [60]:
# fix random seed for reproducibility
seed = 7
np.random.seed(seed)

In [61]:
# Load data from file
# NIST Train 26 Classes Uppercase Handwritten Characters
# 37,440 samples for training
# 12,092 samples for validation
# 11,941 samples for testing
# 108-dimensional feature vectors
# 26 classes (A-Z uppercase characters)

TrainData = np.loadtxt('CSV_Files/NISTUpperHandwritten_train.csv', delimiter=' ', dtype=np.str)
ValidData = np.loadtxt('CSV_Files/NISTUpperHandwritten_valid.csv', delimiter=' ', dtype=np.str)
TestData  = np.loadtxt('CSV_Files/NISTUpperHandwritten_test.csv' , delimiter=' ', dtype=np.str)

Xtrain =TrainData[0:37439,0:108].astype(np.float)
Ytrain =TrainData[0:37439,108:134].astype(np.int)

Xvalid = ValidData[0:12091,0:108].astype(np.float)
Yvalid = ValidData[0:12091,108:134].astype(np.int)

Xtest  = TestData[0:11940,0:108].astype(np.float)
Ytest  = TestData[0:11940,108:134].astype(np.int)

In [62]:
from numpy import argmax
Ytrain2 = argmax(Ytrain, axis=1)
Yvalid2 = argmax(Yvalid, axis=1)
Ytest2  = argmax(Ytest , axis=1)

In [63]:
# normalize the data
scaler = MinMaxScaler(feature_range=(0, 1))

Xtrain = scaler.fit_transform(Xtrain)
Xvalid = scaler.fit_transform(Xvalid)
Xtest  = scaler.fit_transform(Xtest)

In [64]:
Ytrain2

array([22, 22, 20, ..., 10,  9, 25])

In [65]:
num_classes = Ytrain.shape[1]
input_dim   = Xtrain.shape[1]

In [66]:
def DT_model():
    print("Decision Tree\n")
    # create model
    model = tree.DecisionTreeClassifier(criterion='entropy', 
                                        max_depth=10, min_samples_leaf=10, 
                                        min_samples_split=20 )
    return model

In [67]:
# Build the model
# Choose one at each time
model = DT_model()

Decision Tree



In [68]:
# Fit the model (TRAIN)
model.fit(Xtrain, Ytrain2)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=10, min_samples_split=20,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [69]:
# Use the model to predict the class of samples
# Notice that we are testing on the 3 data splits

Ytrain_pred = model.predict(Xtrain)
Yvalid_pred = model.predict(Xvalid)
Ytest_pred  = model.predict(Xtest)

In [70]:
# You can also predict the probability of each class

Ytrain_pred_prob = model.predict_proba(Xtrain)
Yvalid_pred_prob = model.predict_proba(Xvalid)
Ytest_pred_prob  = model.predict_proba(Xtest)

In [71]:
# Evaluation metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

# Final evaluation of the model (On the Training, Validation or Test dataset)
scores = accuracy_score(Ytrain2, Ytrain_pred )
scores2 = accuracy_score(Yvalid2, Yvalid_pred )
scores3 = accuracy_score(Ytest2, Ytest_pred )


print("Correct classification rate for the training dataset   = "+str(scores*100)+"%")
print("Correct classification rate for the validation dataset = "+str(scores2*100)+"%")
print("Correct classification rate for the test dataset       = "+str(scores3*100)+"%")

Correct classification rate for the training dataset   = 83.8617484441358%
Correct classification rate for the validation dataset = 77.85956496567695%
Correct classification rate for the test dataset       = 74.69011725293132%


In [72]:
from sklearn.metrics import classification_report
target_names = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']
print( classification_report(Yvalid2, Yvalid_pred, target_names=target_names))
# This works, but we have labels with no predicted samples

              precision    recall  f1-score   support

           A       0.71      0.68      0.69       449
           B       0.72      0.63      0.67       439
           C       0.87      0.88      0.87       525
           D       0.75      0.77      0.76       466
           E       0.74      0.85      0.79       390
           F       0.80      0.91      0.85       428
           G       0.85      0.75      0.80       425
           H       0.67      0.86      0.75       421
           I       0.94      0.77      0.85       740
           J       0.80      0.90      0.85       429
           K       0.65      0.76      0.70       411
           L       0.81      0.90      0.85       500
           M       0.74      0.58      0.65       450
           N       0.58      0.68      0.62       471
           O       0.86      0.75      0.80       472
           P       0.93      0.74      0.82       465
           Q       0.86      0.72      0.79       450
           R       0.62    

- HYPERPARAMETER OPTIMIZATION FOR DECISION TREE

OK, but we didn't optimize the parameters of the Decision Tree, such as:

1) max_depth

2) max_leaf_nodes

3) min_impurity_decrease

4) min_impurity_split

5) min_samples_leaf

6) min_samples_split

7) Etc...

But now, we already have a pre-defined VALIDATION dataset! So, we don't need to split the dataset and use cross-validation.

We will use the hypopt Python package (pip install hypopt). It's a professional package created specifically for parameter optimization with a validation set. It works with any scikit-learn model out-of-the-box and can be used with Tensorflow, PyTorch, etc. as well.

https://pypi.org/project/hypopt/1.0.0/

In [73]:
model.get_params().keys()

dict_keys(['class_weight', 'criterion', 'max_depth', 'max_features', 'max_leaf_nodes', 'min_impurity_decrease', 'min_impurity_split', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'presort', 'random_state', 'splitter'])

In [74]:
# Set the parameters by cross-validation
# Assuming you already have train, test, val sets and a model.
from hypopt import GridSearch

tuned_parameters = [{'max_depth': [15, 30, 60], 
                     'min_samples_leaf': [5, 10], 
                     'min_samples_split': [5, 10, 20]}]

In [76]:
# Grid-search all parameter combinations using a validation set.

tuned_model = GridSearch(model = tree.DecisionTreeClassifier(criterion='entropy'),param_grid = tuned_parameters)
tuned_model.fit(Xtrain, Ytrain2, Xvalid, Yvalid2)

print("Best parameters set found on validation set:")
print()
print('Test Score for Optimized Parameters:', tuned_model.score(Xvalid, Yvalid2))

Best parameters set found on validation set:

Test Score for Optimized Parameters: 0.8038210239020759


In [77]:
print('We can view the best performing parameters and their scores.')
for z in tuned_model.get_param_scores()[:2]:
    p, s = z
    print(p)
    print('Score:', s)
print()
print('Verify that the lowest scoring parameters make sense.')
for z in tuned_model.get_param_scores()[-2:]:
    p, s = z
    print(p)
    print('Score:', s)

We can view the best performing parameters and their scores.
{'max_depth': 15, 'min_samples_leaf': 5, 'min_samples_split': 5}
Score: 0.8038210239020759
{'max_depth': 15, 'min_samples_leaf': 5, 'min_samples_split': 10}
Score: 0.8038210239020759

Verify that the lowest scoring parameters make sense.
{'max_depth': 60, 'min_samples_leaf': 5, 'min_samples_split': 20}
Score: 0.799106773633281
{'max_depth': 30, 'min_samples_leaf': 5, 'min_samples_split': 20}
Score: 0.799106773633281


In [78]:
# Use the tunes model to predict the class of samples

Ytrain_pred = tuned_model.predict(Xtrain)
Yvalid_pred = tuned_model.predict(Xvalid)
Ytest_pred  = tuned_model.predict(Xtest)

In [79]:
# You can also predict the probability of each class

Ytrain_pred_prob = tuned_model.predict_proba(Xtrain)
Yvalid_pred_prob = tuned_model.predict_proba(Xvalid)
Ytest_pred_prob  = tuned_model.predict_proba(Xtest)

In [80]:
# Final evaluation of the model (On the Training, Validation or Test dataset)
scores_tuned = accuracy_score(Ytrain2, Ytrain_pred )
print("Correct classification rate for the training dataset (first model) = "+str(scores*100)+"%")
print("Correct classification rate for the training dataset (best model)  = "+str(scores_tuned*100)+"%")
print()
scores_tuned = accuracy_score(Yvalid2, Yvalid_pred )
print("Correct classification rate for the validation dataset (first model) = "+str(scores2*100)+"%")
print("Correct classification rate for the validation dataset (best model)  = "+str(scores_tuned*100)+"%")
print()
scores_tuned = accuracy_score(Ytest2, Ytest_pred )
print("Correct classification rate for the test dataset (first model) = "+str(scores3*100)+"%")
print("Correct classification rate for the test dataset (best model)  = "+str(scores_tuned*100)+"%")

Correct classification rate for the training dataset (first model) = 83.8617484441358%
Correct classification rate for the training dataset (best model)  = 92.54787788135367%

Correct classification rate for the validation dataset (first model) = 77.85956496567695%
Correct classification rate for the validation dataset (best model)  = 80.38210239020759%

Correct classification rate for the test dataset (first model) = 74.69011725293132%
Correct classification rate for the test dataset (best model)  = 76.44891122278057%


- BAGGING with DECISION TREES

Bagging methods are offered as a unified BaggingClassifier meta-estimator, taking as input a user-specified base estimator along with parameters specifying the strategy to draw random subsets.

In particular, to control the size of the subsets (in terms of samples and features):

- max_samples and

- max_features

and

- bootstrap and

- bootstrap_features

control whether samples and features are drawn with or without replacement.

When using a subset of the available samples the generalization accuracy can be estimated with the out-of-bag samples by setting:

- oob_score=True.

More details in: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier

In [81]:
from sklearn.ensemble import BaggingClassifier

In [82]:
# Using a bagging of Decision Trees

bagging = BaggingClassifier(tree.DecisionTreeClassifier(criterion='entropy'),
                            max_samples  = 0.5,
                            # max_features = 0.5,
                            n_estimators = 10,
                            n_jobs = 8,
                            bootstrap = False)

In [83]:
# Fit the model (TRAIN)
bagging.fit(Xtrain, Ytrain2)

BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
         bootstrap=False, bootstrap_features=False, max_features=1.0,
         max_samples=0.5, n_estimators=10, n_jobs=8, oob_score=False,
         random_state=None, verbose=0, warm_start=False)

In [84]:
# Use the model to predict the class of samples

Ytrain_pred_bagging = bagging.predict(Xtrain)
Yvalid_pred_bagging = bagging.predict(Xvalid)
Ytest_pred_bagging  = bagging.predict(Xtest)

In [85]:
# Final evaluation of the model (On the Training, Validation or Test dataset)

scores_bagging  = accuracy_score(Ytrain2, Ytrain_pred_bagging )
scores2_bagging = accuracy_score(Yvalid2, Yvalid_pred_bagging )
scores3_bagging = accuracy_score(Ytest2, Ytest_pred_bagging )

print("Correct classification rate for the training dataset   = "+str(scores_bagging*100)+"%")
print("Correct classification rate for the validation dataset = "+str(scores2_bagging*100)+"%")
print()
print("Correct classification rate for the test dataset (first model) = "+str(scores3*100)+"%")
print("Correct classification rate for the test dataset (best model)  = "+str(scores_tuned*100)+"%")
print("Correct classification rate for the test dataset (bagging)     = "+str(scores3_bagging*100)+"%")

Correct classification rate for the training dataset   = 99.10254013194798%
Correct classification rate for the validation dataset = 87.98279712182615%

Correct classification rate for the test dataset (first model) = 74.69011725293132%
Correct classification rate for the test dataset (best model)  = 76.44891122278057%
Correct classification rate for the test dataset (bagging)     = 85.80402010050251%


 - ADABOOST with DECISION TREES

The number of weak learners is controlled by the parameter n_estimators.

The learning_rate parameter controls the contribution of the weak learners in the final combination.

By default, weak learners are decision stumps.

Different weak learners can be specified through the base_estimator parameter.

The main parameters to tune to obtain good results are n_estimators and the complexity of the base estimators (e.g., its depth max_depth or minimum required number of samples at a leaf min_samples_leaf in case of decision trees).

More details in: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier

In [None]:
from sklearn.ensemble import AdaBoostClassifier

In [None]:
# Using a AdaBoosting of Decision Trees

AdaB= AdaBoostClassifier(tree.DecisionTreeClassifier(criterion='entropy'), 
                         n_estimators = 30, 
                         learning_rate = 0.5)

In [None]:
# Fit the model to the AdaB (TRAIN)

AdaB.fit(Xtrain, Ytrain2)

In [None]:
# Use the model to predict the class of samples
# Notice that we are testing the train dataset

Ytrain_pred_adaboost = AdaB.predict(Xtrain)
Yvalid_pred_adaboost = AdaB.predict(Xvalid)
Ytest_pred_adaboost  = AdaB.predict(Xtest)

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix



In [None]:
# Final evaluation of the model (On the Training, Validation or Test dataset)

scores_adaboost  = accuracy_score(Ytrain2, Ytrain_pred_adaboost )
scores2_adaboost = accuracy_score(Yvalid2, Yvalid_pred_adaboost )
scores3_adaboost = accuracy_score(Ytest2, Ytest_pred_adaboost )

print("Correct classification rate for the training dataset   = "+str(scores_adaboost*100)+"%")
print("Correct classification rate for the validation dataset = "+str(scores2_adaboost*100)+"%")
print()
print("Correct classification rate for the test dataset (first model) = "+str(scores3*100)+"%")
print("Correct classification rate for the test dataset (best model)  = "+str(scores_tuned*100)+"%")
print("Correct classification rate for the test dataset (bagging)     = "+str(scores3_bagging*100)+"%")
print("Correct classification rate for the test dataset (adaboost)    = "+str(scores3_adaboost*100)+"%")

In [None]:
print("Notebook ended")