In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
import time
from mnist import *
import numpy as np
import warnings

warnings.simplefilter("ignore")

training_set_path = "D:\\Projects\\ml-experiments\\datasets\\mnist\\train-images-idx3-ubyte.gz"
train_labels_path = "D:\\Projects\\ml-experiments\\datasets\\mnist\\train-labels-idx1-ubyte.gz"

f_train = gzip.open(training_set_path)
f_train_labels = gzip.open(train_labels_path)

training_set = parse_idx(f_train)
training_labels = parse_idx(f_train_labels)

training_set_tr = training_set.reshape((60000, 784))

**Some utility function to reuse throughout experiment**

In [9]:
def get_random_digit(training_set, labels, digit):
    indexes = np.where(labels == digit)[0]
    return training_set[indexes[np.random.randint(0, len(indexes) - 1)]]

**Each instance weight $w_{(i)}$ is initially set to $\frac{1}{m}$ so given equal weight (m is the number of instances). A first predictor is trained and its WEIGHTED ERROR RATE $r_1$ is computed on the training set: $r_j = \frac {\sum_{i=1 y_j^{(i)}!=y^{(i)}}^m w^{(i)}}{\sum_{i=1}^m w^{(i)}}$ ($y_j^{(i)}$ is the jth predictor prediction for instance i)**

**Next the predictor's weight is then computed: $\alpha_j = \eta * \log \frac{1-r_j}{r_j}$  ($\eta$ is the learning rate) In machine learning, the logarithm or exponential without a base specified often refers to the natural log/exp so base is e**

**Next the instance weights are updated and the misclassified instances are boosted: for i = 1->m, if $y_j^{(i)} == y^{(i)}$ $w^{(i)} -> w{(i)}$, otherwise $w^{(i)} -> w^{(i)} * \exp (\alpha_j)$**

**Finally, a new predictor is trained using the updated weights, and the whole process is repeated (the new predictor’s weight is computed, the instance weights are updated, then another predictor is trained, and so on). The algorithm stops when the desired number of predictors is reached, or when a perfect predictor is found. To make predictions, AdaBoost simply computes the predictions of all the predictors and weighs them using the predictor weights αj. The predicted class is the one that receives the majority of weighted votes**

In [10]:
ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=10,
    algorithm="SAMME.R", learning_rate=0.5)

start_time = time.time()
ada_clf.fit(training_set_tr, training_labels)
elapsed = time.time() - start_time

print(f"Training an AdaBoostClassifier with 10 stumps took {elapsed} seconds")

Training an AdaBoostClassifier with 10 stumps took 12.799814701080322 seconds


**Scikit-Learn actually uses a multiclass version of AdaBoost called SAMME which stands for Stagewise Additive Modeling using a Multiclass Exponential loss function. In this case since the DecisionTree estimator/predictor can estimate class probabilities (exposes a predict_proba method) Scikit-Learn can use a variant of SAMME called SAMME.R (the R stands for “Real”), which relies on class probabilities rather than predictions and generally performs better.**

**A Decision Tree with max_depth=1, in other words, a tree composed of a single decision node plus two leaf nodes is called a Decision Stump**

In [11]:
an_eight = get_random_digit(training_set_tr, training_labels, 8)

print(f"AdaBoostClassifier prediction is: {ada_clf.predict([an_eight])}")

AdaBoostClassifier prediction is: [8]
