# Programming for Data Science and Artificial Intelligence

## Classification - AdaBoost

### Readings:
- [GERON] Ch7
- [VANDER] Ch5
- [HASTIE] Ch16
- https://scikit-learn.org/stable/modules/ensemble.html

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
import numpy as np
import matplotlib.pyplot as plt

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

## Boosting

Boosting is a general strategy for learning classifiers by combining simpler ones. The idea of boosting is to take a "weak classifier" — that is, any classifier that will do at least slightly better than chance — and use it to build a much better classifier, thereby boosting the performance of the weak classification algorithm. This boosting is done by averaging the outputs of a collection of weak classifiers.  The common form of hypothesis function for boosting is as follows:

$$
\begin{aligned}
H(x) & =  \alpha_1h_1(x) + \alpha_2h_2(x) + \cdots + \alpha_sh_s(x) ) \\
& = \Sigma_{s=1}^{S}\alpha_sh_s(x)
\end{aligned}
$$

where $S =$ number of classifiers and $\alpha$ is the weight associated with each classifier 

The among the first, and therefore popular boosting algorithm is **AdaBoost**, so-called because it is *adaptive.*

AdaBoost is extremely simple to use and implement (far simpler than SVMs), and often gives very effective results. There is tremendous flexibility in the choice of weak classifier as well. Anyhow, Decision Tree with max_depth=1 and max_leaf_nodes=2 are often used (also known as **stump**)

Suppose we are given training data ${(\mathbf{x_i}, y_i)}$, where $\mathbf{x_i} \in \mathbb{R}^n$ and $y_i \in \{-1, 1\}$.  And suppose we are given a (potentially large) number (denoted $S$) of weak classifiers, denoted $h_s(x) \in \{-1, 1\}$ where $s = 1, 2, \cdots, S$, and for each classifier, we define $\alpha_s$ as the *voting power* of the classifier $h_s(x)$. Then, the hypothesis function is based on a linear combination of the weak classifier and is written as:

$$
\begin{aligned}
H(x) & = \text{sign}\big(\alpha_1h_1(x) + \alpha_2h_2(x) + \cdots + \alpha_sh_s(x) )\big) \\
& = \text{sign}\big(\Sigma_{s=1}^{S}\alpha_sh_s(x)\big)
\end{aligned}
$$

Our job is to find the optimal $\alpha_s$, so we can know which classifier we should give more weightage (i.e., believe more) in our hypothesis function since their accuracy is relatively better compared to other classifiers.  To get this alpha, we should define what is "good" classifier.  This is simple, since good classifier should simply has the maximum number of accurate classified samples as:

$$ max \big( \Sigma_{i=1}^m I(h_s(x_i) = y_i) \big)$$

or can be written as minimization function as

$$ min \big( \Sigma_{i=1}^m I(h_s(x_i) \neq y_i) \big)$$

Aside from "weighted" wisdom of crowd, AdaBoost has one more capability, and that is that each subsequent classifier will try to correct the errors made by previous predictor.  In other words,  whatever samples the previous classifier misclassified, it should be prioritized in the subsequent classifier.  To realize this mechanism, the concept is to increase the penalty if those previously misclassified sample are wrong.  To do so, we first initialize the weight for each sample, which shall be applied to the first predictor $h_1(x)$ to be 

$$ w_i^{(s)} = \frac{1}{m} ; s = 1;  i = 1, 2, \cdots, m$$

Then, after the first classifier was fitted, we readjust this weight by increasing weight for those misclassified sample, and decreasing weight for those correctly classified sample.  To make sure classifier will be chosen on the basis of these weighted errors, we shall revise our definition of "good" classifiers as follows:

$$ min \big( \frac{\Sigma_{i=1}^m w_i^{s}I(h_s(x_i) \neq y_i)}{\Sigma_{i=1}^m w_i^{s}} \big ) $$

Note that the lower term is simply so that all weights sum to 1.

Thus, the subsequent classifier will be chosen based on the one that can create the least weighted errors. 

Let's put everything into the AdaBoost algorithm as follows:

define $S$

**for** $i$ from 1 to m {

$$w_i^{(1)} = \frac{1}{m}$$ 

make sure $y \in \{-1, 1\}$

}

**for** $s$ = 1 to $S$ {

Looping through all features and threshold, identify the best stump, whose value has the minimum of this objective function:
    
$$\epsilon_s = \Sigma_{i=1}^m w_i^{s}I(h_s(x_i) \neq y_i) $$

where $I$ is indicator function $I(h_s(x_i) \neq y_i) = 1$ if $h_s(x_i) \neq y_i$ and 0 otherwise

Then calculate the voting power of the weak classifier, denoted $\alpha_s$ and can be calculated as:

$$\alpha_s = \frac{1}{2}ln\frac{1-\epsilon_s}{\epsilon_s}$$

Before fitting the next stump, we need to make sure to exaggerate the weights of *incorrectly* classified samples so our next stump will be chosen based on the new weighted objective function:

Then **for** all $i$ { $$w_i^{(s+1)} = \frac{w_i^{(s)}e^{ -\alpha_sh_s(\mathbf{x_i}) y_i}}{{\Sigma_{i=1}^m w_i^{s}}} $$}

}

This means that misclassified samples will get larger weights and correctly classified samples will get smaller weights.

To predict, we simply take the weighted sum of all predictors and take the sign of them.  Recall that $S$ is number of stumps/predictors you have

$$ 
  H(x) = \text{sign}\big(\Sigma_{s=1}^{S}\alpha_sh_s(x)\big)
$$

Stopping criteria of AdaBoost is important to impose or else we can get some overfitting with AdaBoost by adding too many classifiers.  We can either specify the number of iterations, or when we reach a certain level of accuracy, or perform early stopping by using a validation set to detect the iteration when overfit starts to happen. 

## AdaBoost

### Scratch

In [2]:
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=500, random_state=1)
y = np.where(y==0,-1,1)  #change our y to be -1 if it is 0, otherwise 1

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

In [3]:
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

m = X_train.shape[0]
S = 20
stump_params = {'max_depth': 1, 'max_leaf_nodes': 2}
models = [DecisionTreeClassifier(**stump_params) for _ in range(S)]

#initially, we set our weight to 1/m
W = np.full(m, 1/m)

#keep collection of a_j
a_js = np.zeros(S)

for j, model in enumerate(models):
    
    #train weak learner
    model.fit(X_train, y_train, sample_weight = W)
    
    #compute the errors
    yhat = model.predict(X_train) 
    err = W[(yhat != y_train)].sum()
        
    #compute the predictor weight a_j
    #if predictor is doing well, a_j will be big
    a_j = np.log ((1 - err) / err) / 2
    a_js[j] = a_j
    
    #update sample weight; divide sum of W to normalize
    W = (W * np.exp(-a_j * y_train * yhat)) 
    W = W / sum (W)
    
        
#make weighted predictions
Hx = 0
for i, model in enumerate(models):
    yhat = model.predict(X_test)
    Hx += a_js[i] * yhat
    
yhat = np.sign(Hx)

print(classification_report(y_test, yhat))

              precision    recall  f1-score   support

          -1       0.96      0.97      0.97        79
           1       0.97      0.96      0.96        71

    accuracy                           0.97       150
   macro avg       0.97      0.97      0.97       150
weighted avg       0.97      0.97      0.97       150



### Sklearn 

Sklearn implements AdaBoost using SAMME which stands for Stagewise Additive Modeling using a Multiclass Exponential Loss Function.

The following code trains an AdaBoost classifier based on 200 Decision stumps.  A Decision stump is basically a Decision Tree with max_depth=1.  This is the default base estimator of AdaBoostClassifier class:

In [4]:
from sklearn.ensemble import AdaBoostClassifier

#SAMME.R - a variant of SAMME which relies on class probabilities 
#rather than predictions and generally performs better
ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200,
    learning_rate=0.5, random_state=42)
ada_clf.fit(X_train, y_train)
y_pred = ada_clf.predict(X_test)
print("Ada score: ", accuracy_score(y_test, y_pred))

Ada score:  0.9666666666666667


### ===Task===

Your work: Let's modify the above scratch code:
- Notice that if <code>err</code> = 0, then $\alpha$ will be undefined, thus attempt to fix this by adding some very small value to the lower term
- Notice that sklearn version of AdaBoost has a parameter <code>learning_rate</code>.  This is in fact the $\frac{1}{2}$ in front of the $\alpha$ calculation.  Attempt to change this $\frac{1}{2}$ into a parameter called <code>eta</code>, and try different values of it and see whether accuracy is improved.  Note that sklearn default this value to 1.
- Observe that we are actually using sklearn DecisionTreeClassifier.  If we take a look at it closely, it is actually using weighted gini index, instead of weighted errors that we learn above.   Attempt to write your own class of <code>class Stump</code> that actually uses weighted errors, instead of weighted gini index
- Put everything into a class