# AdaBoost

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
import numpy as np
import matplotlib.pyplot as plt

<img src = "../figures/ada2.png" />

## Hypothesis function

$$
\begin{aligned}
h(\mathbf{x}^{(i)}) & = \text{sign}\big(\alpha_1h_1(\mathbf{x}^{(i)}) + \alpha_2h_2(\mathbf{x}^{(i)}) + \cdots + \alpha_sh_s(\mathbf{x}^{(i)}) )\big) \\
& = \text{sign}\big(\sum_{s=1}^{S}\alpha_sh_s(\mathbf{x}^{(i)})\big)
\end{aligned}
$$



## Define what is a good classifier


Our job is to find the optimal $\alpha_s$, so we can know which classifier we should give more weightage (i.e., believe more).  To get this alpha, we should first define what is "good" classifier.  This is simple, since good classifier should simply has the minimum weighted errors as:

$$\epsilon_s = \frac{\sum_{i=1}^m w_s^{(i)}I(h_s(\mathbf{x}^{(i)}) \neq y^{(i)})}{\sum_{i=1}^m w_s^{(i)}}$$

where

$$\text{range}(\epsilon_s) = [0, 1]$$

in which the weights are initialized in the beginning as

$$w_s^{(i)} = \frac{1}{m}$$ 

where

$$\sum_{i=1}^m w_s^{(i)} = 1$$

For example, given $h(\mathbf{x})$ as <code>yhat</code> and <code>y</code> as the real y, we get:

We can calculate its weighted errors

If we try to change our weight bigger for the first one, you will see that the final error is enlarged.  (Please don't mind why it became 0.7 or 0.05; this is just example.)

## Updating the weights

Our goal is that once we got the error, we need to emphasize the incorrectly classified sample, so the next classifier will focus on making them right.  Thus we need a weight update rule.  The formula is as follows:

$$w_{s+1}^{(i)} = w_s^{(i)}e^{ -\alpha_sh_s(\mathbf{x^{(i)}}) y^{(i)}}$$

which then need to renormalize 

$$w_{s+1}^{(i)} = \frac{w_{s+1}^{(i)}}{{\displaystyle\sum_{i=1}^m w_{s+1}^{(i)}}} $$

so that

$$\sum_{i=1}^m w_s^{(i)} = 1$$

Here $\alpha_s$ is:

$$\alpha_s = \frac{1}{2}\ln\frac{1-\epsilon_s}{\epsilon_s}$$

where 

$$\text{range}(\alpha_s) = (-\infty, \infty)$$


## Relationship between alpha and errors

Here, higher the error, lower is alpha, which means we don't trust that classifier.  And vice versa.   If e is close to 0 (the classifier performs well), alpha will be positive, indicating that the classifier's predictions should have more influence on the final ensemble. If e is close to 0.5 (the classifier performs worst than random guessing), alpha will be negative, indicating that the classifier's predictions should have less influence, and they will be "flipped" in the ensemble.   Of course, if e is 1, then we should NOT even use this classifier!!

First, to see why this formula works, let's plot alpha against errors:

## How this weight update rule works?

$$w_{s+1}^{(i)} = w_s^{(i)}e^{ -\alpha_sh_s(\mathbf{x^{(i)}}) y^{(i)}}$$

Let's first find the alpha.  Recall that:

$$\alpha_s = \frac{1}{2}\ln\frac{1-\epsilon_s}{\epsilon_s}$$

where

$$\epsilon_s = \frac{\sum_{i=1}^m w_s^{(i)}I(h_s(\mathbf{x}^{(i)}) \neq y^{(i)})}{\sum_{i=1}^m w_s^{(i)}}$$

After we find the initial alpha, let's plug everthing into the weight update rule, starting from this component:

$$h_s(\mathbf{x^{(i)}}) y^{(i)} $$

Try to understand what multiplying actually means.  Notice that negative means the answer is wrong.

We multiply negative alpha so that incorrectly classified sample will have bigger value.

Why do we need to make the incorrectly classified sample have bigger value?  Well, because we want incorrectly classified sample to have bigger weight so the next classifier can focus on it.

$$-\alpha_sh_s(\mathbf{x^{(i)}}) y^{(i)} $$

Next, since we will multiply this with the weight, we perform exp to make sure that the resulting number is positive.

$$e^{ -\alpha_sh_s(\mathbf{x^{(i)}}) y^{(i)}} $$

Last, we calculate everything.

$$w_{s+1}^{(i)} = w_s^{(i)}e^{ -\alpha_sh_s(\mathbf{x^{(i)}}) y^{(i)}}$$

which then need to renormalize 

$$w_{s+1}^{(i)} = \frac{w_{s+1}^{(i)}}{{\displaystyle\sum_{i=1}^m w_{s+1}^{(i)}}} $$

**So what does this number means?**

Well, notice that the incorrectly classified samples are #1 and #3, and they both have bigger weights than others.   This will make sure next classifier will focus on solving it.

## 1. Scratch

In [15]:
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=500, random_state=1)
y = np.where(y==0,-1,1)  #change our y to be -1 if it is 0, otherwise 1

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)