# Ensemble Learning: Boosting

Take a bunch of simple rules and combining them into a more complex rule that works very well. Similar to decision trees, we build them step by step.

- Learn over a subset 1 of data -> generate rule  
- Learn over a subset 2 of data -> generate different rule   
...   
...  
- Learn over a subset n of data -> generate different rule   
- Finally combine all rules into a **complex rule**   

Simplest approach:  
1. Choosing subsets: Uniformly randomly select data, apply a learner  
2. Combine: assume each learner is equal and take mean   

## Example: Bagging  
---
<img src="../images/bagging.png" width=600 align="left"/>  
5 subsets of 5 examples, chosen randomly and with replacement   

Each learns a 3rd order polynomial   

Combine by average (red line)  

Compared to 4th order polynomial regression (blue line)  

Regression does better on training set but bagging does better on testing (in this case)  

Does better because less overfitting - works like cross validation - don't get trapped by a few points that are wrong

## Boosting  
---
- Instead of picking subsets randomly, take advantage of what we are learning as we go along and pick subsets containing the hardest examples  
- Combine using weighted mean  

### Error  
- Regression: squared difference between predicted and actual  
- Classification: total number of mismatches over the total number of examples  

But every example may not be of equal importance.  
We should also consider the probability of encountering the example.  

$\textrm{P} \left (h(x) \ne c(x) \right)$  
Error: probabilty, given underlying distribution, that I'll disagree with the true concept given some instance x   

Some instances may be rare.   
Think about amount of time you may be wrong instead of the distinct number of mistakes.

### Weak Learner
- A learner that always does at least better than chance (no matter what the distribution is over the data)
- Error rate will always be less than half  
- Learner will always learn something   

$\forall_D \textrm{P}_D ( h(x) \ne c(x) ) \le 1/2 - \epsilon $

### Pseudocode 
Given Training Set:  ${(x_i, y_i)}$ with $y \in ( -1, +1 )$ (binary classification)  

Loop over some time step, $t$:   
For t=1 to T:
- construct distribution: $D_t$
- find weak classifier, $h_t(x)$
    - with small error
    - $\epsilon_t = \textrm{P}_{D_t} ( h_t(x_i) \ne y_i ) $

output final hypothesis: $H_{\textrm{final}}$

#### Distribution Construction
<img src="../images/boosting.png" width=800 align="left"/>  

#### Final Hypothesis
<img src="../images/boosting_final.png" width=800 align="left"/>

#### Example
<img src="../images/boosting_example1.png" width=450 align="left"/>
<img src="../images/boosting_example2.png" width=450 align="left"/>

https://github.com/dmlc/xgboost/tree/master/demo   
https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d  

### Overfitting
<img src="../images/boosting_error.png" width=600 align="left"/>   
Need to think about confidence, not just right or wrong (error).   
Confidence: how strongly you believe in the answer.

As you add more weak learners the +'s and -'s move from boundary and it gets more confident but error stays the same.  And essentially creates a bigger and bigger margin.   

Large margins tend to minimize overfitting.  

Boosting will overfit if:
- weak learning uses A.N.N. with many layers and nodes
- essentially if the weak learner overfits and don't stop overfitting so will boosting  

pink noise - uniform noise - BOOSTING MAY OVERFIT in this case  
white noise - gaussian noise  

On the discussion of boostings impact on *overfitting*, we've been ignoring some information. 

What we normally keep track of is:
- *error* (i.e., the probability that you will come up with an answer that disagrees with your training set). 

however, what we are also keeping track of is *confidence*. 

$ H(x)_{\text{final}} = \text{sgn} \left( \sum_t \alpha_t h_t(x) \right)$

which simply outputs the sign of the sum over the weak hypothes +1, - 1, or 0.

Take the above and divide by the weights we used, normalizing the output:

$$ 
H(x)_{\text{final}} = 
\text{sgn} 
\left ( 
\frac{\sum_t\alpha_t h_t(x) }{ \sum\alpha_t}
\right )
$$

This normalization reduces the output between -1 and +1, and in the process of adding more weak learners, the margin increases (e.g., the distance between boundaries). This increases the confidence while the actual error remains the same.

Boosting tends to overfit also if the underlying weak learner tends to overfit, as in the case of an artificial neural network.