# Boosting

In general boosting is an ensemble method which aims to combine
the output of many *weak* learners in order to produce a single 
*strong* learner in relation to some objective function. In this
context a *weak* learner will be only slightly correlated with a
true classification i.e marginally better than random guessing.
(NB boosting can be extended to regression). 

# AdaBoost

The AdaBoost algorithm solves a two class problem with 
features $\boldsymbol{x}_{i}\in\mathbb{R}^{d}$ and outputs $y_{i}\in\{-1,1\}$, for $i=1,2,\ldots,n$. We consider a 
weak classifier $G(x)\}$ and an error rate
$$
E(G,\boldsymbol{x},y) = \frac{\sum_{i=1}^{N}\bigl(1-\delta_{y_{i},G(\boldsymbol{x}_{i})}\bigr)}{N}
$$

The algorithm sequentially applies this classifier to a modified
version of the data $\boldsymbol{x}$. This produces the sequence 
$\{G_{m}(\boldsymbol{x})\}_{m=1}^{M}$ of classifiers. The data 
modification, at step $k$, is a weighting applied to each instance 
$(\boldsymbol{x}_{i},y_{i})$ depending on wether the previous 
classifier, $G_{k-1}(x)$, correctly or incorrectly classified the 
instance. If classification was correct the weight decreases, and if 
incorrect the weight increases, with the initial weighting $\frac{1}{N}$ for all the instances. Additionally a weight of
$\alpha_{k}$ is computed bases on the error 
$E(G_{k},\boldsymbol{x},y)$ which determines the contribution of 
$G_{k}$ to the final classifier.

In detail the algorithm is as follows

1. Set weights to $w_{i}=\frac{1}{N}$ $\forall$ $i$. 
2. For $m=1$ to $M$.
    2. Fit $G_{m}$ to training data with weightings $w_{i}$.
    2. Compute the weighted error 
    $$
    E_{m} = \frac{\sum_{i=1}^{N}w_{i}(1-\delta_{y_{i},G_{m}(\boldsymbol{x}_{i})})}{\sum_{i=1}^{N}w_{i}}
    $$
    2. Compute the final weight 
    $$
        \alpha_{i} = \log\frac{1-E_{m}}{E_{m}}
    $$
    2. Update the weight for the instances
    $$
        w_{i} \leftarrow w_{i}e^{\alpha_{m}(1-\delta_{y_{i},G_{m}(\boldsymbol{x}_{i})})}, i=1,2,\ldots,N
    $$
3. Output the final classifier
    $$
        G(x) = \text{sgn}\biggl(\sum_{m=1}^{M}\alpha_{m}G_{m}(x)\biggr)
    $$

## AdaBoost for decision trees

In the decision tree context an example of a weak classifier is
the decision stump, which includes one rule at the root node
and two resultant leaf nodes. Using a decision stump may 
yield error rates of $\sim 45\%$ but applying AdaBoost (with 
$G$ the decision stump) can improve this to $\sim 6\%$ after 
400 iterations.

## Julia Implementations

### Libraries
    
- DecisionTree.jl https://github.com/bensadeghi/DecisionTree.jl (Decision Stumps)

### References

[1] The Elements of Statistical Learning (Ch 10.1)