# Intro to ensemble methods

**Ensemble methods** are supervized learning models which combine the predictions of multiple smaller models to improve predictive power and generalization.

The smaller models that combine to make the ensemble model are referred to as **base models**. Ensemble methods often result in considerably higher performance than any of the individual base models could achieve.

![ensemble](./images/Ensemble.png) 

## When to use ensembles

    - Medical diagnoses
    - Predicting disease outbreak, natrual disasters
    - Stock market predictions
    - AI

Or any case where the highest performance is desired at the expense of model interpretability.

## Two popular families of ensemble methods

---

**BAGGING**

Several estimators are built independently on subsets of the data and their predictions are averaged. Typically the combined estimator is usually better than any of the single base estimator.

**Bagging can reduce variance with little to no effect on bias.**

    ex: Random Forests

---

**BOOSTING**

Base estimators are built sequentially. Each subsequent estimator focuses on the weaknesses of the previous estimators. In essence several weak models "team up" to produce a powerful ensemble model. (We will discuss these later this week.)

**Boosting can reduce bias without incurring higher variance.**

    ex: Gradient Boosted Trees, AdaBoost

## Potential deficiencies of base models

There are three categories of weaknesses in which "base models" can fail or produce poor results:

1. Statistical problems
2. Computational problems
3. "Representational" problems

Ensemble methods are designed to address any or all three.

---

Let

### $$ \begin{aligned} \text{true function of data} &= f() \\ \text{model function of data} &= h() \end{aligned}$$

Where $h()$ can be a classifier or a regression model.

### Statistical problem

**The amount of training data available is small**. A single base classifier will have difficulty converging to $h()$.

![statistical](./images/statistical.png)

---

A bagging ensemble model, for example, mitigates this problem by "averaging out" base classifier predictions to improve convergence on the true function.

[Paper describing in-depth reason for this.](http://web.cs.iastate.edu/~jtian/cs573/Papers/Dietterich-ensemble-00.pdf)

### Computational problem

There is sufficient training data, but it is computationally intractable to find the best model $h()$.

For example, if a base classifier is a decision tree, an exhaustive search of the hypothesis space of all possible classifiers is extremely complex (NP-complete).

This is, for example, why decision trees use heuristic algorithms at nodes (greedy search).

![computational](./images/computational.png)

---

Ensembles composed of several, simpler base models using different starting points can converge faster to a good approximation of $f()$.

### Representational problem

Suppose we use a decision tree as a base classifier. A decision tree works by forming a "rectilinear" partition of the feature space, **i.e it always cuts at a fixed value along a feature.**

But what if $f()$ is best modeled by diagonal line?

It cannot be represented by a finite number of rectilinear segments, and the true decision boundary cannot be obtained by the decision tree classifier.

![dtcut](./images/dtcut.png)

**A representational problem occurs when $f()$ cannot be expressed in terms of our hypothesis at all.** 

Yet, it may be still be possible to approximate $f()$ by expanding the space of representable functions using ensemble methods!

![representational](./images/representational.png)

## Conditions for ensembles to outperform base models

For an ensemble method to perform better than a base classifier, it must meet these two criteria:

1. **Accuracy:** the combination of base classifiers must outperform random guessing. 
2. **Diversity:** base models must not be identical in classification/regression estimates.
    - [Description of diversity.](https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume11/opitz99a-html/node2.html)
    - [Paper on measures of diversity.](http://staff.ustc.edu.cn/~ketang/papers/TangSuganYao_ML06.pdf)

## Bagging

The ensemble method we will be using today is called **bagging**, which is short for **bootstrap aggregating**.

Bagging builds multiple base models with **resampled training data with replacement.** We train $k$ base classifiers on $k$ different samples of training data. Using random subsets of the data to train base models promotes more differences between the base models.

Random Forests, which "bag" decision trees, can achieve very high classification accuracy.

## Bagging's magic decrease of model variance 

One of the biggest advantages of Random Forests is that they **decrease variance without increasing bias**. Essentially you can get a better model without having to trade off between bias and variance.

---

**VARIANCE DECREASE**

Base model estimates are averaged together, so variability of model predictions (across hypothetical samples) is lower.

---

**NO/LITTLE BIAS INCREASE**

The bias remains the same as the bias of the individual base models. The model is still able to model the "true function" since the  base models' complexity is unrestricted (low bias).
