# Why

Fitting a single decision tree instance may often lead us to low bias and high variance, in other words, overfitting the dataset, even with tree constraints like `min_samples_leaf`, `min_samples_split` or `max_depth`

# Intuition

<img src="random-forests.png">

Fitting many meta learners, in this case, decision trees that only consider a random sample of features, and bagging them to aggregate to the output, will vastly reduces the variance.

# Math

## Motiviation of average

Consider a set of uncorrelated random variables $\{Y_i\}_{i=1}^{n}$ with mean $\mu$ and variance $\sigma^2$, where $Y_i$ can be treated as prediction from model $i$, the expectation and the variance of the average is
$$E[\frac{1}{n}\sum_{i=1}^{n}Y_i] = \frac{1}{n}\sum_{i=1}^{n}E[Y_i] = \frac{1}{n} \cdot n \cdot \mu = \mu$$

$$Var[\frac{1}{n}\sum_{i=1}^{n}Y_i] = \frac{1}{n^2}\sum_{i=1}^{n}Var[Y_i] = \frac{1}{n^2} \cdot n \cdot \sigma^2 = \frac{\sigma^2}{n}$$

Hence, the $\mu$ stays the same, meaning we won't lose precision in bias, while the variance becomes much smaller

In the real world, we won't be able to have uncorrelated models to ensemble. Assume $\forall i \neq j, Corr(Y_{i}, Y_{j}) = \rho$. Then the variance of the average becomes:
$$
Var[\frac{1}{n}\sum_{i=1}^{n}Y_i] = \frac{1}{n^2}Var[\sum_{i=1}^{n}Y_i] = \frac{1}{n^2}(\sum_{i=1}^{n}Var[Y_i] + \mathop{\sum\sum}_{i \neq j}Cov(Y_{i}, Y_{j})) = \frac{\sigma^2}{n} + \frac{n(n-1)\sigma^2\rho}{n^2} = \rho\sigma^2 + \frac{1-\rho}{n}\sigma^2
$$
As $n$ grows, the first term dominates, so variance still gets effectively reduced.

## Bagging (Bootstrap AGGregatING)

Given a training set size $n$, generate $T$ random samples, each size of $n'$ by sampling with replacement.

When $n = n'$, $63\%$ of the data are chosen, while $37\%$ being OOB (out-of-bag) samples:

$$(1 - \frac{1}{n})^n \approx \lim_{n \to \infty}(1 - \frac{1}{n})^n = \frac{1}{e} = 0.367$$

# Implementation