# Decision Trees, Random Forests, Etc

Mostly based on **ESL** and **ISLR**

`R` package `tree`, useful function:

* `cv.tree()` performances cross-validation in order to determine the complexity of the tree.
* `predict()`
* `prune.tree()`

`R` package `randomForest`:

* `importance()` - variable importance
* `varImpPlot()` - plot variable importance

`R` package `gbm` for boosting trees.

# Decision Trees

One major problem with trees is their high variance.

## Tree Pruning

A better strategy is to grow a very large tree $T_0$, then **prune** it back in order to obtain a subtree that leads to lowest test error.

**Cost complexity pruning**, a.k.a. **weakest link pruning**: consider a sequence of trees indexed by nonnegative tuning parameter $\alpha$. For each value of $\alpha$ there corresponds a subtree $T \in T_0$ such that:

$$ \sum_{m=1}^{|T|} \sum_{x_i \in R_m} \big(y_i - \hat{y}_{R_m} \big)^2 + \alpha |T| $$

is as small as possible. 

* $|T|$ indicates the number of terminal nodes of the tree $T$. 
* $R_m$ is the rectangle (i.e. the subset of predictor space) corresponding to the $m^{th}$ terminal node.
* $\hat{y}_{R_m}$ is the predicted response associated with $R_m$

$\alpha$ controls a trade-off between the subtree's complexity and its fit to the training data. Turns out as we increase $\alpha$ from zero, branches get pruned from the tree in a nested and predictable fashion. The value of $\alpha$ can be found using cross-validation. Algorithm 8.1 on page 309 in **ISLR** book.

# Bagging

Bootstrap aggreation, or bagging, builds $B$ regression trees with $B$ bootstrapped training set, and average the resulting predictions. For classification, a majority vote is used.

The trees built are grown deep, and not pruned. **Hence each tree has high variance but low variance**.

## Out-of-Bag Error Estimate

It can be shown that with bootstrapping, on average, each bagged tree makes use of around 2/3 of the observations. The remaining 1/3 of the data is not used to fit the tree, hence these are referred to as **out-of-bag** (OOB) observations. 

We can predict the response for the $i^{th}$ obervation using each of the trees in which the observation was OOB. This yields around B/3 predictions for the $i^{th}$ observation. We can use average of these (for regression) or majority vote (for classification) to get a single OOB prediction for this observation. 

It can be shown that when B is sufficiently large, OOB error is virtually equivalent to leave-one-out cross-validation error.

# Variable Importance

**Regression**: record the total amount that the RSS is decreased due to splits over a given predictor, averaged over all B trees. A **large** value indicates an important predictor. **ISLR** p319.

$$ RSS = \sum_{j=1}^{J} \sum_{i \in R_j} \big(y_i - \hat{y}_{R_j} \big)^2 $$

**Classification**: add up the total amount that the Gini index is decreased by splits over a given predictor, averaged over all B trees. 

$$ G = \sum_{k=1}^{K} \hat{p}_{mk}(1-\hat{p}_{mk}) $$

Where $\hat{p}_{mk}$ represents the proportion of training observations in the $m^{th}$ region that are the from the $k^{th}$ class.

Alternative to the Gini index is **cross-entropy**, given by:

$$ D = -\sum_{k=1}^{K} \hat{p}_{mk} \log \hat{p}_{mk} $$

# Random Forest

For bagged trees, strong predictors will result in correlated trees, therefore the reduction in variance would be limited. 

Random forest overcomes this issue by having each tree only using a **randomly chosen** subset of $m$ predictors. Typically, $m = \sqrt{p}$, where $p$ is the number of all predictors.

On average $(p - m)/p$ of the splits in a tree will not even consider the strong predictor, so the other predictors would have a change.

Some measures of confidence for Random Forest have been developed, such as: 

* [forestci](https://github.com/scikit-learn-contrib/forest-confidence-interval) package for `scikit-learn`
* [randomForestCI](https://github.com/swager/randomForestCI) for `R` and the associated paper.

See this [notebook](https://github.com/ianozsvald/data_science_delivered/blob/master/ml_explain_regression_prediction.ipynb) for examples. 



# Boosting

Not using bootstrapping, trees are grown **sequentially**. Boosting creates **slow learners**. Each tree can be rather small, with just a few termina lnodes, determined by the parameter $d$ in the algorithm. 

Given the current model, we fit a tree to the residuals from the model, then we add this tree into the fitted function in order to update the residuals. 

A shrinkage parameter $\lambda$ slows down learning further. 

Third parameter is number of trees, B.

