## Random Forest

While decision trees' easily represented set of rules is powerful for modeling and conveying that model to a general audience, their high variance and propensity for overfitting are serious problems

__Random Forest:__
- Instead of one tree, make a 'forest' of many trees
- Each tree gets to vote on the outcome for a given observation
- Low variance, high accuracy
- Classification Random Forest uses the mode of votes for prediction
- Regression RF uses the mean of votes for prediction

## Parameters

Set parameters for both the tree and the forest
- Trees have same parameters as before: depth, number of features, how the tree is built (entropy, [gini impurity](https://www.garysieling.com/blog/sklearn-gini-vs-entropy-criteria))
- Get to set the number of estimators (trees) to generate in the forest
- Number of trees is a tradeoff between amount of variance explained vs. computational complexity
- Tune by increasing number of trees until the additional learning from another tree approaches zero

__Entropy vs. Gini:__
- Gini intended for continuous attributes, entropy for attributes that occur in classes (eg colors)
- Gini tends to find the largest class, entropy tends to find groups of classes that make up ~50% of the data
- Gini to minimize misclassification, entropy for exploratory analysis
- Gini/entropy methods differ less than 2% of the time
- Entropy can be slower to compute

## Bagging and Random Subspace

Methods RFs use to generate trees that are different, without this creating trees using the same data over and over could lead to very similar or identical trees vulnerable to bias from highly predictive features dominating every tree (and therefore biased predictions)

__Bagging:__ each tree selects a subset of observations with replacement (can choose the same observation more than once) to build training set

__Random subspace:__ use a random subset of features for each split
- Each time it performs a split or generates a rule, only looking at the random subspace created by a random subset of some of the features as possibilities to generate that rule
- Helps avoid correlated trees because trees are built with different available features

__General rule:__ for a dataset with x features
- Classifiers use $\sqrt{x}$ features
- Regression use $x/3$

## Advantages and Disadvantages

__Advantages:__
- Strong performer
- Accurate in many situations

__Disadvantages:__
- Will not predict outside of the sample (only returns values it has seen before)
- Can tend to get too large, taxing on system resources
- Lack of transparency: 'black box' model

__Black box:__ gives an output with little insight into how the output was achieved
- Can't see the rules being applied
- Can't see which variables are prioritized or how
- Unable to represent in a simple visual form