# Random Forest
<br><br>
Random Forest is an ensembble method that uses bagging (bootstrap aggregating). Various CART models are built and they are averaged, or perhaps majority voting is used to arrive at the estimate. 

It helps to first consider what are the pros and cons of Decision Trees:<br>
- Pros
  - Adaptable to different functions. f(x) does not have to be a continuous and differentiable function
  - Useful for classification and regression
  - Can give feature importance/feature selection
  - It is tractable. You can show someone the decision tree and why a certain sample ended up at a particular leaf
- Cons
  - Give a piecewise constant approximation that is not differentiable (refer to Murphy Figure 16.1b)
  - Can easily overfit if not careful
  - Unstable to slight changes in the data. A slight change in data can completely alter the 'path'
<br>
<br>

Random Forest reduces the variance of the estimates and this is accomplished by two aspects of RF that are described below.<br>

- Draw many data sets D_training with replacement
- Before splitting region $R_m$ select a random subset of features, d, where d < D (D is the full set of features)
  - select the best feature in d to threshold
    - Here the similarity between trees is small
    - When the trees are averaged or voted on, this low similarity reduces variance considerably
  - This gives rise to a single CART, $T_m, m \in [1,M]$, and an estimate $\hat{f_m}(x)$  
- Final step averages all of the trees to arrive at an estimate. Voting can also be used instead of averaging.
  - For regression the result is $$\hat{f}(x) = \frac{1}{M}\sum_{m=1}^M \hat{f_m}(x) $$
    - This is referred to as bagging 
    - This averaging is one of the ways the Random Forests reduce variance
  - For classification the result is $$\hat{y}(\vec{x}) = \underset{c}{\mathrm{argmax}}\quad\sum_{b=1}^B(\hat{y}^{(b)} = c)$$
<br>where the classification of a region is done by vote, $\hat{y}$ is the winning class<br><br>

In [1]:
'''
In this example we are reading in a house description and sale dataset. For this classification we are going to 
estimate whether a house will sell(and with what probability) within 90 days of being put on the market.
'''
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# this data has already been cleaned up, standardized, one hot encoded and vetted
df = pd.read_csv("classification_house_sale_px_data.csv", parse_dates=True, sep=',', header=0)
df_labels = pd.read_csv("classification_house_sale_px_labels.csv", parse_dates=True, sep=',', header=0)

# split data into training and test sets
train, test, y_train, y_test = train_test_split(df, df_labels, train_size=.6, test_size=.4, shuffle=True)

# run the classifier on the training data
clf = RandomForestClassifier(n_estimators=10, max_depth=5)
clf.fit(train, list(y_train.label.values))
# make prediction on the test data
#predicted = clf.predict(test)
print("Random Forest: Test set accuracy (% correct) when max_depth = 5: {0:.3f}".format(clf.score(test, y_test.label.values)))
# run the classifier on the training data
clf = RandomForestClassifier(n_estimators=10, max_depth=5)
clf.fit(train, list(y_train.label.values))
print("Random Forest: Test set accuracy (% correct) when max_depth = 50: {0:.3f}".format(clf.score(test, y_test.label.values)))

Random Forest: Test set accuracy (% correct) when max_depth = 5: 0.607
Random Forest: Test set accuracy (% correct) when max_depth = 50: 0.596


<br>
Note how the RF is betwer than the Decision Tree, albeit by a small margin. Even with a higher depth the results are better than the Decision Tree. 
<br>
# Take away
- Random forest is an ensemble method that averages the results of many decision trees (CART)
- Random forest reduces variance due to the averaging and bagging calculations