<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Build a (Basic) Random Forest From Scratch

_Authors: Kiefer Katovich (SF)_

---

> **Note**: This lab is intended to be completed in a group or as a code-along with the instructor.


## What is a Random Forest?

---

Random forests are some of the most widely used classifiers. They perform well and are relatively simple to use, as they require very few parameters. 

As we have seen, decision trees are powerful machine learning models, but they also have some critical limitations. In particular, trees that are grown deep tend to learn highly irregular patterns and therefore overfit their training sets. Bagging helps mitigate this problem by exposing different trees to different subsamples of the whole training set.

Random forests average multiple deep decision trees trained on different parts of a training set further, with the goal of reducing the variance. This does come with a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance of the final model.

### Feature Bagging

Random forests differ from decision tree bagging in only one way: They use a modified tree-learning algorithm that selects a random subset of the features at each candidate split in the learning process. This process is sometimes called feature bagging. 

Using random forests helps combat the correlation of trees in an ordinary bootstrap sample. Normally, for example, if one or a few features are strong predictors for the response variable (target output), these features will be the ones selected in many of the bagging base trees. This causes these features to become correlated. By selecting a random subset of features at each split, we avoid this correlation between base trees and strengthen the overall model.

**For a problem with `p` features, it is typical to use:**

- `p^(1/2)` features in each split for a classification problem (rounded down).
- `p/3` with a minimum node size of five as the default for a regression problem (rounded down).

## Lab Instructions

---

**A random forest classifier satisfying these conditions is built:**

1) Multiple internal decision tree classifiers will be built as the base models.
- For each base model, the data will be resampled with replacement data (this is what is referred to as bootstrapping).
- Each decision tree will be fit on one of the bootstrapped samples of the original data.
- Each internal base model will then be passed the new data and make its predictions. The final output will be the result of a vote across the base models for the class.

**Your custom random forest classifier must:**

1) Accept the hyperparameters `max_features`, `n_estimators`, and `max_depth`.
2) Implement a `fit` method.
3) Implement a `predict` method.
4) Implement a `score` method.
5) Satisfy the conditions for random forest classifiers listed above.
6) **for the sake of simplicity, you will not be implementing feature bagging in this lab.**

**Test your random forest classifier on the pre-cleaned Titanic data set and compare the performance to a single decision tree classifier.**

> *Note: You're allowed to use the `DecisionTreeClassifier` class in scikit-learn for the internal base estimators. This lab is about building the random forest ensemble estimator, not building a decision tree class!*

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [2]:
titanic = pd.read_csv('./datasets/titanic_clean.csv')

In [3]:
import numpy as np
from sklearn.tree import DecisionTreeClassifier

In [4]:
# This will be a custom, basic version of a random forest in the style of
# scikit-learn's models.
class RandomForest(object):
    
    def __init__(self):
        pass
        
    def fit(self):
        pass
    
    def predict(self):
        pass
    
    def score(self):
        pass