# Build A (Basic) Random Forest from Scratch

---

> **Note**: this is intended to be a group work lab or a codealong with the instructor.


## What is a Random Forest?

---

Random Forests are some of the most widespread classifiers used. They are relatively simple to use because they require very few parameters to set and they perform well. As we have seen, Decision Trees are very powerful machine learning models.

Decision Trees have some critical limitations. In particular, trees that are grown very deep tend to learn highly irregular patterns: they overfit their training sets. Bagging helps mitigate this problem by exposing different trees to different sub-samples of the whole training set.

Random forests are a further way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance. This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance of the final model.

### Feature bagging

Random forests differ from bagging decision trees in only one way: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. This process is sometimes called feature bagging. 

The reason for doing this is due to correlation of trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the bagging base trees, causing them to become correlated. By selecting a random subset of the features at each split, we avoid this correlation between base trees, strengthening the overall model.

**For a problem with p features, it is typical to use:**
- `p^(1/2)` (rounded down) features in each split for a classification problem.
- `p/3` (rounded down) with a minimum node size of 5 as the default for a regression problem.

## Lab Instructions

---

**A Random Forest classifier is built satisfying these conditions:**
1. Multiple internal decision tree classifiers will be built as the base models.
- For each base model, the data will be resampled with replacement (bootstrapping).
- Each decision tree will be fit on one of the bootstrapped samples of the original data.
- To predict, each internal base model will be passed the new data and make their predictions. The final output will be a vote across the base models for the class.

**Your custom random forest classifier must:**
1. Accept hyperparameters `max_features`, `n_estimators`, and `max_depth`.
2. Implement a `fit` method.
3. Implement a `predict` method.
4. Implement a `score` method.
5. Satisfy the conditions for random forest classifiers listed above!
6. **You will not be implementing feature bagging in this lab, for the sake of simplicity!**

**Test your random forest classifier on the (pre-cleaned) Titanic dataset and compare the performance to a single decision tree classifier!**

> *Note: you are allowed to use the `DecisionTreeClassifier` class in sklearn for the internal base estimators. This lab is about building the random forest ensemble estimator, not building a decision tree class!*

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [1]:
titanic = pd.read_csv('../data/titanic_clean.csv')

NameError: name 'pd' is not defined

In [3]:
import numpy as np
from sklearn.tree import DecisionTreeClassifier

In [4]:
# This will be the custom, basic version of a Random Forest in the style of
# sklearn's models
class RandomForest(object):
    
    def __init__(self):
        pass
        
    def fit(self):
        pass
    
    def predict(self):
        pass
    
    def score(self):
        pass