<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" style="float: left; margin: 15px;">

## Build a (basic) random forest by hand

Week 7 | 3.2

---

The Random Forest model is a popular bagging ensemble method. It combines many decision tree classifiers or regressors as the "base models" to make predictions.

By building this from scratch you can get a feel for  exactly what is going on in a bagging ensemble model.

![rf](./images/randomforests_viz.png)

---

### Construction of the RF

The Random Forest classifier is built such that:

1. Multiple internal decision tree classifiers will be built as the base models
- For each base model, the data will be resampled like in bootstrapping.
- Each decision tree will be fit on the bootstrapped sample of the data.
- To predict, each internal base model will be passed the new data and make their predictions. The final output will be a vote across the base models for the class.

---



In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [6]:
titanic = pd.read_csv('/Users/austinwhaley/Desktop/DSI-SF-4-austinmwhaley/datasets/titanic/titanic_clean.csv')

In [5]:
import numpy as np
from sklearn.tree import DecisionTreeClassifier

In [7]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,S


In [57]:
import patsy

y, X = patsy.dmatrices('Survived ~ C(Pclass) + Sex + Age + C(SibSp) + C(Parch) + Fare', data=titanic)

y = np.ravel(y)

In [59]:
from sklearn.model_selection import train_test_split

In [60]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)

In [61]:
print X_train.shape, X_test.shape

(356, 17) (356, 17)


In [67]:
dtc = DecisionTreeClassifier(max_depth=None)
dtc.fit(X_train, y_train)
dtc.score(X_test, y_test)

0.7668539325842697

In [68]:
#Baseline
1. - np.mean(y_test)

array(0.6179775280898876)

In [72]:
rf = RandomForestClass(n_estimators=1000)
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

array(0.7865168539325843)

In [70]:
from sklearn.base import clone

class RandomForestClass(object):
    
    container = []
    
    def __init__(self, n_estimators=10, max_depth=None):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.base_estimator = DecisionTreeClassifier(max_depth=self.max_depth, max_features=None)
            
    def bootstrap_data(self, X, y):
        indices = np.arange(X.shape[0])
        random_indices = np.random.choice(indices, replace=True, size=X.shape[0])
        return X[random_indices, :], y[random_indices]    
    
    def fit(self, X, y):
        if isinstance(X, pd.DataFrame):
            X = X.values
        for i in range(self.n_estimators):
            estimator = clone(self.base_estimator)
            X_boot, y_boot = self.bootstrap_data(X, y)
            estimator.fit(X_boot, y_boot)
            self.container.append(estimator)
    
    def predict_proba(self, X):
        self.predicted_probabilities = []
        
        for estimator in self.container:
            prediction = estimator.predict(X)
            self.predicted_probabilities.append(prediction)
        
        self.predicted_probabilities = np.array(self.predicted_probabilities)
        self.predicted_probabilities = np.mean(self.predicted_probabilities, axis=0)
        return self.predicted_probabilities
    
    def predict(self, X, threshold=0.5):
        pp = self.predict_proba(X)
        self.predictions = (pp >= threshold).astype(int)
        return self.predictions
    
    def score(self, X, y_true):
        predictions = self.predict(X)
        return np.mean(predictions == y_true)

In [50]:
rf = RandomForestClass(n_estimators=100)

In [51]:
X = np.array([[1,2], [3,4], [5,6], [7,8]])
y = np.array([0,1,0,1])

In [52]:
rf.fit(X, y)

In [53]:
rf.predict_proba(X)

array([ 0.27,  0.66,  0.29,  0.74])

In [54]:
rf.predict(X, threshold=0.5)

array([0, 1, 0, 1])

In [55]:
rf.score(X, y)

1.0