# From Decision Trees to Random Forests

```
Authors: Alexandre Gramfort
         Thomas Moreau
```

## Bagging classifiers

We saw that by increasing the depth of the tree, we are going to get an over-fitted model. A way to bypass the choice of a specific depth is to combine several trees together.

Let's start by training several trees on slightly different data. The slightly different dataset could be generated by randomly sampling with replacement. In statistics, this called a boostrap sample. We will use the iris dataset to create such ensemble and ensure that we have some data for training and some left out data for testing.

In [1]:
import numpy as np

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=29)

Before to train several decision trees, we will run a single tree. However, instead to train this tree on `X_train`, we want to train it on a bootstrap sample. You can use the `np.random.choice` function sample with replacement some index. You will need to create a sample_weight vector and pass it to the `fit` method of the `DecisionTreeClassifier`. We provide the `generate_sample_weight` function which will generate the `sample_weight` array.

In [2]:
def bootstrap_idx(X):
    indices = np.random.choice(
        np.arange(X.shape[0]), size=X.shape[0], replace=True
    )
    return indices

In [3]:
bootstrap_idx(X_train)

array([ 23,  20,  53,  14,  11,  99, 101,  58,  21,  29,  40,  21,  90,
        89,  60,  94,  64,  35,   8, 109,  76,  66,  79,  21,  60,  34,
        24, 106,  30,  96,  35,  90,  36,  67,   7, 106,   5,  76,  44,
       101,  23,   2,  16, 108,  54,  23,  78,  77,  79,  43,  34,  14,
        15,  62,  34,   7,  24,  47,  43, 101,  29,  81,  67,  26,  83,
        37,  62,  63,  86, 100,  41,  59,  53,  45,   5,  90,   7,  59,
        85, 110,  55, 100,  53,   8,  19,  72,  74,  85,  35,  43,  42,
        30,  25,   9,  82,  17, 102,  32,  99,  26,  73,  19,  99, 102,
        21,  33,  89,  42, 108,  89,  63,  47])

In [4]:
from collections import Counter
Counter(bootstrap_idx(X_train))

Counter({16: 4,
         18: 4,
         37: 4,
         15: 3,
         91: 3,
         50: 3,
         39: 3,
         51: 3,
         102: 3,
         77: 3,
         110: 3,
         41: 2,
         2: 2,
         22: 2,
         7: 2,
         48: 2,
         24: 2,
         65: 2,
         30: 2,
         104: 2,
         74: 2,
         88: 2,
         52: 2,
         20: 2,
         31: 2,
         26: 2,
         17: 2,
         85: 2,
         97: 2,
         54: 2,
         33: 2,
         72: 2,
         105: 1,
         69: 1,
         0: 1,
         66: 1,
         70: 1,
         107: 1,
         63: 1,
         4: 1,
         73: 1,
         93: 1,
         14: 1,
         61: 1,
         96: 1,
         35: 1,
         71: 1,
         68: 1,
         6: 1,
         8: 1,
         28: 1,
         36: 1,
         78: 1,
         99: 1,
         56: 1,
         87: 1,
         92: 1,
         79: 1,
         60: 1,
         59: 1,
         47: 1,
         103: 1,
        

In [5]:
def bootstrap_sample(X, y):
    indices = bootstrap_idx(X)
    return X[indices], y[indices]

In [6]:
X_train_bootstrap, y_train_bootstrap = bootstrap_sample(X_train, y_train)

In [7]:
print(f'Classes distribution in the original data: {Counter(y_train)}')
print(f'Classes distribution in the bootstrap: {Counter(y_train_bootstrap)}')

Classes distribution in the original data: Counter({0: 38, 1: 37, 2: 37})
Classes distribution in the bootstrap: Counter({2: 39, 0: 39, 1: 34})


<div class="alert alert-success">
    <b>EXERCISE: Create a bagging classifier</b>:<br>
    <br>
    A bagging classifier will train several decision tree classifiers, each of them on a different bootstrap sample.
     <ul>
      <li>
      Create several <code>DecisionTreeClassifier</code> and store them in a Python list;
      </li>
      <li>
      Loop over these trees and <code>fit</code> them by generating a bootstrap sample using <code>bootstrap_sample</code> function;
      </li>
      <li>
      To predict with this ensemble of trees on new data (testing set), you can provide the same set to each tree and call the <code>predict</code> method. Aggregate all predictions in a NumPy array;
      </li>
      <li>
      Once the predictions available, you need to provide a single prediction: you can retain the class which was the most predicted which is called a majority vote;
      </li>
      <li>
      Finally, check the accuracy of your model.
      </li>
    </ul>
</div>

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

n_trees = 10
dtc_list = [DecisionTreeClassifier() for _ in range(n_trees)]
dtc_fitted = [tree.fit(*bootstrap_sample(X_train, y_train)) for tree in dtc_list]
dtc_predict = [tree.predict(X_test) for tree in dtc_fitted]
predictions = pd.DataFrame(dtc_predict).mode().values[0]

accuracy_score(y_test, predictions)

0.9473684210526315

<div class="alert alert-success">
    <b>EXERCISE: using scikit-learn</b>:
    <br>
    After implementing your own bagging classifier, use a <code>BaggingClassifier</code> from scikit-learn to fit the above data.
</div>

In [38]:
from sklearn.ensemble import BaggingClassifier

bagged = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=10)
preds = bagged.fit(X_train, y_train).predict(X_test)
accuracy_score(y_test, preds)

0.9473684210526315

## Random Forests

A very famous classifier is the random forest classifier. It is similar to the bagging classifier. In addition of the bootstrap, the random forest will use a subset of features (selected randomly) to find the best split.

<div class="alert alert-success">
    <b>EXERCISE: Create a random forest classifier</b>:
    <br>
    Use your previous code which was generated several <code>DecisionTreeClassifier</code>. Check the list of the option of this classifier and modify one of the parameters such that only the $\sqrt{F}$ features are used for the splitting. $F$ represents the number of features in the dataset.
</div>

In [73]:
import math

n_trees = 10
dtc_list = [DecisionTreeClassifier(max_features=int(math.sqrt(X_train.shape[1]))) for _ in range(n_trees)]
dtc_fitted = [tree.fit(*bootstrap_sample(X_train, y_train)) for tree in dtc_list]
dtc_predict = [tree.predict(X_test) for tree in dtc_fitted]
predictions = pd.DataFrame(dtc_predict).mode().values[0]

accuracy_score(y_test, predictions)

0.9473684210526315

<div class="alert alert-success">
    <b>EXERCISE: using scikit-learn</b>:
    <br>
    After implementing your own random forest classifier, use a <code>RandomForestClassifier</code> from scikit-learn to fit the above data.
</div>

In [74]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier().fit(X_train, y_train)
preds = rfc.predict(X_test)

accuracy_score(y_test, preds)

0.9473684210526315

In [75]:
from figures import plot_forest_interactive
plot_forest_interactive()

interactive(children=(IntSlider(value=0, description='max_depth', max=8), Output()), _dom_classes=('widget-intâ€¦