# Random forests

In this notebook, we will present the random forest models and show the
differences with the bagging ensembles.

Random forests are a popular model in machine learning. They are a
modification of the bagging algorithm. In bagging, any classifier or regressor
can be used. In random forests, the base classifier or regressor is always a
decision tree.

Random forests have another particularity: when training a tree, the search
for the best split is done only on a subset of the original features taken at
random. The random subsets are different for each split node. The goal is to
inject additional randomization into the learning procedure to try to
decorrelate the prediction errors of the individual trees.

Therefore, random forests are using **randomization on both axes of the data
matrix**:

- by **bootstrapping samples** for **each tree** in the forest;
- randomly selecting a **subset of features** at **each node** of the tree.

## A look at random forests

We will illustrate the usage of a random forest classifier on the adult census
dataset.

In [1]:
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")
target_name = "class"
data = adult_census.drop(columns=[target_name, "education-num"])
target = adult_census[target_name]

In [2]:
data.head()

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States


In [4]:
target.unique()

array([' <=50K', ' >50K'], dtype=object)

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">If you want a deeper overview regarding this dataset, you can refer to the
Appendix - Datasets description section at the end of this MOOC.</p>
</div>


The adult census contains some categorical data and we encode the categorical
features using an `OrdinalEncoder` since tree-based models can work very
efficiently with such a naive representation of categorical variables.

Since there are rare categories in this dataset we need to specifically encode
unknown categories at prediction time in order to be able to use
cross-validation. Otherwise some rare categories could only be present on the
validation side of the cross-validation split and the `OrdinalEncoder` would
raise an error when calling its `transform` method with the data points of the
validation set.


We will first give a simple example where we will train a single decision tree
classifier and check its generalization performance via cross-validation.

In [None]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_selector as selector
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate

numerical_column_selector = selector(dtype_exclude=object)
categorical_column_selector = selector(dtype_include=object)
numerical_columns = numerical_column_selector(data)
categorical_columns = categorical_column_selector(data)

categorical_preprocessor = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)

preprocessor = ColumnTransformer(
    [
        ("categorical-preprocessor", categorical_preprocessor, categorical_columns)
    ],
    remainder="passthrough"
)

decision_tree = Pipeline(
    [
        ("preprocessor", preprocessor),
        ("rf", DecisionTreeClassifier(random_state=0))
    ]
)

cv_results = cross_validate(decision_tree, data, target)
scores = cv_results['test_score']
print(
    "Decision tree classifier: "
    f"{scores.mean():.3f} ± {scores.std():.3f}"
)

Decision tree classifier: 0.819 ± 0.005



Similarly to what was done in the previous notebook, we construct a
`BaggingClassifier` with a decision tree classifier as base model. In
addition, we need to specify how many models do we want to combine. Note that
we also need to preprocess the data and thus use a scikit-learn pipeline.

In [None]:
from sklearn.ensemble import BaggingClassifier

bagged_tree = Pipeline(
    [
        ("preprocessor", preprocessor),
        ("bagged_tree", BaggingClassifier(
            estimator=DecisionTreeClassifier(random_state=0), n_estimators=50, random_state=0, n_jobs=-1
        ))
    ]
)

cv_results = cross_validate(bagged_tree, data, target)
scores = cv_results['test_score']
print(
    "Bagged decision tree classifier: "
    f"{scores.mean():.3f} ± {scores.std():.3f}"
)


Bagged decision tree classifier: 0.846 ± 0.005


Note that the generalization performance of the bagged trees is already much
better than the performance of a single tree.

Now, we will use a random forest. You will observe that we do not need to
specify any `estimator` because the estimator is forced to be a decision tree.
Thus, we just specify the desired number of trees in the forest.

In [20]:
from sklearn.ensemble import RandomForestClassifier

randomforest = Pipeline(
    [
        ("preprocessor", preprocessor),
        ("randomforest", RandomForestClassifier(n_estimators=50, n_jobs=-1, random_state=0))
    ]
)

cv_results = cross_validate(randomforest, data, target)
scores = cv_results['test_score']
print(
    "Bagged decision tree classifier: "
    f"{scores.mean():.3f} ± {scores.std():.3f}"
)


Bagged decision tree classifier: 0.851 ± 0.003


It seems that the random forest is performing slightly better than the bagged
trees possibly due to the randomized selection of the features which
decorrelates the prediction errors of individual trees and as a consequence
make the averaging step more efficient at reducing overfitting.

## Details about default hyperparameters

For random forests, it is possible to control the amount of randomness for
each split by setting the value of `max_features` hyperparameter:

- `max_features=0.5` means that 50% of the features are considered at each
  split;
- `max_features=1.0` means that all features are considered at each split
  which effectively disables feature subsampling.

By default, `RandomForestRegressor` disables feature subsampling while
`RandomForestClassifier` uses `max_features=np.sqrt(n_features)`. These
default values reflect good practices given in the scientific literature.

However, `max_features` is one of the hyperparameters to consider when tuning
a random forest:
- too much randomness in the trees can lead to underfitted base models and can
  be detrimental for the ensemble as a whole,
- too few randomness in the trees leads to more correlation of the prediction
  errors and as a result reduce the benefits of the averaging step in terms of
  overfitting control.

In scikit-learn, the bagging classes also expose a `max_features` parameter.
However, `BaggingClassifier` and `BaggingRegressor` are agnostic with respect
to their base model and therefore random feature subsampling can only happen
once before fitting each base model instead of several times per base model as
is the case when adding splits to a given tree.

We summarize these details in the following table:

| Ensemble model class     | Base model class          | Default value for `max_features`   | Features subsampling strategy |
|--------------------------|---------------------------|------------------------------------|-------------------------------|
| `BaggingClassifier`      | User specified (flexible) | `n_features` (no&nbsp;subsampling) | Model level                   |
| `RandomForestClassifier` | `DecisionTreeClassifier`  | `sqrt(n_features)`                 | Tree node level               |
| `BaggingRegressor`       | User specified (flexible) | `n_features` (no&nbsp;subsampling) | Model level                   |
| `RandomForestRegressor`  | `DecisionTreeRegressor`   | `n_features` (no&nbsp;subsampling) | Tree node level               |