# **SW08: Random forests**

Random forests are an ensemble learning method that combines multiple decision trees to improve predictive accuracy and reduce overfitting. Each tree in a random forest is trained on a random subset of the data (using bootstrapping) and only a random subset of features at each split, which introduces variety among the trees. During prediction, the random forest aggregates the output of all trees, either by majority voting (for classification) or averaging (for regression). This approach makes random forests robust, accurate, and less prone to overfitting compared to individual decision trees.

In this tutorial, we will once more use the iris dataset to classify the different species of iris flowers.

---

## **Setup**



In [None]:
# Basic imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Sklearn imports
from sklearn.datasets import load_iris

# Some Jupyter magic for nicer output
%config InlineBackend.figure_formats = ["svg"]   # Enable vectorized graphics

# Automatically reload external modules
%load_ext autoreload
%autoreload 2

# Adjust the default settings for plots
import sys
sys.path.append("..")
import ml
ml.setup_plotting()

In [None]:
# Load the iris dataset
data = load_iris(as_frame=True)
X = data.data
y = data.target

display(X)

---

## **Basic example**

Let's apply a random forest directly to predict how it performs on the iris dataset.

The relevant scikit-learn classes are 
[RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) 
and 
[RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor).

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

clf = RandomForestClassifier(
    n_estimators=100,       # Number of decision trees (size of ensemble)
    max_depth=None,         # Depth of each tree (None means "unlimited")
    random_state=0,         # Random seed for reproducibility
    criterion='gini',       # How to measure the quality of a split
    min_samples_split=2,    # Number of samples required to split a node
)
clf.fit(X, y)
y_pred = clf.predict(X)
accuracy_score(y, y_pred)

print("Accuracy: {:.2f}".format(accuracy_score(y, y_pred)))

In [None]:
########################
###    EXERCISE 1    ###
########################

# Part 1:
# -------
# Play around with the parameters of the RandomForestClassifier. What happens
# if you increase the number of trees? What happens if you increase the depth
# of the trees? ...

# Part 2:
# -------
# Remember: It is recommended to assess the model's performance on a separate
# test set, rather than the training set. Can you update the code such that it 
# uses a train-test split?

---

## **Feature importance**

Feature importance in random forests is a measure of how valuable each feature is in predicting the target outcome. Random forests calculate feature importance by evaluating how much each feature contributes to the model's accuracy, typically based on how often a feature is used to split nodes across all trees and the quality of these splits. Features that consistently improve prediction accuracy or reduce node impurity (e.g., Gini impurity or entropy) are assigned higher importance scores. This helps identify which features are most influential in the model, providing insights into the underlying patterns in the data.

We can easily extract the feature importances from a trained random forest
classifier by looking at the attribute `feature_importances_`. We can then
visualize the importances in a bar plot.

In [None]:
########################
###    EXERCISE 2    ###
########################

# Train a random forest classifier on the entire iris dataset. Then,
# visualize the feature importances using a bar plot. Which feature(s) 
# are the most important according to the random forest?

# Train a random forest classifier with the best hyperparameters
clf = RandomForestClassifier(n_estimators=100, 
                             max_depth=4, 
                             random_state=0)
...

---

## **Decision boundaries**

In a previous tutorial (on decision trees), we have used the function 
plot_decision_boundary() to visualize the decision boundary of a classifier 
using two features. Let's use this function again to visualize the decision
boundary, this time of a kNN classifier. For visualization purposes, we will
only use the first two features of the iris dataset.


```python
def plot_decision_boundary(clf, X, y, 
                          n_steps=1000, 
                          data=None, 
                          ax=None):
    """
    Visualize the decision boundary of an arbitrary classifier.

    clf:  The classifier to plot.
    X:    The features of the dataset.
    y:    The labels of the dataset.
    n_steps: Parameter controlling the resolution of plot.
    ax:   (optional) The axis to plot on. If None, a new figure is created.
    data: (optional) Data structure provided by sklearn.datasets.load_iris().
    """
    ...
```

In [None]:
from ml import plot_decision_boundary

# Note: Choose your own two features here... According to the feature
# importances reported above, the separation is best for petal length and
# petal width. (We pick here the more difficult case, to see that a random
# forest is able to learn relatively complex decision boundaries.)
features = ["sepal length (cm)", "sepal width (cm)"]
X_2d = X[features]

In [None]:
########################
###    EXERCISE 3    ###
########################

# Use the function plot_decision_boundary() to look how the decision boundary
# changes when using different configurations of the RandomForestClassifier.
# Just play around with the parameters and see what happens.
