# Random Forest

These notes cover the random forest algorithm and its usage in Python. Random forest is a powerful yet simple machine learning algorithm useful for both regression and classification of multiple classes.

## 0. Load Libraries

In [None]:
# Basic working with data libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Library including linear/logistic regression
import statsmodels.api as sm

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import accuracy_score

## 1. The Random Forest Model

Instead of a single decision tree, a random forest uses *many* different decision trees, often hundreds, working together.

Each tree in the forest each generates a single prediction, i.e. "cat" or "dog", as shown in the image below. Then, a vote is taken, with the majority being the prediction of the model. If the model is a regression model, the prediction is instead an average of each tree's output.

<center>
<img src="https://images.datacamp.com/image/upload/v1718113325/image_7f309c633f.png" width="600px">

Image taken from [Datacamp](https://www.datacamp.com/tutorial/random-forests-classifier-python)</center>

One might ask why the trees are different. If they're all based on the same training data, then shouldn't they all be the same? Yes, they would be. But the trick to the random forest is that each tree is trained on only a random portion of the rows and columns of the training data. The overall effect of this is twofold:

* This fixes the overfitting from the decision tree. Each tree is itself trained on less data, so each is less likely to be overfit. Additionally, since the trees produce roughly independent predictions, they average together to produce a [better result](https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Probability/BS704_Probability12.html) than each individually.

* We lose the elegant explainability of the singular decision tree. Our predictions are better, often *much* better, but now it is much more difficult to understand why we get particular predictions. Check out the [eli5](https://eli5.readthedocs.io/en/latest/overview.html) for some tools that help this.

## 2. Fitting a Model in Python

Let's start with the bone mass density dataset, and try to predict a fracture vs. no fracture.

In [None]:
bmd = pd.read_csv("https://www.dropbox.com/s/c6mhgatkotuze8o/bmd.csv?dl=1")

bmd.head()

Unnamed: 0,id,age,sex,fracture,weight_kg,height_cm,medication,waiting_time,bmd
0,469,57.052768,F,no fracture,64.0,155.5,Anticonvulsant,18,0.8793
1,8724,75.741225,F,no fracture,78.0,162.0,No medication,56,0.7946
2,6736,70.7789,M,no fracture,73.0,170.5,No medication,10,0.9067
3,24180,78.247175,F,no fracture,60.0,148.0,No medication,14,0.7112
4,17072,54.191877,M,no fracture,55.0,161.0,No medication,20,0.7909


In [None]:
# Select the X and y variables
X = bmd[["age", "sex", "weight_kg", "height_cm", "medication", "bmd"]]
y = bmd["fracture"]

X = pd.get_dummies(X)

In [None]:
# Perform a train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=314159)

Actually fitting the model is very simple, as all Scikit-learn models are.

In [None]:
model = RandomForestClassifier().fit(X_train, y_train)

### 2.1. Model Evaluation

As usual for a machine learning classifier, we examine the following:
* Training Accuracy
* Testing Accuracy
* Null Accuracy

Ideally, the testing accuracy is much higher than the null accuracy, and not much smaller than the training accuracy.

Looks like this model is a bit overfit, but still quite good.

In [None]:
yhat_train = model.predict(X_train)
yhat_test = model.predict(X_test)

print(f"Train Accuracy: {accuracy_score(yhat_train, y_train):.2f}")
print(f"Test Accuracy:  {accuracy_score(yhat_test, y_test):.2f}")
print(f"Null Accuracy:  {y_test.value_counts(normalize=True).iloc[0]:.2f}")

Train Accuracy: 1.00
Test Accuracy:  0.93
Null Accuracy:  0.72


If instead this had been a regression model, we should check the residuals  of the testing data with a residual plot, and also compare their spread with the training residuals.