# Random Forests for Classification

## Titanic data

<img src="https://media4.giphy.com/media/XOY5y7YXjTD7q/giphy.gif?cid=ecf05e473tvjfnpeburx7eq75c81fxuxc7dtrn89jo61ftih&ep=v1_gifs_search&rid=giphy.gif&ct=g" style="width=400;height=300">


We are going to use the Kaggle's Titanic dataset to illustrate binary classificaiton using a **random forest model**. You can get more information on the dataset [here](https://www.kaggle.com/competitions/titanic/data).

In [15]:
import pandas as pd
from IPython.display import IFrame # Used to display a webpage
from IPython.display import Image # Used to display an image
import matplotlib.pyplot as plt
import seaborn as sns

# Read the data
#df = pd.read_csv('titanic.csv')
#df.head()

Here's a description of what the variables included:

In [None]:
Image("data/titanic_codebook.png")

### Data preprocessing

Rarely, is a dataset "perfect" when you load it in for analysis -- some preprocessing is almost always necessary. First, it's important to note that `sklearn`'s random forest classifier cannot handle missing data, so either need to remove `NA`'s prior to analysis (i.e., list-wise deletion) or impute missing values. If we look at the `.describe()` method for our `pandas` dataframe, we see that `Age` has missing data:

In [None]:
df.describe()

In [None]:
missing_values_count = df['Age'].isnull().sum()
print(f"Number of missing values in Age column: {missing_values_count}")

Let's use "mean imputation" to fill in these values:

In [None]:
# fill in missing values for age
mean_age = df['Age'].mean()

# Fill missing values in the Age column with the mean
df['Age'] = df['Age'].fillna(mean_age)

In [None]:
df.describe()

The `Sex` of the passanger is also likely to be an important variable and it is currently saved as a `string` (or an `object` in `pandas` speak):

In [None]:
df.dtypes

In [None]:
df['Sex'].dtypes

We can create a **dummy variable** for whether a passanger is a female by using the following code:

In [None]:
df['female'] = (df['Sex'] == 'female').astype(int)

That's all the preprocessing that we will do for now. 

In [None]:
df.describe()

### Splitting into training and testing sets

As per usual, we are interested in out-of-sample performance not in-sample fit. Let's split our data into **training** and **testing** sets before going any further:

In [None]:
# split data into traning and test sets
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

### Feature selection

Feature selection is just a francy name for deciding what variables to include in the model. Note that we could just throw in every variable included in the dataset (aka, the "kitchen sink"), but we'll select a smaller subset of the available variables just to keep things simple:

In [None]:
# select features
features = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'female']
y = 'Survived'

## The components of a random forest: decision trees and bootstrapping

As described in lecture, **random forests** are constructed by combining multiple **decision trees** using **bootstrap aggegation**. Note that `sklearn` will carryout all of these steps for you when using the `RandomForestClassifer()` class; however, let's take a closer look at the two most important components: **decision trees** and **bootstrapping**


### Decision tree classificaiton in `sklearn`

Fitting an indvidual decision tree in `sklearn` follow the same syntax as fitting any other model:

In [None]:
from sklearn.tree import DecisionTreeClassifier

# create a decision tree classifier
clf = DecisionTreeClassifier(criterion='entropy')

# train the classifier
clf.fit(df_train[features], df_train[y])

Note that we can visualize the actual decision tree constructed using the `DecisionTreeClassifer` class by typing the following:

In [None]:
from sklearn.tree import plot_tree
plt.figure(figsize=(20, 10))
plot_tree(clf, feature_names=features, filled=True)
plt.show()

Good thing that we don't need to actually calculate this by hand! Once we have our model, we can use it to `predict` new data and assess out-of-sample performance:

In [None]:
# make predictions
y_pred = clf.predict(df_test[features])

# evaluate the model
from sklearn.metrics import accuracy_score
accuracy_score(df_test[y], y_pred)

Lastly, one of the best features (no pun intended) of decision trees (and by extension random forests) is they are easy to intrepret. For example, we can look at which of our features are teh most important to the overall prediction by calling the `.feature_importances_` [method](https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html):

In [None]:
print(clf.feature_importances_)

We can also plot our feature importance for easy visualiszation:

In [None]:
# plot feature importances
plt.figure(figsize=(10, 6))
plt.bar(features, clf.feature_importances_)
plt.title('Feature Importances')
plt.xlabel('Features')
plt.ylabel('Importance')
plt.show()

Or use a slightly better visualisation:

In [None]:
sorted_features = sorted(zip(features, clf.feature_importances_), key=lambda x: x[1], reverse=True)

# Unzip the sorted features and their importances
sorted_features, sorted_importances = zip(*sorted_features)

# Plot the sorted feature importances
plt.figure(figsize=(10, 6))
plt.barh(sorted_features, sorted_importances)
plt.title('Feature Importances')
plt.xlabel('Importance')
plt.ylabel('Features')
plt.gca().invert_yaxis()  # Ensure the most important feature is at the top
plt.show()

### Bootstrapping "by hand" with `pandas`

Again, the `RandomForestClassifer()` class will do all the "bootstrap aggregating" for you. However, the concept of generating bootstrap random samples is important beyond random forest models and it's good to take a quick look at bootrrapping in `Python`. There are many difference ways to bootstrap in `Python`, but we'll do it "by hand" using `pandas` to illustrate the main concepts.

Say, for instance, we want to understand variation in the survival rate of passengers and we don't want to use standard formulas. How could we use bootstrapping instead?

Let's start by defining how many sample we want and a list to hold our results:

In [None]:
# Number of bootstrap samples to generate
num_bootstrap_samples = 10000

# Initialize a list to store the survival proportion of each bootstrap sample
survival_proportions = []

Now we need repeat the following procedure `num_bootstrap_samples` times:
1. Sample from our original data (`df`) with **replacement**
2. Calculate the surivial proportion.
3. Store the result in `survival_proportions`

In [None]:
# Perform bootstrapping
for i in range(num_bootstrap_samples):
    # Randomly sample with replacement from the original DataFrame
    sample = df.sample(n=len(df), replace=True)
    
    # Add the sample estimate to the list of sample estimates
    survival_proportions.append(sample['Survived'].mean())

print('Here are the first 10 survival proportions:')
print(survival_proportions[:10])

In [None]:
# Create a histogram to visualize the bootstrap estimates of the mean
plt.figure(figsize=(10, 6))
sns.histplot(survival_proportions, kde=True)
plt.xlabel('Bootstrap Sample Mean')
plt.ylabel('Frequency')
plt.title('Bootstrap Estimates of the Mean')
plt.show()

If I was a super crabby professor, I would make you combine the above bootstrapping procedure and the `DecisionTreeClassifer()` class to create random forest classifer from scratch. However, I'm not a crabby professor. Onwards to the `RandomForestClassifer()`!

## Random forests classification in `sklearn`

At the risk of sounding like a broken record, we fit a `RandomForestClassifer()` in `sklearn` using the same syntax as any other model:

In [None]:
# import the necessary class
from sklearn.ensemble import RandomForestClassifier

# create a random forest classifier
clf = RandomForestClassifier(n_estimators=100, criterion='entropy')

# train the classifier
clf.fit(df_train[features], df_train[y])

We can also examine the most important features for our random forest model:

In [None]:
sorted_features = sorted(zip(features, clf.feature_importances_), key=lambda x: x[1], reverse=True)

# Unzip the sorted features and their importances
sorted_features, sorted_importances = zip(*sorted_features)

# Plot the sorted feature importances
plt.figure(figsize=(10, 6))
plt.barh(sorted_features, sorted_importances)
plt.title('Feature Importances')
plt.xlabel('Importance')
plt.ylabel('Features')
plt.gca().invert_yaxis()  # Ensure the most important feature is at the top
plt.show()

And we can ``predict`` new data and assess out-of-sample performance in the usual way:

In [None]:
# make predictions
y_pred = clf.predict(df_test[features])

# evaluate the model
accuracy_score(df_test[y], y_pred)

## Classification performance

With our prediction in hand, we can now calculate a range of **performance metrics**. All of these metrics start with what's referred to as a **confusion matrix**: 

<img src="https://www.researchgate.net/publication/337071081/figure/fig2/AS:941941982236673@1601587877948/A-confusion-matrix-for-binary-classification.png">

From this matrix, we can define the following:

\begin{equation}
    Acurracy = \frac{TP + TN}{TP + FP + FN + TN}
  \end{equation}
  
  \begin{equation}
    Precision = \frac{TP}{TP + FP}
  \end{equation}
  
  \begin{equation}
    Recall = \frac{TP}{TP + FN}
  \end{equation}
  
  \begin{equation}
    Specificity = \frac{TN}{TN + FP}
  \end{equation}

  Or if you are a visual learner:

  <img src="https://upload.wikimedia.org/wikipedia/commons/2/26/Precisionrecall.svg" style="width=300;height=400">

#### Why do we need alternative performance metrics?

The most commonly employed metric is **accuracy**. However, we need to be very careful when using accuracy alone. For example, consider Dallas Raines, the most accurate meteorologist in history...

<img src="http://farm6.static.flickr.com/5260/5516412091_06fea7fdb8.jpg" style="width=400;height=300">

If accuracy falls apart for "imbalanced classes," then what are our alternatives. The most common alternatives (and what I tend to use in my own work) is some combination of **precision** and **recall**, which is referred to as the **F1-score**.

Specifically, these measures are formally combined in the **F1-score**:

\begin{equation}
  F1 = \frac{2\:x\:Precision\:x\:Recall}{Precision\:+\:Recall}
\end{equation}

The F1-score takes into account the tradeoff between precision and accuracy. 

All of these metrics (and others as well) are easy to calculate in `sklearn`:

In [None]:
# import necessary functions
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

# calculate precision
print('precision = ', precision_score(df_test[y], y_pred))

# calculate recall
print('recall = ', recall_score(df_test[y], y_pred))

# calculate f1-score
print('F1 = ', f1_score(df_test[y], y_pred))


## Assessing out-of-sample performance with cross-validation

So far, we've randomly split our data into a single training and testing set using the `train_test_split` function. The realized sample, however, is just one of many samples that we could have pulled. Here, we will look at **k-fold cross-validation**. What do we mean be cross-validation? Take a look at the following:

<img src="https://upload.wikimedia.org/wikipedia/commons/1/1c/K-fold_cross_validation_EN.jpg">

Once again, cross-validation is easy in `sklearn`:

In [None]:
from sklearn.model_selection import cross_val_score

# estimate cross-validation accuracy
cv_scores = cross_val_score(clf, df_train[features], df_train[y], cv=5)
print('CV accuracy scores:', cv_scores)
print('CV accuracy: %.3f +/- %.3f' % (cv_scores.mean(), cv_scores.std()))

# estimate cross-validation precision
cv_scores = cross_val_score(clf, df_train[features], df_train[y], cv=5, scoring='precision')
print('CV precision scores:', cv_scores)
print('CV precision: %.3f +/- %.3f' % (cv_scores.mean(), cv_scores.std()))

# estimate cross-validation recall
cv_scores = cross_val_score(clf, df_train[features], df_train[y], cv=5, scoring='recall')
print('CV recall scores:', cv_scores)
print('CV recall: %.3f +/- %.3f' % (cv_scores.mean(), cv_scores.std()))

# estimate cross-validation f1-score
cv_scores = cross_val_score(clf, df_train[features], df_train[y], cv=5, scoring='f1')
print('CV f1 scores:', cv_scores)
print('CV f1: %.3f +/- %.3f' % (cv_scores.mean(), cv_scores.std()))


## Hyperparameter tuning

We haven't really addressed so-called "hyperparameters" yet, but "tuning" these parameters can often really help your model's performance. What's a **hyperparameter**? According to [Wikipedia](https://en.wikipedia.org/wiki/Hyperparameter_machine_learning),

> In machine learning, a hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are derived via training.

Here's a list of all of the **hyperparameters** for our `RandomForestClassifer()`:

In [None]:
IFrame('https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html', width=700, height=350)

The obvious next question: which of these **hyperparameters** actually matter? In turns out that the `n_estimators` and the `max_features` hyperparameters can really impact learning. Let's start with `n_estimators`.

There are two ways for "optimizing hyperparameters" in `sklearn`: `GridSearchCV` and `RandomizedSearchCV`. Let's start by looking at `GridSearch()`

In [None]:
from sklearn.model_selection import GridSearchCV

# Save a list of dicts with the hyperparameters to test
param_grid = [
  {'n_estimators': list(range(100, 1000, 100))}
 ]

# create a random forest classifier
clf = RandomForestClassifier(criterion='entropy', random_state=42)

# create grid search object
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='f1', return_train_score=True, n_jobs=4)

# run the grid search
grid_search.fit(df_train[features], df_train[y])

print('Best parameters:\n', grid_search.best_params_)
print('\nBest score = ', grid_search.best_score_)

Note that we can actually use the fitted `grid_search` object to make predictions using the "best" model determined from our grid search:

In [None]:
# make predictions
y_pred = grid_search.predict(df_test[features])

# evaluate the model
f1_score(df_test[y], y_pred)

We can add additional hyperparameters to assess by adding more entries to our `param_grid` list of dicts:

In [None]:
param_grid = [
  {'n_estimators': list(range(100, 1000, 100)),
   'max_features': ['log2', 'sqrt']}
 ]

clf = RandomForestClassifier(criterion='entropy', random_state=42)
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='f1', return_train_score=True, n_jobs=4)
grid_search.fit(df_train[features], df_train[y])

print('Best parameters:\n', grid_search.best_params_)
print('\nBest score = ', grid_search.best_score_)

# make predictions
y_pred = grid_search.predict(df_test[features])

# evaluate the model
print('\nTesting F1-score = ', f1_score(df_test[y], y_pred))

The problem with `GridSearchCV` is that it only works for searching small parameter spaces. If we need to search lot's of different hyperparameters, then we need to turn to `RandomSearchCV`:

In [None]:
from sklearn.model_selection import RandomizedSearchCV

param_grid = [
  {'n_estimators': list(range(100, 1000, 100)),
   'max_features': ['log2', 'sqrt'],
   'criterion': ['gini', 'entropy', 'log_loss']
   }
 ]

clf = RandomForestClassifier(criterion='entropy', random_state=42)
grid_search = RandomizedSearchCV(clf, param_grid, cv=5, scoring='f1', return_train_score=True, n_jobs=4)
grid_search.fit(df_train[features], df_train[y])

print('Best parameters:\n', grid_search.best_params_)
print('\nBest score = ', grid_search.best_score_)

# make predictions
y_pred = grid_search.predict(df_test[features])

# evaluate the model
print('\nTesting F1-score = ', f1_score(df_test[y], y_pred))

That's it for now. We will discuss an additional approach to hyperparameter tuning -- i.e., "Bayesian optimization" -- in later weeks, but this is everything that you need to get started.

<img src="https://media0.giphy.com/media/iPNq9rFkAIVgs/giphy.webp?cid=ecf05e477kqr72weerxpbg9w032pzy30vzrlui4wnxt4xlrh&ep=v1_gifs_search&rid=giphy.webp&ct=g" style="width=400;height=300">