# Random Forest

Create multiple random trees and use averages.

Typically 500-1000 trees are grown.

Trees are normally grown deterministically. To randomize this process, only a subset of the data is used to build each tree. Number of trees and size of *bootstrap samples* are hyperparameters.

In addition, at each split, only a subset of features are considered.

[Book](https://www.kaggle.com/code/hamelg/python-for-data-30-random-forests/notebook)

[Video](https://www.youtube.com/watch?v=wNIix0YRoOY)

In [7]:
import numpy as np
import pandas as pd
import os

In [14]:
df = pd.read_csv('data/titanic/train.csv')

In [19]:
titanic_train = df.copy()
# Input median age for NA values
new_age_var = np.where(titanic_train.Age.isnull(),
                       28,
                       titanic_train.Age)

titanic_train.Age = new_age_var

In [10]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing

In [20]:
#
np.random.seed(12)

# Initialize label encoder
label_encoder = preprocessing.LabelEncoder()

# Convert some variables to numberic
titanic_train["Sex"] = label_encoder.fit_transform(titanic_train["Sex"])

# Initialize the model
rf_model = RandomForestClassifier(n_estimators=1000, # Number of trees, 500-100 is typical
                                  max_features=2,    
                                  oob_score=True)

features = ["Sex", "Pclass", "SibSp", "Age", "Fare"]

# Train the model
rf_model.fit(X=titanic_train[features], 
             y=titanic_train['Survived'])

print(f'OOB accuracy: {rf_model.oob_score_}')

OOB accuracy: 0.819304152637486


A random forest can give us an idea of which variables are more important based on how many times they are used to seperate the data.

In [21]:
for feature, imp in zip(features, rf_model.feature_importances_):
    print(f'{feature=}\t{imp}')

feature='Sex'	0.2734664424750403
feature='Pclass'	0.09002593001585939
feature='SibSp'	0.048686858870217925
feature='Age'	0.27668347382949054
feature='Fare'	0.3111372948093918


In [22]:
titanic_test = df.copy()

# Impute median Age for NA Age values
new_age_var = np.where(titanic_test["Age"].isnull(),
                       titanic_test["Age"].mean(),                      
                       titanic_test["Age"])      

titanic_test["Age"] = new_age_var 

# Fill missing Fare with 50
new_fare_var = np.where(titanic_test["Fare"].isnull(),
                       50,                      
                       titanic_test["Fare"])      

titanic_test["Fare"] = new_fare_var 

# Convert some variables to numeric
titanic_test["Sex"] = label_encoder.fit_transform(titanic_test["Sex"])

In [23]:
# Make test set predictions
test_preds = rf_model.predict(X = titanic_test[features])

# Create a submission for Kaggle
submission = pd.DataFrame({"PassengerId":titanic_test["PassengerId"],
                           "Survived":test_preds})

In [26]:
from sklearn import metrics

metrics.confusion_matrix(y_true=titanic_train.Survived, y_pred=test_preds)

array([[543,   6],
       [ 35, 307]])