# Random Forest

This example shows how to use [SciKit-Learn](https://scikit-learn.org/stable/) to train a Random Forest ensamble model on the Titanic dataset. Data is processed to increase the accuracy of the model. For a more detailed explanation of what is Random Forest is, see [Random Forest](../doc/random_forest.md).

## Imports

In [None]:
import os
import polars as pl
import seaborn as sns

from typing import Any

from matplotlib.axes import Axes
from numpy import ndarray
from polars import LazyFrame
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, f1_score
from seaborn import heatmap

## Access the Preprocessed Data

The data is preprocessed in the [Data Preprocessing](./data_preprocessing.ipynb) notebook.

In [None]:
train_Xs: LazyFrame = pl.scan_csv("../data/train_Xs.csv")
train_ys: LazyFrame = pl.scan_csv("../data/train_ys.csv")
test_Xs: LazyFrame = pl.scan_csv("../data/test_Xs.csv")

## Train the Random Forest

In [None]:
Xs: ndarray[Any, Any] = train_Xs.collect().to_numpy()
ys: ndarray[Any, Any] = train_ys.collect().to_numpy()

X_train, X_validate, y_train, y_validate = train_test_split(Xs, ys, test_size=0.2, random_state=73)

#### Find the Best Parameters (Optional)

Use a grid search to find the best parameters for the Decision Tree model. I found that this is sub-optimal and that a simple `DecisionTreeClassifier(max_depth=3)` works best as the Decision Tree seems to over-fit on the training and validation data.

It remains an interesting exercise to find the (not) best parameters for the model.

In [None]:
parameter_grid: dict[str, list[int]] = {
    "n_estimators": [500, 1000, 2000],
    "max_features": [4, 5, 6],
    "max_depth": [3, 4, 5],
    "min_samples_split": [2, 3],
    # "min_samples_leaf": [2, 5, 10],
    # "random_state": [37, 53, 73],
}

template_rfc = RandomForestClassifier()
os.environ["POLARIS_ALLOW_FORKING_THREADS"] = "1"
grid_search = GridSearchCV(template_rfc, param_grid=parameter_grid, cv=10, scoring="accuracy", n_jobs=16)
del os.environ["POLARIS_ALLOW_FORKING_THREADS"]
grid_search.fit(X_train, y_train.ravel())

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_}")

#### Fit the Model

In [None]:
best_rfc: RandomForestClassifier = grid_search.best_estimator_
rfc: RandomForestClassifier = best_rfc.fit(X_train, y_train.ravel())

### Evalue the Model

In [None]:
# Evaluate the model
y_pred: ndarray = rfc.predict(X_validate)
accuracy: float = accuracy_score(y_validate, y_pred)
precision: float = precision_score(y_validate, y_pred)
recall: float = recall_score(y_validate, y_pred)
f1: float = f1_score(y_validate, y_pred)

print(f"Accuracy: {100 * accuracy:.2f}%")
print(f"Precision: {100 * precision:.2f}%")
print(f"Recall: {100 * recall:.2f}%")
print(f"F1: {100 * f1:.2f}%")

### Plot the Confusion Matrix

In [None]:
sns.axes_style(rc={"xtick.top": True, "axes.spines.top": True})

confusion: ndarray = confusion_matrix(y_validate, y_pred)

plot: Axes = heatmap(
    confusion, annot=True, fmt="d", xticklabels=["Foundered", "Survived"], yticklabels=["Foundered", "Survived"]
)

### Generate Prediction List

In [None]:
predictions: ndarray = rfc.predict(test_Xs.collect().to_numpy())
prediction_list = pl.DataFrame(
    {
        "PassengerId": pl.Series(range(892, 1310)),
        "Survived": pl.Series(predictions),
    }
)
prediction_list.write_csv("../data/random_forest_predictions.csv")

### Compare the Predictions with the Ground Truth

In [None]:
source = pl.read_csv("../data/random_forest_predictions.csv")
target = pl.read_csv("../data/gender_submission.csv")

y_source = source["Survived"]
y_target = target["Survived"]

num_differences = (y_source != y_target).sum()
num_difference_percentage = (num_differences / len(y_source)) * 100
num_difference_percentage