# Random Forest

Our second model is a Random Forest classifier. This notebook execution lasts around 20 minutes, mainly due to the Grid Seach.

### Importing Libraries

In [20]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_validate, KFold, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, balanced_accuracy_score, make_scorer, classification_report
import pyprojroot
import pickle

We set a random seed for reproducibility.

In [2]:
RANDOM_STATE = 11

## 1. Data Preprocessing

In [5]:
DATA_PATH = pyprojroot.here().joinpath('data', 'fetal_health.csv')
df = pd.read_csv(DATA_PATH)

In [6]:
X = df[df.columns.difference(['fetal_health', 'fetal_health_label'])]
y = df['fetal_health']

In [7]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.2, stratify = y, random_state=RANDOM_STATE)

## Defining our Metrics

In [24]:
metrics = {
    'accuracy': accuracy_score,
    'balanced_accuracy': balanced_accuracy_score,
    'precision': lambda y_true, y_pred: precision_score(y_true, y_pred, average='macro', zero_division=0),
    'f1_score': lambda y_true, y_pred: f1_score(y_true, y_pred, average='macro', zero_division=0),
}

for key in metrics.keys():
    metrics[key] = make_scorer(metrics[key])

## 2. Model Training

We tune our hyperparameters with GridSearchCV, which conducts an exhaustive search over specified parameter values of an estimator. It optimizes the specified metric, balanced accuracy in this case, with cross-validation.

In [48]:
# We define our parameters dictionary
params = { 
    'n_estimators': [25, 50, 100, 400], 
    'max_features': [None], 
    'max_depth': [80, 90, 100, 110, None],  
    'min_samples_leaf': [3, 4],
    'min_samples_split': [3, 6, 8, 10],
    'random_state': [RANDOM_STATE]
}

In [49]:
# Execute Grid Search
grid_search = GridSearchCV(estimator=RandomForestClassifier(), param_grid=params, scoring=metrics["balanced_accuracy"],\
    cv = KFold(n_splits=3, shuffle=True, random_state=RANDOM_STATE))
grid_search.fit(X_train, y_train)

# Save results in dataframe for visualization
grid_df = pd.DataFrame(grid_search.cv_results_)

In [50]:
# Results are given such that each row corresponds to grid search iteration.
# Here we extract only the columns that contain the parameters used for the iteration
# and the mean test score.
sum_cols = [f"param_{param}" for param in list(params.keys())]
sum_cols.append('mean_test_score')
grid_df[sum_cols].sort_values(by="mean_test_score", ascending=False).head(30)

Unnamed: 0,param_n_estimators,param_max_features,param_max_depth,param_min_samples_leaf,param_min_samples_split,param_random_state,mean_test_score
102,100,,110.0,3,6,11,0.88925
130,100,,,3,3,11,0.88925
66,100,,100.0,3,3,11,0.88925
134,100,,,3,6,11,0.88925
98,100,,110.0,3,3,11,0.88925
34,100,,90.0,3,3,11,0.88925
38,100,,90.0,3,6,11,0.88925
70,100,,100.0,3,6,11,0.88925
6,100,,80.0,3,6,11,0.88925
2,100,,80.0,3,3,11,0.88925


Now, we use the best estimator found by the algorithm to run a more rigorous evaluation of the model over a larger set of metrics

In [52]:
forest_clf = grid_search.best_estimator_
forest_clf

## 3. Evaluation

We're using cross validation to precisely evaluate model performance, independent from random data partitions, which may influence our metrics.

In [53]:
cv_result = cross_validate(forest_clf, X_test, y_test, scoring=metrics,\
    cv = KFold(n_splits=10, shuffle=True, random_state=RANDOM_STATE), return_estimator=True)
cv_result_df = pd.DataFrame(cv_result)

In [54]:
cv_result_df = cv_result_df.drop('estimator', axis=1)
cv_result_df.mean()

fit_time                  0.335880
score_time                0.018012
test_accuracy             0.882669
test_balanced_accuracy    0.718183
test_precision            0.844558
test_f1_score             0.751041
dtype: float64

We can see that dummy classifier has very low performance. These will be our baseline metrics.

In [63]:
y_pred = grid_search.predict(X_test)

In [64]:
labels = [1, 2, 3]
target_names = ['Normal', 'Suspect', 'Pathological']

In [65]:
print(classification_report(y_test, y_pred, labels=labels, target_names=target_names, zero_division=0))

              precision    recall  f1-score   support

      Normal       0.95      0.95      0.95       332
     Suspect       0.72      0.73      0.72        59
Pathological       0.94      0.86      0.90        35

    accuracy                           0.91       426
   macro avg       0.87      0.85      0.86       426
weighted avg       0.91      0.91      0.91       426



Here, we can see in more detail our metrics calculated per class.

## 4. Saving the Model

We're using the _pickle_ library for saving our models.

In [66]:
MODEL_PATH = pyprojroot.here().joinpath('models', 'forest_clf.pkl')

# Save the model
with open(MODEL_PATH,'wb') as f:
    pickle.dump(grid_search.best_estimator_, f)