Introduction to Model Evaluation

When building machine learning models, it’s essential to assess how well your model generalizes to unseen data

Cross-Validation Explained

Cross-validation is a technique that splits the data into multiple subsets.

The model trains on some of these subsets and validates on the others

In [14]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

#loading dataset
data = fetch_california_housing(as_frame=True)
X, y = data.data, data.target

#define pipeline for scallinh and model
pipeline = make_pipeline(StandardScaler(),
                         RandomForestRegressor(random_state=42))

#perfomr 5-fold cross-validation
cv_scores = cross_val_score(pipeline, X, y, cv=5, scoring='r2')

print("Cross validation R2 scores:" , cv_scores)
print("Mean R2 score :", cv_scores.mean())

Cross validation R2 scores: [0.51454242 0.70386991 0.74208135 0.63632938 0.68265475]
Mean R2 score : 0.6558955642815314


Hyperparameter Tuning Introduction

Hyperparameter tuning helps us find the best parameters for our model. For example, the number of trees in a Random Forest or the depth of each tree.

GridSearchCV

GridSearchCV performs an exhaustive search over all parameter combinations. It’s powerful but can take time with large datasets or many parameters.

In [15]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'randomforestregressor__n_estimators': [50, 100, 200],
    'randomforestregressor__max_depth': [None,10,20],
    'randomforestregressor__min_samples_split': [2,5,10]
}

grid_search = GridSearchCV(
    pipeline,
    param_grid=param_grid,
    cv=3,
    scoring='r2',
    n_jobs=-1, #Use all cpu cores
    verbose=3 #show detailed progress
)

grid_search.fit(X, y)
print("Best paramentres:", grid_search.best_params_)
print("Best R2 score :", grid_search.best_score_)

Fitting 3 folds for each of 27 candidates, totalling 81 fits
[CV 3/3] END randomforestregressor__max_depth=None, randomforestregressor__min_samples_split=2, randomforestregressor__n_estimators=50;, score=0.631 total time=  15.0s
[CV 2/3] END randomforestregressor__max_depth=None, randomforestregressor__min_samples_split=2, randomforestregressor__n_estimators=50;, score=0.720 total time=  15.1s
[CV 1/3] END randomforestregressor__max_depth=None, randomforestregressor__min_samples_split=2, randomforestregressor__n_estimators=50;, score=0.622 total time=  15.8s
[CV 1/3] END randomforestregressor__max_depth=None, randomforestregressor__min_samples_split=2, randomforestregressor__n_estimators=100;, score=0.620 total time=  30.7s
[CV 3/3] END randomforestregressor__max_depth=None, randomforestregressor__min_samples_split=2, randomforestregressor__n_estimators=100;, score=0.631 total time=  27.9s
[CV 2/3] END randomforestregressor__max_depth=None, randomforestregressor__min_samples_split=2, r

RandomizedSearchCV Explained

If GridSearchCV feels slow, RandomizedSearchCV is a faster alternative

It samples a fixed number of parameter combinations from the grid, reducing computation time.

In [16]:
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

#define hyperparameters grid with wider range 

param_dist = {
     'randomforestregressor__n_estimators': [50, 100, 200],
    'randomforestregressor__max_depth': [None,10,20],
    'randomforestregressor__min_samples_split': [2,5,10]
}

random_search = RandomizedSearchCV(
    pipeline,
    param_distributions=param_dist,
    n_iter=10, 
    cv=3,
    scoring='r2',
    n_jobs=-1,
    verbose=3,
    random_state=42
)

random_search.fit(X, y)

print("Best paramentres:", random_search.best_params_)
print("Best R2 score :", random_search.best_score_)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV 1/3] END randomforestregressor__max_depth=10, randomforestregressor__min_samples_split=5, randomforestregressor__n_estimators=100;, score=0.618 total time=  20.8s
[CV 2/3] END randomforestregressor__max_depth=10, randomforestregressor__min_samples_split=5, randomforestregressor__n_estimators=100;, score=0.717 total time=  16.9s
[CV 3/3] END randomforestregressor__max_depth=None, randomforestregressor__min_samples_split=10, randomforestregressor__n_estimators=200;, score=0.628 total time=  51.1s
[CV 2/3] END randomforestregressor__max_depth=None, randomforestregressor__min_samples_split=10, randomforestregressor__n_estimators=200;, score=0.725 total time=  51.3s
[CV 1/3] END randomforestregressor__max_depth=None, randomforestregressor__min_samples_split=10, randomforestregressor__n_estimators=200;, score=0.617 total time=  51.9s
[CV 3/3] END randomforestregressor__max_depth=10, randomforestregressor__min_samples_split=5, r