# Hyperparameter Optimization

In our previous model, we achieved an RMSE of $47,328 with a RandomForestRegressor model. Let's tune its hyperparameters to try to reduce this error further.

## Preliminary Steps

### Importing the Preprocessing Pipeline

The preprocessing pipeline developed in previous notebooks is imported from the shared module [`utils/housing_preprocessing.py`](utils/housing_preprocessing.py). We use a low default value for `n_clusters` since we will be tuning this hyperparameter.

In [None]:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor

from utils.housing_preprocessing import get_preprocessing_pipeline

preprocessing = get_preprocessing_pipeline(n_clusters=10)  # Low default, will be tuned

In [None]:
full_pipeline = Pipeline([
    ("preprocessing", preprocessing),
    ("random_forest", RandomForestRegressor(random_state=42)),
])

### Data Loading

The data loading with stratified train/test split is imported from [`utils/load_california.py`](utils/load_california.py).

In [None]:
from utils.load_california import load_housing_data
X_train, X_test, y_train, y_test = load_housing_data()

## Viewing Hyperparameters

In [None]:
full_pipeline

To see the names of the hyperparameters that can be tuned, you can use the following code:

In [None]:
for param in sorted(full_pipeline.get_params().keys()):
    print(param)

## *Grid Search*

To avoid the tedious process of manually modifying a model's hyperparameters until finding the ones that yield the best results, we can define all the hyperparameter values we want to test and program them to try all possible combinations.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'preprocessing__geo__n_clusters': [5, 8, 10], # number of clusters for the geo transformer
     'random_forest__max_features': [4, 6, 8]}, # number of features to consider when looking for the best split
    {'preprocessing__geo__n_clusters': [10, 15],
     'random_forest__max_features': [6, 8, 10]},
]
grid_search = GridSearchCV(
    estimator = full_pipeline,
    param_grid = param_grid, 
    cv=3,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1
    )

_ = grid_search.fit(X_train, y_train)

The ```param_grid``` parameter is a list of dictionaries, each containing the hyperparameter values we want to test. In this case, we're first testing 3 values for the number of clusters and 3 for the number of features considered in each split. Then we're testing 2 values for the number of clusters and 3 for the number of features. In total, we're testing 3×3 + 2×3 = 15 hyperparameter combinations.

Additionally, the ```n_jobs``` parameter allows parallelizing the hyperparameter search by indicating the number of processors to use; a value of -1 means all available processors will be used. This same parameter can be used in the RandomForestRegressor model to parallelize tree construction, but you need to be careful if doing both, since if you parallelize each hyperparameter search, which is itself a model execution, and that model in turn parallelizes tree construction, the total number of executions would multiply. The total n_jobs of RandomForestRegressor multiplied by the number of searches cannot exceed the number of physical cores on the machine. In general, it's better to parallelize the hyperparameter search rather than tree construction; therefore, in this case, we've chosen to leave RandomForestRegressor with the default value ```n_jobs=None```, which assigns 1 core per tree, and use the maximum for GridSearchCV with ```n_jobs=-1```.

We can now see the best hyperparameters found.

In [None]:
grid_search.best_params_

We can see that the best model has 15 clusters. Since this is the highest value tested, it would make sense to run new tests with larger values.

It also returns the best estimator found:

In [None]:
grid_search.best_estimator_

We can also see the result of each hyperparameter combination tested during the search:

In [None]:
cv_res = pd.DataFrame(grid_search.cv_results_)
cv_res.sort_values(by="mean_test_score", ascending=False, inplace=True)

# Select the columns we want to display
cv_res = cv_res[["param_preprocessing__geo__n_clusters",
                 "param_random_forest__max_features", "split0_test_score",
                 "split1_test_score", "split2_test_score", "mean_test_score"]]

# Rename columns for simplicity
score_cols = ["split0", "split1", "split2", "mean_test_rmse"]
cv_res.columns = ["n_clusters", "max_features"] + score_cols
# Clean up the score metric
cv_res[score_cols] = -cv_res[score_cols].round().astype(np.int64)
cv_res.head()

## Randomized Search

Instead of testing all possible hyperparameter combinations, RandomizedSearchCV allows testing a specified number of random combinations. This provides certain advantages:

- Computational efficiency: The "curse of dimensionality" makes Grid Search computationally infeasible very quickly. With more than 3 or 4 hyperparameters with a few options each, the total number of combinations to test explodes. Randomized Search allows setting a computational budget (number of iterations) independent of the number of hyperparameters, making it feasible for complex problems.

- Effectiveness in high dimensions: For many objective functions (such as model performance), only a few hyperparameters have a significant impact. By randomly sampling combinations, there's a higher probability of testing diverse values in the important dimensions, while Grid Search wastes much effort systematically testing values in dimensions that barely affect the result.

- Handling continuous parameters: Randomized Search naturally handles continuous parameters by sampling from a distribution (e.g., uniform, log-uniform). Grid Search requires discretizing the range, which is artificial and can easily miss the actual optimal value if it falls between grid points.

For these reasons, RandomizedSearchCV is generally more efficient than GridSearchCV for high-dimensional problems, with many hyperparameters, or where we don't have a clear idea of the hyperparameter value ranges.

In [None]:
list(range(3,50))

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
    'preprocessing__geo__n_clusters': randint(low=3, high=50),
    'random_forest__max_features': randint(low=2, high=20)
    }

rnd_search = RandomizedSearchCV(
    full_pipeline,
    param_distributions=param_distribs,
    n_iter=10, # number of iterations
    cv=3,
    scoring='neg_root_mean_squared_error',
    random_state=42,
    n_jobs=-1
    )

_ = rnd_search.fit(X_train, y_train)

```scipy.stats.randint()``` returns an object containing the probability distribution of the discrete random variable. RandomizedSearchCV uses it to randomly sample hyperparameter values.

The ```n_iter``` parameter is the number of iterations to perform. In this case, we're testing 10 hyperparameter combinations.

In [None]:
cv_res = pd.DataFrame(rnd_search.cv_results_)
cv_res.sort_values(by="mean_test_score", ascending=False, inplace=True)
cv_res = cv_res[["param_preprocessing__geo__n_clusters",
                 "param_random_forest__max_features", "split0_test_score",
                 "split1_test_score", "split2_test_score", "mean_test_score"]]
score_cols = ["split0", "split1", "split2", "mean_test_rmse"]
cv_res.columns = ["n_clusters", "max_features"] + score_cols
cv_res[score_cols] = -cv_res[score_cols].round().astype(np.int64)
cv_res.head()

We've managed to improve our model by reducing the RMSE to $42,560 by defining 45 clusters and considering 9 features for each split.

## Evaluating the Final Model on the Test Set

In [None]:
from sklearn.metrics import root_mean_squared_error

final_predictions = rnd_search.best_estimator_.predict(X_test)

final_rmse = root_mean_squared_error(y_test, final_predictions)
print(final_rmse)