In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from ipynb.fs.full.project_part_1 import strat_train_set, strat_test_set
from ipynb.fs.full.project_part_2 import housing
from ipynb.fs.full.project_part_3 import housing_prepared, housing_labels, full_pipeline, cat_attribs, num_attribs
from sklearn.metrics import mean_squared_error

# Fine Tune Your Model

Let’s assume that you now have a shortlist of promising models. You now need to fine-tune them. Let’s look at a few ways you can do that.

## Grid Search

One option would be to fiddle with the hyperparameters manually, until you find a great combination of hyperparameter values. This would be very tedious work, and you may not have time to explore many combinations.

Instead, you should get Scikit-Learn’s **`GridSearchCV`** to search for you. All you need to do is tell it which hyperparameters you want it to experiment with and what values to try out, and it will use cross-validation to evaluate all the possible combinations of hyperparameter values. For example, the following code searches for the best combination of hyperparameter values for the **`RandomForestRegressor`**:

In [2]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}
]

forest_reg = RandomForestRegressor()

grid_search = GridSearchCV(forest_reg, param_grid, cv=5, 
                           scoring= 'neg_mean_squared_error',
                           return_train_score=True)

grid_search.fit(housing_prepared, housing_labels)

GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,
                                             criterion='mse', max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             max_samples=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators=100, n_jobs=None,
                                             oob_score=False, random_state=None,
                                             verbose=0, warm_start=False),
             iid='deprecated', n_jo

This **`param_grid`** tells Scikit-Learn to first evaluate all $3 × 4 = 12$ combinations of **`n_estimators`** and **`max_features`** hyperparameter values specified in the first **`dict`**, then try all $2 × 3 = 6$ combinations of hyperparameter values in the second dict, but this time with the **`bootstrap`** hyperparameter set to **`False`** instead of **`True`** (which is the default value for this hyperparameter).

The grid search will explore $12 + 6 = 18$ combinations of **`RandomForestRegressor`** hyperparameter values, and it will train each model 5 times (since we are using **five-fold cross validation)**. In other words, all in all, there will be $18 × 5 = 90$ rounds of training! It may take quite a long time, but when it is done you can get the best combination of parameters like this:

In [3]:
grid_search.best_params_

{'max_features': 6, 'n_estimators': 30}

Since 8 and 30 are the maximum values that were evaluated, you should probably try searching again with higher values; the score may continue to improve.

You can also get the best estimator directly:

In [4]:
grid_search.best_estimator_

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features=6, max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=30, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

If **`GridSearchCV`** is initialized with **`refit=True`** (which is the default), then once it finds the best estimator using cross-validation, it retrains it on the whole training set. This is usually a good idea, since feeding it more data will likely improve its performance.

And of course the evaluation scores are also available:

In [5]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres['mean_test_score'], cvres['params']):
    print(np.sqrt(-mean_score), params)

64209.54153821484 {'max_features': 2, 'n_estimators': 3}
55177.36964723384 {'max_features': 2, 'n_estimators': 10}
52486.73148832081 {'max_features': 2, 'n_estimators': 30}
59457.507679278584 {'max_features': 4, 'n_estimators': 3}
52786.020287589454 {'max_features': 4, 'n_estimators': 10}
50472.29212297196 {'max_features': 4, 'n_estimators': 30}
59182.43398848768 {'max_features': 6, 'n_estimators': 3}
51982.072567783594 {'max_features': 6, 'n_estimators': 10}
50034.32002521204 {'max_features': 6, 'n_estimators': 30}
59440.43024002387 {'max_features': 8, 'n_estimators': 3}
52331.76280751217 {'max_features': 8, 'n_estimators': 10}
50318.9512193438 {'max_features': 8, 'n_estimators': 30}
62118.48449041704 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54169.294811143795 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
59999.955656212034 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52504.53747311246 {'bootstrap': False, 'max_features': 3, 'n_estimators'

In this example, we obtain the best solution by setting the **`max_features`** hyperparameter to $8$ and the **`n_estimators`** hyperparameter to $30$. The RMSE score for this combination is $49,682$, which is slightly better than the score you got earlier using the default hyperparameter values (which was $50,182$). Congratulations, you have successfully fine-tuned your best model!

## Analyze the Best Models and Their Errors

You will often gain good insights on the problem by inspecting the best models. For example, the **`RandomForestRegressor`** can indicate the relative importance of each attribute for making accurate predictions:

In [7]:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances

array([8.22347587e-02, 7.68471270e-02, 4.17196690e-02, 1.72223868e-02,
       1.73319267e-02, 1.61726339e-02, 1.67440689e-02, 3.36487642e-01,
       6.33676269e-02, 1.08598532e-01, 5.94544626e-02, 1.29574267e-02,
       1.42813317e-01, 1.17878662e-04, 2.55370290e-03, 5.37683986e-03])

Let’s display these importance scores next to their corresponding attribute names:

In [12]:
extra_attribs = ['rooms_per_hhold', 'pop_per_hhold', 'bedrooms_per_room']
cat_encoder = full_pipeline.named_transformers_['cat']
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)

[(0.33648764224919064, 'median_income'),
 (0.14281331699728692, 'INLAND'),
 (0.10859853200493033, 'pop_per_hhold'),
 (0.08223475868455118, 'longitude'),
 (0.07684712701741128, 'latitude'),
 (0.06336762692455762, 'rooms_per_hhold'),
 (0.05945446264413106, 'bedrooms_per_room'),
 (0.04171966899932164, 'housing_median_age'),
 (0.01733192669444712, 'total_bedrooms'),
 (0.01722238684188461, 'total_rooms'),
 (0.016744068858809205, 'households'),
 (0.016172633943864183, 'population'),
 (0.01295742671474508, '<1H OCEAN'),
 (0.005376839858201243, 'NEAR OCEAN'),
 (0.0025537029043222596, 'NEAR BAY'),
 (0.00011787866234574327, 'ISLAND')]

With this information, you may want to try dropping some of the less useful features (e.g., apparently only one **`ocean_proximity`** category is really useful, so you could try dropping the others).

You should also look at the specific errors that your system makes, then try to understand why it makes them and what could fix the problem (adding extra features or getting rid of uninformative ones, cleaning up outliers, etc.).

## Evaluate Your System on the Test Set

After tweaking your models for a while, you eventually have a system that performs sufficiently well. Now is the time to evaluate the final model on the test set. There is nothing special about this process; just get the predictors and the labels from your test set, run your **`full_pipeline`** to transform the data (call **`transform()`**, not **`fit_transform()`**—you do not want to fit the test set!), and evaluate the final model on the test set:

In [13]:
final_model = grid_search.best_estimator_

In [21]:
X_test = strat_test_set.drop('median_house_value', axis=1)
y_test = strat_test_set['median_house_value'].copy()

X_test_prepared = full_pipeline.transform(X_test)

final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse

47762.314603153536

In some cases, such a point estimate of the generalization error will not be quite enough to convince you to launch: what if it is just 0.1% better than the model currently in production? You might want to have an idea of how precise this estimate is. For this, you can compute a 95% confidence interval for the generalization error using **`scipy.stats.t.interval()`**:

In [23]:
from scipy import stats

confidence = 0.95

squared_errors = (final_predictions - y_test) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) -1,
                         loc=squared_errors.mean(),
                         scale=stats.sem(squared_errors)))

array([45755.20680557, 49688.41356575])