# 📝 Exercise M3.02

The goal is to find the best set of hyperparameters which maximize the
generalization performance on a training set.

In [1]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data, target = fetch_california_housing(return_X_y=True, as_frame=True)
target *= 100  # rescale the target in k$

data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=42
)

In this exercise, we progressively define the regression pipeline and later
tune its hyperparameters.

Start by defining a pipeline that:
* uses a `StandardScaler` to normalize the numerical data;
* uses a `sklearn.neighbors.KNeighborsRegressor` as a predictive model.

In [4]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline

model = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("regressor", KNeighborsRegressor())
    ]
)

Use `RandomizedSearchCV` with `n_iter=20` and
`scoring="neg_mean_absolute_error"` to tune the following hyperparameters
of the `model`:

- the parameter `n_neighbors` of the `KNeighborsRegressor` with values
  `np.logspace(0, 3, num=10).astype(np.int32)`;
- the parameter `with_mean` of the `StandardScaler` with possible values
  `True` or `False`;
- the parameter `with_std` of the `StandardScaler` with possible values `True`
  or `False`.

The `scoring` function is expected to return higher values for better models,
since grid/random search objects **maximize** it. Because of that, error
metrics like `mean_absolute_error` must be negated (using the `neg_` prefix)
to work correctly (remember lower errors represent better models).

Notice that in the notebook "Hyperparameter tuning by randomized-search" we
pass distributions to be sampled by the `RandomizedSearchCV`. In this case we
define a fixed grid of hyperparameters to be explored. Using a `GridSearchCV`
instead would explore all the possible combinations on the grid, which can be
costly to compute for large grids, whereas the parameter `n_iter` of the
`RandomizedSearchCV` controls the number of different random combination that
are evaluated. Notice that setting `n_iter` larger than the number of possible
combinations in a grid (in this case 10 x 2 x 2 = 40) would lead to repeating
already-explored combinations.

Once the computation has completed, print the best combination of parameters
stored in the `best_params_` attribute.

In [14]:
import numpy as np
from sklearn.model_selection import RandomizedSearchCV

In [15]:
param_distibutions = {
    "regressor__n_neighbors": np.logspace(0, 3, num=10).astype(np.int32),
    "scaler__with_mean": (True, False),
    "scaler__with_std": (True, False)
}

In [16]:
model_randomized_search = RandomizedSearchCV(
    model, 
    param_distributions=param_distibutions,
    n_iter=20,
    scoring="neg_mean_absolute_error"
)

In [17]:
model_randomized_search.fit(data_train, target_train)

0,1,2
,estimator,Pipeline(step...Regressor())])
,param_distributions,"{'regressor__n_neighbors': array([ 1, ... dtype=int32), 'scaler__with_mean': (True, ...), 'scaler__with_std': (True, ...)}"
,n_iter,20
,scoring,'neg_mean_absolute_error'
,n_jobs,
,refit,True
,cv,
,verbose,0
,pre_dispatch,'2*n_jobs'
,random_state,

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,n_neighbors,np.int32(10)
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'minkowski'
,metric_params,
,n_jobs,


### The best paramters are:

In [18]:
model_randomized_search.best_params_

{'scaler__with_std': True,
 'scaler__with_mean': True,
 'regressor__n_neighbors': np.int32(10)}