<div style="height:100px">

<div style="display:inline-block; width:77%; vertical-align:middle;">
    <div>
        <b>Author</b>: <a href="http://pages.di.unipi.it/castellana/">Daniele Castellana</a>
    </div>
    <div>
        PhD student at the Univeristy of Pisa and member of the Computational Intelligence & Machine Learning Group (<a href="http://www.di.unipi.it/groups/ciml/">CIML</a>)
    </div>
    <div>
        <b>Mail</b>: <a href="mailto:daniele.castellana@di.unipi.it">daniele.castellana@di.unipi.it</a>
    </div>
</div>

<div style="display:inline-block; width: 10%; vertical-align:middle;">
    <img align="right" width="100%" src="https://upload.wikimedia.org/wikipedia/it/7/72/Stemma_unipi.png">
</div>

<div style="display:inline-block; width: 10%; vertical-align:middle;">
    <img align="right" width="100%" src="http://www.di.unipi.it/groups/ciml/Home_files/loghi/logo_ciml-restyling2018.svg">
</div>
</div>

# Model Selection
Model selection should be used to determine the appropriate values of the hyper-parameters of the model by using the optimal validation error.

## Dataset splitting
The dataset shold be split into three sets as follow:
- **Training set**: use to update the weights; patterns in this set are repeatedly in random order and the weight update equation are applied after a certain number of patterns
- **Validation set**: use to decide when to stop training only by monitoring the error and to select the best model configuration
- **Test set**: use to test the performance of the neural network. It should not be used as part of the neural network development and model selection cycle


In [1]:
from keras.datasets import boston_housing
from sklearn.preprocessing import StandardScaler


(x_train, y_train), (x_test, y_test) = boston_housing.load_data()

# rescale values
x_train = StandardScaler().fit(x_train).transform(x_train)
x_test = StandardScaler().fit(x_test).transform(x_test)

print("The training set is matrix of size {}.\n"
      "{} is the number of samples and {} is the number of feature.".format(x_train.shape, x_train.shape[0], x_train.shape[1]))

Using TensorFlow backend.


The training set is matrix of size (404, 13).
404 is the number of samples and 13 is the number of feature.


We define a function which takes the model hyper-parameters as input and return a compiled keras model.

In [2]:
from keras.models import Sequential
from keras.layers import Dense

def build_model(n_layers=2, h_dim=64, activation='relu', optimizer='adagrad'):
    # define the model
    model = Sequential()

    n_feature = x_train.shape[1]
    
    model.add(Dense(h_dim, activation=activation, input_shape=(n_feature,)))
    for i in range(n_layers-1):
        model.add(Dense(h_dim, activation=activation))
    #lienar activation
    model.add(Dense(1))

    #compile the model
    model.compile(optimizer=optimizer,
                  loss='mse',
                  metrics=['mae'])
    
    return model

# Validation using Scikit-learn
We use the library scikit-learn to perform the validation. Keras model can be passed to scikit-learn library using a wrapper which can be found in keras.wrappers.scikit_learn (see [doc page](https://keras.io/scikit-learn-api/)). Also, [this page](https://scikit-learn.org/stable/modules/grid_search.html) contains an user guide for model selection using scikit-learn.

## Grid definition
The first step is to define the grid of hyper-parameters which should be tested.

In [3]:
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import GridSearchCV

# create the grid
n_layers = [1, 2, 3]
h_dim = [32, 64, 128]
activation = ['relu', 'tanh']
optimizer = ['adagrad', 'adam']
param_grid = dict(optimizer=optimizer, n_layers=n_layers, h_dim=h_dim, activation=activation)

## Grid Search Cross-Validation
The grid search provided by GridSearchCV exhaustively generates candidates from a grid of parameter values specified with the param_grid parameter.

The GridSearchCV instance implements the usual estimator API: when “fitting” it on a dataset all the possible combinations of parameter values are evaluated and the best combination is retained (see [doc page](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV))

In [None]:
if False:
    model = KerasRegressor(build_fn=build_model)
    grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=3)
    grid_result = grid.fit(x_train, y_train, epochs=100, batch_size=10, verbose=0)

**The grid search is too expansive** since we must compute 3 * 3 * 2 * 2 = 36 configurations. Each of them is validated trough a 3-fold validation, obtaining 36 * 3 = 108 fit.

## Random Search Cross-Validation
When the search space is too big and an exhaustive grid search is too expansive, we can perform a random search over the hyper-parameters grid.

RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. 

Specifying how parameters should be sampled is done using a dictionary, very similar to specifying parameters for GridSearchCV. Additionally, a computation budget, being the number of sampled candidates or sampling iterations, is specified using the **n_iter** parameter. For each parameter, either a distribution over possible values or a list of discrete choices (which will be sampled uniformly) can be specified (see [doc page](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV)).

In [8]:
#we can use a random serach
from sklearn.model_selection import RandomizedSearchCV
model = KerasRegressor(build_fn=build_model)
# scikit learn will randomly choose 10 configurarion over the 36 available
randSearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=10, cv=3)
rand_result = randSearch.fit(x_train, y_train, epochs=100, batch_size=10, verbose=0)






Now we can print the results of the random serch.

In [9]:
print("Best: %f using %s" % (-rand_result.best_score_, rand_result.best_params_))
means = rand_result.cv_results_['mean_test_score']
stds = rand_result.cv_results_['std_test_score']
params = rand_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (-mean, stdev, param))

Best: 11.013070 using {'optimizer': 'adam', 'n_layers': 3, 'h_dim': 128, 'activation': 'relu'}
16.403987 (5.876551) with: {'optimizer': 'rmsprop', 'n_layers': 2, 'h_dim': 128, 'activation': 'tanh'}
11.776609 (2.446961) with: {'optimizer': 'rmsprop', 'n_layers': 1, 'h_dim': 128, 'activation': 'relu'}
16.614111 (4.449182) with: {'optimizer': 'adam', 'n_layers': 2, 'h_dim': 128, 'activation': 'tanh'}
19.048151 (3.786665) with: {'optimizer': 'adam', 'n_layers': 1, 'h_dim': 128, 'activation': 'tanh'}
12.622468 (2.156957) with: {'optimizer': 'rmsprop', 'n_layers': 2, 'h_dim': 128, 'activation': 'relu'}
12.403694 (2.396414) with: {'optimizer': 'rmsprop', 'n_layers': 2, 'h_dim': 32, 'activation': 'relu'}
15.940225 (5.744311) with: {'optimizer': 'rmsprop', 'n_layers': 3, 'h_dim': 32, 'activation': 'tanh'}
15.532436 (3.004790) with: {'optimizer': 'adam', 'n_layers': 1, 'h_dim': 32, 'activation': 'relu'}
11.510838 (1.913685) with: {'optimizer': 'adam', 'n_layers': 2, 'h_dim': 128, 'activation': '

Finally, we can test the best configurarion on the test set.

The best estimator can be obtained directly from the **model selection object**.

**Moreover, the best estimator has been already retrained on the whole training set**.

This option can be disabled settin the paramter **refit=False**.

In [10]:
best_model = rand_result.best_estimator_.model

test_mse, test_mae = best_model.evaluate(x_test, y_test)

print('The MSE on the test set is {:.4f}.\n'
      'The MAE on the test set is {:.4f}.'.format(test_mse, test_mae))

The MSE on the test set is 14.3577.
The MAE on the test set is 2.6032.
