# Hyperparameter Optimization

On this notebook we will take a look at three approaches for hyperparameter optimization, will compare them and build differnt models.

## Dataset
We will use our old friend MNIST for its simplicity. 

<font color=red><b>Load the dataset and preprocess it. 
</font>

In [None]:
import os
import pandas as pd
import wrangle as wr
from numpy import nan
import os, time
import tensorflow as tf
physical_devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)
tf.keras.backend.clear_session() 
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.datasets import mnist

# the data, split between train and test sets
...

print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

## Model Architecture
As we will have to create it more than once, Let's build a function for creating the model. The function will recieve the next parameters :
- first_neuron
- activation
- kernel_initializer
- dropout_rate
- optimizer

And the model will consist in :
- A dense layer with first unit units, activation activated and initialized with the kernel initializer
- A dropout layer with the dropout rate
- A dense layer with the number of classes as  the amount of units, softmax activated and initialized with the kernel initializer
- Use the given optimizer and categorical crossentropy as the loss function. Add accuracy to the metrics

<font color=red><b> Build the model function
</font>

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten

# Function to create model, required for KerasClassifier
def create_model(first_neuron=9,
                 activation='relu',
                 kernel_initializer='uniform',
                 dropout_rate=0,
                 optimizer='Adam'):
    ...

We will build the model using the kerasclassifier wrapper that allows us to introduce the parameters:

In [None]:
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
# Create the model
model = KerasClassifier(build_fn=create_model)

## Grid Search
The first technique we are going to use is the brute force: the grid search. Here's the workflow:

- Define a grid on n dimensions, where each of these maps for an hyperparameter. e.g. n = (learning_rate, dropout_rate, batch_size)
- For each dimension, define the range of possible values: e.g. batch_size = [4, 8, 16, 32, 64, 128, 256]
- Search for all the possible configurations and wait for the results to establish the best one: e.g. C1 = (0.1, 0.3, 4) -> acc = 92%, C2 = (0.1, 0.35, 4) -> acc = 92.3%, etc...

The real pain point of this approach is known as the curse of dimensionality. This means that more dimensions we add, the more the search will explode in time complexity (usually by an exponential factor), ultimately making this strategy unfeasible!

<font color=red><b>Define the range of values you want to experiment with. Don't go crazy with many values because this is computationaly intensive. Finally, create a dictionary with all the components
</font>

In [None]:
# Model Design Components
...

# Hyperparameters
epochs = [10] 
batch_size = [1024]
dropout_rate = [0.0]

# Prepare the Grid
param_grid = ...

Let's perform the grid search.

<font color=red><b> Generate the grid and fit it. use 1 job, 3 cf folds and set verbose to 2.
</font>

In [None]:
# Perform the Search!
from sklearn.model_selection import GridSearchCV
...

In [None]:
# Show results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

## Random search
This is our second strategy. The only real difference between Grid Search and Random Search is on the step 1 of the strategy cycle. Random Search picks the point randomly from the configuration space. Now the parameters can vary in a longer grid.In the Grid Layout, it's easy to notice that, even if we have trained 9 models, we have used only 3 values per variable! Whereas, with the Random Layout, it's extremely unlikely that we will select the same variables more than once. It ends up that, with the second approach, we will have trained 9 model using 9 different values for each variables.

It is not guaranteed to fnd the best hyperparams, but it is very good on high spaces and gives better results on less iterations.


<font color=red><b>Define the distribution of values you want to experiment with. Finally, create a dictionary with all the components
    <br> Hint: use randint for integers in a range or uniform for continuous values
</font>

In [None]:
from scipy.stats import randint as sp_randint
from scipy.stats import uniform
# Model Design Components
...
# Hyperparameters
epochs = [10] 
batch_size = [1024] 
...


# Prepare for the Search
param_dist = 

Let's perform the random search.

<font color=red><b> Generate the randomized search and fit it. use 1 job, 3 cf folds and set verbose to 2. use a max iterations value of 8
</font>

In [None]:
# Perform the Search!
from sklearn.model_selection import RandomizedSearchCV

# run randomized search
...

In [None]:


# Show results
print("Best: %f using %s" % (random_search.best_score_, random_search.best_params_))
means = random_search.cv_results_['mean_test_score']
stds = random_search.cv_results_['std_test_score']
params = random_search.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))



## Bayesian Optimization
This search strategy builds a surrogate model that tries to predict the metrics we care about from the hyperparameters configuration.

Since the objective function is unknown, the Bayesian strategy is to treat it as a random function and place a prior over it. The prior captures beliefs about the behavior of the function. After gathering the function evaluations, which are treated as data, the prior is updated to form the posterior distribution over the objective function. The posterior distribution, in turn, is used to construct an acquisition function (often also referred to as infill sampling criteria) that determines the next query point. 


1. Build a model

**Iterate and generate a better estimation of P(validation_metric|hyperparameters):**
2. Select Hyperparameters
3. Training/Evaluate
4. Update the model

It is not as expensive as the other two strategies and performs a driven strategy to find the right parameters.



<font color=red><b> This strategy needs a function to define the data. Build it:
</font>

In [None]:
def data():
    
    ...
    return x_train, y_train, x_test, y_test

We will build now the function for creating the model. This is special on this case, as it neds double curly brackets dropped-in as needed. Its return value has to be a valid python dictionary with the next keys:
- loss: Specify a numeric evaluation metric to be minimized
- status: Just use STATUS_OK and see hyperopt documentation if not feasible
- model: specify the model just created so that we can later use it again.

In [None]:
from hyperas.distributions import choice, uniform
from hyperopt import Trials, STATUS_OK, tpe

def model(x_train, y_train, x_test, y_test):
    model = Sequential()
    
    # L1
    model.add(Dense({{choice([8,9,10])}}, 
                    input_shape=(784,),
                    kernel_initializer={{choice(['uniform', 'normal'])}}, 
                    activation={{choice(['relu', 'elu'])}}))
    # Dropout
    model.add(Dropout({{uniform(0, 1)}}))
    # L2
    model.add(Dense(10, 
                    kernel_initializer={{choice(['uniform', 'normal'])}}, 
                    activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', 
                  optimizer={{choice(['nadam', 'adam', 'sgd'])}}, 
                  metrics=['accuracy'])
    
    model.fit(x_train, y_train,
              batch_size=1024,
              epochs=10,
              verbose=2,
              validation_data=(x_test, y_test))
    
    score, acc = model.evaluate(x_test, y_test, verbose=0)
    print('Test accuracy:', acc)
    return {'loss': -acc, 'status': STATUS_OK, 'model': model}

<font color=red><b> Run 5 iterations using using the Tree Parzen Estimator or TPE algorithm provided with hyperopt.
    <br> Hint: use the minimize functon from the optim class in hyperas and the tpe.suggest algorithm
</font>



In [None]:
from hyperas import optim
...

In [None]:
x_train, y_train, x_test, y_test = data()
print("Evalutation of best performing model:")
print(best_model.evaluate(x_test, y_test))
print("Best performing model chosen hyper-parameters:")
print(best_run)
