# How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras
From [https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/](https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/) by Jason Brownlee on August 9, 2016

In [1]:
# Use scikit-learn to grid search the batch size and epochs
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.wrappers.scikit_learn import KerasClassifier

from sklearn.model_selection import GridSearchCV

Using TensorFlow backend.


In [2]:
np.random.seed(7)

In [3]:
# load dataset
dataset = np.loadtxt("data/pima-indians-diabetes.csv", delimiter=",")

In [4]:
dataset

array([[   6.   ,  148.   ,   72.   , ...,    0.627,   50.   ,    1.   ],
       [   1.   ,   85.   ,   66.   , ...,    0.351,   31.   ,    0.   ],
       [   8.   ,  183.   ,   64.   , ...,    0.672,   32.   ,    1.   ],
       ..., 
       [   5.   ,  121.   ,   72.   , ...,    0.245,   30.   ,    0.   ],
       [   1.   ,  126.   ,   60.   , ...,    0.349,   47.   ,    1.   ],
       [   1.   ,   93.   ,   70.   , ...,    0.315,   23.   ,    0.   ]])

In [5]:
dataset.shape

(768, 9)

In [6]:
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]

## How to Tune Batch Size and Number of Epochs

In this first simple example, we look at tuning the batch size and number of epochs used when fitting the network.

The batch size in iterative gradient descent is the number of patterns shown to the network before the weights are updated. It is also an optimization in the training of the network, defining how many patterns to read at a time and keep in memory.

The number of epochs is the number of times that the entire training dataset is shown to the network during training. Some networks are sensitive to the batch size, such as LSTM recurrent neural networks and Convolutional Neural Networks.

Here we will evaluate a suite of different mini batch sizes from 10 to 100 in steps of 20.

In [7]:
# Function to create model, required for KerasClassifier
def create_model():
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=8, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [8]:
# create model
model = KerasClassifier(build_fn=create_model, verbose=0)
# define the grid search parameters
batch_size = [10, 20, 40, 60, 80, 100]
epochs = [10, 50, 100]
param_grid = dict(batch_size=batch_size, epochs=epochs)

In [9]:
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=1)
grid_result = grid.fit(X, Y)

In [10]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.687500 using {'batch_size': 80, 'epochs': 100}
0.433594 (0.143773) with: {'batch_size': 10, 'epochs': 10}
0.558594 (0.112719) with: {'batch_size': 10, 'epochs': 50}
0.678385 (0.016053) with: {'batch_size': 10, 'epochs': 100}
0.466146 (0.137566) with: {'batch_size': 20, 'epochs': 10}
0.661458 (0.012890) with: {'batch_size': 20, 'epochs': 50}
0.664062 (0.033603) with: {'batch_size': 20, 'epochs': 100}
0.600260 (0.065700) with: {'batch_size': 40, 'epochs': 10}
0.667969 (0.044993) with: {'batch_size': 40, 'epochs': 50}
0.636719 (0.020915) with: {'batch_size': 40, 'epochs': 100}
0.617188 (0.047628) with: {'batch_size': 60, 'epochs': 10}
0.639323 (0.031304) with: {'batch_size': 60, 'epochs': 50}
0.644531 (0.037603) with: {'batch_size': 60, 'epochs': 100}
0.558594 (0.017758) with: {'batch_size': 80, 'epochs': 10}
0.670573 (0.041626) with: {'batch_size': 80, 'epochs': 50}
0.687500 (0.024910) with: {'batch_size': 80, 'epochs': 100}
0.548177 (0.049855) with: {'batch_size': 100, 'epochs':

We can see that the batch size of 80 and 100 epochs achieved the best result of about 68% accuracy.

## How to Tune the Training Optimization Algorithm

Keras offers a suite of different state-of-the-art optimization algorithms.

In this example, we tune the optimization algorithm used to train the network, each with default parameters.

This is an odd example, because often you will choose one approach a priori and instead focus on tuning its parameters on your problem (e.g. see the next example).

Here we will evaluate the suite of optimization algorithms supported by the Keras API.

In [11]:
def create_model(optimizer='adam'):
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=8, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

In [12]:
# create model
model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, verbose=0)
# define the grid search parameters
optimizer = ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']
param_grid = dict(optimizer=optimizer)

In [13]:
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=1)
grid_result = grid.fit(X, Y)

In [14]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.679688 using {'optimizer': 'Adagrad'}
0.657552 (0.031948) with: {'optimizer': 'SGD'}
0.595052 (0.149507) with: {'optimizer': 'RMSprop'}
0.679688 (0.014616) with: {'optimizer': 'Adagrad'}
0.657552 (0.017566) with: {'optimizer': 'Adadelta'}
0.649740 (0.026557) with: {'optimizer': 'Adam'}
0.454427 (0.173211) with: {'optimizer': 'Adamax'}
0.626302 (0.133969) with: {'optimizer': 'Nadam'}


The results suggest that the Adagrad optimization algorithm is the best with a score of about 68% accuracy.

## How to Tune Learning Rate and Momentum

It is common to pre-select an optimization algorithm to train your network and tune its parameters.

By far the most common optimization algorithm is plain old Stochastic Gradient Descent (SGD) because it is so well understood. In this example, we will look at optimizing the SGD learning rate and momentum parameters.

Learning rate controls how much to update the weight at the end of each batch and the momentum controls how much to let the previous update influence the current weight update.

We will try a suite of small standard learning rates and a momentum values from 0.2 to 0.8 in steps of 0.2, as well as 0.9 (because it can be a popular value in practice).

Generally, it is a good idea to also include the number of epochs in an optimization like this as there is a dependency between the amount of learning per batch (learning rate), the number of updates per epoch (batch size) and the number of epochs.

In [19]:
# Function to create model, required for KerasClassifier
def create_model(learn_rate=0.01, momentum=0):
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=8, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    # Compile model
    optimizer = keras.optimizers.SGD(lr=learn_rate, momentum=momentum)
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

In [20]:
# create model
model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, verbose=0)

In [21]:
# define the grid search parameters
learn_rate = [0.001, 0.01, 0.1, 0.2, 0.3]
momentum = [0.0, 0.2, 0.4, 0.6, 0.8, 0.9]
param_grid = dict(learn_rate=learn_rate, momentum=momentum)

In [22]:
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=1)
grid_result = grid.fit(X, Y)

In [23]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.675781 using {'learn_rate': 0.001, 'momentum': 0.4}
0.669271 (0.007366) with: {'learn_rate': 0.001, 'momentum': 0.0}
0.666667 (0.035564) with: {'learn_rate': 0.001, 'momentum': 0.2}
0.675781 (0.024910) with: {'learn_rate': 0.001, 'momentum': 0.4}
0.667969 (0.006379) with: {'learn_rate': 0.001, 'momentum': 0.6}
0.553385 (0.121882) with: {'learn_rate': 0.001, 'momentum': 0.8}
0.651042 (0.024774) with: {'learn_rate': 0.001, 'momentum': 0.9}
0.666667 (0.037240) with: {'learn_rate': 0.01, 'momentum': 0.0}
0.653646 (0.027498) with: {'learn_rate': 0.01, 'momentum': 0.2}
0.651042 (0.024774) with: {'learn_rate': 0.01, 'momentum': 0.4}
0.544271 (0.146518) with: {'learn_rate': 0.01, 'momentum': 0.6}
0.651042 (0.024774) with: {'learn_rate': 0.01, 'momentum': 0.8}
0.572917 (0.134575) with: {'learn_rate': 0.01, 'momentum': 0.9}
0.651042 (0.024774) with: {'learn_rate': 0.1, 'momentum': 0.0}
0.651042 (0.024774) with: {'learn_rate': 0.1, 'momentum': 0.2}
0.572917 (0.134575) with: {'learn_rate':

We can see that relatively SGD is not very good on this problem, nevertheless best results were achieved using a learning rate of 0.001 and a momentum of 0.4 with an accuracy of about 68%.

## How to Tune Network Weight Initialization

Neural network weight initialization used to be simple: use small random values.

Now there is a suite of different techniques to choose from. Keras provides a laundry list.

In this example, we will look at tuning the selection of network weight initialization by evaluating all of the available techniques.

We will use the same weight initialization method on each layer. Ideally, it may be better to use different weight initialization schemes according to the activation function used on each layer. In the example below we use rectifier for the hidden layer. We use sigmoid for the output layer because the predictions are binary.

In [24]:
# Function to create model, required for KerasClassifier
def create_model(init_mode='uniform'):
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=8, kernel_initializer=init_mode, activation='relu'))
    model.add(Dense(1, kernel_initializer=init_mode, activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [25]:
# create model
model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, verbose=0)

In [26]:
# define the grid search parameters
init_mode = ['uniform', 'lecun_uniform', 'normal', 'zero', 'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform']
param_grid = dict(init_mode=init_mode)

In [27]:
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=1)
grid_result = grid.fit(X, Y)

In [28]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.729167 using {'init_mode': 'uniform'}
0.729167 (0.016053) with: {'init_mode': 'uniform'}
0.695313 (0.016573) with: {'init_mode': 'lecun_uniform'}
0.718750 (0.015947) with: {'init_mode': 'normal'}
0.651042 (0.024774) with: {'init_mode': 'zero'}
0.705729 (0.008027) with: {'init_mode': 'glorot_normal'}
0.712240 (0.021710) with: {'init_mode': 'glorot_uniform'}
0.718750 (0.019918) with: {'init_mode': 'he_normal'}
0.569010 (0.135854) with: {'init_mode': 'he_uniform'}


We can see that the best results were achieved with a uniform weight initialization scheme achieving a performance of about 73%.

## How to Tune the Neuron Activation Function

The activation function controls the non-linearity of individual neurons and when to fire.

Generally, the rectifier activation function is the most popular, but it used to be the sigmoid and the tanh functions and these functions may still be more suitable for different problems.

In this example, we will evaluate the suite of different activation functions available in Keras. We will only use these functions in the hidden layer, as we require a sigmoid activation function in the output for the binary classification problem.

Generally, it is a good idea to prepare data to the range of the different transfer functions, which we will not do in this case.

In [29]:
# Function to create model, required for KerasClassifier
def create_model(activation='relu'):
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=8, kernel_initializer='uniform', activation=activation))
    model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [30]:
# create model
model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, verbose=0)

In [31]:
# define the grid search parameters
activation = ['softmax', 'softplus', 'softsign', 'relu', 'tanh', 'sigmoid', 'hard_sigmoid', 'linear']
param_grid = dict(activation=activation)

In [32]:
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=1)
grid_result = grid.fit(X, Y)

In [33]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.734375 using {'activation': 'softplus'}
0.657552 (0.016053) with: {'activation': 'softmax'}
0.734375 (0.027251) with: {'activation': 'softplus'}
0.683594 (0.020915) with: {'activation': 'softsign'}
0.713542 (0.017566) with: {'activation': 'relu'}
0.684896 (0.008027) with: {'activation': 'tanh'}
0.694010 (0.012890) with: {'activation': 'sigmoid'}
0.678385 (0.010253) with: {'activation': 'hard_sigmoid'}
0.710938 (0.024080) with: {'activation': 'linear'}


Surprisingly (to me at least), the ‘softplus’ activation function achieved the best results with an accuracy of about 73%.

## How to Tune Dropout Regularization

In this example, we will look at tuning the dropout rate for regularization in an effort to limit overfitting and improve the model’s ability to generalize.

To get good results, dropout is best combined with a weight constraint such as the max norm constraint.

This involves fitting both the dropout percentage and the weight constraint. We will try dropout percentages between 0.0 and 0.9 (1.0 does not make sense) and maxnorm weight constraint values between 0 and 5.

In [34]:
# Function to create model, required for KerasClassifier
def create_model(dropout_rate=0.0, weight_constraint=0):
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=8, kernel_initializer='uniform', activation='linear', kernel_constraint=keras.constraints.maxnorm(weight_constraint)))
    model.add(Dropout(dropout_rate))
    model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [35]:
# create model
model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, verbose=0)

In [36]:
# define the grid search parameters
weight_constraint = [1, 2, 3, 4, 5]
dropout_rate = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
param_grid = dict(dropout_rate=dropout_rate, weight_constraint=weight_constraint)

In [37]:
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=1)
grid_result = grid.fit(X, Y)

In [38]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.734375 using {'dropout_rate': 0.0, 'weight_constraint': 5}
0.726562 (0.016877) with: {'dropout_rate': 0.0, 'weight_constraint': 1}
0.710938 (0.012758) with: {'dropout_rate': 0.0, 'weight_constraint': 2}
0.709635 (0.004872) with: {'dropout_rate': 0.0, 'weight_constraint': 3}
0.718750 (0.027621) with: {'dropout_rate': 0.0, 'weight_constraint': 4}
0.734375 (0.016573) with: {'dropout_rate': 0.0, 'weight_constraint': 5}
0.721354 (0.020505) with: {'dropout_rate': 0.1, 'weight_constraint': 1}
0.699219 (0.013902) with: {'dropout_rate': 0.1, 'weight_constraint': 2}
0.692708 (0.023510) with: {'dropout_rate': 0.1, 'weight_constraint': 3}
0.712240 (0.021236) with: {'dropout_rate': 0.1, 'weight_constraint': 4}
0.707031 (0.000000) with: {'dropout_rate': 0.1, 'weight_constraint': 5}
0.725260 (0.006639) with: {'dropout_rate': 0.2, 'weight_constraint': 1}
0.717448 (0.040133) with: {'dropout_rate': 0.2, 'weight_constraint': 2}
0.717448 (0.023073) with: {'dropout_rate': 0.2, 'weight_constraint': 

We can see that the dropout rate of 0.0% and the maxnorm weight constraint of 5 resulted in the best accuracy of about 73%.

## How to Tune the Number of Neurons in the Hidden Layer

The number of neurons in a layer is an important parameter to tune. Generally the number of neurons in a layer controls the representational capacity of the network, at least at that point in the topology.

Also, generally, a large enough single layer network can approximate any other neural network, at least in theory.

In this example, we will look at tuning the number of neurons in a single hidden layer. We will try values from 1 to 30 in steps of 5.

A larger network requires more training and at least the batch size and number of epochs should ideally be optimized with the number of neurons.

In [39]:
# Function to create model, required for KerasClassifier
def create_model(neurons=1):
    # create model
    model = Sequential()
    model.add(Dense(neurons, input_dim=8, kernel_initializer='uniform', activation='linear', kernel_constraint=keras.constraints.maxnorm(4)))
    model.add(Dropout(0.2))
    model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [40]:
# create model
model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, verbose=0)

In [41]:
# define the grid search parameters
neurons = [1, 5, 10, 15, 20, 25, 30]
param_grid = dict(neurons=neurons)

In [42]:
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=1)
grid_result = grid.fit(X, Y)

In [43]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.716146 using {'neurons': 25}
0.695313 (0.015947) with: {'neurons': 1}
0.703125 (0.022326) with: {'neurons': 5}
0.714844 (0.022097) with: {'neurons': 10}
0.714844 (0.003189) with: {'neurons': 15}
0.708333 (0.010253) with: {'neurons': 20}
0.716146 (0.013279) with: {'neurons': 25}
0.701823 (0.011201) with: {'neurons': 30}


We can see that the best results were achieved with a network with 25 neurons in the hidden layer with an accuracy of about 71%.

## Tips for Hyperparameter Optimization

This section lists some handy tips to consider when tuning hyperparameters of your neural network.

- **k-fold Cross Validation**. You can see that the results from the examples in this post show some variance. A default cross-validation of 3 was used, but perhaps k=5 or k=10 would be more stable. Carefully choose your cross validation configuration to ensure your results are stable.


- **Review the Whole Grid**. Do not just focus on the best result, review the whole grid of results and look for trends to support configuration decisions.


- **Parallelize**. Use all your cores if you can, neural networks are slow to train and we often want to try a lot of different parameters. Consider spinning up a lot of AWS instances.


- **Use a Sample of Your Dataset**. Because networks are slow to train, try training them on a smaller sample of your training dataset, just to get an idea of general directions of parameters rather than optimal configurations.


- **Start with Coarse Grids**. Start with coarse-grained grids and zoom into finer grained grids once you can narrow the scope.


- **Do not Transfer Results**. Results are generally problem specific. Try to avoid favorite configurations on each new problem that you see. It is unlikely that optimal results you discover on one problem will transfer to your next project. Instead look for broader trends like number of layers or relationships between parameters.


- **Reproducibility is a Problem**. Although we set the seed for the random number generator in NumPy, the results are not 100% reproducible. There is more to reproducibility when grid searching wrapped Keras models than is presented in this post.