Lambda School Data Science

*Unit 4, Sprint 2, Module 4*

---

# Neural Networks & GPUs (Prepare)
*aka Hyperparameter Tuning*

*aka Big Servers for Big Problems*

## Learning Objectives
* <a href="#p1">Part 1</a>: Describe the major hyperparemeters to tune
* <a href="#p2">Part 2</a>: Implement an experiment tracking framework
* <a href="#p3">Part 3</a>: Search the hyperparameter space using RandomSearch (Optional)

In [1]:
wandb_group = "ds8"
wandb_project = "inclass"

# Hyperparameter Options (Learn)
<a id="p1"></a>

## Overview

Hyperparameter tuning is much more important with neural networks than it has been with any other models that we have considered up to this point. Other supervised learning models might have a couple of parameters, but neural networks can have dozens. These can substantially affect the accuracy of our models and although it can be a time consuming process is a necessary step when working with neural networks.
​
Hyperparameter tuning comes with a challenge. How can we compare models specified with different hyperparameters if our model's final error metric can vary somewhat erratically? How do we avoid just getting unlucky and selecting the wrong hyperparameter? This is a problem that to a certain degree we just have to live with as we test and test again. However, we can minimize it somewhat by pairing our experiments with Cross Validation to reduce the variance of our final accuracy values.

### Load Boston Housing Data

In [2]:
from tensorflow.keras.datasets import boston_housing

(x_train, y_train), (x_test, y_test) = boston_housing.load_data()

### Normalizing Input Data

It's not 100% necessary to normalize/scale your input data before feeding it to a neural network, the network can learn the appropriate weights to deal with data of as long as it is numerically represented,  but it is recommended as it can help **make training faster** and **reduces the chances that gradient descent might get stuck in a local optimum**.

<https://stackoverflow.com/questions/4674623/why-do-we-have-to-normalize-the-input-for-an-artificial-neural-network>

In [3]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
print(x_train[:10])

[[-0.27224633 -0.48361547 -0.43576161 -0.25683275 -0.1652266  -0.1764426
   0.81306188  0.1166983  -0.62624905 -0.59517003  1.14850044  0.44807713
   0.8252202 ]
 [-0.40342651  2.99178419 -1.33391162 -0.25683275 -1.21518188  1.89434613
  -1.91036058  1.24758524 -0.85646254 -0.34843254 -1.71818909  0.43190599
  -1.32920239]
 [ 0.1249402  -0.48361547  1.0283258  -0.25683275  0.62864202 -1.82968811
   1.11048828 -1.18743907  1.67588577  1.5652875   0.78447637  0.22061726
  -1.30850006]
 [-0.40149354 -0.48361547 -0.86940196 -0.25683275 -0.3615597  -0.3245576
  -1.23667187  1.10717989 -0.51114231 -1.094663    0.78447637  0.44807713
  -0.65292624]
 [-0.0056343  -0.48361547  1.0283258  -0.25683275  1.32861221  0.15364225
   0.69480801 -0.57857203  1.67588577  1.5652875   0.78447637  0.3898823
   0.26349695]
 [-0.37502238 -0.48361547 -0.54747912 -0.25683275 -0.54935658 -0.78865126
   0.18954148  0.48371503 -0.51114231 -0.71552978  0.51145832  0.38669063
  -0.13812828]
 [ 0.58963463 -0.48361547

### Model Validation using an automatic verification Dataset

Instead of doing seperate train test split class, Keras has a really nice feature that you can set the validation.data argument when fitting your model and Keras will take that portion of your test data and use it as a validation dataset. 

In [4]:
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Important Hyperparameters
inputs = x_train.shape[1]
epochs = 75
batch_size = 10


# Create Model
model = Sequential()
model.add(Dense(64, activation='relu', input_shape=(inputs,)))
model.add(Dense(64, activation='relu'))
model.add(Dense(1))

# Compile Model
model.compile(optimizer='adam', loss='mse', metrics=['mse', 'mae'])

# Fit Model
model.fit(x_train, y_train, 
          validation_data=(x_test,y_test), 
          epochs=epochs, 
          batch_size=batch_size
         )

Train on 404 samples, validate on 102 samples
Epoch 1/75
Epoch 2/75
Epoch 3/75
Epoch 4/75
Epoch 5/75
Epoch 6/75
Epoch 7/75
Epoch 8/75
Epoch 9/75
Epoch 10/75
Epoch 11/75
Epoch 12/75
Epoch 13/75
Epoch 14/75
Epoch 15/75
Epoch 16/75
Epoch 17/75
Epoch 18/75
Epoch 19/75
Epoch 20/75
Epoch 21/75
Epoch 22/75
Epoch 23/75
Epoch 24/75
Epoch 25/75
Epoch 26/75
Epoch 27/75
Epoch 28/75
Epoch 29/75
Epoch 30/75
Epoch 31/75
Epoch 32/75
Epoch 33/75
Epoch 34/75
Epoch 35/75
Epoch 36/75
Epoch 37/75
Epoch 38/75
Epoch 39/75
Epoch 40/75
Epoch 41/75
Epoch 42/75
Epoch 43/75
Epoch 44/75
Epoch 45/75
Epoch 46/75
Epoch 47/75
Epoch 48/75
Epoch 49/75
Epoch 50/75
Epoch 51/75
Epoch 52/75
Epoch 53/75
Epoch 54/75
Epoch 55/75
Epoch 56/75
Epoch 57/75
Epoch 58/75
Epoch 59/75
Epoch 60/75
Epoch 61/75
Epoch 62/75
Epoch 63/75
Epoch 64/75
Epoch 65/75
Epoch 66/75
Epoch 67/75
Epoch 68/75
Epoch 69/75
Epoch 70/75
Epoch 71/75
Epoch 72/75
Epoch 73/75
Epoch 74/75
Epoch 75/75


<tensorflow.python.keras.callbacks.History at 0x1d4f57b0780>

### Hyperparameter Tuning Approaches:

#### 1) Babysitting AKA "Grad Student Descent".

If you fiddled with any hyperparameters yesterday, this is basically what you did. This approach is 100% manual and is pretty common among researchers where finding that 1 exact specification that jumps your model to a level of accuracy never seen before is the difference between publishing and not publishing a paper. Of course the professors don't do this themselves, that's grunt work. This is also known as the fiddle with hyperparameters until you run out of time method.

#### 2) Grid Search

Grid Search is the Grad Student galaxy brain realization of: why don't I just specify all the experiments I want to run and let the computer try every possible combination of them while I go and grab lunch. This has a specific downside in that if I specify 5 hyperparameters with 5 options each then I've just created 5^5 combinations of hyperparameters to check. Which means that I have to train 3125 different versions of my model Then if I use 5-fold Cross Validation on that then my model has to run 15,525 times. This is the brute-force method of hyperparameter tuning, but it can be very profitable if done wisely. 

When using Grid Search here's what I suggest: don't use it to test combinations of different hyperparameters, only use it to test different specifications of **a single** hyperparameter. It's rare that combinations between different hyperparameters lead to big performance gains. You'll get 90-95% of the way there if you just Grid Search one parameter and take the best result, then retain that best result while you test another, and then retain the best specification from that while you train another. This at least makes the situation much more manageable and leads to pretty good results. 

#### 3) Random Search

Do Grid Search for a couple of hours and you'll say to yourself - "There's got to be a better way." Enter Random Search. For Random search you specify a hyperparameter space and it picks specifications from that randomly, tries them out, gives you the best results and says - That's going to have to be good enough, go home and spend time with your family. 

Grid Search treats every parameter as if it was equally important, but this just isn't the case, some are known to move the needle a lot more than others (we'll talk about that in a minute). Random Search allows searching to be specified along the most important parameter and experiments less along the dimensions of less important hyperparameters. The downside of Random search is that it won't find the absolute best hyperparameters, but it is much less costly to perform than Grid Search. 

#### 4) Bayesian Methods

One thing that can make more manual methods like babysitting and gridsearch effective is that as the experimenter sees results he can then make updates to his future searches taking into account the results of past specifications. If only we could hyperparameter tune our hyperparameter tuning. Well, we kind of can. Enter Bayesian Optimization. Neural Networks are like an optimization problem within an optimization problem, and Bayesian Optimization is a search strategy that tries to take into account the results of past searches in order to improve future ones. This is the most advanced method but can be a little bit tricky to implement, but there are some early steps with `hyperas` which is Bayesian optimization wrapper for `keras`. 

## What Hyperparameters are there to test?

- batch_size
- training epochs
- optimization algorithms
- learning rate
- momentum
- activation functions
- dropout regularization
- number of neurons in the hidden layer

There are more, but these are the most important.

## Follow Along

## Batch Size

Batch size determines how many observations the model is shown before it calculates loss/error and updates the model weights via gradient descent. You're looking for a sweet spot here where you're showing it enough observations that you have enough information to updates the weights, but not such a large batch size that you don't get a lot of weight update iterations performed in a given epoch. Feed-forward Neural Networks aren't as sensitive to bach_size as other networks, but it is still an important hyperparameter to tune. Smaller batch sizes will also take longer to train. 

In [5]:
import numpy
import pandas as pd
from sklearn.model_selection import GridSearchCV
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

# load dataset
url ="https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

dataset = pd.read_csv(url, header=None).values

# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]

# Function to create model, required for KerasClassifier
def create_model():
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=8, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# create model
model = KerasClassifier(build_fn=create_model, verbose=0)

# define the grid search parameters
# batch_size = [10, 20, 40, 60, 80, 100]
# param_grid = dict(batch_size=batch_size, epochs=epochs)

# define the grid search parameters
param_grid = {'batch_size': [10, 20, 40, 60, 80, 100],
              'epochs': [20]}

# Create Grid Search
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=1)
grid_result = grid.fit(X, Y)

# Report Results
print(f"Best: {grid_result.best_score_} using {grid_result.best_params_}")
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print(f"Means: {mean}, Stdev: {stdev} with: {param}") 

Best: 0.6327985763549805 using {'batch_size': 40, 'epochs': 20}
Means: 0.6185213565826416, Stdev: 0.02140310524956398 with: {'batch_size': 10, 'epochs': 20}
Means: 0.5912061929702759, Stdev: 0.04617203845989743 with: {'batch_size': 20, 'epochs': 20}
Means: 0.6327985763549805, Stdev: 0.04319084957341726 with: {'batch_size': 40, 'epochs': 20}
Means: 0.6106188058853149, Stdev: 0.04054461984753422 with: {'batch_size': 60, 'epochs': 20}
Means: 0.5664544761180877, Stdev: 0.08582089515646346 with: {'batch_size': 80, 'epochs': 20}
Means: 0.59128258228302, Stdev: 0.05490354871211439 with: {'batch_size': 100, 'epochs': 20}


## Epochs

The number of training epochs has a large and direct affect on the accuracy, However, more epochs is almost always goign to better than less epochs. This means that if you tune this parameter at the beginning and try and maintain the same value all throughout your training, you're going to be waiting a long time for each iteration of GridSearch. I suggest picking a fixed moderat # of epochs all throughout your training and then Grid Searching this parameter at the very end. 

In [6]:
# define the grid search parameters
param_grid = {'batch_size': [20],
              'epochs': [20, 40, 60,200]}

# Create Grid Search
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=1)
grid_result = grid.fit(X, Y)

# Report Results
print(f"Best: {grid_result.best_score_} using {grid_result.best_params_}")
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print(f"Means: {mean}, Stdev: {stdev} with: {param}")

Best: 0.718767523765564 using {'batch_size': 20, 'epochs': 200}
Means: 0.6471182465553283, Stdev: 0.05630154056879198 with: {'batch_size': 20, 'epochs': 20}
Means: 0.686232078075409, Stdev: 0.03317398545272534 with: {'batch_size': 20, 'epochs': 40}
Means: 0.703157639503479, Stdev: 0.03783015339042304 with: {'batch_size': 20, 'epochs': 60}
Means: 0.718767523765564, Stdev: 0.018724013634204385 with: {'batch_size': 20, 'epochs': 200}


## Optimizer

Remember that there's a different optimizers [optimizers](https://keras.io/optimizers/). At some point, take some time to read up on them a little bit. "adam" usually gives the best results. The thing to know about choosing an optimizer is that different optimizers have different hyperparameters like learning rate, momentum, etc. So based on the optimizer you choose you might also have to tune the learning rate and momentum of those optimizers after that. 

## Learning Rate

Remember that the Learning Rate is a hyperparameter that is specific to your gradient-descent based optimizer selection. A learning rate that is too high will cause divergent behavior, but a Learning Rate that is too low will fail to converge, again, you're looking for the sweet spot. I would start out tuning learning rates by orders of magnitude: [.001, .01, .1, .2, .3, .5] etc. I wouldn't go above .5, but you can try it and see what the behavior is like. 

Once you have narrowed it down, make the window even smaller and try it again. If after running the above specification your model reports that .1 is the best optimizer, then you should probably try things like [.05, .08, .1, .12, .15] to try and narrow it down. 

It can also be good to tune the number of epochs in combination with the learning rate since the number of iterations that you allow the learning rate to reach the minimum can determine if you have let it run long enough to converge to the minimum. 

## Momentum

Momentum is a hyperparameter that is more commonly associated with Stochastic Gradient Descent. SGD is a common optimizer because it's what people understand and know, but I doubt it will get you the best results, you can try hyperparameter tuning its attributes and see if you can beat the performance from adam. Momentum is a property that decides the willingness of an optimizer to overshoot the minimum. Imagine a ball rolling down one side of a bowl and then up the opposite side a little bit before settling back to the bottom. The purpose of momentum is to try and escale local minima.

## Activation Functions

We've talked about this a little bit, typically you'l want to use ReLU for hidden layers and either Sigmoid, or Softmax for output layers of binary and multi-class classification implementations respectively, but try other activation functions and see if you can get any better results with sigmoid or tanh or something. There are a lot of activation functions that we haven't really talked about. Maybe you'll get good results with them. Maybe you won't. :) <https://keras.io/activations/>

## Network Weight Initialization

You saw how big of an effect the way that we initialize our network's weights can have on our results. There are **a lot** of what are called initialization modes. I don't understand all of them, but they can have a big affect on your model's initial accuracy. Your model will get further with less epochs if you initialize it with weights that are well suited to the problem you're trying to solve.

`init_mode = ['uniform', 'lecun_uniform', 'normal', 'zero', 'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform']`

## Dropout Regularization and the Weight Constraint

the Dropout Regularization value is a percentage of neurons that you want to be randomly deactivated during training. The weight constraint is a second regularization parameter that works in tandem with dropout regularization. You should tune these two values at the same time. 

Using dropout on visible vs hidden layers might have a different effect. Using dropout on hidden layers might not have any effect while using dropout on hidden layers might have a substantial effect. You don't necessarily need to turn use dropout unless you see that your model has overfitting and generalizability problems.

## Neurons in Hidden Layer 

Remember that when we only had a single perceptron our model was only able to fit to linearly separable data, but as we have added layers and nodes to those layers our network has become a powerhouse of fitting nonlinearity in data. The larger the network and the more nodes generally the stronger the network's capacity to fit nonlinear patterns in data. The more nodes and layers the longer it will take to train a network, and higher the probability of overfitting. The larger your network gets the more you'll need dropout regularization or other regularization techniques to keep it in check. 

Typically depth (more layers) is more important than width (more nodes) for neural networks. This is part of why Deep Learning is so highly touted. Certain deep learning architectures have truly been huge breakthroughs for certain machine learning tasks. 

You might borrow ideas from other network architectures. For example if I was doing image recognition and I wasn't taking cues from state of the art architectures like resnet, alexnet, googlenet, etc. Then I'm probably going to have to do a lot more experimentation on my own before I find something that works.

There are some heuristics, but I am highly skeptical of them. I think you're better off experimenting on your own and forming your own intuition for these kinds of problems. 

- https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

## Challenge
You will be expected to tune several hyperparameters in today's module project. 

# Experiment Tracking Framework (Learn)
<a id="p2"></a>

## Overview

You will notice quickly that managing the results of all the experiments you are running becomes challenging. Which set of parameters did the best? Are my results today different than my results yesterday? Although we use Ipython Notebooks to work, the format is not well suited to logging experimental results. Enter experiment tracking frameworks like [Comet.ml](https://comet.ml) and [Weights and Biases](https://wandb.ai/).

Those tools will help you track your experiments, store the results, and the code associated with those experiments. Experimental results can also be readily visualized to see changes in performance across any metric you care about. Data is sent to the tool as each epoch is completed, so you can also see if your model is converging. Let's check out Weights & Biases today. 

## Follow Along

Make sure you login into `wandb` in the terminal before running the next cell. 

In [7]:
import wandb
from wandb.keras import WandbCallback

In [10]:
!wandb login 23edc1350be891d3ac08250ce9f4e8f952b123ea

'wandb' is not recognized as an internal or external command,
operable program or batch file.


In [None]:
wandb.init(project="boston", entity="lambda-ds7") #Initializes and Experiment

# Important Hyperparameters
X =  x_train
y =  y_train

inputs = X.shape[1]
wandb.config.epochs = 50
wandb.config.batch_size = 10

# Create Model
model = Sequential()
model.add(Dense(64, activation='relu', input_shape=(inputs,)))
model.add(Dense(64, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(1))
# Compile Model
model.compile(optimizer='adam', loss='mse', metrics=['mse', 'mae'])

# Fit Model
model.fit(X, y, 
          validation_split=0.33, 
          epochs=wandb.config.epochs, 
          batch_size=wandb.config.batch_size, 
          callbacks=[WandbCallback()]
         )

## Challenge

You will be expected to use Weights & Biases to try to tune your model during your module assignment today. 

# Hyperparameters with RandomSearchCV (Learn)

## Overview

Basically `GridSearchCV` takes forever. You'll want to adopt a slightly more sophiscated strategy.

## Follow Along

In [9]:
sweep_config = {
    'method': 'random',
    'parameters': {
        'learning_rate': {'distribution': 'normal'},
        'epochs': {'distribution': 'uniform',
                    'min': 100,
                    'max': 1000},
        'batch_size': {'distribution': 'uniform',
            'min': 10,
            'max': 400}
    }
}

In [13]:
sweep_id = wandb.sweep(sweep_config)

Create sweep with ID: huau0u9r
Sweep URL: https://app.wandb.ai/lambda-ds7/boston/sweeps/huau0u9r


In [11]:
import wandb
from wandb.keras import WandbCallback
#Initializes and Experiment

from tensorflow.keras.optimizers import Adam

# Important Hyperparameters
X =  x_train
y =  y_train

inputs = X.shape[1]

def train():
    
    wandb.init(project="boston", entity="lambda-ds7") 
    
    config = wandb.config

    # Create Model
    model = Sequential()
    model.add(Dense(64, activation='relu', input_shape=(inputs,)))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(1))

    # Optimizer 
    adam = Adam(learning_rate=config.learning_rate)

    # Compile Model
    model.compile(optimizer=adam, loss='mse', metrics=['mse', 'mae'])

    # Fit Model
    model.fit(X, y, 
              validation_split=0.33, 
              epochs=config.epochs, 
              batch_size=config.batch_size, 
              callbacks=[WandbCallback()]
             )

In [None]:
wandb.agent(sweep_id, function=train)

wandb: Agent Starting Run: 2g77kp6k with config:
	batch_size: 308.503347845309
	epochs: 704.9395850579006
	learning_rate: 1.480005523005428
wandb: Agent Started Run: 2g77kp6k


## Challenge

Try to apply RandomSearchCV to your module project today. 

# Review
* <a href="#p1">Part 1</a>: Describe the major hyperparemeters to tune
    - Activation Functions
    - Optimizer
    - Number of Layers
    - Number of Neurons
    - Batch Size
    - Dropout Regulaization
    - Learning Rate
    - Number of Epochs
    - and many more
* <a href="#p2">Part 2</a>: Implement an experiment tracking framework
    - Weights & Biases
    - Comet.ml
    - By Hand / GridSearch
* <a href="#p3">Part 3</a>: Search the hyperparameter space using RandomSearch
    - Sklearn still useful (haha)
    - Integration with Wieghts & Biases
* <a href="#p4">Part 4</a>: Discuss emerging hyperparameter tuning strategies
    - Bayesian Optimization
    - Hyperopt
    - Genetic Evolution

# Sources

## Additional Reading
- https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/
- https://blog.floydhub.com/guide-to-hyperparameters-search-for-deep-learning-models/
- https://machinelearningmastery.com/dropout-regularization-deep-learning-models-keras/
- https://machinelearningmastery.com/introduction-to-weight-constraints-to-reduce-generalization-error-in-deep-learning/
- https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/