# Learning Rate study

Deep learning neural networks are trained using the stochastic gradient descent optimization algorithm. The learning rate is a hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated. Choosing the learning rate is challenging as a value too small may result in a long training process that could get stuck, whereas a value too large may result in learning a sub-optimal set of weights too fast or an unstable training process.


In this notebook, we will cover the effects of the learning rate, momentum and learning rate schedules

  
    

# The data

We will use a self-made data

Multi-Class Classification Problem

We will use a small multi-class classification problem as the basis to demonstrate the effect of learning rate on model performance.

The scikit-learn class provides the make_blobs() function that can be used to create a multi-class classification problem with the prescribed number of samples, input variables, classes, and variance of samples within a class.

<font color=red><b>Generate a dataset composed of 1000 samples with 3 classes and 2 features. Plot the result
<br>Hint: use the blobs kmaker from sklearn and the given parameters</b>
</font>

In [None]:
# scatter plot of blobs dataset
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt

cluster_std = 2 # force classes to be not linearly separable
random_state = 2 

# generate 2d classification dataset
X, y = ...
# scatter plot for each class value
for class_value in range(3):
    # scatter plot for points with a different color
    plt.scatter(...)
# show plot
pyplot.show()

## Effect of Learning Rate

In this section, we will develop a  model to address the blobs classification problem and investigate the effect of different learning rates

<font color=red><b>Generate a function for creating training and testing datasets. Remember the labels must be in a categorical format
<br>Hint: use the imported functions </b>
</font>

In [None]:
import tensorflow as tf
from tensorflow.keras.utils import to_categorical
# prepare train and test dataset
def prepare_data():
    # generate 2d classification dataset
    X, y = ..
    # one hot encode output variable
    y = ...
    # split into train and test. Is order ok?
    n_train = 500
    trainX, testX = ...
    trainy, testy = ...
    return trainX, trainy, testX, testy

In [None]:
 # For easy reset of notebook state.
# study of learning rate on accuracy for blobs problem
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import SGD

from matplotlib import pyplot

<font color=red><b>Create a function for generating a model with a the following configuration:
    - dense layer with 50 units, relu-activated. 
    - dense layer with the categories amount as units, softmax activated.
<br>use stochastic gradient descent as the optimizer andcompile using categorical crossentropy and accuracy as the metric. Define the fit as a history for plotting </b>
<br>Hint: use the imported functions </b>
</font>

In [None]:
# fit a model and plot learning curve
def fit_model(trainX, trainy, testX, testy, lrate):
    # define model
    model = Sequential()
    ...
    # compile model
    ...
    # fit model
    history = ...
    # plot learning curves
    pyplot.plot(history.history['accuracy'], label='train')
    pyplot.plot(history.history['val_accuracy'], label='test')
    pyplot.title('lrate='+str(lrate), pad=-50)

<font color=red><b> Prepare the data and train on different learning rates. Show the results and discuss about them
<br>Hint: use the functions you just built </b>
</font>

In [None]:
# prepare dataset
trainX, trainy, testX, testy = ...
# create learning curves for different learning rates
learning_rates = [1E-0, 1E-1, 1E-2, 1E-3, 1E-4, 1E-5, 1E-6, 1E-7]
for i in range(len(learning_rates)):
    # determine the plot number
    plot_no = 420 + (i+1)
    plt.subplot(plot_no)
    # fit model and plot learning curves for a learning rate
    ...
# show learning curves
plt.show()

## Momentum Dynamics

Momentum can smooth the progression of the learning algorithm that, in turn, can accelerate the training process.

We will choose the learning rate of 0.01 that in the previous section converged to a reasonable solution, but required more epochs than the learning rate of 0.1

<font color=red><b> Define the same model architecture, but this time, modify the momentum for the learner. plot the results
<br>Hint: Add the momentum to the optimizer definition</b>
</font>

In [None]:
# fit a model and plot learning curve
def fit_model_momentum(trainX, trainy, testX, testy, momentum):
    # define model
    ...
    # compile model
    ...
    # fit model
    history = ...
    # plot learning curves
    ...
    plt.title('momentum='+str(momentum), pad=-80) 


momentums = [0.0, 0.5, 0.9, 0.99]
for i in range(len(momentums)):
    # determine the plot number
    plot_no = 220 + (i+1)
    plt.subplot(plot_no)
    # fit model and plot learning curves for a momentum
    ...
# show learning curves
plt.show()

## Effect of Learning Rate Schedules

We will look at two learning rate schedules in this section.

The first is the decay built into the SGD class and the second is the ReduceLROnPlateau callback.
## Learning Rate Decay

We will see now the effects of adding a decay effect on the learning rate. This will make the lr to be smaller the more the network learns. 

$$lr = \frac{lr_0}{1+ decay\cdot t} $$

Where *t* is the iteration number and decay is que parameter we add.

<font color=red><b> Define the same model architecture, but this time,ad a learning rate decay. Plot the results
<br>Hint: Add the decay to the optimizer definition</b>
</font>

In [None]:
# fit a model and plot learning curve
def fit_model_decay(trainX, trainy, testX, testy, decay):
    # define model
    ...
    # compile model
    ...
    # fit model
    history = ...
    # plot learning curves
    ...
    plt.title('decay='+str(decay), pad=-80)

# prepare dataset
trainX, trainy, testX, testy = prepare_data()
# create learning curves for different decay rates
decay_rates = [1E-1, 1E-2, 1E-3, 1E-4]
for i in range(len(decay_rates)):
    # determine the plot number
    plot_no = 220 + (i+1)
    plt.subplot(plot_no)
    # fit model and plot learning curves for a decay rate
    ...
    # show learning curves
plt.show()