# Lecture 3.13: Learning Better Pt. 3

[**Lecture Slides**](https://docs.google.com/presentation/d/1gCJQZkepnwXhu-IUAYsWZJrD4eFJDNwuhqyhQMs1P4w/edit?usp=sharing)

This lecture, we are going to experiment with batch sizes, learning rates, & optimizers in keras, in an attempt to better understand neural network optimization.

**Learning goals:**
- examine the effect of batch size on training
- compare the effect of learning rates on training
- contrast the choice of optimizers
- visualize loss curves vs epochs, batches, and time

## 1. Setup

This notebook uses the keras and tensorflow deep learning libraries. If you haven't already, please follow the setup steps in notebook 3.11 to correctly install these dependencies.

## 2. Data Munging

We'll use the same banknote authentication dataset to explore neural network optimization hyperparameters. 💸 We load it into a `DataFrame`:

In [None]:
import pandas as pd

df = pd.read_csv('bank_note.csv')
df.head()

Our features are scaled and ready to go! 🏋️‍♀️We'll use all 4 features and put them in a feature matrix:

In [None]:
X = df[['feature_1', 'feature_2', 'feature_3', 'feature_4']].values
y = df['is_fake'].values

## 3. Batch Size

### 3.1 loss vs epochs
Now that we understand what `batch_size` means, let's test different flavours of gradient descent on our neural network training.

We'll stick to the 2 hidden layers of 6 neurons with ReLU activation from last lecture. Let's wrap the neural network creation and training in one helper function called `.train_neural_network()`. That way, we can iterate through hyperparameters quickly to compare their effects.

The function takes a feature matrix, `X`, a label vector, `y`, and optimization hyperparameters. It returns the loss `history` and the training `time`:

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from timeit import default_timer as timer
import numpy as np
import tensorflow as tf

def train_neural_network(X, y, optimizer='adam', **kwargs):
    # create model
    model = Sequential([
        Dense(6, activation='relu', input_dim=4),
        Dense(6, activation='relu'),
        Dense(1, activation='sigmoid')
    ])
    model.compile(loss='binary_crossentropy', optimizer=optimizer)
    
    # training reproducibility
    np.random.seed(1337)
    tf.random.set_seed(666)
    
    # train and time model
    start = timer()
    history = model.fit(X, y, **kwargs)
    end = timer()
    
    time = end - start
    return history, time


ℹ️ Note how `**kwargs` prevents us from explicitly listing and passing on all the `.fit()` arguments.

Now we can train neural networks with different `batch_size` to compare vanilla gradient descent, stochastic gradient descent, and various sizes of mini-batch gradient descent.

This is the first time that we will meet the excellent `namedtuple`: Python is flexible and fast, and therefore has a tendency to end up cluttered with hundreds of complicated nested dictionaries 🐍. One solution is to create custom _classes_ to hold this data. However it's not very "pythonic" make hundreds of dedicated tiny classes for every single data object. Instead, we can use a `namedtuple`. It implements the `tuple` interface, and thus can be instantiated and unpacked easily. However, it also has named fields, so it is readable and safe like a real class. More details about this handy object [here](https://pymotw.com/2/collections/namedtuple.html).

We want to group _settings_ and _results_ together, and this happens often in machine learning experiments. `namedtuple` offers a terse and immutable alternative to dictionaries. We therefore create two named tuples:
- `Setting`, with fields `batch_size` and `epochs`
- `Result`, with fields `batch_size`, `history`, and `time`

We also change the `epochs` for each `batch_size` so that the training procedure doesn't take too long. Small batch sizes have more steps per epoch, and can last a while. The setting pairs chosen below aren't special in any way: they were chosen retroactively to keep the total training time under control 👮🏻.

Let's train some neural networks!

In [None]:
from collections import namedtuple

Setting = namedtuple('Setting', ['batch_size', 'epochs'])
settings = [Setting(1, 10), 
            Setting(2, 20), 
            Setting(8, 80), 
            Setting(32, 100), 
            Setting(128, 200), 
            Setting(len(X), 1500)]

Result = namedtuple('Result', ['batch_size', 'history', 'time'])

results = []

for s in settings:
    history, time = train_neural_network(X, y, batch_size=s.batch_size, epochs=s.epochs)
    results.append(Result(s.batch_size, history, time))

The losses of each neural network are stored under `results`. We can iterate through the list to plot them on the same graph:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
colors = sns.dark_palette("palegreen")

fig = plt.figure(figsize=(8, 6), dpi=120)
ax = fig.add_subplot(111)

for i, r in enumerate(results):
    ax.plot(r.history.history['loss'], label=f'batch_size={r.batch_size}', c=colors[i], lw=3, alpha=0.8)
    
ax.set_xlabel('epochs')
ax.set_ylabel('loss')
ax.set_xlim(left= -5, right=100)
ax.legend(loc = 'upper right', ncol=3);

🤤 That's a lot of information! Small batch sizes take fewer epochs to converge. In our case, this is mostly due to the fact they have more gradient updates per epochs, even if each step is less precise than bigger batches.

🧠 How many steps are in 1 epoch of vanilla gradient descent?

🧠 The smaller the batch size, the lower the starting loss value. Why?

We learned that stochastic gradient descent has a "bouncy" loss 🏀, but here it is smooth. This is because we are plotting _epochs_ on the x-axis, not individual _steps_. Let's try it out!

### 3.2 loss vs steps

By default, keras saves the loss at the end of each _epoch_. If we want it for each _batch_ , we need to write a custom [callback](https://keras.io/api/callbacks/). Remember callbacks are functions used to extend training functionality.

We create a `LossPerBatch` callback, which stores the neural network loss at the end of each batch. We're extending the [`Callback`](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/Callback) abstract base class, meaning we only have to override the methods we're interested in:

In [None]:
from keras.callbacks import Callback

class LossPerBatch(Callback):
    def on_train_begin(self, logs={}):
        self.history = {}
        self.history['loss'] = []

    def on_batch_end(self, batch, logs={}):
        self.history['loss'].append(logs['loss'])

We can use the callback by passing it as an argument to `.fit()`. We can repeat the previous experiments, this time recording the loss at each gradient descent step:

In [None]:
settings = [Setting(1, 10), 
            Setting(2, 20), 
            Setting(8, 80), 
            Setting(32, 100), 
            Setting(128, 200), 
            Setting(len(X), 1500)]

results_per_batch = []

for s in settings:
    history = LossPerBatch()
    _, time = train_neural_network(X, y, batch_size=s.batch_size, epochs=s.epochs, callbacks=[history])
    results_per_batch.append(Result(s.batch_size, history, time))

The `history` variable stored in our `result` is now the `LossPerBatch` history, so let's visualize it:

In [None]:
fig = plt.figure(figsize=(8, 6), dpi=120)
ax = fig.add_subplot(111)

for i, r in enumerate(results_per_batch):
    ax.plot(r.history.history['loss'], label=f'batch_size={r.batch_size}', alpha=0.6, lw=1, c=colors[i])
    
ax.set_xlabel('steps')
ax.set_ylabel('loss')
ax.legend(loc = 'upper right', ncol=3);

This is quite a confusing graph 😟 The stochasticity of some of the loss curves (looking at you `batch_size=1` 😡) has rendered the graph illegible. Let's pick a few batch sizes to compare side by side:

In [None]:
def compare_batch_sizes(results):
    height = len(results) * 4
    fig = plt.figure(figsize=(8, height), dpi=120)
    for i, r in enumerate(results):
        r = results[i]
        ax = fig.add_subplot(len(results), 1, i+1)
        ax.plot(r.history.history['loss'], label=f'batch_size={r.batch_size}', alpha=0.8, lw=1, c=colors[3])    
        ax.set_xlabel('steps')
        ax.set_ylabel('loss')
        ax.legend()
    
compare_batch_sizes([results_per_batch[0], results_per_batch[3], results_per_batch[5]])


- The single example updates of stochastic gradient descent create a lot of noise, which seems to slow down its convergence. It struggles to stay close to the global minimum even after reaching it.
- Vanilla gradient descent's updates are precise, but the smoothness might get it stuck in local minima for non-convex loss surfaces.
- Mini-batch gradient descent has just enough stochasticity to be robust yet fast.

The per-step efficiency of vanilla gradient descent is still the best here, and it seems to converge to the same final loss value as the other optimizers. This suggests that the loss surface isn't very "rough" and that bad local minima are only a minor concern for this dataset and neural architecture.

### 3.3 loss vs time

We have shown which batch sizes are most efficient per epoch and per step. However, we described earlier that batch size influences the _time_ spent on each step. ⏱ Let's find out which batch size is the _fastest_ to converge. 

Let's assume that each epoch takes roughly the same amount of time. Since we know the _total_ training time, we can convert the _epochs_ unit to _seconds_. An easy way of converting the units of a regularly spaced vector, is rescaling the values using `np.linspace()`.

In [None]:
fig = plt.figure(figsize=(8, 6), dpi=120)
ax = fig.add_subplot(111)

for i, r in enumerate(results):
    loss = r.history.history['loss']
    x_time = np.linspace(0, r.time, len(loss))
    ax.plot(x_time, loss, label=f'batch_size={r.batch_size}', lw=2, alpha=0.8)
    
ax.set_xlabel('time / s')
ax.set_ylabel('loss')
ax.set_xlim(left= -0.5, right=6)
ax.legend(loc = 'upper right', ncol=3);

Interestingly, the fastest batch size is neither 1 or 1372. It seems that 128 examples per gradient update is the right balance between the step speed and accuracy to converge the quickest.

Of course speed isn't the only factor, and one should also pick hyperparameters which find 'good' minima. In this example however, all these settings seem to converge well.

ℹ️ Next lecture we will cover GPU & parallelisation in machine learning. This will speed up training even further and change the relationship between batch size and training time. Remember to choose the best and fastest settings for a given dataset, task, and _hardware_.

## 4. Learning Rates

Now that we have a feel of what batch size is best for our optimization problem, let's investigate the effects of the _learning rate_ on training. We reuse the `train_neural_network()` helper function to iterate through a range of learning rates. The learning rate must be set on an [`Optimizer`](https://keras.io/api/optimizers/), and provided as argument to the model compilation step.

We use `SGD` here, because we want to showcase the effects of a constant learning rate on gradient descent. Adaptive learning rate optimizers like `adam` will change the learning rate throughout the training, and muddle our results.

We modify our `Result` namedtuple to hold the `learning_rate` instead of the `batch_size`:

In [None]:
from keras.optimizers import SGD

learning_rates = [0.001, 0.01, 0.1, 1, 10]
Result = namedtuple('Result', ['learning_rate', 'history', 'time'])

results = []

for lr in learning_rates:
    optimizer = SGD(learning_rate=lr)
    history, time = train_neural_network(X, y, optimizer=optimizer, batch_size=32, epochs=100)
    results.append(Result(lr, history, time))

We can now compare our results:

In [None]:
fig = plt.figure(figsize=(8, 6), dpi=120)
ax = fig.add_subplot(111)

for i, r in enumerate(results):
    ax.plot(r.history.history['loss'], label=r.learning_rate, alpha=0.8, lw=3, c=colors[i])
    
ax.set_xlabel('epochs')
ax.set_ylabel('loss')
ax.legend(loc = 'upper right', ncol=6);

This resembles the graph shown in the lecture slides:
- 0.001 is too slow
- 10 is too high, and fails to find the loss surface minimum
- 1 minimizes the loss the fastest  

These profiles are typical: one step size hits the right balance between convergence quality and speed.

ℹ️ Note however that these _values_ aren't very common. For more non-convex loss surfaces, learning rates $< 0.001$ are more the norm. However, just like batch size, this is a _hyperparameter_ that must be tuned to the task at hand.

Keep in mind that these experiments were run with the same `batch_size` and `epochs`. Therefore the time spent per epoch is the same for all these curves (see keras logs). The steepest curve on this graph is therefore also the fastest in seconds.

## 5. Optimizers

In the lecture slides, we understood the principles of momentum and adaptive learning rates behind the ✨**adam**✨ optimizer. We also explained that there _many_ other optimizers, some of which might be better suited to our learning task. Let's try _all_ of keras' optimizers and see which minimizes our loss function the best.


We'll want to peek at the loss per epoch and per step. We therefore update our `Result` namedtuple to hold both, so we don't have to run the experiments twice. As pointed out last lecture, the `optimizer` is given to the `.compile()` method.

In [None]:
optimizers = ['sgd', 'rmsprop', 'adam', 'adadelta', 'adagrad', 'adamax', 'nadam']
Result = namedtuple('Result', ['optimizer', 'history', 'history_per_batch', 'time'])

results = []

for o in optimizers:
    history_per_batch = LossPerBatch()
    history, time = train_neural_network(X, y, optimizer=o, batch_size=32, epochs=100, callbacks=[history_per_batch])
    results.append(Result(o, history, history_per_batch, time))

In [None]:
fig = plt.figure(figsize=(8, 6), dpi=120)
ax = fig.add_subplot(111)

for r in results:
    ax.plot(r.history.history['loss'], label=r.optimizer, alpha=0.8)
    
ax.set_xlabel('epochs')
ax.set_ylabel('loss')
ax.legend(loc = 'upper right', ncol=3);

There is a wide range of convergence speeds. 
- the constant learning rate `sgd` is far behind, not having converged by 100 epochs
- `nadam` leads with an impressive ~ 30 epochs to convergence 🏎 
- `adam` also seems to be a solid  optimizer in 3rd place

We'd like to see the loss per-step, but we anticipate those curves to be noisy. So let's pick three to compare side by side:
- regular ol' `sgd` (mini-batch gradient descent with constant learning rate) 👴
- the messiah, `adam` 😇
- the prophet's nemesis, `nadam` 😈

In [None]:
def compare_optimizers(results):
    height = len(results) * 4
    fig = plt.figure(figsize=(8, height), dpi=120)
    for i, r in enumerate(results):
        r = results[i]
        ax = fig.add_subplot(len(results), 1, i+1)
        ax.plot(r.history_per_batch.history['loss'], label=r.optimizer, alpha=0.8, lw=1, c=colors[3])    
        ax.set_xlabel('steps')
        ax.set_ylabel('loss')
        ax.legend()
    
compare_optimizers([results[0], results[2], results[6]])


`adam` might have lost this race, but we can clearly see that advanced optimizers speed up loss function minimization. 🎊

In fact, notice that the adaptive learning rate methods also reduce the stochasticity later in the training, meaning they also improve the _quality_ of the optimization.


The analysis above only proves that `nadam` is the fastest optimizer per _epoch_ and per _step_. Indeed, one of these methods could take longer to_calculate, and result in a longer total training _time_. 

💪💪 Plot the loss curves of each optimizer versus training time.
- see section 3.3 for an example
- the graph is the unit test 🙃

In [None]:
# INSERT YOUR CODE HERE

🧠 Is this what you expected? Does this confirm that `nadam` is the fastest optimizer for our problem?

🧠🧠 How can you tell that nadam is the slowest _calculation_ per gradient descent update?

## 6. Learning Rate Scheduling

Learning rate schedule are most used for their _restarts_ and _warmups_. Recall that these "bumps" in learning rate can help the descent jump out of bad minima or escape flat areas of the loss surface. 🤸‍♂️

Our optimization problem has shown no issues with bad local minima. There is therefore no point trying fancy learning rate schedules, as they probably wouldn't improve anything. 😔

If you are interested however, learning rate schedules can be implemented in keras using the [LearningRateScheduler](https://keras.io/api/callbacks/learning_rate_scheduler/) class, or by writing a custom [callback](https://github.com/keras-team/keras-contrib/blob/master/keras_contrib/callbacks/cyclical_learning_rate.py).

## 7. Summary

Today, we learned about **advanced optimization** algorithms. First, we described the struggles of vanilla gradient descent with **non-convex** functions. We then identified the benefits of **stochasticity**, and how reducing the **batch size** can help navigate bumpy neural network losses. We recognised **mini-batch gradient descent** as the most common gradient descent flavour today, and defined an **epoch** as a complete pass through the training data. We then highlighted the importance of the **learning rate**, i.e the gradient descent step size. We saw how **learning rate scheduling** can help find a better minimum, faster. We also learned that advanced optimizers such as **adam**, use momentum and adaptive learning rates to improve optimization. We recognized that neural network training is **complicated**, and that the best methods vary with tasks and datasets. Finally, we investigated these techniques ourselves by testing optimization hyperparameter combinations on the banknote authentication dataset.

### Core Resources

- [**Slides**](https://docs.google.com/presentation/d/1gCJQZkepnwXhu-IUAYsWZJrD4eFJDNwuhqyhQMs1P4w/edit?usp=sharing)
- [Ruder on better gradient descent](https://ruder.io/optimizing-gradient-descent/)  
Classi blogpost with clear explanations of the main neural network optimizers

### Additional Resources

- [Animating gradient descent](http://louistiao.me/notes/visualizing-and-animating-optimization-algorithms-with-matplotlib/)  
Matplolib visualization of gradient descent algorithms
- [Loss function visualization](http://www.telesens.co/loss-landscape-viz/viewer.html)  
Interactive app to visualize common CV loss landscapes
- [More Ruder on better gradient descent](https://ruder.io/deep-learning-optimization-2017/)  
Follow up to the classic blog post
- [Recent gradient descent algorithms](https://johnchenresearch.github.io/demon/)  
Even more recent follow up about advanced optimization methods
- [Improving the way we work with learning rates](https://techburst.io/improving-the-way-we-work-with-learning-rate-5e99554f163b)  
Blogpost focusing on learning rate scheduling and restarts
- [efficient backprop](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf)  
Classic paper going deep into backpropagation and gradient descent algorithms for neural networks
- [Practical recommendations for gradient-based training of deep architectures](https://arxiv.org/pdf/1206.5533v2.pdf)  
Also classic detailed paper on how to get your neural networks to actually work
- [Loss functions tumblr](https://lossfunctions.tumblr.com/)  
Some fun loss curves
- [Loss landscapes](https://losslandscape.com/)  
Beautiful renderings of neural loss landscapes