<a href="https://colab.research.google.com/github/erodola/ML-s2-2024/blob/main/labs/03_SGD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning

# Tutorial 3: Stochastic Gradient Descent

In this tutorial, we will cover:

- Features and feature scaling
- Multinomial regression
- Logistic regression with scikit-learn
- SGD and momentum

Authors:

- Prof. Emanuele Rodolà

Course:

- Lectures and notebooks at https://github.com/erodola/ML-s2-2024/

# Imports and utilities

In [None]:
# @title import dependencies

from typing import Mapping, Union, Optional, Tuple
import argparse

import numpy as np

from tqdm.notebook import tqdm

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

import matplotlib.pyplot as plt

from PIL import Image

import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

In [None]:
# @title reproducibility stuff

import random
np.random.seed(42)
random.seed(0)

# Logistic Regression

## Binary model

In theory class, we saw the linear regression model as a **binary classifier**. If we have a bunch of data points (e.g. images) $\mathbf{x}_i$, the classifier outputs a number in $[0,1]$ telling if each $\mathbf{x}_i$ belongs to class $0$ or $1$:

$$ p_i = \sigma(\mathbf{w}^\top \mathbf{x}_i + b) $$

To find the optimal $\mathbf{w}$, we need to solve an optimization problem where $\mathbf{w}$ is the unknown, and $(\mathbf{x}_i, y_i)$ are the data; note that $y_i$ are the true labels, while $p_i$ are our model's predictions. We define the loss:

$$ \ell(\mathbf{w}) = -\sum_i ( y_i\log p_i + (1-y_i)\log (1-p_i)) $$

Let's solve this with Scikit-learn!

> **EXERCISE**: Check scikit-learn's docs on binary logistic regression. Two additional terms appear: $s_i$ and $r(w)$. What are they?

## Data loading

To familiarize with the problem, we are going to solve it using a black-box approach and then interpret the results. After this, we will write our own solver!

First, let's load some data:

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()

digits.data.shape  # there are 1797 images, each with 8x8=64 pixels

Our training data is not just the images; we also need their ground-truth labels:

In [None]:
digits.target.shape

Always visualize the data:

In [None]:
random_idx = np.random.randint(0, digits.data.shape[0], 5)
plt.figure(figsize=(6,2))
for i, idx in enumerate(random_idx):
  plt.subplot(1, 5, i + 1)
  plt.imshow(np.reshape(digits.data[idx], (8,8)), cmap=plt.cm.gray)
  plt.title(f"label: {digits.target[idx]}")
  plt.axis("off")

> **EXERCISE**: Only visualize the **7** digits. Plot 16 of them in a 4x4 grid.

## Multinomial logistic regression

Hang on, we know how to solve a _binary_ classification task, but now we have 10 classes, one per digit. This makes it a _multiclass_ problem, and we need a new learning model to deal with it. Enter **multinomial classification**.

Here's the good old binary logistic regression model:

$$ p = \sigma(\mathbf{w}^\top \mathbf{x}_i + b) $$

In the multinomial case, we now have:

$$ \mathbf{p} = \text{softmax} (\mathbf{W}^\top \mathbf{x}_i + \mathbf{b}) \,.$$

where $\mathbf{p}$ represents the _probabilities_ over 10 classes.

For example, if $\mathbf{p}=(0, 0.8, 0.1, 0, 0, 0, 0, 0, 0, 0.1)$, it means that $\mathbf{x}_i$ belongs to the second class with probability 0.8, to the third class with probability 0.1, and so on. Also observe that the **weight matrix** $\mathbf{W}$ has as many rows as the number of classes, and that the **bias** is now a vector containing one value per class.

What is that _softmax_ function? Formally, it is defined as follows:

$$\text{softmax}(\mathbf{x}) = \{\frac{\exp(x_0)}{\sum_{j}^{ }\exp(x_j))}, \frac{\exp(x_1)}{\sum_{j}^{ }\exp(x_j))}, ... , \frac{\exp(x_9)}{\sum_{j}^{ }\exp(x_j))}\}$$

By construction, $\text{softmax}(\mathbf{x})$ sums up to one and each element is in $[0,1]$; therefore we can treat the softmax vector as a discrete probability distribution over the set of classes.



The following plot gives a useful intuition of how softmax behaves; we'll do this by considering $\exp(\alpha x)$ with different $\alpha$:

In [None]:
x = np.random.rand(40)

In [None]:
# @title Softmax: crank up the alpha!  { run: "auto" }

import plotly.graph_objects as go

alpha = 1  #@param {type:"slider", min:1, max:50, step:1}

sx = np.exp(alpha*x)
sx /= sx.sum()

fig = go.Figure()
# fig.add_trace(go.Bar(y=x, name='x', marker_color='blue'))
fig.add_trace(go.Bar(y=sx, name='sx', marker_color='red'))
fig.update_layout(barmode='group', title='Softmax of a random vector', width=800, height=300)
fig.show()

In [None]:
# sums up to one
print(f"{np.sum(sx):.1f}")

# all values are in [0,1]
np.all((sx >= 0) & (sx <= 1))

You now got an idea if why it's called soft**max**? One useful way to think about the softmax is as a smooth approximation of the indicator function, centered around the maximum.

## Logits

Another bit of terminology:

$$ \text{softmax}(\underbrace{\mathbf{W}^\top \mathbf{x}_i + \mathbf{b}}_{\mathbf{z}_i}) $$

The $\mathbf{z}_i$ are also called the **logits** in ML lingo. Familiarize with this term, as it occurs often!

You can think of the logits as raw class scores that get converted into probabilities by the softmax.


## Pixels as features

So far we have been treating each image $\mathbf{x}_i$ as _flattened_: it's a $k$-dimensional vector, where $k$ is the number of pixels. In ML, we call $\mathbf{x}_i$ the **feature vector** of image $i$. In this vector, **each dimension is a feature**.

In [None]:
xi = digits.data[17]
print(f"{len(xi)} features")

plt.figure(figsize=(8, 2))
_ = plt.bar(range(len(xi)), xi)

Clearly, in this case each feature is a pixel. From this plot we also learned that the maximum value (corresponding to a white pixel) is 16.

Other features are possible. For instance, a possible (but not very informative) feature vector for an image might be its color histogram:

In [None]:
hist, _ = np.histogram(xi, bins='auto')

plt.figure(figsize=(8, 2))
_ = plt.bar(range(len(hist)), hist)

Let's stay _raw_ for now, and use pixel features as input to our learning algorithm.

## Training with scikit-learn

Now that we have pinned down the terminology, we are ready to set up our learning model for logistic regression and train it:

In [None]:
from sklearn.linear_model import LogisticRegression

reg = LogisticRegression(max_iter=10)
_ = reg.fit(digits.data, digits.target)

Hang on, what are those warnings? `lbfgs` failed converge, meaning that the solver ([LBFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) in this case, but we will implement SGD) didn't reach a minimum. Indeed, we specified `max_iter=10`, so it didn't have time to reach a minimum. Increase that value to 100 or 1000 and see what happens!

## Feature scaling

Apparently, even with `max_iter=1000` we keep getting a warning. A possible problem is that **our data is not normalized**.

Normalizing the data means (i) centering it around zero, and (ii) rescaling it to have unit standard deviation:

$$ x \mapsto \frac{x - \mu}{\sigma} $$

Here, $x$ is a **feature** and $\mu$ is the mean of that feature over the entire training set. Similarly, $\sigma$ is the standard deviation.

For example, if the feature is a pixel at position $(12,8)$, we must take the mean over all the pixels at position $(12,8)$ in the entire dataset. In other words, centering and scaling must be done _independently on each feature_.

Feature scaling is also referred to as **standardization**.

> **EXERCISE**: Create a `standard_scale` function implementing feature scaling, and apply it to the input data.

In [None]:
# ✏️ your code here

#X = standard_scale(digits.data)

In [None]:
# @title 👀 Solution

def standard_scale(X):
  means = np.mean(X, axis=0)  # mean across the feature dimensions
  stds = np.std(X, axis=0)
  stds[stds < 1e-6] = 1.  # avoid division by very small values
  X_scaled = (X - means) / stds
  return X_scaled

X = standard_scale(digits.data)
# X = sklearn.preprocessing.scale(digits.data, axis=0)  # scikit-learn version

> **EXERCISE:** Is it possible that a feature has zero standard deviation? When does it happen with our current data?

Standardization is important for numerical stability: if a feature has a variance that is orders of magnitude larger than others, it might dominate the loss and make the training unable to learn from other features.

In [None]:
reg = LogisticRegression(max_iter=100)
_ = reg.fit(X, digits.target)

The warnings are gone, and the optimizer converged!

## Evaluation

We can now evaluate the quality of our model. By the way, did you notice that we have as many weights as we have features (i.e. image pixels in this case)? It's obvious if we look at the model expressions:

$$ \sigma(\mathbf{w}^\top\mathbf{x}_i + b) \quad , \quad
\text{softmax} (\mathbf{W}^\top \mathbf{x}_i + \mathbf{b})$$

This suggests that we can reshape the set of weights (just the $\mathbf{w}$ or $\mathbf{W}$) as an image, and have a look at them!


In [None]:
plt.figure(figsize=(10,2))
for i in range(10):
  plt.subplot(1, 10, i + 1)
  plt.imshow(reg.coef_[i].reshape(8, 8), cmap=plt.cm.viridis)
  plt.title(f"{i}")
  plt.axis("off")

Not so easy to read, but look at the weights responsible to classify the **0** digit, which are perhaps more clear. That image is a "0" classifier. It will not always be possible to grasp such a clear intuition just by looking at the learned weights, but studying them is a valid research direction by itself!

To get a _quantitative_ evaluation, we can compute the hit rate, i.e., the percentage of correctly classified images:

In [None]:
preds = reg.predict(X)

# calculate the amount of correctly classified images
accuracy = np.sum(preds == digits.target) / len(digits.target)

print(f"{100*accuracy:.2f}%")

If you did everything correctly, you should see an accuracy larger than 99%. Is it a perfect classifier? To answer this question, we must test how well the model generalizes to unseen data. But we don't have any new data to use!

Let's restart from scratch, but this time we'll split the available images into 75% **training** data and use the remaining 25% as **validation** data.

In [None]:
X = digits.data.copy()
labels = digits.target.copy()

def shuffle(X_, y_):
  shuffled_idx = np.random.permutation(X_.shape[0])
  return X_[shuffled_idx], y_[shuffled_idx]

# shuffle the data to avoid any bias
X, labels = shuffle(X, labels)

# prepare training and validation sets
n_train = int(0.75 * X.shape[0])
X_train = X[:n_train]
X_valid = X[n_train:]

# center and scale
X_train = standard_scale(X_train)
X_valid = standard_scale(X_valid)

⚠️ **Leakage warning:** A common mistake is to rescale data *before* splitting into training and test sets. _This will bias the model_ because statistics of the _test set_ are carried over to the training set.

Let's retrain our model, and test it on the validation set:

In [None]:
reg = LogisticRegression(max_iter=100)
_ = reg.fit(X_train, labels[:n_train])

preds = reg.predict(X_valid)
accuracy = np.sum(preds == labels[n_train:]) / (X.shape[0] - n_train)

print(f"{100*accuracy:.2f}%")

Lower than before, but still good! Let's look at the **misclassified images**, i.e. the images from the validation set that did _not_ get a correct prediction:

In [None]:
wrong_idx = np.where(preds != labels[n_train:])[0]

plt.figure(figsize=(6,2))
for i, idx in enumerate(wrong_idx[:5]):
  plt.subplot(1, 5, i + 1)
  plt.imshow(np.reshape(X_valid[idx], (8,8)), cmap=plt.cm.gray)
  plt.title(f"predict: {preds[idx]}")
  plt.axis("off")

Hmm, these digits are not so easy to read (especially after normalization!). Let's move to a more readable dataset that provides a standard baseline in ML and DL: the **MNIST dataset**.

## Classifying MNIST digits

MNIST is a dataset consisting of 70,000 handwritten digits and it's typically used to prototype ideas and new methods. We will download it directly from the web, as Scikit-learn doesn't offer it among its datasets. The code below already takes care of the training/validation splitting:

In [None]:
!wget https://s3.amazonaws.com/img-datasets/mnist.npz

def load_data_impl():
    # file retrieved by:
    #   wget https://s3.amazonaws.com/img-datasets/mnist.npz -O code/dlgo/nn/mnist.npz
    # code based on:
    #   site-packages/keras/datasets/mnist.py
    path = 'mnist.npz'
    f = np.load(path)
    x_train, y_train = f['x_train'].reshape(-1, 784), f['y_train']
    x_test, y_test = f['x_test'].reshape(-1, 784), f['y_test']
    f.close()
    return (x_train.astype(np.float32), y_train), (x_test.astype(np.float32), y_test)

(x_train, y_train), (x_valid, y_valid) = load_data_impl()

# bring pixel values to [0, 1]
x_train /= 255
x_valid /= 255

Each MNIST image is 28x28 (=784 pixels).

> **EXERCISE:** Visualize a few MNIST images from the training set, _before_ normalization and _after_ normalization. Make sure to normalize all the data before proceeding to the next cells.

In [None]:
# ✏️ your code here


In [None]:
# @title 👀 Solution

plt.figure(figsize=(6,2))
for i in range(5):
  plt.subplot(1, 5, i + 1)
  plt.imshow(x_train[i].reshape(28, 28), cmap=plt.cm.gray)
  plt.axis("off")

x_train = standard_scale(x_train)

plt.figure(figsize=(6,2))
for i in range(5):
  plt.subplot(1, 5, i + 1)
  plt.imshow(x_train[i].reshape(28, 28), cmap=plt.cm.gray)
  plt.axis("off")

x_valid = standard_scale(x_valid)

> **EXERCISE:** You know what to do: train a classifier and evaluate it. Can you get **above 95%** accuracy?

In [None]:
# ✏️ your code here


It's always a good idea to have a look at the misclassified samples:

In [None]:
# ✏️ your code here


# Stochastic Gradient Descent (SGD)

In our experiments so far, we trusted scikit-learn's implementation of the logistic regression model, together with its training (the `fit()` function). We are now going to implement our own training procedure via gradient descent:

$$\mathbf{W}^{(t+1)} = \mathbf{W}^{(t)} - \alpha \nabla \ell(\mathbf{W}^{(t)})$$
with
$$\nabla \ell(\mathbf{W}^{(t)}) = \frac{1}{m} \sum_i^m \nabla \ell_{\{\mathbf{x}_i, y_i\}}(\mathbf{W}^{(t)}) $$

For $m \ll n$ we get **stochastic** gradient descent, as we have seen in theory class.

_Note:_ Remember that we also have the update equations for the bias $\mathbf{b}$; we are not writing them here for brevity.

Let's first prepare some data:

In [None]:
# we'll use the small digit data again
from sklearn.datasets import load_digits
digits = load_digits()

X = digits.data.copy()
labels = digits.target.copy()
n_classes = 10

# shuffle the data to avoid any bias
X, labels = shuffle(X, labels)

# prepare training and validation sets
n_train = int(0.75 * X.shape[0])
X_train = X[:n_train]
y_train = labels[:n_train]
X_valid = X[n_train:]
y_valid = labels[n_train:]

# center and scale
X_train = standard_scale(X_train)
X_valid = standard_scale(X_valid)

## The loss

Before computing the gradient, we still need to define a loss! We have cross-entropy for the binary setting, but what about the multinomial case?

Here's our learning model:

$$ \mathbf{p} = \text{softmax} (\mathbf{W}^\top \mathbf{x}_i + \mathbf{b}) $$

The first thing we do is modify it slightly, to output the _log-probabilities_ instead of the probabilities, namely:

$$ \log(\mathbf{p}) = \log(\text{softmax} (\mathbf{W}^\top \mathbf{x}_i + \mathbf{b}) ) $$

> **EXERCISE:** _(Pen-and-paper)_ Rewrite the formula for $\log(\mathbf{p})$ by explicitly applying the $\log$ to $\text{softmax}$.

You can use the result from the simplified formula to solve the next exercise.

> **EXERCISE:** Write a function `log_softmax()` that takes as input a vector `t` and returns its log(softmax).

In [None]:
# ✏️ your code here


In [None]:
# @title 👀 Solution

def log_softmax(t):
  return t - np.log(np.sum(np.exp(t), axis=1))[:, None]

Let's compute some predictions on a minibatch of 12 images, using random weights and biases:

In [None]:
def model(xb):
  return log_softmax(xb @ weights + bias)  # WARNING: this is using the _global_ variables 'weights' and 'bias'!
                                           #          not a good coding style, but ok for this notebook.

m = 50
batch = X_train[:m]

weights = np.random.rand(X_train.shape[1], n_classes)
bias = np.random.rand(1, n_classes)

preds = model(batch)
preds[17]  # the predicted negative log-probabilities for image 17

For the loss itself we'll use the **negative log-likelihood** (NLL). For an image $\mathbf{x}_i$ with true class label $c$, it is defined as:

$$ -\log (p_{c_i}) $$

where $p_{c_i}$ is the model's predicted probability that $\mathbf{x}_i$ belongs to class $c$. The loss makes sense: higher probabilities for the correct class yield lower losses.

We can rewrite this as:

$$ -\log(\text{softmax}(\mathbf{W}^\top \mathbf{x}_i + \mathbf{b})_c) $$

Again, $\text{softmax}(\mathbf{W}^\top \mathbf{x}_i + \mathbf{b})_c$ is the predicted probability that the $i$-th sample belongs to class $c$.

For a batch of $m$ images, we simply average their NLL loss to make it independent from the batch size:

$$ \ell(\mathbf{W}, \mathbf{b}) = - \frac{1}{m}  \sum_{i=1}^m \sum_{c=1}^n \mathbf{y}_{i,c} \log(\text{softmax}(\mathbf{W}^\top \mathbf{x}_i + \mathbf{b})_c)\,, $$

where $\mathbf{y}_i$ is a one-hot encoding of the correct class for sample $i$.

> **EXERCISE:** _(Pen and paper)_ Show how the cross-entropy loss of binary logistic regression relates to the formula above. Are they are equivalent?

Writing the loss as a python function is a one-liner, and we'll need it to test our learning algorithms.

> **EXERCISE:** Write a function `nll()` that computes the NLL loss, taking as input a matrix of log-probabilities and a vector of ground-truth labels.

In [None]:
# ✏️ your code here


In [None]:
# @title 👀 Solution

def nll(log_prob, target):
  batch_size = log_prob.shape[0]
  loss = -np.mean(log_prob[np.arange(batch_size), target])
  return loss

nll(preds, y_train[:m])

Take a note of this loss value -- we are going to bring it down!

## The gradient

It's the dreaded time to compute the gradient 😩. Here's our complete loss from before, maximizing for each sample the probability of the correct class:

$$ \ell(\mathbf{W}, \mathbf{b}) = - \frac{1}{m}  \sum_{i=1}^m \sum_{c=1}^n \mathbf{y}_{i,c} \log(\text{softmax}(\mathbf{W}^\top \mathbf{x}_i + \mathbf{b})_c)\,, $$

We are expected to compute its gradient with respect to $\mathbf{W}$ and $\mathbf{b}$.

Lo and behold, this is actually quite easy! If we have $m$ samples per minibatch, $f$ features, and $c$ classes, we get the compact expressions:

$$ \nabla_\mathbf{W} \ell = \mathbf{X}^\top( \text{softmax}(\mathbf{Z}) - \mathbf{Y}) $$

$$ \nabla_\mathbf{b} \ell = \mathbf{1}^\top( \text{softmax}(\mathbf{Z}) - \mathbf{Y}) $$

where $\mathbf{X}$ is a $m \times f$ matrix of features, $\mathbf{Z}$ is a $m \times c$ matrix of logits, and $\mathbf{Y}$ is a matrix containing as rows the one-hot representations of the correct classes, a row per sample.


> **EXERCISE:** What are the dimensions of $\mathbf{W}$ and $\mathbf{b}$? What about $\nabla_\mathbf{W} \ell$ and $\nabla_\mathbf{b} \ell$?

You know what to do: write down the code for computing the gradient over a batch, and then use it in a (stochastic) gradient descent loop. Let's start with the gradient:

> **EXERCISE:** Write a `grad` function that takes as input a minibatch `batch`, together with its ground-truth labels `target`, and returns a tuple `W_grad, b_grad` with the gradients.

In [None]:
# ✏️ your code here


In [None]:
# @title 👀 Solution

def grad(batch, target):

  Z = batch @ weights + bias  # logits
  sZ = np.exp(Z)
  sZ /= np.sum(sZ, axis=1)[:, None]
  Y = np.zeros_like(sZ)
  Y[np.arange(batch.shape[0]), target] = 1

  W_grad = batch.transpose() @ (sZ - Y) / batch.shape[0]
  b_grad = np.sum(sZ - Y, axis=0)[None, :]

  return W_grad, b_grad

## The algorithm

We now have all the ingredients to write down our own version of SGD! Let's start with standard gradient descent, meaning that we only have one batch of data corresponding to the entire training set.

Let's initialize our model with random parameters:

In [None]:
weights = np.random.rand(X_train.shape[1], n_classes)
bias = np.random.rand(1, n_classes)

> **EXERCISE:** Write the gradient descent algorithm, training with the entire dataset `X_train` and ground-truth labels `y_train`. Pick your own learning rate and maximum number of iterations, and plot the loss vs. the number of iterations.

In [None]:
# ✏️ your code here


In [None]:
# @title 👀 Solution

lr = 0.01
max_iter = 1000

losses = []

for iter in range(max_iter):

  W_grad, b_grad = grad(X_train, y_train)

  weights -= lr * W_grad
  bias -= lr * b_grad

  loss = nll(model(X_train), y_train)
  losses.append(loss)

plt.figure(figsize=(6, 3))
plt.plot(losses)
plt.xlabel('iterations')
plt.ylabel('training loss')
plt.show()

> **EXERCISE:** Try to play with the learning rate a bit. You might notice that for some choices like `lr = 1.0` you get an overshooting effect. Why is that? Shouldn't a unitary learning rate guarantee that the descent step has the same exact magnitude of the gradient?

Make sure you understood the details of this part -- we are now going for our final algorithm! Stochastic gradient descent differs from the standard algorithm in one key aspect: **we must use _mini-batches_ of data**, rather than all the training data at once.

So the first thing we need is a way to break down our training data into mini-batches of prescribed size $m$.

In [None]:
m = 50  # what happens if m does not divide n_train exactly?
n_batches = int(np.ceil(n_train / m))
n_batches

Before we implement SGD: if we now compute the loss over all the mini-batches composing our training data, what do you expect to see?

> **EXERCISE:** Show a plot of the loss computed over all the mini-batches. Put the batch number along the x-axis.

If the loss is this noisy even before we start the optimization (recall that we are evaluating the model at a fixed point of the loss landscape! The one given by the current weights and biases), it stands to reason that it won't go down as smoothly as standard gradient descent. Let's test this!

> **EXERCISE:** Write the entire SGD algorithm, and plot the training loss vs. the number of epochs. In the same figure, also plot standard GD and draw your (hopefully enthusiastic) conclusions.

In [None]:
# ✏️ your code here

# re-initialize with random parameters
weights_init = np.random.rand(X_train.shape[1], n_classes)
bias_init = np.random.rand(1, n_classes)

lr = 0.01
max_epochs = 1000

# ...

In [None]:
# @title 👀 Solution

lr = 0.01
max_epochs = 1000

weights_init = np.random.rand(X_train.shape[1], n_classes)
bias_init = np.random.rand(1, n_classes)

plt.figure(figsize=(6, 3))

for m in [50, n_train]:  # SGD, GD

  weights = weights_init.copy()  # we don't want shallow copies
  bias = bias_init.copy()

  n_batches = int(np.ceil(n_train / m))

  losses = np.zeros(max_epochs)
  for epoch in range(max_epochs):

    # reshuffle the training set at each epoch
    X_train, y_train = shuffle(X_train, y_train)

    for i in range(n_batches):

      start = i * m
      end = start + m
      batch = X_train[start:end]
      target = y_train[start:end]

      W_grad, b_grad = grad(batch, target)

      weights -= lr * W_grad
      bias -= lr * b_grad

    # store the training loss for plotting
    losses[epoch] = nll(model(batch), target)

  if m < n_train:
    plt.plot(losses, color='red', label='SGD')
  else:
    plt.plot(losses, color='black', label='GD')

plt.xlabel('epochs')
plt.ylabel('training loss')
plt.legend()
plt.show()

Not only SGD is faster, but also more accurate 🪄

Experiment with different choices of batch size, learning rate, and number of epochs to see how the loss goes down differently. There is no recipe for choosing the right hyper-parameters, and their impact on the so-called **training dynamics** is still an open research direction.

### On the optimal batch size

As we have seen in theory class, we don't really want to reach the global minima of the loss. At the global minimum, an overparametrized model like a deep neural network would overfit big time.

Quoting the work of Keskar et al. [*On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima*](https://ar5iv.org/abs/1609.04836):

>The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say 32-512 data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation.

Nevertheless, the optimal size of a mini-batch is still debated and many interesting works have been published in the last years, for instance:

- [*Don't Decay the Learning Rate, Increase the Batch Size*](https://ar5iv.org/abs/1711.00489)
- [*Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour*](https://arxiv.org/abs/1706.02677)

## Evaluation

We are not done yet, because we still need to evaluate the performance of our trained model on the validation set.

> **EXERCISE:** Test your SGD-trained model on the validation set of the small digit data. In particular, do the following plots:
> - Figure 1: plot the loss on training and validation data (two different curves) across the epochs.
> - Figure 2: plot the validation _accuracy_ across the epochs.

In [None]:
# ✏️ your code here


Can you reach the same accuracy as the one we obtained with Scikit-learn?

## EXERCISE: Classifying MNIST digits

You now have all the skills needed to setup, configure, and solve a classification problem using SGD as a training algorithm. Importantly, **you coded a functioning learning algorithm from start to finish, all by yourself**, using existing libraries only for computing matrix products and plotting. Congratulations! 🤝

Let's finish this notebook by applying all we learned to classify MNIST digits. **Can you do better than Scikit-learn?**

Perhaps by implementing **momentum**?

In [None]:
# ✏️ your Scikit-learn beating code here


# Coda

Can you think of any classification problem that you would like to solve in your own everyday life?

Maybe something that recognizes faces of your friends from your photos 🧑? Or monsters in Pokemon 👾? Or written text from the sound of a laptop keyboard 💻🎵? All these are solvable with what you learned today, so go out there and have fun!