# COMP534 Lab session 5

In this session, we will:

**1.** compare the Batch Gradient Descend (BGD), Stochastic Gradient Descend (SGD) and Mini Batch Gradient Descend (MBGD);

**2.** Analyze hyperparameters, which are tunable in training a neural network;

**3.** Perform Fashion MNIST classification with Convolutional layers, Batchnormalization and MaxPooling.

## 5-1 (Batch) Gradient Descent

The gradient descent algorithm is a very common method in deep learning, which used to optimize the parameters of a model by computing the gradients of an objective function w.r.t the parameters.

In order to find the optimal solution, you can try to use the exhaustive method, the divide and conquer method or the greedy algorithm. The gradient descent algorithm is a greedy algorithm. Through continuous iteration, each time the direction with the fastest loss reduction is selected to update the parameters, the local optimal solution can be quickly reached.
If the function is a convex function, the local optimal solution is the global optimal solution; if the function is a non-convex function, it may fall into the local optimal solution, or the saddle point where the gradient is 0 in the neural network, thus stopping the iteration.

Here we have some artificially generated data and want to train a neural network to approximate a function. When the function is a linear function , we show the iterative update process using batch gradient descend.

In [None]:
# Batch Gradient Descent (BGD)
import numpy as np
import matplotlib.pyplot as plt
import torch

# produce the data point with linear function
X = torch.arange(-5, 5, 0.1).view(-1, 1)
func = 6 * X  # the actual function that we want to estimate through Gradient Descent
# Gaussian noise is added to create the variable Y
Y = func + 0.4 * torch.randn(X.size())

# plot and visualize the data points
fig = plt.figure(figsize=(20, 10))

ax1 = fig.add_subplot(121)

ax1.plot(X, Y, 'b*', label='Data Points')
ax1.plot(X, func, 'r', label='Function')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.legend()
ax1.grid('True', color='y')

Here, we learn how to approximate the red line/function from the previous graph using GD.

In [None]:
# define the forward function
def forward(x):
    return w * x + b  # simple linear regression with slope (w) and intercept (b)

# define loss function with Mean Square Error (MSE)
def criterion(y_pred, y):
    return torch.mean((y_pred - y) ** 2)

#  initial parameters w (slope) and b (intercept)
# w and b are the parameters we want to learn
w = torch.tensor(-10.0, requires_grad=True)
b = torch.tensor(-20.0, requires_grad=True)

#  other parameters
step_size = 0.1
loss_BGD = []
n_iter = 20

# Initial predictions
print('Predict before training with BGD: x=' + str(4) + ' y=' + str(4*6) + ' prediction=' + str(forward(4.0)))

for i in range (n_iter):
    # making predictions with forward pass
    Y_pred = forward(X)
    # calculating the loss between original and predicted data points
    loss = criterion(Y_pred, Y)
    # storing the calculated loss in a list
    loss_BGD.append(loss.item())
    # backward pass for computing the gradients of the loss w.r.t to learnable parameters
    loss.backward()

    # updateing the parameters after each iteration
    w.data = w.data - step_size * w.grad.data
    b.data = b.data - step_size * b.grad.data

    # zeroing gradients after each iteration
    w.grad.data.zero_()
    b.grad.data.zero_()

    # priting some values for understanding
    if i % 5 == 0:
        print('iteration: {}, \t loss: {}, \t weight: {}, \t bias: {}'.format(i, loss.item(), w.item(), b.item()))

#Predict y after updating w
print('Predict after training with BGD: x=' + str(4) + ' y=' + str(4*6) + ' prediction=' + str(forward(4.0)))

Let's plot the cost vs the iteration step. We can clearly see how the loss has decreased over time.

In [None]:
# plot the figure (loss_BGD)
plt.plot(loss_BGD, label="Batch Gradient Descent")
plt.xlabel('Epoch')
plt.ylabel('Cost/Total loss')
plt.legend()
plt.show()

Let's now compare the real function and the learned one. You will see that both lines (the real and the learned one) are very similar to each other, showing that the learning was, in fact, effective.

In [None]:
W = w.detach().numpy()
B = b.detach().numpy()

# plot and visualize the data points
fig = plt.figure(figsize=(20, 10))

ax1 = fig.add_subplot(121)

ax1.plot(X, Y, 'b*', label='Data Points')
ax1.plot(X, func, 'r', alpha=0.5, label='Real Function')
ax1.plot(X, X*W+B, 'gray', alpha=0.5, label='Learned Function')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.legend()
ax1.grid('True', color='y')

## **2. Stochastic Gradient Descend**
Instead of using the gradient mean of **all** samples to update (like in the GD), in each iteration, a sample is randomly selected from N samples, and the weight w is derived using the loss of a single sample to obtain the gradient, and the weight w is calculated and renew. When encountering a saddle point or a local minimum, the random gradient is helpful to help jump out of this area, so that the algorithm can continue to move towards the optimal point.

We will use the same data and methods defined before, changing only the Optimizer.

In [None]:
#  initial parameters w and b
w = torch.tensor(-10.0, requires_grad=True)
b = torch.tensor(-20.0, requires_grad=True)

# other parameters
step_size = 0.1
n_iter = 50
loss_SGD = []

# Initial predictions
print('Predict before training with SGD: x=' + str(4) + ' y=' + str(4*6) + ' prediction=' + str(forward(4.0)))

for i in range (n_iter):
    loss_SGD.append(criterion(forward(X), Y).tolist())  # this is NOT necessary! Only calculated because of the plot below
    for x, y in zip(X, Y):  # for each sample
      # making a pridiction in forward pass
      y_hat = forward(x)
      # calculating the loss between original and predicted data points
      loss = criterion(y_hat, y)
      # backward pass for computing the gradients of the loss w.r.t to learnable parameters
      loss.backward()
      # updateing the parameters after each iteration
      w.data = w.data - step_size * w.grad.data
      b.data = b.data - step_size * b.grad.data
      # zeroing gradients after each iteration
      w.grad.data.zero_()
      b.grad.data.zero_()
    # priting some values for understanding
    if i % 10 == 0:
        print('iteration: {}, \t loss: {}, \t weight: {}, \t bias: {}'.format(i, loss.item(), w.item(), b.item()))

#Predict y after updating w
print('Predict after training with SGD: x=' + str(4) + ' y=' + str(4*6) + ' prediction=' + str(forward(4.0)))

In [None]:
# plot the figure (loss_SGD)
plt.plot(loss_SGD, label="Batch Gradient Descent")
plt.xlabel('Epoch')
plt.ylabel('Cost/Total loss')
plt.legend()
plt.show()

Stochastic gradient descent has very noisy convergence, because using only one data point for one update. That is why you observe fluctuations in the graph for SGD.
In batch gradient descent, the loss is updated after all the training samples are processed while the stochastic gradient descent updates the loss after every training sample in the training data.



## **3. Mini-Batch Gradient Descend**
Instead of a single sample or the whole dataset, a small batches of the dataset is considered and update the parameters accordingly. For a dataset of 100 samples, if the batch size is 4, meaning we have 25 batches. Hence, updates occur 25 times.
For more mathmatical details, please refer [website](https://www.baeldung.com/cs/gradient-stochastic-and-mini-batch#:~:text=Mini%20Batch%20Gradient%20Descent%20is,the%20gradients%20for%20each%20batch.&text=is%20a%20hyperparameter%20that%20denotes%20the%20size%20of%20a%20single%20batch.).

Again, we will use the same data and methods defined before, changing only how we optimize the model.

In [None]:
#  initial parameters w and b
w = torch.tensor(-10.0, requires_grad=True)
b = torch.tensor(-20.0, requires_grad=True)

#  other parameters
step_size = 0.01
loss_MBGD = []
n_iter = 20

#Initial predictions
print('Predict before training with MBGD: x=' + str(4) + ' y=' + str(4*6) + ' prediction=' + str(forward(4.0)))

print("Size of training set", len(X))
batch_size = 5
n_batches = int(len(X) / batch_size)
print("Amount of batches in total", n_batches)

for epoch in range(n_iter):  # epochs
    batch_loss = []
    for n_b in range(n_batches):  # iteration over the number of batches
        batch_X, batch_y = X[batch_size*n_b:(n_b+1)*batch_size,], Y[batch_size*n_b:(n_b+1)*batch_size,]
        # calculating true loss and storing it
        Ybatch_pred = forward(batch_X)
        loss = criterion(Ybatch_pred, batch_y)
        # store the loss in the list
        batch_loss.append(loss.item())
        # backward pass for computing the gradients of the loss w.r.t to learnable parameters
        loss.backward()
        # updateing the parameters after each iteration
        w.data = w.data - step_size * w.grad.data
        b.data = b.data - step_size * b.grad.data
        # zeroing gradients after each iteration
        w.grad.data.zero_()
        b.grad.data.zero_()
    loss_MBGD.append(np.mean(batch_loss))
    print('iteration: {}, \t loss: {}, \t weight: {}, \t bias: {}'.format(epoch, loss.item(), w.item(), b.item()))

#Predict y after updating w
print('Predict after training with MBGD: x=' + str(4) + ' y=' + str(4*6) + ' prediction=' + str(forward(4.0)))

In [None]:
# plot the figure (loss_BGD)
plt.plot(loss_MBGD, label="Mini Batch Gradient Descent")
plt.xlabel('Epoch')
plt.ylabel('Cost/Total loss')
plt.legend()
plt.show()

Now, let's compare the convergence curves.

In [None]:
# plot the figure
fig = plt.figure(figsize=(10, 10))

plt.plot(loss_MBGD, 'b', label="Mini Batch Gradient Descent")
# plt.plot(loss_SGD[:20], 'r',label="Stochastic Gradient Descent")
plt.plot(loss_BGD, 'g',label="Batch Gradient Descent")
plt.xlabel('Epoch')
plt.ylabel('Cost/Total loss')
plt.legend()

plt.show()

For Mini Batch Gradient Descent, it showa the practical convergence by using batch of the data for one update.
Compared with Batch Gradient Descent, the mini batch gradient descent can converge faster.
