# Training and testing the neural network

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

import gzip
import numpy as np
import tensorflow as tf
from typing import Tuple

You've already seen how to train a neural network using Keras in [module 24](/module24/lecture/train.ipynb) &mdash; in this notebook, we'll re-implement the training loop in TensorFlow. This will help you understand what goes on under the hood a bit better, will give you the opportunity to customize the training loop if you want, and will enable you to debug it.

We'll start by including code that gives us the datasets and model that we'll use in the remainder of this notebook. We will use the same FashionMNIST dataset and data loading code as in [module 24](/module24/lecture/train.ipynb), so feel free to re-visit that module if something is not clear, or take a look [at the source code](https://github.com/MicrosoftDocs/tensorflow-learning-path/blob/main/intro-tf/tintro.py).

In [2]:
import wget

if not os.path.exists('tintro.py'):
    wget.download('https://raw.githubusercontent.com/MicrosoftDocs/tensorflow-learning-path/main/intro-tf/tintro.py', 'tintro.py') 

from tintro import *

As we mentioned in module 1, the goal of training the neural network is to find parameters $W$ and $b$ that minimize the loss function, which measures the difference between the actual and predicted labels. We also mentioned that we can think of the neural network as the function $\ell$ below, and that we use an optimization algorithm to find the parameters $W$ and $b$ that minimize this function.

$$
\mathrm{loss} = \ell(X, y, W, b)
$$

Let's now dig deeper into what this optimization algorithm might look like. There are many types of optimization algorithms, but in this tutorial we'll cover only the simplest one: the gradient descent algorithm. To implement gradient descent, we iteratively improve our estimates of $W$ and $b$ according to the update formulas below, until the gradients are smaller than a pre-defined threshold $\epsilon$ (or for a pre-defined number of times):

$$
\begin{align}
  W &:= W - \alpha \frac{\partial \ell}{\partial W} \\
  b &:= b - \alpha \frac{\partial \ell}{\partial b}
\end{align}
$$

The parameter $\alpha$ is typically referred to as the "learning rate," and will be defined later in the code. 

When doing training, we pass a mini-batch of data as input, perform a sequence of calculations to obtain the loss, then propagate back through the network to calculate the derivatives used in the gradient descent formulas above. Once we have the derivatives, we can update the values of the network's parameters $W$ and $b$ according to the formulas. This sequence of steps is the backpropagation algorithm. By performing these calculations several times, our parameters get updated repeatedly, getting more and more accurate each time. 

In Keras, when we called the function [`fit`](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit), the backpropagation algorithm was executed several times. Here, we'll start by understanding the code that reflects a single pass of the backpropagation algorithm:

- A forward pass through the model to compute the predicted value, `y_prime = model(X, training=True)`
- A calculation of the loss using a loss function, `loss = loss_fn(y, y_prime)`
- A backward pass from the loss function through the model to calculate derivatives, `grads = tape.gradient(loss, model.trainable_variables)`
- A gradient descent step to update $W$ and $b$ using the derivatives calculated in the backward pass, `optimizer.apply_gradients(zip(grads, model.trainable_variables))`

Here's the complete code:

In [3]:
def fit_one_batch(X, y, model, loss_fn, optimizer) -> Tuple[tf.Tensor, tf.Tensor]:
  with tf.GradientTape() as tape:
    y_prime = model(X, training=True)
    loss = loss_fn(y, y_prime)

  grads = tape.gradient(loss, model.trainable_variables)
  optimizer.apply_gradients(zip(grads, model.trainable_variables))

  return (y_prime, loss)

Notice that the code above ensures that the forward calculations are within the `GradientTape`'s scope, just as we saw in the previous notebook. This makes it possible for us to ask the tape for the gradients. 

The code above works for a single mini-batch, which is typically much smaller than the full set of data (in this sample we use a mini-batch of size 64, out of 60,000 training data items). But we want to execute the backpropagation algorithm for the full set of data. We can do so by iterating through the `Dataset` we created earlier, which, as we saw in module 1, returns a mini-batch per iteration. There are two critical lines in the code below: the `for` loop and the call to the `fit_one_batch` function. The rest of the code just prints the accuracy and loss as the model is being trained. 

In [4]:
def fit(dataset: tf.data.Dataset, model: tf.keras.Model, loss_fn: tf.keras.losses.Loss, 
optimizer: tf.optimizers.Optimizer) -> None:
  batch_count = len(dataset)
  loss_sum = 0
  correct_item_count = 0
  current_item_count = 0
  print_every = 100

  for batch_index, (X, y) in enumerate(dataset):
    (y_prime, loss) = fit_one_batch(X, y, model, loss_fn, optimizer)

    y = tf.cast(y, tf.int64)
    correct_item_count += (tf.math.argmax(y_prime, axis=1) == y).numpy().sum()

    batch_loss = loss.numpy()
    loss_sum += batch_loss
    current_item_count += len(X)

    if ((batch_index + 1) % print_every == 0) or ((batch_index + 1) == batch_count):
      batch_accuracy = correct_item_count / current_item_count * 100
      print(f'[Batch {batch_index + 1:>3d} - {current_item_count:>5d} items] accuracy: {batch_accuracy:>0.1f}%, loss: {batch_loss:>7f}')

A complete iteration over all mini-batches in the dataset is called an "epoch." In this sample, we restrict the code to just five epochs for quick execution, but in a real project you would want to set it to a much higher number (to achieve better predictions). The code below also shows the creation of the loss function and optimizer, which we discussed in module 1.

In [6]:
learning_rate = 0.1
batch_size = 64
epochs = 5

(train_dataset, test_dataset) = get_data(batch_size)

model = NeuralNetwork()

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.optimizers.SGD(learning_rate)

print('\nFitting:')
for epoch in range(epochs):
  print(f'\nEpoch {epoch + 1}\n-------------------------------')
  fit(train_dataset, model, loss_fn, optimizer)




Fitting:

Epoch 1
-------------------------------
[Batch 100 -  6400 items] accuracy: 60.2%, loss: 0.864982
[Batch 200 - 12800 items] accuracy: 66.7%, loss: 0.611843
[Batch 300 - 19200 items] accuracy: 70.3%, loss: 0.590243
[Batch 400 - 25600 items] accuracy: 72.4%, loss: 0.725874
[Batch 500 - 32000 items] accuracy: 74.1%, loss: 0.492054
[Batch 600 - 38400 items] accuracy: 75.2%, loss: 0.538996
[Batch 700 - 44800 items] accuracy: 76.1%, loss: 0.396010
[Batch 800 - 51200 items] accuracy: 76.8%, loss: 0.518657
[Batch 900 - 57600 items] accuracy: 77.3%, loss: 0.575013
[Batch 938 - 60000 items] accuracy: 77.5%, loss: 0.433798

Epoch 2
-------------------------------
[Batch 100 -  6400 items] accuracy: 83.0%, loss: 0.322842
[Batch 200 - 12800 items] accuracy: 83.0%, loss: 0.511460
[Batch 300 - 19200 items] accuracy: 82.8%, loss: 0.320734
[Batch 400 - 25600 items] accuracy: 83.0%, loss: 0.387147
[Batch 500 - 31968 items] accuracy: 82.9%, loss: 0.395419
[Batch 600 - 38368 items] accuracy: 83