<div><img style="float: right; width: 120px; vertical-align:middle" src="https://www.upm.es/sfs/Rectorado/Gabinete%20del%20Rector/Logos/EU_Informatica/ETSI%20SIST_INFORM_COLOR.png" alt="ETSISI logo" />

# Training Process Evaluation<a id="top"></a>

<i><small>Authors: Alberto Díaz Álvarez<br>Last update: 2023-04-09</small></i></div> 

***

## Introduction

It is important to monitor the training process of a model in order to evaluate and improve its performance. During the training of a neural network it is necessary to adjust numerous hyperparameters, such as learning rate, batch size, epochs number, topology of the network, etc.

Tensorboard is a useful tool to monitor the training process, since it allows to graphically visualize various metrics that are collected during the training process, such as the evolution of the loss function, the network accuracy on training and validation data, the learning of the network features, among others.

In addition, Tensorboard also allows to visualize the neural network structure and the distributions of the weights and activations of the layers, which can help to identify possible problems in the network architecture.

## Goals

The general objective of this notebook is to show how to use the **Tensorboard** tool to evaluate the training of neural network models.

After showing how to start it inside our notebook (it is an external tool that can be used independently from the terminal), we will try to detect _exploding gradients_ and _vanishing gradients_ problems.

## Libraries and configuration

Next we will import the libraries that will be used throughout the notebook.

In [None]:
import datetime

import tensorflow as tf

***

## Our sample model

For this example, we will create a model to solve the MNIST problem that we already saw in the previous exercise.

In [None]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

x_train, x_test = x_train / 255, x_test / 255
y_train = tf.keras.utils.to_categorical(y_train, num_classes=10)
y_test = tf.keras.utils.to_categorical(y_test, num_classes=10)

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='sigmoid'),
    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics = ['accuracy'])
model.summary()

## Loading TensorBoard

TensorBoard is a **web-based tool** provided by TensorFlow. It is used **to visualize and debug the training process of machine learning models in TensorFlow**.

It provides several features, including:

- **Graph Visualization**: TensorBoard can visualize the computational graph of the model, which is helpful in understanding the structure of the model.
- **Scalar Dashboard**: This feature allows users to visualize scalar values such as accuracy, loss, and learning rate over time during training. It helps in monitoring the progress of the model and detecting overfitting or underfitting.
- **Histogram Dashboard**: TensorBoard can also display histograms of activations and weight distributions in the model.
- **Projector Dashboard**: This feature is used to visualize high-dimensional data using t-SNE, a dimensionality reduction technique. It is helpful in understanding how the model is clustering the data.
- **Profile Dashboard**: This feature provides a detailed analysis of the execution time and memory usage of different operations in the model. It helps in identifying performance bottlenecks in the model.

Overall, TensorBoard is a powerful tool for understanding and debugging machine learning models in TensorFlow. To initialize it from a python notebook, simply load it as an external module:

In [None]:
%reload_ext tensorboard

Then, we have to create a callback that will update the model values as we work with it.

In [None]:
log_dir = f'logs/{datetime.datetime.now().strftime("%Y%m%d-%H%M%S")}'
tb_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

Once the module is loaded and the _callback_ that `tensorboard` will use is created, we start it.

In [None]:
%tensorboard --logdir $log_dir

At this moment, tensorboard is listening in the log directory (where _loss_ and other metrics are being stored) and updating every 30 seconds.

We can now train and see how the training of our model is evolving in `tensorboard`:

In [None]:
history = model.fit(x_train, y_train, epochs=250, validation_split=0.1, verbose=0, callbacks=[tb_callback])

## About the components

We will now look at the main components that are available in TensorBoard

### Graph

These are very complicated elements to follow as the complexity of the network increases.

![An example graph](Images/graph.png "Vista del grafo completo de nuestro ejemplo")

The way to get a useful view is to do some preliminary cleanup work to structure it.

The graphs can help us enormously both to understand the model we are working with, and to detect errors in the topology of the model.

Some keys to interpret the graph:

1. Nodes with the same color imply that they belong to the same structure. Grays, however, indicate that each of the nodes is unique.
2. Clicking on a node shows more details.
3. There is a button that allows us to see dependencies of any node: `trace_input`.

### Summaries

They are a special type of tensorflow operator. Just as there are operators such as algebraic operations (e.g. additions, subtractions, ...), there are operators that take as input a tensor of the graph and provide as output a set of "summarized" data.

By default, there are automatically created summary operators (practically every graph that appears apart from the network in TensorBoard is an operator of this type), although we can create as many as we need. Once they are created, they will be dumped into the logs, which will be read from tensorboard.

Now we will see some of the most common operators:

* `tf.summary.scalar`: They write down individual values such as accuracy, loss, etc., displaying them in the form of a graph.
* `tf.summary.image`: It displays an image, which is very useful to identify if the inputs are correct or if a generative model is producing images as expected.
* `tf.summary.audio`: Similar to the previous operator, but for sound.
* `tf.summary.histogram`: Useful for plotting the histogram of a non-scalar tensor, which shows how the distribution of the tensor value changes over time. In the case of DNN it is commonly used to check the distribution of weights and biases, helping to detect irregular behavior in the network parameters.

## Vanishing gradients

The gradient is a measure of the direction and magnitude of change in the loss function of the neural network, which is used to adjust the weights of the network connections during the training process.

If the gradient is too small, this can lead to _vanishing gradients_ problems, where the weights of the network connections are updated in small jumps that can cause the network to stall at a local minimum and fail to learn more complex patterns.

Let's do an exercise to identify how our model is suffering from a _vanishing gradients_ problem. We will first finish with the above `tensorboard` process.

In [None]:
!kill $(ps -e | grep 'tensorboard' | awk '{print $1}')

Now we will create a different model and launch a new tensorboard to evaluate the training.

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(5, activation='tanh'),
    tf.keras.layers.Dense(5, activation='tanh'),
    tf.keras.layers.Dense(5, activation='tanh'),
    tf.keras.layers.Dense(5, activation='tanh'),
    tf.keras.layers.Dense(5, activation='tanh'),
    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics = ['accuracy'])
model.summary()

log_dir = f'logs/{datetime.datetime.now().strftime("%Y%m%d-%H%M%S")}'
tb_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

%tensorboard --logdir $log_dir

Let's go with the training to see how it evolves.

In [None]:
history = model.fit(x_train, y_train, epochs=25, validation_split=0.1, verbose=0, callbacks=[tb_callback])

## Exploding gradients

If the gradient is too large, this can lead to _exploding gradients_ problems, where the weights of the network connections are updated in large jumps that can make the network unable to converge to an optimal solution.

We will start by terminating the previous tensorboard process so that we can subsequently launch a new one.

In [None]:
!kill $(ps -e | grep 'tensorboard' | awk '{print $1}')

Since this problem is rare (although it does happen) in shallow networks, in our current example it is difficult to achieve the desired effect. We will try this by tricking the inputs to be larger than they should be, forcing the values traveling through the network to be very high.

In [None]:
x_train, y_train = x_train * 10, y_train * 10

We will now train a model with these inputs to try to give us the expected outputs:

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(5, activation='relu', kernel_initializer='he_uniform'),
    tf.keras.layers.Dense(5, activation='relu', kernel_initializer='he_uniform'),
    tf.keras.layers.Dense(5, activation='relu', kernel_initializer='he_uniform'),
    tf.keras.layers.Dense(5, activation='relu', kernel_initializer='he_uniform'),
    tf.keras.layers.Dense(5, activation='relu', kernel_initializer='he_uniform'),
    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics = ['accuracy'])
model.summary()

log_dir = f'logs/{datetime.datetime.now().strftime("%Y%m%d-%H%M%S")}'
tb_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
%tensorboard --logdir $log_dir

We will now train our model.

In [None]:
history = model.fit(x_train, y_train, epochs=25, validation_split=0.1, verbose=0, callbacks=[tb_callback])

## Conclusiones

This notebook has been intense, but in it we have seen the main differences of the two types of problems that we will encounter in deep learning problems: classification and regression. The models developed for these are very similar, varying basically in the output and its error calculation.

Also, for the evaluation of these models we have presented some measures, some specific for classification and others for regression. There are some that we have not explained (e.g. cross-entropy) but we have preferred to stay with the most common ones. One good thing is that practically all frameworks include these implementations, probably much better than we can implement them ourselves. However, it is very important to know how we are measuring and what those measurements mean.

***

<div><img style="float: right; width: 120px; vertical-align:top" src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-nc-sa.png" alt="Creative Commons by-nc-sa logo" />

[Back to top](#top)

</div>