# A deeper look into TensorFlow training

This tutorial will focus on some advanced functions in TensorFlow. These allow for a more fine-grained control of training ops and significant training boosts.

In [1]:
!pip install --upgrade tensorflow
import tensorflow as tf
assert tf.__version__[0] == '2', 'this tutorial is for tensorflow versions of 2 or higher'

import numpy as np
import time

Requirement already up-to-date: tensorflow in /usr/local/lib/python3.6/dist-packages (2.1.0)


## Computational Graph

If we have a sequence of operations, we can represent these as nodes in a graph. In this representation, the edges would be the tensors that flow from operation to operation.

For example, a fully connected layer that performs the operation:

$$
z = f(W \cdot X + b)
$$

This series of operations would be represented as a computational graph as follows.

xxxxxxxxxxxx

It wouldn't be hard to imagine how a whole Neural Network, along with its loss function would be represented in this fashion. Now we can picture an alternate training process, where we would first define the full computation graph and then run the actual computations.

An analogous to this would be a regular function:

```python
# 1) Eagerly compute the result:
z = f(w * x + b)

# 2) Lazily compute the result:
def fc(x):
  return f(w * x + b)

# ... 
# at this point we have defined which operations
# we want to run, i.e. the "computation graph"
# ...

z = fc(x)  # run the actual computation
```

This computational model has several advantages over eagerly executing the operations.

- Computation graphs can be **simplified**. E.g. the operation `c = a + b - b` could be simplified into `c = a` even before we receive the actual values of `a` and `b`.

- We can add to each node two operation one for the **regular** path and one for the **reverse** path. This way our computation graph can be bidirectional. This is very useful for **backpropagation**, where we want the input tensors to flow in the forward path and the gradients in the backward path.

- We don't need to know the exact shape of the input when creating the graph. In practice this translates to **not** needing to specify the **batch size** when defining the model's architecture.

- There are several **optimization** tricks that TensorFlow can make to computation graphs, e.g. parallel processing.

It has its disadvantages also:

- **Complexity**. Imagine if you had to define a function every time you needed to do a simple addition in python!

- **Error Handling**. If we make a mistake when defining the graph, we'll know much later when actually running the graph.

These disadvantages have led to eager execution being the default view of TensorFlow in version 2. However, in order to tap into the potential of this framework we'll need to have a basic understanding of how this works.

Luckily, TensorFlow provides an awesome tool to make our life easier: [AutoGraph](https://www.tensorflow.org/api_docs/python/tf/autograph). This is involked through the [`tf.function`](https://www.tensorflow.org/api_docs/python/tf/function) decorator and converts any regular python syntax into a TensorFlow computation graph!

In [2]:
# Regular keras FC layer
fc = tf.keras.layers.Dense(100)

# Same operation converted to a TensorFlow graph
@tf.function
def fc_graph(x):
  return fc(x)

# Create dummy data
X = tf.random.normal(shape=(1000, 100))

# Run each operation once for warmup
# (we have an initial "cost" of having to construct the graph, to evaluate which
# of the two is faster we need to take this out of the equation.)
fc(X); fc_graph(X)

# Run a forward pass on both keras and our tf-wrapped layer
t1 = time.time()
z1 = fc(X)
t2 = time.time()
z2 = fc_graph(X)
t3 = time.time()

# Assert that we performed the same operation and we got the same result
print('Same result:', np.array_equal(z1, z2))

# Print the results
print('Eager time: {:.2f}ms'.format((t2 - t1) * 1000))
print('Graph time: {:.2f}ms'.format((t3 - t2) * 1000))

Same result: True
Eager time: 0.41ms
Graph time: 0.37ms


This difference magnifies in other types of layers. The following example was taken from the [official TensorFlow guide on tf.functions](https://www.tensorflow.org/guide/function#the_tffunction_decorator),

In [3]:
# Keras LSTM layer
lstm_cell = tf.keras.layers.LSTMCell(10)

# Convert to tf Graph
@tf.function
def lstm_fn(input, state):
  return lstm_cell(input, state)

# Generate "dummy" input data
inp = tf.zeros([10, 10])
state = [tf.zeros([10, 10])] * 2

# Warmup
lstm_cell(inp, state); lstm_fn(inp, state)

# Run the benchmark
t1 = time.time()
z1 = lstm_cell(inp, state)
t2 = time.time()
z2 = lstm_fn(inp, state)
t3 = time.time()

print('Same result:', np.array_equal(z1, z2))

print('Eager time: {:.2f}ms'.format((t2 - t1) * 1000))
print('Graph time: {:.2f}ms'.format((t3 - t2) * 1000))

Same result: False
Eager time: 0.83ms
Graph time: 0.49ms


Here the forward pass takes up almost **half the time** when run in graph mode!

Some other things to note about the `tf.function` wrapper are that:

- If we have a nested structure we just need to apply the decorator to the outer function. It will work on its own in the inner function as well. For example:

```python
# no need for @tf.function here
def a(x):
    return 2 * x

@tf.function
def b(x):
    return 3 * a(x)
```

- AutoGraph inhenetly handles any python control structures by converting the native python ops to TensorFlow graph ops. For example, python's `while` will become [`tf.while_loop`](https://www.tensorflow.org/api_docs/python/tf/while_loop), `if` will become [`tf.cond`](https://www.tensorflow.org/api_docs/python/tf/cond), etc.

- AutoGraph is enabled **by default** for all non-dynamic keras models.

- The first time a `tf.function` decorated function is involked, the AutoGraph module is called to construct the graph. This procedure is called [tracing](https://www.tensorflow.org/guide/function) and takes time. For this reason reason avoid applying `tf.function` to low-level local function; rather try to decorate module-level functions and class methods.  *Note: this is the reason we used a "warmup" when timing the execution before.*  A more in-depth analysis of this can be found [here](https://www.tensorflow.org/guide/function#re-tracing).

- Graphs were the "default view" of TensorFlow in versions `< 2.0`. TensorFlow still allows for the [direct (i.e. manual) construction of a graph](https://www.tensorflow.org/api_docs/python/tf/Graph#using_graphs_directly_deprecated). However, this approach is **deprecated** and should be avoided.

- Asynchronous batch prefetching (i.e. the `tf.Dataset` feature we saw in the previous tutorial, where batches were prepared while the model was training) is [only available](https://www.tensorflow.org/guide/effective_tf2#combine_tfdatadatasets_and_tffunction) when running in graph mode.

- Several **limitations** of AutoGraph (along with their workarounds) can be found [here](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/autograph/g3doc/reference/limitations.md). It is worthwile to go through them if you're planning to use the `tf.function` decorator.

- TensorFlow uses a tool called [Grappler] for **optimizing** its graphs. These optimizations include *pruning* redundant nodes, *stripping* debugging operations, *low-level memory mapping*, *parallelization* and many more. More info about these can be found [here](https://web.stanford.edu/class/cs245/slides/TFGraphOptimizationsStanford.pdf). Most of these are **enabled by default**, however Grappler also allows for a more [fine-grained control](https://www.tensorflow.org/api_docs/python/tf/config/optimizer/set_experimental_options).  

## Custom Training Loop

In the previous tutorials we saw how we can easily train keras models through the `.fit()` and `.fit_generator()` methods. One feature of these methods is that they intentionally **hide the training loop**.

What would normally be

```python
for e in range(epochs):
  for x, y in zip(x_train, y_train):
    # ...
    # train on batch
    # ...
```

becomes

```python
model.fit(x_train, y_train, epochs=epochs)
```

This is done for the sake of simplicity, but in the expense of **control**. To affect some aspects of the training process (e.g. learning rate) or to monitor variables, we need to use one of the existing or write a custom callback.

Still there are limitations to what we can affect or monitor through the use of callbacks! For this reason, in some cases we might want to train the model in a custom loop, where we have full control.

The fundamental steps we need to perform **at each iteration** are:

1. Generate the training **batch** (samples + labels).
2. Perform the **forward pass** and get the model's **predictions** for this batch.
3. Calculate the **loss** of these predictions compared to the actual labels.
4. Compute the loss' **gradients** w.r.t the model's trainable parameters (i.e. **backprop**).
5. **Update** the model's parameters according to the gradients (i.e. **optimizer**).

Let's try by defining our dataset, like we saw in the precious tutorial. This time we'll use the [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset, which consists of $60000$, $32 \times 32$, RGB images, evenly distributed into $10$ classes.

In [0]:
def preprocess_data(x, y):
  '''
  Preprocess a single image-label pair. 
  '''
  x = tf.image.convert_image_dtype(x, tf.float32)
  y = tf.one_hot(tf.cast(tf.squeeze(y), tf.int32), 10)
  return x, y

def generate_cifar():
  '''
  Generate the train and test set datasets for CIFAR-10.
  '''
  (x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

  train = tf.data.Dataset.from_tensor_slices((x_train, y_train))
  test = tf.data.Dataset.from_tensor_slices((x_test, y_test))

  train = train.map(preprocess_data)
  test = test.map(preprocess_data)
  
  train = train.shuffle(50000)
  test = test.shuffle(10000)

  train = train.repeat()
  test = test.repeat()
  
  train = train.batch(256)
  test = test.batch(256)
  
  train.prefetch(1)
  test.prefetch(1)

  return train, test

train_set, test_set = generate_cifar()

Now, let's build a simple 4-layer keras CNN.

In [5]:
def make_cnn(input_shape=(32, 32, 3)):
  inp = tf.keras.layers.Input(input_shape)
  c1 = tf.keras.layers.Conv2D(32, (3, 3), activation='relu')(inp)
  c2 = tf.keras.layers.Conv2D(64, (3, 3), activation='relu')(c1)
  c3 = tf.keras.layers.Conv2D(128, (3, 3), activation='relu')(c2)
  fl = tf.keras.layers.Flatten()(c3)
  out = tf.keras.layers.Dense(10, activation='softmax')(fl)

  model = tf.keras.models.Model(inp, out)

  model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

  return model

model = make_cnn()

model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 32, 32, 3)]       0         
_________________________________________________________________
conv2d (Conv2D)              (None, 30, 30, 32)        896       
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 28, 28, 64)        18496     
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 26, 26, 128)       73856     
_________________________________________________________________
flatten (Flatten)            (None, 86528)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                865290    
Total params: 958,538
Trainable params: 958,538
Non-trainable params: 0
_______________________________________________________

Let train the model through keras' `.fit()` API as a baseline.

In [6]:
start_time = time.time()
model.fit(train_set, epochs=5, steps_per_epoch=(50000//128), 
          validation_data=test_set, validation_steps=(10000//128))
print('Time elapsed: {:.2f}'.format(time.time() - start_time))

Train for 390 steps, validate for 78 steps
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Time elapsed: 64.77


Now let's create a custom training loop to do the same thing.

For simplicity we'll first create a function that performs a single training step on the data.

In [0]:
# Define the same loss function, the optimizer and the metric we used previsouly
loss_function = tf.keras.losses.categorical_crossentropy
optimizer = tf.keras.optimizers.Adam()
accuracy = tf.keras.metrics.Accuracy()


# We'll now define a function that performs a single training step. We don't
# need to apply the tf.function decorator here, because this function isn't 
# intended to be a top-level function (i.e. it will be called from another).
def train_on_batch(x, y):
  '''
  Will train 'model' as defined in the global scope for a single training batch.
  The optimizer and loss function will also be taken from the global scope.
  '''
  # Start monitoring the operations in order to compute the gradient:
  with tf.GradientTape() as tape:
  
    # Generate the model's prediction:
    y_hat = model(x)

    # Calculate its loss:
    loss = loss_function(y, y_hat)

  # Compute the gradient of the loss w.r.t the model's parameters
  grads = tape.gradient(loss, model.trainable_variables)

  # Update the model's parameters through the optimizer
  optimizer.apply_gradients(zip(grads, model.trainable_variables))

  # Store the batch's accuracy
  acc = accuracy(tf.argmax(y, axis=1), tf.argmax(y_hat, axis=1))

  # Get the batch's loss so that we can plot it later
  return tf.reduce_mean(loss), tf.reduce_mean(acc)

Now let's write a function that uses the previous to train the model for a number of epochs.

In [0]:
train_loss = tf.keras.metrics.Mean()
train_acc = tf.keras.metrics.Mean()
valid_acc = tf.keras.metrics.Mean()

@tf.function
def train(train_data, epochs, validation_data=None):
  '''
  Train 'model' as defined in the global scope for a number of epochs, on a 
  given dataset. 
  '''

  for e in range(epochs):

    tf.print('Epoch {} of {}'.format(e+1, epochs))
    
    i = 0
    loss = 0.
    acc = 0.

    # Training epoch:
    for x, y in train_data:

      i += 1

      # Perform a single iteration
      loss, acc = train_on_batch(x, y)

      # We need to manually keep track of the iteration,
      # because this generator will loop forever 
      if i >= 50000 // 128 + 1:
        break

      if i % 100 == 0:
        tf.print('    Iteration:', i)
        tf.print('        Training loss:', loss)
        tf.print('        Training acc:', acc)

    if validation_data:

      i = 0
      valid_acc = 0.

      # Iterate over the validation set:
      for x, y in validation_data:

        i += 1

        # Make a prediction and compute the mean of the accuracy and the loss
        preds = model(x)
        
        # Compute the mean accuracy of each batch and add them to 'valid_acc'
        valid_acc += accuracy(tf.argmax(y, axis=1), tf.argmax(preds, axis=1))
        
        # Again keep track of the iteration in order to terminate the loop
        if i >= 10000 // 128 + 1:
          break

      tf.print('    Validation accuracy:', valid_acc / tf.cast(i, tf.float32))

We can finally train our model now.

In [9]:
model = make_cnn()
start_time = time.time()
train(train_set, epochs=5, validation_data=test_set)
print('Time elapsed: {:.2f}sec'.format(time.time() - start_time))

Epoch 1 of 5
    Iteration: 100
        Training loss: 1.33137321
        Training acc: 0.398554683
    Iteration: 200
        Training loss: 1.29405034
        Training acc: 0.469453126
    Iteration: 300
        Training loss: 1.08646286
        Training acc: 0.516992211
    Validation accuracy: 0.554237247
Epoch 2 of 5
    Iteration: 100
        Training loss: 0.895894587
        Training acc: 0.582689166
    Iteration: 200
        Training loss: 0.765978038
        Training acc: 0.599661827
    Iteration: 300
        Training loss: 0.777137578
        Training acc: 0.616878033
    Validation accuracy: 0.631168425
Epoch 3 of 5
    Iteration: 100
        Training loss: 0.60319674
        Training acc: 0.64668721
    Iteration: 200
        Training loss: 0.44415462
        Training acc: 0.657404721
    Iteration: 300
        Training loss: 0.566433847
        Training acc: 0.670649588
    Validation accuracy: 0.679828286
Epoch 4 of 5
    Iteration: 100
        Training loss: 0.4218583

In the next tutorial we'll see a visualization tool TensorFlow offers to debug our model's graph and inspect its performance during training.