### Structure of Deep Learning Frameworks: Computational Graph, Autodiff, and Optimizers



![Graph](https://i.stack.imgur.com/mCBrs.gif "graph")



# The benefits of graphs
With a graph, you have a great deal of flexibility. You can use your TensorFlow graph in environments that don't have a Python interpreter, like mobile applications, embedded devices, and backend servers. TensorFlow uses graphs as the format for saved models when it exports them from Python.

Graphs are also easily optimized, allowing the compiler to do transformations like:

- Statically infer the value of tensors by folding constant nodes in your computation ("constant folding").
- Separate sub-parts of a computation that are independent and split them between threads or devices.
- Simplify arithmetic operations by eliminating common subexpressions.
There is an entire optimization system, Grappler, to perform this and other speedups.

In short, graphs are extremely useful and let your TensorFlow run fast, run in parallel, and run efficiently on multiple devices.

However, you still want to define our machine learning models (or other computations) in Python for convenience, and then automatically construct graphs when you need them.

## Seeing the speed up

For complicated computations, graphs can provide a significant speedup.  This is because graphs reduce the Python-to-device communication and perform some speedups.

This code times a few runs on some small dense layers.

In [1]:
import tensorflow as tf
import timeit
from datetime import datetime
# Create an oveerride model to classify pictures
class SequentialModel(tf.keras.Model):
  def __init__(self, **kwargs):
    super(SequentialModel, self).__init__(**kwargs)
    self.flatten = tf.keras.layers.Flatten(input_shape=(28, 28))
    self.dense_1 = tf.keras.layers.Dense(128, activation="relu")
    self.dropout = tf.keras.layers.Dropout(0.2)
    self.dense_2 = tf.keras.layers.Dense(10)

  def call(self, x):
    x = self.flatten(x)
    x = self.dense_1(x)
    x = self.dropout(x)
    x = self.dense_2(x)
    return x

input_data = tf.random.uniform([60, 28, 28])

eager_model = SequentialModel()
graph_model = tf.function(eager_model)

print("Eager time:", timeit.timeit(lambda: eager_model(input_data), number=10000))
print("Graph time:", timeit.timeit(lambda: graph_model(input_data), number=10000))

Eager time: 7.697414857335389
Graph time: 4.414290325250477


# Tensorflow 1.x 

- Accessing collections explicitly
- Accessing collections implicitly with methods like : global_variables, losses.get_regularization_loss, using placeholder to set up graph inputs

- Executing graphs with Session.run

- Initializing variables manually

In [7]:
import time
t1 = time.time()
import tensorflow as tf
tf.compat.v1.disable_eager_execution()
print(f'TF executing eagerly is {tf.executing_eagerly()}')

in_a = tf.compat.v1.placeholder(dtype=tf.float32, shape=(2))
in_b = tf.compat.v1.placeholder(dtype=tf.float32, shape=(2))

def forward(x):
  with tf.compat.v1.variable_scope("matmul", reuse=tf.compat.v1.AUTO_REUSE):
    W = tf.compat.v1.get_variable("W", initializer=tf.ones(shape=(2,2)),
                        regularizer=tf.keras.regularizers.L2(0.04))
    b = tf.compat.v1.get_variable("b", initializer=tf.zeros(shape=(2)))
    return W * x + b

out_a = forward(in_a)
out_b = forward(in_b)

reg_loss=tf.compat.v1.losses.get_regularization_loss(scope="matmul")

with tf.compat.v1.Session() as sess:
  sess.run(tf.compat.v1.global_variables_initializer())
  outs = sess.run([out_a, out_b, reg_loss],
                feed_dict={in_a: [1, 0], in_b: [0, 1]})
  print(outs)
print(f'Time took is {time.time()-t1} seconds')

TF executing eagerly is False
[array([[1., 0.],
       [1., 0.]], dtype=float32), array([[0., 1.],
       [0., 1.]], dtype=float32), 0.16]
Time took is 0.029334306716918945 seconds


# Tensorflow 2.x (à la PyTorch)

- The variables are local Python objects.
- The Session.run call is replaced with a call to forward
- The optional **tf.function** decorator can be added for performance.
- The regularizations are calculated manually, without referring to any global collection.
- No sessions or placeholders.

In [2]:
import time
t1 = time.time()
import tensorflow as tf
tf.compat.v1.enable_eager_execution()
print(f'TF executing eagerly is {tf.executing_eagerly()}')

W = tf.Variable(tf.ones(shape=(2,2)), name="W")
b = tf.Variable(tf.zeros(shape=(2)), name="b")

regularizer = tf.keras.regularizers.l2(0.04)

@tf.function
def forward(x):
  return regularizer(W) * x + b

out_b = forward([0,1]) 
out_a = forward([1,0])
print(out_a)
print(out_b)
print(f'Time took is {time.time()-t1} seconds')

TF executing eagerly is True
tf.Tensor([0.16 0.  ], shape=(2,), dtype=float32)
tf.Tensor([0.   0.16], shape=(2,), dtype=float32)
Time took is 0.03977394104003906 seconds


# Computing gradients
To differentiate automatically, TensorFlow needs to remember what operations happen in what order during the forward pass. Then, during the backward pass, TensorFlow traverses this list of operations in reverse order to compute gradients.

# Gradient tapes
TensorFlow provides the `tf.GradientTape` API for automatic differentiation; that is, computing the gradient of a computation with respect to some inputs, usually `tf.Variables`. TensorFlow "records" relevant operations executed inside the context of a `tf.GradientTape` onto a "tape". TensorFlow then uses that tape to compute the gradients of a "recorded" computation using reverse mode differentiation.

Here is a simple example:

In [3]:
x = tf.Variable(3.0)

with tf.GradientTape() as tape:
  y = x**2

Once you've recorded some operations, use `GradientTape.gradient(target, sources)` to calculate the gradient of some target (often a loss) relative to some source (often the model's variables).

In [4]:
# dy = 2x * dx
dy_dx = tape.gradient(y, x)
dy_dx.numpy()

6.0

The above example uses scalars, but `tf.GradientTape` works as easily on any tensor:

In [6]:
w = tf.Variable(tf.random.normal((3, 2)), name='w')
b = tf.Variable(tf.zeros(2, dtype=tf.float32), name='b')
x = [[1., 2., 3.]]

with tf.GradientTape() as tape:
  y = x @ w + b
  loss = tf.reduce_mean(y**2)

To get the gradient of y with respect to both variables, you can pass both as sources to the gradient method. The tape is flexible about how sources are passed and will accept any nested combination of lists or dictionaries and return the gradient structured the same way

In [7]:
[dl_dw, dl_db] = tape.gradient(loss, [w, b])
print(w.shape)
print(dl_dw.shape)

(3, 2)
(3, 2)


# Gradients with respect to a model
It's common to collect `tf.Variables` into a `tf.Module` or one of its subclasses (`layers.Layer`, `keras.Model`) for checkpointing and exporting.

In most cases, you will want to calculate gradients with respect to a model's trainable variables. Since all subclasses of `tf.Module` aggregate their variables in the `Module.trainable_variables` property, you can calculate these gradients in a few lines of code:



In [8]:
layer = tf.keras.layers.Dense(2, activation='relu')
x = tf.constant([[1., 2., 3.]])

with tf.GradientTape() as tape:
  # Forward pass
  y = layer(x)
  loss = tf.reduce_mean(y**2)

# Calculate gradients with respect to every trainable variable
grad = tape.gradient(loss, layer.trainable_variables)

for var, g in zip(layer.trainable_variables, grad):
  print(f'{var.name}, shape: {g.shape}')
  

dense/kernel:0, shape: (3, 2)
dense/bias:0, shape: (2,)


# Optimization
GPUs and TPUs can radically reduce the time required to execute a single training step. Achieving peak performance requires an efficient input pipeline that delivers data for the next step before the current step has finished. The `tf.data` API helps to build flexible and efficient input pipelines. 

## The dataset
Define a class inheriting from `tf.data.Dataset` called `ArtificialDataset`. This dataset:

- generates num_samples samples (default is 3)
- sleeps for some time before the first item to simulate opening a file
- sleeps for some time before producing each item to simulate reading data from a file



In [9]:
class ArtificialDataset(tf.data.Dataset):
    def _generator(num_samples):
        # Opening the file
        time.sleep(0.03)
        
        for sample_idx in range(num_samples):
            # Reading data (line, record) from the file
            time.sleep(0.015)
            
            yield (sample_idx,)
    
    def __new__(cls, num_samples=3):
        return tf.data.Dataset.from_generator(
            cls._generator,
            output_types=tf.dtypes.int64,
            output_shapes=(1,),
            args=(num_samples,)
        )

### The training loop
Let's say we have a dummy training loop that measures how long it takes to iterate over a dataset. Training time is simulated.

In [11]:
def benchmark(dataset, num_epochs=2):
    start_time = time.perf_counter()
    for epoch_num in range(num_epochs):
        for sample in dataset:
            # Performing a training step
            time.sleep(0.01)
    tf.print("Execution time:", time.perf_counter() - start_time)

### Optimize performance
To exhibit how performance can be optimized, we will improve the performance of the ArtificialDataset.

### The naive approach
Start with a naive pipeline using no tricks, iterating over the dataset as-is.

In [12]:
benchmark(ArtificialDataset())

Execution time: 0.27469153702259064


Under the hood, this is how your execution time was spent:

![Naive](https://www.tensorflow.org/guide/images/data_performance/naive.svg)

You can see that performing a training step involves:

- opening a file if it hasn't been opened yet,
- fetching a data entry from the file,
- using the data for training.

However, in a naive synchronous implementation like here, while your pipeline is fetching the data, your model is sitting idle. 
Conversely, while your model is training, the input pipeline is sitting idle.
The training step time is thus the sum of all, opening, reading and training time.

We can optimize this for designing performant TensorFlow input pipelines.

### Prefetching

Prefetching overlaps the preprocessing and model execution of a training step.
While the model is executing training step `s`, the input pipeline is reading the data for step `s+1`.
Doing so reduces the step time to the maximum (as opposed to the sum) of the training and the time it takes to extract the data.

The `tf.data` API provides the `tf.data.Dataset.prefetch` transformation.
It can be used to decouple the time when data is produced from the time when data is consumed.
In particular, the transformation uses a background thread and an internal buffer to prefetch elements from the input dataset ahead of the time they are requested.
The number of elements to prefetch should be equal to (or possibly greater than) the number of batches consumed by a single training step.
You could either manually tune this value, or set it to `tf.data.experimental.AUTOTUNE` which will prompt the
`tf.data` runtime to tune the value dynamically at runtime.

Note that the prefetch transformation provides benefits any time there is an opportunity to overlap the work of a "producer" with the work of a "consumer."

In [13]:
benchmark(
    ArtificialDataset()
    .prefetch(tf.data.experimental.AUTOTUNE)
)

Execution time: 0.2035039570182562


![Prefetched](https://www.tensorflow.org/guide/images/data_performance/prefetched.svg)

This time you can see that while the training step is running for sample 0, the input pipeline is reading the data for the sample 1, and so on.

### Parallelizing data extraction

In a real-world setting, the input data may be stored remotely (for example, NFS or HDFS).
A dataset pipeline that works well when reading data locally might become bottlenecked on I/O when reading data remotely because of the following differences between local and remote storage:

*   **Time-to-first-byte:** Reading the first byte of a file from remote storage can take orders of magnitude longer than from local storage.
*   **Read throughput:** While remote storage typically offers large aggregate bandwidth, reading a single file might only be able to utilize a small fraction of this bandwidth.

In addition, once the raw bytes are loaded into memory, it may also be necessary to deserialize and/or decrypt the data, which requires additional computation.
This overhead is present irrespective of whether the data is stored locally or remotely, but can be worse in the remote case if data is not prefetched effectively.

To mitigate the impact of the various data extraction overheads, the `tf.data.Dataset.interleave` transformation can be used to parallelize the data loading step, interleaving the contents of other datasets (such as data file
readers).
The number of datasets to overlap can be specified by the `cycle_length` argument, while the level of parallelism can be specified by the `num_parallel_calls` argument. Similar to the `prefetch` transformation, the `interleave` transformation supports `tf.data.experimental.AUTOTUNE` which will delegate the decision about what level of parallelism to use to the `tf.data` runtime.

#### Sequential interleave

The default arguments of the `tf.data.Dataset.interleave` transformation make it interleave single samples from two datasets sequentially.

![Sequential interleave](https://www.tensorflow.org/guide/images/data_performance/sequential_interleave.svg)

This plot allows to exhibit the behavior of the `interleave` transformation, fetching samples alternatively from the two datasets available.
However, no performance improvement is involved here.



In [22]:
benchmark(
    tf.data.Dataset.range(5)
    .interleave(ArtificialDataset)
)

Execution time: 0.9094312000088394


#### Parallel interleave

Now use the `num_parallel_calls` argument of the `interleave` transformation.
This loads multiple datasets in parallel, reducing the time waiting for the files to be opened.

In [23]:
benchmark(
    tf.data.Dataset.range(5)
    .interleave(
        ArtificialDataset,
        num_parallel_calls=tf.data.experimental.AUTOTUNE
    )
)

Execution time: 0.7138526570051908


![Parallel interleave](https://www.tensorflow.org/guide/images/data_performance/parallel_interleave.svg)

This time, the reading of the two datasets is parallelized, reducing the global data processing time.

Furthermore a lot of possibilities like:
- Mapping
- Caching 

# Model Performance
- Distributed Training
> This afternoon
- Mixed Precision
> To be discussed in Hands-On
- XLA

## XLA: Optimizing Compiler for Machine Learning

- XLA provides an alternative mode of running models: it compiles the TensorFlow graph into a sequence of computation kernels generated specifically for the given model. Because these kernels are unique to the model, they can exploit model-specific information for optimization.

```
def model_fn(x, y, z):
  return tf.reduce_sum(x + y * z)
```
- XLA can optimize the graph so that it computes the result in a single kernel launch. It does this by "fusing" the addition, multiplication and reduction into a single GPU kernel.

In [1]:
import tensorflow as tf
import timeit
from datetime import datetime

tf.keras.backend.clear_session()
tf.config.optimizer.set_jit(False) # Start with XLA disabled.

def load_data():
  (x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
  x_train = x_train.astype('float32') / 256
  x_test = x_test.astype('float32') / 256

  # Convert class vectors to binary class matrices.
  y_train = tf.keras.utils.to_categorical(y_train, num_classes=10)
  y_test = tf.keras.utils.to_categorical(y_test, num_classes=10)
  return ((x_train, y_train), (x_test, y_test))

(x_train, y_train), (x_test, y_test) = load_data()


# Create an oveerride model to classify pictures
def generate_model():
  return tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(32, (3, 3), padding='same', input_shape=x_train.shape[1:]),
    tf.keras.layers.Activation('relu'),
    tf.keras.layers.Conv2D(32, (3, 3)),
    tf.keras.layers.Activation('relu'),
    tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
    tf.keras.layers.Dropout(0.25),

    tf.keras.layers.Conv2D(64, (3, 3), padding='same'),
    tf.keras.layers.Activation('relu'),
    tf.keras.layers.Conv2D(64, (3, 3)),
    tf.keras.layers.Activation('relu'),
    tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
    tf.keras.layers.Dropout(0.25),

    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(512),
    tf.keras.layers.Activation('relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(10),
    tf.keras.layers.Activation('softmax')
  ])

model = generate_model()


def compile_model(model):
  opt = tf.keras.optimizers.RMSprop(lr=0.0001, decay=1e-6)
  model.compile(loss='categorical_crossentropy',
                optimizer=opt,
                metrics=['accuracy'])
  return model

model = compile_model(model)

def train_model(model, x_train, y_train, x_test, y_test, epochs=5):
   model.fit(x_train, y_train, batch_size=256, epochs=epochs, validation_data=(x_test, y_test), shuffle=True)


%time train_model(model, x_train, y_train, x_test, y_test)

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
Epoch 1/25

KeyboardInterrupt: 