# Deep Learning Frameworks - An Introduction

## 1. Choosing a framework

Wide variety of frameworks for deep learning available:

-   [TensorFlow](https://tensorflow.org) (comprehensive, widely used)
-   [PyTorch](https://pytorch.org/) (very flexible, good for recurrent
    networks)
-   [Caffe](http://caffe.berkeleyvision.org/) (well-established,
    comprehensive) and [Caffe2](https://caffe2.ai/)
-   [Flux](https://fluxml.ai/) (generic package for differentiable
    computing)
-   [MatConvNet](http://www.vlfeat.org/matconvnet/), [Matlab Neural
    Network
    Toolbox](https://www.mathworks.com/products/neural-network.html)
    (for Matlab users)
-   [Microsoft Cognitive
    Toolkit](https://www.microsoft.com/en-us/cognitive-toolkit/) (for
    Windows users)
-   [Theano](http://www.deeplearning.net/software/theano/),
    [MXNet](https://mxnet.apache.org/), [Chainer](https://chainer.org/),
    [PaddlePaddle](http://www.paddlepaddle.org/), many others

### Some common trends

-   Emergence of common frontends (e.g., [Keras](https://keras.io))
-   Emergence of exchange formats (e.g., [ONNX](https://onnx.ai/))
-   Move towards dynamic computation, eager execution, tight Python
    integration

### The main contenders

|               | TensorFlow                | PyTorch                      |
|---------------|:--------------------------|:-----------------------------|
| open source   | yes                       | yes                          |
| backed by     | Google Brain              | Facebook                     |
| user base     | large, including industry | growing, mainly research     |
| visualization | tensorboard               | tensorboard plugin           |
| deployment    | TensorFlow Serving        | some support via TorchScript |
| advantages    | large feature set         | good Python integration      |
|               | usage-specific interfaces | good for dynamic networks    |

-   Recently, both frameworks have grown more similar to each other
    -   TensorFlow includes dynamic graphs & eager execution
    -   PyTorch includes graph compilation

## 2. A quick tour of TensorFlow

-   Main frontend for TensorFlow is Python
-   Import and use it as a Python module, similar to `numpy`

In [None]:
import tensorflow as tf

### TensorFlow as a calculator

-   Most basic level: **tensors** (multidimensional arrays) and
    **operations** on them

In [None]:
x = tf.random.normal(shape=(2,2))
print("x:", x)

y = tf.linspace(0.0, 1.0, 4)
print("y:", y.numpy())

y = tf.reshape(y, shape=(2,2))
xy = tf.matmul(x, y)
print("x * y (matrix op):", xy.numpy().squeeze())

-   Results are available immediately (**eager execution**)
-   In the background, computational **graph** is built and run (on GPU)

### Compiling a computation

-   Eager execution: run tensor operations on device
-   Compiled graph: run everything on device, including glue code
-   To compile, annotate with `@tf.function`

In [None]:
def function1(x, y):
    z = tf.matmul(x, y)
    for i in range(100):
        z += tf.matmul(x, y)
    return(z)

@tf.function
def function2(x, y):
    z = tf.matmul(x, y)
    for i in range(100):
        z += tf.matmul(x, y)
    return(z)

In [None]:
x = tf.random.normal(shape=(100, 1000))
y = tf.random.normal(shape=(1000, 500))

%timeit -n 5 -r 10 function1(x, y)
%timeit -n 5 -r 10 function2(x, y)

### Grouping code in modules

-   Core building block for new code: `tf.Module`
    -   roughly corresponds to a node in the computational graph
    -   provides tools for managing graph (names, variables, submodules)
-   New code should subclass `tf.Module`

In [None]:
class LinearRegressor(tf.Module):
    def __init__(self, input_size, output_size, name=None):
        super(LinearRegressor, self).__init__(name=name)
        self.w = tf.Variable(tf.random.normal([input_size, output_size]), name='w')
        self.b = tf.Variable(tf.zeros([output_size]), name='b')
    
    def __call__(self, x):
        y = tf.matmul(x, self.w) + self.b
        return(y)

In [None]:
# instantiate regressor with 5 inputs and 1 output
r = LinearRegressor(5, 1)
# apply the regressor to a batch of 10 inputs, each of size 5
r(tf.random.normal([10, 5]))

Workflow:

-   experiment in **eager mode**, put together computation
-   group code in `tf.Module`s
-   encapsulate in `@tf.function`s to compute efficiently
-   use high-level interfaces (e.g., **Keras**) if possible

### Automatic differentiation

Typical (supervised) deep learning approach:

-   Model as function with parameters $\hat{y} = f_{model}(x | \theta)$
    -   optimize parameters $\theta$ to minimise loss $L(\hat{y})$
    -   find derivative $\frac{dL}{d\theta}$
-   To find derivatives, TensorFlow offers automatic differentiation
    with `tf.GradientTape`

In [None]:
# create some dummy data
x = tf.random.normal([1, 5])
y = tf.constant([1.0])
print("Initial prediction:", r(x).numpy())

In [None]:
# calculate gradients of regression
with tf.GradientTape() as t:
    y_hat = r(x)
    loss = tf.square(y - y_hat)
dw, db = t.gradient(loss, [r.w, r.b])

# apply gradients to improve regressor
r.w.assign_sub(0.1 * dw)
r.b.assign_sub(0.1 * db)
print("Prediction after training:", r(x).numpy())

### Exercise: Build a small neural network

-   Implement a small neural network in TensorFlow.
-   Write a `tf.Module` named `Layer` that implements the computation of
    a neural network layer.
-   Write a `tf.Module` named `Network` that stacks several layers and
    executes them in order.
-   Train it using a `tf.GradientTape` on the regression data given
    below (predict `y` from `x`).

In [None]:
x = tf.reshape(tf.linspace(0., 10., 20), (1, 20))
y = 5 - 0.5 * x + tf.random.normal(shape=(20,))

Each layer should have two sets of trainable parameters: weights `w` and
biases `b`. Given an input vector `x`, it should compute a matrix
multiplication `w * x` and add `b` to the result. It should then apply a
non-linearity (use `tf.nn.sigmoid()`). The output of each layer should
be used as input for the next. The last layer should not have a
non-linearity (or use `tf.identity()`).

For both classes, `Layer` and `Network`, implement the appropriate
`__init__()` and `__call__()` functions.

As a loss function for the training, you should use the mean squared
error between the network’s prediction and the ground truth.

In [None]:
# ADD CODE HERE

### High-level interfaces: Keras

-   [Keras](https://keras.io) - a common frontend for several frameworks
    -   common neural network layers included
    -   simple model construction
    -   wrappers for training, evaluation, prediction, etc.
-   In TensorFlow: `tf.keras`

Building a model:

In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Input, Flatten, Dense

net = Sequential([
    Input([28, 28, 1]),
    Flatten(),
    Dense(1024, activation="elu"),
    Dense(1024, activation="elu"),
    Dense(10, activation="softmax")
])

In [None]:
net.summary()

Getting data:

In [None]:
from utils import mnist_imgs, mnist_lbls

from matplotlib import pyplot as plt
plt.imshow(mnist_imgs[0, :, :, 0], cmap="gray")
plt.show()

Compiling and training the model:

In [None]:
net.compile(optimizer=tf.keras.optimizers.Adam(),
    loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
net.fit(mnist_imgs, mnist_lbls, batch_size=64, epochs=5)

Save and load trained models:

In [None]:
net.save('trained_model.h5')
reloaded = tf.keras.models.load_model('trained_model.h5')

In [None]:
p1 = net.predict(mnist_imgs)
p2 = reloaded.predict(mnist_imgs)
(p1 == p2).all()

### Exercise: Convolutional networks

-   On image data such as MNIST, convolutional networks tend to perform
    better than dense networks like the one above.
-   Implement
    [LeNet-5](http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf), one
    of the standard architectures for MNIST.

LeNet5 comprises the following layers:

-   normalizing the input (you may do this outside of your network)
-   convolution with 6 filters of size 5-by-5 (output of size 28x28)
-   max-pooling with size 2 and stride 2
-   convolution with 16 filters of size 5-by-5
    -   the original paper only conneced some of the input features to
        each output map, ignore this for now
    -   here, we use valid convolution, such that the output size is
        10-by-10
-   max-pooling with size 2 and stride 2
-   a convolution layer with 120 features and kernel size 5-by-5, again
    using valid convolution (i.e., getting a 1-by-1 output)
-   a fully connected layer with 84 units
-   the output layer with 10 units

Diverging from the original formulation of the network a little, use
ReLU non-linearities and a softmax cross-entropy loss. LeNet5 achieves
around 99% test accuracy on MNIST.

In [None]:
# ADD CODE HERE

## 3. Visualization with TensorBoard

-   **TensorBoard**: visualization toolkit shipped with TensorFlow
-   Extremely useful - other deep learning frameworks now offer plugins
-   Requires log files:
    -   for low level interfaces, use `tf.summary.FileWriter` and
        `tf.summary.trace_export()`
    -   for Keras, use `tf.keras.callbacks.TensorBoard()` callback

### Try it out

-   Train the LeNet5 model from above and pass a TensorBoard callback.
-   Open tensorboard from command line:
    `tensorboard --logdir=/path/to/logs`
-   Open link `localhost:6006` in browser.

## Conclusions

-   Many different deep learning frameworks, choose what works for your
    problem
    -   e.g., Keras for standard components
    -   e.g., TensorFlow’s low-level interfaces for more detailed
        control
    -   e.g., PyTorch for dynamic graphs