# Gradient Tape
This notebook is for training and understanding purposes only. All algorithms and credits go to pyimagesearch.com by Adrian Rosebrock, specifically https://www.pyimagesearch.com/2020/03/23/using-tensorflow-and-gradienttape-to-train-a-keras-model/ (A wonderful source and inspiration for Computer Vision and Deep Learning) and Chi-Feng Wang (https://towardsdatascience.com/automatic-differentiation-explained-b4ba8e60c2ad) for explaining autodifferentiation.


As this notebook is for training and understanding purposes, rather than downloading the source code right away. The code will be typed in order to build "muscle-memory". Author-readable comments will appear from time to time.

In TF2.0 and above, the function gradient tape was introduced to register the derivatives in each operation. The derivatives are calculated by a technique known as **Automatic Differentiation**

To understand **Automatic Differentiation**, one needs to understand the difference between **Symbolic** and **Numerical differentiation**.

![image.png](https://www5a.wolframalpha.com/Calculate/MSP/MSP96671c7cie55de2e85e100003fgf62adf4dbb3b9?MSPStoreType=image/gif&s=53)

Differentiation is an algebraic technique studied in high-school to find changes in a function when one of the function's variable changes. There is a set of differentiation rules to abide by (i.e. sum rule, constant rule, powers rule, chain rule). **Symbolic differentiation** uses these rules to produce the derivative of a function. <br>
<br>
(i.e. The derivative of $y = x_1^2 + 2x_2^2$ with respect to $x_1$ is $\dfrac{dy}{dx_1} = 2x_1$ and the derivative of $y = x_1^2 + x_2^2$ with respect to $x_2$ is $\dfrac{dy}{dx_2} = 4x_2$).
<br>
This is the de-facto way to produce a generic derivative formula. However, in the world of DL, as functions get more complex, it is not efficient to calculate the derivative in each scenario.

**Numerical differentiation** leverages that between two infinitesimally small interval (say when $x_1 = 2.00$ and $x_1 = 1.99$), the difference in the function $y$ is sufficiently linear. Therefore, the change in $y$ due to change in $x_1$ can be calculated numerically as such:<br>
<br>
$\dfrac{dy}{dx_1} = \dfrac{y(x_1 = 2.00, x_2 = a) - y(x_1 = 1.99, x_2 = a)}{2.00 - 1.99}$ <br>
<br>
The accuracy of the derivatives improves as the interval gets smaller and smaller. However, it would also warrant longer computation time to compute the derivative for a whole range of $x_1$ (not really a problem nowadays, but still we want to get things done faster). The other problem is when multiplication or division is in-play, underflow and overflow issue could happen.

#### In comes:
### Automatic Differentiation
**Automatic Differentiation** leverage that every computation has to be done in sequence of elementray arithmetic operations and functions. Before we can derive any expression, functions have to be converted into a computational graph. We will use a different function than those above, specifically <br>
<br>
<center>$cost = y_{target} - y_{predicted}$</center><br>
<center>$cost = y_{target} - w . x  + b$ such that $w$, $x$ and $b$ are vectors </center>

![image.png](https://miro.medium.com/max/399/1*W6-39saZm_QqL-wQvGESGQ.png)
<br>
We would like to calculate how our cost function changes with respect to change of each gradient element. From the graph above, we could see that each computation step is executed sequentially and only requires input of values to calculate the value. On each computation step, the partial derivatives can be calculated as such:

![image.png](https://miro.medium.com/max/935/1*sBhdw3Dycs6hV7HhHrtBWg.png)

And if we are interested in the partial derivatives only up till a certain node (read as reverse mode autodifferentiation), we can just apply chain rule ! How so?

![image.png](https://miro.medium.com/max/1219/1*53HDeNScHx2zwkLPZ1vEhA.png)

## Let's go to the implementation of GradientTape !

In [3]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import categorical_crossentropy
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.datasets import mnist
import tensorflow as tf
import numpy as np
import time
import sys

def build_model(width, height, depth, classes):
    # we build a sequential model with the following architecture
    # 1 x 16-filters (3x3 shape) Convolutional layer (followed by Relu activation, normalization and max pooling)
    # 2 x 32-filters (3x3 shape) Convolutional layer (followed by Relu activation, and normalization), then max pooling
    # 3 x 32-filters (3x3 shape) Convolutional layer (followed by Relu activation, and normalization), then max pooling
    # 1 x fully connected dense layer (followed by relu activation, normalization and 50% dropout)
    # 1 x NoOfClass dense layer with softmax activation
    inputShape = (height, width, depth)
    chanDim = -1 # for Keras Backend
    
    model = Sequential([
        
        # CONV => RELU => BN => POOL layer set
        Conv2D(16, (3, 3), padding="same", input_shape=inputShape),
        Activation("relu"),
        BatchNormalization(axis=chanDim),
        MaxPooling2D(pool_size=(2, 2)),
        
        # (CONV => RELU => BN) * 2 => POOL layer set
        Conv2D(32, (3, 3), padding="same"),
        Activation("relu"),
        BatchNormalization(axis=chanDim),
        Conv2D(32, (3, 3), padding="same"),
        Activation("relu"),
        BatchNormalization(axis=chanDim),    
        MaxPooling2D(pool_size=(2, 2)),
        
        # (CONV => RELU => BN) * 3 => POOL layer set
        Conv2D(64, (3, 3), padding="same"),
        Activation("relu"),
        BatchNormalization(axis=chanDim),
        Conv2D(64, (3, 3), padding="same"),
        Activation("relu"),
        BatchNormalization(axis=chanDim),
        Conv2D(64, (3, 3), padding="same"),
        Activation("relu"),
        BatchNormalization(axis=chanDim),
        MaxPooling2D(pool_size=(2, 2)),
        
        # first (and only) set of FC => RELU layers
        Flatten(),
        Dense(256),
        Activation("relu"),
        BatchNormalization(),
        Dropout(0.5),
        
        # softmax classifier (remember softmax behaves as a probability classifier)
        Dense(classes),
        Activation("softmax")
    ])
    
    return model

def step(X, y):
    # keep track of our gradients in a context manager
    with tf.GradientTape() as tape:
        # make a forward pass using the model
        pred = model(X)
        # calculate the crossentropy loss value since we use softmax, which gives us the probability value
        # crossentropy penalize predictions are confident and wrong
        loss = categorical_crossentropy(y, pred)
        
    # calculate the gradients using our tape and then update the model weights
    # THE MOST IMPORTANT PART OF THE TRAINING PROCESS
    # allows to update changes to specific layer only, by providing a different vars (instead of model.trainable_variables)
    grads = tape.gradient(loss, model.trainable_variables)
    opt.apply_gradients(zip(grads, model.trainable_variables))
    
    # optional to retrun loss to update 
    # return loss
    
# initialize the number of epochs to train for, batch size, and initial learning rate
EPOCHS = 1
BS = 64
INIT_LR = 1e-3

# load the MNIST dataset
print("[INFO] loading MNIST dataset...")
((trainX, trainY), (testX, testY)) = mnist.load_data()

# add a channel dimension to every image in the dataset, then scale the pixel intensities to the range [0, 1]
trainX = np.expand_dims(trainX, axis=-1)
testX = np.expand_dims(testX, axis=-1)
trainX = trainX.astype("float32") / 255.0
testX = testX.astype("float32") / 255.0

# one-hot encode the labels
trainY = to_categorical(trainY, 10)
testY = to_categorical(testY, 10)

# build our model and initialize our optimizer
print("[INFO] creating model...")
model = build_model(28, 28, 1, 10)
opt = Adam(lr=INIT_LR, decay=INIT_LR / EPOCHS)

# compute the number of batch updates per epoch
numUpdates = int(trainX.shape[0] / BS)

# loop over the number of epochs
# instead of using model.fit to do training, we step through using gradient tape and apply gradient per epoch
for epoch in range(0, EPOCHS):
    # show the current epoch number
    print("[INFO] starting epoch {}/{}...".format(epoch + 1, EPOCHS), end="")
    sys.stdout.flush()
    epochStart = time.time()
    
    # loop over the data in batch size increments
    for i in range(0, numUpdates):
        # determine starting and ending slice indexes for the current batch
        start = i * BS
        end = start + BS
        # take a step
        step(trainX[start:end], trainY[start:end])
        
    # show timing information for the epoch
    epochEnd = time.time()
    elapsed = (epochEnd - epochStart) / 60.0
    print("took {:.4} minutes".format(elapsed))
    
# in order to calculate accuracy using Keras' functions we first need to compile the model
model.compile(optimizer=opt, loss=categorical_crossentropy, metrics=["acc"])
# now that the model is compiled we can compute the accuracy
(loss, acc) = model.evaluate(testX, testY)
print("[INFO] test accuracy: {:.4f}".format(acc))

[INFO] loading MNIST dataset...
[INFO] creating model...
[INFO] starting epoch 1/1...took 1.57 minutes


[INFO] test accuracy: 0.9876


In [5]:
import matplotlib.pyplot as plt

# Sequential model doesnt have History attribute
# unlike, model built with functional api
plt.plot([0,1],model.history["loss"])

AttributeError: 'Sequential' object has no attribute 'history'