# Gradient Tape

As the name suggests, it's a *tape* to store the *gradients* of *watched values*.

As you should know by now, learning is the process of finding the best weights to minimize the loss function, and we use the optimizer to update the weights to achieve that. optimizers usually use gradients (derivative of the loss function w.r.t the weights) to update the weights in the best direction. in order to do that we need alot of intermediate values computed through the forward pass to be stored, we also need to remember the functions used to compute them to calculate their derivatives.

That's what you would normally do if you were using numpy to implement everything step by step. GradientTape is an object that stores the intermediate values and functions used to compute them, and it can be used to calculate the derivatives of any watched value. hence it can basically do all the numpy magic for you in an elegant way.

GradientTape is used within a with block, upon entering the block its `__enter__` method is called which starts recording the operations which involve watched variables. when the block exits, the `__exit__` method is called which stops recording the operations and returns the `GradientTape` object. after that you can carry out **only 1** gradient operation on any watched value, then the gradient is garbage collected, if you want to do more then pass the consturctor a `persistent=True` argument.

In [1]:
import tensorflow as tf
import numpy as np

In [3]:
with tf.GradientTape(persistent=True) as g:
  x = tf.Variable(2.0)
  y = tf.Variable(3.0)
  z = x * y

dz_dx = g.gradient(z, x) # dz/dx = y = 3
dz_dy = g.gradient(z, y) # dz/dy = x = 2

print(f"dz/dx = {dz_dx.numpy()}")
print(f"dz/dy = {dz_dy.numpy()}")

dz/dx = 3.0
dz/dy = 2.0


# with weights and Bias 

let's see how we can do that with weights and bias for a single step of training for a single layer with weights `w` and bias `b`.

In [4]:
# prepare data
x = np.arange(10)
y = 2 * x - 1 # w --> 2, b --> -1

In [14]:
w = tf.Variable(0.1, name="w", trainable=True, dtype=tf.float32)
b = tf.Variable(0.1, name="b", trainable=True, dtype=tf.float32)

def simple_loss(y_true, y_pred):
  return tf.abs(y_true - y_pred)

LEARNING_RATE = 0.001

# train for 1000 epochs
for i in range(1000):
  with tf.GradientTape(persistent=True) as g:
    z = w * x + b
    loss = simple_loss(y, z)

  dl_dw = g.gradient(loss, w)
  dl_db = g.gradient(loss, b) 

  w.assign_sub(dl_dw * LEARNING_RATE)
  b.assign_sub(dl_db * LEARNING_RATE)
  del g # delete the gradient tape, must be done since we are using persistent=True

print(f"w = {w.numpy()}") # --> 2.0
print(f"b = {b.numpy()}") # --> -1.0

w = 2.00075364112854
b = -0.9990085959434509


And now you might think that you would need to have [`w`, `b`] values for each layer, and loop over them to do the gradient calculations.

But actually is gets even easier, you only need to use `model.trainable_variables` and compute the gradients for all of them in one go using 

```python
grads = tape.gradient(loss, model.trainable_variables)
```

and with that you got all the gradients. 

now you can apply them, will we have to do the loop now?, still NO.

It's the optimizer's job that given the gradients it will update the weights, so you simply pass them to the optimizer and let it take care of it.

and so your training step should look something like this:

```python
def train_step(model, data):
    with tf.GradientTape() as tape: # not persistent since we are calling tape.gradient() only once
        predictions = model(data)
        loss = loss_fn(labels, predictions)

    loss_history.append(loss.numpy().mean())
    
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
```

even simpler and cleaner than the code above and can train any model generally.

# Higher order derivatives

Simple nested gradients, check the next cell for an example:

In [15]:
x = tf.Variable(2.0)

f = lambda x: x ** 3

with tf.GradientTape() as g:
  with tf.GradientTape() as g2:
    y = f(x)
  dy_dx = g2.gradient(y, x) # --> 3 * x ** 2 = 3 * 4 = 12
d2y_dx2 = g.gradient(dy_dx, x) # --> 6 * x = 12

print(f"dy/dx = {dy_dx.numpy()}")
print(f"d2y/dx2 = {d2y_dx2.numpy()}")

dy/dx = 12.0
d2y/dx2 = 12.0
