<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#What-are-gradient-tapes?" data-toc-modified-id="What-are-gradient-tapes?-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>What are gradient tapes?</a></span></li><li><span><a href="#A-first-simple-example" data-toc-modified-id="A-first-simple-example-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>A first simple example</a></span></li><li><span><a href="#Which-tensors-are-watched?" data-toc-modified-id="Which-tensors-are-watched?-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Which tensors are watched?</a></span></li><li><span><a href="#Computing-the-derivative-for-more-than-one-tensor" data-toc-modified-id="Computing-the-derivative-for-more-than-one-tensor-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Computing the derivative for more than one tensor</a></span></li><li><span><a href="#Computing-the-gradient-for-the-function-from-the-lecture" data-toc-modified-id="Computing-the-gradient-for-the-function-from-the-lecture-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Computing the gradient for the function from the lecture</a></span></li><li><span><a href="#Fine-grained-control-over-which-variables-are-watched" data-toc-modified-id="Fine-grained-control-over-which-variables-are-watched-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Fine grained control over which variables are watched</a></span></li><li><span><a href="#Automatic-differentiation-for-programs" data-toc-modified-id="Automatic-differentiation-for-programs-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Automatic differentiation for programs</a></span></li></ul></div>

# Introduction

If you do not use Keras to build a neural network and train it using `fit()` in Tensorflow 2, you need to use "gradient tapes" directly in order to adjust the parameters of your model.

So it seems to be a good idea first to understand what gradient tapes are.

First make sure, you have the right TensorFlow version (>=2.0):

In [1]:
import tensorflow as tf
print(tf.__version__)

2.0.0


# What are gradient tapes?

The TensorFlow website

https://www.tensorflow.org/tutorials/customization/autodiff

describes gradient tapes as follows:

    TensorFlow provides the tf.GradientTape API for automatic differentiation - computing the gradient of a computation with respect to its input variables. Tensorflow "records" all operations executed inside the context of a tf.GradientTape onto a "tape". Tensorflow then uses that tape and the gradients associated with each recorded operation to compute the gradients of a "recorded" computation using reverse mode differentiation.

# A first simple example

In the following we generate a TensorFlow op `y` which computes `y=x^2=x*x`.

Let us assume we want to compute the derivative of `y` with respect to the input variable `x` (the result should be `2*x`, right?)

So if we compute the derivative dy/dx at x=3.0, the result should be 6.0.

In [2]:
x = tf.constant(3.0)
with tf.GradientTape() as t:
  t.watch(x)
  y = x * x
dy_dx = t.gradient(y, x) # Will compute to 6.0

In [3]:
print(dy_dx)

tf.Tensor(6.0, shape=(), dtype=float32)


In [4]:
dy_dx.numpy()

6.0

Note that the statement `t.watch(x)` is important. If we omit it, the tape has not stored any information in order to compute the gradient of dy/dx:

In [5]:
x = tf.constant(3.0)
with tf.GradientTape() as t:
  #t.watch(x)  # <-- we now omit this statement
  y = x * x
dy_dx = t.gradient(y, x) # Will compute to 6.0

In [6]:
print(dy_dx)

None


# Which tensors are watched?

The TensorFlow website says:

    Trainable variables (created by tf.Variable or tf.compat.v1.get_variable, where trainable=True is default in both cases) are automatically watched. Tensors can be manually watched by invoking the watch method on this context manager.
    
Let's try this:

In [7]:
x = tf.Variable(3.0)
with tf.GradientTape() as t:
  #t.watch(x)  # <-- we now omit this statement
  y = x * x
dy_dx = t.gradient(y, x) # Will compute to 6.0

In [8]:
print(dy_dx)

tf.Tensor(6.0, shape=(), dtype=float32)


In [9]:
dy_dx.numpy()

6.0

So the statement is right! Though we did not tell the tape to explicitly watch `x`, we have no the desired information on the tape in order to compute the gradient dy/dx.

# Computing the derivative for more than one tensor

In the following example we do not only compute dz/dx, but also dy/dx.

In [10]:
x = tf.constant(3.0)
with tf.GradientTape(persistent=True) as t:
  t.watch(x)
  y = x * x # y=x^2
  z = y * y # z=y^2=(x^2)^2=x^4

dz_dx = t.gradient(z, x)  # 108.0 (4*x^3 at x=3 --> 4*27=108)
dy_dx = t.gradient(y, x)  # 6.0 (2*x at x=3 --> 6)

In [11]:
dz_dx

<tf.Tensor: id=32, shape=(), dtype=float32, numpy=108.0>

In [12]:
dy_dx

<tf.Tensor: id=36, shape=(), dtype=float32, numpy=6.0>

Did you notice the flag `Persistent=True`?

Here is the description why we need this flag to be set:

    By default, the resources held by a GradientTape are released as soon as GradientTape.gradient() method is called. To compute multiple gradients over the same computation, create a persistent gradient tape.

We could also ask the tape to give us the derivative of the output with respect to some *intermediate* node (op, tensor), e.g. dz/dy = 2*y

For x=3.0, y=x*x=9.0, so dz/dy=2*y at y=9 will be 18.0, right?

Let's check this:

In [15]:
x = tf.constant(3.0)
with tf.GradientTape(persistent=True) as t:
  t.watch(x)
  y = x * x # y=x^2
  z = y * y # z=y^2=(x^2)^2=x^4

dz_dy = t.gradient(z,y)

In [16]:
print(dz_dy)

tf.Tensor(18.0, shape=(), dtype=float32)


# Computing the gradient for the function from the lecture

In the lecture we considered the function

f(x1,x2) := x1*x2 + sin(x1)

Now let us compute the gradient of f which is (df/dx1, df/dx2) using a gradient tape:

In [17]:
x1 = tf.constant(1.0)
x2 = tf.constant(2.0)
with tf.GradientTape(persistent=True) as t:
    t.watch(x1)
    t.watch(x2)
    w3 = x1*x2
    w4 = tf.math.sin(x1)
    w5 = w3*w4

dw5_dx1 = t.gradient(w5, x1)
dw5_dx2 = t.gradient(w5, x2)

In [18]:
dw5_dx1

<tf.Tensor: id=67, shape=(), dtype=float32, numpy=2.7635465>

In [19]:
dw5_dx2

<tf.Tensor: id=72, shape=(), dtype=float32, numpy=0.84147096>

Compare this with the results we got on slide 126. It is exactly the same!

# Fine grained control over which variables are watched

In [20]:
variable_a = tf.Variable(2.0)
variable_b = tf.Variable(4.0)
with tf.GradientTape(watch_accessed_variables=False, persistent=True) as t:
  t.watch(variable_a)
  y = variable_a ** 2  # Gradients will be available for `variable_a`.
  z = variable_b ** 3  # No gradients will be available since `variable_b` is
                       # not being watched.

In [21]:
print( t.gradient(y,variable_a) )

tf.Tensor(4.0, shape=(), dtype=float32)


In [22]:
print( t.gradient(z,variable_b) )

None


# Automatic differentiation for programs

Tapes can record operations as they are executed. This allows to compute the derivative of functions computed with "normal" Python control flow structures such as if statements and while loops.

Here is an simple example:

In [23]:
def foo(x, y):      
  if y >= 7:
        output = x+x+x
        return output
  output=1.0
  for i in range(y):
      output = output * x
  return output

def compute_gradient(x, y):
  with tf.GradientTape() as t:
    t.watch(x)
    out = foo(x, y)
  return t.gradient(out, x)

x = tf.constant(2.0)

In [24]:
compute_gradient(x,3) # x^3 derived is 3*x^2. With x=2.0, we get 3*4=12

<tf.Tensor: id=113, shape=(), dtype=float32, numpy=12.0>

In [25]:
compute_gradient(x,12) # for foo(x,12)
                       # the if statement is true
                       # and thus the function value foo(x,12)=x+x+x=3x.
                       # The derivative is therefore 3.

<tf.Tensor: id=117, shape=(), dtype=float32, numpy=3.0>

In [26]:
compute_gradient(x,6) # x^3 derived is 6*x^5. With x=2.0, we get 6*(2^5) = 6*32=192

<tf.Tensor: id=138, shape=(), dtype=float32, numpy=192.0>