# Automatic differentiation

Automatic differentiation (autodiff) is a powerful tool used in deep learning and other optimization based learning algorithms to compute gradients efficiently. 

In the first part of this notebook, we will introduce the mechanics of autodiff in Tensorflow. In the second part, we will use autdiff to implement a simple linear regression. While simple, this example is representative of the kind of optimization problems that are solved in deep learning.


We start by importing the necessary libraries. In addition to Tensorflow, we will use numpy for some basic numeric operations and matplotlib for plotting.

In [None]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

The goal of automatic differentiation is to compute the gradient of a function. In the context of machine learning, the function is typically the loss function.
We start, however, with a much simpler function, namely the function $f(x) = x^2$. From calculus, we know that the derivative of this function is $f'(x) = 2x$. We will use autodiff to compute this derivative.

When we compute with normal values in Python, as in the computation 
```python
x = 3
y = x**2
```
the value of `y` is computed and stored in memory. Python does not keep track of the operations that were used to compute `y`. Consequently, we cannot compute the derivative of `y` with respect to `x` using Python. 

To make automatic differentiation possible, we need to use a special data structure that keeps track of the operations used to compute a value. In Tensorflow, this data structure is called a `tf.Variable`. It tells Tensorflow to keep track of the operations used to compute the value of the variable. To instruct Tensorflow to compute the gradient of a variable, we need to perform the operations within a `tf.GradientTape` context, as illustrated below.

In [None]:
x = tf.Variable(7.0)

with tf.GradientTape() as tape:
  y = x * x

It is now possible to compute the gradient of `y` with respect to `x`. The result is the expected value of `2*x`.

In [None]:
grad = tape.gradient(y, x)
print(grad)

#### Exercises

- Change the value of `x` to a different value and re-run the cell. Is the computation correct?
- Experiment with different functions, such as `y = x**3` or `y = x**2 + 3*x`. Can you compute the gradient of these functions using autodiff? 
- What happens if you try to compute the gradient of a function that is not a function of a `tf.Variable`, i.e. if you change `x` to be a normal Python variable?

The computation of gradients is not restricted to tracking only one variable. We can compute the gradient of a function with respect to multiple variables. In the example below, we compute the gradient of the function $f(x, y) = x^2 + y$ with respect to both $x$ and $y$. The expected result is $2x$ if we differentiate with respect to $x$ and $1$ if we differentiate with respect to $y$.

In [None]:
x = tf.Variable(1, dtype=tf.float32)
y = tf.Variable(2, dtype=tf.float32)

# We use persistent=True so that we can call gradient twice on the same tape
with tf.GradientTape(persistent=True) as tape:
    z = x * x + y       

dz_dx = tape.gradient(z, x)
dz_dy = tape.gradient(z, y)

print(dz_dx)
print(dz_dy)

In most machine learning models we have many parameters that we need to optimize. We would not explicitly create a `tf.Variable` for each parameter. Instead, we would typically use a `tf.Variable` to store all parameters at once. We call such a variable a vector (or 1d tensor). 

In [None]:
x = tf.Variable([1, 2], dtype=tf.float32)

# We use persistent=True so that we can call gradient twice on the same tape
with tf.GradientTape(persistent=True) as tape:
    z = x[0] * x[0] + 5 * x[1]       

dz_dx = tape.gradient(z, x)

print(dz_dx)

We see that the result has the same values, but we get it with one call to `tape.gradient` instead of two. Furthermore, the result is returned as a vector, with one entry for each variable. This vector is called the gradient vector of $z$ and often written as $\nabla z$ or $\text{grad} z$.

Working with vectors is more efficient and more convenient than working with individual variables, especially when we have many parameter, as is the case in deep learning models.


### Regression 

Now that we have seen how to compute gradients with autodiff, we will use this tool to implement a simple linear regression. 

We start by generating some synthetic data. We will generate a dataset with 100 samples. Each sample consists of a single feature $x$ and a target value $y$. 

In [None]:
b_true = 2 # true y-intercept
w_true = 3 # true slope

x = np.random.rand(100).astype(np.float32)
noise = np.random.normal(scale=0.1, size=len(x))
y = w_true * x + b_true + noise

plt.scatter(x, y)

To set up the regression, we define the variables, the model and the loss function. 

In [None]:

def loss_fn(y, y_pred):
    # note that y and y_pred are vectors (1d tensors) and hence also y - y_pred
    # reduce_mean computes the mean value of all elements in 
    return tf.reduce_mean(tf.square(y - y_pred))

def model(w, b, x):
    return w * x + b   # x is a vector of inputs. The operation leads to a vector of predictions

Let's make some predictions with our model. We plot the predictions against the true values and also compute the loss. 

In [None]:
# initialize the weights and bias
w = tf.Variable(0.0)
b = tf.Variable(0.0)

predictions = model(w, b, x)
plt.scatter(x, predictions, label='Predictions', c='r')
plt.scatter(x, y, label='True', c='b')
plt.legend()
loss = loss_fn(y, predictions)

print("the loss before training is", loss)

Now we are ready to optimize the model. We will compute gradients and do a gradient descent step to update the weights of the model.  We will iterate over the dataset 100 times and update the weights of the model after each iteration. In each iteration, we will compute the loss and print it. 

In [None]:
w = tf.Variable(0.0)
b = tf.Variable(0.0)

lr = 0.1  # learning rate
number_of_iterations = 100

history = []
for iter in range(number_of_iterations):
    with tf.GradientTape() as tape:
        y_pred = model(w, b, x)
        loss = loss_fn(y, y_pred)
    
    dw, db = tape.gradient(loss, [w, b])

    # to update the weights, we cannot use w = w - lr * dw but 
    # we have to use w.assign(w - lr * dw) in order to keep 
    # the operation recorded by the tape
    w.assign(w - lr * dw) # update the weights
    b.assign(b - lr * db) # update the bias
    
    history.append(loss.numpy())

    if iter % 10 == 0: # we print the loss every 10 iterations
        print(f'Iteration {iter}: Loss: {loss.numpy()}')
print(f'Final weights: {w.numpy()} and bias: {b.numpy()}')

plt.plot(history)
plt.xlabel('Iteration')
plt.ylabel('Loss')

Let's do the predictions again and plot them. We should see that the predictions are much closer to the true values now. 

In [None]:
predictions = model(w, b, x)
plt.scatter(x, predictions, label='Predictions', c='r')
plt.scatter(x, y, label='True', c='b')


### Exercises

- What do you observe towards the end of the training? Does the loss decrease in each iteration?
- Experiment with different learning rates. What happens if you set the learning rate to a very high value, such as 1.0? What happens if you set it to a very low value, such as 0.0001?

### Logistic regression (Exercise)

In this section you will adapt the code to implement logistic regression. Remember, the difference to linear regression is on one hand the model, which passes the logits (linear predictions) through a sigmoid function, and on the other hand the loss function, which is the binary cross entropy loss. 

The following code generates synthetic data:

In [None]:
# generate classification data for a simple logistic regression model

n = 100
x = np.random.rand(n, ) - 0.5
b_true = 0.0
w_true = 3.0
noise_var = 0.1
y = np.round(1 / (1 + np.exp(- (b_true + x * w_true ))) + np.random.normal(0, noise_var, size=n))

plt.scatter(x, y)

You can use the sigmoid function from `tf.nn.sigmoid` `tf.math.log` for computing the logarithm. With these functions you should be able to implement the loss function and the model.

In [None]:
def loss_fn_lr(y, y_pred):
    return None # Replace None with the correct loss


In [None]:
def model_lr(w, b, x):
    return None # Replace None with the correct loss

In [None]:
# create some predictions with the true weights to make sure the model is working
# Also test the loss function

w = ...
b = ...
predictions = ...
loss = ...
print("the loss before training is", loss)

The training loop remains unchanged. 

In [None]:
# initialize the weights and bias
w = tf.Variable(0.0)
b = tf.Variable(0.0)

# define the learning rate
lr = 0.3
number_of_iterations = 1000

history = []
for iter in range(number_of_iterations):
    with tf.GradientTape() as tape:
        y_pred =model_lr(w, b, x)
        loss = loss_fn_lr(y, y_pred)
    dw, db = tape.gradient(loss, [w, b])
    w.assign_sub(lr * dw)
    b.assign_sub(lr * db)

    history.append(loss.numpy())
    if iter % 100 == 0:
        print(f'Iteration {iter}: Loss: {loss.numpy()}')
print(f'Final weights: {w.numpy()} and bias: {b.numpy()}')

plt.plot(history)
plt.xlabel('Iteration')
plt.ylabel('Loss')

Make again some predictions and plot them. You should see that the model is able to separate the two classes.

In [None]:
predictions = ... 
plt.scatter(x, predictions, label='Predictions', c='r')
plt.scatter(x, y, label='True', c='b')
