# Gradient and backpropagation

### Intro

This notebook looks into functions related to backprop

I will use the following function as an example: f = (x + y) * z.

We can treat it as a scalar-valued function, taking 3-dimensional vector as an input: [x, y, z].

Calculation of vector of partial derivaties (gradient) is a critical part of the training process for neural networks.

### Why

The main rationale for calculating gradient of the loss function with respect to the model's parameters is to be able to train the neural network, i.e. to be able to estimate a set of parameters minimising loss function. Gradient is used because it mathematically guarantees that it is the best direction to update the weights to decrease value of the loss function. This algorithm is called **gradient descent**.



In [78]:
import numpy as np
from sympy import symbols, diff, lambdify
import tensorflow as tf
import torch
import os
from typing import Callable

# Surpress warnings from tensorflow
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

## Computing the gradient

There are ways to compute gradient:

- *Numerical gradient* (finite differences methods). The numerical gradient is computed by perturbing the parameters of the function slightly and observing the corresponding change in the function's output. The problem is, it is linear in the number of parameters, hence is not efficient.
- *Analytical* (by using calculus). It's computation of the gradient of a function with respect to its parameters using symbolic differentiation.

### Numerical gradient

In [80]:
# The following function is taken from the course notes https://cs231n.github.io by Andrej Karpathy

def eval_numerical_gradient(f: Callable, x: np.array) -> np.array:
    """
    
    Computes numerical gradient of f at x
    
    f - some scalar-valued function
    x - np.array, the data point to evaluate gradient at

    Logic is the following:
    - compute value of the function
    - iterate over the input parameters of the function, increment each
      by a tiny margin h and note the effect of change in this parameter
      on the function value (i.e. partial derivative)
    - return the vector of partial derivatives (gradient)
    """
    fx = f(x)
    grad = np.zeros(x.shape)
    # Some small value to increment function value at 
    h = 0.000001

    # Efficient multi-dimensional iterator object to iterate over arrays
    iterator = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])

    while not iterator.finished:

        # Evaluate function at x + h
        ix = iterator.multi_index
        old_value = x[ix]
        print('old value of parameter:', old_value)

        # Increment by h
        x[ix] = old_value + h

        print('new value of parameter:', old_value + h)
        fxh = f(x)
        print('new value of the function:', fxh)
        # Restore to old value
        x[ix] = old_value
        # Compute partial derivative
        grad[ix] = (fxh - fx) / h
        iterator.iternext()
    return grad

In [55]:
def f(params: np.array) -> float: 
    x, y, z = params
    return (x + y) * z

In [82]:
# Value of the function at this point
x = np.array([-2.0, 5.0, -4.0])
f(x)

-12.0

In [76]:
eval_numerical_gradient(f, x)

old value of parameter: -2.0
new value of parameter: -1.99999
new value of the function: -12.00004
old value of parameter: 5.0
new value of parameter: 5.00001
new value of the function: -12.000039999999998
old value of parameter: -4.0
new value of parameter: -3.99999
new value of the function: -11.99997


array([-4., -4.,  3.])

### Analytical gradient & backpropagation

Analytical approach to calculation of gradient uses calculus to compute partial derivaties. 

Backpropagation is an algorithm for efficiently computing gradients in composite functions by applying the chain rule of calculus. It involves propagating the error backward through the "computational graph" to calculate the gradients with respect to each parameter of the function.

Here is an excellent blog post from Chris Olah on the topic: [Calculus on Computational Graphs: Backpropagation](https://colah.github.io/posts/2015-08-Backprop/)

#### Backprop by hand: an example from cs231n

In [47]:

# f = (x + y) * z
# inputs
x, y, z = -2, 5, -4

# break f into q * z
q = x + y
print(f'q: {q}')
f = q * z
print(f'f: {f}')

# partial derivatives of z and q in respect to f(x)
dfdz = q
dfdq = z

# partial derivatives of x and y in respect to f(q)
dqdx = 1
dqdy = 1

# Application of chain rule
dfdx = dqdx * dfdq
dfdy = dqdy * dfdq
print([dfdx, dfdy, dfdz])

q: 3
f: -12
[-4, -4, 3]


#### Same with symbolic differentiation

In [84]:
x, y, z = symbols('x y z')
f = (x + y) * z
df_dx = diff(f, x)
df_dy = diff(f, y)
df_dz = diff(f, z)

df_dx_numeric = lambdify((x, y, z), df_dx, 'numpy')
df_dy_numeric = lambdify((x, y, z), df_dy, 'numpy')
df_dz_numeric = lambdify((x, y, z), df_dz, 'numpy')

v = [-2.0, 5.0, -4.0]
print("df/dx =", df_dx_numeric(*v))
print("df/dy =", df_dy_numeric(*v))
print("df/dz =", df_dz_numeric(*v))



df/dx = -4.0
df/dy = -4.0
df/dz = 3.0


#### Same with Tensorflow

In [29]:

x = tf.Variable(-2.0)
y = tf.Variable(5.0)
z = tf.Variable(-4.0)

with tf.GradientTape(persistent=True) as tape:
    f = (x + y) * z
    df_dx = tape.gradient(f, x)
    df_dy = tape.gradient(f, y)
    df_dz = tape.gradient(f, z)

# Print the results
print("df/dx =", df_dx.numpy())
print("df/dy =", df_dy.numpy())
print("df/dz =", df_dz.numpy())

del tape

df/dx = -4.0
df/dy = -4.0
df/dz = 3.0


#### Same with PyTorch

In [26]:
x = torch.tensor(-2.0, requires_grad=True)
y = torch.tensor(5.0, requires_grad=True)
z = torch.tensor(-4.0, requires_grad=True)

f = (x + y) * z
f.backward()

# Access the partial derivatives
df_dx = x.grad
df_dy = y.grad
df_dz = z.grad

# Print the results
print("df/dx =", df_dx.item())
print("df/dy =", df_dy.item())
print("df/dz =", df_dz.item())

df/dx = -4.0
df/dy = -4.0
df/dz = 3.0


### Conclusion

The next step is to apply this approach on the vector-valued function to obtain the Jacobian and use this to effectively perform the parameter update.