In [85]:
# Using numpy as the underlying tensor library so that this impl is "interview realistic."
import numpy as np

See `autodiff.ipynb` for general purpose autodiff engine implementation.

This notebook is a study reference for an interview which asks you to implement backprop for a feed-foward network. Here, we will skip the autodiff engine, and implement only what we need to write to pass such an interview.

# Operations

For each operation below, we want to provide a `grad` function. grad stands for vector-Jacobian product.

Suppose we have a function $h = g\circ f$ with $f : \mathbb{R}^n\to\mathbb{R}^m$ and $g : \mathbb{R}^m \to \mathbb{R}$. In the context of a neural network, $f$ would be a particular layer (or operation within a layer such as affine or relu/sigmoid), and $g$ would be the rest of the "stack" sitting on top of $f$, which goes through however many layers and then through the loss function, resulting in a final scalar loss. Let $x\in \mathbb{R}^n$. We are interested in the gradient $\nabla_x h(x) = (\partial h/\partial x_1,\dots,\partial h/\partial x_n)$, which using chain rule, is

$$
\nabla h = \nabla g J[f]
$$

where $z = f(x)$ and $\nabla g = (\partial g/\partial z_1,\dots,\partial g/\partial z_n)$ is a row-vector, and

$$J[f] = \begin{bmatrix}\frac{\partial f_1}{\partial x_1}&\dots&\frac{\partial f_1}{\partial x_n} \\ \vdots&\ddots&\vdots \\ \frac{\partial f_m}{\partial x_1}&\dots&\frac{\partial f_m}{\partial x_n}\end{bmatrix}$$

is a matrix of partial derivatives called the Jacobian. Hence, $\nabla h$ is a vector-Jacobian product (grad).

If there are layers of the network below $f$, we can just repeat this process until we get to the bottom. In other words, suppose $h' = g\circ f\circ f'$. Then we can let $g' = g\circ f$ and $\nabla g' = \nabla h$, so that $\nabla h' = \nabla g'J[f']$.

## Deriving grad Functions
$\newcommand{\a}{\alpha}\newcommand{\b}{\beta}$

We don't want to pass around Jacobians explicitly because they can be large and sparse. Instead, we pass around a `grad` function which calculates $\nabla g J[f]$ while preforming whatever optimizations under the hood.

Let's derive the `grad` function for the affine operation.

Let $N$ be the batch size, $K_i$ be the number of input features, and $K_o$ be the number of output features.  
Let $x \in \mathbb{R}^{N\times K_i}$, $w \in \mathbb{R}^{K_i\times K_o}$ and $b \in \mathbb{R}^{1\times K_o}$ all be matrices.

The affine operation is $$f(x, w, b) = xw+b$$

Let $J_x[f], J_w[f]$ and $J_b[f]$ be the Jacobians w.r.t. $x$, $w$ and $b$.   
These Jacobians are 4-tensors, with
$$J_x[f]^{i,j}_{\a, \b} = \frac{\partial f_{i,j}}{\partial x_{\a, \b}}$$
etc., where  
$J_x[f]$ has shape $(N, K_o \mid N, K_i)$,  
$J_w[f]$ has shape $(N, K_o \mid K_i, K_o)$ and  
$J_b[f]$ has shape $(N, K_o \mid 1, K_o)$,  
with $(\text{output_shape} \mid \text{input_shape})$ denoting the dimensions of the raised followed by lowered tensor indices.

Supposing we have a "stack" on top of $y = f(x,w,b)$ which outputs a scalar loss $\mathcal{L} \in \mathbb{R}$, let $g = \nabla_y \mathcal{L} \in \mathbb{R}^{N\times K_o}$ is a matrix (and 2-tensor) where $(\nabla_y \mathcal{L})_{i,j} = \partial \mathcal{L}/\partial y_{i,j}$.

Then

$$
\text{grad}_x(g) = \sum_{i, j} g_{i, j} J_x[f]^{i, j}
$$

and

$$
\text{grad}_w(g) = \sum_{i, j} g_{i, j} J_w[f]^{i, j}
$$

etc.

Lets find $J_x[f]$ and $J_w[f]$ explicitly. Since they are both 4-tensors, we won't be able (easily) visualize them in their entirety as grids. I find, in deriving the `grad` function, it is easist to work with $J[f]^{i,j}_{\a, \b}$

So

$$\begin{aligned}
J_x[f]^{i,j}_{\a,\b} &= \frac{\partial}{\partial x_{\a,\b}}(xw+b)_{i,j} \\
  &= \frac{\partial}{\partial x_{\a,\b}} (x_i \cdot (w^T)_j + (b^T)_j) \\
  &= \begin{cases}w_{\b,j} & i = \a \\ 0 & \text{otherwise} \end{cases}
\end{aligned}$$

and so

$$
\text{grad}_x(g)_{\a,\b} = \sum_{i, j} g_{i, j} J_x[f]^{i, j}_{\a,\b} = \sum_{j} g_{\a,j} w_{\b,j}
$$

which we can rewrite as matrix multiply,

$$\text{grad}_x(g) = gw^T$$

Likewise we have,

$$\begin{aligned}
J_w[f]^{i,j}_{\a,\b} &= \frac{\partial}{\partial w_{\a,\b}}(xw+b)_{i,j} \\
  &= \frac{\partial}{\partial w_{\a,\b}} (x_i \cdot (w^T)_j + (b^T)_j) \\
  &= \begin{cases}x_{i,\a} & j = \b \\ 0 & \text{otherwise} \end{cases}
\end{aligned}$$

giving us

$$
\text{grad}_w(g)_{\a,\b} = \sum_{i, j} g_{i, j} J_w[f]^{i, j}_{\a,\b} = \sum_{i} g_{i,\b} x_{i,\a}
$$

which we can rewrite as the matrix multiply,

$$\text{grad}_w(g) = x^T g$$

In code, for any variable `v`, we will use `gv` to represent the gradient of the loss `L` with respect to `v`.

I.e. `gv` = $\nabla_{v}L = \frac{\partial{L}}{\partial{v}}$

In [86]:
# Define some operations which will let us construct a feed foward network and loss.
# Each operation returns an output and a `grad` function which can be called to calculate a backward-pass through the same op.

# In code, for any variable `v`, we will use `gv` to represent the gradient of the loss `L` with respect to `v`.
# I.e. `gv` = $\nabla_{v}L = \frac{\partial{L}}{\partial{v}}$

def affine(x, w, b):
  y = x @ w + b

  def grad(gy):
    # w.r.t. x
    gx = gy @ w.T

    # w.r.t. w
    gw = x.T @ gy

    # w.r.t. b
    gb = gy.sum(axis=0)

    return gx, gw, gb

  return y, grad


def sigmoid(x):
  y = 1/(1+np.exp(-x))

  def grad(gy):
    # Why is this written with positive x or negative x?
    # Numerical stability?
    # Play with this, try with PyTorch test.
    # dy = np.exp(-x)/(1 + np.exp(-x))**2
    dy = y * (1 - y)
    return gy*dy

  return y, grad


def relu(x):
  y = np.clip(x, a_min=0, a_max=None)

  def grad(gy):
    # Note that the torch implementation of relu makes the gradient 1 at x=0
    return gy * (x >= 0).astype(float)

  return y, grad


def square_error_loss(x, y):
  diff = x - y
  l = (diff**2).sum(axis=-1)

  def grad(gl):
    return gl * 2 * diff

  return l, grad


def tensor_sum(x):
  y = x.sum()

  def grad(gy):
    return gy * np.ones_like(x)

  return y, grad

# Feed-forward network

In [87]:
# build a simple NN with toy data

# data - shape == (batch, features)
x = np.array([[1,2,3], [4,5,6]], dtype=float)
y = np.array([[1, 0], [0, 1]], dtype=float)

# params
w1 = np.array([[1, 1], [-1, 1], [-2, 2]], dtype=float)
b1 = np.array([[0, 1]], dtype=float)
w2 = np.array([[.2, .5], [.5, -.5]], dtype=float)
b2 = np.array([[-1, .5]], dtype=float)
w3 = np.array([[.7, -.5], [-.2, .3]], dtype=float)
b3 = np.array([[.3, -.2]], dtype=float)
params = [w1, b1, w2, b2, w3, b3]

# build NN
h0 = x
s1, s1_grad = affine(h0, w1, b1)
h1, h1_grad = sigmoid(s1)
s2, s2_grad = affine(h1, w2, b2)
h2, h2_grad = relu(s2)
s3, s3_grad = affine(h2, w3, b3)
Le, Le_grad = square_error_loss(s3, y=y)
L, L_grad = tensor_sum(Le)
L  # view the output

np.float64(2.060075596202683)

In [88]:
# perform backprop
g = L_grad(1)
g = Le_grad(g)
g, g_w3, g_b3 = s3_grad(g)
g = h2_grad(g)
g, g_w2, g_b2 = s2_grad(g)
g = h1_grad(g)
_, g_w1, g_b1 = s1_grad(g)

my_grads = [g_w1, g_b1, g_w2, g_b2, g_w3, g_b3]
my_grads  # view our grads

[array([[ 6.90769472e-05, -3.63401821e-06],
        [ 1.41001896e-04, -7.26838789e-06],
        [ 2.12926844e-04, -1.09027576e-05]]),
 array([ 7.19249484e-05, -3.63436968e-06]),
 array([[ 0.00000000e+00,  1.43982798e-04],
        [ 0.00000000e+00, -6.79882637e-01]]),
 array([ 0.        , -0.67987537]),
 array([[ 0.        ,  0.        ],
        [-0.00066893, -0.00019387]]),
 array([-0.80019174, -2.79971239])]

# Test

In [89]:
# Compare our gradients against gradients calculated by PyTorch.
!pip3 install torch
import torch



In [90]:
def sigmoid_torch(x):
  y = 1/(1+(-x).exp())
  return y, None


def relu_torch(x):
  # y = torch.maximum(x, torch.zeros_like(x))
  y = torch.clamp(x, min=0)
  return y, None

In [91]:
w1t = torch.tensor(w1, requires_grad=True)
b1t = torch.tensor(b1, requires_grad=True)
w2t = torch.tensor(w2, requires_grad=True)
b2t = torch.tensor(b2, requires_grad=True)
w3t = torch.tensor(w3, requires_grad=True)
b3t = torch.tensor(b3, requires_grad=True)
torch_params = [w1t, b1t, w2t, b2t, w3t, b3t]

h0 = torch.tensor(x)
s1, _ = affine(h0, w1t, b1t)
h1, _ = sigmoid_torch(s1)
s2, _ = affine(h1, w2t, b2t)
h2, _ = relu_torch(s2)
s3, _ = affine(h2, w3t, b3t)
Le, _ = square_error_loss(s3, y=torch.tensor(y))
L, _ = tensor_sum(Le)
L  # view the output

tensor(2.0601, dtype=torch.float64, grad_fn=<SumBackward0>)

In [92]:
# Compare to gradients calculated by torch.
# This is for debugging purposes.

def get_torch_grads(target, params):
  # zero out previous cum gradients
  for p in params:
    if p.grad is not None:
      p.grad.zero_()
  # update cum gradients
  target.backward(torch.ones_like(target), retain_graph=True)
  return [p.grad for p in params]


# compare to torch grads
torch_grads = get_torch_grads(L, torch_params)
print('matches:', [torch.allclose(torch.tensor(my_g), tc_g) for my_g, tc_g in zip(my_grads, torch_grads)])
# If all are True then we've succeeded

matches: [True, True, True, True, True, True]
