# Neural Networks Layers Lab

Welcome to the Neural Networks Layers lab! By the end of this lab, you will have

- Implemented Affine, ReLU, and Squared Loss layers
- Built a modular implementation of a one-hidden layer perceptron model with a squared loss

Let's get started!

---

# Neural Network Layers

A neural network *layer* is a unit of computation that knows how to

- Compute a *forward* pass where it calculates its output(s) from its input(s)
- Compute a *backward* pass where it calculates its input gradient(s) from its output gradient(s) 

A neural network *layer* can be implemented in many ways, one reasonable choice being a python class which conforms to the following interface.

In [2]:
class Layer:
    def forward(self, inputs):
        raise NotImplementedError('Forward pass not implemented!')
        
    def backward(self, dout):
        raise NotImplementedError('Backward pass not implemented!')

# Examples

Let's start off simple with a Plus layer.

![Plus Layer](images/Plus.png)

In [13]:
class Plus(Layer):
    def forward(self, a, b):
        c = a + b
        return c
    
    def backward(self, dc):
        da, db = 1*dc, 1*dc
        return da, db

As can be seen, the Plus layer computes its output c given its inputs `a` and `b`. The Plus layer also computes its input gradients `da` and `db` given its output gradient `dc`.

Notice that `Plus.backward()` does not need to know the value of `a` nor `b` (i.e. its inputs). Such a layer is called *stateless* because it doesn't need to remember anything from its forward pass.

In contrast, some layers are *stateful*. A stateful layer requires knowledge of values that were computed during its forward pass in order to compute its backward pass. The Square layer is an example of a stateful layer.

![Square Layer](images/Square.png)
Notice the Square layer is stateful because $\dfrac{\partial y}{\partial x}$ is a function of $x$.

In [14]:
class Square(Layer):
    def forward(self, x):
        y = x**2
        self.cache = locals()
        return y
    
    def backward(self, dy):
        x = self.cache['x']
        dx = 2*x * dy
        return dx

To retain knowledge of values computed during the forward pass of the Square layer, we call the python builtin `locals()` function right before exiting. The `locals()` function returns a `dict` containing all of the local variables in the current scope. It's basically a very convenient way to quickly record everything that's been computed in the forward pass. We save it to an attribute `self.cache` so that we can retrieve it in the backward pass.

Once we have layers, we can chain them together to form a more interesting computational graphs. Here's an example of chaining together a `Plus` layer and a `Square` layer.

![Pipeline](images/Pipeline.png)

In [29]:
plus, square = Plus(), Square()

a, b = 3, 2

c = plus.forward(a, b)
y = square.forward(c)

dy = 1
dc = square.backward(dy)
da, db = plus.backward(dc)

da, db, dc, dy

(10, 10, 10, 1)

As you can see, an invocation of a computational graph consists of four steps.

1. Instantiate the layers of your computational graph
2. Define the inputs values of the inputs to the graph
3. Perfrom the forward pass
4. Perform the backward pass (i.e. backpropagation)

Enough examples. It's your turn to implement some layers!

---

# Affine Layer

### Tasks

- Implement an Affine layer which computes the function

$$
\text{Affine}(x, w, b) = wx + b
$$

and hence corresponds to the computational graph

![Affine Black Box](images/Affine%20Abstraction%20Black%20Box.png)
### Requirements

- Use the exact variable names as used in the computational graph
- Use the variable naming convention `d`$\cdot = \overset{\longleftarrow}{\nabla_\cdot}$ For example, $\overset{\longleftarrow}{\nabla_r}$ gets the variable name `dr`.

### Hints

- Implement the Affine layer in terms of operations which have simple local gradients (i.e. are easy to backpropagate through) as in the computational graph

![Affine White Box](images/Affine%20Abstraction%20White%20Box.png)

### Question

- Why do you think we compute $\nabla_x$? Recall in the previous lab, we only computed $\nabla_w$ and $\nabla_b$.

### Answer

- In the previous lab, $\nabla_x$ was assumed to be our observed, fixed data. Hence it was no use to us compute $\nabla_x$ because we could not change $x$. However, we can now have affine layers whose $x$ input refers to an activation from a previous layer. Hence it will likely be necessary to compute $\nabla_x$ as part of chaining the gradient from the loss to parameters from earlier layers in the given architecture during backpropagation.

In [1]:
class Affine(Layer):
    def forward(self, x, w, b):
        z = w*x
        a = z + b
        self.cache = locals()
        return a

    def backward(self, da):
        x, w, z = self.cache['x'], self.cache['w'], self.cache['z']
        dz, db = 1*da, 1*da
        dx, dw = w*dz, x*dz
        return dx, dw, db

NameError: name 'Layer' is not defined

## ReLU Layer

### Tasks

- Implement a Rectified Linear Unit (ReLU) layer which computes the function

$$
\text{ReLU}(a) = \begin{cases} 0 & \text{if } a < 0 \\ a & \text{otherwise} \end{cases}
$$

and hence corresponds to the computational graph

![ReLU Layer Black Box](images/ReLU%20Layer%20Black%20Box.png)
### Requirements

- Use the exact variable names as used in the computational graph
- Use the variable naming convention `d`$\cdot = \overset{\longleftarrow}{\nabla_\cdot}$ For example, $\overset{\longleftarrow}{\nabla_r}$ gets the variable name `dr`.

### Hints

- Implement the ReLU layer in terms of operations which have simple local gradients (i.e. are easy to backpropagate through) as in the computational graph

![ReLU Layer White Box](images/ReLU%20Layer%20White%20Box.png)

In [21]:
class ReLU(Layer):
    def forward(self, a):
        h = max(a, 0)
        self.cache = locals()
        return h
    
    def backward(self, dh):
        a = self.cache['a']
        da = (a>0) * dh
        return da

## Squared Loss Layer

### Tasks

- Implement a SquaredLoss layer which computes the function

$$
\text{SquaredLoss}(\hat{y}, y) = (\hat{y} - y)^2
$$

and hence corresponds to the computational graph

![Squared Loss Black Box](images/Squared%20Loss%20Black%20Box.png)
### Requirements

- Use the exact variable names as used in the computational graph
- Use the variable naming convention `d`$\cdot = \overset{\longleftarrow}{\nabla_\cdot}$ For example, $\overset{\longleftarrow}{\nabla_r}$ gets the variable name `dr`.

### Hints

- Implement the SquaredLoss layer in terms of operations which have simple local gradients (i.e. are easy to backpropagate through) as in the computational graph

![Squared Loss White Box](images/Squared%20Loss%20White%20Box.png)

In [18]:
class SquaredLoss(Layer):
    def forward(self, y_hat, y):
        r = y_hat - y
        l = r**2
        self.cache = locals()
        return l

    def backward(self, dl):
        r = self.cache['r']
        dr = 2*r * dl
        dy_hat, dy = 1*dr, -1*dr
        return dy_hat

# One-Hidden-Layer Neural Network Pipeline

Recall from the last lab a one-hidden-layer neural network model takes the form

$$
g(x, w_1, b_1, w_2, b_2) = \max(\max(w_1 x + b_1, 0)w_2 + b_2, 0).
$$

Rewriting $g$ in terms of the layers we have defined yields

$$
g(x, w_1, b_1, w_2, b_2) = \text{Affine}(\text{ReLU}(\text{Affine}(x, w_1, b_1)), w_2, b_2).
$$

Applying a squared loss to $g$ yields the loss function

\begin{align*}
\mathcal{L}(x, y, w_1, b_1, w_2, b_2)
&= \text{SquaredLoss}(\text{Affine}(\text{ReLU}(\text{Affine}(x, w_1, b_1)), w_2, b_2), y)
\end{align*}

for a given $(x, y)$ training pair and parameters $(w_1, b_1, w_2, b_2)$.

## Forward Pass

### Tasks

- Compute $\mathcal{L}(2, 1, -1, 1, -2, 1.5)$ as corresponding to the computational graph

![MLP Layers Forward Numeric](images/MLP%20Layers%20Numeric%20Forward.png)
### Requirements

- Use only the layers you have defined in this lab

In [31]:
affine1, relu, affine2, squared_loss = Affine(), ReLU(), Affine(), SquaredLoss()

x, y = 2, 1
w1, b1, w2, b2 = -1, 1, -2, 1.5

a = affine1.forward(x, w1, b1)
h = relu.forward(a)
y_hat = affine2.forward(h, w2, b2)
l = squared_loss.forward(y_hat, y)

a, h, y_hat, l

(-1, 0, 1.5, 0.25)

## Backward Pass

### Tasks

- Compute $\nabla_{w_1}$, $\nabla_{b_1}$, $\nabla_{w_2}$, and $\nabla_{b_2}$ as corresponding to the computational graph

![MLP Layers Backward Numeric](images/MLP%20Layers%20Numeric%20Backward.png)
### Requirements

- Use the variable naming convention `d`$\cdot = \overset{\longleftarrow}{\nabla_\cdot}$ For example, $\overset{\longleftarrow}{\nabla_r}$ gets the variable name `dr`.

### Hints

- $\overset{\longleftarrow}{\nabla_\ell}$ = 1 will get you started

In [27]:
dl = 1
dy_hat = squared_loss.backward(dl)
dh, dw2, db2 = affine2.backward(dy_hat)
da = relu.backward(dh)
dx, dw1, db1 = affine1.backward(da)

dx, dw1, db1, dw2, db2

(0.0, -0.0, -0.0, 0.0, 1.0)

### Question

- Did you need to make a call to `locals()` to cache anything during the neural network forward pass pipeline? Why or why not?

### Answer

- No because all of the caching is taken care of by the layers that we are hooking together.

### Bonus Tasks

- Implement a Sigmoid layer
- Implement a Softamx layer
- Implement a hinge loss layer
- Implement a vectorized Affine layer
- Implement a vectorized ReLU layer