# Neural Networks Layers Lab

Welcome to the Neural Networks Layers lab! By the end of this lab, you will have

- Implemented Affine, ReLU, and Squared Loss layers
- Built a modular implementation of a one-hidden layer perceptron model with a squared loss

Let's get started!

---

# Neural Network Layers

A neural network *layer* is a unit of computation that knows how to

- Compute a *forward* pass to computed output(s) from its input(s)
- Compute a *backward* pass to compute its input gradient(s) are from its output gradient(s) 

A neural network *layer* can be implemented in many ways, one reasonable choice being a python class which conforms to the following interface.

In [None]:
class Layer:
    def forward(self, inputs):
        raise NotImplementedError('Forward pass not implemented!')
        
    def backward(self, dout):
        raise NotImplementedError('Backward pass not implemented!')

# Examples

Let's start off simple with a Plus layer.

<img src="images/Plus.svg" alt="Plus Layer" style="width: 400px;"/>

In [None]:
class Plus(Layer):
    def forward(self, a, b):
        c = a + b
        return c
    
    def backward(self, dc):
        da, db = 1*dc, 1*dc
        return da, db

As can be seen, the Plus layer computes its output c given its inputs `a` and `b`. The Plus layer also computes its input gradients `da` and `db` given its output gradient `dc`.

Notice that `Plus.backward()` does not need to know the value of `a` nor `b` (i.e. its inputs). Such a layer is called *stateless* because it doesn't need to remember anything from its forward pass.

In contrast, some layers are *stateful*. A stateful layer requires knowledge of values that were computed during its forward pass in order to compute its backward pass. The Square layer is an example of a stateful layer.

<img src="images/Square.svg" alt="Square Layer" style="width: 400px;"/>

Notice the Square layer is stateful because $\dfrac{\partial y}{\partial x}$ is a function of $x$.

In [None]:
class Square(Layer):
    def forward(self, x):
        y = x**2
        self.cache = locals()
        return y
    
    def backward(self, dy):
        x = self.cache['x']
        dx = 2*x * dy
        return dx

To retain knowledge of values computed during the forward pass of the Square layer, we call the python builtin `locals()` function right before exiting. The `locals()` function returns a `dict` containing all of the local variables in the current scope. It's basically a very convenient way to quickly record everything that's been computed in the forward pass. We save it to an attribute `self.cache` so that we can retrieve it in the backward pass.

Once we have layers, we can chain them together to form a more interesting computational graphs. Here's an example of chaining together a `Plus` layer and a `Square` layer.

<img src="images/Pipeline.svg" alt="Pipeline" style="width: 600px;"/>

In [None]:
plus, square = Plus(), Square()

a, b = 3, 2

c = plus.forward(a, b)
y = square.forward(c)

dy = 1
dc = square.backward(dy)
da, db = plus.backward(dc)

da, db, dc, dy

As you can see, an invocation of a computational graph consists of four steps.

1. Instantiate the layers of your computational graph
2. Define the inputs values of the inputs to the graph
3. Perfrom the forward pass
4. Perform the backward pass (i.e. backpropagation)

Enough examples. It's your turn to implement some layers!

---

# Affine Layer

### Tasks

- Implement an Affine layer which computes the function

$$
\text{Affine}(x, w, b) = wx + b
$$

and hence corresponds to the computational graph

<img src="images/Affine Abstraction Black Box.svg" alt="Affine Black Box" style="width: 600px;"/>

### Requirements

- Use the exact variable names as used in the computational graph
- Use the variable naming convention `d`$\cdot = \overset{\longleftarrow}{\nabla_\cdot}$ For example, $\overset{\longleftarrow}{\nabla_r}$ gets the variable name `dr`.

### Hints

- Implement the Affine layer in terms of operations which have simple local gradients (i.e. are easy to backpropagate through) as in the computational graph

<img src="images/Affine Abstraction White Box.svg" alt="Affine White Box" style="width: 600px;"/>

### Questions

- Why do you think we compute $\nabla_x$? Recall in the previous lab, we only computed $\nabla_w$ and $\nabla_b$.

## ReLU Layer

### Tasks

- Implement a Rectified Linear Unit (ReLU) layer which computes the function

$$
\text{ReLU}(a) = \begin{cases} 0 & \text{if } a < 0 \\ a & \text{otherwise} \end{cases}
$$

and hence corresponds to the computational graph

<img src="images/ReLU Layer Black Box.svg" alt="ReLU Layer Black Box" style="width: 600px;"/>

### Requirements

- Use the exact variable names as used in the computational graph
- Use the variable naming convention `d`$\cdot = \overset{\longleftarrow}{\nabla_\cdot}$ For example, $\overset{\longleftarrow}{\nabla_r}$ gets the variable name `dr`.

### Hints

- Implement the ReLU layer in terms of operations which have simple local gradients (i.e. are easy to backpropagate through) as in the computational graph

<img src="images/ReLU Layer White Box.svg" alt="ReLU Layer White Box" style="width: 600px;"/>

## Squared Loss Layer

### Tasks

- Implement a SquaredLoss layer which computes the function

$$
\text{SquaredLoss}(\hat{y}, y) = (\hat{y} - y)^2
$$

and hence corresponds to the computational graph

<img src="images/Squared Loss Black Box.svg" alt="Squared Loss Black Box" style="width: 600px;"/>

### Requirements

- Use the exact variable names as used in the computational graph
- Use the variable naming convention `d`$\cdot = \overset{\longleftarrow}{\nabla_\cdot}$ For example, $\overset{\longleftarrow}{\nabla_r}$ gets the variable name `dr`.

### Hints

- Implement the SquaredLoss layer in terms of operations which have simple local gradients (i.e. are easy to backpropagate through) as in the computational graph

<img src="images/Squared Loss White Box.svg" alt="Squared Loss White Box" style="width: 600px;"/>

# Layered One-Hidden-Layer Neural Network

Recall from the last lab a one-hidden-layer neural network model takes the form

$$
g(x, w_1, b_1, w_2, b_2) = \max(\max(w_1 x + b_1, 0)w_2 + b_2, 0).
$$

Rewriting $g$ in terms of the layers we have defined yields

$$
g(x, w_1, b_1, w_2, b_2) = \text{Affine}(\text{ReLU}(\text{Affine}(x, w_1, b_1)), w_2, b_2).
$$

Applying a squared loss to $g$ yields the loss function

\begin{align*}
\mathcal{L}(x, y, w_1, b_1, w_2, b_2)
&= \text{SquaredLoss}(\text{Affine}(\text{ReLU}(\text{Affine}(x, w_1, b_1)), w_2, b_2), y)
\end{align*}

for a given $(x, y)$ training pair and parameters $(w_1, b_1, w_2, b_2)$.

## Forward Pass

### Tasks

- Compute $\mathcal{L}(2, 1, -1, 1, -2, 1.5)$ as corresponding to the computational graph

<img src="images/MLP Layers Numeric Forward.svg" alt="MLP Layers Forward Numeric" style="width: 1000px;"/>

### Requirements

- Use only the layers you have defined in this lab

## Backward Pass

### Tasks

- Compute $\nabla_{w_1}$, $\nabla_{b_1}$, $\nabla_{w_2}$, and $\nabla_{b_2}$ as corresponding to the computational graph

<img src="images/MLP Layers Numeric Backward.svg" alt="MLP Layers Backward Numeric" style="width: 1000px;"/>

### Requirements

- Use the variable naming convention `d`$\cdot = \overset{\longleftarrow}{\nabla_\cdot}$ For example, $\overset{\longleftarrow}{\nabla_r}$ gets the variable name `dr`.

### Hints

- $\overset{\longleftarrow}{\nabla_\ell}$ = 1 will get you started

### Question

- Did you need to make a call to `locals()` to cache anything during the forward pass? Why or why not?

### Bonus Tasks

- Implement a Sigmoid layer
- Implement a Softamx layer
- Implement a hinge loss layer
- Implement a vectorized Affine layer
- Implement a vectorized ReLU layer