# 4 - Introduction to Neural Networks
Given a function $f(x)$ where $x$ is a vector of inputs, what is the gradient of $f$ at $x$?

In otherwords, compute $\nabla_x f$

In neural networs, $f$ is the loss function $L$, while $x$ is the training data(image) and weights.
***

# Computational graph

![image.png](https://raw.githubusercontent.com/alstjgg/alstjgg.github.io/master/CS231n/04.computational_graph.PNG)

A **computational graph** can represent any mathematical function visually.

Each **node** represents the steps of computation, such as multiplication or addition. Here, $x$, $y$, and $z$ are **inputs** of the function, while $f$ is the final result value. $q$ is an intermediate node.

### Advantage
- **Backpropagation** is possible, allowing fast and easy computation of the gradient
- Useful for complex mathematical functions used in models such as convolutional networks

# Back propagation
**Forward Propagation** computes the result of the function and any intermediate values an 

**Back Propagation** is a method of computing gradients of expressions through recursive application of the **chain rule**.

The objective is to compute the gradient of $f$ respect to each input variable

### For a Single node
All nodes of a computational graph are only aware of their immediate surroundings; the **local input values** and **local output value**. Therefore, the **local gradients** and **upstream gradient** can be easily computed.

![image.png](https://raw.githubusercontent.com/alstjgg/alstjgg.github.io/master/CS231n/04.back_propagation.PNG)

- input values : $x$, $y$
- function : $f$
- output value : $z$
- local gradients : $\frac{\delta z}{\delta x}$, $\frac{\delta z}{\delta y}$
- upstream gradienst: $\frac{\delta L}{\delta z}$
    - These have been computed in the previous backpropagation process
- objective: compute the gradient of $L$ respect to every input variable

Using the local gradients, we can compute the gradient of $f$ respect to each intermediate value. The chain rule can come into use.

Assigning actual values for each variables(node) of the graph proves that using the computational graph and backpropagation is much easier in computing the gradient of $f$.

#### Some observations
- Although nodes can be some of the simplest mathematical functions(addition, multiplication, etc), **functions can be grouped together** to form more complex ones.

    ![image.png](https://raw.githubusercontent.com/alstjgg/alstjgg.github.io/master/CS231n/04.sigmoid_function.PNG)

    For example, we can create a sigmoid gate by groupding.
    
- **add** gate : gradient distributer (the upstream gradient is equally passed along each branch)
- **max** gate : gradient router (the upstream gradient is passed to one branch, while the other branch gets passed the value 0)
- **mul** gate : gradient switcher/scaler (the upstream gradient is scaled by the value of the branch)
- Gradients can get added at branches, so changes in one node can effect the whole graph.
- Each gradient represents how sensitive the function is to that certain element

### For Vectorized codes
- Variables are vectors in **vectorized codes**
- The gradients becomes the **jacobian matrix**, a matrix where each element is the derivative of the values of the vector

#### Some observations
- The size of the jacobian matrix is same as the input matrix size

# Modularized implementation
The computation process can be seen as a forward and backward API.

The **forward** API computes the values of each node, while the **backward** API computes the gradients of each node.

```python
class ComputationalGraph(object):
    def forward(inputs):
        for node in self.graph.nodes_topologically_sorted():
            node.forward()
        return loss    # the final node outputs the loss
    
    def backward():
        for node in reversed(self.graph.nodes_topologically_sorted()):
            node.backward()
        return inputs_gradients
```

### Computation node implementation

```python
class MultiplyNode(object):
    def forward(x, y):
        z = x*y
        # cache values to use in future
        self.x = x
        self.y = y
        return z
    def backward(dz):    # upstream gradient dz = dL / dz
        dx = self.y * dz    # local gradient
        dy = self.x * dz    # local gradient
        return [dx, dy]
```