# Neural Networks - Graph view

After this notebook you should:
* know about the different equivalent formulizations of a neural network 
* understand the computational graph abstraction and how it relates to backprob

### Recap: Feed-forward Neural Network

A neural network can be formalized, both algebraically and graphically. we can think of a feed-forward NN as a function $NN(\mathbf{x})$: $$y= NN(\mathbf{x}) $$

with:
input: $\mathbf{x}$ (vector with $d_{in}$ dimensions)

output: $\mathbf{y}$ (output with $d_{out}$ classes)

For example, a simple feedforward neural network can be visualized (and formalized) as:
<img src="pics/nn.png" width=300> 

$$NN_{MLP1}(\mathbf{x})=g(\mathbf{xW^1+b^1})\mathbf{W^2}+\mathbf{b^2}$$



### Recap: Where do the weights come frome?

It's an **optimization** problem. We want to find the weights that "work best". 

<img src="pics/mountains_at_home.jpg" width=500>

### Training a Neural Network: Ingredients

* we need to **define what "works best" means**
* we need **a way to change the model (parameters)** to get closer to a good model (hint: SGD)


### Defining what works best ~ how close we are: Loss

Measures how 'far off' we are from true solution:

$$L(\mathbf{\hat{y}},\mathbf{y})$$

For multi-class classification the **cross-entropy** is a commonly used loss function: 

$$L_{crossentropy}(\mathbf{\hat{y}},\mathbf{y})= - log(\hat{y}_i)$$


### How to get closer to a good model


**Strategy 1:** random guessing

**Strategy 2:** start with some random initial parameters (weights), and randomly adjust them

**Strategy 3:** follow the gradient: analytical method to find the best direction along which we should change our weight vector: **gradient descent** (see first days of this class)

<img src="https://upload.wikimedia.org/wikipedia/commons/6/6d/Error_surface_of_a_linear_neuron_with_two_input_weights.png">

### To sum up: Ingredients for training a Neural Network

* we need to define what "works best" means 
    $\rightarrow$ minimize some **loss**
* we need a way to change the parameters to get closer to a good model
    $\rightarrow$ **minimize loss using a gradient-based method: gradient descent**



Intuitively, training a neural networks involves the following steps:

* compute the gradient of the loss function with respect to the parameters
* move the parameters in the negative direction of the gradient to decrease the loss

#### Skeleton of gradient descent:
    
**Input**: training set, loss function $L$

Repeat for number of iterations (**epochs**): 
 
* compute loss on data: $L(X,Y)$
* compute gradients: $\mathbf{g} = L(X,Y)$ with respect to $w$
* move parameters in direction of the negative gradient: $w \pm -\eta \mathbf{g}$

## Computational Graph

We want to get an intuitive understanding of the **backpropagation** algorithm. Backprob is a way of computing **gradients** of expressions through applying the **chain rule**. A powerful way to compute these gradients is to see the network as a  **computational graph**.


#### Why computational graph?

It helps us to understand the flow in the model. 

### In a computational graph:
* nodes are operations
* gray boxes are parameters

The neural network from before:

$$NN_{MLP1}(\mathbf{x})=g(\mathbf{xW^1+b^1})\mathbf{W^2}+\mathbf{b^2}$$

is represented as computational graph:

<img src="pics/compgraph1.png" alt="See Goldberg (2015) primer for more details" width=200>

### Complete computational graph

The computational graph from before is not complete. It has no input and no output. A complete graph needs to define how the **input** looks like, and needs to specify a loss.

* loss: $L(\mathbf{\hat{y}},\mathbf{y})$
* input: 4 words embedded into a 50-dimensional word embedding space and *concatenated* 
<img src="pics/compgraphcomplete.png">

Why is the computational graph handy?

* It allows us to easily compute the predictions of a network from the input in a **forward pass** (follow the arrows)
* In the **backward pass** (derivatives, backprop) the computational graph helps to compute the gradients for the parameters with respect to a specific scalar output loss. This is what most deep learning toolkits use.

## To recap, when working with a neural network
You need to:

* to specify how the model looks like (e.g., feedforward, RNN, etc.)
* define a loss function (e.g., crossentropy, cosine)
* pick an optimization procedure (e.g. SGD, Adam)

## References

* Yoav Goldberg's primer chapter 6: [A Primer on Neural Network Models for Natural Language Processing](http://arxiv.org/abs/1510.00726)