# Neural Networks - Graph view

After this lecture you should:
* understand why we need non-linearities
* understand the computational graph abstraction and how it relates to backprob
* know how to define a feedforward neural network in `DyNet`

### Recap: Feed-forward Neural Network

We have seen how a neural network can be formalized, both algebraically and graphically. we can think of a feed-forward NN as a function $NN(\mathbf{x})$: $$y= NN(\mathbf{x}) $$

with:
input: $\mathbf{x}$ (vector with $d_{in}$ dimensions)

output: $\mathbf{y}$ (output with $d_{out}$ classes)

Recall our network: <img src="pics/nn.png"> 
$$NN_{MLP1}(\mathbf{x})=g(\mathbf{xW^1+b^1})\mathbf{W^2}+\mathbf{b^2}$$



Recall that you can also see it as vector-matrix operations:

* In the first layer, the input of 4 dimensions ($x_{1x4}$) is transformed into a vector of 5 dimensions:

$$\textbf{x}_{1x4} \cdot \textbf{W}^1_{4x5} \rightarrow \textbf{v}_{1x5}$$

to which the bias vector is added, and the whole is send through the activation function to calculate the hidden layer values:

$$\textbf{v}_{1x5} + \textbf{b}_{1x5} \rightarrow \mbox{apply } g \rightarrow \textbf{h}_{1x5}$$

and so forth until the end:


$$\mathbf{h1}=g(\mathbf{xW^1+b^1})$$
$$NN_{MLP1}(\mathbf{x})=\mathbf{h1}\mathbf{W^2}+\mathbf{b^2}$$

Which size does $\mathbf{W^2}$ and $\mathbf{b^2}$ have?

answer: W2 = 5x1 b2 = 1x1

What we just did (calculate the output from the input) is also called the **forward pass** in a neural network.


In [1]:
import numpy as np
## sigmoid activation function
f = lambda x: 1.0/(1.0 + np.exp(-x)) # activation function (here we use sigmoid)

# define input
x = np.random.randn(4) # random input vector of three numbers (1x4) 
print(x)

# model parameters
W1 = np.zeros((4,5))   # Weights W1 (3x5)
W2 = np.zeros((5,1))   # Weights (4x4)
b1 = np.zeros((1,5))
b2 = np.zeros((1,1))

# forward-pass of the feedforward neural network:
h1=f(np.dot(x,W1)+b1) # calculate the activations of the first hidden layer (1x5) - linear transformation followed by non-linearity!
out=f(np.dot(h1,W2)+b2) # calculate output (1x1)
print(out)
print("the network has no weights yet..")

[-0.43393611  0.5452903  -0.166863    0.69230561]
[[ 0.5]]
the network has no weights yet..


All $\mathbf{W}$ and $\mathbf{b}$ are the **parameters** (or **weights**) of the model.

### Where do the weights come frome?

### Intuition
We want to adjust the weights so that *a small change in the output should have a small effect in the output*.


In case of the simples model, a simple perceptron, a small change might often have a large effect. Remember, the decision function for the perceptron is a threshold, this can be seen as a **step function**: <img src="https://upload.wikimedia.org/wikipedia/commons/a/ac/HardLimitFunction.png" width=300> 

It is 0 for everything below 0 and 1 for positive outputs. If you are already close to the threshold, a small change might have a large effect.
<img src="http://neuralnetworksanddeeplearning.com/images/tikz8.png">

For another reason that we will see later, we will not use simple thresholding, i.e., a **step function**, but rather a smoother function like the **sigmoid** function.

<img src="https://what-if.xkcd.com/imgs/whatif-logo.png">

### What if all the non-linearities in an NN suddenly vanished?

<small>(CREDITS: The following slide has been taken from AJ and ZA's tutorial):</small>


For now, lets simply ignore the bias term to simplfiy our notation. 

A neural network with an input layer, a middle layer, and an output layer computes the following:

$$\mathbf{y} = g(W^{(0)}g(W^{(1)}g(W^{(0)}\mathbf{x})))$$

$g$ is a non-linearity, which could be different for each layer.

If we change $g$ to a linear function (e.g. a scaling factor), it can simply be multiplied into the weights matrices. Below we assume that $g = 1$, which allows us to simplify the expression:

$$\mathbf{y} = (W^{(0)}(W^{(1)}(W^{(0)}\mathbf{x})))$$

Since matrix multiplication is associative:

$$A(BC) = (AB)C,$$

we can get rid of the brackets altogether:

$$\mathbf{y} = W^{(0)}W^{(1)}W^{(0)}\mathbf{x}.$$

The series of linear transformations can be summarized in a single transformation matrix :

$$T = W^{(0)}W^{(1)}W^{(0)}.$$

And so the prediction of the neural network becomes:

$$\mathbf{y} = T\mathbf{x}.$$

The effective number of parameters in the now non non-linear neural network is $|\mathbf{y}| \times |\mathbf{x}|$, which is precisely the same as a standard linear model.

i.e., **the non-linearities are crucial**!

### Commonly-used activation functions
Tanh: <img src="http://cs231n.github.io/assets/nn1/tanh.jpeg">
Sigmoid: <img src="http://cs231n.github.io/assets/nn1/sigmoid.jpeg">
ReLu: <img src="http://cs231n.github.io/assets/nn1/relu.jpeg">


### So, where do the weights come from?

It's an **optimization** problem. We want to find the weights that "work best". 

<img src="pics/mountains_at_home.jpg">

### Training a Neural Network: Ingredients

* we need to **define what "works best" means**
* we need **a way to change the model (parameters)** to get closer to a good model


### Defining what works best ~ how close we are: Loss

Measures how 'far off' we are from true solution:

$$L(\mathbf{\hat{y}},\mathbf{y})$$



### How to get closer to a good model


**Strategy 1:** random guessing

**Strategy 2:** start with some random initial parameters (weights), and randomly adjust them

**Strategy 3:** follow the gradient: analytical method to find the best direction along which we should change our weight vector: **gradient descent**

<img src="https://upload.wikimedia.org/wikipedia/commons/6/6d/Error_surface_of_a_linear_neuron_with_two_input_weights.png">

### To sum up: Ingredients for training a Neural Network

* we need to define what "works best" means 
    $\rightarrow$ minimize some **loss**
* we need a way to change the parameters to get closer to a good model
    $\rightarrow$ **minimize loss using a gradient-based method: gradient descent**



Intuitively, training a neural networks involves the following steps:

* compute the gradient of the loss function with respect to the parameters
* move the parameters in the negative direction of the gradient to decrease the loss

#### Skeleton of gradient descent:
    
**Input**: training set, loss function $L$

Repeat for number of iterations (**epochs**): 
 
* compute loss on data: $L(X,Y)$
* compute gradients: $\mathbf{g} = L(X,Y)$ with respect to $w$
* move parameters in negative direction of the gradient: $w = w - \eta \mathbf{g}$

## Loss functions

For **multi-class** classification the **cross-entropy** is a commonly used loss function: 

$$L_{crossentropy}(\mathbf{\hat{y}},\mathbf{y})= - log(\hat{y}_i)$$

In `DyNet` this is implemented as:

In [4]:
import dynet as dy
dy.pickneglogsoftmax

<function _dynet.pickneglogsoftmax>

For **binary** classification you can use the binary log loss. In DyNet:

In [6]:
dy.binary_log_loss

<function _dynet.binary_log_loss>

## Computational Graph & DyNet

We want to get an intuitive understanding of the **backpropagation** algorithm. Backprob is a way of computing **gradients** of expressions through applying the **chain rule**. But before we get into details of gradients etc, lets introduce the **computational graph abstraction**.


Imagine you have a task where you classify an input of 150 dimensions into 17 classes (let us use the illustration from before, except for assuming that you have 150 input nodes, not 4 as illustrated). And let us use as simple feedforward neural network with one hidden layer defined as follows:

$$NN_{MLP1}(\mathbf{x})=g(\mathbf{xW^1+b^1})\mathbf{W^2}+\mathbf{b^2}$$

<img src="pics/nn.png" width=300> 

## How can we represent this in DyNet?

We need to define a model and all "pieces" of the model (its parameters - which need to be learned).

In [8]:
model = dy.Model()

input_size = 150
hidden_size = 100
num_labels = 17

W1 = model.add_parameters((input_size, hidden_size)) # weights 1
b1 = model.add_parameters((hidden_size))             # bias 

W2 = model.add_parameters((hidden_size, num_labels)) # weights 2
b2 = model.add_parameters((num_labels))             # bias



So far so good. But these are just the components. We still need to define the full model by connecting the pieces:


In [13]:
def score_input(x):
    dy.renew_cg()
    
    # everything in DyNet needs to be a DyNet expression (e.g., inputVector, parameter, etc)
    input_vec = dy.inputVector(x)
    a_1 = x * dy.parameter(W1) + b1 
    h = dy.tanh(a1)
    a_2 = h * dy.parameter(W2) + b2
    return dy.softmax(a_2)


We can represent our neural network as **computational graph**: nodes are operations, gray boxes are parameters.

**IMPORTANT**: Explain the code snippet above to your neighbor, by relating it to the computational graph shown below. 

<img src="pics/yg-compgraph1.png">

#### Why computational graph?

It helps us to understand the flow of parameters in the model. 

- for the forward pass (to calculate the output, e.g. softmax or sigmoid/logistic output)
- but also for the backward pass, i.e., the calculation of the gradients to update the parameters


#### Backward step

What do we need to compute? the gradient of the loss function with respect to the parameters.

**What's a gradient?**: A vector of partial derivatives. 

The following slides will illustrate this. In DyNet we have automatic gradient computation, thus we get it for free!

The follwing piece of code does the gradient computation and update of the paramters for us. 

In [None]:
## assuming we have defined the loss and a specific trainer
loss.backward()
trainer.update()

### Exercise time!

A neural network is a big network of many 'neurons'. A single neuron could be a perceptron, *or*, a model that you already saw yesterday, a logistic regression node! (with sigmoid/logistic output)

Now take the starter code in `exercise2.tar.gz` and answer the questions. It implements a logistic regression classifier (for binary classification) in `DyNet`.


### More details (optional)

Recall: the **derivative** 

A derivative gives us a linear approximation of the function at a specific point. Intuitively, the derivative indicates the rate of change of a function $f$ with respect to a variable $x$ (surrounding the region around point $h$):

    
<img src="http://www.intuitive-calculus.com/images/what-is-a-derivative-4.gif">

### Gradient

We are interested in finding the gradient, i.e., all partial derivatives, since our functions are not just functions of single parameters, but of a lot of parameters. 


### Example: gradient

Lets take a simple example of a function: $$f(x) = (x * y)$$ (or simply): $$f(x)=xy$$

We want to calculate the gradient, the vector of partial derivatives (how much does the function change wrt the parameters x and y): $$\nabla f = [\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}]$$


The partial derivatives of this function are:

$$f(x,y) = x y \hspace{0.5in} \rightarrow \hspace{0.5in} \frac{\partial f}{\partial x} = y \hspace{0.5in} \frac{\partial f}{\partial y} = x$$

In [2]:
## f(x) = x * y
# lets take some numbers 
x = 4
y = -3

In [3]:
## forward pass (function application)
f = x * y
print(f)

-12


In [4]:
## the derivative of each variable tells us the sensitivity of the whole expression on its value. 
## for instance, take the partial derivative of f wrt y:
df_dy = x  # it's simply y 
print(x) # this means if we increase the y by a tiny amount, the whole function would increase by this amount.

4


In [5]:
## similarly, the partial derivative of f w.r.t. x is:
df_dx = y
print(y)  # changing x by some small amount would make the whole expression decrease (negative sign)

-3


**the derivative on each variable tells you the sensitivity of the whole expression on its value**

### Example 2: computational graph - compound expression

(Thanks to lecture notes by Fei-Fei, Karphaty and Johnson, cf: http://cs231n.github.io/optimization-2/)

Lets take the function: $$f(x) = (x + y) + z$$ 

Lets represent it as computational graph (green: forward pass values):
<img src="pics/k1.png"> Slide by Fei-Fei, Karpathy and Johnson

<img src="pics/k2.png"> Slide by Fei-Fei, Karpathy and Johnson

<img src="pics/k3.png"> Slide by Fei-Fei, Karpathy and Johnson

<img src="pics/k4.png"> Slide by Fei-Fei, Karpathy and Johnson

<img src="pics/k5.png"> Slide by Fei-Fei, Karpathy and Johnson

<img src="pics/k6.png"> Slide by Fei-Fei, Karpathy and Johnson

<img src="pics/k7.png"> Slide by Fei-Fei, Karpathy and Johnson

<img src="pics/k8.png"> Slide by Fei-Fei, Karpathy and Johnson

<img src="pics/k9.png"> Slide by Fei-Fei, Karpathy and Johnson

<img src="pics/k10.png"> Slide by Fei-Fei, Karpathy and Johnson

<img src="pics/k11.png"> Slide by Fei-Fei, Karpathy and Johnson

<img src="pics/k13.png"> Slide by Fei-Fei, Karpathy and Johnson

<img src="pics/graph.png"> Slide by Fei-Fei, Karpathy and Johnson

#### Using the chain rule

* we can compute the gradients of the loss along the backward path in our computational graph
* once we know the gradients: we know how much we should change our parameters (in negative direction of gradients, as we want to minimize the loss): $w \pm -\eta \mathbf{g}$





###### Gradient descent

Repeat for number of iterations (**epochs**): 
* compute loss on data: $L(X,Y)$
* compute gradients: $\mathbf{g} = L(X,Y)$ with respect to $w$
* move parameters in opposite direction of gradient: $w \pm -\eta \mathbf{g}$


#### In practice:

* **stochastic gradient descent** (online learning)
* **mini-batches** (use a small subset of training instances) (minibatch size)
* **further hyperparameters**: learning rate $\eta$ (how big a step we take), number of epochs (how often we go over training data)

## References

* Yoav Goldberg's primer chapter 6: [A Primer on Neural Network Models for Natural Language Processing](http://arxiv.org/abs/1510.00726)
* Fei-Fei, Karpathy and Johnson's lecture notes: http://cs231n.github.io/optimization-2/