# Neural Networks in `DyNet`


After this lecture you will:
    
* know how to implement a neural network in DyNet
* understand pros/cons of static vs dynamic neural network libraries


Language is discrete and structured. Structure is ubiquitous in Natural Language Processing.

- Sequences, trees, graphs

<img src="pics/motivation.png">

Neural nets represent things with continuous vectors (we will see this in the next notebook)

- Poor “native support” for structure

 ## Two software modes

### Static declaration

1. Define an architecture

2. Run a bunch of data through it to train the
model and/or make predictions

### Dynamic declaration

Graph is defined implicitly as the forward computation is executed



### Static declaration

#### Pros

- Offline optimization is powerful 
- Limits on operations mean better hardware support

#### Cons

- Structured data (even simple stuff like sequences), even variable-sized
data, is ugly

Examples: Tensorflow, Theano, Torch

### Dynamic declaration

#### Pros

- easier, flexibler modeling of structured problems
- easier to work with variable-sized data (less ugly)

#### Cons

- if graph is static, dynamic effort is wasted
- less space for optimization

Examples: DyNet, PyTorch, Chainer

`DyNet` offers flexibility in modeling your problem with a dynamic neural network library.

As Chris, Yoav and Graham put it: "*Things that are cumbersome / hard / ugly in other
frameworks*" [and that's where DyNet shines...] ;)

We will use the `python` wrapper of `DyNet`. Originally it is implemented as C++ library based on eigen (like Tensorflow).


## Neural Networks in `DyNet`

### The major ingredients that we need:

- the computational graph abstraction

- expressions

- parameters

- a model (i.e. collection of parameters )

- a trainer




In [7]:
import dynet as dy
dy.renew_cg() # create a new computation graph
v1 = dy.inputVector([1,2,3,4])
v2 = dy.inputVector([5,6,7,8])
# v1 and v2 are expressions
v3 = v1 + v2
v4 = v3 * 2
v5 = v1 + 1
v6 = dy.concatenate([v1,v2,v3,v5])
print(v6) # is a DyNet expression
print(v6.npvalue()) # access its value 

expression 5/1
[  1.   2.   3.   4.   5.   6.   7.   8.   6.   8.  10.  12.   2.   3.   4.
   5.]


In [10]:
print(v6.value()) #  alternative way to access its value (simple list, not numpy array)

[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 6.0, 8.0, 10.0, 12.0, 2.0, 3.0, 4.0, 5.0]


You create basic expressions, and combine them using *operations*. 

Expressions represent **symbolic computations**.

To perform actual computations you use:

```
.value()
.npvalue()
.forward()
```

### Models and Parameters

- **Parameters** are the things that we optimize over
(vectors, matrices).
- **Model** is a collection of parameters.
- Parameters **out-live** the computation graph

In [11]:
model = dy.Model()
pW = model.add_parameters((20,4))
pb = model.add_parameters(20)
dy.renew_cg()
x = dy.inputVector([1,2,3,4])
W = dy.parameter(pW) # convert params to expression
b = dy.parameter(pb) # and add to the graph
y = W * x + b 

In [15]:
## different initializations

model = dy.Model()
pW = model.add_parameters((4,4))
pW2 = model.add_parameters((4,4), init=dy.GlorotInitializer())
pW3 = model.add_parameters((4,4), init=dy.NormalInitializer(0,1))


## Trainers and Backprop

- Initialize a Trainer with a given model.
- Compute gradients by calling `expr.backward()`
from a scalar node.
- Call `trainer.update()` to update the model
parameters using the gradients.
- There are many different training algorithms available (check the docs)

In [17]:
model = dy.Model()
trainer = dy.SimpleSGDTrainer(model)
p_v = model.add_parameters(10)
for i in range(10):
     dy.renew_cg()
     v = dy.parameter(p_v)
     v2 = dy.dot_product(v,v)
     v2.forward()
     v2.backward() # compute gradients
     trainer.update()


### Training with `DyNet`

- Create model, add parameters, create trainer.
- For each training example:
    - create computation graph for the loss
    - run forward (compute the loss)
    - run backward (compute the gradients)
    - update parameters

## Example: MLP for NAND

Model: $$\hat{y} = \sigma( W x + b )$$ 

Data:

```
x1 x2  y
0 0  1
0 1  1
1 0  1
1 1  0
```

Loss (binary log loss): 


if $y=1$ then loss = $-\log \hat{y}$

if $y=0$ then loss = $-\log (1-\hat{y})$


In [6]:
import dynet as dy
import random
data =[ ([0,1],1),
        ([1,0],1),
        ([0,0],1),
        ([1,1],0) ]
model = dy.Model()

HIDDEN_SIZE = 8
INPUT_SIZE = 2
pU = model.add_parameters((HIDDEN_SIZE,INPUT_SIZE))
pb = model.add_parameters(HIDDEN_SIZE)
pv = model.add_parameters((1, HIDDEN_SIZE))

trainer = dy.SimpleSGDTrainer(model)

for ITER in range(1000):
    closs = 0.0
    random.shuffle(data)
    for x,y in data:
        dy.renew_cg()
        W = dy.parameter(pU)
        b = dy.parameter(pb)
        V = dy.parameter(pv)
        x = dy.inputVector(x)
        # predict
        h = dy.tanh((W*x) + b)
        yhat = dy.logistic((V*h))
        # loss
        if y == 0:
            loss = -dy.log(1 - yhat)
        elif y == 1:
            loss = -dy.log(yhat)

        closs += loss.scalar_value() # forward
        loss.backward()
        trainer.update()
        
    if ITER > 0 and ITER % 100 == 0:
        print("Iter:",ITER,"loss:", closs/4)

Iter: 100 loss: 0.03673703083768487
Iter: 200 loss: 0.012332652062468696
Iter: 300 loss: 0.0069259492611308815
Iter: 400 loss: 0.004703131926362403
Iter: 500 loss: 0.003518811388858012
Iter: 600 loss: 0.0027918846099055372
Iter: 700 loss: 0.002303714075424068
Iter: 800 loss: 0.0019548175596355577
Iter: 900 loss: 0.001693963927209552


Let us organize the code a bit.

In [10]:
import dynet as dy
import random
data =[ ([0,1],1),
        ([1,0],1),
        ([0,0],1),
        ([1,1],0) ]
model = dy.Model()

HIDDEN_SIZE = 8
INPUT_SIZE = 2
pU = model.add_parameters((HIDDEN_SIZE,INPUT_SIZE))
pb = model.add_parameters(HIDDEN_SIZE)
pv = model.add_parameters((1, HIDDEN_SIZE))

trainer = dy.SimpleSGDTrainer(model)

def calc_score(x):
    W = dy.parameter(pU)
    b = dy.parameter(pb)
    V = dy.parameter(pv)
    x = dy.inputVector(x)
    # predict
    h = dy.tanh((W*x) + b)
    return dy.logistic((V*h))

def my_loss_function(yhat, y):
    if y == 0:
        loss = -dy.log(1 - yhat)
    elif y == 1:
        loss = -dy.log(yhat)
    return loss

for ITER in range(1000):
    closs = 0.0
    random.shuffle(data)
    for x,y in data:
        dy.renew_cg()
        # predict
        yhat = calc_score(x)
        # loss
        loss = my_loss_function(yhat, y)

        closs += loss.scalar_value() # record loss
        loss.backward() 
        trainer.update()
        
    if ITER > 0 and ITER % 100 == 0:
        print("Iter:",ITER,"loss:", closs/4)

Iter: 100 loss: 0.04411190759856254
Iter: 200 loss: 0.01389462522638496
Iter: 300 loss: 0.0076934145981795155
Iter: 400 loss: 0.005188286460906966
Iter: 500 loss: 0.0038655066655337578
Iter: 600 loss: 0.0030569497484975727
Iter: 700 loss: 0.0025156566480291076
Iter: 800 loss: 0.0021298859664966585
Iter: 900 loss: 0.0018419274238112848


Using a built-in loss function:

In [11]:
import dynet as dy
import random
data =[ ([0,1],1),
        ([1,0],1),
        ([0,0],1),
        ([1,1],0) ]
model = dy.Model()
HIDDEN_SIZE = 8
INPUT_SIZE = 2
pU = model.add_parameters((HIDDEN_SIZE,INPUT_SIZE))
pb = model.add_parameters(HIDDEN_SIZE)
pv = model.add_parameters((1, HIDDEN_SIZE))
trainer = dy.SimpleSGDTrainer(model)

def calc_score(x):
    W = dy.parameter(pU)
    b = dy.parameter(pb)
    V = dy.parameter(pv)
    x = dy.inputVector(x)
    # predict
    h = dy.tanh((W*x) + b)
    return dy.logistic((V*h))

for ITER in range(1000):
    closs = 0.0
    random.shuffle(data)
    for x,y in data:
        dy.renew_cg()
        # predict
        yhat = calc_score(x)
        # loss
        loss = dy.binary_log_loss(yhat, dy.scalarInput(y))

        closs += loss.scalar_value() # record loss
        loss.backward() 
        trainer.update()
        
    if ITER > 0 and ITER % 100 == 0:
        print("Iter:",ITER,"loss:", closs/4)

Iter: 100 loss: 0.035967539937701076
Iter: 200 loss: 0.012440204518497922
Iter: 300 loss: 0.0070557887866016245
Iter: 400 loss: 0.004808427196621778
Iter: 500 loss: 0.0036034677605130128
Iter: 600 loss: 0.00286049308397196
Iter: 700 loss: 0.002360385213250993
Iter: 800 loss: 0.002002454174544255
Iter: 900 loss: 0.0017345331739306857


## References

This notebook is heavily based on:
- Chris Dyer, Yoav Goldberg, Graham Neubig: [DyNet tutorial - part 1](http://demo.clab.cs.cmu.edu/cdyer/emnlp2016-dynet-tutorial-part1.pdf) and [part 2](http://demo.clab.cs.cmu.edu/cdyer/emnlp2016-dynet-tutorial-part2.pdf) [awesome resource!]