# Day 6: Sequence Models in Deep Learning

### Exercise 6.1 
Convince yourself a RNN is just an FF unfolded in time. Run the NumpyRNN code. Set break-points and compare with what you learned about back-propagation in the previous day.

Start by loading data Part-of-speech data and configure it for the exercises

### WSJ Data

In [None]:
# Load Part-of-Speech data 
from lxmls.readers.pos_corpus import PostagCorpusData
data = PostagCorpusData()

Model configuration

In [None]:
import lxmls.deep_learning.numpy_models.rnn as rnns
reload(rnns)

In [None]:
# RNN configuration
embedding_size = 50   # Size of word embeddings
hidden_size = 20     # size of hidden layer

In [None]:
model = rnns.NumpyRNN(
    input_size=data.input_size,
    embedding_size=embedding_size,
    hidden_size=hidden_size,
    output_size=data.output_size
)

In [None]:
train_batches = data.batches('train', batch_size=1)

#### Milestone 1:

Check gradients using the empirical gradient computation

In [None]:
from lxmls.deep_learning.rnn import rnn_parameter_handlers
from lxmls.deep_learning.numpy_models.rnn import cross_entropy_loss, backpropagation
batch = data.batches('train', batch_size=1)[0]

In [None]:
gradient = backpropagation(batch['input'], batch['output'], model.parameters)

In [None]:
# Select one parameter from the network
layer_index = -1 # 0
row = 1
column = 1

In [None]:
# Get functions to get and set values of a particular parameter of the model
get_parameter, set_parameter = rnn_parameter_handlers(
    layer_index=layer_index,
    row=row, 
    column=column
)

In [None]:
import numpy as np
from copy import deepcopy

scale = 30
resolution = 1000
perturbations = np.linspace(-scale, scale, resolution)

# Get study weight value and loss
study_loss = cross_entropy_loss(batch['input'], batch['output'], model.parameters)

# Compute the loss when varying the study weight
parameters = deepcopy(model.parameters)
study_weight = float(get_parameter(parameters))
loss_range = []
for perturbation in perturbations:
    perturbated_parameters = set_parameter(parameters, study_weight + perturbation)
    perturbated_loss = cross_entropy_loss(batch['input'], batch['output'], perturbated_parameters)
    loss_range.append(perturbated_loss)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
# Plot empirical
weight_range = study_weight + perturbations
plt.plot(weight_range, loss_range)
plt.plot(study_weight, study_loss, 'xr')
# Plot real
h = plt.plot(
    weight_range,
    get_parameter(gradient)*(weight_range - study_weight) + study_loss, 
    'r--'
)

### Exercise 6.2
Understand the basics of scan with this examples. Scan allows you to build computation graphs with a variable number of nodes. It acts as a python "for" loop but it is symbolic. The following example should help you understand the basic scan functionality. It generates a sequence for a given length. Run it and modify it. Try to arrive at an error.

In [None]:
import numpy as np
import theano
import theano.tensor as T
theano.config.optimizer='None'

In [None]:
def square(x): 
    return x**2 

# Python
def np_square_n_steps(nr_steps):
    out = []
    for n in np.arange(nr_steps):
        out.append(square(n))
    return np.array(out)
    
# Theano
nr_steps = T.lscalar('nr_steps')
h, _ = theano.scan(fn=square, sequences=T.arange(nr_steps))
th_square_n_steps = theano.function([nr_steps], h)

In [None]:
print np_square_n_steps(10)
print th_square_n_steps(10)

The following example should help you understand about matrix multiplications and passing values from one iteration to the other. It at each step it we multiply the output of the previous step by a matrix A. We start with an initial vector s0. The matrix and vector are random but normalized to result on a Markov chain.  

In [None]:
# Configuration
nr_states = 3
nr_steps = 5

# Transition matrix
A = np.abs(np.random.randn(nr_states, nr_states))
A = A/A.sum(0, keepdims=True)
# Initial state
s0 = np.zeros(nr_states)
s0[0] = 1

In [None]:
# Numpy version
def np_markov_step(s_tm1): 
    s_t = np.dot(s_tm1, A.T)
    return s_t 

def np_markov_chain(nr_steps, A, s0):
    # Pre-allocate space
    s = np.zeros((nr_steps+1, nr_states))
    s[0, :] = s0
    for t in np.arange(nr_steps):
        s[t+1, :] = np_markov_step(s[t, :])
    return  s           

In [None]:
np_markov_chain(nr_steps, A, s0)

In [None]:
# Theano version
# Store variables as shared variables
th_A = theano.shared(A, name='A', borrow=True)
th_s0 = theano.shared(s0, name='s0', borrow=True)
# Symbolic variable for the number of steps
th_nr_steps = T.lscalar('nr_steps')

def th_markov_step(s_tm1): 
    s_t = T.dot(s_tm1, th_A.T)
    # Remember to name variables
    s_t.name = 's_t'
    return s_t 

s, _ = theano.scan(th_markov_step, 
                   outputs_info=[dict(initial=th_s0)], 
                   n_steps=th_nr_steps)
th_markov_chain = theano.function([th_nr_steps], T.concatenate((th_s0[None, :], s), 0))

In [None]:
th_markov_chain(nr_steps)

### Exercise 6.3
Complete the theano code for a RNN inside lxmls/deep learning/rnn.py. Use exercise 6.1 for a numpy example and 6.2 to learn how to handle scan. Keep in mind that you only need to implement the forward pass!. Theano will handle backpropagation ofr us. 

In [None]:
rnn = rnns.RNN(nr_words, emb_size, hidden_size, nr_tags, seed=SEED)

In [None]:
# Compile theano function
x = T.ivector('x')
th_forward = theano.function([x], rnn._forward(x).T)

When working with theano, it is more difficult to localize the source of errors. It is therefore extremely important to work step by step and test the code frequently. To debug we suggest to implement and compile the forward pass first. You can use this code for testing. If it raises no error you are good to go.

In [None]:
assert np.allclose(th_forward(x0), np_rnn.forward(x0)), \
    "Numpy and Theano forward pass differ!"

Once you are confident the forward pass is working you can test the gradients

In [None]:
# Compile function returning the list of gradients
x = T.ivector('x')     # Input words
y = T.ivector('y')     # gold tags 
p_y = rnn._forward(x)
cost = -T.mean(T.log(p_y)[T.arange(y.shape[0]), y])
grads_fun = theano.function([x, y], [T.grad(cost, par) for par in rnn.param])

In [None]:
# Compare numpy and theano gradients
theano_rnn_gradients = grads_fun(x0, y0)
for n in range(len(theano_rnn_gradients)): 
    assert np.allclose(numpy_rnn_gradients[n], theano_rnn_gradients[n]), \
    "Numpy and Theano gradients differ in step n"

Finally, its time to test our network!. For this, lets first compile a function that does predictions

In [None]:
rnn_prediction = theano.function([x], T.argmax(p_y, 1))
# Lets test the predictions
def test_model(sample_seq, rnn_prediction):
    words = [train_seq.word_dict[wrd] for wrd in sample_seq.x]
    tags = [train_seq.tag_dict[pred] for pred in rnn_prediction(sample_seq.x)]
    print ["/".join([word, tag]) for word , tag in zip(words, tags)]

In [None]:
test_model(train_seq[0], rnn_prediction) 

Now lets define the optimization parameters and compile a batch update function

In [None]:
lrate = 0.5
n_iter = 5

In [None]:
# Get list of SGD batch update rule for each parameter
updates = [(par, par - lrate*T.grad(cost, par)) for par in rnn.param]
# compile
rnn_batch_update = theano.function([x, y], cost, updates=updates)

Finally it is time to run SGD. You can use the following code for this purpose

In [None]:
nr_words = sum([len(seq.x) for seq in train_seq])
for i in range(n_iter):
    
    # Training
    cost = 0
    errors = 0
    for n, seq in enumerate(train_seq):
        cost += rnn_batch_update(seq.x, seq.y)
        errors += sum(rnn_prediction(seq.x) != seq.y)
    acc_train = 100*(1-errors*1./nr_words) 
    print "Epoch %d: Train cost %2.2f Acc %2.2f %%" % (i+1, cost, acc_train), 
    
    # Evaluation    
    errors = 0
    for n, seq in enumerate(dev_seq):
        errors += sum(rnn_prediction(seq.x) != seq.y)  
    acc_dev = 100*(1-errors*1./nr_words) 
    print " Devel Acc %2.2f %%" % acc_dev
    sys.stdout.flush()

Test the effect of using pre-trained embeddings. Run the following code to download the embeddings, reset the layer parameters and initialize the embedding layer with the pre-trained embeddings. Then run the training code above.

In [None]:
# Embeddings Path
EMBEDDINGS = "../../data/senna_50"
import lxmls.deep_learning.embeddings as emb
import os
reload(emb)
if not os.path.isfile(EMBEDDINGS):
    emb.download_embeddings('senna_50', EMBEDDINGS)
E = emb.extract_embeddings(EMBEDDINGS, train_seq.x_dict) 
# Reset model to remove the effect of training
rnn = rnns.reset_model(rnn, seed=SEED)
# Set the embedding layer to the pre-trained values
rnn.param[0].set_value(E.astype(theano.config.floatX)) 

### Exercise 6.4
Convince yourself that LSTMs are just more complex RNNs. Run them, play around with the hyper parameters and compare the RNN and LSTM classes.

In [None]:
lstm = rnns.LSTM(nr_words, emb_size, hidden_size, nr_tags)
lstm_prediction = theano.function([x], 
                                  T.argmax(lstm._forward(x), 1))
lstm_cost = -T.mean(T.log(lstm._forward(x))[T.arange(y.shape[0]), y])

In [None]:
# Get list of SGD batch update rule for each parameter
lstm_updates = [(par, par - lrate*T.grad(lstm_cost, par)) for par in lstm.param]
# compile
lstm_batch_update = theano.function([x, y], lstm_cost, 
                                    updates=lstm_updates)

In [None]:
nr_words = sum([len(seq.x) for seq in train_seq])
for i in range(n_iter):
    
    # Training
    cost = 0
    errors = 0
    for n, seq in enumerate(train_seq):
        cost += lstm_batch_update(seq.x, seq.y)
        errors += sum(lstm_prediction(seq.x) != seq.y)
    acc_train = 100*(1-errors*1./nr_words) 
    print "Epoch %d: Train cost %2.2f Acc %2.2f %%" % (i+1, cost, acc_train), 
    
    # Evaluation:
    errors = 0
    for n, seq in enumerate(dev_seq):
        errors += sum(lstm_prediction(seq.x) != seq.y)  
    acc_dev = 100*(1-errors*1./nr_words) 
    print " Devel Acc %2.2f %%" % acc_dev
    sys.stdout.flush()