We want to train an LSTM model over a large vocabulary.

We do not care so much about the perplexity / probability as much as we care about the representation itself, but we want to (pre)train it in an unsupervised way, so we will use the corpus (previous words) as our guiding principle.

In [3]:
# we assume that we have the pycnn module in your path.
# we also assume that LD_LIBRARY_PATH includes a pointer to where libcnn_shared.so is.
from pycnn import *

## An LSTM/RNN overview:

An (1-layer) RNN can be thought of as a sequence of cells, $h_1,...,h_k$, where $h_i$ indicates the time dimenstion. 

Each cell $h_i$ has an input $x_i$ and an output $r_i$. In addition to $x_i$, cell $h_i$ receives as input also $r_{i-1}$.

In a deep (multi-layer) RNN, we don't have a sequence, but a grid. That is we have several layers of sequences:

* $h_1^3,...,h_k^3$ 
* $h_1^2,...,h_k^2$ 
* $h_1^1,...h_k^1$, 

Let $r_i^j$ be the output of cell $h_i^j$. Then:

The input to $h_i^1$ is $x_i$ and $r_{i-1}^1$.

The input to $h_i^2$ is $r_i^1$ and $r_{i-1}^2$,
and so on.







## The LSTM (RNN) Interface

RNN / LSTM / GRU follow the same interface. We have a "builder" which is in charge of creating definining the parameters for the sequence.

In [4]:
model = Model()
NUM_LAYERS=2
INPUT_DIM=50
HIDDEN_DIM=10
builder = LSTMBuilder(NUM_LAYERS, INPUT_DIM, HIDDEN_DIM, model)
# or:
# builder = SimpleRNNBuilder(NUM_LAYERS, INPUT_DIM, HIDDEN_DIM, model)

Note that when we create the builder, it adds the internal RNN parameters to the `model`.
We do not need to care about them, but they will be optimized together with the rest of the network's parameters.

In [5]:
s0 = builder.initial_state()

In [6]:
x1 = vecInput(INPUT_DIM)

In [7]:
s1=s0.add_input(x1)
y1 = s1.output()
# here, we add x1 to the RNN, and the output we get from the top is y (a HIDEN_DIM-dim vector)

In [8]:
y1.npvalue().shape

(10,)

In [9]:
s2=s1.add_input(x1) # we can add another input
y2=s2.output()

If our LSTM/RNN was one layer deep, y2 would be equal to the hidden state. However, since it is 2 layers deep, y2 is only the hidden state (= output) of the last layer.

If we were to want access to the all the hidden state (the output of both the first and the last layers), we could use the `.h()` method, which returns a list of expressions, one for each layer:

In [10]:
print s2.h()

(exprssion 54/0, exprssion 66/0)


The same interface that we saw until now for the LSTM, holds also for the Simple RNN:

In [11]:
# create a simple rnn builder
rnnbuilder=SimpleRNNBuilder(NUM_LAYERS, INPUT_DIM, HIDDEN_DIM, model)

# initialize a new graph, and a new sequence
rs0 = rnnbuilder.initial_state()

# add inputs
rs1 = rs0.add_input(x1)
ry1 = rs1.output()
print "all layers:", s1.h()

all layers: (exprssion 32/0, exprssion 42/0)


In [12]:
print s1.s()

(exprssion 28/0, exprssion 38/0, exprssion 32/0, exprssion 42/0)


To summarize, when calling `.add_input(x)` on an `RNNState` what happens is that the state creates a new RNN/LSTM column, passing it: 
1. the state of the current RNN column
2. the input `x`

The state is then returned, and we can call it's `output()` method to get the output `y`, which is the output at the top of the column. We can access the outputs of all the layers (not only the last one) using the `.h()` method of the state.

**`.s()`** The internal state of the RNN may be more involved than just the outputs $h$. This is the case for the LSTM, that keeps an extra "memory" cell, that is used when calculating $h$, and which is also passed to the next column.  To access the entire hidden state, we use the `.s()` method. 

The output of `.s()` differs by the type of RNN being used. For the simple-RNN, it is the same as `.h()`. For the LSTM, it is more involved.


In [13]:
rnn_h  = rs1.h()
rnn_s  = rs1.s()
print "RNN h:", rnn_h
print "RNN s:", rnn_s


lstm_h = s1.h()
lstm_s = s1.s()
print "LSTM h:", lstm_h
print "LSTM s:", lstm_s


RNN h: (exprssion 74/0, exprssion 76/0)
RNN s: (exprssion 74/0, exprssion 76/0)
LSTM h: (exprssion 32/0, exprssion 42/0)
LSTM s: (exprssion 28/0, exprssion 38/0, exprssion 32/0, exprssion 42/0)


As we can see, the LSTM has two extra state expressions (one for each hidden layer) before the outputs h.

## Extra options in the RNN/LSTM interface

**Stack LSTM** The RNN's are shaped as a stack: we can remove the top and continue from the previous state.
This is done either by remembering the previous state and continuing it with a new `.add_input()`, or using
we can access the previous state of a given state using the `.prev()` method of state.

**Initializing a new sequence with a given state** When we call `builder.initial_state()`, we are assuming the state has random /0 initialization. If we want, we can specify a list of expressions that will serve as the initial state. The expected format is the same as the results of a call to `.final_s()`. TODO: this is not supported yet.

In [14]:
s2=s1.add_input(x1)
s3=s2.add_input(x1)
s4=s3.add_input(x1)

# let's continue s3 with a new input.
s5=s3.add_input(x1)

# we now have two different sequences:
# s0,s1,s2,s3,s4
# s0,s1,s2,s3,s5
# the two sequences share parameters.

assert(s5.prev() == s3)
assert(s4.prev() == s3)

s6=s3.prev().add_input(x1)
# we now have an additional sequence:
# s0,s1,s2,s6

In [15]:
s6.h()

(exprssion 184/0, exprssion 196/0)

In [16]:
s6.s()

(exprssion 180/0, exprssion 192/0, exprssion 184/0, exprssion 196/0)

## Charecter-level LSTM

Now that we know the basics of RNNs, let's build a character-level LSTM language-model.
We have a sequence LSTM that, at each step, gets as input a character, and needs to predict the next character.

In [17]:
import random
from collections import defaultdict
from itertools import count
import sys

LAYERS = 2
INPUT_DIM = 50 
HIDDEN_DIM = 50  

characters = list("abcdefghijklmnopqrstuvwxyz ")
characters.append("<EOS>")

int2char = list(characters)
char2int = {c:i for i,c in enumerate(characters)}

VOCAB_SIZE = len(characters)



In [18]:
model = Model()


srnn = SimpleRNNBuilder(LAYERS, INPUT_DIM, HIDDEN_DIM, model)
lstm = LSTMBuilder(LAYERS, INPUT_DIM, HIDDEN_DIM, model)

model.add_lookup_parameters("lookup", (VOCAB_SIZE, INPUT_DIM))
model.add_parameters("R", (VOCAB_SIZE, HIDDEN_DIM))
model.add_parameters("bias", (VOCAB_SIZE))

# return compute loss of RNN for one sentence
def do_one_sentence(rnn, sentence):
    # setup the sentence
    renew_cg()
    s0 = rnn.initial_state()
    
    
    R = parameter(model["R"])
    bias = parameter(model["bias"])
    lookup = model["lookup"]
    sentence = ["<EOS>"] + list(sentence) + ["<EOS>"]
    sentence = [char2int[c] for c in sentence]
    s = s0
    loss = []
    for char,next_char in zip(sentence,sentence[1:]):
        s = s.add_input(lookup[char])
        probs = softmax(R*s.output() + bias)
        loss.append( -log(pick(probs,next_char)) )
    loss = esum(loss)
    return loss
 

# generate from model:
def generate(rnn):
    def sample(probs):
        rnd = random.random()
        for i,p in enumerate(probs):
            rnd -= p
            if rnd <= 0: break
        return i
    
    # setup the sentence
    renew_cg()
    s0 = rnn.initial_state()
    
    R = parameter(model["R"])
    bias = parameter(model["bias"])
    lookup = model["lookup"]
    
    s = s0.add_input(lookup[char2int["<EOS>"]])
    out=[]
    while True:
        probs = softmax(R*s.output() + bias)
        probs = probs.vec_value()
        next_char = sample(probs)
        out.append(int2char[next_char])
        if out[-1] == "<EOS>": break
        s = s.add_input(lookup[next_char])
    return "".join(out[:-1]) # strip the <EOS>
        

# train, and generate every 5 samples
def train(rnn, sentence):
    trainer = SimpleSGDTrainer(model)
    for i in xrange(200):
        loss = do_one_sentence(rnn, sentence)
        loss_value = loss.value()
        loss.backward()
        trainer.update()
        if i % 5 == 0: 
            print loss_value,
            print generate(rnn)
    

Notice that:
1. We pass the same rnn-builder to `do_one_sentence` over and over again.
We must re-use the same rnn-builder, as this is where the shared parameters are kept.
2. We `renew_cg()` before each sentence -- because we want to have a new graph (new network) for this sentence.
The parameters will be shared through the model and the shared rnn-builder.

In [19]:
sentence = "a quick brown fox jumped over the lazy dog"
train(srnn, sentence)

159.679000854 lmt nx ygtflbjxpoq lwy f
92.4982833862 afjvta ltvopebstkfdo rlo yqnopb  ifpqbzi
61.3959312439 aavleiq po  rvtzeakqyg zur
37.0580444336 vnlbclj fjo j mamlndfe  dixnjhgdtof bdzhe ddvyuwa ohryltop fvek cemouaa dcg ndeilci  cxgphmzthauhxzcl
19.9309463501 iwomtno iwv nwe
7.24577760696 a quicknbrown  vx tuxpcy ouxr she hazy dog
2.65018558502 a quick brown fox jumped over tue lazy dog
0.954704046249 a quick brown fox jhmped over the lazy dog
0.626812160015 a quick brown fox jumped over the lazy dog
0.466860800982 a quick brown fox jumped over the lazy dog
0.37179967761 a quick brown fox jumped over the lazy dog
0.308676809072 a quick brown fox jumped over the lazy dog
0.263797819614 a quick brown fox jumped over the lazy dog
0.230166554451 a quick brown fox jumped over the lazx dog
0.204109430313 a quick brown fox jumped over the lazy dog
0.183262810111 a nuick browk fox jumped over the lazy dog
0.166249394417 a quick brown fox jumped over the lazy dog
0.152071535587 a quick bro

In [20]:
sentence = "a quick brown fox jumped over the lazy dog"
train(lstm, sentence)


141.691223145  uvybot firnqqci 
130.66217041  nud  s ddco  e u  sqo ryastfig scoxoupa se  mz jmxr  v
126.464065552 uuipfaa gyoe pmftgnidnvonf  uyddorehj zxbuatudorecdaneoo oepvq uox
118.422286987 s l whurp yypde beyn  ex kavk  x  uzdlboxc tuq ac aupztide rdmfua qe e wm jxu
105.712097168 e ck  ohelfl e akmfer oewe gp os
90.3664779663 gzh dofuth lxjuaz thke
77.7215118408 a eckn vv
64.3503112793 jr vrof
55.7599029541 d zcsn ver fe zezz doj dor em ed ama xuei ob
47.3434677124 gcz bbovrw vunmpjd ofg
39.2717132568  auil bkbgor ffhg tde abgea oirn ffxgz die berr tref efhe hde oei 
29.5673542023 uqyk qvog tvm for  oxx jtme dxgk tve laue doy qfx dzz do rh thl dze do ex uuee don frx jamm dur eohn fom pdo jhe duz qzg qwor hote l oe joy jox fmqme doch vinn tte ta luzc do
23.5196590424 
17.6167182922 k iock nogr the lzz upiek vohe ttea loa pdogw theplddoy bdor joa muuped over the lazy doger ohen ohe amzy dor
13.1281785965 b dugckk howr fo
8.73806285858 p quxpd over the laza doo
6.34239006042 a quik

The model seem to learn the sentence quite well.

Somewhat surprisingly, the Simple-RNN model learn quicker than the LSTM!

How can that be?

The answer is that we are cheating a bit. The sentence we are trying to learn
has each letter exactly once. This means a simple bigram model can memorize
it very well.

Try it out with more complex sequences.



In [21]:
train(srnn, "these pretzels are making me thirsty")

316.674530029 a quick brown fox jumped over the lazy dog
105.89151001 a quick brown fox jumped over the lazy dog
47.5085983276 a quick browitleh aheclazy dog
14.6825256348 thellazy dog
2.44217705727 the ere makqne mn thi ste
0.339166790247 thele pretzels are making me thirsty
0.185528486967 these pretzels are making me thirsty
0.136738568544 these mretsels are making me thirsty
0.109439112246 these pretzels are making me thirsty
0.0916935577989 these pretzels are making me thirsty
0.0791544318199 these pretzels are making me thirsty
0.0697760358453 these pretzels are making me thirsty
0.0624843500555 these pretzels are making me thirsty
0.0566432401538 these pretzels are making me thirsty
0.0518313497305 these pretzels are making me thirsty
0.0478418208659 these pretzels are making me thirsty
0.0444259755313 these pretzels are making me thirsty
0.0414918512106 these pretzels are making me thirsty
0.0389476977289 these pretzels are making me thirsty
0.0366978682578 these pretzels are ma