Four flavours of a CharRNN implementation in Lasagne
===

In [6]:
from __future__ import print_function

import sys
import os
import time

import numpy as np
import theano
import theano.tensor as T
import lasagne

In [2]:
print(theano.__version__)  #should be at least 0.8.0.dev
print(lasagne.__version__) #should be at least 0.2.dev1

0.8.0.dev-a34dec55bfd6bd84e92a97346b5665f685b83a44
0.2.dev1


## A note on RNN model building in lasagne
#### Input layer
The convention used in lasagne is that sequential data is presented in the shape `(batch_size, n_time_steps, n_features_1, n_features_2, ...)`. Because not all sequences in each minibatch will always have the same length, all recurrent layers in lasagne accept a separate mask input which has shape `(batch_size, n_time_steps)`, which is populated such that `mask[i, j] = 1` when `j <= (length of sequence i)` and `mask[i, j] = 0` when `j > (length of sequence i)`. When no mask is provided, it is assumed that all sequences in the minibatch are of length `n_time_steps`. Finally, as is true of the first `(batch_size)` dimension, the `n_time_steps` dimension can be set to None, which means that it can vary from batch to batch. This means that the network can take in minibatches which have an arbitrary number of sequences which are of arbitrary length - very convenient!

#### LSTM layer
The de facto method for training recurrent networks is backpropagation through time, which simply unrolls the network across timesteps and treats it as a network which is repeated for each time step. This can result in an incredibly "deep" and computationally expensive network, so it's common practice to truncate the number of unrolled sequence steps. This can be controlled with the `gradient_steps` argument; when it's -1 (the default), this means "don't truncate". A common method for mitigating the exponentially growing gradients commonly found when "unrolling" recurrent networks through time and backpropagating is to simply preventing them from being larger than a pre-set value. In recurrent layers, this can be achieved by passing in a float (rather than False) to `grad_clipping`. Some of the dot products computed in recurrent layers are non-recursive, which means they can be computed ahead of time in one big dot product. Since one big dot product is more efficient than lots of little dot products, lasagne does it by default. However, it imposes an additional memory requirement, so if you're running out of memory, set `precompute_input` to False.

> ##### LSTM params
The GRULayer and LSTMLayer utilize the Gate class, which is essentially just a container for parameter initializers. The LSTMLayer initializer accepts four Gate instances - one for the input gate, one for the forget gate, one for the cell, and one for the output gate. From the lasagne code base: the bias of the forget gate is often initialized to a large positive value to encourage the layer initially remember the cell value, see e.g. _"Learning to forget: Continual prediction with LSTM."_ page 15.

#### Fully-connected layer
As mentioned above, recurrent layers and feed-forward layers expect different input shapes. The output of l_sum will be of shape `(n_batch, n_time_steps, N_HIDDEN)`. If we fed this into a non-recurrent layer, it would think that the n_time_steps dimension was a "feature" dimension, when in fact it's a "sample" dimension. That is, each index in the second dimension should be treated as a different sample, and a non-recurrent lasagne layer would instead treat them as different feature values, which would be incorrect. Fortunately, the ReshapeLayer makes combining these conventions very convenient - we just combine the first and second dimension so that there are essentially `n_batch*n_time_steps` individual samples before using any non-recurrent layers, then (optionally) reshape the output back to the original shape. Note that because we will only be using the output of the network at the end of the sequence, this could also be done using a SliceLayer (as in the recurrent.py example included with lasagne) which is a bit more efficient. In this tutorial, we'll do it with the ReshapeLayer for illustration.

## Flavour 1 - seq2seq training, seq2sample prediction

In the first flavour of the CharRNN we adopt the following procedures. In the training phase we employ a sequence 2 sequence procedure. That is, for every input character, we predict the subsequent character. At test time, to sample from our model, we start with a bootstrap text sequence for which we predict the single next character. After that, the input for the next time step is `bootstrap_text[1:]+predicted_char`, for which we predict the following character etc.

In lasagne every input sequence is treated seperately by default. That is, the hidden and cell states get reset for every new input (so no state is transfered to the following input; this will be the subject of one of the following CharRNN flavours). We will use an input sequence length of 25 chars and we backpropagate the gradient for 20 timesteps. So five timesteps are used to "bootstrap" the hidden and cell states. We will divide our data in sequences as follows: data[0:25], data[5:30], data[10:35] etc. So there will be overlap in the input sequences.

In [3]:
# Hold options as static element in the opts class
class opts():
    hidden_size = 50
    seq_len = 25         # Data sequence length
    gradient_steps = 20  # Truncated BPTT length
    data_offset = 15     # Offset for every new input sequence
    batch_size = 50
    n_epochs = 100
    lr = 0.1

In [4]:
def build_rnn_1(input_var, dim):
    # By setting the first and second dimensions to None, we allow arbitrary minibatch
    # sizes with arbitrary sequence lengths.
    # In this case, we know the sequence length beforehand.
    # We leave the batch size undetermined, so that the model can be used for single value prediction.
    # The dim variable will be equal to the size of the vocabulary (one-hot char representation).
    # ----- INPUT LAYER -----
    l_in = lasagne.layers.InputLayer(shape=(None, opts.seq_len, dim), input_var=input_var)
    
    # The following input can be used to provide the network with masks.
    # Masks are expected to be matrices of shape (n_batch, n_time_steps);
    # Since all our sequences will be of equal length, we don't need the masks here.
    # l_mask = lasagne.layers.InputLayer(shape=(None, None))
    
    # The convention is that gates use the standard sigmoid nonlinearity,
    # which is the default for the Gate class.
    io_gate_parameters = lasagne.layers.recurrent.Gate(
        W_in=lasagne.init.Orthogonal(), W_hid=lasagne.init.Orthogonal(),
        b=lasagne.init.Constant(0.))
    
    forget_gate_parameters = lasagne.layers.recurrent.Gate(
        W_in=lasagne.init.Orthogonal(), W_hid=lasagne.init.Orthogonal(),
        b=lasagne.init.Constant(5.))
    
    cell_parameters = lasagne.layers.recurrent.Gate(
        W_in=lasagne.init.Orthogonal(), W_hid=lasagne.init.Orthogonal(),
        # Setting W_cell to None denotes that no cell connection will be used.
        W_cell=None, b=lasagne.init.Constant(0.),
        # By convention, the cell nonlinearity is tanh in an LSTM.
        nonlinearity=lasagne.nonlinearities.tanh)
    
    # ----- LSTM LAYER -----
    l_lstm = lasagne.layers.recurrent.LSTMLayer(
        l_in, opts.hidden_size,
        # We need to specify a separate input for masks (not needed here)
        # mask_input=None,
        # Here, we supply the gate parameters for each gate
        ingate=io_gate_parameters, forgetgate=forget_gate_parameters,
        cell=cell_parameters, outgate=io_gate_parameters,
        # We'll learn the initialization and use gradient clipping
        learn_init=True, grad_clipping=50., gradient_steps=opts.gradient_steps)
    
    # ----- FC LAYER -----
    # First reshape so that output at every time step is treated as separate sample
    # Since batch size is unknown, we have to determine it first.
    batch_size, _, _ = l_in.input_var.shape
    l_reshape = lasagne.layers.ReshapeLayer(l_lstm, (batch_size * opts.seq_len, opts.hidden_size))
    l_dense = lasagne.layers.DenseLayer(
        l_reshape, num_units=dim, nonlinearity=lasagne.nonlinearities.softmax)
    # The output size of the network is thus (batch_size * opts.seq_len, dim)
    
    return l_dense

#### Data processing.
Remember, the convention used in lasagne is that sequential data is presented in the shape `(batch_size, n_time_steps, n_features_1, n_features_2, ...)`. Also remember how we will construct our input sequences with the proper offset.

In [5]:
data = open('tinyshakespeare.txt', 'r').read()
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print ('Vocabulary size = ' + str(vocab_size) + '; total data size = ' + str(data_size))
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

# Define function to get batches of preprocessed data.
def get_batch(b):
    if (b+1)*opts.batch_size*opts.data_offset - opts.data_offset + opts.seq_len + 1 >= len(data):
        return None, None
    X = np.zeros((opts.batch_size, opts.seq_len, vocab_size), dtype=theano.config.floatX)
    y = np.zeros((opts.batch_size, opts.seq_len, vocab_size), dtype=np.int8)
    
    for i in xrange(opts.batch_size):
        c = b*opts.data_offset*opts.batch_size + opts.data_offset*i
        for j in xrange(opts.seq_len):
            X[i, j, char_to_ix[data[c]]] = 1.0
            y[i, j, char_to_ix[data[c+1]]] = 1.0
            c += 1
    
    return X, y.reshape((opts.batch_size*opts.seq_len, vocab_size))

Vocabulary size = 65; total data size = 1115394


Let's test the `get_batch()` function. Following code should give twice the same output.

In [6]:
X,y = get_batch(10)
for i in xrange(opts.data_offset, opts.seq_len):
    sys.stdout.write(ix_to_char[np.argmax(X[4][i])])
print("\n--")
for i in xrange(opts.data_offset, opts.seq_len):
    sys.stdout.write(ix_to_char[np.argmax(X[5][i-opts.data_offset])])

 you curs,
--
 you curs,

#### Main program.

In [7]:
# Saving and reading model parameter functions
import cPickle
def save_params(network, filename):
    params = lasagne.layers.get_all_param_values(network)
    with open(filename, 'wb') as f:
        cPickle.dump(params, f)

def load_params(network, filename):
    params = None
    with open(filename, 'rb') as f:
        params = pickle.load(f)
    lasagne.layers.set_all_param_values(network, params)

Create the necessary tensor variables, create the network, define the loss function, define the parameters to be learned, and compile the train and sample functions.

In [8]:
input_var = T.tensor3('inputs')
output_var = T.bmatrix('outputs') # the outputs will be flattened over 1st and 2nd dimension to reflect
                                 # dense layer output

network = build_rnn_1(input_var, vocab_size)

# Now predict the cost of batch in terms of a loss function
network_output = lasagne.layers.get_output(network)
loss = lasagne.objectives.categorical_crossentropy(network_output, output_var).mean()

# Retrieve all network params
all_params = lasagne.layers.get_all_params(network)

# Compute adam updates for training
updates = lasagne.updates.adam(loss, all_params)

# Theano function for training and computing cost
train = theano.function(
    [input_var, output_var],
    loss, updates=updates)

# Theano function to sample from the RNN; we only keep the final output prediction for the first batch
sample = theano.function(
    [input_var], network_output[-1,:])

We also need a function to sample text from the RNN in order to babysit the learning process.

In [9]:
def sample_text(length=200):
    # First take a random piece of bootstrap text
    start = np.random.randint(0, len(data)-opts.seq_len)
    s = data[start:start+opts.seq_len]
    
    # Convert to proper input data shape (here, batch size = 1)
    s_np = np.zeros((1, opts.seq_len, vocab_size), dtype=theano.config.floatX)
    for i in xrange(opts.seq_len):
        s_np[0, i, char_to_ix[s[i]]] = 1.0
    
    # Start sampling loop
    res = ''
    for k in xrange(length):
        # Predict next character
        predict = sample(s_np)
        predict_i = np.random.choice(range(vocab_size), p=predict.ravel())
        res += ix_to_char[predict_i]
        
        # Update s_np
        s_np[0, 0:-1, :] = s_np[0, 1:, :]
        s_np[0, -1, :] = 0.0
        s_np[0, -1, predict_i] = 1.0
    
    return res

In [10]:
# Let's test the function on an untrained RNN
print(sample_text(200))

Hxjb.R,;Q;elhTNqI.NT,ErKprJADOM'LDuRdD ySbOZl$Qh-ZdZRDTEFl$'Dixp?3QDZ;O:OTQgqmGjV!WQv?cPbvl:hu3p?lapjKYMJ-&LY&VUdknGIyYqTnJSb-Gy&bEOmt-DBcZBd?U:3zA
iklhkOArFPsl!eTRcE;$EQ:uPbxSnspkwsPOKAO:a;hXMoEa:x?d


Time to train the RNN. After every epoch, we will also sample a text from the RNN.

In [84]:
# Train procedure
print("Start training RNN...")
for epoch in range(opts.n_epochs):
    cost  = 0.0
    b = 0.0
    while True:
        X, y = get_batch(int(b))
        if X is None or y is None:
            break
        cost += train(X, y)
        b += 1.0
    print("\nEpoch {} with {} batches, cost = {}\n".format(epoch + 1, int(b), cost / b))
    #print("Saving...")
    #save_params(network, "params_"+str(epoch + 1)+".dump")
    # Sampling
    print(sample_text(100))

Start training RNN...


KeyboardInterrupt: 

## Flavour 2 - seq2sample training, seq2sample prediction

In the second flavour of the CharRNN we use a sequence 2 sample training procedure. In this case, we predict only the final next character for a given input sequence. At test time, we proceed in the same way as in the first flavour.

We keep the same options as before, but in this case it is reasonable to set the data_offset to 1.

In [1]:
# Hold options as static element in the opts class
class opts():
    hidden_size = 50
    seq_len = 25         # Data sequence length
    gradient_steps = 20  # Truncated BPTT length
    data_offset = 1      # Offset for every new input sequence
    batch_size = 50
    n_epochs = 100
    lr = 0.1

In [2]:
def build_rnn_2(input_var, dim):
    # ----- INPUT LAYER -----
    l_in = lasagne.layers.InputLayer(shape=(None, opts.seq_len, dim), input_var=input_var)

    io_gate_parameters = lasagne.layers.recurrent.Gate(
        W_in=lasagne.init.Orthogonal(), W_hid=lasagne.init.Orthogonal(),
        b=lasagne.init.Constant(0.))
    
    forget_gate_parameters = lasagne.layers.recurrent.Gate(
        W_in=lasagne.init.Orthogonal(), W_hid=lasagne.init.Orthogonal(),
        b=lasagne.init.Constant(5.))
    
    cell_parameters = lasagne.layers.recurrent.Gate(
        W_in=lasagne.init.Orthogonal(), W_hid=lasagne.init.Orthogonal(),
        W_cell=None, b=lasagne.init.Constant(0.),
        nonlinearity=lasagne.nonlinearities.tanh)
    
    # ----- LSTM LAYER -----
    l_lstm = lasagne.layers.recurrent.LSTMLayer(
        l_in, opts.hidden_size,
        ingate=io_gate_parameters, forgetgate=forget_gate_parameters,
        cell=cell_parameters, outgate=io_gate_parameters,
        learn_init=True, grad_clipping=50., gradient_steps=opts.gradient_steps)
    
    # ----- SLICE LAYER -----
    # We only need the final output of the LSTM layer
    # Output of this layer now has shape (batch_size, opts.hidden_size)
    l_slice = lasagne.layers.SliceLayer(l_lstm, indices=-1, axis=1)
    
    # ----- FC LAYER -----
    l_dense = lasagne.layers.DenseLayer(
        l_slice, num_units=dim, nonlinearity=lasagne.nonlinearities.softmax)
    # The output size of the network is thus (batch_size, dim)
    
    return l_dense

#### Data processing.

In [67]:
data = open('tinyshakespeare.txt', 'r').read()
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print ('Vocabulary size = ' + str(vocab_size) + '; total data size = ' + str(data_size))
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

# Define function to get batches of preprocessed data.
def get_batch(data, b, b_size):
    if (b+1)*b_size*opts.data_offset - opts.data_offset + opts.seq_len + 1 >= len(data):
        return None, None
    X = np.zeros((b_size, opts.seq_len, vocab_size), dtype=theano.config.floatX)
    y = np.zeros((b_size, vocab_size), dtype=np.int8)
    
    for i in xrange(b_size):
        c = b*opts.data_offset*b_size + opts.data_offset*i
        for j in xrange(opts.seq_len):
            X[i, j, char_to_ix[data[c]]] = 1.0
            c += 1
        y[i, char_to_ix[data[c]]] = 1.0
    
    return X, y

Vocabulary size = 65; total data size = 1115394


Test the `get_batch()` function, which seems OK.

In [70]:
X,y = get_batch(data, 10, opts.batch_size)
for i in xrange(opts.seq_len):
    sys.stdout.write(ix_to_char[np.argmax(X[4][i])])
print("")
print(ix_to_char[np.argmax(y[4])])
print("--")
for i in xrange(opts.seq_len):
    sys.stdout.write(ix_to_char[np.argmax(X[5][i])])
print("")
print(ix_to_char[np.argmax(y[5])])

izens, the patricians goo
d
--
zens, the patricians good
.


#### Main program.

In [12]:
input_var = T.tensor3('inputs')
output_var = T.bmatrix('outputs') # the outputs will be flattened over 1st and 2nd dimension to reflect
                                 # dense layer output

network = build_rnn_2(input_var, vocab_size)

# Now predict the cost of batch in terms of a loss function
network_output = lasagne.layers.get_output(network)
loss = lasagne.objectives.categorical_crossentropy(network_output, output_var).mean()

# Retrieve all network params
all_params = lasagne.layers.get_all_params(network)

# Compute adam updates for training
updates = lasagne.updates.adam(loss, all_params)

# Theano function for training and computing cost
train = theano.function(
    [input_var, output_var],
    loss, updates=updates)

# Theano function to sample from the RNN; we only keep the final output prediction for the first batch
sample = theano.function(
    [input_var], network_output[-1,:])

In [51]:
def sample_text(data, length=200):
    # First take a random piece of bootstrap text
    start = np.random.randint(0, len(data)-opts.seq_len)
    s = data[start:start+opts.seq_len]
    
    # Convert to proper input data shape (here, batch size = 1)
    s_np = np.zeros((1, opts.seq_len, vocab_size), dtype=theano.config.floatX)
    for i in xrange(opts.seq_len):
        s_np[0, i, char_to_ix[s[i]]] = 1.0
    
    # Start sampling loop
    res = ''
    for k in xrange(length):
        # Predict next character
        predict = sample(s_np)
        predict_i = np.random.choice(range(vocab_size), p=predict.ravel())
        res += ix_to_char[predict_i]
        
        # Update s_np
        s_np[0, 0:-1, :] = s_np[0, 1:, :]
        s_np[0, -1, :] = 0.0
        s_np[0, -1, predict_i] = 1.0
    
    return res

In [52]:
# Let's test again the sample_text function on this very untrained RNN
print(sample_text(data, 200))

had de andes wrat wy he fnavert mard kinhe gsend is I ubmes yat ay to s fot cond dat mill hinc
OQ an mr thans, sheaperof suifor porell.

BKOANLARTA:
AZd dency thul thas !y, rold in af Greard to thult 


Now train the flavour-2 RNN.

In [18]:
# Train procedure
print("Start training RNN...")
counter = 0
for epoch in range(opts.n_epochs):
    cost  = 0.0
    b = 0.0
    while True:
        X, y = get_batch(data, int(b), opts.batch_size)
        if X is None or y is None:
            break
        cost += train(X, y)
        b += 1.0
        counter += 1   
        if counter % 1000 == 0:
            print(sample_text(data, 100))
    print("\nEpoch {} with {} batches, cost = {}\n".format(epoch + 1, int(b), cost / b))

Start training RNN...
,
Dhit ipsendvime indese?

Th ttaout, syold fi, hin, tho'r ons hompcone
or fo su wars;
kne!e, the!;

dt porefou woven as,
MI I bace coand sfity amyou: heronl,,
CCalergollI hafom.

CORIOLANUNh:
Mef
Oy a
sgt:
Fer ceanr poristins wor omn;inhalg fise briplne,
Wo besir:
Fa niss-iend hadr comet tan toos sol
d afand baclat lact ford por:
indest farer lstere lave npuellom hurner costhou mothe.
For tree fore 


KeyboardInterrupt: 

## Perplexity

Before moving on to the third flavour, let's evaluate our model.
Language models are often evaluated through a measure called perplexity. For a word language model, perplexity is defined as follows:
$$
\mathrm{Perplexity} = \exp\left(\frac{-\sum_{k=1}^{N}\log \mathrm{P}(w_k|w_{1:k-1})}{N} \right)
$$
... in which $N$ is the sequence length. In our case we don't use words, but characters. However, perplexity can still be calculated on a separate test set. In the case of sequence2sample prediction, we can simply use the Theano function `sample()` we compiled before.

In [75]:
def raw_perplexity(input_data, input_targets):
    # Assume input_data.shape = (batch_size, seq_len, vocab_size)
    # Assume input_targets.shape = (batch_size, vocab_size)
    batch_size = input_data.shape[0]
    
    num = 0
    counter = 0
    for b in xrange(batch_size):
        sample_output = sample(np.expand_dims(input_data[b], axis=0))
        p = np.sum(sample_output * input_targets[b])
        num -= np.log(p)
        counter += 1
    
    return num, counter

Let's test it for a sample batch. Lower perplexity is always better.

In [76]:
ba_X, ba_y = get_batch(data, 10, opts.batch_size)
num, den = raw_perplexity(ba_X, ba_y)
print(np.exp(num / den))

11.050741546


We now add perplexity evaluation during model training. We will train on a specified train set, and evaluate perplexity on the test set. Every 500 batches we see that the perplexity on the test set indeed decreases.

In [78]:
train_portion = 0.99
train_data = data[0:int(train_portion*len(data))]
test_data = data[int(train_portion*len(data)):]

# Train procedure
print("Start training RNN...")
counter = 0
for epoch in range(opts.n_epochs):
    cost  = 0.0
    b = 0.0
    while True:
        X, y = get_batch(train_data, int(b), opts.batch_size)
        if X is None or y is None:
            break
        cost += train(X, y)
        b += 1.0
        counter += 1   
        if counter % 500 == 0:
            # Sample using test_data
            print(sample_text(test_data, 100))
            # Evaluate perplexity
            num, den = 0.0, 0.0
            tb = 0
            while True:
                Xt, yt = get_batch(test_data, tb, 1000)
                if Xt is None or yt is None:
                    break
                n2, d2 = raw_perplexity(Xt, yt)
                num += n2
                den += d2
                tb += 1
            print("Perplexity: " + str(np.exp(num/den)) + '\n')
    print("\nEpoch {} with {} batches, cost = {}\n".format(epoch + 1, int(b), cost / b))

Start training RNN...
le quy beand prascirsely thoupc; ints pavinsd were, linor omemud
Thice poor butuut cruc!
Thu efos 'p
Perplexity: 10.6966303961

g borfons wer.

HENENIUS:
Hot shane baod torst! comelforisef tho reTporgis you han, ard be lut to pf
Perplexity: 10.4050989616



KeyboardInterrupt: 

## Flavour 3 - seq2seq training, sample2sample prediction

In the third flavour of the CharRNN we use again the sequence 2 sequence training procedure. To sample text from the model, we will do this sample by sample, while keeping track of the hidden and cell states of the LSTM layer in between samples. This is probably the most logical method to sample from an RNN.

Implementing a sample2sample prediction in lasagne is not straightforward. The standard implementation of the `get_output_for()` function in the `LSTMLayer` starts always starts with the initial (learnt or given) hidden and cell states and only returns the final hidden state to be used in the next layer. This final hidden state is however not stored, nor is the final cell state. In other words, if we sample from the model, we will always start with initial hidden and cell states. To mitigate this, we will use extra input layers for these initial states. Furthermore, the `get_output_for()` function only return the final hidden state, while we also need the final cell state. For this purpose, I created an adapted version of the `LSTMLayer`, which I called `LSTMLayer_v2`, and in which I copied the `get_output_for()` function so that now also the cell state is returned. The consequence is that we cannot use the helper function `get_output()` anymore, so we will have to go through the network manually...

In [9]:
import LSTMLayer_v2
LSTMLayer_v2 = reload(LSTMLayer_v2)

In [10]:
# Hold options as static element in the opts class
class opts():
    hidden_size = 50
    seq_len = 25         # Data sequence length
    gradient_steps = 20  # Truncated BPTT length
    data_offset = 15     # Offset for every new input sequence
    batch_size = 50
    n_epochs = 100
    lr = 0.1

To train the RNN we will use the `build_rnn_1()` function from before. To sample from the RNN, we will use a different model with the `LSTMLayer_v2` (and we will load the parameters from the train model at prediction time). We explicitly turn off the `learn_init` argument, and we will use extra input layers to provide the initial hidden and cell states.

In [13]:
def build_test_rnn_1(input_var, hid_var, cell_var, dim):
    # ----- INPUT LAYER -----
    l_in = lasagne.layers.InputLayer(shape=(1, 1, dim), input_var=input_var)
    
    io_gate_parameters = lasagne.layers.recurrent.Gate(
        W_in=lasagne.init.Orthogonal(), W_hid=lasagne.init.Orthogonal(),
        b=lasagne.init.Constant(0.))
    
    forget_gate_parameters = lasagne.layers.recurrent.Gate(
        W_in=lasagne.init.Orthogonal(), W_hid=lasagne.init.Orthogonal(),
        b=lasagne.init.Constant(5.))
    
    cell_parameters = lasagne.layers.recurrent.Gate(
        W_in=lasagne.init.Orthogonal(), W_hid=lasagne.init.Orthogonal(),
        W_cell=None, b=lasagne.init.Constant(0.),
        nonlinearity=lasagne.nonlinearities.tanh)
    
    # ----- LSTM LAYER -----
    # Here we initialize extra input layers for the initial hidden and cell states
    l_hid = lasagne.layers.InputLayer(shape=(1, opts.hidden_size), input_var=hid_var)
    l_cell = lasagne.layers.InputLayer(shape=(1, opts.hidden_size), input_var=cell_var)
    l_lstm = LSTMLayer_v2.LSTMLayer_v2(
        l_in, opts.hidden_size,
        ingate=io_gate_parameters, forgetgate=forget_gate_parameters,
        cell=cell_parameters, outgate=io_gate_parameters,
        hid_init=hid_var, cell_init=cell_var,
        learn_init=False, grad_clipping=50., gradient_steps=opts.gradient_steps)
    
    # ----- SLICE LAYER -----
    l_slice = lasagne.layers.SliceLayer(l_lstm, indices=0, axis=1)
    
    # ----- FC LAYER -----
    l_dense = lasagne.layers.DenseLayer(
        l_slice, num_units=dim, nonlinearity=lasagne.nonlinearities.softmax)
    
    return l_dense, l_lstm, l_hid, l_cell

#### Data processing.

In [12]:
data = open('tinyshakespeare.txt', 'r').read()
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print ('Vocabulary size = ' + str(vocab_size) + '; total data size = ' + str(data_size))
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

# Define function to get batches of preprocessed data.
def get_batch(data, b, b_size):
    if (b+1)*b_size*opts.data_offset - opts.data_offset + opts.seq_len + 1 >= len(data):
        return None, None
    X = np.zeros((b_size, opts.seq_len, vocab_size), dtype=theano.config.floatX)
    y = np.zeros((b_size, opts.seq_len, vocab_size), dtype=np.int8)
    
    for i in xrange(b_size):
        c = b*opts.data_offset*b_size + opts.data_offset*i
        for j in xrange(opts.seq_len):
            X[i, j, char_to_ix[data[c]]] = 1.0
            y[i, j, char_to_ix[data[c+1]]] = 1.0
            c += 1
    
    return X, y.reshape((b_size*opts.seq_len, vocab_size))

Vocabulary size = 65; total data size = 1115394


#### Main program.

In [92]:
input_var = T.tensor3('inputs')
output_var = T.bmatrix('outputs') # the outputs will be flattened over 1st and 2nd dimension to reflect
                                 # dense layer output

network = build_rnn_1(input_var, vocab_size)
# Now predict the cost of batch in terms of a loss function
network_output = lasagne.layers.get_output(network)
loss = lasagne.objectives.categorical_crossentropy(network_output, output_var).mean()
# Retrieve all network params
all_params = lasagne.layers.get_all_params(network)
# Compute adam updates for training
updates = lasagne.updates.adam(loss, all_params)
# Theano function for training and computing cost
train = theano.function(
    [input_var, output_var],
    loss, updates=updates)

# hold the initial hidden state of LSTM layer
hid_var = theano.shared(np.zeros((1,opts.hidden_size), dtype=theano.config.floatX))
# hold the initial cell state of LSTM layer
cell_var = theano.shared(np.zeros((1,opts.hidden_size), dtype=theano.config.floatX))

test_network, test_lstm_layer, test_hid, test_cell = build_test_rnn_1(input_var, hid_var, cell_var, vocab_size)
test_network_output = lasagne.layers.get_output(test_network)

# Theano function to sample from the RNN at test/prediction time
# Make sure that also the final hidden and cell states are returned
from collections import OrderedDict
hid_out, cell_out = test_lstm_layer.get_hid_cell_for([input_var, test_hid.input_var, test_cell.input_var])
# The following is essential, as it updates the hidden and cell state variables
# to the hidden and cell outputs after the prediction.
sample_updates = OrderedDict()
sample_updates[hid_var] = hid_out[-1, :, :]
sample_updates[cell_var] = cell_out[-1, :, :]
sample = theano.function(
    [input_var],
    test_network_output[-1,:],
    updates=sample_updates)

In [105]:
def sample_text(data, length=200):
    # First take a random piece of bootstrap text
    start = np.random.randint(0, len(data)-opts.seq_len)
    s = data[start:start+opts.seq_len]
    
    # Convert to proper input data shape (here, batch size = 1)
    s_np = np.zeros((1, opts.seq_len, vocab_size), dtype=theano.config.floatX)
    for i in xrange(opts.seq_len):
        s_np[0, i, char_to_ix[s[i]]] = 1.0
        
    # Feed into network sequentially and ignore all but last sample (bootstrapping hidden/cell states)
    for char in xrange(opts.seq_len):
        predict = sample(s_np[:, [char], :])
    
    # Start sampling loop
    res = ''
    predict_i = np.random.choice(range(vocab_size), p=predict.ravel())
    s_np = np.zeros((1, 1, vocab_size), dtype=theano.config.floatX)
    s_np[0, 0, predict_i] = 1.0
    for k in xrange(length):
        # Predict next character
        predict = sample(s_np)
        predict_i = np.random.choice(range(vocab_size), p=predict.ravel())
        res += ix_to_char[predict_i]
        
        # Update s_np
        s_np[0, 0, :] = 0.0
        s_np[0, 0, predict_i] = 1.0
    
    return res

In [106]:
# Let's test again the sample_text function on an untrained RNN
print(sample_text(data, 200))

iddsb$wYsJOG!J!bLjTni;3phGdCc Y-sqeQomoJ;jXvSuZvSvv'ezT-;ir3g MDla
RfWaY3iIw:eItmgv vYtzKALtTR3Ysi dSCCMd iDmGsyCmG
$SOosKq: UDDCm:T.r$symVOidDo:z'u!mOs-sTzVXjov:ZDgC 'pFFiTMa
GijstoMDMWSD.TQ?!!$'FmD 


We now have two networks, one to train and one to sample from. Prior to sampling it is important to set the parameters of the test network to the parameter values of the train network. This can be done through `lasagne.layers.set_all_param_values(test_network, lasagne.layers.get_all_param_values(network))`. This will also set the `hid_init` and `cell_init` values - which are learnt in the train network - correctly in the test network! This is demonstrated below.

In [95]:
hid_var.set_value(np.ones((1,50))*5.0)
print(hid_var.get_value())
lasagne.layers.set_all_param_values(test_network, lasagne.layers.get_all_param_values(network))
print(hid_var.get_value())

[[ 5.  5.  5.  5.  5.  5.  5.  5.  5.  5.  5.  5.  5.  5.  5.  5.  5.  5.
   5.  5.  5.  5.  5.  5.  5.  5.  5.  5.  5.  5.  5.  5.  5.  5.  5.  5.
   5.  5.  5.  5.  5.  5.  5.  5.  5.  5.  5.  5.  5.  5.]]
[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]


Let's combine training and sampling.

In [97]:
train_portion = 0.99
train_data = data[0:int(train_portion*len(data))]
test_data = data[int(train_portion*len(data)):]

# Train procedure
print("Start training RNN...")
for epoch in range(opts.n_epochs):
    cost  = 0.0
    b = 0.0
    while True:
        X, y = get_batch(train_data, int(b), opts.batch_size)
        if X is None or y is None:
            break
        cost += train(X, y)
        b += 1.0
    print("\nEpoch {} with {} batches, cost = {}\n".format(epoch + 1, int(b), cost / b))
    # Update parameter values
    lasagne.layers.set_all_param_values(test_network, lasagne.layers.get_all_param_values(network))
    # Sample using test_data
    print(sample_text(test_data, 100))

Start training RNN...

Epoch 1 with 1472 batches, cost = 2.66165176926

owI AIO:
euh gherethes fwodh sasthi'tong;
Bus andl'btd.

TPrORI:
y what noo wirint, nor wiwe the tha

Epoch 2 with 1472 batches, cost = 2.27936997562

ald t.
NKVAnITLE:
Soot hhat mave gith:
Whind hasl, o cime whas ald steparet you teppoirtn tobe sinnt

Epoch 3 with 1472 batches, cost = 2.15341372483

urd.

LUTENTBANO:
Why lith the wipine pen's tase lothat will bot;
What ship me brast belt rammel som


KeyboardInterrupt: 

## Flavour 4 - seq2seq training with state remembrance, sample2sample prediction

In the final flavour of the CharRNN we will again use the sample 2 sample prediction from above; that is, keeping track of the hidden and cell states while sampling a new character. In the training procedure we will now adopt a same kind of logic. We will train sequence 2 sequence, e.g. data[0:25] -> data[1:26], but now we will rememeber the hidden state at time step 1. Next, using this hidden state, we train data[1:26] -> data[2:27] and we remember hidden state 2, etc. We can of course always increase the data offset in this scheme. Note that this training procedure is generally only applicable in a context of batch size 1. This procedure is the most closely related one to Karpathy's charRNN gist. The advantage is that we can propagate the gradient through the entire input sequence safely since we do not need to bootstrap the hidden states.

We will again use our LSTMLayer_v2 to be able to update the hidden and cell state vectors, but this time also in the train network. This time we will therefore not learn the initial states.

In [98]:
import LSTMLayer_v2
LSTMLayer_v2 = reload(LSTMLayer_v2)

In [1]:
# Hold options as static element in the opts class
class opts():
    hidden_size = 50
    seq_len = 25         # Data sequence length
    gradient_steps = 25  # Truncated BPTT length
    data_offset = 25     # Offset for every new input sequence
    batch_size = 1
    n_epochs = 100
    lr = 0.1

In [126]:
def build_rnn_4(input_var, hid_var, cell_var, dim):
    # ----- INPUT LAYER -----
    l_in = lasagne.layers.InputLayer(shape=(1, opts.seq_len, dim), input_var=input_var)
    
    io_gate_parameters = lasagne.layers.recurrent.Gate(
        W_in=lasagne.init.Orthogonal(), W_hid=lasagne.init.Orthogonal(),
        b=lasagne.init.Constant(0.))
    
    forget_gate_parameters = lasagne.layers.recurrent.Gate(
        W_in=lasagne.init.Orthogonal(), W_hid=lasagne.init.Orthogonal(),
        b=lasagne.init.Constant(5.))
    
    cell_parameters = lasagne.layers.recurrent.Gate(
        W_in=lasagne.init.Orthogonal(), W_hid=lasagne.init.Orthogonal(),
        # Setting W_cell to None denotes that no cell connection will be used.
        W_cell=None, b=lasagne.init.Constant(0.),
        # By convention, the cell nonlinearity is tanh in an LSTM.
        nonlinearity=lasagne.nonlinearities.tanh)
    
    # ----- LSTM LAYER -----
    l_hid = lasagne.layers.InputLayer(shape=(1, opts.hidden_size), input_var=hid_var)
    l_cell = lasagne.layers.InputLayer(shape=(1, opts.hidden_size), input_var=cell_var)
    l_lstm = LSTMLayer_v2.LSTMLayer_v2(
        l_in, opts.hidden_size,
        ingate=io_gate_parameters, forgetgate=forget_gate_parameters,
        cell=cell_parameters, outgate=io_gate_parameters,
        cell_init=l_cell, hid_init=l_hid,
        learn_init=False, grad_clipping=50., gradient_steps=opts.gradient_steps)
    
    # ----- FC LAYER -----
    l_reshape = lasagne.layers.ReshapeLayer(l_lstm, (1 * opts.seq_len, opts.hidden_size))
    l_dense = lasagne.layers.DenseLayer(
        l_reshape, num_units=dim, nonlinearity=lasagne.nonlinearities.softmax)
    # The output size of the network is thus (opts.seq_len, dim)
    
    return l_dense, l_lstm, l_hid, l_cell

#### Data processing.

In [127]:
data = open('tinyshakespeare.txt', 'r').read()
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print ('Vocabulary size = ' + str(vocab_size) + '; total data size = ' + str(data_size))
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

# Define function to get batches of preprocessed data.
def get_batch(data, b, b_size):
    if (b+1)*b_size*opts.data_offset - opts.data_offset + opts.seq_len + 1 >= len(data):
        return None, None
    X = np.zeros((b_size, opts.seq_len, vocab_size), dtype=theano.config.floatX)
    y = np.zeros((b_size, opts.seq_len, vocab_size), dtype=np.int8)
    
    for i in xrange(b_size):
        c = b*opts.data_offset*b_size + opts.data_offset*i
        for j in xrange(opts.seq_len):
            X[i, j, char_to_ix[data[c]]] = 1.0
            y[i, j, char_to_ix[data[c+1]]] = 1.0
            c += 1
    
    return X, y.reshape((b_size*opts.seq_len, vocab_size))

Vocabulary size = 65; total data size = 1115394


#### Main program.

In [128]:
input_var = T.tensor3('inputs')
output_var = T.bmatrix('outputs') # the outputs will be flattened over 1st and 2nd dimension to reflect
                                 # dense layer output

# hold the initial hidden state of LSTM layer
hid_var = theano.shared(np.zeros((1,opts.hidden_size), dtype=theano.config.floatX))
# hold the initial cell state of LSTM layer
cell_var = theano.shared(np.zeros((1,opts.hidden_size), dtype=theano.config.floatX))

network, l_lstm, l_hid, l_cell = build_rnn_4(input_var, hid_var, cell_var, vocab_size)
# Now predict the cost of batch in terms of a loss function
network_output = lasagne.layers.get_output(network)
loss = lasagne.objectives.categorical_crossentropy(network_output, output_var).mean()
# Retrieve all network params
all_params = lasagne.layers.get_all_params(network, trainable=True)
# Compute adam updates for training
updates = lasagne.updates.adam(loss, all_params)
# Add extra state updates
hid_out, cell_out = l_lstm.get_hid_cell_for([input_var, l_hid.input_var, l_cell.input_var])
updates[hid_var] = hid_out[-1, [opts.data_offset-1], :]
updates[cell_var] = cell_out[-1, [opts.data_offset-1], :]
# Theano function for training and computing cost
train = theano.function(
    [input_var, output_var],
    loss, updates=updates)

test_network, test_lstm_layer, test_hid, test_cell = build_test_rnn_1(input_var, hid_var, cell_var, vocab_size)
test_network_output = lasagne.layers.get_output(test_network)

# Theano function to sample from the RNN at test/prediction time
# Make sure that also the final hidden and cell states are returned
from collections import OrderedDict
hid_out, cell_out = test_lstm_layer.get_hid_cell_for([input_var, test_hid.input_var, test_cell.input_var])
# The following is essential, as it updates the hidden and cell state variables
# to the hidden and cell outputs after the prediction.
sample_updates = OrderedDict()
sample_updates[hid_var] = hid_out[-1, :, :]
sample_updates[cell_var] = cell_out[-1, :, :]
sample = theano.function(
    [input_var],
    test_network_output[-1,:],
    updates=sample_updates)

We use the sample `sample_text()` function as above. Let's train and sample. _Do not forget to set the initial hidden and cell states to zero after each epoch. Also do not forget to store the states before sampling and restoring them after sampling (sampling adjusts these states)!!_

In [129]:
train_portion = 0.99
train_data = data[0:int(train_portion*len(data))]
test_data = data[int(train_portion*len(data)):]

# Train procedure
print("Start training RNN...")
for epoch in range(opts.n_epochs):
    hid_var.set_value(np.zeros((1,opts.hidden_size)))
    cell_var.set_value(np.zeros((1,opts.hidden_size)))
    cost  = 0.0
    b = 0.0
    counter = 0
    while True:
        X, y = get_batch(train_data, int(b), 1)
        if X is None or y is None:
            break
        cost += train(X, y)
        b += 1.0
        counter += 1
        
        if counter % 5000 == 0:
            # Store hidden and cell states
            states = [hid_var.get_value(), cell_var.get_value()]
            # Update parameter values
            lasagne.layers.set_all_param_values(test_network,
                                                lasagne.layers.get_all_param_values(network, trainable=True),
                                               trainable=True)
            # Sample using test_data
            print(sample_text(test_data, 100) + '\n')
            # Restore states
            hid_var.set_value(states[0])
            cell_var.set_value(states[1])
            
    print("\nEpoch {} with {} batches, cost = {}\n".format(epoch + 1, int(b), cost / b))

Start training RNN...
'ek:
ME: orme, hore'loe con het lall 'nsine clond bdi bee.

Holevicold,
He asy lende I athe the. ur 

 in ge;
Wicher konet Cosed anoud,y
Hatet the ko I Snenge
And ang lowstato Cheilgrasces hivce Richarg

BaWDY:
rt sist Ons wans thin colloug, wind ad frar, coulen, amcne that plore cotith to cort firel wa

t wisk higho
n the the menit menblen, kals the a wrtor, whaslled, Lall worayow,
And the then you dhe


Anks; pnod my nilis Qake not four the knig, the an tis my lobke we elle wiar, thy me Hank mus, that

The, for, what for,
I nouls, quene shees?
Comss.

A LARLEPLAULIULIO:
Gor.

PEONWAN:

TAULEO:
Hangore

hermare no nof ad, stay your us dowht,
Phate wink beiten, and ary, if thoood tell youruesrad?
Firy y


ANAneeas to no weatio haong in.

FRIMIO:
Tiby tol,
Ad it af ar at a To to ksar.

ETUMIO:
Bear
As? I


Epoch 1 with 44169 batches, cost = 2.16966171264

 he cool ablod
Themale nare,
Nome beepadly :
I, all
Threm on he sharler'tn spean: be.

SINIANIUS:
No



KeyboardInterrupt: 