# Using MXNet RNN API

## Introduction

You can use this tutorial by:
- pulling this git repository
- run "jupyter notebook" command in the folder where this tutorial is
- execute the cells
    

You will accomplish the following:

- Build a language model using MXNet
- Learn to use BucketSentenceIter to iterate through and bucket variable length sentences
- Learn to use BucketingModule for optimization


## Prerequisites

To complete this tutorial, you need:

- MXNet
- Familiarity with basic MXNet functionalities (iterators, optimizers)

In [1]:
import sys

sys.path.insert(0, "../../python")
import numpy as np
import mxnet as mx

from bucket_io import BucketSentenceIter, default_build_vocab

def Perplexity(label, pred):
    # collapse the time, batch dimension
    label = label.reshape((-1,))
    pred = pred.reshape((-1, pred.shape[-1]))

    loss = 0.
    for i in range(pred.shape[0]):
        loss += -np.log(max(1e-10, pred[i][int(label[i])]))
    return np.exp(loss / label.size)




## Preparing the data

Here, we make use of the BucketSentenceIter to iterate through our sentences. This module is used to group together sentences of the same length into buckets. The sentences which fall in between bucket lengths are padded with 0's to make their length match the length of the next smallest bucket.

In [2]:
N = 4
batch_size = 32 * N
buckets = [10, 20, 30, 40, 50, 60]

In [3]:
vocab = default_build_vocab("./data/ptb.train.txt")
data_train = BucketSentenceIter("./data/ptb.train.txt", vocab,
                                    buckets, batch_size, [])
data_val = BucketSentenceIter("./data/ptb.eval.txt", vocab,
                                  buckets, batch_size, [])


bucket of len  10 : 19479 samples
bucket of len  20 : 19336 samples
bucket of len  30 : 12208 samples
bucket of len  40 : 3962 samples
bucket of len  50 : 845 samples
bucket of len  60 : 160 samples
bucket of len  10 : 19479 samples
bucket of len  20 : 19336 samples
bucket of len  30 : 12208 samples
bucket of len  40 : 3962 samples
bucket of len  50 : 845 samples
bucket of len  60 : 160 samples


## Defining the model symbol

While working with RNNs, we cannot work with a static symbol and optimize it. This is because the symbol which we are working with changes based on the length of the sequence we are working with. In order to overcome this problem, instead of defining a fixed symbol, we write a function to generate the symbol, which takes the sequence length as its input and uses it to generate the appropriate symbol. We then pass this function in place of the symbol to the optimization module. 

For this particular model, our model looks as follows:
    1. The BucketSentenceIter iterator passes input words in the following shape (batch_size, seq_length), where each value in the input corresponds to an individual word index in the vocabulary.
    2. This data is fed into the variable called "data"
    3. We extract the embeddings for each of the individual words by making use of mx.sym.Embedding to get an output of the shape (batch_size, seq_length, embedding_size)
    4. To feed the data into the RNN, we need to slice it at the word level, so that each slice consists of the ith word for each sentence of the batch.
    5. Next we define the structure of the individual RNN cell. We stack multiple lstm layers in our implementation
    6. After this the RNN is rolled out to the length of the sequence. Since we are doing language modelling, an output will be created at each time-step of the sequence.
    7. The outputs of each time-step are concatenated, followed by reshaped by stacking each output over the other.
    8. Each of the outputs is multiplied by a fully connected layer, to produce an output the size of the vocabulary.
    9. We then softmax over the vocabulary for each output time-step to get the distribution over the vocabulary for the next word.
    

In [4]:
import logging

head = '%(asctime)-15s %(message)s'
logging.basicConfig(level=logging.DEBUG, format=head)


    # buckets = [32]
num_hidden = 200
embedding_size = 200
num_lstm_layer = 2

num_epoch = 1
learning_rate = 0.01
momentum = 0.0
    # contexts = [mx.context.gpu(i) for i in range(N)]
contexts = mx.cpu()





def sym_gen(seq_len):
    data = mx.sym.Variable('data')
    label = mx.sym.Variable('softmax_label')
    embed = mx.sym.Embedding(data=data, input_dim=len(vocab), output_dim=embedding_size, name='embed')
    wordvec = mx.sym.SliceChannel(data=embed, num_outputs=seq_len, squeeze_axis=1)

    stack = mx.rnn.SequentialRNNCell()
    for i in range(num_lstm_layer):
        stack.add(mx.rnn.LSTMCell(num_hidden=num_hidden, prefix='lstm_l%d_' % i))
    outputs, states = mx.rnn.rnn_unroll(stack, seq_len, inputs=wordvec)

    outputs = [mx.sym.expand_dims(x, axis=1) for x in outputs]
    pred = mx.sym.Concat(*outputs, dim=1)
    pred = mx.sym.Reshape(pred, shape=(-1, num_hidden))
    pred = mx.sym.FullyConnected(data=pred, num_hidden=len(vocab), name='pred')

    label = mx.sym.Reshape(label, shape=(-1,))
    pred = mx.sym.SoftmaxOutput(data=pred, label=label, name='softmax')

    return pred, ('data',), ('softmax_label',)


## Model Optimization

Optimizing the model for the current task is similar to creating MXNet models in general, except for one difference. Since we are working with RNNs, where the length of the input could vary across training examples, the symbol which we are trying to optimize varies based on the sequence length. To take care of this, we make use of the bucketing module for optimization, which takes care of this by optimizing the appropriate symbol based on the sequence length while sharing the weights accross different optimizations, so all of them are modifying the same parameters.  

In [1]:
model = mx.mod.BucketingModule(
        sym_gen=sym_gen,
        default_bucket_key=data_train.default_bucket_key,
        context=contexts)

print "Fitting..."

model.fit(
        train_data=data_train,
        eval_data=data_val,
        eval_metric=mx.metric.np(Perplexity),
        optimizer_params={'learning_rate': learning_rate,
                          'momentum': momentum,
                          'wd': 0.00001},
        initializer=mx.init.Xavier(factor_type="in", magnitude=2.34),
        num_epoch=num_epoch,
        batch_end_callback=mx.callback.Speedometer(batch_size, 200))
print "Done."

Fitting...
Done.
