# Language Modeling using NLP Toolkit

In this notebook, we will go through an example of using tools in Gluon NLP Toolkit to build a data pipeline for language model, and use pre-defined model architecture to train a standard LSTM language model.

We train the model using truncated [back-propagation-through-time (BPTT)](https://en.wikipedia.org/wiki/Backpropagation_through_time)

![bptt](https://upload.wikimedia.org/wikipedia/commons/e/ee/Unfold_through_time.png)

## Preparation

### Load gluonnlp

In [1]:
import warnings
warnings.filterwarnings('ignore')

import time
import math

import mxnet as mx
from mxnet import gluon, autograd

import gluonnlp as nlp

### Set environment

In [2]:
num_gpus = 1
context = [mx.gpu(i) for i in range(num_gpus)] if num_gpus else [mx.cpu()]
log_interval = 100

### Set hyperparameters

In [3]:
batch_size = 80 * len(context)
lr = 20
epochs = 3
bptt = 35
grad_clip = 0.2

### Load dataset, extract vocabulary, numericalize, and batchify for truncated BPTT

In [4]:
dataset_name = 'wikitext-2'
train_dataset, val_dataset, test_dataset = [nlp.data.WikiText2(segment=segment,
                                                               bos=None, eos='<eos>',
                                                               skip_empty=False)
                                            for segment in ['train', 'val', 'test']]

vocab = nlp.Vocab(nlp.data.Counter(train_dataset), padding_token=None, bos_token=None)

bptt_batchify = nlp.data.batchify.LanguageModelBPTT(vocab, batch_size, bptt, last_batch='discard')
train_data, val_data, test_data = [bptt_batchify(x)
                                   for x in [train_dataset, val_dataset, test_dataset]]

### Load pre-defined language model architecture

In [5]:
model_name = 'standard_lstm_lm_200'
model, vocab = nlp.model.get_model(model_name, vocab=vocab, dataset_name=None)
print(model)
print(vocab)

StandardRNN(
  (embedding): HybridSequential(
    (0): Embedding(33278 -> 200, float32)
    (1): Dropout(p = 0.2, axes=())
  )
  (encoder): LSTM(200 -> 200, TNC, num_layers=2, dropout=0.2)
  (decoder): HybridSequential(
    (0): Dense(200 -> 33278, linear)
  )
)
Vocab(size=33278, unk="<unk>", reserved="['<eos>']")


In [6]:
model.initialize(mx.init.Xavier(), ctx=context)
trainer = gluon.Trainer(model.collect_params(), 'sgd',
                        {'learning_rate': lr,
                         'momentum': 0,
                         'wd': 0})
loss = gluon.loss.SoftmaxCrossEntropyLoss()

## Training

Now that everything is ready, we can start training the model.

### Detach gradients on states for truncated BPTT

In [7]:
def detach(hidden):
    if isinstance(hidden, (tuple, list)):
        hidden = [detach(i) for i in hidden]
    else:
        hidden = hidden.detach()
    return hidden

### Evaluation

In [8]:
def evaluate(model, data_source, ctx):
    total_L = 0.0
    ntotal = 0
    hidden = model.begin_state(batch_size, func=mx.nd.zeros, ctx=ctx)
    for i, (data, target) in enumerate(data_source):
        data = data.as_in_context(ctx)
        target = target.as_in_context(ctx)
        output, hidden = model(data, hidden)
        L = loss(output.reshape(-3, -1),
                 target.reshape(-1))
        total_L += mx.nd.sum(L).asscalar()
        ntotal += L.size
    return total_L / ntotal

### Training loop

In [9]:
def train(model, train_data, val_data, test_data, epochs, lr):
    best_val = float("Inf")
    start_train_time = time.time()
    parameters = model.collect_params().values()
    for epoch in range(epochs):
        total_L, n_total = 0.0, 0
        start_epoch_time = time.time()
        start_log_interval_time = time.time()
        hiddens = [model.begin_state(batch_size//len(context), func=mx.nd.zeros, ctx=ctx) 
                   for ctx in context]
        for i, (data, target) in enumerate(train_data):
            data_list = gluon.utils.split_and_load(data, context, 
                                                   batch_axis=1, even_split=True)
            target_list = gluon.utils.split_and_load(target, context, 
                                                     batch_axis=1, even_split=True)
            hiddens = detach(hiddens)
            L = 0
            Ls = []
            with autograd.record():
                for j, (X, y, h) in enumerate(zip(data_list, target_list, hiddens)):
                    output, h = model(X, h)
                    batch_L = loss(output.reshape(-3, -1), y.reshape(-1))
                    L = L + batch_L.as_in_context(context[0]) / X.size
                    Ls.append(batch_L)
                    hiddens[j] = h
            L.backward()
            grads = [p.grad(x.context) for p in parameters for x in data_list]
            gluon.utils.clip_global_norm(grads, grad_clip)

            trainer.step(1)

            total_L += sum([mx.nd.sum(l).asscalar() for l in Ls])
            n_total += data.size

            if i % log_interval == 0 and i > 0:
                cur_L = total_L / n_total
                print('[Epoch %d Batch %d/%d] loss %.2f, ppl %.2f, '
                      'throughput %.2f samples/s'%(
                    epoch, i, len(train_data), cur_L, math.exp(cur_L), 
                    batch_size * log_interval / (time.time() - start_log_interval_time)))
                total_L, n_total = 0.0, 0
                start_log_interval_time = time.time()

        mx.nd.waitall()

        print('[Epoch %d] throughput %.2f samples/s'%(
                    epoch, len(train_data)*batch_size / (time.time() - start_epoch_time)))
        val_L = evaluate(model, val_data, context[0])
        print('[Epoch %d] time cost %.2fs, valid loss %.2f, valid ppl %.2f'%(
            epoch, time.time()-start_epoch_time, val_L, math.exp(val_L)))

        if val_L < best_val:
            best_val = val_L
            test_L = evaluate(model, test_data, context[0])
            model.save_params('{}_{}-{}.params'.format(model_name, dataset_name, epoch))
            print('test loss %.2f, test ppl %.2f'%(test_L, math.exp(test_L)))
        else:
            lr = lr*0.25
            print('Learning rate now %f'%(lr))
            trainer.set_learning_rate(lr)

    print('Total training throughput %.2f samples/s'%(
                            (batch_size * len(train_data) * epochs) / 
                            (time.time() - start_train_time)))

### Train and evaluate

In [10]:
train(model, train_data, val_data, test_data, epochs, lr)

[Epoch 0 Batch 100/745] loss 8.03, ppl 3065.94, throughput 1715.26 samples/s
[Epoch 0 Batch 200/745] loss 7.25, ppl 1403.39, throughput 1772.06 samples/s
[Epoch 0 Batch 300/745] loss 6.94, ppl 1033.97, throughput 1785.40 samples/s
[Epoch 0 Batch 400/745] loss 6.67, ppl 787.56, throughput 1786.35 samples/s
[Epoch 0 Batch 500/745] loss 6.48, ppl 649.69, throughput 1786.77 samples/s
[Epoch 0 Batch 600/745] loss 6.31, ppl 550.47, throughput 1779.51 samples/s
[Epoch 0 Batch 700/745] loss 6.20, ppl 492.56, throughput 1776.04 samples/s
[Epoch 0] throughput 1774.25 samples/s
[Epoch 0] time cost 36.75s, valid loss 5.94, valid ppl 378.13
test loss 5.86, test ppl 352.40
[Epoch 1 Batch 100/745] loss 6.08, ppl 439.13, throughput 1771.13 samples/s
[Epoch 1 Batch 200/745] loss 5.99, ppl 399.40, throughput 1791.90 samples/s
[Epoch 1 Batch 300/745] loss 5.93, ppl 377.05, throughput 1790.47 samples/s
[Epoch 1 Batch 400/745] loss 5.88, ppl 356.80, throughput 1791.31 samples/s
[Epoch 1 Batch 500/745] loss

## Use your own dataset

In [11]:
!./get_ptb_data.sh
ptb_dataset = !ls ptb.*.txt
print(ptb_dataset)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 4982k  100 4982k    0     0  22.4M      0 --:--:-- --:--:-- --:--:-- 22.4M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  390k  100  390k    0     0  4337k      0 --:--:-- --:--:-- --:--:-- 4386k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  439k  100  439k    0     0  4529k      0 --:--:-- --:--:-- --:--:-- 4529k
['ptb.test.txt', 'ptb.train.txt', 'ptb.valid.txt']


In [12]:
import nltk
nltk.download(['perluniprops', 'nonbreaking_prefixes', 'punkt'])
moses_tokenizer = nlp.data.NLTKMosesTokenizer()

ptb_val = nlp.data.CorpusDataset('ptb.valid.txt',
                                    flatten=True,
                                    sample_splitter=nltk.tokenize.sent_tokenize,
                                    tokenizer=moses_tokenizer, eos='<eos>')

ptb_val_data = nlp.data.batchify.LanguageModelBPTT(vocab, batch_size, bptt, last_batch='discard')(ptb_val)

[nltk_data] Downloading package perluniprops to
[nltk_data]     /home/ubuntu/nltk_data...
[nltk_data]   Package perluniprops is already up-to-date!
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data]     /home/ubuntu/nltk_data...
[nltk_data]   Package nonbreaking_prefixes is already up-to-date!
[nltk_data] Downloading package punkt to /home/ubuntu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [13]:
ptb_L = evaluate(model, ptb_val_data, context[0])
print('Best validation loss %.2f, test ppl %.2f'%(ptb_L, math.exp(ptb_L)))

Best validation loss 6.48, test ppl 653.59


In [14]:
train(model, ptb_val_data, ptb_val_data, ptb_val_data, epochs=3, lr=20)

[Epoch 0] throughput 1409.67 samples/s
[Epoch 0] time cost 2.49s, valid loss 5.80, valid ppl 329.53
test loss 5.80, test ppl 329.53
[Epoch 1] throughput 1764.51 samples/s
[Epoch 1] time cost 2.13s, valid loss 5.75, valid ppl 313.96
test loss 5.75, test ppl 313.96
[Epoch 2] throughput 1763.15 samples/s
[Epoch 2] time cost 2.13s, valid loss 4.91, valid ppl 135.13
test loss 4.91, test ppl 135.13
Total training throughput 608.55 samples/s


## Conclusion

- Gluon NLP Toolkit provides high-level APIs that could drastically simplify the development process of modeling for NLP tasks.
- Low-level APIs in NLP Toolkit enables easy customization.

Documentation can be found at http://gluon-nlp.mxnet.io/index.html