# Sentiment Analysis (SA) with pretrained Language Model (LM)

In this notebook, we are going to build a sentiment analysis model based on the pretrained
language model. We are focusing on the best usability to support traditional nlp tasks in a simple fashion. The building process is really simple in three steps. Let us get started now.

## Preparation and settings

### Load mxnet and gluonnlp

In [1]:
import time
import multiprocessing as mp
import numpy as np

import mxnet as mx
from mxnet import nd, gluon, autograd

import gluonnlp as nlp
from gluonnlp import datasets, Vocab

### Hyperparameters

In [2]:
dropout = 0.5
language_model_name = 'standard_lstm_lm_200'
pretrained = True
num_gpus = 4
learning_rate = 0.01 * num_gpus
batch_size = 20 * num_gpus
bucket_num = 10
bucket_ratio = 0.5
epochs = 15
grad_clip = 0.25
log_interval = 100

In [3]:
context = [mx.gpu(i) for i in range(num_gpus)] if num_gpus else [mx.cpu()]

## Sentiment analysis model with pre-trained language model encoder

The model architecture is based on pretrained LM:

![sa-model](samodel-v3.png)

In [4]:
class SentimentNet(gluon.Block):
    def __init__(self, embedding_block, encoder_block, dropout, prefix=None, params=None):
        super(SentimentNet, self).__init__(prefix=prefix, params=params)
        with self.name_scope():
            self.embedding = embedding_block
            self.encoder = encoder_block
            self.out_layer = gluon.nn.HybridSequential()
            with self.out_layer.name_scope():
                self.out_layer.add(gluon.nn.Dropout(dropout))
                self.out_layer.add(gluon.nn.Dense(1, flatten=False))

    def forward(self, data, valid_length):
        encoded = self.encoder(nd.Dropout(self.embedding(data), 0.6, axes=(0,)))  # Shape(T, N, C)
        masked_encoded = nd.SequenceMask(encoded,
                                        sequence_length=valid_length,
                                        use_sequence_length=True)
        agg_state = nd.broadcast_div(nd.sum(masked_encoded, axis=0),
                                     nd.expand_dims(valid_length, axis=1))
        out = self.out_layer(agg_state)
        return out

lm_model, vocab = nlp.models.get_model(name=language_model_name,
                                       pretrained=pretrained,
                                       ctx=context,
                                       dropout=dropout,
                                       prefix='sent_net_')
net = SentimentNet(embedding_block=lm_model.embedding, encoder_block=lm_model.encoder,
                   dropout=dropout, prefix='sent_net_')
net.initialize(mx.init.Xavier(), ctx=context)
net.hybridize()
print(net)

SentimentNet(
  (encoder): LSTM(200 -> 800, TNC, num_layers=2, dropout=0.5)
  (out_layer): HybridSequential(
    (0): Dropout(p = 0.5, axes=())
    (1): Dense(None -> 1, linear)
  )
  (embedding): HybridSequential(
    (0): Embedding(33279 -> 200, float32)
    (1): Dropout(p = 0.5, axes=())
  )
)


  "Set force_reinit=True to re-initialize."%self.name)
  "Set force_reinit=True to re-initialize."%self.name)
  "Set force_reinit=True to re-initialize."%self.name)
  "Set force_reinit=True to re-initialize."%self.name)
  "Set force_reinit=True to re-initialize."%self.name)
  "Set force_reinit=True to re-initialize."%self.name)
  "Set force_reinit=True to re-initialize."%self.name)
  "Set force_reinit=True to re-initialize."%self.name)
  "Set force_reinit=True to re-initialize."%self.name)


## Data pipeline

### Load sentiment analysis dataset -- IMDB reviews

In [5]:
train_dataset, test_dataset = [datasets.IMDB(root='data/imdb', segment=segment) for segment in ('train', 'test')]
print("Tokenize using spaCy...")
tokenizer = nlp.data.SpacyTokenizer('en')
length_clip = nlp.data.ClipSequence(500)

def preprocess(x):
    data, label = x
    label = int(label > 5)
    data = vocab[length_clip(tokenizer(data))]
    return data, label, float(len(data))

def get_length(x):
    return x[2]

def preprocess_dataset(dataset):
    start = time.time()
    with mp.Pool(32) as pool:
        dataset = gluon.data.SimpleDataset(pool.map(preprocess, dataset))
        lengths = gluon.data.SimpleDataset(pool.map(get_length, dataset))
    end = time.time()
    print('Done! Tokenizing Time={:.2f}s, #Sentences={}'.format(end - start, len(dataset)))
    return dataset, lengths

train_dataset, train_data_lengths = preprocess_dataset(train_dataset)
test_dataset, test_data_lengths = preprocess_dataset(test_dataset)

Tokenize using spaCy...
Done! Tokenizing Time=6.03s, #Sentences=25000
Done! Tokenizing Time=7.48s, #Sentences=25000



## Training

### Evaluation using loss and accuracy

In [6]:
def evaluate(net, dataloader, context):
    loss = gluon.loss.SigmoidBCELoss()
    total_L = 0.0
    total_sample_num = 0
    total_correct_num = 0
    start_log_interval_time = time.time()
    print('Begin Testing...')
    for i, (data, label, valid_length) in enumerate(dataloader):
        data = mx.nd.transpose(data.as_in_context(context))
        valid_length = valid_length.as_in_context(context).astype(np.float32)
        label = label.as_in_context(context)
        output = net(data, valid_length)
        L = loss(output, label)
        pred = (output > 0.5).reshape(-1)
        total_L += L.sum().asscalar()
        total_sample_num += label.shape[0]
        total_correct_num += (pred == label).sum().asscalar()
        if (i + 1) % log_interval == 0:
            print('[Batch {}/{}] elapsed {:.2f} s'.format(
                i + 1, len(dataloader), time.time() - start_log_interval_time))
            start_log_interval_time = time.time()
    avg_L = total_L / float(total_sample_num)
    acc = total_correct_num / float(total_sample_num)
    return avg_L, acc

In [7]:
def train(net, context, epochs):
    trainer = gluon.Trainer(net.collect_params(), 'adagrad', {'learning_rate': learning_rate,
                                                              'wd': 0.001})
    loss = gluon.loss.SigmoidBCELoss()

    # Construct the DataLoader
    batchify_fn = nlp.data.batchify.Wrap(nlp.data.batchify.Pad(axis=0), 
                                         nlp.data.batchify.Stack(),
                                         nlp.data.batchify.Stack())  # Pad data, stack label and lengths
    batch_sampler = nlp.data.sampler.FixedBucketSampler(train_data_lengths,
                                                        batch_size=batch_size,
                                                        num_buckets=bucket_num,
                                                        ratio=bucket_ratio,
                                                        shuffle=True)
    print(batch_sampler.stats())
    train_dataloader = gluon.data.DataLoader(dataset=train_dataset,
                                             batch_sampler=batch_sampler,
                                             batchify_fn=batchify_fn)
    test_dataloader = gluon.data.DataLoader(dataset=test_dataset,
                                            batch_size=batch_size,
                                            shuffle=False,
                                            batchify_fn=batchify_fn)
    parameters = net.collect_params().values()

    # Training/Testing
    for epoch in range(epochs):
        # Epoch training stats
        start_epoch_time = time.time()
        epoch_L = 0.0
        epoch_sent_num = 0
        epoch_wc = 0
        # Log interval training stats
        start_log_interval_time = time.time()
        log_interval_wc = 0
        log_interval_sent_num = 0
        log_interval_L = 0.0

        for i, (data, label, length) in enumerate(train_dataloader):
            if data.shape[0] > len(context):
                data_list = gluon.utils.split_and_load(data, context, batch_axis=0, even_split=False)
                label_list = gluon.utils.split_and_load(label, context, batch_axis=0, even_split=False)
                length_list = gluon.utils.split_and_load(length, context, batch_axis=0, even_split=False)
            else:
                data_list = [data.as_in_context(context[0])]
                label_list = [label.as_in_context(context[0])]
                length_list = [length.as_in_context(context[0])]
            L = 0
            wc = length.sum().asscalar()
            log_interval_wc += wc
            epoch_wc += wc
            log_interval_sent_num += data.shape[1]
            epoch_sent_num += data.shape[1]
            for data, label, valid_length in zip(data_list, label_list, length_list):
                valid_length = valid_length
                with autograd.record():
                    output = net(data.T, valid_length)
                    L = L + loss(output, label).mean().as_in_context(context[0])
            L.backward()
            # Clip gradient
            if grad_clip:
                gluon.utils.clip_global_norm([p.grad(x.context) for p in parameters for x in data_list],
                                             grad_clip)
            # Update parameter
            trainer.step(1)
            log_interval_L += L.asscalar()
            epoch_L += L.asscalar()
            if (i + 1) % log_interval == 0:
                print('[Epoch {} Batch {}/{}] elapsed {:.2f} s, avg loss {:.6f}, throughput {:.2f}K wps'.format(
                    epoch, i + 1, len(train_dataloader), time.time() - start_log_interval_time,
                    log_interval_L / log_interval_sent_num,
                    log_interval_wc / 1000 / (time.time() - start_log_interval_time)))
                # Clear log interval training stats
                start_log_interval_time = time.time()
                log_interval_wc = 0
                log_interval_sent_num = 0
                log_interval_L = 0
        end_epoch_time = time.time()
        test_avg_L, test_acc = evaluate(net, test_dataloader, context[0])
        print('[Epoch {}] train avg loss {:.6f}, test acc {:.2f}, test avg loss {:.6f}, throughput {:.2f}K wps'.format(
            epoch, epoch_L / epoch_sent_num,
            test_acc, test_avg_L, epoch_wc / 1000 / (end_epoch_time - start_epoch_time)))

In [8]:
train(net, context, epochs)

FixedBucketSampler:
  sample_num=25000, batch_num=265
  key=[14, 68, 122, 176, 230, 284, 338, 392, 446, 500]
  cnt=[5, 976, 2353, 6662, 4470, 2661, 1836, 1385, 1012, 3640]
  batch_size=[1428, 294, 163, 113, 86, 80, 80, 80, 80, 80]
[Epoch 0 Batch 100/265] elapsed 20.86 s, avg loss 0.009846, throughput 109.40K wps
[Epoch 0 Batch 200/265] elapsed 19.95 s, avg loss 0.007972, throughput 124.55K wps
Begin Testing...
[Batch 100/313] elapsed 12.09 s
[Batch 200/313] elapsed 9.88 s
[Batch 300/313] elapsed 10.35 s
[Epoch 0] train avg loss 0.008628, test acc 0.71, test avg loss 0.534941, throughput 115.87K wps
[Epoch 1 Batch 100/265] elapsed 18.95 s, avg loss 0.005504, throughput 158.88K wps
[Epoch 1 Batch 200/265] elapsed 18.32 s, avg loss 0.008851, throughput 102.06K wps
Begin Testing...
[Batch 100/313] elapsed 12.08 s
[Batch 200/313] elapsed 11.75 s
[Batch 300/313] elapsed 11.34 s
[Epoch 1] train avg loss 0.007137, test acc 0.75, test avg loss 0.480933, throughput 126.94K wps
[Epoch 2 Batch 100

In [9]:
net(mx.nd.reshape(mx.nd.array(vocab[['This', 'movie', 'is', 'amazing']], ctx=context[0]), shape=(-1, 1)),
    mx.nd.array([4], ctx=context[0])).sigmoid()


[[0.9164194]]
<NDArray 1x1 @gpu(0)>

## Conclusion

In summary, we have built a SA model using gluonnlp. It is:

1) easy to use.

2) simple to customize.

3) fast to build the NLP prototype.

Gluonnlp documentation is here: http://gluon-nlp.s3-accelerate.dualstack.amazonaws.com/index.html