# 1 - Updated Sentiment Analysis

In the previous notebook, we got the fundamentals down for sentiment analysis. In this notebook, we'll actually get decent results.

We will use:
- packed padded sequences
- pre-trained word embeddings
- different RNN architecture
- bidirectional RNN
- multi-layer RNN
- regularization
- a different optimizer

This will allow us to achieve ~84% test accuracy.

## Preparing Data

As before, we'll set the seed, define the `Fields` and get the train/valid/test splits.

We'll be using *packed padded sequences*, which will make our RNN only process the non-padded elements of our sequence, and for any padded element the `output` will be a zero tensor. To use packed padded sequences, we have to tell the RNN how long the actual sequences are. We do this by setting `include_lengths = True` for our `TEXT` field. This will cause `batch.text` to now be a tuple with the first element being our sentence (a numericalized tensor that has been padded) and the second element being the actual lengths of our sentences.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import torch
from torch import nn
from torchtext import data
from torchtext import datasets
from poutyne.framework import Model
from poutyne.framework.callbacks import ModelCheckpoint
from pathlib import Path
import numpy as np
import json
from mat6115.model import RNN
from mat6115.dataset import dataset_factory, TEXT, LABEL, SEED
from mat6115.train import acc, custom_loss

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True


Use the dataset factory to get IMDb dataset, pre split in train-valid-test iterators. Many pretrained word vectors are available.

- charngram.100d
- fasttext.en.300d
- fasttext.simple.300d
- glove.42B.300d
- glove.840B.300d
- glove.twitter.27B.25d
- glove.twitter.27B.50d
- glove.twitter.27B.100d
- glove.twitter.27B.200d
- glove.6B.50d
- glove.6B.100d
- glove.6B.200d
- glove.6B.300d

> Note: these vectors are over 800MB, so watch out if you have a limited internet connection.

In [3]:
train_iter, valid_iter, test_iter = dataset_factory("imdb", embedding="glove.6B.100d")

In [4]:
kwargs = dict(
    vocab_size = len(TEXT.vocab),
    embedding_dim = 100,
    hidden_dim = 256,
    output_dim = 1,
    n_layers = 1,
    dropout = 0.5,
    pad_idx = TEXT.vocab.stoi[TEXT.pad_token],
    null_idx = TEXT.vocab.stoi[TEXT.null_token],
    rnn_type = 'gru'
)
model = RNN(**kwargs)



In [5]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 2,775,501 trainable parameters


The final addition is copying the pre-trained word embeddings we loaded earlier into the `embedding` layer of our model.

We retrieve the embeddings from the field's vocab, and check they're the correct size, _**[vocab size, embedding dim]**_ 

In [6]:
pretrained_embeddings = TEXT.vocab.vectors

print(pretrained_embeddings.shape)

torch.Size([25003, 100])


We then replace the initial weights of the `embedding` layer with the pre-trained embeddings.

**Note**: this should always be done on the `weight.data` and not the `weight`!

In [7]:
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 1.9269,  1.4873,  0.9007,  ...,  0.1233,  0.3499,  0.6173],
        [ 0.7262,  0.0912, -0.3891,  ...,  0.0821,  0.4440, -0.7240],
        [-0.4611, -0.0639, -1.3667,  ...,  1.6309, -0.0847,  1.0844],
        ...,
        [ 0.2766, -0.2792, -0.1118,  ..., -0.8459, -0.5144, -0.1363],
        [ 0.2785, -0.5053,  0.7145,  ..., -0.2789, -0.4981,  0.5644],
        [ 0.6178, -0.2024,  0.0910,  ...,  0.3046, -0.6263,  0.5276]])

As our `<unk>`, `<pad>` and `<null>` token aren't in the pre-trained vocabulary they have been initialized using `unk_init` (an $\mathcal{N}(0,1)$ distribution) when building our vocab. It is preferable to initialize all three to all zeros to explicitly tell our model that, initially, they are irrelevant for determining sentiment. 

We do this by manually setting their row in the embedding weights matrix to zeros. We get their row by finding the index of the tokens. The `<null>` is rather special, it should never be trained. It is used to study fixed points or quasi-fixed points ($h \approx F(h, 0)$)

> Note: like initializing the embeddings, this should be done on the `weight.data` and not the `weight`!

In [8]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]
NULL_IDX = TEXT.vocab.stoi[TEXT.null_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(kwargs['embedding_dim'])
model.embedding.weight.data[PAD_IDX] = torch.zeros(kwargs['embedding_dim'])
model.embedding.weight.data[NULL_IDX] = torch.zeros(kwargs['embedding_dim'])

print(model.embedding.weight.data)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 0.2766, -0.2792, -0.1118,  ..., -0.8459, -0.5144, -0.1363],
        [ 0.2785, -0.5053,  0.7145,  ..., -0.2789, -0.4981,  0.5644],
        [ 0.6178, -0.2024,  0.0910,  ...,  0.3046, -0.6263,  0.5276]])


We can now see the first two rows of the embedding weights matrix have been set to zeros. As we passed the index of the pad token to the `padding_idx` of the embedding layer it will remain zeros throughout training, however the `<unk>` token embedding will be learned.

## Train the Model

Now to training the model.

The only change we'll make here is changing the optimizer from `SGD` to `Adam`. SGD updates all parameters with the same learning rate and choosing this learning rate can be tricky. `Adam` adapts the learning rate for each parameter, giving parameters that are updated more frequently lower learning rates and parameters that are updated infrequently higher learning rates. More information about `Adam` (and other optimizers) can be found [here](http://ruder.io/optimizing-gradient-descent/index.html).

To change `SGD` to `Adam`, we simply change `optim.SGD` to `optim.Adam`, also note how we do not have to provide an initial learning rate for Adam as PyTorch specifies a sensibile default initial learning rate.

In [9]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters(), )

We define a function for training our model. 

As we have set `include_lengths = True`, our `batch.text` is now a tuple with the first element being the numericalized tensor and the second element being the actual lengths of each sequence. We separate these into their own variables, `text` and `text_lengths`, before passing them to the model.

**Note**: as we are now using dropout, we must remember to use `model.train()` to ensure the dropout is "turned on" while training.

Finally, we train our model...

In [10]:
N_EPOCHS = 10
SAVE_PATH = Path(f'{kwargs["rnn_type"]}_{kwargs["n_layers"]}layer')
SAVE_PATH.mkdir(parents=True, exist_ok=True)
with open(SAVE_PATH / 'kwargs.json', 'w') as kwargs_file:
    json.dump(kwargs, kwargs_file)
    
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

poutyne_model = Model(
    network=model, 
    optimizer=optimizer, 
    loss_function=custom_loss, 
    batch_metrics=[acc]
)
poutyne_model.to(device)

history = poutyne_model.fit_generator(
    train_generator=train_iter, 
    valid_generator=valid_iter, 
    epochs=N_EPOCHS,
    callbacks=[
        ModelCheckpoint(
            filename=str(SAVE_PATH / "model.pkl"),
            save_best_only=True,
            restore_best=True
        )
    ]
)

Epoch 1/10 7.54s Step 274/274: loss: 0.636729, acc: 0.627600, val_loss: 0.701215, val_acc: 0.660800
Epoch 2/10 7.48s Step 274/274: loss: 0.421709, acc: 0.811314, val_loss: 0.322023, val_acc: 0.866000
Epoch 3/10 7.49s Step 274/274: loss: 0.271525, acc: 0.888400, val_loss: 0.252827, val_acc: 0.899200
Epoch 4/10 7.46s Step 274/274: loss: 0.212121, acc: 0.917429, val_loss: 0.241873, val_acc: 0.906533
Epoch 5/10 7.46s Step 274/274: loss: 0.174293, acc: 0.933543, val_loss: 0.232285, val_acc: 0.909733
Epoch 6/10 7.49s Step 274/274: loss: 0.141388, acc: 0.948343, val_loss: 0.256034, val_acc: 0.907333
Epoch 7/10 7.47s Step 274/274: loss: 0.115610, acc: 0.959257, val_loss: 0.257615, val_acc: 0.910400
Epoch 8/10 7.48s Step 274/274: loss: 0.096256, acc: 0.965829, val_loss: 0.271390, val_acc: 0.897067
Epoch 9/10 7.45s Step 274/274: loss: 0.081667, acc: 0.971314, val_loss: 0.315175, val_acc: 0.905600
Epoch 10/10 7.48s Step 274/274: loss: 0.071324, acc: 0.974571, val_loss: 0.314905, val_acc: 0.909200

In [11]:
test_loss, test_acc, y_pred, y_true = poutyne_model.evaluate_generator(
    generator=test_iter, 
    return_pred=True, 
    return_ground_truth=True
)
print(f"Test Loss: {test_loss:.4f}, Test Binary Accuracy: {test_acc:.4f}")

Test Loss: 0.2569, Test Binary Accuracy: 0.8988


## User Input

We can now use our model to predict the sentiment of any sentence we give it. As it has been trained on movie reviews, the sentences provided should also be movie reviews.

When using a model for inference it should always be in evaluation mode. If this tutorial is followed step-by-step then it should already be in evaluation mode (from doing `evaluate` on the test set), however we explicitly set it to avoid any risk.

Our `predict_sentiment` function does a few things:
- sets the model to evaluation mode
- tokenizes the sentence, i.e. splits it from a raw string into a list of tokens
- indexes the tokens by converting them into their integer representation from our vocabulary
- gets the length of our sequence
- converts the indexes, which are a Python list into a PyTorch tensor
- add a batch dimension by `unsqueeze`ing 
- converts the length into a tensor
- squashes the output prediction from a real number between 0 and 1 with the `sigmoid` function
- converts the tensor holding a single value into an integer with the `item()` method

We are expecting reviews with a negative sentiment to return a value close to 0 and positive reviews to return a value close to 1.

In [18]:
import spacy
nlp = spacy.load('en')

def predict_sentiment(model, sentence):
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    length = [len(indexed)]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(0)
    length_tensor = torch.LongTensor(length)
    prediction = model.predict((tensor, length_tensor), batch_size=1)
    return 1 * LABEL.vocab.stoi['neg'] - 1 / (1 + np.exp(-prediction[0]))

An example negative review...

In [19]:
predict_sentiment(poutyne_model, "This film is terrible")

array([0.09787267], dtype=float32)

An example positive review...

In [20]:
predict_sentiment(poutyne_model, "This film is great")

array([0.7957635], dtype=float32)

## Next Steps

We've now built a decent sentiment analysis model for movie reviews! In the next notebook we'll implement a model that gets comparable accuracy with far fewer parameters and trains much, much faster.