# Word embedding and RNN for sentiment analysis

The goal of the following notebook is to predict whether a written
critic about a movie is positive or negative. For that we will try three
models. A simple linear model on the word embeddings, a recurrent neural
network and a CNN.

## Preliminaries

### Libraries and Imports

First some imports are needed.

In [1]:
from timeit import default_timer as timer
from typing import List

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.nn.utils.rnn import pad_sequence
from torch.optim import Adam, Optimizer
from torch.utils.data import DataLoader
from datasets import load_dataset
from tokenizers import Tokenizer, models, trainers, pre_tokenizers, normalizers

### Global variables

First let’s define a few variables. `EMBEDDING_DIM` is the dimension of
the vector space used to embed all the words of the vocabulary.
`SEQ_LENGTH` is the maximum length of a sequence, `BATCH_SIZE` is the
size of the batches used in stochastic optimization algorithms and
`NUM_EPOCHS` the number of times we are going thought the entire
training set during the training phase.

In [2]:
EMBEDDING_DIM = 8
SEQ_LENGTH = 64
BATCH_SIZE = 512
NUM_EPOCHS = 10

## The `IMDb` dataset

We use the `datasets` library to load the `IMDb` dataset.

In [3]:
dataset = load_dataset("imdb")
train_set = dataset['train']
test_set = dataset['test']

train_set[0]

print(f"Number of training examples: {len(train_set)}")
print(f"Number of testing examples: {len(test_set)}")

Number of training examples: 25000
Number of testing examples: 25000


### Building a vocabulary out of `IMDb` from a tokenizer

We first need a tokenizer that takes and text a returns a list of tokens.
There are many tokenizers available from other libraries. Here we use
the `tokenizers` library.

In [4]:
# Use a word-level tokenizer in lower case
tokenizer = Tokenizer(models.WordLevel(unk_token="[UNK]"))
tokenizer.normalizer = normalizers.Lowercase()
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

Then we need to define the set of words that will be understood by the
model: this is the vocabulary. We build it from the training set.

In [5]:
train_texts = train_set['text']
test_texts = test_set['text']

trainer = trainers.WordLevelTrainer(vocab_size=10000, special_tokens=["[UNK]", "[PAD]"])
tokenizer.train_from_iterator(train_texts, trainer)

vocab = tokenizer.get_vocab()

UNK_IDX, PAD_IDX = vocab["[UNK]"], vocab["[PAD]"]
VOCAB_SIZE = len(vocab)

tokenizer.encode("All your base are belong to us").tokens
tokenizer.encode("All your base are belong to us").ids

vocab['plenty']

988

## The training loop

The training loop is decomposed into 3 different functions:

-   `train_epoch`
-   `evaluate`
-   `train`

### Collate function

The collate function maps raw samples coming from the dataset to padded
tensors of numericalized tokens ready to be fed to the model.

In [6]:
def collate_fn(batch: List):
    def collate(text):
        """Turn a text into a tensor of integers."""
        ids = tokenizer.encode(text).ids[:SEQ_LENGTH]
        return torch.LongTensor(ids)

    src_batch = [collate(sample["text"]) for sample in batch]

    # Pad list of tensors using `pad_sequence`
    src_batch = pad_sequence(src_batch, padding_value=PAD_IDX)

    # Define the labels tensor
    tgt_batch = torch.Tensor([sample["label"] for sample in batch])

    return src_batch, tgt_batch

### The `accuracy` function

We need to implement an accuracy function to be used in the
`train_epoch` function (see below).

In [7]:
def accuracy(predictions, labels):
    # `predictions` and `labels` are both tensors of same length

    # Implement accuracy
    return torch.sum((torch.sigmoid(predictions) > 0.5).float() == (labels > 0.5)).item() / len(predictions)

assert accuracy(torch.Tensor([1, -2, 3]), torch.Tensor([1, 0, 1])) == 1
assert accuracy(torch.Tensor([1, -2, -3]), torch.Tensor([1, 0, 1])) == 2 / 3


### The `train_epoch` function

def train_epoch(model: nn.Module, optimizer: Optimizer):
    #model.to(device)

    # Training mode
    model.train()

    loss_fn = nn.BCEWithLogitsLoss()

    train_dataloader = DataLoader(
        train_set, batch_size=BATCH_SIZE, collate_fn=collate_fn, shuffle=True
    )

    matches = 0
    losses = 0
    for sequences, labels in train_dataloader:
        #sequences, labels = sequences.to(device), labels.to(device)

        # Implement a step of the algorithm:
        #
        # - set gradients to zero
        # - forward propagate examples in `batch`
        # - compute `loss` with chosen criterion
        # - back-propagate gradients
        # - gradient step
        optimizer.zero_grad()
        predictions = model(sequences)
        loss = loss_fn(predictions, labels)
        loss.backward()
        optimizer.step()

        acc = accuracy(predictions, labels)
        matches += len(predictions) * acc
        losses += loss.item()

    return losses / len(train_set), matches / len(train_set)

### The `evaluate` function

In [8]:
def evaluate(model: nn.Module):
    #model.to(device)
    model.eval()

    loss_fn = nn.BCEWithLogitsLoss()

    val_dataloader = DataLoader(
        test_set, batch_size=BATCH_SIZE, collate_fn=collate_fn
    )

    losses = 0
    matches = 0
    for sequences, labels in val_dataloader:
        #sequences, labels = sequences.to(device), labels.to(device)

        predictions = model(sequences)
        loss = loss_fn(predictions, labels)
        acc = accuracy(predictions, labels)
        matches += len(predictions) * acc
        losses += loss.item()

    return losses / len(test_set), matches / len(test_set)

### The `train` function

In [9]:
def train(model, optimizer):
    for epoch in range(1, NUM_EPOCHS + 1):
        start_time = timer()
        train_loss, train_acc = train_epoch(model, optimizer)
        end_time = timer()
        val_loss, val_acc = evaluate(model)
        print(
            f"Epoch: {epoch}, "
            f"Train loss: {train_loss:.3f}, "
            f"Train acc: {train_acc:.3f}, "
            f"Val loss: {val_loss:.3f}, "
            f"Val acc: {val_acc:.3f}, "
            f"Epoch time = {(end_time - start_time):.3f}s"
        )

### Helper function to predict from a character string

In [10]:
def predict_sentiment(model, sentence):
    "Predict sentiment of given sentence according to model"

    tensor, _ = collate_fn([("dummy", sentence)])
    prediction = model(tensor)
    pred = torch.sigmoid(prediction)
    return pred.item()

## Models

### Training a linear classifier with an embedding

We first test a simple linear classifier on the word embeddings.

In [11]:
class EmbeddingNet(nn.Module):
    def __init__(self, vocab_size, embedding_dim, seq_length):
        super().__init__()
        self.embedding_dim = embedding_dim
        self.seq_length = seq_length
        self.vocab_size = vocab_size

        # Define an embedding of `vocab_size` words into a vector space
        # of dimension `embedding_dim`.
        self.embedding = nn.Embedding(self.vocab_size, self.embedding_dim)

        # Define a linear layer from dimension `seq_length` *
        # `embedding_dim` to 1.
        self.l1 = nn.Linear(self.seq_length * self.embedding_dim, 1)

    def forward(self, x):
        # `x` is of size `seq_length` * `batch_size`

        # Compute the embedding `embedded` of the batch `x`. `embedded` is
        # of size `seq_length` * `batch_size` * `embedding_dim`
        embedded = self.embedding(x)

        # Flatten the embedded words and feed it to the linear layer. `flatten`
        # must be of size `batch_size` * (`seq_length` * `embedding_dim`). You
        # might need to use `permute` first.
        flatten = embedded.permute(1, 0, 2).reshape(-1, self.seq_length * self.embedding_dim)

        # Apply the linear layer and return a squeezed version
        # `l1` is of size `batch_size`
        return self.l1(flatten).squeeze()

In [12]:
embedding_net = EmbeddingNet(VOCAB_SIZE, EMBEDDING_DIM, SEQ_LENGTH)
print(sum(torch.numel(e) for e in embedding_net.parameters()))

device = "cuda:0" if torch.cuda.is_available() else "cpu"
device = "cpu"

optimizer = Adam(embedding_net.parameters())
train(embedding_net, optimizer)

80513
Epoch: 1, Train loss: 0.001, Train acc: 0.500, Val loss: 0.001, Val acc: 0.504, Epoch time = 9.691s
Epoch: 2, Train loss: 0.001, Train acc: 0.533, Val loss: 0.001, Val acc: 0.512, Epoch time = 16.191s
Epoch: 3, Train loss: 0.001, Train acc: 0.556, Val loss: 0.001, Val acc: 0.519, Epoch time = 18.647s
Epoch: 4, Train loss: 0.001, Train acc: 0.571, Val loss: 0.001, Val acc: 0.527, Epoch time = 16.734s
Epoch: 5, Train loss: 0.001, Train acc: 0.585, Val loss: 0.001, Val acc: 0.540, Epoch time = 16.241s
Epoch: 6, Train loss: 0.001, Train acc: 0.595, Val loss: 0.001, Val acc: 0.552, Epoch time = 16.997s
Epoch: 7, Train loss: 0.001, Train acc: 0.612, Val loss: 0.001, Val acc: 0.568, Epoch time = 17.188s
Epoch: 8, Train loss: 0.001, Train acc: 0.628, Val loss: 0.001, Val acc: 0.582, Epoch time = 15.599s
Epoch: 9, Train loss: 0.001, Train acc: 0.642, Val loss: 0.001, Val acc: 0.597, Epoch time = 16.189s
Epoch: 10, Train loss: 0.001, Train acc: 0.658, Val loss: 0.001, Val acc: 0.612, Epoch

### Training a linear classifier with a pretrained embedding

Load a GloVe pretrained embedding instead

In [13]:
import gensim.downloader
glove_vectors = gensim.downloader.load('glove-twitter-25')

# Calculate the mean vector manually
unknown_vector = np.mean(glove_vectors.vectors, axis=0)
vocab_vectors = torch.tensor(np.stack([glove_vectors[e] if e in glove_vectors else unknown_vector for e in vocab.keys()]))

In [14]:
class GloVeEmbeddingNet(nn.Module):
    def __init__(self, seq_length, vocab_vectors, freeze=True):
        super().__init__()
        self.seq_length = seq_length

        # Define `embedding_dim` from vocabulary and the pretrained `embedding`.
        self.embedding_dim = vocab_vectors.size(1)
        self.embedding = nn.Embedding.from_pretrained(vocab_vectors, freeze=freeze)

        self.l1 = nn.Linear(self.seq_length * self.embedding_dim, 1)

    def forward(self, x):
        # Same forward as in `EmbeddingNet`
        # `x` is of size `batch_size` * `seq_length`
        embedded = self.embedding(x)
        flatten = embedded.permute(1, 0, 2).reshape(-1, self.seq_length * self.embedding_dim)

        # L1 is of size batch_size
        return self.l1(flatten).squeeze()


glove_embedding_net1 = GloVeEmbeddingNet(SEQ_LENGTH, vocab_vectors, freeze=True)
print(sum(torch.numel(e) for e in glove_embedding_net1.parameters()))

optimizer = Adam(glove_embedding_net1.parameters())
train(glove_embedding_net1, optimizer)

251601
Epoch: 1, Train loss: 0.001, Train acc: 0.520, Val loss: 0.001, Val acc: 0.529, Epoch time = 16.558s
Epoch: 2, Train loss: 0.001, Train acc: 0.574, Val loss: 0.001, Val acc: 0.543, Epoch time = 16.344s
Epoch: 3, Train loss: 0.001, Train acc: 0.593, Val loss: 0.001, Val acc: 0.549, Epoch time = 15.202s
Epoch: 4, Train loss: 0.001, Train acc: 0.603, Val loss: 0.001, Val acc: 0.548, Epoch time = 16.315s
Epoch: 5, Train loss: 0.001, Train acc: 0.610, Val loss: 0.001, Val acc: 0.548, Epoch time = 16.547s
Epoch: 6, Train loss: 0.001, Train acc: 0.610, Val loss: 0.001, Val acc: 0.550, Epoch time = 16.566s
Epoch: 7, Train loss: 0.001, Train acc: 0.616, Val loss: 0.001, Val acc: 0.549, Epoch time = 15.145s
Epoch: 8, Train loss: 0.001, Train acc: 0.616, Val loss: 0.001, Val acc: 0.550, Epoch time = 17.041s
Epoch: 9, Train loss: 0.001, Train acc: 0.620, Val loss: 0.001, Val acc: 0.551, Epoch time = 15.868s
Epoch: 10, Train loss: 0.001, Train acc: 0.620, Val loss: 0.001, Val acc: 0.550, Epo

### Fine-tuning the pretrained embedding

In [15]:
# Define model and don't freeze embedding weights
glove_embedding_net2 = GloVeEmbeddingNet(SEQ_LENGTH, vocab_vectors, freeze=True)

### Recurrent neural network with frozen pretrained embedding

In [16]:
class RNN(nn.Module):
    def __init__(self, hidden_size, vocab_vectors, freeze=True):
        super(RNN, self).__init__()

        # Define pretrained embedding
        self.embedding = nn.Embedding.from_pretrained(vocab_vectors, freeze=freeze)

        # Size of input `x_t` from `embedding`
        self.embedding_size = self.embedding.embedding_dim
        self.input_size = self.embedding_size

        # Size of hidden state `h_t`
        self.hidden_size = hidden_size

        # Define a GRU
        self.gru = nn.GRU(input_size=self.input_size, hidden_size=self.hidden_size)

        # Linear layer on last hidden state
        self.linear = nn.Linear(hidden_size, 1)

    def forward(self, x, h0=None):
        # `x` is of size `seq_length` * `batch_size` and `h0` is of size 1
        # * `batch_size` * `hidden_size`

        # Define first hidden state in not provided
        if h0 is None:
            # Get batch and define `h0` which is of size 1 * # `batch_size` *
            # `hidden_size` (1 extra dimension for bidirectional)
            batch_size = x.size(1)
            h0 = torch.zeros(1, batch_size, self.hidden_size)

        # `embedded` is of size `seq_length` * `batch_size` *
        # `embedding_dim`
        embedded = self.embedding(x)

        # Define `output` and `hidden` returned by GRU:
        #
        # - `output` is of size `seq_length` * `batch_size` * `embedding_dim`
        #   and gathers all the hidden states along the sequence.
        # - `hidden` is of size 1 * `batch_size` * `embedding_dim` and is the
        #   last hidden state.
        output, hidden = self.gru(embedded, h0)

        # Apply a linear layer on the last hidden state to have a score tensor
        # of size 1 * `batch_size` * 1, and return a one-dimensional tensor of
        # size `batch_size`.
        return self.linear(hidden).squeeze()


rnn = RNN(hidden_size=100, vocab_vectors=vocab_vectors)
print("Number of parameters for RNN model:", sum(torch.numel(e) for e in rnn.parameters() if e.requires_grad))

optimizer = optim.Adam(filter(lambda p: p.requires_grad, rnn.parameters()), lr=0.001)
train(rnn, optimizer)

Number of parameters for RNN model: 38201
Epoch: 1, Train loss: 0.001, Train acc: 0.513, Val loss: 0.001, Val acc: 0.515, Epoch time = 24.432s
Epoch: 2, Train loss: 0.001, Train acc: 0.528, Val loss: 0.001, Val acc: 0.521, Epoch time = 25.827s
Epoch: 3, Train loss: 0.001, Train acc: 0.550, Val loss: 0.001, Val acc: 0.568, Epoch time = 25.694s
Epoch: 4, Train loss: 0.001, Train acc: 0.576, Val loss: 0.001, Val acc: 0.560, Epoch time = 26.953s
Epoch: 5, Train loss: 0.001, Train acc: 0.586, Val loss: 0.001, Val acc: 0.585, Epoch time = 27.790s
Epoch: 6, Train loss: 0.001, Train acc: 0.598, Val loss: 0.001, Val acc: 0.579, Epoch time = 26.066s
Epoch: 7, Train loss: 0.001, Train acc: 0.608, Val loss: 0.001, Val acc: 0.592, Epoch time = 23.791s
Epoch: 8, Train loss: 0.001, Train acc: 0.621, Val loss: 0.001, Val acc: 0.617, Epoch time = 26.056s
Epoch: 9, Train loss: 0.001, Train acc: 0.634, Val loss: 0.001, Val acc: 0.629, Epoch time = 26.283s
Epoch: 10, Train loss: 0.001, Train acc: 0.647, V

### CNN based text classification

In [17]:
class CNN(nn.Module):
    def __init__(self, vocab_vectors, freeze=True):
        super().__init__()

        self.embedding = nn.Embedding.from_pretrained(vocab_vectors, freeze=freeze)
        self.embedding_dim = self.embedding.embedding_dim

        self.conv_0 = nn.Conv2d(
            in_channels=1, out_channels=100, kernel_size=(3, self.embedding_dim)
        )
        self.conv_1 = nn.Conv2d(
            in_channels=1, out_channels=100, kernel_size=(4, self.embedding_dim)
        )
        self.conv_2 = nn.Conv2d(
            in_channels=1, out_channels=100, kernel_size=(5, self.embedding_dim)
        )
        self.linear = nn.Linear(3 * 100, 1)
        self.dropout = nn.Dropout(0.5)

    def forward(self, x):
        # Input `x` is of size `seq_length` * `batch_size` and contains integers
        embedded = self.embedding(x)

        # The tensor `embedded` is of size `seq_length` * `batch_size` *
        # `embedding_dim` and should be of size `batch_size` * (`n_channels`=1)
        # * `seq_length` * `embedding_dim` for the convolutional layers. You can
        # use `transpose` and `unsqueeze` to make the transformation.

        # <answer>
        embedded = embedded.transpose(0, 1).unsqueeze(1)
        # </answer>

        # Tensor `embedded` is now of size `batch_size` * 1 *
        # `seq_length` * `embedding_dim` before convolution and should
        # be of size `batch_size` * (`out_channels` = 100) *
        # (`seq_length` - `kernel_size[0]` + 1) after convolution and
        # squeezing.
        # Implement the three parallel convolutions
        # <answer>
        conved_0 = self.conv_0(embedded).squeeze(3)
        conved_1 = self.conv_1(embedded).squeeze(3)
        conved_2 = self.conv_2(embedded).squeeze(3)
        # </answer>

        # Non-linearity step, we use ReLU activation
        # <answer>
        conved_0_relu = F.relu(conved_0)
        conved_1_relu = F.relu(conved_1)
        conved_2_relu = F.relu(conved_2)
        # </answer>

        # Max-pooling layer: pooling along whole sequence
        # Implement max pooling
        # <answer>
        seq_len_0 = conved_0_relu.shape[2]
        pooled_0 = F.max_pool1d(conved_0_relu, kernel_size=seq_len_0).squeeze(2)

        seq_len_1 = conved_1_relu.shape[2]
        pooled_1 = F.max_pool1d(conved_1_relu, kernel_size=seq_len_1).squeeze(2)

        seq_len_2 = conved_2_relu.shape[2]
        pooled_2 = F.max_pool1d(conved_2_relu, kernel_size=seq_len_2).squeeze(2)
        # </answer>

        # Dropout on concatenated pooled features
        cat = self.dropout(torch.cat((pooled_0, pooled_1, pooled_2), dim=1))

        # Linear layer
        return self.linear(cat).squeeze()

cnn = CNN(vocab_vectors)
optimizer = optim.Adam(cnn.parameters())

print(sum(torch.numel(e) for e in cnn.parameters() if e.requires_grad))

print(
    (3 * cnn.embedding_dim + 1) * 100  # Conv1
    + (4 * cnn.embedding_dim + 1) * 100  # Conv2
    + (5 * cnn.embedding_dim + 1) * 100  # Conv3
    + 3 * 100 + 1  # Linear
)
train(cnn, optimizer)

30601
30601
Epoch: 1, Train loss: 0.001, Train acc: 0.516, Val loss: 0.001, Val acc: 0.572, Epoch time = 24.160s
Epoch: 2, Train loss: 0.001, Train acc: 0.565, Val loss: 0.001, Val acc: 0.602, Epoch time = 23.874s
Epoch: 3, Train loss: 0.001, Train acc: 0.589, Val loss: 0.001, Val acc: 0.605, Epoch time = 24.312s
Epoch: 4, Train loss: 0.001, Train acc: 0.614, Val loss: 0.001, Val acc: 0.620, Epoch time = 24.264s
Epoch: 5, Train loss: 0.001, Train acc: 0.628, Val loss: 0.001, Val acc: 0.625, Epoch time = 22.354s
Epoch: 6, Train loss: 0.001, Train acc: 0.641, Val loss: 0.001, Val acc: 0.632, Epoch time = 22.503s
Epoch: 7, Train loss: 0.001, Train acc: 0.650, Val loss: 0.001, Val acc: 0.644, Epoch time = 24.703s
Epoch: 8, Train loss: 0.001, Train acc: 0.662, Val loss: 0.001, Val acc: 0.649, Epoch time = 23.930s
Epoch: 9, Train loss: 0.001, Train acc: 0.677, Val loss: 0.001, Val acc: 0.656, Epoch time = 24.183s
Epoch: 10, Train loss: 0.001, Train acc: 0.691, Val loss: 0.001, Val acc: 0.654