<a href="https://colab.research.google.com/github/fcoelhomrc/MachineLearning/blob/main/TAAPC_Assignments%5Cassignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [162]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import torch
import torch.nn as nn
import torchtext
from collections import Counter, OrderedDict

In [60]:
url_train = "https://raw.githubusercontent.com/fcoelhomrc/MachineLearning/main/TAAPC_Assignments/imdb_train.csv"
url_test = "https://raw.githubusercontent.com/fcoelhomrc/MachineLearning/main/TAAPC_Assignments/imdb_test.csv"

imdb_train = pd.read_csv(url_train)
imdb_test = pd.read_csv(url_test, names=imdb_train.columns)

In [61]:
imdb_test.head()

Unnamed: 0,review,sentiment
0,"I saw the film many times, and every time I am...",negative
1,I loved KOLCHAK: THE NIGHT STALKER since I saw...,positive
2,This feels as if it is a Czech version of Pear...,positive
3,"When, oh, when will someone like Anchor Bay or...",positive
4,"""Just before dawn "" is one of the best slasher...",positive


In [62]:
print(imdb_train.shape, imdb_test.shape)
print(imdb_train.columns, imdb_test.columns)

(45000, 2) (5000, 2)
Index(['review', 'sentiment'], dtype='object') Index(['review', 'sentiment'], dtype='object')


The outline of this project is
1. Pre-process the data, defining a vocabulary and converting it to a sparse representation
2. Pre-train an embedding layer to transform into a dense representation
3. Use each embedded sequence to warm-up a GRU cell, and then return a single binary prediction (many-to-one model)

We will have to deal with different length sequences while mini-batching. [This post](https://towardsdatascience.com/taming-lstms-variable-sized-mini-batches-and-why-pytorch-is-good-for-your-health-61d35642972e) explains PyTorch implementations to do that.

Generate the vocabulary

In [135]:
tokenizer = torchtext.data.get_tokenizer("basic_english")

def get_vocab_freq(data):
    counter = Counter()
    for sentence in data["review"]:
        tokenized_sentence = tokenizer(sentence)
        counter.update(tokenized_sentence)
    return OrderedDict(counter)

imdb_data = pd.concat([imdb_train, imdb_test], axis=0, ignore_index=True) # fetch vocab considering complete data

vocab_freq = get_vocab_freq(imdb_data)

pad_token = "<pad>" # Token representing padding
vocab = torchtext.vocab.vocab(vocab_freq, min_freq=50,
                              specials=[pad_token], special_first=True)
vocab.set_default_index(-1) # Unknown words will return -1

In [136]:
len(vocab)

11264

In [137]:
sequence_len = [len(tokenizer(seq)) for seq in imdb_data["review"]]
PADDED_LENGTH = max(sequence_len)
PADDED_LENGTH

2752

Create a custom `Dataset` and `DataLoader` to automatically handle mini-batching.

The `IMDBDataset` object should store each review in a sparse representation and map the sentiment into binary categories: `{0: "negative", 1: "positive"}`.

Moreover, it will pad each review with `<pad>` tokens. This will be managed later by the `pack_padded_sequence` and `pad_packed_sequence` functions.

In [138]:
class IMDBDataset(torch.utils.data.Dataset):
    def __init__(self, data, vocab, tokenizer, max_len, pad_token):
        self.inputs = []
        self.targets = []
        self.lengths = []

        print(f"Loading data...")
        print(f"Padding inputs...")
        for sentence in data["review"]:
            tokenized_sentence = tokenizer(sentence)
            padded_sentence = [vocab[tk] for tk in tokenized_sentence]
            self.lengths.append(len(padded_sentence)) # Save length before padding
            padding_size = max_len - len(padded_sentence)
            padded_sentence.extend([vocab[pad_token]]*padding_size)
            self.inputs.append(padded_sentence)
        print("Finished padding inputs. Processing labels...")
        for label in data["sentiment"]:
            self.targets.append(0 if label == "negative" else 1)
        print("Done.")

    def __len__(self):
        return len(self.targets)
    
    def __getitem__(self, idx):
        # Must also return original length before padding
        return self.inputs[idx], self.targets[idx], self.lengths[idx]

Instantiate the `DataLoaders`, splitting the train data into train/validation sets. For the train/validation splitting, refer [to this post.](https://clay-atlas.com/us/blog/2021/08/25/pytorch-en-random-split-data-set/)

In [139]:
train_dataset = IMDBDataset(imdb_train, vocab, tokenizer, PADDED_LENGTH, pad_token)
test_dataset = IMDBDataset(imdb_test, vocab, tokenizer, PADDED_LENGTH, pad_token)

train_split = int(0.9 * len(train_dataset))
validation_split = len(train_dataset) - train_split
train_dataset, validation_dataset = torch.utils.data.random_split(train_dataset,
                                                                  [train_split, validation_split])

BATCH_SIZE = 64
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
validation_loader = torch.utils.data.DataLoader(validation_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

Loading data...
Padding inputs...
Finished padding inputs. Processing labels...
Done.
Loading data...
Padding inputs...
Finished padding inputs. Processing labels...
Done.


Now we need to define our model:

[![](https://mermaid.ink/img/pako:eNpVz8tqAzEMBdBfMVq1kPzAFArNBLrpKiGruAt3dDsRHWtcPxYlyb9XeVCoVgadK8lHGmYGdTTmkA7ubfPk1Vm97D2JplY9vbvl8vnkqaSQCzyd3MqaiB9gFh0N3CKru2PojfXGXje7HtP0h_o7korsJuhDwXeDDni8Jv6jImOchS-dtc1KGSxDlVmv42hBETkGYTv_eEl6qgdE293Zk0P-8uT1bC60Om9_dKCu5oYFtcShYi3Bfh2p-wxTwfkXhtJVaQ)](https://mermaid.live/edit#pako:eNpVz8tqAzEMBdBfMVq1kPzAFArNBLrpKiGruAt3dDsRHWtcPxYlyb9XeVCoVgadK8lHGmYGdTTmkA7ubfPk1Vm97D2JplY9vbvl8vnkqaSQCzyd3MqaiB9gFh0N3CKru2PojfXGXje7HtP0h_o7korsJuhDwXeDDni8Jv6jImOchS-dtc1KGSxDlVmv42hBETkGYTv_eEl6qgdE293Zk0P-8uT1bC60Om9_dKCu5oYFtcShYi3Bfh2p-wxTwfkXhtJVaQ)

Some comments:
- Unlike the Word2Vec model implemented in class, we don't need to use a `nn.Linear` layer to map from embedding space to vocabulary space. 
- We can use `nn.LSTMCell` instead of `nn.GRUCell`, but the latter is a newer architecture and their implementation are basically the same.
- We can later consider stacking many recursive cells (to get a deep RNN), but for a first approach we should stick to this simpler model.
- We don't need to include `nn.Sigmoid` explicitly, since our loss function (`nn.BCEWithLogitsLoss`) will implement it for us (with more numerical stability, [c.f. documentation](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html)).

In [163]:
class GRUPredictor(nn.Module):

    def __init__(self, vocab_size, hidden_size, embed_dim, embed_max_norm=None):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim,
                                      max_norm=embed_max_norm)
        self.gru = torch.nn.GRUCell(embed_dim, hidden_size, bias=True)
        self.linear = torch.nn.Linear(hidden_size, 1)

    def forward(self, inputs):
        X = self.embedding(inputs)
        X = X.swapaxes(0, 1) # batch, step, dim -> step, batch, dim
        h = torch.zeros((X.shape[1], self.hidden_size),
                         device=X.device)
        # warm-up
        T = X.shape[0]
        for t in range(T):
            h = self.gru(X[t], h)
        # prediction
        output = nn.linear(h)
        outputs = outputs.swapaxes(0, 1) # time, batch, dim -> batch, time, dim
        return outputs