# Lab 4. Text Classification with CNN

In this lab, we are going to finally put all the previous knowledge into use to train our first neural NLP model. In particular, we are going to read the dataset, preprocess it and train a convolutional neural network using pretrained word vectors as inputs. 

Since we are going to build a deep network and we are going to have 25,000 texts, it's recommended that you run this notebook on a Cuda GPU. If you don't have one at your disposal, you can run this notebook on Google Colab.

To do it, you just need to visit https://colab.research.google.com/ and upload and run this notebook there. Google Colab will allocate a GPU for you for about twelve hours or until you leave it inactive for some period of time.

Also, if you are running this notebook on Google Colab, don't forget to go to `Runtime -> Change runtime type` and set `Runtime type` to `Python 3` and `Hardware acceleration` to `GPU`.

This lab is based on [this tutorial](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/4%20-%20Convolutional%20Sentiment%20Analysis.ipynb), so you can always visit it to get more information. However, we made some changes to the data loading part to make it more profound and flexible to use in other models. The original tutorial uses `torchtext` package to load the data. However, it is too high-level and it may be difficult to understand what is going under the hood. Additionaly, we will have more flexibility in adapting our own custom data loader to different tasks and datasets. 

With the setup being dealt with, we can proceed to building our classifier.

## Text classification

Text classification is one of the most popular NLP tasks. It can be used to predict a genre of a document, or establish the authorship of a text. In this lab, we are going to predict a sentiment of a sentence, i.e. try to guess if a text (in our case, a review) is positive or negative. 

Here is the decription of a CNN classifier from the tutorial above:

> Traditionally, CNNs are used to analyse images and are made up of one or more convolutional layers, followed by one or more linear layers. The convolutional layers use filters (also called kernels or receptive fields) which scan across an image and produce a processed version of the image. This processed version of the image can be fed into another convolutional layer or a linear layer. Each filter has a shape, e.g. a 3x3 filter covers a 3 pixel wide and 3 pixel high area of the image, and each element of the filter has a weight associated with it, the 3x3 filter would have 9 weights. In traditional image processing these weights were specified by hand by engineers, however the main advantage of the convolutional layers in neural networks is that these weights are learned via backpropagation.

> The intuitive idea behind learning the weights is that your convolutional layers act like feature extractors, extracting parts of the image that are most important for your CNN's goal, e.g. if using a CNN to detect faces in an image, the CNN may be looking for features such as the existance of a nose, mouth or a pair of eyes in the image.

> So why use CNNs on text? In the same way that a 3x3 filter can look over a patch of an image, a 1x2 filter can look over a 2 sequential words in a piece of text, i.e. a bi-gram. In the previous tutorial we looked at the FastText model which used bi-grams by explicitly adding them to the end of a text, in this CNN model we will instead use multiple filters of different sizes which will look at the bi-grams (a 1x2 filter), tri-grams (a 1x3 filter) and/or n-grams (a 1x$n$ filter) within the text.

> The intuition here is that the appearance of certain bi-grams, tri-grams and n-grams within the review will be a good indication of the final sentiment.

## Data

To train the classifier, we are going to use the [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/). It contains 25,000 reviews for training and another 25,000 for testing. Both training and test sets contain 12,500 positive reviews and 12,500 negative reviews.

In the next steps, we are going to build our own custom dataloader to load, preprocess, split, and batch the data.

In [None]:
import torch
from torchtext import data
from torchtext import datasets
from torch.utils.data import Dataset, DataLoader
from torch.utils.data.sampler import SubsetRandomSampler
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import random
import numpy as np

from pathlib import Path
import time

# from: https://spacy.io/api/tokenizer
from spacy.lang.en import English
nlp = English()

# Create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
tokenizer = nlp.Defaults.create_tokenizer(nlp)

# Check if we are running on a CPU or GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device 

Run the cell below if you want to save the files and trained models on your Google Drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Let's download the vector file and unpack in to the `vector_cache/` folder. You can skip this step if you have already done it yourself.

In [None]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip

In [None]:
!unzip wiki-news-300d-1M.vec.zip -d vector_cache/

Let's download the dataset and unpack in to the `data/` folder. You can skip this step if you have already done it yourself.

In [None]:
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

In [None]:
!mkdir data/
!tar -xzf aclImdb_v1.tar.gz -C data/

We are going to define some variables that we are going to need later. 

We will need the `<PAD>` and `<UNK>` symbols. `<PAD>` is needed to make the sentences in one batch have the same length. We are going to prepend this symbol to the end of each sentence to equalize the lengths. `<UNK>` is needed to replace the words for which we don't have a pretrained vector.

We are also going to define the paths for our vector file and data folder, as well as maximum numer of vectors that we want to store.

In [None]:
PAD = '<PAD>'
PAD_ID = 0
UNK = '<UNK>'
UNK_ID = 1
VOCAB_PREFIX = [PAD, UNK]

VEC_PATH = Path('vector_cache') /  'wiki-news-300d-1M.vec'
DATA_PATH = Path('data') / 'aclImdb'
MAX_VOCAB = 25000

batch_size = 64
validation_split = .3
shuffle_dataset = True
random_seed = 42

First, let's prepare a vocabulary for our pretrained vectors. Since the input to our model should be an index of a word, we need to build it to map from words to indices.

Below, we define a `PretrainedWordVocab` class that is going to take a list of words and build a vocab based on it. We also define some methods that we are going to use:

- `normalize_unit()` to put the word to lowercase if `lower` argument is set to `True`.
- `unit2id()` to return the index of a word in the vocab or an `<UNK>` index otherwise.
- `id2unit()` to return a word given its index in the vocab.
- `map()` to return a list of indeces given a list of words.
- `build_vocab()` to initialize the vocab

In [None]:
class PretrainedWordVocab:
    def __init__(self, data, lower=False):
        self.data = data
        self.lower = lower
        self.build_vocab()
        
    def normalize_unit(self, unit):
        if self.lower:
            return unit.lower()
        else:
            return unit
        
    def unit2id(self, unit):
        unit = self.normalize_unit(unit)
        if unit in self._unit2id:
            return self._unit2id[unit]
        else:
            return self._unit2id[UNK]
    
    def id2unit(self, id):
        return self._id2unit[id]
    
    def map(self, units):
        return [self.unit2id(unit) for unit in units]
        
    def build_vocab(self):
        # self._id2unit - id to unit (add PAD and UNK)
        # self._unit2id - unit to id 
        ...

        
    def __len__(self):
        return len(self._unit2id)

Next, we need to create the `Pretrain` class to store the pretrained vectors and vocab that we defined above. The vectors are going to be stored in as a numpy array.

In [None]:
class Pretrain:
    def __init__(self, vec_filename, max_vocab=-1):
        self._vec_filename = vec_filename
        self._max_vocab = max_vocab
        
    @property
    def vocab(self):
        if not hasattr(self, '_vocab'):
            self._vocab, self._emb = self.read()
        return self._vocab
    
    @property
    def emb(self):
        if not hasattr(self, '_emb'):
            self._vocab, self._emb = self.read()
        return self._emb
        
    def read(self):
        if self._vec_filename is None:
            raise Exception("Vector file is not provided.")
        print(f"Reading pretrained vectors from {self._vec_filename}...")
        
        words, emb, failed = self.read_from_file(self._vec_filename, open_func=open)
        
        if failed > 0: # recover failure
            emb = emb[:-failed]
        if len(emb) - len(VOCAB_PREFIX) != len(words):
            raise Exception("Loaded number of vectors does not match number of words.")
            
        # Use a fixed vocab size
        if self._max_vocab > len(VOCAB_PREFIX) and self._max_vocab < len(words):
            words = words[:self._max_vocab - len(VOCAB_PREFIX)]
            emb = emb[:self._max_vocab]
                
        vocab = PretrainedWordVocab(words, lower=True)
        
        return vocab, emb
        
    def read_from_file(self, filename, open_func=open):
        """
        Open a vector file using the provided function and read from it.
        """
        first = True
        words = []
        failed = 0
        with open_func(filename, 'rb') as f:
            for i, line in enumerate(f):
                try:
                    line = line.decode()
                except UnicodeDecodeError:
                    failed += 1
                    continue
                if first:
                    # the first line contains the number of word vectors and the dimensionality
                    first = False
                    line = line.strip().split(' ')
                    rows, cols = [int(x) for x in line]
                    emb = np.zeros((rows + len(VOCAB_PREFIX), cols), dtype=np.float32)
                    continue

                line = line.rstrip().split(' ')
                emb[i+len(VOCAB_PREFIX)-1-failed, :] = [float(x) for x in line[-cols:]]
                words.append(' '.join(line[:-cols]))
        return words, emb, failed

Finally, we need to define the dataset class `IMDBDataSet` that is going to load and preprocess our data files. Inside the data folder, we have `train` and `test` folders that have `neg` and `pos` folders inside of then. Each of these folders have a review as a separate file.

We are going to iterate through each file inside these folders, read the text, tokenize it with [Spacy tokenizer](https://spacy.io/api/tokenizer) and replace the words with the indices using the `PretrainedWordVocab` that we created earlier.

We also need our custom class to inherit from the `torch.utils.data.Dataset` class. Finally, we need to define the `__len__()` method to know how big is our dataset and `__getitem__()` method to get one sample at a given index.

In [None]:
class IMDBDataSet(Dataset):
    def __init__(self, pretrain, data_folder='.data', test=False):
        self.pretrain_vocab = pretrain.vocab
        self.label_vocab = {'neg': 0, 'pos': 1}
        
        if test:
            self.data_folder = data_folder / 'test'
        else:
            self.data_folder = data_folder / 'train'
            
        self.data = []
        
        if self.data_folder.exists():
            self.load()
        else:
            raise ValueError("Data path doesn't exist!")
        
    def load(self):
        for label in ['pos', 'neg']:
            print(f'Reading {label} sentences...')
            ...
            
                
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        return self.data[idx]

Additionally, we need to define a funciton to pad all the sentences in the batch to the same length. To do this, we are going to first find the longest sequence in the batch and use its length to create a torch tensor of size `(batch_size, max_len)` filled with `0` that is our padding id. Later, we are just going to append each sequence in the beginning of the corresponding row of our new batch tensor. Don't forget that `nn.Embedding` layer that we are going to use later requires indices to be of type `long`. We are also going to put the labels, with `0` corresponding to `negative` and `1` to `positive` to the `labels` tensor of length `batch_size`. To be able to use them to calculate the loss, each label must be of type `float`.

Finally, don't forget to convert all the tensors to the current device with `.to(device)`.

In [None]:
def pad_sequences(batch):
  ...


Now, we can finally load our data and pretrained vectors. It will take some time...

In [None]:
pretrain = Pretrain(VEC_PATH, MAX_VOCAB)

In [None]:
train_data = IMDBDataSet(pretrain, DATA_PATH)

In [None]:
test_data = IMDBDataSet(pretrain, DATA_PATH, test=True)

The last step in our data preparation is to define the train and validation splits. We are going to use the validation set to see how the model performs during the training. It is important to be able to see if the model is overfitting or not.

To do that, we will just create a range of indices from `0` to the size of the training data. Then, we are going to define an index on which we are going to splite the data. Optionally, we can shuffle our indices before splitting.

With these indices for train and validation datasets, we are going to create two corresponding `torch.utils.data.SubsetRandomSampler` objects that we are going to pass to the `torch.utils.data.DataLoader` objects in the next step.

In [None]:
# Creating data indices for training and validation splits:
dataset_size = len(train_data)
indices = list(range(dataset_size))
split = int(np.floor(validation_split * dataset_size))
if shuffle_dataset:
    np.random.seed(random_seed)
    np.random.shuffle(indices)
train_indices, val_indices = indices[split:], indices[:split]

# Creating PT data samplers and loaders:
train_sampler = SubsetRandomSampler(train_indices)
valid_sampler = SubsetRandomSampler(val_indices)

Here, for each set, we are going to create a `DataLoader` object that is going to create a batch iterator for us. We will pass to it out `IMDBDataSet` object as a source of data. Batch size as a `batch_size` argument. To specify train and validation splits, we are going to pass the corresponding `SubsetRandomSampler` objects as a `sampler` argument for the training set. Finally, we need to pass our `pad_sequences()` function as a `collate_fn` argument to tell the data loader how to prepare the batches so that they have the same length. 

In [None]:
train_loader = DataLoader() 
validation_loader = DataLoader()
test_loader = DataLoader()

The model descpition and the code below is taken from [the Build the Model section of this tutorial](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/4%20-%20Convolutional%20Sentiment%20Analysis.ipynb). Please, refer to it for the necessary details.

In [None]:
class CNN(nn.Module):
    def __init__(self, pretrain, vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, 
                 dropout, pad_idx):
        
        super().__init__()
                
        self.embedding = nn.Embedding.from_pretrained(
            torch.from_numpy(pretrain.emb), 
            padding_idx=pad_idx, 
            freeze=True
        )
        
        self.convs = nn.ModuleList([
                                    nn.Conv2d(in_channels = 1, 
                                              out_channels = n_filters, 
                                              kernel_size = (fs, embedding_dim)) 
                                    for fs in filter_sizes
                                    ])
        
        self.fc = nn.Linear(len(filter_sizes) * n_filters, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):           
        #text = [batch size, sent len]

        embedded = self.embedding(text)     
        #embedded = [batch size, sent len, emb dim]
        
        embedded = embedded.unsqueeze(1)  
        #embedded = [batch size, 1, sent len, emb dim]
        
        conved = [F.relu(conv(embedded)).squeeze(3) for conv in self.convs]    
        #conved_n = [batch size, n_filters, sent len - filter_sizes[n] + 1]
                
        pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]     
        #pooled_n = [batch size, n_filters]
        
        cat = self.dropout(torch.cat(pooled, dim = 1))
        #cat = [batch size, n_filters * len(filter_sizes)]
            
        return self.fc(cat)

In [None]:
INPUT_DIM = len(pretrain.vocab)
EMBEDDING_DIM = pretrain.emb.shape[1]
N_FILTERS = 100
FILTER_SIZES = [3,4,5]
OUTPUT_DIM = 1
DROPOUT = 0.5

model = CNN(pretrain, INPUT_DIM, EMBEDDING_DIM, N_FILTERS, FILTER_SIZES, OUTPUT_DIM, DROPOUT, PAD_ID)

In [None]:
optimizer = optim.Adam(model.parameters())

criterion = nn.BCEWithLogitsLoss()

model = model.to(device)
criterion = criterion.to(device)

In [None]:
def binary_accuracy(preds, y): 
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

In [None]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:

        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0

    ...
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [None]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_loader, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, validation_loader, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'imdb_cnn_classifier.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

In [None]:
start_time = time.time()
test_loss, test_acc = evaluate(model, test_loader, criterion)
end_time = time.time()
epoch_mins, epoch_secs = epoch_time(start_time, end_time)

print(f'Epoch: test | Epoch Time: {epoch_mins}m {epoch_secs}s')
print(f'\tTest Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

## User input

Once we trained our model, we can try to predict the sentiment of our own input. We are going to define the `predict_sentiment()` function that is going to take our trained model and a sentence as an argument. 

First, we need to switch the model to evaluation mode be calling `model.eval()` on it. Then, we are going to tokenize the sentence the same way as we tokenized the input. If the sentence is less than `min_len` parameter, we are going to add the padding symbols to it, so our model doesn't throw an error. After that, we turn the words into indices with the same vocabulary that we built for training. Finally, we transform the output into tenson and adding an empty dimention in the beginning, imitating a batch of size 1.

As we remember from the training part, `0` was a negative sentiment and `1` was positive. Thus, the closer to `0` is our prediction, the more negative is the sentiment and the opposite is true for positive.

In [None]:
def predict_sentiment(model, sentence, min_len = 5):
    ...

In [None]:
predict_sentiment(model, "This film is so bad that I had to wash my eyes with bleach after watching it")

In [None]:
predict_sentiment(model, "After watching this movie, I felt that I'm in heaven")

# References

- [Github tutorial on CNN](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/4%20-%20Convolutional%20Sentiment%20Analysis.ipynb)
- [Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011, June). Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1 (pp. 142-150). Association for Computational Linguistics.](https://dl.acm.org/doi/10.5555/2002472.2002491)
- [Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., & Mikolov, T. (2016). Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.](https://arxiv.org/abs/1612.03651)
