<img src="https://drive.google.com/uc?id=1dFgNX9iQUfmBOdmUN2-H8rPxL3SLXmxn" width="400"/>


---


# NLP: Sentiment Analysis with GloVe

### Exercise objectives:
- Learn how to embed data with GloVe (and similar embedding like Word2Vec)
- Build a data pipeline to prepare text data
- Train a simple LSTM model for sentiment analysis

<hr>


# The data

Today we will use the IMDB movie review dataset. It is a classic data set that can be downloaded staight from Pytorch. It contains reviews of different movies, as well as a target: 1 for bad review, 2 for a good review. The goal is to train an LSTM model to predict whether a review is good, or bad.

# Installing Dependencies

But first, let's install a few libraries that you are unlikely to have if you are running this on your colab instance:

In [None]:
!pip install torchdata

In [None]:
!pip install nltk

## Download NLTK files

We also need to download a few key files for NLTK (this could take a minute or two o run)

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

# Importing torch 

Let's import torch and a few common dependencies, and make sure we set our device to the GPU:

In [None]:
import numpy as np
import pandas as pd

In [None]:
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

# Importing and visualizing the data

Running the cell below will import that IMDB movie reviews in your notebook, and we can then visualize some results. We can import the reviews straight from datasets.

In [None]:
from torchtext import data,datasets
from sklearn.model_selection import train_test_split
import torchdata

train_dataset, test_dataset  = datasets.IMDB(root = '.data', split = ('train', 'test'))

# Let's create a true test data and a true validation data as lists:
test_dataset, val_dataset = train_test_split(list(test_dataset), train_size=.8)

## Let's see the 5 first reviews

Note that they are tuples: 1 means a bad review, 2 means a good review.

In [None]:
val_dataset[:5]

## Choosing one example

We will need one example of text to prepare it. So, let's take the first review of your validation set:

In [None]:
example = val_dataset[0][1]
example

# Time to play with GloVe

We want to embed our text with a pre-trained model. In the lecture, we talked about Word2Vec, which is one of the best embeddings. But Word2Vec is harder to implement in PyTorch because it is not offered as a pre-trained layer. Instead, we will play with GloVe, which is very similar to Word2Vec in concept and is readily available in PyTorch.

## Download GloVe

There are different versions of GloVe, depending on the size of the vector you want to use for your embedding, and the size of your vocabulary. Here, to make things as fast as possible, let's download the version with vectors embedded in 50 dimensions, and the smallest vocabulary possible. This is still a large file (862 Mb) so on my system this took almost 3 minutes to run: be patient!

In [None]:
from torchtext.vocab import GloVe

glove = GloVe(dim='50', name='6B', max_vectors=20000)

## Words to vectors

Now that we have GloVe downloaded, we can see how it works. We can, for instance, very easily find the vector for any word. Here is how we can find the representation of the word 'king':

In [None]:
glove["king"]

## GloVe tokens (index)

Importantly, we will need to obtain the index (or token) of a word in GloVe. This is because we will transform our sentence into integer token before passing it to the network, and these integer tokens will need to correspond to the index of the vectors for our embedding layer.

There are two useful functions for us:
- string to integer (stoi)
- integer to string (itos)

They do pretty much what is written on the tin:

In [None]:
glove.stoi['king']

In [None]:
glove.itos[691]

## Feel free to play with vectors!

Explore the embedding, and how each words are represented...

In [None]:
glove['queen']

# Exercise 1: Text preprocessing

Your first exercise will be to build a pre-processing pipeline. You can use the code in the lecture to help you do that. Also, I already created the functions signature below to help you understand what is needed. Each function is applied to a single review, not a batch of reviews (you will see this below).

The transformations you need to do are the following:
1. Text needs to be all lower case (all words in GloVe are lower case only)
2. You need to remove numbers - they are not helpful for sentiment analysis
3. Remove punctuation (also not needed)
4. Transform the sentence into word tokens using NLTK. Now your sentence becomes a list of words.
5. Remove stopwords using NLTK
6. Get the index of the word in GloVe. If the word exists in the dictionary, keep it in the list. If not, just don't add it. This is a little bit more challenging, so consult the solution if you are unsure.
7. We also need to pad our sentence to MAX_LEN (the maximum length of the sentences). Notice below that I set this to be 100 words: we don't really need more than that, and we are getting good results (well, decent results) with 100 words. But all tensors need to be the same length, so pad_sentence is here to add zeros to those sentences that are shorted than 100 words. Again, consult the solution if you struggle with this one.

I have left the last function in for you: transform_text calls each of the other functions one after another, and transforms the text.

Make sure that you obtain a tensor of dimension 100, with all the right token, when you call transform_text on your example sentence from above. Then continue with the exercise.


In [None]:
MAX_LEN = 100

In [None]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string 
from torch.nn.functional import pad 

def remove_numbers(txt):
    pass

def remove_punctuation(txt):
    pass

def tokenize(txt):
    pass

def remove_stopwords(word_tokens):
    pass


def get_index(txt, vocab=glove):
    pass

def pad_sentence(txt):
    pass
        

def transform_text(txt):
    txt = txt.lower()
    txt = remove_numbers(txt)
    txt = remove_punctuation(txt)
    txt = tokenize(txt)
    txt = remove_stopwords(txt)
    txt = torch.tensor(get_index(txt)).long()
    return pad_sentence(txt)

In [None]:
transform_text(example)

# Preparing the dataset

Now that you have written the helper functions, we can focus on preparing the dataset. First, I will create the target and the label for the train, test, and validation splits using a list comprehension. This will be a dataset not saved as a dataloader, and it will be useful to assess the loss and the accuracy of each one of our splits. But it won't be used as a batch.

Run the code below. Beware, due to the size of the dataset this can be a bit of a lengthy cell to run!

In [None]:
train_y = torch.tensor([item[0] for item in list(train_dataset)])-1
train_x = torch.stack([transform_text(item[1]) for item in list(train_dataset)])

val_y = torch.tensor([item[0] for item in list(test_dataset)])-1
val_x = torch.stack([transform_text(item[1]) for item in list(test_dataset)])

test_y = torch.tensor([item[0] for item in list(test_dataset)])-1
test_x = torch.stack([transform_text(item[1]) for item in list(test_dataset)])


## Dataloader

To be able to run batches on the GPU, we will need to save the dataset in a DataLoader class. We will also create a function called vectorize_batch, that will allow us to vectorise our batches on the go (i.e. batch by batch). Take note of how I wrote the train_loader: I use the dataset, define the batch size (256), and collate function (my vectorizer), and (very important!) I set shuffle to 'True'.

Not setting shuffle to True results in the batch not being randomly shuffled, and a poor performance during training:

In [None]:
from torchtext.data import to_map_style_dataset
from torch.utils.data import DataLoader

def vectorize_batch(batch):
    Y, X = list(zip(*batch))
    
    X_embedded = torch.stack([transform_text(txt) for txt in X])
    
    return X_embedded, torch.tensor(Y).long()-1 

train_dataset=  to_map_style_dataset(train_dataset)

train_loader = DataLoader(train_dataset, batch_size=256, collate_fn=vectorize_batch, shuffle=True)

Let's quickly check that the dataloader works. You should obtain a size of [256, 100] for X, and [256] for Y:

In [None]:
for X, Y in train_loader:
    print(X.shape, Y.shape)
    break

# The embedding layer

I have written a function for you that creates an embedding layer. We will pass to it the vectors of our GloVe object, and from this, we will now the number of embeddings (20000) and the embedding dimensions (50). 

We then pass the weights of the GloVe vocabulary to our embedding layer, and we choose to set the weight to 'not trainable'. This way, we won't try to relearn the correct weights for this task. But we can also choose to start with the GloVe weights, and then update them to suite our vocabulary.

This is a classic example of transfer learning.

In [None]:
def create_emb_layer(weights_matrix, non_trainable=True):
    num_embeddings, embedding_dim = weights_matrix.size()
    emb_layer = nn.Embedding(num_embeddings, embedding_dim,padding_idx=0)
    emb_layer.load_state_dict({'weight': weights_matrix})
    if non_trainable:
        emb_layer.weight.requires_grad = False

    return emb_layer, num_embeddings, embedding_dim

# Network architecture

We will use an LSTM model, as you did on Friday. But this time, we will use the LSTM layer from nn.LSTM: yes, you do not need to write it from scratch!

In fact, you don't need to write anything. I wrote the network for you, as this is very time consuming and we only have half a day for theory and practice today.

Note the use of the embedding layer, as well as the fact that I use 2 LSTM layers.

In [None]:
from torch import nn

class LSTM(nn.Module):
    def __init__(self, hid_dim, output_dim):
        super(LSTM, self).__init__()
        
        self.hid_dim = hid_dim
        
        self.embedding, num_embeddings, embedding_dim = create_emb_layer(glove.vectors, False)
        
        n_layers = 2

        self.lstm = nn.LSTM(embedding_dim, hid_dim, n_layers,dropout=0, batch_first=True)
        self.linear = nn.Linear(hid_dim,100)
        self.relu = nn.ReLU()
        self.fc = nn.Linear(100, output_dim)
        self.dropout = nn.Dropout(.5)
        
        self.reset_parameters()
        
    def reset_parameters(self):
        std= 1.0 / np.sqrt(self.hid_dim)
        
        for w in self.parameters():
            w.data.uniform_(-std, std)
        

    def forward(self, text):

        embedded = self.embedding(text)


        batch_size, seq_len,  _ = embedded.shape
        hid_dim = self.lstm.hidden_size
            
        outputs, (hidden, cell) = self.lstm(embedded)

        outputs = outputs[:, -1]
        
        prediction = self.fc(self.dropout(self.relu(self.linear(outputs))))


        return prediction

# Training and validation functions

Below are my training and validation functions. Take a moment to study them, and then run the code.

In [None]:
from tqdm import tqdm
from sklearn.metrics import accuracy_score
import torch.nn.functional as F
import gc

def CalcValLossAndAccuracy(model, loss_fn, val_X, val_Y):
    
    #print(f'Calculating Epoch Loss and Accuracy:')
    
    losses = []
    accuracies = []
    model.eval()
    with torch.no_grad():
        X, Y, title = (val_x, val_y,'Validation')
        X = val_X.to(device)
        Y = val_Y.to(device)
            
        outputs = model(X).squeeze()
        loss = loss_fn(outputs, Y.float())
            
        preds = [1 if p>=.5 else 0 for p in torch.sigmoid(outputs)]
        accuracy = accuracy_score(Y.detach().cpu().numpy().tolist(),preds)
            
        accuracies.append(accuracy)
        losses.append(loss)

        
        print(f'{title} Loss : {loss:.3f}')
        print(f"{title} Accuracy  : {accuracy:.3f}")
    
    return losses, accuracies


def TrainModel(model, loss_fn, optimizer, train_loader, epochs=10):
    train_losses = []
    train_accuracy = []
    val_losses = []
    val_accuracy = []
    
    for i in range(1, epochs+1):
        
        print('-'*100)
        print(f'EPOCH {i}')
        print('-'*100)
        
        epoch_losses = []

        model.train()

        
        for X, Y in tqdm(train_loader, colour='BLUE'):

            X = X.to(device)
            Y = Y.to(device)
            
            Y_preds = model(X).squeeze()
            loss = loss_fn(Y_preds, Y.float())
            
            epoch_losses.append(loss.item())

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        print("Train Loss : {:.3f}".format(torch.tensor(epoch_losses).mean()))
        
        losses, acc = CalcValLossAndAccuracy(model, loss_fn, val_x, val_y)
        train_losses.append(losses[0])
        train_accuracy.append(losses[0])
        val_losses.append(acc[0])
        val_accuracy.append(acc[0])
        
    return train_losses, val_losses, train_accuracy, val_accuracy

# Training the model

We will now train the model using the RMSprop optimizer, and 20 hidden units in our model. We will run for only 8 epochs for now:

In [None]:
from torch.optim import RMSprop

epochs = 8
learning_rate = 1e-3


loss_fn = nn.BCEWithLogitsLoss()
text_classifier = LSTM(20,1).to(device)

optimizer = RMSprop(text_classifier.parameters(), lr=learning_rate)

print("STARTING TRAINING")
print("MODEL ARCHITECTURE:")
print(text_classifier)
print(" ")


TrainModel(text_classifier, loss_fn, optimizer, train_loader, epochs)

# Evaluating the model

Let's look at the 3 first reviews in our test set, and the predictions from our model! Do these make sense?

In [None]:
test_dataset[:3]

In [None]:
text_classifier.eval()
with torch.no_grad():
    print(torch.sigmoid(text_classifier(test_x[:3].to(device))))

# Optional Exercise 2

Here are a few things you can do if you want:

1. Try to calculate the accuracy for the test set
2. The model is decent (about 80% accuracy). Can you improve on it?

There are no given solutions for this exercise: it is up to you to play with the model if you want to.

Hope you enjoyed your first try at NLP!!!!

In [None]:
# Feel free to play around...