<img src="https://drive.google.com/uc?id=1dFgNX9iQUfmBOdmUN2-H8rPxL3SLXmxn" width="400"/>


---


# Sentiment analysis with GloVe (Solution)

### Exercise objectives:
- Learn how to embed data with GloVe (and similar embedding like Word2Vec)
- Build a data pipeline to prepare text data
- Train a simple LSTM model for sentiment analysis

<hr>


# The data

Today we will use the IMDB movie review dataset. It is a classic data set that can be downloaded staight from Pytorch. It contains reviews of different movies, as well as a target: 1 for bad review, 2 for a good review. The goal is to train an LSTM model to predict whether a review is good, or bad.

# Installing Dependencies

But first, let's install a few libraries that you are unlikely to have if you are running this on your colab instance:

In [1]:
!pip install torchdata

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torchdata
  Downloading torchdata-0.5.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB)
[K     |████████████████████████████████| 4.5 MB 4.7 MB/s 
[?25hCollecting portalocker>=2.0.0
  Downloading portalocker-2.6.0-py2.py3-none-any.whl (15 kB)
Collecting urllib3>=1.25
  Downloading urllib3-1.26.13-py2.py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 53.2 MB/s 
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 63.3 MB/s 
Installing collected packages: urllib3, portalocker, torchdata
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.24.3
    Uninstalling urllib3-1.24.3:
      Successfully uninstalled urllib3-1.24.3
Successfully installed portalocker-2.6.0 torchdata-0.5.0 urllib3-1.25.11


In [2]:
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Download NLTK files

We also need to download a few key files for NLTK (this could take a minute or two o run)

In [3]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

# Importing torch 

Let's import torch and a few common dependencies, and make sure we set our device to the GPU:

In [4]:
import numpy as np
import pandas as pd

In [5]:
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

# Importing and visualizing the data

Running the cell below will import that IMDB movie reviews in your notebook, and we can then visualize some results. We can import the reviews straight from datasets.

In [6]:
from torchtext import data,datasets
from sklearn.model_selection import train_test_split
import torchdata

train_dataset, test_dataset  = datasets.IMDB(root = '.data', split = ('train', 'test'))

# Let's create a true test data and a true validation data as lists:
test_dataset, val_dataset = train_test_split(list(test_dataset), train_size=.8)

## Let's see the 5 first reviews

Note that they are tuples: 1 means a bad review, 2 means a good review.

In [7]:
val_dataset[:5]

[(1,
  'I was looking for a documentary of the same journalistic quality as Frontline or "Fog of War" (by Errol Morris). Instead I was appalled by this shallow and naive account of a very complex and disturbing man and his regime: Alberto Fujimori. This movie should be called "The return of Fujimori". The director presumes she made a "perfect" movie because alienates both pro and anti-Fujimori factions when in fact it is a very biased and unprofessional piece of work. <br /><br />The movie has few crucial facts wrong: <br /><br />1) She uses the so called "landslide" election of 1995 in which Fujimori was re-elected with 65% of the vote, as an example of the massive popular support of Fujimori. But we all now know to be the fruit of a very organized electoral fraud.<br /><br />2) The movie states that Sendero Luminoso (Shining Path) killed 60,000 people. In fact, the Truth Commission\'s final report states that there were 69,280 deaths due to political violence in Peru. 33% of those we

## Choosing one example

We will need one example of text to prepare it. So, let's take the first review of your validation set:

In [8]:
example = val_dataset[0][1]
example

'I was looking for a documentary of the same journalistic quality as Frontline or "Fog of War" (by Errol Morris). Instead I was appalled by this shallow and naive account of a very complex and disturbing man and his regime: Alberto Fujimori. This movie should be called "The return of Fujimori". The director presumes she made a "perfect" movie because alienates both pro and anti-Fujimori factions when in fact it is a very biased and unprofessional piece of work. <br /><br />The movie has few crucial facts wrong: <br /><br />1) She uses the so called "landslide" election of 1995 in which Fujimori was re-elected with 65% of the vote, as an example of the massive popular support of Fujimori. But we all now know to be the fruit of a very organized electoral fraud.<br /><br />2) The movie states that Sendero Luminoso (Shining Path) killed 60,000 people. In fact, the Truth Commission\'s final report states that there were 69,280 deaths due to political violence in Peru. 33% of those were caus

# Time to play with GloVe

We want to embed our text with a pre-trained model. In the lecture, we talked about Word2Vec, which is one of the best embeddings. But Word2Vec is harder to implement in PyTorch because it is not offered as a pre-trained layer. Instead, we will play with GloVe, which is very similar to Word2Vec in concept and is readily available in PyTorch.

## Download GloVe

There are different versions of GloVe, depending on the size of the vector you want to use for your embedding, and the size of your vocabulary. Here, to make things as fast as possible, let's download the version with vectors embedded in 50 dimensions, and the smallest vocabulary possible. This is still a large file (862 Mb) so on my system this took almost 3 minutes to run: be patient!

In [9]:
from torchtext.vocab import GloVe

glove = GloVe(dim='50', name='6B', max_vectors=20000)

.vector_cache/glove.6B.zip: 862MB [02:39, 5.41MB/s]                           
100%|█████████▉| 19999/20000 [00:00<00:00, 43242.41it/s]


## Words to vectors

Now that we have GloVe downloaded, we can see how it works. We can, for instance, very easily find the vector for any word. Here is how we can find the representation of the word 'king':

In [10]:
glove["king"]

tensor([ 0.5045,  0.6861, -0.5952, -0.0228,  0.6005, -0.1350, -0.0881,  0.4738,
        -0.6180, -0.3101, -0.0767,  1.4930, -0.0342, -0.9817,  0.6823,  0.8172,
        -0.5187, -0.3150, -0.5581,  0.6642,  0.1961, -0.1349, -0.1148, -0.3034,
         0.4118, -2.2230, -1.0756, -1.0783, -0.3435,  0.3350,  1.9927, -0.0423,
        -0.6432,  0.7113,  0.4916,  0.1675,  0.3434, -0.2566, -0.8523,  0.1661,
         0.4010,  1.1685, -1.0137, -0.2158, -0.1515,  0.7832, -0.9124, -1.6106,
        -0.6443, -0.5104])

## GloVe tokens (index)

Importantly, we will need to obtain the index (or token) of a word in GloVe. This is because we will transform our sentence into integer token before passing it to the network, and these integer tokens will need to correspond to the index of the vectors for our embedding layer.

There are two useful functions for us:
- string to integer (stoi)
- integer to string (itos)

They do pretty much what is written on the tin:

In [11]:
glove.stoi['king']

691

In [12]:
glove.itos[691]

'king'

## Feel free to play with vectors!

Explore the embedding, and how each words are represented...

In [13]:
glove['queen']

tensor([ 0.3785,  1.8233, -1.2648, -0.1043,  0.3583,  0.6003, -0.1754,  0.8377,
        -0.0568, -0.7580,  0.2268,  0.9859,  0.6059, -0.3142,  0.2888,  0.5601,
        -0.7746,  0.0714, -0.5741,  0.2134,  0.5767,  0.3868, -0.1257,  0.2801,
         0.2813, -1.8053, -1.0421, -0.1926, -0.5537, -0.0545,  1.5574,  0.3930,
        -0.2475,  0.3425,  0.4536,  0.1624,  0.5246, -0.0703, -0.8374, -1.0326,
         0.4595,  0.2530, -0.1784, -0.7340, -0.2002,  0.2347, -0.5609, -2.2839,
         0.0093, -0.6028])

# Exercise 1: Text preprocessing

Your first exercise will be to build a pre-processing pipeline. You can use the code in the lecture to help you do that. Also, I already created the functions signature below to help you understand what is needed. Each function is applied to a single review, not a batch of reviews (you will see this below).

The transformations you need to do are the following:
1. Text needs to be all lower case (all words in GloVe are lower case only)
2. You need to remove numbers - they are not helpful for sentiment analysis
3. Remove punctuation (also not needed)
4. Transform the sentence into word tokens using NLTK. Now your sentence becomes a list of words.
5. Remove stopwords using NLTK
6. Get the index of the word in GloVe. If the word exists in the dictionary, keep it in the list. If not, just don't add it. This is a little bit more challenging, so consult the solution if you are unsure.
7. We also need to pad our sentence to MAX_LEN (the maximum length of the sentences). Notice below that I set this to be 100 words: we don't really need more than that, and we are getting good results (well, decent results) with 100 words. But all tensors need to be the same length, so pad_sentence is here to add zeros to those sentences that are shorted than 100 words. Again, consult the solution if you struggle with this one.

I have left the last function in for you: transform_text calls each of the other functions one after another, and transforms the text.

Make sure that you obtain a tensor of dimension 100, with all the right token, when you call transform_text on your example sentence from above. Then continue with the exercise.


In [14]:
MAX_LEN = 100

In [15]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string 
from torch.nn.functional import pad 

def remove_numbers(txt):
    txt = ''.join(word for word in txt if not word.isdigit())
    return txt

def remove_punctuation(txt):
    
    for punctuation in string.punctuation:
        txt = txt.replace(punctuation, '') 
    
    return txt

def tokenize(txt):
    word_tokens = word_tokenize(txt) 
    return word_tokens

def remove_stopwords(word_tokens):
    stop_words = set(stopwords.words('english')) 
    word_tokens = [w for w in word_tokens if not w in stop_words] 
    return word_tokens


def get_index(txt, vocab=glove):
    embedded_text = []
    
    for word in txt:
        try:
            embedded_text.append(glove.stoi[word])
        except:
            pass
    
    return embedded_text

def pad_sentence(txt):
    if txt.shape[0]>=MAX_LEN:
        return txt[:MAX_LEN]
    else:
        return pad(txt, (0, MAX_LEN-txt.shape[0]), 'constant',0).long()
        

def transform_text(txt):
    txt = txt.lower()
    txt = remove_numbers(txt)
    txt = remove_punctuation(txt)
    txt = tokenize(txt)
    txt = remove_stopwords(txt)
    txt = torch.tensor(get_index(txt)).long()
    return pad_sentence(txt)

In [16]:
transform_text(example)

tensor([  862,  3830, 17940,  1506, 16469,  9667,   136,  4239,   773, 19827,
         8966, 14427,  1530,  1688,  9392,   300,  1872,  6922,  9274,  1005,
          175,   498,  9274,   369,   116,  2615,  1005,  1396,  5104,   853,
        14222,  2365,   161,  1005,  2892,  4490,  1797,  2054,   175,  8036,
          367,  9274, 19283,   538,   880,  1971,   814,   280,  9274,   346,
         4138,  2112,  2799,  1005,   112, 13001,  2818,   256,    69,   853,
         2745,  8631,   294,   255,   112,  1933,   445,   209,   714,  4068,
         1098,  2200,  1281,   142,   178,   503,   853,  2054, 10747,   419,
         9274,   622,    82,   170,   880,  5510,  1005,   899,   631,  8340,
         1378,   299,  9274,  2056,  1395,  1872,   372,   590,   149, 14875])

# Preparing the dataset

Now that you have written the helper functions, we can focus on preparing the dataset. First, I will create the target and the label for the train, test, and validation splits using a list comprehension. This will be a dataset not saved as a dataloader, and it will be useful to assess the loss and the accuracy of each one of our splits. But it won't be used as a batch.

Run the code below. Beware, due to the size of the dataset this can be a bit of a lengthy cell to run!

In [17]:
train_y = torch.tensor([item[0] for item in list(train_dataset)])-1
train_x = torch.stack([transform_text(item[1]) for item in list(train_dataset)])

val_y = torch.tensor([item[0] for item in list(test_dataset)])-1
val_x = torch.stack([transform_text(item[1]) for item in list(test_dataset)])

test_y = torch.tensor([item[0] for item in list(test_dataset)])-1
test_x = torch.stack([transform_text(item[1]) for item in list(test_dataset)])


## Dataloader

To be able to run batches on the GPU, we will need to save the dataset in a DataLoader class. We will also create a function called vectorize_batch, that will allow us to vectorise our batches on the go (i.e. batch by batch). Take note of how I wrote the train_loader: I use the dataset, define the batch size (256), and collate function (my vectorizer), and (very important!) I set shuffle to 'True'.

Not setting shuffle to True results in the batch not being randomly shuffled, and a poor performance during training:

In [18]:
from torchtext.data import to_map_style_dataset
from torch.utils.data import DataLoader

def vectorize_batch(batch):
    Y, X = list(zip(*batch))
    
    X_embedded = torch.stack([transform_text(txt) for txt in X])
    
    return X_embedded, torch.tensor(Y).long()-1 

train_dataset=  to_map_style_dataset(train_dataset)

train_loader = DataLoader(train_dataset, batch_size=256, collate_fn=vectorize_batch, shuffle=True)

Let's quickly check that the dataloader works. You should obtain a size of [256, 100] for X, and [256] for Y:

In [19]:
for X, Y in train_loader:
    print(X.shape, Y.shape)
    break

torch.Size([256, 100]) torch.Size([256])


# The embedding layer

I have written a function for you that creates an embedding layer. We will pass to it the vectors of our GloVe object, and from this, we will now the number of embeddings (20000) and the embedding dimensions (50). 

We then pass the weights of the GloVe vocabulary to our embedding layer, and we choose to set the weight to 'not trainable'. This way, we won't try to relearn the correct weights for this task. But we can also choose to start with the GloVe weights, and then update them to suite our vocabulary.

This is a classic example of transfer learning.

In [20]:
def create_emb_layer(weights_matrix, non_trainable=True):
    num_embeddings, embedding_dim = weights_matrix.size()
    emb_layer = nn.Embedding(num_embeddings, embedding_dim,padding_idx=0)
    emb_layer.load_state_dict({'weight': weights_matrix})
    if non_trainable:
        emb_layer.weight.requires_grad = False

    return emb_layer, num_embeddings, embedding_dim

# Network architecture

We will use an LSTM model, as you did on Friday. But this time, we will use the LSTM layer from nn.LSTM: yes, you do not need to write it from scratch!

In fact, you don't need to write anything. I wrote the network for you, as this is very time consuming and we only have half a day for theory and practice today.

Note the use of the embedding layer, as well as the fact that I use 2 LSTM layers.

In [21]:
from torch import nn

class LSTM(nn.Module):
    def __init__(self, hid_dim, output_dim):
        super(LSTM, self).__init__()
        
        self.hid_dim = hid_dim
        
        self.embedding, num_embeddings, embedding_dim = create_emb_layer(glove.vectors, False)
        
        n_layers = 2

        self.lstm = nn.LSTM(embedding_dim, hid_dim, n_layers,dropout=0, batch_first=True)
        self.linear = nn.Linear(hid_dim,100)
        self.relu = nn.ReLU()
        self.fc = nn.Linear(100, output_dim)
        self.dropout = nn.Dropout(.5)
        
        self.reset_parameters()
        
    def reset_parameters(self):
        std= 1.0 / np.sqrt(self.hid_dim)
        
        for w in self.parameters():
            w.data.uniform_(-std, std)
        

    def forward(self, text):

        embedded = self.embedding(text)


        batch_size, seq_len,  _ = embedded.shape
        hid_dim = self.lstm.hidden_size
            
        outputs, (hidden, cell) = self.lstm(embedded)

        outputs = outputs[:, -1]
        
        prediction = self.fc(self.dropout(self.relu(self.linear(outputs))))


        return prediction

# Training and validation functions

Below are my training and validation functions. Take a moment to study them, and then run the code.

In [22]:
from tqdm import tqdm
from sklearn.metrics import accuracy_score
import torch.nn.functional as F
import gc

def CalcValLossAndAccuracy(model, loss_fn, val_X, val_Y):
    
    #print(f'Calculating Epoch Loss and Accuracy:')
    
    losses = []
    accuracies = []
    model.eval()
    with torch.no_grad():
        X, Y, title = (val_x, val_y,'Validation')
        X = val_X.to(device)
        Y = val_Y.to(device)
            
        outputs = model(X).squeeze()
        loss = loss_fn(outputs, Y.float())
            
        preds = [1 if p>=.5 else 0 for p in torch.sigmoid(outputs)]
        accuracy = accuracy_score(Y.detach().cpu().numpy().tolist(),preds)
            
        accuracies.append(accuracy)
        losses.append(loss)

        
        print(f'{title} Loss : {loss:.3f}')
        print(f"{title} Accuracy  : {accuracy:.3f}")
    
    return losses, accuracies


def TrainModel(model, loss_fn, optimizer, train_loader, epochs=10):
    train_losses = []
    train_accuracy = []
    val_losses = []
    val_accuracy = []
    
    for i in range(1, epochs+1):
        
        print('-'*100)
        print(f'EPOCH {i}')
        print('-'*100)
        
        epoch_losses = []

        model.train()

        
        for X, Y in tqdm(train_loader, colour='BLUE'):

            X = X.to(device)
            Y = Y.to(device)
            
            Y_preds = model(X).squeeze()
            loss = loss_fn(Y_preds, Y.float())
            
            epoch_losses.append(loss.item())

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        print("Train Loss : {:.3f}".format(torch.tensor(epoch_losses).mean()))
        
        losses, acc = CalcValLossAndAccuracy(model, loss_fn, val_x, val_y)
        train_losses.append(losses[0])
        train_accuracy.append(losses[0])
        val_losses.append(acc[0])
        val_accuracy.append(acc[0])
        
    return train_losses, val_losses, train_accuracy, val_accuracy

# Training the model

We will now train the model using the RMSprop optimizer, and 20 hidden units in our model. We will run for only 8 epochs for now:

In [23]:
from torch.optim import RMSprop

epochs = 8
learning_rate = 1e-3


loss_fn = nn.BCEWithLogitsLoss()
text_classifier = LSTM(20,1).to(device)

optimizer = RMSprop(text_classifier.parameters(), lr=learning_rate)

print("STARTING TRAINING")
print("MODEL ARCHITECTURE:")
print(text_classifier)
print(" ")


TrainModel(text_classifier, loss_fn, optimizer, train_loader, epochs)

STARTING TRAINING
MODEL ARCHITECTURE:
LSTM(
  (embedding): Embedding(20000, 50, padding_idx=0)
  (lstm): LSTM(50, 20, num_layers=2, batch_first=True)
  (linear): Linear(in_features=20, out_features=100, bias=True)
  (relu): ReLU()
  (fc): Linear(in_features=100, out_features=1, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)
 
----------------------------------------------------------------------------------------------------
EPOCH 1
----------------------------------------------------------------------------------------------------


100%|[34m██████████[0m| 98/98 [00:29<00:00,  3.33it/s]


Train Loss : 0.694
Validation Loss : 0.692
Validation Accuracy  : 0.502
----------------------------------------------------------------------------------------------------
EPOCH 2
----------------------------------------------------------------------------------------------------


100%|[34m██████████[0m| 98/98 [00:29<00:00,  3.30it/s]


Train Loss : 0.638
Validation Loss : 0.559
Validation Accuracy  : 0.748
----------------------------------------------------------------------------------------------------
EPOCH 3
----------------------------------------------------------------------------------------------------


100%|[34m██████████[0m| 98/98 [00:28<00:00,  3.43it/s]


Train Loss : 0.512
Validation Loss : 0.499
Validation Accuracy  : 0.787
----------------------------------------------------------------------------------------------------
EPOCH 4
----------------------------------------------------------------------------------------------------


100%|[34m██████████[0m| 98/98 [00:28<00:00,  3.41it/s]


Train Loss : 0.403
Validation Loss : 0.481
Validation Accuracy  : 0.788
----------------------------------------------------------------------------------------------------
EPOCH 5
----------------------------------------------------------------------------------------------------


100%|[34m██████████[0m| 98/98 [00:28<00:00,  3.43it/s]


Train Loss : 0.331
Validation Loss : 0.439
Validation Accuracy  : 0.817
----------------------------------------------------------------------------------------------------
EPOCH 6
----------------------------------------------------------------------------------------------------


100%|[34m██████████[0m| 98/98 [00:28<00:00,  3.41it/s]


Train Loss : 0.289
Validation Loss : 0.421
Validation Accuracy  : 0.828
----------------------------------------------------------------------------------------------------
EPOCH 7
----------------------------------------------------------------------------------------------------


100%|[34m██████████[0m| 98/98 [00:28<00:00,  3.45it/s]


Train Loss : 0.251
Validation Loss : 0.416
Validation Accuracy  : 0.827
----------------------------------------------------------------------------------------------------
EPOCH 8
----------------------------------------------------------------------------------------------------


100%|[34m██████████[0m| 98/98 [00:28<00:00,  3.43it/s]


Train Loss : 0.227
Validation Loss : 0.470
Validation Accuracy  : 0.820


([tensor(0.6922, device='cuda:0'),
  tensor(0.5587, device='cuda:0'),
  tensor(0.4986, device='cuda:0'),
  tensor(0.4810, device='cuda:0'),
  tensor(0.4394, device='cuda:0'),
  tensor(0.4208, device='cuda:0'),
  tensor(0.4159, device='cuda:0'),
  tensor(0.4702, device='cuda:0')],
 [0.50205, 0.7477, 0.78665, 0.78765, 0.8174, 0.82765, 0.82715, 0.81965],
 [tensor(0.6922, device='cuda:0'),
  tensor(0.5587, device='cuda:0'),
  tensor(0.4986, device='cuda:0'),
  tensor(0.4810, device='cuda:0'),
  tensor(0.4394, device='cuda:0'),
  tensor(0.4208, device='cuda:0'),
  tensor(0.4159, device='cuda:0'),
  tensor(0.4702, device='cuda:0')],
 [0.50205, 0.7477, 0.78665, 0.78765, 0.8174, 0.82765, 0.82715, 0.81965])

# Evaluating the model

Let's look at the 3 first reviews in our test set, and the predictions from our model! Do these make sense?

In [24]:
test_dataset[:3]

[(2,
  'Focus is an engaging story told in urban, WWII-era setting. William Macy portrays everyman who is taken out of his personal circumstances and challenged with decisions testing his values affecting the community. Laura Dern, Macy and David Paymer give good performances, so also the good supporting ensemble.'),
 (2,
  'Director John Schlesinger\'s tense and frantic film tells the true story of Christopher Boyce and Andrew Daulton Lee, two young men who sold United States government secrets to the Soviet Union in the early 1970\'s.<br /><br />Timothy Hutton plays Christopher Boyce very competently. He is a young man very disillusioned by the CIA\'s underhanded activities in allied Australia. Sean Penn, as the doped-up, drug running Andrew Daulton Lee, is outstanding.<br /><br />The competent and professional direction of Schlesinger, along with some very good acting, make "The Falcon and the Snowman" an espionage thriller not to be missed.<br /><br />Tuesday, February 4, 1992 - Vi

In [25]:
text_classifier.eval()
with torch.no_grad():
    print(torch.sigmoid(text_classifier(test_x[:3].to(device))))

tensor([[0.9450],
        [0.9456],
        [0.9506]], device='cuda:0')


# Optional Exercise 2

Here are a few things you can do if you want:

1. Try to calculate the accuracy for the test set
2. The model is decent (about 80% accuracy). Can you improve on it?

There are no given solutions for this exercise: it is up to you to play with the model if you want to.

Hope you enjoyed your first try at NLP!!!!