<a href="https://colab.research.google.com/github/bubulkopetro/NYU_AI/blob/main/Advanced_Lab_4_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stanford Sentiment Classification Dataset

In this lab, we will learn how to classify a sentence as positive or negative sentiment. We will use the [Stanford Sentiment Classification Dataset (SST-2)](https://nlp.stanford.edu/sentiment/index.html) which is part of [GLUE Benchmark](https://gluebenchmark.com/tasks), a benchmark for evaluating machine learning models on a collection of variety language understanding tasks.

First, let's download SST-2 from GLUE Benchmark and unzip it.
We can use the [`wget`](https://www.gnu.org/software/wget/manual/wget.html) command for downloading a file.

In [None]:
!wget https://dl.fbaipublicfiles.com/glue/data/SST-2.zip
!unzip SST-2.zip

--2022-01-08 14:11:21--  https://dl.fbaipublicfiles.com/glue/data/SST-2.zip
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.75.142, 172.67.9.4, 104.22.74.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.75.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7439277 (7.1M) [application/zip]
Saving to: ‘SST-2.zip’


2022-01-08 14:11:22 (30.2 MB/s) - ‘SST-2.zip’ saved [7439277/7439277]

Archive:  SST-2.zip
   creating: SST-2/
  inflating: SST-2/dev.tsv           
   creating: SST-2/original/
  inflating: SST-2/original/README.txt  
  inflating: SST-2/original/SOStr.txt  
  inflating: SST-2/original/STree.txt  
  inflating: SST-2/original/datasetSentences.txt  
  inflating: SST-2/original/datasetSplit.txt  
  inflating: SST-2/original/dictionary.txt  
  inflating: SST-2/original/original_rt_snippets.txt  
  inflating: SST-2/original/sentiment_labels.txt  
  inflating: SST-2/test.tsv          
  inflating: SST-2/trai

We can view the content of the downloaded SST-2 folder using the `ls` command.
We can see that it contains a train, dev, and test data in `tsv` format.


In [None]:
!ls SST-2

dev.tsv  original  test.tsv  train.tsv


Let's explore the content of `train.tsv`. We can use the `head -n` command to quickly read the first `n` lines of any file.

In [None]:
!head -5 SST-2/train.tsv

sentence	label
hide new secretions from the parental units 	0
contains no wit , only labored gags 	0
that loves its characters and communicates something rather beautiful about human nature 	1
remains utterly satisfied to remain the same throughout 	0


As we can see, the first line is the name of each column (sentence and label) and the rest of the lines are the examples of the datasets. SST-2 is a binary classification dataset: Label 0 corresponds to negative sentiment and 1 corresponds to positive sentiment.

Now, we will read the `tsv` file and explore the data in Python.
We will use the `pandas` library for that.

In [None]:
import pandas as pd
data = pd.read_csv("SST-2/train.tsv", sep='\t')
data[:5]

Unnamed: 0,sentence,label
0,hide new secretions from the parental units,0
1,"contains no wit , only labored gags",0
2,that loves its characters and communicates som...,1
3,remains utterly satisfied to remain the same t...,0
4,on the worst revenge-of-the-nerds clichés the ...,0


In [None]:
print(data["sentence"][:2].tolist())
print(data["label"][:2].tolist())

['hide new secretions from the parental units ', 'contains no wit , only labored gags ']
[0, 0]


# Preprocessing Data

We are going to use a simple Bag-of-Word classifier model. 
Before we can train the model, we need to prepare the input to the model.

We need to define a Python function to load the dataset files.  
We will then perform the following **preprocessing steps**: 

- tokenizing the sentences in the dataset into words
- removing the stop words (which does not add much meaning to a sentence such as “the”, “a”, “an”, “in”) and punctuation
- lower-casing all the characters.

**Note that we normally do not peform the removal of stop words and punctuation when we use more recent powerful state-of-the-art models, such as Google's BERT.


First, we need to install the `spacy` library which will be used in preprocessing.

In [None]:
!pip install -U spacy
!python -m spacy download en_core_web_sm

Collecting spacy
  Downloading spacy-3.2.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB)
[K     |████████████████████████████████| 6.0 MB 7.4 MB/s 
[?25hCollecting spacy-loggers<2.0.0,>=1.0.0
  Downloading spacy_loggers-1.0.1-py3-none-any.whl (7.0 kB)
Collecting spacy-legacy<3.1.0,>=3.0.8
  Downloading spacy_legacy-3.0.8-py2.py3-none-any.whl (14 kB)
Collecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 71.2 MB/s 
Collecting srsly<3.0.0,>=2.4.1
  Downloading srsly-2.4.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (451 kB)
[K     |████████████████████████████████| 451 kB 43.1 MB/s 
[?25hCollecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-cp37-cp37m-manylinux2014_x86_64.whl (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 40.5 MB/s 
Collecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.6-py3-none-any.whl (17 kB)
Collect

We will define a `preprocess_sent` function that tokenizes, converts to lowercase, and removes punctuation and stop words from a sentence using the `spacy` tokenizer.

In [None]:
import string
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS

# Create a blank Tokenizer with just the English vocab
nlp_en = English()
tokenizer = Tokenizer(nlp_en.vocab)

#get a list of punctuations in English
punctuations = string.punctuation
print(punctuations)

# tokenize, lowercase, and remove punctuation and stop words
def preprocess_sent(sent):
    tokens = tokenizer(sent)
    # for each token in tokens, lower-case token's text if it's not in punctuations list and STOP_WORDS list
    return [token.text.lower() for token in tokens if (token.text not in punctuations and token.text not in STOP_WORDS)]


!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


Let's try to preprocess a sentence using the `preprocess_sent` function we defined above.

In [None]:
# Example
tokens = preprocess_sent(u'Apple is looking at buying U.K. startup for $1 billion')
print(tokens)

['apple', 'looking', 'buying', 'u.k.', 'startup', '$1', 'billion']


Now, we will define a function that reads `tsv` files from the data paths and return a dictionary of datasets.

The returned dictionary will look like this:

```python
datasets = {
    "train" : (train_sentence_list, train_labels),
    "dev": (dev_sentence_list, dev_labels),
    "test": (test_sentence_list, None),
}
```

Note that SST-2 dataset does not contain labels for the test set. Therefore, our test labels are `None`.

In [None]:
import os
def load_datasets(data_paths):
    """
        for each path in data_paths:
          load data from the tsv file path 
          pre-process each sentence and label in data
          and store them in datasets dictionary
    """
    datasets = {}
    for name, path in data_paths.items():
        print("Loading {} dataset from {}".format(name, path))
        tmp_data = pd.read_csv(os.path.join("SST-2", path), sep='\t')
        sentences  = [preprocess_sent(sent) for sent in tmp_data['sentence']] 
        if 'label' in tmp_data:
            labels = tmp_data['label']
        else:
            labels = None # Test dataset has no label
        datasets[name] = (sentences, labels)
    return datasets

Now, we will load the datasets using the `load_datasets` function defined above.

In [None]:
data_paths = {
                "train": "train.tsv",
                "dev": "dev.tsv",
                "test": "test.tsv"
              }

datasets = load_datasets(data_paths)

Loading train dataset from train.tsv
Loading dev dataset from dev.tsv
Loading test dataset from test.tsv


Let's check how some examples in the preprocessed train dataset look.

In [None]:
print ("Train dataset size is {}".format(len(datasets['train'][0])))
print ("Val dataset size is {}".format(len(datasets['dev'][0])))
print ("Test dataset size is {}".format(len(datasets['test'][0])))

Train dataset size is 67349
Val dataset size is 872
Test dataset size is 1821


In [None]:
for ex in datasets['train'][0][:5]:
  print(ex)

['hide', 'new', 'secretions', 'parental', 'units']
['contains', 'wit', 'labored', 'gags']
['loves', 'characters', 'communicates', 'beautiful', 'human', 'nature']
['remains', 'utterly', 'satisfied', 'remain']
['worst', 'revenge-of-the-nerds', 'clichés', 'filmmakers', 'dredge']


Now, we will build a vocabulary using the tokens (words in this case) in the training data. We will define a `token2id` and `id2token` dictionary.  
`token2id` will take a token and return the corresponding index of the token in the vocabulary. This is used to convert word tokens into numbers. `id2token` is the reverse of `token2id`.

e.g. `token2id['hi'] = 3`.  `id2token[3] = 'hi'`



First, we will define a **pad token** (`<PAD>`), which is used to pad the sentences to have the same length, and an **unknown token** (`<UNK>`) for words that do not appear in dictionary.
Then, we will define a maximum vocabulary size (`max_vocab_size`). If the total number of unique tokens in our train dataset is larger than `max_vocab_size`, we will choose the top most frequent `max_vocab_size` as tokens in our vocabulary.

In [None]:
from collections import Counter

max_vocab_size = 10000
# save index 0 for unk and 1 for pad
PAD_IDX = 0
UNK_IDX = 1

def build_vocab(train_tokens):
    # Returns:
    # id2token: list of tokens, where id2token[i] returns token that corresponds to token i
    # token2id: dictionary where keys represent tokens and corresponding values represent indices
    all_tokens = [tokens for token_list in train_tokens for tokens in token_list]
    token_counter = Counter(all_tokens)
    vocab, count = zip(*token_counter.most_common(max_vocab_size))
    id2token = list(vocab)
    token2id = dict(zip(vocab, range(2,2+len(vocab)))) 
    id2token = ['<pad>', '<unk>'] + id2token
    token2id['<pad>'] = PAD_IDX 
    token2id['<unk>'] = UNK_IDX
    return token2id, id2token

token2id, id2token = build_vocab(datasets['train'][0])

Let's check `token2id` and `id2token` by loading a random token from it.

In [None]:
import random
random_token_id = random.randint(0, len(id2token)-1)
random_token = id2token[random_token_id]

print ("Token id {} ; token {}".format(random_token_id, id2token[random_token_id]))
print ("Token {}; token id {}".format(random_token, token2id[random_token]))


Token id 4645 ; token tuned
Token tuned; token id 4645


Now, we will convert the tokens in our datasets into numerical indices based on `token2id`.

In [None]:
# convert token to id in the dataset
def token2index_dataset(tokens_data):
    indices_data = []
    # for each token list in tokens_data, convert to indices
    # WRITE YOUR OWN CODE
    for tokens in tokens_data:
        index_list = [token2id[token] if token in token2id else UNK_IDX for token in tokens]
        indices_data.append(index_list)
      
    return indices_data

train_data_indices, train_targets = token2index_dataset(datasets['train'][0]), datasets['train'][1]
val_data_indices, val_targets = token2index_dataset(datasets['dev'][0]), datasets['dev'][1]
test_data_indices, test_targets = token2index_dataset(datasets['test'][0]), datasets['test'][1]

# double checking
print ("Train dataset size is {}".format(len(train_data_indices)))
print ("Val dataset size is {}".format(len(val_data_indices)))
print ("Test dataset size is {}".format(len(test_data_indices)))

Train dataset size is 67349
Val dataset size is 872
Test dataset size is 1821


In [None]:
print(datasets['train'][1])

0        0
1        0
2        1
3        0
4        0
        ..
67344    1
67345    0
67346    1
67347    1
67348    0
Name: label, Length: 67349, dtype: int64


In [None]:
print("Token2ID: ", train_data_indices[0])
print("Recovering tokens from train_data_indices: ", [id2token[idx] for idx in train_data_indices[0]])
print("original tokens in sentence: ", datasets['train'][0][0])

Token2ID:  [4193, 23, 1, 6981, 8715]
Recovering tokens from train_data_indices:  ['hide', 'new', '<unk>', 'parental', 'units']
original tokens in sentence:  ['hide', 'new', 'secretions', 'parental', 'units']


We can see that the word "secretions" is replaced with `<unk>` token as it is not in the most frequent `max_vocab_size` number of tokens.

Note that, unlike image data which is sequence of (continuous) pixel values, the natural language data is discrete, meaning its values are distinct, separate, and can only take on certain values. If we randomly change the value of a token in the sentence, the meaning of the entire sentence can change entirely.  
e.g. "I love apple" --> "I hate apple"

Now, we will create a `Dataset` object and `DataLoader` object, as well as a function for batching.

In [None]:
import numpy as np
import torch
from torch.utils.data import Dataset

class SSTDataset(Dataset):
    """
    Class that represents a train/validation/test dataset that's readable for PyTorch
    Note that this class inherits torch.utils.data.Dataset
    """
    def __init__(self, data_list, target_list, MAX_SENTENCE_LENGTH=200):
        """
        @param data_list: list of newsgroup tokens 
        @param target_list: list of newsgroup targets 

        """
        self.data_list = data_list

        if target_list is not None:
            self.target_list = target_list
        else:
            # if target list is None, create a dummy target list
            self.target_list = [0] * len(data_list)
        self.MAX_SENTENCE_LENGTH = MAX_SENTENCE_LENGTH
        assert (len(self.data_list) == len(self.target_list))

    def __len__(self):
        return len(self.data_list)
        
    def __getitem__(self, key):
        """
        Triggered when you call dataset[i]
        """
        token_idx = self.data_list[key][:self.MAX_SENTENCE_LENGTH]
        label = self.target_list[key]
        return [token_idx, label]



def generate_batch(batch):
    """
    Customized function for DataLoader that dynamically pads the batch so that all 
    data have the same length
    """
    labels = torch.LongTensor([entry[1] for entry in batch])
    data_list = []
    # padding
    max_len = max(len(entry[0]) for entry in batch)
    for entry in batch:
        padded_vec = np.pad(np.array(entry[0]),
                            pad_width=((0,max_len-len(entry[0]))), 
                            mode="constant", constant_values=token2id['<pad>'])
        data_list.append(padded_vec)
    return [torch.from_numpy(np.array(data_list)).long(), labels]


In [None]:
BATCH_SIZE = 32
train_dataset = SSTDataset(train_data_indices, train_targets)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=BATCH_SIZE,
                                           collate_fn=generate_batch,
                                           shuffle=True)

val_dataset = SSTDataset(val_data_indices, val_targets)
val_loader = torch.utils.data.DataLoader(dataset=val_dataset, 
                                           batch_size=BATCH_SIZE,
                                           collate_fn=generate_batch,
                                           shuffle=True)

test_dataset = SSTDataset(test_data_indices, test_targets)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                           batch_size=BATCH_SIZE,
                                           collate_fn=generate_batch,
                                           shuffle=False)


# Word2Vec: Pre-trained Word Vectors

We will initialize the embedding layer of our model with pre-trained Word2Vec vector. To do this, we will use [`gensim`](https://radimrehurek.com/gensim/index.html) library for downloading and loading word vector. The `gensim` library also contains other built-in functions such as finding similarity between word vectors, topic modeling, etc.

First, let's install `gensim`.

In [None]:
!pip install --upgrade gensim

Collecting gensim
  Downloading gensim-4.1.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 72.1 MB/s 
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.1.2


We will now download `word2vec`, a set of pretrained word vectors that are trained on Google News dataset (This can take a few minutes).

In [None]:
import gensim.downloader as api
pretrained_word_vectors = api.load('word2vec-google-news-300')



Let's print out the vector for the word "student".

In [None]:
pretrained_word_vectors['student']

array([ 0.03686523,  0.0201416 ,  0.22167969,  0.15527344,  0.17871094,
        0.03149414,  0.31445312, -0.03369141,  0.15429688, -0.375     ,
        0.05102539, -0.13183594, -0.11962891, -0.13867188, -0.02026367,
        0.01318359, -0.06738281, -0.06591797, -0.02502441, -0.140625  ,
        0.02160645,  0.17382812, -0.00177765, -0.09179688, -0.09765625,
       -0.4921875 , -0.13671875, -0.00570679,  0.16992188,  0.10107422,
        0.09423828, -0.10986328, -0.08496094,  0.05419922, -0.06542969,
       -0.0168457 ,  0.11230469,  0.13964844, -0.08300781,  0.22265625,
       -0.23828125,  0.11767578, -0.04614258,  0.0859375 ,  0.17089844,
       -0.06884766,  0.04003906, -0.10351562,  0.15917969,  0.04956055,
       -0.10888672, -0.15039062,  0.01507568, -0.05419922,  0.25      ,
       -0.09521484, -0.11816406,  0.11132812,  0.20507812, -0.10009766,
        0.0168457 , -0.09521484, -0.00308228, -0.01348877,  0.09277344,
       -0.08447266,  0.08496094, -0.05541992,  0.15820312,  0.10

Let's use the built-in `most_similar` function to print out the top 5 words that are most similar to the word "car".

In [None]:
print(pretrained_word_vectors.most_similar(positive=['car'], topn=5))

[('vehicle', 0.7821096181869507), ('cars', 0.7423831224441528), ('SUV', 0.7160962224006653), ('minivan', 0.6907036900520325), ('truck', 0.6735789775848389)]


Using the `most_similar` function of gensim, explore a few words of your choice and their most similar words returned by gensim.

In [None]:
# WRITE YOUR OWN CODE
print(pretrained_word_vectors.most_similar(positive=['happy'], topn=5))

[('glad', 0.7408890724182129), ('pleased', 0.6632170677185059), ('ecstatic', 0.6626912355422974), ('overjoyed', 0.6599286794662476), ('thrilled', 0.6514049172401428)]


Now, we will create a `word2vec_vectors` embedding which has the same vobulary indices as `token2id` using `pretrained_word_vectors`. This `word2vec_vectors` will be used to create the `Embedding` layer of our BagOfWord model.

In [None]:
import torch
word2vec_vectors = []
W2V_SIZE = 300
for token, idx in token2id.items():
    if token in pretrained_word_vectors:
        word2vec_vectors.append(torch.FloatTensor(pretrained_word_vectors[token]))
    else:
        word2vec_vectors.append(torch.rand(W2V_SIZE))
word2vec_vectors=torch.stack(word2vec_vectors)

  


In [None]:
print("size of word2vec_vectors: ", word2vec_vectors.size())

size of word2vec_vectors:  torch.Size([10002, 300])


# Model

Now we will build a class for a simple BagOfWord model.
It consists of:
- an embedding layer
- a linear layer.

Our BagOfWord model uses the average of word embeddings of each word in the sentence as the sentence representation. and uses this representation as the input to a linear classifier layer.

In [None]:
import torch.nn as nn
import torch.nn.functional as F

class TextBoW(nn.Module):
    def __init__(self, pre_trained_emb, num_class):
        super(TextBoW, self).__init__()
        # load pre-trained embeddings to Embedding layer
        self.embedding = nn.Embedding.from_pretrained(pre_trained_emb)
        self.embedding.weight.requires_grad=True
        embed_dim = pre_trained_emb.size(1)
        # a fully-connected linear layer for classification
        # WRITE YOUR OWN CODE
        self.fc = nn.Linear(embed_dim, num_class) 
        self.init_weights()

    def init_weights(self):
        # initialize weights for fc layer
        initrange = 0.05
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, x):
        # WRITE YOUR OWN CODE
        # forward input x into embedding layer
        # average the output of embedding layer and forward it to linear fc layer
        x = self.embedding(x)
        x = torch.mean(x, 1)
        x = self.fc(x)
        return x


We will initialize the `TextBoW` model we just created.

In [None]:
EMBED_DIM = 300
VOCAB_SIZE = len(token2id)
NUN_CLASS = 2  # we have 2 label classes (pos and neg)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 
# Train on GPU device if there's cuda (GPU), else use CPU device

model = TextBoW(word2vec_vectors, NUN_CLASS).to(device)


Let's check how many learnable parameters our TextBoW model has.

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 3,001,202 trainable parameters


Now we will create functions for training and evaluation of our model.

`train_func` will:
- loop through the batches in the training `data_loader`
- forward each batch through the model
- compute the loss and accuracy of the model after each forward step
- backpropagate the loss and update the model parameters
- compute the accuracy and total loss to keep track of the performance throughout the training

`eval_func` will:
- loop through the batches in a provided `data_loader`
- forward each batch through the model
- compute the loss and accuracy of the model after each forward step
- compute the accuracy and total loss to keep track of the performance throughout the training if it's not in test mode. (Note that we do not have test labels so we can't compute loss and accuracy on test dataset). If it's in test mode, we will keep track of model's predictions instead.



In [None]:
def train_func(model, train_loader, optimizer, criterion):

    # Train the model
    train_loss = 0
    train_acc = 0
    num_examples = 0
    model.train()

    # for each batch in train data loader
    for idx, batch in enumerate(train_loader):
        input_text, labels = batch[0], batch[1]

        # WRITE YOUR OWN CODE
        # clear optimizer gradient
        # forward input_text through model
        # compute loss
        # backpropagate loss
        # take a step with optimizer
        optimizer.zero_grad()
        output = model(input_text)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()

        # Compute total loss and accuracy
        train_loss += loss.item()
        train_acc += (output.argmax(1) == labels).sum().item()
        num_examples += labels.size(0)
        
    return train_loss / num_examples, train_acc / num_examples


def eval_func(data_loader, test_mode=False):
    eval_loss = 0
    eval_acc = 0
    num_examples = 0
    predictions = []
    model.eval()
    for batch in data_loader:
        input_text, labels = batch[0], batch[1]

        # As we are not doing training, 
        # we don't need to keep track of gradients
        with torch.no_grad():
            output = model(input_text)

            if not test_mode:
                loss = criterion(output, labels)
                eval_loss += loss.item()
                eval_acc += (output.argmax(1) == labels).sum().item()
                num_examples += labels.size(0)
            else:
                predictions.extend(output.argmax(1).tolist())
                
    if test_mode:
        return predictions
    return eval_loss / num_examples, eval_acc / num_examples


# Time to train!

Now, we can start training our model.

First, we will define:
- the number of epochs (`N_EPOCHS`) which is the number of times we will loop through the training dataset using `train_func`
- criterion for computing the loss function (In our case, `CrossEntropyLoss` for classification)
- an optimizer and learning rate: We use the simple SGD optimizer with learning rate 0.5.

We will then train the models for `N_EPOCHS`.

For each epoch, 
- we will run `train_func` to train the model
- we will evaluate the model on validation dataset using `eval` function and save the model checkpoint with the best validation score. 

(The validation dataset is used to tune or select the best model. Note that you should not use the test dataset to tune the model. It will produce a biased model overfitted to the test data.)


In [None]:
import time
N_EPOCHS = 20
min_valid_loss = float('inf')

criterion = torch.nn.CrossEntropyLoss().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)
best_val_acc = 0


for epoch in range(N_EPOCHS):
    start_time = time.time()

    # train model on train_loader
    train_loss, train_acc = train_func(model, train_loader, optimizer, criterion)
    # Evaluate the performance on validation dataset
    valid_loss, valid_acc = eval_func(val_loader)

    # If current validation_acc is better than best_val_acc,
    # update best_val_acc and save model
    if valid_acc > best_val_acc:
        best_val_acc = valid_acc
        torch.save(model.state_dict(), './saved_model')

    # this is just to compute how long the whole process take for an epoch
    secs = int(time.time() - start_time)
    mins = secs / 60
    secs = secs % 60

    # Print out the loss and accuracy for this epoch
    print('Epoch: %d' %(epoch + 1), " | time in %d minutes, %d seconds" %(mins, secs))
    print(f'\tLoss: {train_loss:.4f}(train)\t|\tAcc: {train_acc * 100:.1f}%(train)')
    print(f'\tLoss: {valid_loss:.4f}(valid)\t|\tAcc: {valid_acc * 100:.1f}%(valid)')


Epoch: 1  | time in 0 minutes, 48 seconds
	Loss: 0.0214(train)	|	Acc: 56.4%(train)
	Loss: 0.0213(valid)	|	Acc: 58.1%(valid)
Epoch: 2  | time in 0 minutes, 23 seconds
	Loss: 0.0208(train)	|	Acc: 59.7%(train)
	Loss: 0.0207(valid)	|	Acc: 60.8%(valid)
Epoch: 3  | time in 0 minutes, 22 seconds
	Loss: 0.0203(train)	|	Acc: 62.5%(train)
	Loss: 0.0200(valid)	|	Acc: 69.2%(valid)
Epoch: 4  | time in 0 minutes, 23 seconds
	Loss: 0.0199(train)	|	Acc: 63.6%(train)
	Loss: 0.0217(valid)	|	Acc: 57.3%(valid)
Epoch: 5  | time in 0 minutes, 25 seconds
	Loss: 0.0194(train)	|	Acc: 65.4%(train)
	Loss: 0.0183(valid)	|	Acc: 72.6%(valid)
Epoch: 6  | time in 0 minutes, 21 seconds
	Loss: 0.0191(train)	|	Acc: 66.7%(train)
	Loss: 0.0202(valid)	|	Acc: 60.4%(valid)
Epoch: 7  | time in 0 minutes, 20 seconds
	Loss: 0.0187(train)	|	Acc: 67.6%(train)
	Loss: 0.0177(valid)	|	Acc: 74.9%(valid)
Epoch: 8  | time in 0 minutes, 20 seconds
	Loss: 0.0184(train)	|	Acc: 68.9%(train)
	Loss: 0.0176(valid)	|	Acc: 77.1%(valid)
Epoch: 9

When the training is complete, we will load the saved_model that has the best validation accuracy  and generate model predictions on test dataset.

In [None]:
model.load_state_dict(torch.load('./saved_model'))
predictions = eval_func(test_loader, test_mode=True)
print(predictions[:10])

[0, 0, 0, 1, 1, 1, 1, 1, 0, 0]


# Fun with GPT-2

Now, you know how to build a simple text sentiment classifer. This is just an introduction to NLP. There are many exciting things in store as you studies further. For example, you can write a cool language generator using the state-of-the-art NLP models like GPT-2. Head over to [this page](https://transformer.huggingface.co/doc/gpt2-large) to try out a language generator with GPT-2.

## Using Hugging Face Datasets and Transformers

Next we will look at some more modern libraries for Deep Learning NLP models. Specifically we will be using Hugging Face's  `datasets` and `transformers` libraries.

Specifically, we will be using the pretrained `BERT` model as a starting point to train sentiment classification model based on SST-2.

**Note**: In this section, it is recommended that you switch your Colab runtime to using a GPU hardware accelerator. This will reset your Colab runtime. You can do so by going through the menu: "Runtime" --> "Change runtime type" --> Select "GPU".

In [None]:
! pip install datasets transformers

Collecting datasets
  Downloading datasets-1.17.0-py3-none-any.whl (306 kB)
[?25l[K     |█                               | 10 kB 27.9 MB/s eta 0:00:01[K     |██▏                             | 20 kB 25.5 MB/s eta 0:00:01[K     |███▏                            | 30 kB 17.1 MB/s eta 0:00:01[K     |████▎                           | 40 kB 14.5 MB/s eta 0:00:01[K     |█████▍                          | 51 kB 5.7 MB/s eta 0:00:01[K     |██████▍                         | 61 kB 6.1 MB/s eta 0:00:01[K     |███████▌                        | 71 kB 4.8 MB/s eta 0:00:01[K     |████████▋                       | 81 kB 5.4 MB/s eta 0:00:01[K     |█████████▋                      | 92 kB 5.9 MB/s eta 0:00:01[K     |██████████▊                     | 102 kB 5.4 MB/s eta 0:00:01[K     |███████████▊                    | 112 kB 5.4 MB/s eta 0:00:01[K     |████████████▉                   | 122 kB 5.4 MB/s eta 0:00:01[K     |██████████████                  | 133 kB 5.4 MB/s eta 0:00:01

In [None]:
import datasets
import transformers
import numpy as np
import torch
import torch.nn as nn
import itertools
from torch.utils.data import Dataset
from tqdm.auto import tqdm

# cuda:0 means that we are using the first (we only have one) GPU
DEVICE = "cuda:0"

Instead of downloading and processing the datasets above, we will use the `datasets` library to easily download our SST-2 data.

In this case, SST-2 is a sub-dataset of the [GLUE](https://gluebenchmark.com/) benchmark. 

In [None]:
sst_dataset = datasets.load_dataset("glue", "sst2")

Downloading:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Downloading and preparing dataset glue/sst2 (download: 7.09 MiB, generated: 4.81 MiB, post-processed: Unknown size, total: 11.90 MiB) to /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading:   0%|          | 0.00/7.44M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

The resulting `sst_dataset` object contains data for all three splits of the dataset.

In [None]:
sst_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})

We can take a look at a single example in the dataset.

In [None]:
example = sst_dataset["train"][4]
example

{'idx': 4,
 'label': 0,
 'sentence': 'on the worst revenge-of-the-nerds clichés the filmmakers could dredge up '}

Next, we will use the `transformers` library to prepare our model. As above, we need both a tokenizer and a model. Let's load up our tokenizer first.

In [None]:
tokenizer = transformers.BertTokenizerFast.from_pretrained("bert-base-cased")

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Let's tokenize a single input sentence.

Notice that it breaks up certain rare works into sub-word tokens, e.g. 

> clichés --> ["c", "##lich", "és"]

In [None]:
print(example["sentence"])
tokens = tokenizer.tokenize(example["sentence"])
print(tokens)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)

on the worst revenge-of-the-nerds clichés the filmmakers could dredge up 
['on', 'the', 'worst', 'revenge', '-', 'of', '-', 'the', '-', 'ne', '##rds', 'c', '##lich', '##és', 'the', 'filmmakers', 'could', 'd', '##red', '##ge', 'up']
[1113, 1103, 4997, 7972, 118, 1104, 118, 1103, 118, 24928, 15093, 172, 16879, 10051, 1103, 18992, 1180, 173, 4359, 2176, 1146]


Calling the `tokenizer` on the input sentence does the string->token IDs converion all at once.

(You can ignore the `token_type_ids` and `attention_mask` fields for now.)

In [None]:
tokenized = tokenizer(example["sentence"])
for k, v in tokenized.items():
     print(k, v)

input_ids [101, 1113, 1103, 4997, 7972, 118, 1104, 118, 1103, 118, 24928, 15093, 172, 16879, 10051, 1103, 18992, 1180, 173, 4359, 2176, 1146, 102]
token_type_ids [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
attention_mask [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


Next, we will load our pretrained BERT model. In this case, we want to use BERT to tackle a 2-class classification problem (positive/negative sentiment). 

Moreover, we want to run this model on the GPU, so we move it to `DEVICE` (`cuda:0`, from above).

In [None]:
model = transformers.BertForSequenceClassification.from_pretrained(
    "bert-base-cased",
    num_labels=2,
).to(DEVICE)

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

If we run the model on our tokenized input (with some preprocessing to move the input to GPU as well), we see that it outputs logits corresponding to our two classes.

In [None]:
model(
    input_ids=torch.LongTensor([tokenized["input_ids"]]).to(DEVICE),
    attention_mask=torch.LongTensor([tokenized["attention_mask"]]).to(DEVICE),
)

SequenceClassifierOutput([('logits',
                           tensor([[1.0703, 0.5167]], device='cuda:0', grad_fn=<AddmmBackward0>))])

Next, let's set up our SST dataset, similar to the above.

For efficiency reasons, we will pad/truncate all our inputs to the same number of tokens (64), which we do in the `SSTDataset` object's `__getitem__` method.

Our `generate_batch` function simply stacks the inputs together into batches.

In [None]:
class SSTDataset(Dataset):
    """
    Class that represents a train/validation/test dataset that's readable for PyTorch
    Note that this class inherits torch.utils.data.Dataset
    """
    def __init__(self, phase_dataset, tokenizer, max_seq_length=64):
        """
        @param data_list: dataset object for a single phase
        @param tokenizer: tokenizer 
        @param max_seq_length: max sequence length 

        """
        self.phase_dataset = phase_dataset
        self.tokenizer = tokenizer
        self.max_seq_length = max_seq_length

    def __len__(self):
        return len(self.phase_dataset)
        
    def __getitem__(self, key):
        """
        Triggered when you call dataset[i]
        """
        tokenized = tokenizer(self.phase_dataset[key]["sentence"])
        diff = len(tokenized["input_ids"]) - self.max_seq_length
        if diff >= 0:
            # Truncate if > max_seq_length
            input_ids = tokenized["input_ids"][:self.max_seq_length]
            attention_mask = tokenized["attention_mask"][:self.max_seq_length]
        else:
            # Pad if < max_seq_length
            input_ids = tokenized["input_ids"] + [self.tokenizer.pad_token_id] * (-diff)
            attention_mask = tokenized["attention_mask"] + [0] * (-diff)
        assert len(input_ids) == self.max_seq_length
        assert len(attention_mask) == self.max_seq_length
        return {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "label": self.phase_dataset[key]["label"]
        }

def generate_batch(batch):
    """
    Customized function for DataLoader that dynamically pads the batch so that all 
    data have the same length
    """
    return {
        "input_ids": torch.LongTensor([entry["input_ids"] for entry in batch]),
        "attention_mask": torch.LongTensor([entry["attention_mask"] for entry in batch]),
        "label": torch.LongTensor([entry["label"] for entry in batch]),
    }

Now, let's set up our datasets and dataloaders.

In [None]:
train_dataset = SSTDataset(sst_dataset["train"], tokenizer)
val_dataset = SSTDataset(sst_dataset["validation"], tokenizer)

train_loader = torch.utils.data.DataLoader(
    dataset=train_dataset, 
    batch_size=32,
    collate_fn=generate_batch,
    shuffle=False)

val_loader = torch.utils.data.DataLoader(
    dataset=val_dataset, 
    batch_size=32,
    collate_fn=generate_batch,
    shuffle=False)

We can see what a single batch from our dataloader looks like.

In [None]:
batch = next(iter(val_loader))
for k, v in batch.items():
  print(k, tuple(v.shape))

input_ids (32, 64)
attention_mask (32, 64)
label (32,)


Now, we'll make an optimizer object to perform gradient descent to train our model. We will use the `AdamW` optimizer, which is gradient descent with some extra features (momentum, regularization), but the idea is very similar.

To set up our optimizer, we need to provide the model parameters (which we will optimize over) and the learning rate.

In [None]:
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

As above, we will set up our `train_func` and `eval_func` functions.

To run this lab a little faster, we will only train our model for a small number of steps, rather than full epochs on the SST-2 training dataset. Feel free to adjust the number of steps!

In [None]:
def train_func(model, train_loader, optimizer, steps=100):

    # Train the model
    train_loss = 0
    train_acc = 0
    num_examples = 0
    model.train()
    criterion = nn.CrossEntropyLoss()

    for i, batch in tqdm(
            zip(range(steps), itertools.cycle(train_loader)), total=steps):
        optimizer.zero_grad()
        output = model(
            input_ids=batch["input_ids"].to(DEVICE),
            attention_mask=batch["attention_mask"].to(DEVICE),
        ).logits
        loss = criterion(output, batch["label"].to(DEVICE))
        loss.backward()
        optimizer.step()

        # Compute total loss and accuracy
        train_loss += loss.item()
        train_acc += (output.argmax(1) == batch["label"].to(DEVICE)).sum().item()
        num_examples += batch["label"].to(DEVICE).size(0)
        
    return {
        "loss": train_loss / num_examples,
        "accuracy": train_acc / num_examples
    }


def eval_func(model, data_loader, test_mode=False):
    eval_loss = 0
    eval_acc = 0
    num_examples = 0
    predictions = []
    model.eval()
    criterion = nn.CrossEntropyLoss()
    for batch in tqdm(data_loader):

        with torch.no_grad():
            output = model(
                input_ids=batch["input_ids"].to(DEVICE),
                attention_mask=batch["attention_mask"].to(DEVICE),
            ).logits

            if not test_mode:
                loss = criterion(output, batch["label"].to(DEVICE))
                eval_loss += loss.item()
                eval_acc += (output.argmax(1) == batch["label"].to(DEVICE)).sum().item()
                num_examples += batch["label"].to(DEVICE).size(0)
            else:
                predictions.extend(output.argmax(1).tolist())
                
    if test_mode:
        return predictions
    return {
        "loss": eval_loss / num_examples,
        "accuracy": eval_acc / num_examples,
    }


Now, let's go ahead and train our model!

In [None]:
train_func(model, train_loader, optimizer, steps=300)

  0%|          | 0/300 [00:00<?, ?it/s]

{'accuracy': 0.8453125, 'loss': 0.011458817155410845}

We can then evaluate our model on the validation set. You should see that our model gets very decent performance even with very few training steps. This is the benefit of pretrained models!

In [None]:
eval_func(model, val_loader)

  0%|          | 0/28 [00:00<?, ?it/s]

{'accuracy': 0.8864678899082569, 'loss': 0.009296531353248368}

Now, try to apply our fine-tuned BERT sentiment classification model to a new sentence!

In [None]:
sentence = "The chicken pot pie was delicious!"

# WRITE YOUR OWN CODE
tokenized = tokenizer(sentence)
out = model(
    input_ids=torch.LongTensor([tokenized["input_ids"]]).to(DEVICE),
    attention_mask=torch.LongTensor([tokenized["attention_mask"]]).to(DEVICE),
)
neg, pos = torch.softmax(out["logits"], dim=-1)[0]

print(f"Negative sentiment: {neg:.1%}")
print(f"Positive sentiment: {pos:.1%}")

Negative sentiment: 3.0%
Positive sentiment: 97.0%
