In [None]:
#!pip install portalocker sentencepiece sacremoses transformers

## 6.4 Lab 4 / Case 4: Sentiment Analysis

In this lab, you'll fine-tune an encoder-based model to perform sentiment analysis on the Standford Sentiment Treebank (SST2) dataset. You'll load RoBERTa's sibling, XLM-RoBERTa, use its prescribed transformations to preprocess text in the SST2 dataset, and fine-tune (train) it for one epoch.

### 6.4.1 Model

You'll use Torchtext's `XLMR_BASE_ENCODER` in this lab. Create an instance of a classification head (`RobertaClassificationHead`) to perform binary classification (we have two classes, "positive" and "negative" sentiment), matching the input dimensions to the embeddings generated by the base model, and then load the model with the head attached to it.

In [None]:
import torchtext

xlmr_base = torchtext.models.XLMR_BASE_ENCODER
xlmr_base

In [None]:
classifier_head = ...
# Tip: you can call a method from xlmr_base to load the model with the head
model = ...
model

### 6.4.2 Dataset

Now, you will load Torchtext's ["Stanford Sentiment Treebank (SST2)"](https://pytorch.org/text/stable/datasets.html#sst2) dataset. This dataset uses Torchdata's `DataPipe`s instead of traditional `Dataset`s. It is already split into `train`, `dev` (validation), and `test` sets. You only need to specify it in the `split` argument in the constructor of `SST2`.

In [None]:
from torchtext.datasets import SST2

datapipes = {}
datapipes['train'] = ...
datapipes['val'] = ...

Let's take a look at one data point from the SST2 dataset:

In [None]:
row = next(iter(datapipes['train']))
text, label = row
text, label

Each data point is a tuple, containing a line of text, and the corresponding label - the sentiment (0 for negative, 1 for positive).

### 6.4.3 Transforms

You already know the drill: you must preprocess the input (the text) using the prescribed transformation for the model you're using, so it gets tokenized, converted into token ids, and prependend/appended with the appropriate special tokens.

Retrieve the transformation function/model from the XLM-RoBERTa model, and write a function that takes a tuple of `(text, label)` and returns another tuple of `(list of tokens ids, label)`.

In [None]:
transform_fn = ...
transform_fn(text)

In [None]:
def apply_transform(row):
    text, label = row
    # write your code here
    ...

Let's apply your function to our data point to see if it is working as expected:

In [None]:
apply_transform(row)

Did you notice the transformation is returning a regular Python list of token ids, not a PyTorch tensor? Remember, we cannot make a tensor out of lists of different lengths (see section 2.9.3). The solution? Padding the shorter sentences, so they all have the same length.

In [None]:
padding_idx = transform_fn[1].vocab.lookup_indices(['<pad>'])[0]

We'll be padding the sentences and building tensors out of them during pipeline creation so it is more streamlined but you could also do it inside the training loop, after they're returned by the data loader and before they're sent as inputs to the model.

Write a function that takes a batch of (transformed) data points, pads the sequences (using `to_tensor` and the padding id provided above), and converts the labels into a tensor as well.

In [None]:
import torch
from torchtext.functional import to_tensor

def tensor_batch(batch):
    tokens = batch['token_ids']
    labels = batch['labels']
    # write your code here
    ...

You're probably wondering: how could the pipeline pad the sequences if it transforming data points individually? Isn't the data loader's role to produce mini-batches? And, where did the dictionary come from?!

Yes, it is usually the data loader's role to produce mini-batches, but it turns out we can make also batches inside the data pipe already, so we can pad the sequences appropriately.

We'll be using two methods to accomplish this:
- `batch()`: it takes the number of data points that will make up the mini-batch
- `rows2columnar()`: it "transposes" the data so that the a list of tuples becomes a tuple of lists. For example, let's say we have two data points `(f1, l1)` and `(f2, l2)`. A mini-batch of two would be a list of tuples `[(f1, l1), (f2, l2)]` but if we make it columnar, it will become a tuple of lists `([f1, f2], [l1, l2])` or, better yet, a dictionary where the keys are the column names passed as arguments: `{col1: [f1, f2], 'col2: [l1, l2]}`.

And that should answer the question "where did the dictionary come from."

Now, let's line up all these steps:

In [None]:
for k in datapipes.keys():
    datapipes[k] = datapipes[k].map(apply_transform)
    datapipes[k] = datapipes[k].batch(16)
    datapipes[k] = datapipes[k].rows2columnar(['token_ids', 'labels'])
    datapipes[k] = datapipes[k].map(tensor_batch)

If we fetch from our data pipe, it should return a tuple of two tensors, each tensor containing as many rows as the mini-batch size.

In [None]:
dp_out = next(iter(datapipes['train']))
dp_out

Now, create a data loader for each data pipe. Since the batches are already defined inside the data pipe, the batch size should be `None`. It is still OK to shuffle the training set, though.

In [None]:
from torch.utils.data import DataLoader

dataloaders = {}
dataloaders['train'] = ...
dataloaders['val'] = ...

Now, let's fetch a mini-batch from our data loader:

In [None]:
dl_out = next(iter(dataloaders['train']))
dl_out

Do you see any difference between the two outputs, from the (batched) datapipe and the data loader? The former returns a tuple while the latter returns a list, but the contents are the same: a mini-batch of features and a mini-batch of labels. The length of the features may differ depending on how long the longest sequence in a given mini-batch is.

In [None]:
dp_out[0].shape, dl_out[0].shape # features

In [None]:
dp_out[1].shape, dl_out[1].shape # labels

This means that it is possible to use data pipes directly in the training loop.

### 6.4.4 Training

Now, it is time to write a training loop to fine-tune your XLM-RoBERTa model on the SST2 dataset. This is a large model, and the training set has over 60,000 data points, so you can train it over a single epoch, that is, looping over the mini-batches from the datapipe (or data loader) only once. For the sake of speed, keep the evalution for the end only.

Although `Adam` is the optimizer of choice, we suggest you try out `AdamW`, a modified version that is also commonly used.

Sentiment analysis is a classification task, so we need to use the appropriate loss function for the task. Even though it is a binary classification, RoBERTa's classification head is actually producing two logits instead of one, so you have to use `CrossEntropyLoss` (which can handle two or more logits using softmax functio to convert them into probabilities).

***

**Classification Losses Showdown**

Honestly, I always feel this whole thing is a bit confusing, especially for someone who's learning it for the first time. 

Which loss functions take logits as inputs? Should I add a (log)softmax layer or not? Can I use the `weight` argument to handle imbalanced datasets? Too many questions, right?

So, here is a table to help you figure out the landscape of loss functions for classification problems, both binary and multiclass:

|                         | BCE Loss               | BCE With Logits Loss     | NLL Loss                    | Cross-Entropy Loss   
| --- | --- | --- | --- | --- |
|     Classification      | binary                | binary                | multiclass                 | multiclass 
| Input (each data point) | probability           | logit                 | array of log probabilities | array of logits    
| Label (each data point) | float (0.0 or 1.0)    | float (0.0 or 1.0)    | long (class index)         | long (class index) 
|   Model's last layer    | Sigmoid               | -                     | LogSoftmax                 | -                  
|    `weight` argument    | **not** class weights | **not** class weights | class weights              | class weights      
|  `pos_weight` argument  | n/a                   | "weighted" loss       | n/a                        | n/a                

***

In [None]:
import torch.optim as optim
import torch.nn as nn

optimizer = optim.AdamW(model.parameters(), lr=1e-5)
loss_fn = nn.CrossEntropyLoss()

#### 6.4.4.1 TensorBoard

Yes, TensorBoard is that good! So good that we’ll be using a tool from the competing framework, TensorFlow :-) Jokes aside, TensorBoard is a very useful tool, and PyTorch provides classes and methods so that we can integrate it with
our model.

First, we need to load TensorBoard’s extension for Jupyter, and then we can run TensorBoard using the newly available magic:

In [None]:
%load_ext tensorboard
%tensorboard --logdir runs

The magic above tells TensorBoard to look for logs inside the folder specified by the logdir argument: `runs`. So, there must be a runs folder in the same location as the notebook you’re using to train the model.

If you want to know more about running TensorBoard inside notebooks, check this official [guide](https://www.tensorflow.org/tensorboard/tensorboard_in_notebooks).

It all starts with the creation of a `SummaryWriter`: since we told TensorBoard to look for logs inside the runs folder, it makes sense to actually log to that folder. Moreover, to be able to distinguish between different
experiments or models, we should also specify a sub-folder: `test`.

In [None]:
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter('runs/test')

What about sending the loss values to TensorBoard? We can use the `add_scalars()` method to send multiple scalar values at once; it needs three arguments:
- `main_tag`: the parent name of the tags, or the "group tag," if you will
- `tag_scalar_dict`: the dictionary containing the key: value pairs for the scalars you want to keep track of (for example, training and validation losses)
- `global_step`: step value; that is, the index you’re associating with the values you’re sending in the dictionary; the index of the mini-batch comes to mind in our case, as losses are computed for each mini-batch

As training progresses, you can go back to the cell where TensorBoard was loaded, click on its refresh button on the top right, and observe the current loss level.

In [None]:
from tqdm import tqdm

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model.to(device)

batch_losses = []

## Training
for i, (batch_features, batch_targets) in tqdm(enumerate(datapipes['train'])):
    # Set the model's mode
    # write your code here
    ...
    
    # Send batch features and targets to the device
    # write your code here
    ...
    
    # Step 1 - forward pass
    predictions = ...

    # Step 2 - computing the loss
    loss = ...

    # Step 3 - computing the gradients
    # Tip: it requires a single method call to backpropagate gradients
    # write your code here
    ...

    batch_losses.append(loss.item())
    
    writer.add_scalars(main_tag='loss',
                       tag_scalar_dict={'training': loss.item()},
                       global_step=i)    

    # Step 4 - updating parameters and zeroing gradients
    # Tip: it takes two calls to optimizer's methods
    # write your code here
    ...


writer.close()
    
## Validation   
with torch.inference_mode():
    val_losses = []

    for i, (val_features, val_targets) in enumerate(dataloaders['val']):
        # Set the model's mode
        # write your code here
        ...

        # Send batch features and targets to the device
        # write your code here
        ...

        # Step 1 - forward pass
        predictions = ...

        # Step 2 - computing the loss
        loss = ...
        
        val_losses.append(loss.item())

By the end of it, your losses on TensorBoard should look more or less like this (if you drag the slider on the right to the maximum level of smoothing):

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch6/tensorboard.png)

### 6.4.5 Inference

Write a function that takes some text (a sequence of words), a model, its prescribed transformations, and a list of target categories for the classification, and returns the most likely category and the corresponding probability.

Since you're handling a single sequence, there's no need for any padding, but you still need to provide a tensor containing a mini-batch (of one) as input to the model.

The model returns two logits, one for each class, so you must use the softmax function to convert them into probabilities.

In [None]:
def predict(sequence, model, transforms_fn, categories):        
    # Build a tensor of token ids out of the input sequence
    # write your code here
    ...

    # Set the model to the appropriate mode
    # write your code here
    ...

    device = next(iter(model.parameters())).device
    
    # Use the model to make predictions/logits
    # Tip: Don't forget to send the input to the same device as the model
    # Tip: Don't forget models take mini-batches as inputs, not single data points
    pred = ...
    
    # Compute the probabilities corresponding to the logits
    # and return the top value and index
    
    probabilities = ...
    values, indices = ...
    
    return [{'label': categories[i], 'value': v.item()} for i, v in zip(indices, values)]

Now, try out your prediction function and fine-tuned model:

In [None]:
categories = ['negative', 'positive']
text = "I am really liking this course"
predict(text, model, xlmr_base.transform(), categories)

In [None]:
text = "This course is too complicated!"
predict(text, model, xlmr_base.transform(), categories)

That's cool, but what if we could perform sentiment analysis out-of-the-box?