**Outline**

- [Fine-tuning a BERT model in PyTorch](#Fine-tuning-a-BERT-model-in-PyTorch)
  - [Loading the IMDb movie review dataset](#Loading-the-IMDb-movie-review-dataset)
  - [Tokenizing the dataset](#Tokenizing-the-dataset)
  - [Loading and fine-tuning a pre-trained BERT model](#[Loading-and-fine-tuning-a-pre-trained-BERT-model)
  - [Fine-tuning a transformer more conveniently using the Trainer API](#Fine-tuning-a-transformer-more-conveniently-using-the-Trainer-API)
- [Summary](#Summary)

---

Quote from https://huggingface.co/transformers/custom_datasets.html:

> DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased , runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language understanding benchmark.

---

In [1]:
from IPython.display import Image

## Fine-tuning a BERT model in PyTorch

In this section, we learn how to fine-tune a BERT model for sentiment classification in PyTorch.

Note that pre-training a BERT from scratch is painful and quite unnecessary
considering the availability of the `transformers` Python package provided by Hugging Face, which
includes a bunch of pre-trained models that are ready for fine-tuning.

We see how to prepare and tokenize the IMDb movie review dataset and fine-tune the distilled BERT model to perform sentiment classification.

### Loading the IMDb movie review dataset


In [2]:
import gzip
import shutil
import time

import pandas as pd
import requests
import torch
import torch.nn.functional as F
import torchtext

import transformers
from transformers import DistilBertTokenizerFast
from transformers import DistilBertForSequenceClassification

**General Settings**

In [3]:
torch.backends.cudnn.deterministic = True
RANDOM_SEED = 123
torch.manual_seed(RANDOM_SEED)
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

NUM_EPOCHS = 3

**Download Dataset**

The following cells will download the IMDB movie review dataset (http://ai.stanford.edu/~amaas/data/sentiment/) for positive-negative sentiment classification in as CSV-formatted file:

In [4]:
url = "https://github.com/rasbt/machine-learning-book/raw/main/ch08/movie_data.csv.gz"
filename = url.split("/")[-1]

with open(filename, "wb") as f:
    r = requests.get(url)
    f.write(r.content)

with gzip.open('movie_data.csv.gz', 'rb') as f_in:
    with open('movie_data.csv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

Check that the dataset looks okay:

In [5]:
df = pd.read_csv('movie_data.csv')
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [6]:
df.shape

(50000, 2)

**Split Dataset into Train/Validation/Test**

Here, we use 70 percent of the reviews for the training set, 10 percent for the validation set, and the remaining 20 percent for testing.

In [7]:
train_texts = df.iloc[:35000]['review'].values
train_labels = df.iloc[:35000]['sentiment'].values

valid_texts = df.iloc[35000:40000]['review'].values
valid_labels = df.iloc[35000:40000]['sentiment'].values

test_texts = df.iloc[40000:]['review'].values
test_labels = df.iloc[40000:]['sentiment'].values

## Tokenizing the dataset

Now, we tokenize the texts into individual word tokens using the tokenizer implementation inherited
from the pre-trained model class.

In [8]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [9]:
train_encodings = tokenizer(list(train_texts), truncation=True, padding=True)
valid_encodings = tokenizer(list(valid_texts), truncation=True, padding=True)
test_encodings = tokenizer(list(test_texts), truncation=True, padding=True)

In [10]:
train_encodings[0]

Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

**Dataset Class and Loaders**

Let us pack everything into a class called IMDbDataset and create the corresponding data loaders.
Such a self-defined dataset class lets us customize all the related features and functions for our custom movie review dataset in `DataFrame` format.

In [11]:
class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)


train_dataset = IMDbDataset(train_encodings, train_labels)
valid_dataset = IMDbDataset(valid_encodings, valid_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

In [12]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True)
valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=16, shuffle=False)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=16, shuffle=False)

Let us look at the `item` variable in the `__getitem__` method. 

The encodings we produced previously store a lot of information about the tokenized texts. Via the dictionary comprehension that we use to assign the dictionary to the item variable, we are only extracting the most relevant information. 

For instance, the resulting dictionary entries include input_ids (unique integers from the vocabulary corresponding to the tokens), labels (the class labels), and attention_mask. Here, attention_mask is a tensor with binary values (0s and 1s) that denotes which tokens the model should attend to. 

In particular, 0s correspond to tokens used for padding the sequence to equal lengths and are ignored by the model; the 1s correspond to the actual text tokens.

## Loading and fine-tuning a pre-trained BERT model

Once we have taken care of the data preparation, we load the pre-trained DistilBERT model and fine-tune it using the dataset we just created.

In [None]:
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.to(DEVICE)
model.train()

optim = torch.optim.Adam(model.parameters(), lr=5e-5)

`DistilBertForSequenceClassification` specifies the downstream task we want to fine-tune the model
on, which is sequence classification in this case. 

As mentioned before, `'distilbert-base-uncased'` is a lightweight version of a BERT uncased base model with manageable size and good performance. 

Note that “uncased” means that the model does not distinguish between upper- and lower-case letters.

**Train Model -- Manual Training Loop**

First, we need to define an accuracy function to evaluate the model performance. Note that this accuracy function computes the conventional classification accuracy.

Here, we need to load the dataset batch by batch to work around RAM or GPU memory (VRAM) limitations when working with a large deep learning model.

In [15]:
def compute_accuracy(model, data_loader, device):
    with torch.no_grad():
        correct_pred, num_examples = 0, 0
        
        for batch_idx, batch in enumerate(data_loader):
        
        ### Prepare data
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs['logits']
            predicted_labels = torch.argmax(logits, 1)
            num_examples += labels.size(0)
            correct_pred += (predicted_labels == labels).sum()
        
        return correct_pred.float()/num_examples * 100


In the `compute_accuracy` function, we load a given batch and then obtain the predicted labels from the outputs. 

While doing this, we keep track of the total number of examples via `num_examples`. Similarly,
we keep track of the number of correct predictions via the `correct_pred` variable. 

Finally, after we
iterate over the complete dataset, we compute the accuracy as the proportion of correctly predicted
labels.

Overall, via the `compute_accuracy` function, we can already get a glimpse at how we can use the
transformer model to obtain the class labels. 


That is, we feed the model the `input_ids` along with the `attention_mask information` that, here, denotes whether a token is an actual text token or a token for padding the sequences to equal length.

The model call then returns the outputs, which is a transformer library-specific `SequenceClassifierOutput` object. From this object, we then obtain the logits that we
convert into class labels via the `argmax` function.

In [None]:
start_time = time.time()

for epoch in range(NUM_EPOCHS):
    
    model.train()
    
    for batch_idx, batch in enumerate(train_loader):
        
        ### Prepare data
        input_ids = batch['input_ids'].to(DEVICE)
        attention_mask = batch['attention_mask'].to(DEVICE)
        labels = batch['labels'].to(DEVICE)

        ### Forward
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss, logits = outputs['loss'], outputs['logits']
        
        ### Backward
        optim.zero_grad()
        loss.backward()
        optim.step()
        
        ### Logging
        if not batch_idx % 250:
            print (f'Epoch: {epoch+1:04d}/{NUM_EPOCHS:04d} | '
                   f'Batch {batch_idx:04d}/{len(train_loader):04d} | '
                   f'Loss: {loss:.4f}')
            
    model.eval()

    with torch.set_grad_enabled(False):
        print(f'Training accuracy: '
              f'{compute_accuracy(model, train_loader, DEVICE):.2f}%'
              f'\nValid accuracy: '
              f'{compute_accuracy(model, valid_loader, DEVICE):.2f}%')
        
    print(f'Time elapsed: {(time.time() - start_time)/60:.2f} min')
    
print(f'Total Training Time: {(time.time() - start_time)/60:.2f} min')
print(f'Test accuracy: {compute_accuracy(model, test_loader, DEVICE):.2f}%')

In this code, we iterate over multiple epochs. In each epoch we perform the following steps:
1. Load the input into the device we are working on (GPU or CPU)
2. Compute the model output and loss
3. Adjust the weight parameters by backpropagating the loss
4. Evaluate the model performance on both the training and validation set

After three epochs, accuracy on the test
dataset reaches around 93 percent, which is a substantial improvement compared to the 85 percent
test accuracy of the RNN.

In [None]:
del model # free memory