# 02 - Finetuning
By Jan Christian Blaise B. Cruz

In this notebook, we'll learn how to finetune a transformer language model into a text classifier. First, let's do the imports.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import CyclicLR, CosineAnnealingLR
from torch.utils.data import TensorDataset, DataLoader

from pytorch_pretrained_bert import BertTokenizer, cached_path
from models import TransformerForClassification

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from tqdm import tqdm

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch.manual_seed(42)
torch.backends.cudnn.deterministic = True

We can finetune to either the IMDB Sentiment Classification task, or a subset of the Text Retrieval Conference (TREC) document classification task. For this example, we'll use the TREC dataset. The datasets are available directly in the ```/data``` folder so there's no need to download them separately. Like in the pretraining notebook, we'll use WordPiece tokenization and we'll use BERT's ready tokenizer for this purpose. 

Again, do note that if you'll want to use a different language, you'll have to train your own BERT WordPiece tokenizer (we have Filipino ones available, contact us if you need them!)

In [2]:
# Load data
task = 'trec'
df = pd.read_csv('data/' + task + '.csv')
text, labels = list(df['text']), list(df['labels'])

# Instantiate tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case=False)

We'll preprocess the data like we did in the pretraining notebook.

In [3]:
max_num_pos = 256
batch_size = 32

# Trim the dataset and pad
data = []
for line in tqdm(text):
    line = tokenizer.tokenize(line)[:max_num_pos - 1] + ['[CLS]']
    if len(line) < max_num_pos:
        line = line + ['[PAD]' for _ in range(max_num_pos - len(line))]
    tokens = tokenizer.convert_tokens_to_ids(line)
    data.append(tokens)
X = np.array(data)

# Build labels
label_list = list(set(labels))
y = np.array([label_list.index(y) for y in labels])

# Build dataset and loader
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
train_data = TensorDataset(torch.LongTensor(X_train), torch.LongTensor(y_train))
test_data = TensorDataset(torch.LongTensor(X_test), torch.LongTensor(y_test))

train_loader = DataLoader(train_data, batch_size=batch_size)
test_loader = DataLoader(test_data, batch_size=batch_size)

100%|██████████| 5452/5452 [00:00<00:00, 6318.37it/s]


To finetune, we'll use a GPT-2 Transformer model with a classification head on top (see the ```models.py``` file for the complete code listing). We'll again use the Adam (Kingma & Ba, 2014) optimizer to finetune our model, and we'll use Cosine Annealing as out learning rate schedule. We'll give it a maximum epoch setting of 3 since we'll only take that long to finetune our classifier.

Likewise, we'll initialize the weights and biases. Afterwhich, we'll load pretrained Transformer weights trained on the WikiText-103 dataset. The script downloads it for you automatically, so there's no need to process it yourself. A nice thing to note here is that ```TransformerForClassification``` is a subclass of ```TransformerForLanguageModeling``` and so loading weights for the superclass works for the subclass. No need to adjust weights and encodings unlike in ULMFiT (Howard & Ruder, 2018).

In [4]:
model = TransformerForClassification(num_classes=len(label_list), embed_dim=410, hidden_dim=2100, num_embeddings=len(tokenizer.vocab), 
                                     num_max_positions=256, num_heads=10, num_layers=16, dropout=0.1).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=6e-5)
scheduler = CosineAnnealingLR(optimizer, 3)

def init_weights(m):
    if isinstance(m, (nn.Linear, nn.Embedding, nn.LayerNorm)):
        m.weight.data.normal_(mean=0.0, std=0.02)
    if isinstance(m, (nn.Linear, nn.LayerNorm)) and m.bias is not None:
        m.bias.data.zero_()

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

model.apply(init_weights)
print("The model has {:,} trainable parameters".format(count_parameters(model)))

# Download weights and load them
path = cached_path('https://s3.amazonaws.com/models.huggingface.co/naacl-2019-tutorial/model_checkpoint.pth')
state_dict = torch.load(open(path, 'rb'), map_location='cpu')
incompatible_keys = model.load_state_dict(state_dict, strict=False)

The model has 50,398,826 trainable parameters


We'll finetune for just three epochs. We'll use a max norm of 0.25 for gradient clipping.

Do note that since the base model is a language model, our classifier will likewise anticipate an input of shape ```[seq_len, batch_size]```. We'll transpose accordingly to accomodate the need. Also, we're going to have to manually mask the special tokens ```[CLS]``` and ```[PAD]```.

In [5]:
epochs = 3
max_norm = 0.25
train_loss = 0
train_acc = 0
test_loss = 0
test_acc = 0

for i in range(epochs):
    model.train()
    for batch in tqdm(train_loader):
        x, y = batch

        inputs = x.transpose(1, 0).contiguous().to(device)
        targets = y.to(device)
        clf_mask = (inputs == tokenizer.vocab['[CLS]']).to(device)
        pad_mask = (x == tokenizer.vocab['[PAD]']).to(device)

        logits = model(inputs, clf_tokens_mask=clf_mask, padding_mask=pad_mask)
        loss = criterion(logits, targets)

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
        optimizer.step()

        train_loss += loss.item()
        train_acc += torch.sum(torch.argmax(logits, dim=1) == targets).item() / len(targets)
    train_loss /= len(train_loader)
    train_acc /= len(train_loader)

    model.eval()
    with torch.no_grad():
        for batch in tqdm(test_loader):
            x, y = batch

            inputs = x.transpose(1, 0).contiguous().to(device)
            targets = y.to(device)
            clf_mask = (inputs == tokenizer.vocab['[CLS]']).to(device)
            pad_mask = (x == tokenizer.vocab['[PAD]']).to(device)

            logits = model(inputs, clf_tokens_mask=clf_mask, padding_mask=pad_mask)
            loss = criterion(logits, targets)

            test_loss += loss.item()
            test_acc += torch.sum(torch.argmax(logits, dim=1) == targets).item() / len(targets)
        test_loss /= len(test_loader)
        test_acc /= len(test_loader)

    scheduler.step()
    print("Train Loss {:.4f} | Train Acc {:.4f} | Test Loss {:.4f} | Test Acc {:.4f}".format(train_loss, train_acc, test_loss, test_acc))

100%|██████████| 128/128 [00:39<00:00,  3.51it/s]
100%|██████████| 43/43 [00:04<00:00, 10.71it/s]
  0%|          | 0/128 [00:00<?, ?it/s]

Train Loss 0.8024 | Train Acc 0.6962 | Test Loss 0.3232 | Test Acc 0.8912


100%|██████████| 128/128 [00:38<00:00,  3.54it/s]
100%|██████████| 43/43 [00:04<00:00, 10.72it/s]
  0%|          | 0/128 [00:00<?, ?it/s]

Train Loss 0.2756 | Train Acc 0.9143 | Test Loss 0.2851 | Test Acc 0.9376


100%|██████████| 128/128 [00:38<00:00,  3.50it/s]
100%|██████████| 43/43 [00:04<00:00, 10.71it/s]

Train Loss 0.1256 | Train Acc 0.9673 | Test Loss 0.2800 | Test Acc 0.9445





And we got a final test set accuracy of 94.45% in just three epochs of finetuning!