# 01 - Pretraining
By Jan Christian Blaise B. Cruz

In this notebook we'll see how to pretrain a Transformer (Vaswani, et al., 2015) language model for the purposes of transfer learning it into a downstream task. First, let's do some imports.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR
from torch.utils.data import TensorDataset, DataLoader

from pytorch_pretrained_bert import BertTokenizer
from models import TransformerForLanguageModeling

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from tqdm import tqdm

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch.manual_seed(42)
torch.backends.cudnn.deterministic = True

For the sake of demonstration, we'll use the WikiText-2 language modeling dataset. This has 2M tokens in entirety. In real life, language models for transfer learning are often trained with the larger WikiText-103 (with 103M tokens) + a variety of other large corpora. We won't do this here as it is obviously resource-intensive. Training a Transformer on WikiText-103 along takes a couple days on an NVIDIA Tesla V100 to achieve robust results.

In [2]:
with open('data/wikitext2_train.txt', 'r') as f:
    train_text = [l.strip().replace('<unk>', '[UNK]') for l in f]
with open('data/wikitext2_valid.txt', 'r') as f:
    valid_text = [l.strip().replace('<unk>', '[UNK]') for l in f]

For tokenization, we'll use a scheme called "WordPiece." WordPiece is a form of subword tokenization that uses the Byte-Pair Encoding (Seinnrich, et al., 2016) to chunk words into smaller "pieces of words" by using a compression algorithm and n-gram frequencies (more recent iterations use a language model to optimize the chunks that get into the final vocab list). This has two advantages: first, it allows us to represent words that are out of the vocabulary as all small chunks (single letters included) can form infinite combinations of tokens. Second, this scheme is actually morphologically closer to modeling language than word-based tokenization as we can now embed "sounds" or "morphemes" directly instead of individual tokens.

I won't go into detail about how BPE works (the paper is well written). To save us the hassle of training a WordPiece vocabulary, we'll use the one trained with BERT (Devlin, et al., 2018), another Transformer-based language model. 


*Note: Do note that if you want to use another language other than English, you'd have to train your own WordPiece vocabulary (usually with a BERT model). We happen to have Filipino ones available (BERT models included) so let us know if you need them.*

In [3]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case=False)

We'll prepare the dataset for training.

Note that transformers, unlike Recurrent Neural Network (RNN) based models, can only take in sequences of fixed length. This is actually an advantage, that it can be fed inputs by batch and not sequentially (which limits the RNNs ability to be parallelized, crippling its speed). We set a "maximum number of positions" (synonymous to BPTT length) which is the same number of positions the transformer can attent to at one time using Multihead Attention.

We'll write a function to do the processing for us. We truncate long sequences to the maximum positions minus one in order to add the special "end of sequence" token (BERT uses ```[CLS]``` and so we'll use the same). We also pad the shorter sequences. Since we're doing language modeling, we'll train the model to predict the next word, so we shift the tokens one position to the left to produce our target vector.

In [4]:
def prepare_dataset(text, max_num_pos):
    X_set, y_set = [], []
    for line in tqdm(text):
        tokens = tokenizer.tokenize(line)
        x = tokens[:-1][:max_num_pos - 1] + ['[CLS]']
        y = tokens[1:][:max_num_pos - 1] + ['[CLS]']

        if len(x) < max_num_pos:
            x = x + ['[PAD]' for _ in range(max_num_pos - len(x))]
        if len(y) < max_num_pos:
            y = y + ['[PAD]' for _ in range(max_num_pos - len(y))]

        x = tokenizer.convert_tokens_to_ids(x)
        y = tokenizer.convert_tokens_to_ids(y)

        X_set.append(x)
        y_set.append(y)
    X_set = torch.LongTensor(X_set)
    y_set = torch.LongTensor(y_set)
    data = TensorDataset(X_set, y_set)
    
    return data

We'll use a maximum number of positions of 256 (the standard configuration for the Transformer-XL (Dai, wt al., 2019) model) and a batch size of 32 (to make sure it fits in a sizeable enough GPU). We construct the dataloaders to faciliate better batching.

In [5]:
max_num_pos = 256
batch_size = 32

train_data = prepare_dataset(train_text, max_num_pos)
valid_data = prepare_dataset(valid_text, max_num_pos)
train_loader = DataLoader(train_data, batch_size)
valid_loader = DataLoader(valid_data, batch_size)

100%|██████████| 36718/36718 [00:27<00:00, 1351.82it/s]
100%|██████████| 3760/3760 [00:02<00:00, 1407.03it/s]


We'll use a transformer with a language modeling head on top for our task. The hyperparameters we use, although we are using the GPT-2 (Radford, et al., 2019) architecture, are akin to the settings used by the Transformer-XL. We'll use 10 heads to attend to 10 positions at once, through 16 layers of transformer blocks. Again, I won't go into detail with how Transformer work, but the aforementioned papers are good resources for learning about modern iterations of them.

We'll train the model using Adam (Kingma & Ba, 2014) and use a standard cross entropy objective. We'll ignore the padding token while computing the loss. We'll also use cosine annealing to steadily decrease our learning rate. We initialize the weights of the network to a mean of 0 and a standard deviation of 0.02, akin to the settings of the Transformer-XL.

In [6]:
model = TransformerForLanguageModeling(embed_dim=410, hidden_dim=2100, num_embeddings=len(tokenizer.vocab), 
                                       num_max_positions=256, num_heads=10, num_layers=16, dropout=0.1).to(device)
criterion = nn.CrossEntropyLoss(ignore_index=tokenizer.vocab['[PAD]'])
optimizer = optim.Adam(model.parameters(), lr=2.5e-4)
scheduler = CosineAnnealingLR(optimizer, 10)

def init_weights(m):
    if isinstance(m, (nn.Linear, nn.Embedding, nn.LayerNorm)):
        m.weight.data.normal_(mean=0.0, std=0.02)
    if isinstance(m, (nn.Linear, nn.LayerNorm)) and m.bias is not None:
        m.bias.data.zero_()

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

model.apply(init_weights)
print("The model has {:,} trainable parameters".format(count_parameters(model)))

The model has 50,396,360 trainable parameters


We'll only train for 10 epochs just for demonstration. Normally, you'd want to train for 200 epochs with a much larger dataset (like WikiText-103).

In [7]:
epochs = 10
max_norm = 0.25
train_loss = 0
train_ppl = 0
test_loss = 0
test_ppl = 0

for i in range(epochs):
    model.train()
    for batch in tqdm(train_loader):
        x, y = batch
        x = x.to(device)
        y = y.to(device)
        
        out = model(x)
        loss = criterion(out.flatten(0, 1), y.flatten())

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
        optimizer.step()

        train_loss += loss.item()
        train_ppl += torch.exp(loss).item()
    train_loss /= len(train_loader)
    train_ppl /= len(train_loader)

    model.eval()
    with torch.no_grad():
        for batch in tqdm(valid_loader):
            x, y = batch
            x = x.to(device)
            y = y.to(device)

            out = model(x)
            loss = criterion(out.flatten(0, 1), y.flatten())
            
            test_loss += loss.item()
            test_ppl += torch.exp(loss).item()
        test_loss /= len(valid_loader)
        test_ppl /= len(valid_loader)

    scheduler.step()
    print("Train Loss {:.4f} | Train Ppl {:.4f} | Test Loss {:.4f} | Test Ppl {:.4f}".format(train_loss, train_ppl, test_loss, test_ppl))




After training, we can then use the pretrained language model to finetune to a downstream task. In the next notebook, we'll use a pretrained transformer to perform text classification.