# Fine-tuning BERT-style models

Transformer-based large language models have been all the hype since about 2018 when BERT was first published.  Fortunately, it's not too hard to fine-tune a model (or, at least, do do some quick-and-dirty fine tuning; optimizing the fine-tuning process can be a pretty time-intensive process).  the `transformers` library from Huggingface is the single best tool for working with transformer-based language models.  Couple it with PyTorch (or, increasingly, Jax+Flax) and you've got a pretty easy to use toolbox--as long as you have the GPU compute.  (You do NOT want to try to run this on a CPU-only machine; it'll just be too slow).

We'll use PyTorch to fine-tune an DistilBERT model on the Amazon review data, and we'll mostly follow the recommended settings from Huggingface. DistilBERT is a "distilled" version of BERT that's about half the size, about twice the speed, and about 97% the accuracy.  This code successfully ran on an NVidia Quadro T1000 card (4GB VRAM) that was not being used for anything else, e.g. not being used for graphics rendering.  You _could_ run this code purely on CPU and worry less about the memory overhead, but it'll be a lot slower.

## A note about pretraining and fine-tuning

One of the big reasons why transformers have become all the rage is the "pretrain-finetune" paradigm, which is essentially a form of transfer learning.  The model is first _pre-trained_ on a self-supervised language task, usually a _masked language task_.  Some strategy is used to hide tokens in the input texts, and the model has to predict what word has been hidden.  Then, to fine-tune the model, you add a small densely connected layer right at the very end, feed example sentences through, and only update the dense layer and the last few of the pretrained models.

Conceptually, there are two ways to think about this:
1. The pre-training step lets the model learn some general representation of the target language (e.g. English).  I.e., it imbues the model with some information that answers the question: "what does English generally look like?"  The, the fine-tuning step takes this general representation of a language, and hones it to be really good at one specific task.  Once a model knows what English generally looks like, it can then learn a more specialized representation.

1. The pre-training step is a way to learn a really good, really general-purpose initialization of the neural network, which serves as a good initialization for a wide range of downstream tasks.  Compare this to, e.g., a random initialization of the network weights.  Fine-tuning is taking this initialization and building the "real" model on top of it.

Pre-training is an extraordinarily time- and compute-intensive process, so you'll probably never do that yourself.  Pretraining models can take weeks or months of continuous runtime on fairly large servers/clusters.  Fine-tuning, by contrast, is relatively quick (though it can still be slow, since these are still quite large neural network models).

In [1]:
# requirements
# !conda install --yes tqdm pandas scikit-learn

# NOTE: go to https://pytorch.org/get-started/locally/ and replace the next line
# with the installation instructions for your platform.
# !conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
# !conda intall --yes -c huggingface huggingface
# !python -m pip install -U transformers

In [2]:
# tqdm is a magic library that gives you progerss bars when iterating
# through things.
from tqdm.notebook import tqdm

In [3]:
import pandas as pd

# load the data
train = pd.read_csv("../../data/train.csv")
test = pd.read_csv("../../data/test.csv")
val = pd.read_csv("../../data/validation.csv")

# the Transformer models will expect our labels to be numeric
# indices starting from 0; just subtract 1 from our stars and
# we're good.  (and add 1 to the final predicted number of stars
# to convert back).
train["stars"] -= 1
test["stars"] -= 1
val["stars"] -= 1

In [4]:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

In Huggingace, we generally need to manually run the model's tokenizer over our data.  Transformer-based models have learnable/trainable tokenizers that, essentially, learn how to tokenize an input text; this means that different models' tokenizers behave differently.  It also means that the tokenization is not necessarily human-understandable after it's done.

To get both the full model (which we'll fine-tune in a moment) and the tokenizer (which we'll use to tokenize our data), we just use the `.from_pretrained()` method on the `Auto*` classes.  Here, we're loading the `albert-base-v2` model to start with; this is a transformer model that's designed to be much smaller and faster than BERT, but without giving up much accuracy.  (ALBERT has ~11M parameters, compared to BERT's ~110M).  We're using ALBERT purely for the speed and low memory footprint--feel free to swap it out for a larger model, like `bert-base-uncased`, if you've got a decent GPU and want to run this yourself.

_Note:_ to use a different base model, like `bert-base-uncased`, `roberta-base`, etc, just replace the name of the model being loaded.  The rest of your code remains unchanged.

In [5]:
# the model, which we'll fine-tune for our classification task.
# it will be downloaded+cached if not already available locally.
MODEL_NAME = "distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    
    # tell it how many labels we have--this will add the dense
    # layer at the end of the model.
    num_labels=5,
    
    # rather than the default 32-bit floating points, load the model
    # in 16-bit float format if there isn't a GPU available.
    # This will roughly halve the memory use, and will usually provide
    # a speedup.  This will NOT provide a GPU speedup for all GPUs;
    # some more recent NVidia GPUs have highly-optimized 16-bit float
    # operations, and for those, this can provide about a 2x speed increase.
    # But in the interest of compatibility--and accuracy--we'll leave it
    # at float32 and just eat the speed cost.
    #
    # However: note that not all models are necessarily compatible with
    # half-precision (16-bit float) on all hardware.  So if you want to
    # play with half-precision floats for the extra speed, it'll involve
    # some experimentation.
    torch_dtype=torch.float32 if torch.cuda.is_available() else torch.float16,
)

# the tokenizer, which we'll use to manually preprocess our data.
# it will be downloaded+cached if not already available locally.
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'pre_classi

We can ignore those warnings.

The tokenizer and model are both callable objects.  Calling the tokenizer on some strings will tokenize them; calling the model on some tokenized texts will run the inputs through the model and generate final predictions.  But, we're not going to tokenize our dataset right now; doing that would use up a huge amount of RAM.  Instead, we'll make a custom PyTorch `Dataset` class that handles iteration through our raw strings, and we'll tokenize them on-the-fly during the training loop.  This sacrifices some speed, but it'll save us a lot of RAM usage.

In [6]:
from torch.utils.data import Dataset, DataLoader

class TextDataset(Dataset):
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
    def __len__(self):
        return self.x.shape[0]
    
    def __getitem__(self, idx):
        return self.x[idx], self.y[idx]
    
def tokenize(texts, tokenizer):
    """Tokenize the texts, so the model can take them as inputs."""
    return tokenizer(
        # the texts must be a list--not an "array-like," but an actual
        # list--if we're passing multiple texts at once.
        list(texts),

        # pad short texts out to the max sequence length for the model.
        padding="max_length",

        # truncate long texts to the model's max length.  Truncation is
        # rarely an issue for texts that are just a bit longer than the
        # model's max length, but it can introduce errors for texts that
        # are very long.  (very long texts are a pathological problem for
        # transformers).
        truncation=True,

        # return PyTorch tensors, since we're going to use PyTorch models
        # and PyTorch training loops.
        # (other options are "tf" for Tensorflow tensors, or "np" for
        # Numpy ndarrays).
        return_tensors="pt",
    )

def make_dataset(df, batch_size=8):
    return DataLoader(
        TextDataset(df["review_body"], df["stars"]),
        batch_size=batch_size,
        shuffle=True,
    )

train_dataset = make_dataset(train, batch_size=8)
test_dataset = make_dataset(test, batch_size=32)
val_dataset = make_dataset(val, batch_size=32)

As a quick sidebar, the tokenizers return a Python `dict` that we'll pass to the model using `**kwargs` syntax:

In [7]:
for (k,v) in tokenize(["This is a sentence."], tokenizer).items():
    print(k)
    print(v[:, :10])
    print()

input_ids
tensor([[ 101, 2023, 2003, 1037, 6251, 1012,  102,    0,    0,    0]])

attention_mask
tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0]])



Now, we can start up our PyTorch training loop.

In [8]:
from transformers import get_scheduler

torch.cuda.empty_cache()

# AdamW is a standard optimizer for transformers.
# Sometimes you'll see regular Adam or something else,
# but usually the optimizers aren't anything super
# exotic for transformer models.
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

# Learning rate scheduler--also commonly used for fine-tuning transformers.
# This will linearly decrease the learning rate after each training batch.
# For this demo we'll only train for a single epoch over the training data.
n_epochs = 1
n_train_steps = n_epochs * len(train_dataset)
lr_scheduler = get_scheduler(
    name="linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=n_train_steps
)

# use CUDA if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# training loop!
# yes, the indents move about half a mile to the right, and I'm sorry
# about that.
model.to(device)
batchnum = 0
best_val_loss = 100
early_stopping_patience = 0
val_losses = []
val_batches = []
train_losses = []
for (x, y) in tqdm(train_dataset, desc=f"Training"):

    # tokenize the strings and move them to the GPU, if a GPU is available.
    batch = {
        k: v.to(device)
        for (k,v) in tokenize(x, tokenizer).items()
    }
    batch["labels"] = y.to(device)

    # transformers track their own loss; when the `batch` dict has
    # a `labels` field.  We could use our own loss calculation, but
    # this is fine.
    preds = model(**batch)
    loss = preds.loss
    loss.backward()

    # optimization and learning rate steps
    optimizer.step()
    lr_scheduler.step()

    # reset optimizer learning rate
    optimizer.zero_grad()

    batchnum += 1

    # evaluate on the validation data every 100 batches
    if batchnum % 250 == 0:
        val_loss = 0
        with torch.no_grad():
            for (x, y) in val_dataset:
                batch = {
                    k: v.to(device)
                    for (k,v) in tokenize(x, tokenizer).items()
                }
                batch["labels"] = y.to(device)
                preds = model(**batch)
                loss = preds.loss
                val_loss += loss
        val_loss = val_loss / len(val_dataset)
        print(f"Batch {batchnum:,} - validation loss={val_loss}")
        val_losses.append(val_loss)
        val_batches.append(batchnum)
        train_losses.append(loss)
        
        # simple early stopping criteria--stop as soon as the training loss
        # doesn't decrease between validation rounds.  I've left the code
        # here for doing a more patient approach to early stopping--just
        # change the `1` in `if early_stopping_patience >= 1` to be however
        # many validation rounds with no improvement you want to wait before 
        # stopping training.
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            early_stopping_patience = 0
        else:
            early_stopping_patience += 1
        
        if early_stopping_patience >= 1:
            break

Training:   0%|          | 0/25000 [00:00<?, ?it/s]

Batch 250 - validation loss=1.1785367727279663
Batch 500 - validation loss=1.12093186378479
Batch 750 - validation loss=1.0768646001815796
Batch 1,000 - validation loss=1.0647107362747192
Batch 1,250 - validation loss=1.0456185340881348
Batch 1,500 - validation loss=1.0505222082138062


You can see from the `tqdm` timer bar how long this took--a bit under an hour on my computer.  Granted, I ran this on a fairly basic GPU (NVidia Quadro T1000--a pretty entry-level, general-purpose compute GPU for laptops), but even on higher end hardware this will still take some time.

Just for reference, here are some other speed benchmarks from running this same code on a few different GPU configurations:
- CPU-only (Intel Xeon E-2276M, 2.8GHz base clock): ~10-12s per batch.
- CPU-only (AMD Threadripper 2990WX, 3.0GHz base clock): ~2.5s per batch.
- NVidia Quadro T1000: ~1s per batch.
- NVidia Titan V: ~6.5 batches per second.

So, yes, using a good GPU is a _huge_ must, if you have one available, and if you have a lot of data.  And a GPU with more VRAM means you can run bigger models, too, which can often give you more accuracy (at the cost of longer training times).

Anyways. Let's check the final accuracy/F1 scores of our model now.

In [9]:
import numpy as np
from sklearn import metrics

torch.cuda.empty_cache()
with torch.no_grad():
    predicted = []
    ys = []
    model.cuda()
    for (x, y) in tqdm(test_dataset):
        batch = {k:v.to(device) for k,v in tokenize(x, tokenizer).items()}
        preds = model(**batch)
        predicted.append(preds["logits"].argmax(axis=1).detach().cpu().numpy())
        ys.append(y.detach().numpy())

predicted = np.hstack(predicted)
ys = np.hstack(ys)

  0%|          | 0/157 [00:00<?, ?it/s]

In [10]:
acc = np.mean(predicted == ys)
f1 = metrics.f1_score(predicted, ys, average="macro")

print(f"Accuracy: {acc:.2%}")
print(f"Macro F1: {f1:.4f}")

Accuracy: 53.98%
Macro F1: 0.5271


Notice how the code we wrote--minus the PyTorch training loop part--was about as complex as the other notebooks.  But also note that our total runtime and hardware requirements went up considerably, and our accuracy went up by a noticeable, but not overwhelming, margin.  This goes back to what I said in the first notebook: there will rarely be a situation where transformers have awesome performance, but simpler models have garbage performance.  (There are a few, but they're rare, and usually deal with extremely abstract, latent linguistic constructs).