## Fine Tuning BERT (GLUE - MRPC)

This notebook demonstrates the process of fine-tuning [BERT-base (Bidirectional Encoder Representations from Transformers)](https://arxiv.org/abs/1810.04805) for the Microsoft Research Paraphrase Corpus (MRPC) task, part of the [General Language Understanding Evaluation (GLUE)](https://gluebenchmark.com/) benchmark. BERT-base is a transformer model pre-trained on a large corpus of English text using self-supervised learning. Its pre-training involves two key tasks: **Masked Language Modeling (MLM)**, where it predicts randomly masked words in a sentence, and **Next Sentence Prediction (NSP)**, where it determines if two sentences are consecutive in the original text. This approach allows BERT to learn bidirectional representations of language, capturing complex contextual relationships.

While BERT's pre-training provides a robust understanding of language, it requires fine-tuning on specific tasks that use the whole sentence (potentially masked) such as sequence classification, token classification, question answering, and paraphrase identification - as in our implementation. This fine-tuning process adapts BERT's general language understanding to the specific nuances of the MRPC task, which involves determining whether two given sentences are paraphrases of each other.

In this notebook, we'll walk through the steps of preparing the MRPC dataset (incl. tokenization and dynamic padding), training the model with [Hugging Face Transformers](https://huggingface.co/docs/transformers/index), and tracking its performance on the paraphrase identification task with the [Weights & Biases](https://wandb.ai/site) framework. Note that more advanced techniques for fine-tuning will be covered in another notebook.

In [None]:
%pip install transformers datasets evaluate --quiet | tail -n 1

We will use [Weights & Biases](https://wandb.ai/site) to track and visualize our fine tuning process:

In [None]:
%pip install wandb --quiet | tail -n 1

   ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ 62.7/62.7 kB 6.6 MB/s eta 0:00:00


In [None]:
import wandb
wandb.login()

### 1. Preparing Dataset

#### 1.1 Loading MRPC

The MRPC dataset is one of the 10 datasets composing the [GLUE benchmark](https://gluebenchmark.com/), an academic benchmark used to measure the performance of ML models across 10 different text classification tasks.

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")

In [None]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In the random sample [42], we can see that `sentence1` and `sentence2` are equivalent (label == 1):

In [None]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[42]

{'sentence1': 'The company has said it plans to restate its earnings for 2000 through 2002 .',
 'sentence2': 'The company had announced in January that it would have to restate earnings for 2002 , 2001 and perhaps 2000 .',
 'label': 1,
 'idx': 47}

In [None]:
raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

#### 1.2 Tokenization

Tokenization is the process of breaking down text into smaller units called tokens. These tokens can be words, subwords, or characters, depending on the tokenization strategy.

For BERT, we use a subword tokenization method that can handle out-of-vocabulary words by breaking them into familiar subword units. In this section, we use the `AutoTokenizer` from the `Transformers` library to tokenize our input sentences. The tokenizer is initialized with the same checkpoint as our model ("bert-base-uncased") to ensure compatibility.

In [None]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize(x):
    # defer padding to batch creation (however, with TPUs it is more efficient to apply padding here!)
    # in bert, max_sequence_length = 512
    return tokenizer(x["sentence1"], x["sentence2"], truncation=True)

The `tokenize` function processeses both sentences in each example, applying truncation to fit BERT's maximum sequence length of 512 tokens. The map function is then used to apply this tokenization to our entire dataset efficiently.

In [None]:
tokenized_datasets = raw_datasets.map(tokenize, batched=True)
tokenized_datasets

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

#### 1.3 Dynamic Padding

Dynamic padding is an efficient technique used to handle variable-length sequences within a batch. Instead of padding all sequences to a fixed maximum length, which can waste computational resources, dynamic padding adds padding only up to the length of the longest sequence in each batch. This approach optimizes memory usage and computation time.

In our BERT fine-tuning process, we'll use the `DataCollatorWithPadding` class from the Transformers library, which automatically pads the inputs in each batch to the maximum length in that specific batch.

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
samples = tokenized_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]

[50, 59, 47, 67, 59, 50, 62, 32]

In [None]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

{'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'attention_mask': torch.Size([8, 67]),
 'labels': torch.Size([8])}

### 2. Training and Evaluation

In this section, we set up the training and evaluation pipeline for our BERT model on the MRPC task. We define a `compute_metrics` function that uses the `evaluate` library to calculate performance metrics specific to the MRPC task.

#### 2.1 Init 'Weights and Biases' project

In [None]:
run = wandb.init(project="bert-uncased-finetuned-mrpc")

#### 2.1 Evaluate

The `Trainer`class does not automatically evaluate model performance during training. We need to pass it a function to compute and report metrics. The `Evaluate` library provides dozens of evaluation methods for different domains such as for the GLUE MRPC task. For this subset it will report:

- `accuracy`: the proportion of correct predictions among the total number of cases processed, with a range between 0 and 1, and:

- `f1`: the harmonic mean of the precision and recall. Its range is 0-1 ‚Äì its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.

In [None]:
import numpy as np
import evaluate

def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

#### 2.2 Training Hyperparameters

We then initialize the `TrainingArguments` with basic settings, including the output directory and evaluation strategy. In our case, the `compute_metrics` function will be called at the end of each epoch. This class contains many more hyperparameters that can be tuned as well as flags for activating different training options:

In [None]:
from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification

training_args = TrainingArguments(
    output_dir="results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.1,
    push_to_hub=False,
)

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.432218,0.803922,0.851852
2,No log,0.369652,0.857843,0.898246
3,0.429800,0.413086,0.867647,0.906574


TrainOutput(global_step=690, training_loss=0.3674315521682518, metrics={'train_runtime': 104.2954, 'train_samples_per_second': 105.508, 'train_steps_per_second': 6.616, 'total_flos': 428577075854640.0, 'train_loss': 0.3674315521682518, 'epoch': 3.0})

### 3. Results

![wandb results](./static/fine_tuning_bert_wandb.png)

In [None]:
wandb.finish()