<a href="https://colab.research.google.com/github/chineidu/NLP-Tutorial/blob/main/notebook/06_Transformers/tasks/02_masked_language_modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [Masked Language Modelling (MLM)](https://huggingface.co/learn/nlp-course/chapter7/3?fw=pt)

- For NLP tasks with Transformer models, you can use pretrained models from Hugging Face and fine-tune them on your data.
- Transfer learning works well if the pretraining and fine-tuning corpora are similar.
- However, in cases like `legal` or `scientific text`, domain-specific words may be treated as `rare` tokens.
- Fine-tuning the language model on in-domain data can improve downstream task performance.
- This process is called d`omain adaptation`, popularized by ULMFiT in 2018.
- We'll perform a similar process with `Transformers` instead of LSTMs.

<br>

## Benefits of MLM

- **Improved Generalization**: By exposing the model to various masking patterns, MLM enhances its ability to generalize to unseen data and perform well on downstream tasks.

- **Effective Pre-training for Diverse Tasks**: MLM has shown to be effective in pre-training language models for a wide range of NLP tasks, including text generation, machine translation, and question answering.

In [1]:
!pip install rich
!pip install transformers[torch]
!pip install torch datasets evaluate

Collecting accelerate>=0.20.3 (from transformers[torch])
  Downloading accelerate-0.24.1-py3-none-any.whl (261 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.24.1
Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m 

In [2]:
# Built-in library
import re
import json
from typing import Any, Optional, Union
import logging
import warnings

# Standard imports
import numpy as np
import pandas as pd
from rich import print
import torch

# Visualization
import matplotlib.pyplot as plt


# Pandas settings
pd.options.display.max_rows = 1_000
pd.options.display.max_columns = 1_000
pd.options.display.max_colwidth = 600

warnings.filterwarnings("ignore")

# # Black code formatter (Optional)
# %load_ext lab_black

# # auto reload imports
# %load_ext autoreload
# %autoreload 2

In [3]:
from transformers import AutoModelForMaskedLM


model_checkpoint: str = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [4]:
distilbert_num_parameters: float = model.num_parameters() / 1_000_000
print(f"'>>> DistilBERT number of parameters: {round(distilbert_num_parameters)}M'")
print(f"'>>> BERT number of parameters: 110M'")

In [5]:
# Let’s see what kinds of tokens this model predicts:
text: str = "This is a great [MASK]."

### Comment

- As humans, we can imagine many possibilities for the [MASK] token, such as “day”, “ride”, or “painting”.
- For pretrained models, the predictions depend on the corpus the model was trained on, since it learns to pick up the statistical patterns present in the data.
- Like BERT, DistilBERT was pretrained on the [English Wikipedia](https://huggingface.co/datasets/wikipedia) and [BookCorpus datasets](https://huggingface.co/datasets/bookcorpus), so we expect the predictions for [MASK] to reflect these domains.
- To predict the mask we need DistilBERT’s tokenizer to produce the inputs for the model, so let’s download that from the Hub as well:

In [6]:
from transformers import AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
inputs: dict[str, Any] = tokenizer(text, return_tensors="pt")
token_logits: torch.Tensor = model(**inputs).logits
token_logits

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tensor([[[ -5.5882,  -5.5868,  -5.5958,  ...,  -4.9448,  -4.8174,  -2.9905],
         [-11.9031, -11.8872, -12.0622,  ..., -10.9570, -10.6464,  -8.6323],
         [-11.9604, -12.1520, -12.1279,  ..., -10.0218,  -8.6074,  -8.0971],
         ...,
         [ -4.8228,  -4.6268,  -5.1041,  ...,  -4.2771,  -5.0184,  -3.9428],
         [-11.2944, -11.2388, -11.3857,  ...,  -9.2063,  -9.3411,  -6.1505],
         [ -9.5213,  -9.4632,  -9.5022,  ...,  -8.6561,  -8.4908,  -4.6903]]],
       grad_fn=<ViewBackward0>)

In [7]:
print(f'input_ids: {inputs["input_ids"]}')
print(f"mask_token_id: {tokenizer.mask_token_id}")

print(inputs["input_ids"].flatten() == tokenizer.mask_token_id)

In [8]:
raw_tokens: list[str] = tokenizer.tokenize(text)
token_ids: list[int] = tokenizer(text).get("input_ids")

print(raw_tokens, token_ids)

In [9]:
print(token_logits.shape)

token_logits

tensor([[[ -5.5882,  -5.5868,  -5.5958,  ...,  -4.9448,  -4.8174,  -2.9905],
         [-11.9031, -11.8872, -12.0622,  ..., -10.9570, -10.6464,  -8.6323],
         [-11.9604, -12.1520, -12.1279,  ..., -10.0218,  -8.6074,  -8.0971],
         ...,
         [ -4.8228,  -4.6268,  -5.1041,  ...,  -4.2771,  -5.0184,  -3.9428],
         [-11.2944, -11.2388, -11.3857,  ...,  -9.2063,  -9.3411,  -6.1505],
         [ -9.5213,  -9.4632,  -9.5022,  ...,  -8.6561,  -8.4908,  -4.6903]]],
       grad_fn=<ViewBackward0>)

In [10]:
token_logits[0, 5, :]

tensor([-4.8228, -4.6268, -5.1041,  ..., -4.2771, -5.0184, -3.9428],
       grad_fn=<SliceBackward0>)

In [11]:
# Find the location of [MASK] and extract its logits
mask_token_index: torch.Tensor = torch.where(
    inputs["input_ids"].flatten() == tokenizer.mask_token_id
)[0]
mask_token_logits: torch.Tensor = token_logits[0, mask_token_index, :]

print(f"mask_token_index: {mask_token_index}")

mask_token_logits

tensor([[-4.8228, -4.6268, -5.1041,  ..., -4.2771, -5.0184, -3.9428]],
       grad_fn=<IndexBackward0>)

In [12]:
# Pick the [MASK] candidates with the highest logits
k: int = 5
top_5_tokens = torch.topk(mask_token_logits, k, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

### Load The dataset

- To showcase domain adaptation, we’ll use the famous [Large Movie Review Dataset (or IMDb for short)](https://huggingface.co/datasets/imdb), which is a corpus of movie reviews that is often used to benchmark sentiment analysis models.
- By fine-tuning DistilBERT on this corpus, we expect the language model will adapt its vocabulary from the factual data of Wikipedia that it was pretrained on to the more subjective elements of movie reviews.
- We can get the data from the Hugging Face Hub with the load_dataset() function from 🤗 Datasets:

In [13]:
from datasets import load_dataset, Dataset, DatasetDict


data_path: str = "imdb"
imdb_dataset: DatasetDict = load_dataset(data_path)

imdb_dataset

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [14]:
# Preview a small sample
RANDOM_STATE: int = 42
sample = imdb_dataset.get("train").shuffle(seed=RANDOM_STATE).select(range(3))

for row in sample:
    print(f">>> Review: {row.get('text')}")
    print(f">>> Label: {row.get('label')}")

In [15]:
# Check the number of unique labels
imdb_dataset.get("train").unique("label")

[0, 1]

In [16]:
# Preview a small sample of the unsupervised/unlabelled data
sample = imdb_dataset.get("unsupervised").shuffle(seed=RANDOM_STATE).select(range(3))

for row in sample:
    print(f">>> Review: {row.get('text')}")
    print(f">>> Label: {row.get('label')}")

In [17]:
# Check the number of unique labels
imdb_dataset.get("unsupervised").unique("label")

[-1]

### Comment

- For both `auto-regressive` and `masked language modeling`, a common preprocessing step is to `concatenate` all the examples and then split the whole corpus into chunks of equal size.
- This is quite different from our usual approach, where we simply tokenize individual examples. Why concatenate everything together?
- The reason is that individual examples might get truncated if they’re too long, and that would result in losing information that might be useful for the language modeling task!

- To get started, we’ll first tokenize our corpus as usual, but without setting the `truncation=True` option in our tokenizer.
- We’ll also grab the word IDs if they are available (which they will be if we’re using a fast tokenizer, as described in Chapter 6), as we will need them later on to do whole word masking.
- We’ll wrap this in a simple function, and while we’re at it we’ll remove the `text` and `label` columns since we don’t need them any longer:

In [18]:
def tokenize_function(examples: dict[str, Any]) -> dict[str, Any]:
    """This is used to tokenize the texts."""
    result: dict[str, Any] = tokenizer(examples.get("text"))

    if tokenizer.is_fast:
        result["word_ids"] = [
            result.word_ids(idx) for idx in range(len(result["input_ids"]))
        ]
    return result

In [19]:
# Use batched=True to activate fast multithreading!
tokenized_datasets: DatasetDict = imdb_dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "label"]
)
tokenized_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (720 > 512). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 50000
    })
})

### Input_ids Vs. Word_ids

[![image.png](https://i.postimg.cc/MptyPP3S/image.png)](https://postimg.cc/MnMMXD8P)

### Comment

- Now that we’ve tokenized our movie reviews, the next step is to group them all together and split the result into chunks.
- But how big should these chunks be? This will ultimately be determined by the amount of GPU memory that you have available, but a good starting point is to see what the model’s maximum context size is.
- This can be inferred by inspecting the model_max_length attribute of the tokenizer:

In [20]:
# Model's context size
tokenizer.model_max_length

512

In [21]:
# To run our experiments on GPUs like those found on Google Colab, choose a smaller size that can fit in memory:
chunk_size: int = 128

# Slicing produces a list of lists for each feature
tokenized_samples: Dataset = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Review {idx} length: {len(sample)}'")

In [22]:
tokenized_samples.keys()

dict_keys(['input_ids', 'attention_mask', 'word_ids'])

In [23]:
# Concatenate lists: list[list] and empty list
sum([[10, 4, 5]], [])

[10, 4, 5]

In [24]:
# Concatenate all these examples with a simple dictionary comprehension:
concatenated_examples: dict[str, Any] = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length: int = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated reviews length: {total_length}'")

In [25]:
concatenated_examples.keys()

dict_keys(['input_ids', 'attention_mask', 'word_ids'])

In [26]:
# Chunk the data
chunks: dict[str, Any] = {
    key: [tokens[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for key, tokens in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

### Comment

- As you can see the last chunk will generally be smaller than the maximum chunk size.
- There are two main strategies for dealing with this:
  - Drop the last chunk if it’s smaller than chunk_size.
  - Pad the last chunk until its length equals chunk_size.
- We’ll take the first approach here, so let’s wrap all of the above logic in a single function that we can apply to our tokenized datasets:

In [27]:
# Chunk the data
chunks: dict[str, Any] = {
    key: [tokens[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for key, tokens in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

In [28]:
def group_texts(examples: dict[str, Any]) -> dict[str, Any]:
    """This is used to concatenate the input_ids and chunk the data."""

    # Concatenate all texts
    concatenated_examples = {key: sum(examples[key], []) for key in examples.keys()}
    # Compute length of concatenated texts
    # Select the 1st item in the list and calculate the length
    total_length: int = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length: int = (total_length // chunk_size) * chunk_size
    # Chunk the data
    chunks: dict[str, Any] = {
        key: [tokens[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for key, tokens in concatenated_examples.items()
    }

    # Create a new labels column
    chunks["labels"] = chunks["input_ids"].copy()

    return chunks

### Comment

- Note that in the last step of group_texts() we create a new `labels` column which is a copy of the input_ids one.
- As we’ll see shortly, that’s because in masked language modeling the objective is to predict randomly masked tokens in the input batch, and by creating a labels column we provide the ground truth for our language model to learn from.
- Let’s now apply group_texts() to our tokenized datasets using `Dataset.map()` function:

In [29]:
lm_datasets:DatasetDict = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 61291
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 59904
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 122957
    })
})

### Comment

- Grouping and then chunking the texts has produced many more examples than the original 25,000 for the train and test splits.
- That’s because we now have examples involving contiguous tokens that span across multiple examples from the original corpus.
- You can see this explicitly by looking for the special `[SEP]` and `[CLS]` tokens in one of the chunks:

In [30]:
# Decode the tokenized texts
print(tokenizer.decode(lm_datasets["train"][2]["input_ids"]))

In [31]:
# Decode the tokenized labels
print(tokenizer.decode(lm_datasets["train"][2]["labels"]))

### Comment

- As expected from our `group_texts()` function above, this looks identical to the decoded input_ids — but then how can our model possibly learn anything?
- We’re missing a key step: inserting [MASK] tokens at random positions in the inputs!
- Let’s see how we can do this on the fly during fine-tuning using a special data collator.

<br><hr>

## Fine-tuning DistilBERT with the Trainer API

- Fine-tuning a masked language model is almost identical to fine-tuning a sequence classification model.
- The only difference is that we need a special data collator that can randomly mask some of the tokens in each batch of texts.
- Fortunately, 🤗 Transformers comes prepared with a dedicated DataCollatorForLanguageModeling for just this task.
- We just have to pass it the tokenizer and an `mlm_probability` argument that specifies what fraction of the tokens to mask.
- We’ll use `15%`, which is the amount used for `BERT` and a common choice in the literature:

In [32]:
from transformers import DataCollatorForLanguageModeling


data_collator: DataCollatorForLanguageModeling = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm_probability=0.15
)

In [33]:
# To see how the random masking works, let’s feed a few examples to the data collator.
# Since it expects a list of dicts, where each dict represents a single chunk of contiguous text,
# we first iterate over the dataset before feeding the batch to the collator. We remove the "word_ids" key
# for this data collator as it does not expect it:
samples: list[dict[str, Any]] = [lm_datasets["train"][i] for i in range(2)]

for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


### Comment

- Nice, it worked! We can see that the [MASK] token has been randomly inserted at various locations in our text.
- These will be the tokens which our model will have to predict during training — and the beauty of the data collator is that it will randomize the [MASK] insertion with every batch!

In [34]:
# Replace the tokenizer.decode() method with tokenizer.convert_ids_to_tokens() to see that
# sometimes a single token from a given word is masked, and not the others.

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.convert_ids_to_tokens(chunk)}'")

### Comment

- One side effect of random masking is that our evaluation metrics will not be deterministic when using the Trainer, since we use the same data collator for the training and test sets.
- We’ll see later, when we look at fine-tuning with 🤗 Accelerate, how we can use the flexibility of a custom evaluation loop to freeze the randomness.
- When training models for masked language modeling, one technique that can be used is to `mask whole words together`, not just individual tokens.
- This approach is called `whole word masking`. If we want to use whole word masking, we will need to build a data collator ourselves.
- A data collator is just a function that takes a list of samples and converts them into a batch, so let’s do this now!
- We’ll use the word IDs computed earlier to make a map between word indices and the corresponding tokens, then randomly decide which words to mask and apply that mask on the inputs.
- Note that the labels are all `-100` except for the ones corresponding to mask words.

In [35]:
import collections
from transformers import default_data_collator


wwm_probability: float = 0.2


def whole_word_masking_data_collator(features: list[Any]) -> dict[str, Any]:
    """This is used for whole word masking."""
    for feature in features:
        word_ids: list[int] = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index: int = -1
        current_word: Union[int, None] = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word: int = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask: np.ndarray = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids: list[Union[int, None]] = feature["input_ids"]
        labels: list[int] = feature["labels"]
        new_labels: list[int] = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return default_data_collator(features)

In [36]:
# Next, we can try it on the same samples as before:
samples: list[Any] = [lm_datasets["train"][i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for chunk in batch["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")

### Comment

- Now that we have two data collators, the rest of the fine-tuning steps are standard.
- Training can take a while on Google Colab if you’re not lucky enough to score a mythical P100 GPU 😭, so we’ll first downsample the size of the training set to a few thousand examples.
- Don’t worry, we’ll still get a pretty decent language model!
- A quick way to downsample a dataset in 🤗 Datasets is via the `Dataset.train_test_split()` function that we saw in Chapter 5:

In [37]:
train_size: int = 10_000
test_size: int = int(0.1 * train_size)

downsampled_dataset: DatasetDict = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=RANDOM_STATE
)
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1000
    })
})

In [38]:
from huggingface_hub import notebook_login


notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [39]:
from transformers import TrainingArguments


batch_size: int = 64
# Show the training loss with every epoch
logging_steps: int = len(downsampled_dataset["train"]) // batch_size
model_name: str = model_checkpoint.split("/")[-1]
learning_rate: float = 2e-5
weight_decay: float = 0.01

training_args: TrainingArguments = TrainingArguments(
    output_dir=f"{model_name}-finetuned-imdb",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=True,
    fp16=True, # Requires CUDA
    logging_steps=logging_steps,
)

In [None]:
from transformers import Trainer


trainer:Trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

## Evaluation Metric

### Perplexity

- We can calculate the `perplexity` of our pretrained model by using the `Trainer.evaluate()` function to compute the cross-entropy loss on the test set and then taking the exponential of the result

### Training Code Block

```python
import math

# Perplexity before training
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

trainer.train()

# Perplexity after training
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
```

## Fine-tuning DistilBERT with 🤗 Accelerate

- Fine-tuning a masked language model is very similar to the text classification example from Chapter 3. In fact, the only subtlety is the use of a special data collator, and we’ve already covered that earlier in this section!
- However, we saw that `DataCollatorForLanguageModeling` also applies random masking with each evaluation, so we’ll see some fluctuations in our perplexity scores with each training run.
- One way to eliminate this source of randomness is to apply the masking once on the whole test set, and then use the default data collator in 🤗 Transformers to collect the batches during evaluation.
- To see how this works, let’s implement a simple function that applies the masking on a batch, similar to our first encounter with DataCollatorForLanguageModeling:

In [None]:
def insert_random_mask(batch) -> dict[str, Any]:
    """This is used to insert masks randomly to the test set. It uses the default data collator
    in 🤗 Transformers to collect the batches during evaluation."""

    features: list[dict[Any, Any]] = [dict(zip(batch, t)) for t in zip(*batch.values())]
    masked_inputs = data_collator(features)
    # Create a new "masked" column for each column in the dataset
    return {"masked_" + k: v.numpy() for k, v in masked_inputs.items()}

In [None]:
downsampled_dataset: DatasetDict = downsampled_dataset.remove_columns(["word_ids"])

# Apply random masking
eval_dataset: Dataset = downsampled_dataset["test"].map(
    insert_random_mask,
    batched=True,
    remove_columns=downsampled_dataset["test"].column_names,
)
eval_dataset = eval_dataset.rename_columns(
    {
        "masked_input_ids": "input_ids",
        "masked_attention_mask": "attention_mask",
        "masked_labels": "labels",
    }
)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
# Setup the dataloaders
from torch.utils.data import DataLoader
from transformers import default_data_collator


batch_size: int = 64
train_dataloader: DataLoader = DataLoader(
    dataset=downsampled_dataset["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
eval_dataloader: DataLoader = DataLoader(
    dataset=eval_dataset, batch_size=batch_size, collate_fn=default_data_collator
)

In [None]:
from accelerate import Accelerator
from torch.optim import AdamW


# Load a fresh version of the pretrained model
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

# Specify the learning rate
learning_rate: float = 5e-5
optimizer = AdamW(model.parameters(), lr=learning_rate)

# Prepare everything for training with the Accelerator object
accelerator: Accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

In [None]:
from transformers import get_scheduler

# Specify the learning rate scheduler
num_train_epochs: int = 3
num_update_steps_per_epoch: int = len(train_dataloader)
num_training_steps: int = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [None]:
from huggingface_hub import get_full_repo_name


# Ceate a model repository on the Hugging Face Hub! Use the 🤗 Hub library to first
# generate the full name of our repo
model_name: str = "distilbert-base-uncased-finetuned-imdb-accelerate"
repo_name: str = get_full_repo_name(model_name)
repo_name

'chineidu/distilbert-base-uncased-finetuned-imdb-accelerate'

In [None]:
from huggingface_hub import Repository


# Create and clone the repository using the Repository class from 🤗 Hub:
output_dir: str = model_name
repo: Repository = Repository(output_dir, clone_from=repo_name)

### FUll Training And Evaluation Loop


In [None]:
from tqdm.auto import tqdm
import torch
import math

progress_bar:tqdm = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):

    # Training
    model.train()
    for batch in train_dataloader:
        # Forward prop
        outputs: dict[str, Any] = model(**batch)
        loss = outputs.loss

        # Backprop
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    losses: list[float] = []

    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            # Forward prop
            outputs: dict[str, Any] = model(**batch)

        loss = outputs.loss
        losses.append(accelerator.gather(loss.repeat(batch_size)))

    losses: torch.Tensor = torch.cat(losses)
    losses = losses[: len(eval_dataset)]

    try:
        perplexity = math.exp(torch.mean(losses))
    except OverflowError:
        perplexity = float("inf")

    print(f">>> Epoch {epoch}: Perplexity: {perplexity}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )