# [Masked Language Modelling (MLM)](https://huggingface.co/learn/nlp-course/chapter7/3?fw=pt)

- For NLP tasks with Transformer models, you can use pretrained models from Hugging Face and fine-tune them on your data. 
- Transfer learning works well if the pretraining and fine-tuning corpora are similar. 
- However, in cases like `legal` or `scientific text`, domain-specific words may be treated as `rare` tokens. 
- Fine-tuning the language model on in-domain data can improve downstream task performance. 
- This process is called d`omain adaptation`, popularized by ULMFiT in 2018. 
- We'll perform a similar process with `Transformers` instead of LSTMs.

<br>

## Benefits of MLM 

- **Improved Generalization**: By exposing the model to various masking patterns, MLM enhances its ability to generalize to unseen data and perform well on downstream tasks.

- **Effective Pre-training for Diverse Tasks**: MLM has shown to be effective in pre-training language models for a wide range of NLP tasks, including text generation, machine translation, and question answering.

In [1]:
# Built-in library
import re
import json
from typing import Any, Optional, Union
import logging
import warnings

# Standard imports
import numpy as np
import pandas as pd
from rich import print
import torch

# Visualization
import matplotlib.pyplot as plt


# Pandas settings
pd.options.display.max_rows = 1_000
pd.options.display.max_columns = 1_000
pd.options.display.max_colwidth = 600

warnings.filterwarnings("ignore")

# Black code formatter (Optional)
%load_ext lab_black

# auto reload imports
%load_ext autoreload
%autoreload 2

In [2]:
from transformers import AutoModelForMaskedLM


model_checkpoint: str = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

Downloading config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [3]:
distilbert_num_parameters: float = model.num_parameters() / 1_000_000
print(f"'>>> DistilBERT number of parameters: {round(distilbert_num_parameters)}M'")
print(f"'>>> BERT number of parameters: 110M'")

In [4]:
# Let’s see what kinds of tokens this model predicts:
text: str = "This is a great [MASK]."

### Comment

- As humans, we can imagine many possibilities for the [MASK] token, such as “day”, “ride”, or “painting”. 
- For pretrained models, the predictions depend on the corpus the model was trained on, since it learns to pick up the statistical patterns present in the data. 
- Like BERT, DistilBERT was pretrained on the [English Wikipedia](https://huggingface.co/datasets/wikipedia) and [BookCorpus datasets](https://huggingface.co/datasets/bookcorpus), so we expect the predictions for [MASK] to reflect these domains. 
- To predict the mask we need DistilBERT’s tokenizer to produce the inputs for the model, so let’s download that from the Hub as well:

In [5]:
from transformers import AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
inputs: dict[str, Any] = tokenizer(text, return_tensors="pt")
token_logits: torch.Tensor = model(**inputs).logits
token_logits

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tensor([[[ -5.5882,  -5.5868,  -5.5958,  ...,  -4.9448,  -4.8174,  -2.9905],
         [-11.9031, -11.8872, -12.0623,  ..., -10.9570, -10.6464,  -8.6324],
         [-11.9604, -12.1520, -12.1279,  ..., -10.0218,  -8.6074,  -8.0971],
         ...,
         [ -4.8228,  -4.6268,  -5.1041,  ...,  -4.2771,  -5.0184,  -3.9428],
         [-11.2945, -11.2388, -11.3857,  ...,  -9.2063,  -9.3411,  -6.1505],
         [ -9.5213,  -9.4632,  -9.5022,  ...,  -8.6561,  -8.4908,  -4.6903]]],
       grad_fn=<ViewBackward0>)

In [21]:
print(f'input_ids: {inputs["input_ids"]}')
print(f"mask_token_id: {tokenizer.mask_token_id}")

print(inputs["input_ids"].flatten() == tokenizer.mask_token_id)

In [38]:
raw_tokens: list[str] = tokenizer.tokenize(text)
token_ids: list[int] = tokenizer(text).get("input_ids")

print(raw_tokens, token_ids)

In [40]:
print(token_logits.shape)

token_logits

tensor([[[ -5.5882,  -5.5868,  -5.5958,  ...,  -4.9448,  -4.8174,  -2.9905],
         [-11.9031, -11.8872, -12.0623,  ..., -10.9570, -10.6464,  -8.6324],
         [-11.9604, -12.1520, -12.1279,  ..., -10.0218,  -8.6074,  -8.0971],
         ...,
         [ -4.8228,  -4.6268,  -5.1041,  ...,  -4.2771,  -5.0184,  -3.9428],
         [-11.2945, -11.2388, -11.3857,  ...,  -9.2063,  -9.3411,  -6.1505],
         [ -9.5213,  -9.4632,  -9.5022,  ...,  -8.6561,  -8.4908,  -4.6903]]],
       grad_fn=<ViewBackward0>)

In [42]:
token_logits[0, 5, :]

tensor([-4.8228, -4.6268, -5.1041,  ..., -4.2771, -5.0184, -3.9428],
       grad_fn=<SliceBackward0>)

In [43]:
# Find the location of [MASK] and extract its logits
mask_token_index: torch.Tensor = torch.where(
    inputs["input_ids"].flatten() == tokenizer.mask_token_id
)[0]
mask_token_logits: torch.Tensor = token_logits[0, mask_token_index, :]

print(f"mask_token_index: {mask_token_index}")

mask_token_logits

tensor([[-4.8228, -4.6268, -5.1041,  ..., -4.2771, -5.0184, -3.9428]],
       grad_fn=<IndexBackward0>)

In [47]:
# Pick the [MASK] candidates with the highest logits
k: int = 5
top_5_tokens = torch.topk(mask_token_logits, k, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

### Load The dataset

- To showcase domain adaptation, we’ll use the famous [Large Movie Review Dataset (or IMDb for short)](https://huggingface.co/datasets/imdb), which is a corpus of movie reviews that is often used to benchmark sentiment analysis models. 
- By fine-tuning DistilBERT on this corpus, we expect the language model will adapt its vocabulary from the factual data of Wikipedia that it was pretrained on to the more subjective elements of movie reviews. 
- We can get the data from the Hugging Face Hub with the load_dataset() function from 🤗 Datasets:

In [48]:
from datasets import load_dataset, Dataset


data_path: str = "imdb"
imdb_dataset: Dataset = load_dataset(data_path)

imdb_dataset

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [52]:
# Preview a small sample
RANDOM_STATE: int = 42
sample = imdb_dataset.get("train").shuffle(seed=RANDOM_STATE).select(range(3))

for row in sample:
    print(f">>> Review: {row.get('text')}")
    print(f">>> Label: {row.get('label')}")

In [62]:
# Check the number of unique labels
imdb_dataset.get("train").unique("label")

[0, 1]

In [63]:
# Preview a small sample of the unsupervised/unlabelled data
sample = imdb_dataset.get("unsupervised").shuffle(seed=RANDOM_STATE).select(range(3))

for row in sample:
    print(f">>> Review: {row.get('text')}")
    print(f">>> Label: {row.get('label')}")

In [64]:
# Check the number of unique labels
imdb_dataset.get("unsupervised").unique("label")

[-1]

### Comment

- For both `auto-regressive` and `masked language modeling`, a common preprocessing step is to `concatenate` all the examples and then split the whole corpus into chunks of equal size. 
- This is quite different from our usual approach, where we simply tokenize individual examples. Why concatenate everything together? 
- The reason is that individual examples might get truncated if they’re too long, and that would result in losing information that might be useful for the language modeling task!

- To get started, we’ll first tokenize our corpus as usual, but without setting the `truncation=True` option in our tokenizer. 
- We’ll also grab the word IDs if they are available (which they will be if we’re using a fast tokenizer, as described in Chapter 6), as we will need them later on to do whole word masking. 
- We’ll wrap this in a simple function, and while we’re at it we’ll remove the `text` and `label` columns since we don’t need them any longer:

In [65]:
def tokenize_function(examples: dict[str, Any]) -> dict[str, Any]:
    result: dict[str, Any] = tokenizer(examples.get("text"))

    if tokenizer.is_fast:
        result["word_ids"] = [
            result.word_ids(idx) for idx in range(len(result["input_ids"]))
        ]
    return result

In [66]:
# Use batched=True to activate fast multithreading!
tokenized_datasets = imdb_dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "label"]
)
tokenized_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (720 > 512). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 50000
    })
})

### Input_ids Vs. Word_ids

[![image.png](https://i.postimg.cc/MptyPP3S/image.png)](https://postimg.cc/MnMMXD8P)

### Comment

- Now that we’ve tokenized our movie reviews, the next step is to group them all together and split the result into chunks. 
- But how big should these chunks be? This will ultimately be determined by the amount of GPU memory that you have available, but a good starting point is to see what the model’s maximum context size is. 
- This can be inferred by inspecting the model_max_length attribute of the tokenizer:

In [108]:
# Model's context size
tokenizer.model_max_length

512

In [109]:
# To run our experiments on GPUs like those found on Google Colab, choose a smaller size that can fit in memory:
chunk_size: int = 128

# Slicing produces a list of lists for each feature
tokenized_samples = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Review {idx} length: {len(sample)}'")

In [113]:
tokenized_samples.keys()

dict_keys(['input_ids', 'attention_mask', 'word_ids'])

In [117]:
# Concatenate lists
sum([[10, 4, 5]], [])

[10, 4, 5]

In [118]:
# Concatenate all these examples with a simple dictionary comprehension:
concatenated_examples: dict[str, Any] = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length: int = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated reviews length: {total_length}'")

In [133]:
concatenated_examples.keys()

dict_keys(['input_ids', 'attention_mask', 'word_ids'])

In [120]:
# Chunk the data
chunks: dict[str, Any] = {
    key: [tokens[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for key, tokens in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

### Comment

- As you can see the last chunk will generally be smaller than the maximum chunk size. 
- There are two main strategies for dealing with this:
  - Drop the last chunk if it’s smaller than chunk_size.
  - Pad the last chunk until its length equals chunk_size.
- We’ll take the first approach here, so let’s wrap all of the above logic in a single function that we can apply to our tokenized datasets:

In [None]:
# Chunk the data
chunks: dict[str, Any] = {
    key: [tokens[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for key, tokens in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

In [134]:
def group_texts(examples: dict[str, Any]) -> dict[str, Any]:
    # Concatenate all texts
    concatenated_examples = {key: sum(examples[key], []) for key in examples.keys()}
    # Compute length of concatenated texts
    # Select the 1st item in the list and calculate the length
    total_length: int = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length: int = (total_length // chunk_size) * chunk_size
    # Chunk the data
    chunks: dict[str, Any] = {
        key: [tokens[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for key, tokens in concatenated_examples.items()
    }

    # Create a new labels column
    chunks["labels"] = chunks["input_ids"].copy()

    return chunks

### Comment

- Note that in the last step of group_texts() we create a new `labels` column which is a copy of the input_ids one. 
- As we’ll see shortly, that’s because in masked language modeling the objective is to predict randomly masked tokens in the input batch, and by creating a labels column we provide the ground truth for our language model to learn from.
- Let’s now apply group_texts() to our tokenized datasets using `Dataset.map()` function:

In [138]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 61291
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 59904
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 122957
    })
})

### Comment

- Grouping and then chunking the texts has produced many more examples than the original 25,000 for the train and test splits. 
- That’s because we now have examples involving contiguous tokens that span across multiple examples from the original corpus. 
- You can see this explicitly by looking for the special `[SEP]` and `[CLS]` tokens in one of the chunks:

In [139]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

"as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men. < br / > < br / > what kills me about i am curious - yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno. while my countrymen mind find it shocking, in reality sex and nudity are a major staple in swedish cinema. even ingmar bergman,"

In [None]:
tokenizer.decode(lm_datasets["train"][1]["labels"])

### Comment

- As expected from our `group_texts()` function above, this looks identical to the decoded input_ids — but then how can our model possibly learn anything? 
- We’re missing a key step: inserting [MASK] tokens at random positions in the inputs! 
- Let’s see how we can do this on the fly during fine-tuning using a special data collator.

<br><br>

## Fine-tuning DistilBERT with the Trainer API

- Fine-tuning a masked language model is almost identical to fine-tuning a sequence classification model.
- The only difference is that we need a special data collator that can randomly mask some of the tokens in each batch of texts. 
- Fortunately, 🤗 Transformers comes prepared with a dedicated DataCollatorForLanguageModeling for just this task. 
- We just have to pass it the tokenizer and an `mlm_probability` argument that specifies what fraction of the tokens to mask. 
- We’ll use `15%`, which is the amount used for `BERT` and a common choice in the literature: