<a href="https://colab.research.google.com/github/dimitarpg13/transformer_examples/blob/main/notebooks/bert/Masked_Language_Modeling_with_DistilBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Masked Language Modeling

Masked Language Modeling predicts which words fits best a blanked words in a given sentence. These models take sentences with blanked text as input and their output are the possible values of the text for that mask. These models can attend to tokens bidirectionally. This means the model has full access o the tokens on the left and the right. Masked Language modeling is used before fine-tuning the model for the specific task at hand. For example, if you need to use a model in a specific domain models like BERT will treat the domain-specific words as rare tokens. Then one can train the masked language model using the corpus of words for the specific domain and then fine-tune the model on a downstream task then we will end up with better performing model; that is model with higher inference accuracy given the amount of training time and word corpus. Wtth regard to classification metrics there is no single correct answer. We evaulate the distribution of the masked values. Common metrics are cross-entropy loss and perplexity.

We can use any plain text dataset and tokenize the text to mask the data.

Next we will tune [DistillRoBERTa](https://huggingface.co/distilbert/distilroberta-base) on the [r/askscience](https://www.reddit.com/r/askscience/) subset of the [ELI5](https://facebookresearch.github.io/ELI5/) dataset.

We will start by loading the first 5,000 examples with the [ELI5-Category](https://huggingface.co/datasets/rexarski/eli5_category) Dataset using the Datasets library. But first we take care of installing the necessary libraries:



In [1]:
%%capture --no-stderr

%pip install --quiet transformers datasets evaluate

%pip install --quiet datasets==2.16.0


In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
from datasets import load_dataset

eli5 = load_dataset("eli5_category", split="train[:5000]")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/4.17k [00:00<?, ?B/s]

Downloading readme: 0.00B [00:00, ?B/s]

Downloading data:   0%|          | 0.00/62.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.00M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.76M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.85M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/91772 [00:00<?, ? examples/s]

Generating validation1 split:   0%|          | 0/5446 [00:00<?, ? examples/s]

Generating validation2 split:   0%|          | 0/2375 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5411 [00:00<?, ? examples/s]

In [4]:
eli5 = eli5.train_test_split(test_size=0.2)

In [5]:
eli5["train"][0]

{'q_id': '74aqnr',
 'title': 'Are there planets in interstellar/intergalactic space?',
 'selftext': 'So, here\'s where I\'m going with this. As I understand it, when a solar system is formed, dust turns into clumps, clumps turn into bigger clumps and so on until the sun and planets are formed. There\'s also a lot of collisions. Could a planet size object (or bigger) get knocked into interstellar space or intergalactic space? I may be using the wrong terminology here.. I mean the space between solar systems and the space between galaxies. If so, is it possible to detect one (or have we done that already)? Lastly, (assuming this possible) with no sun, could there be life on them (for example, I\'ve heard it\'s possible one of the moons around Jupiter could have liquid water because Jupiter\'s gravity is "flexing" it, or could the planet itself simply have a hot enough molten core for liquid water)? -Thanks!',
 'category': 'Physics',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id':

## Processing a dataset for masked language modeling

Example:

`[My] [name] [MASK] [Sylvain] [.]`
               
                  |
                  V
                 [is]

`[I] [MASK] [at] [Hug] [##ging] [Face] [.]`

          |
          V
        [work]

We need to fill the masks

```python
from datasets import load_dataset

raw_datasets = load_dataset("wikitext", "wikitext-2-raw-v1")
raw_datasets["train"]

Dataset({
  features: ['text'],
  num_rows: 36718
})
```

Gather all of the text in one column in your dataset. Before we start the masking process we need to make all of the text the same length to batch them together. The first way to make the text the same length is the way we do that in text classification tasks - pad the short text sentences and truncate the long text sentences.

Example:

`[CLS] [My] [name] [is] [Sylvain] [.] [SEP]`

`[CLS] [I] [MASK] [at] [Hug] [##ging] [SEP]`

`[CLS] [Short] [text] [PAD] [PAD] [PAD] [SEP]`

As we have seen when we repurpose data for text classification

```python
from datasets import load_dataset
from transformers import AutoTokenizer

raw_datasets = load_dataset("imdb")
raw_datasets = raw_datasets.remove_columns("label")

model_checkpoint = "distilbert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
context_length = 128

def tokenize_pad_and_truncate(texts):
  return tokenizer(texts["text"], truncation=True, padding="max length", max_length=context_length)

tokenized_datasets = raw_datasets.map(tokenize_pad_and_truncate, batched=True)

```
This padding and truncation is done automatically by `AutoTokenizer`.
However, using `AutoTokenizer` we are going to loose a lot of text if the datasets are very long compared to the context length we have picked.

![Figure: chunking on context length pieces](https://github.com/dimitarpg13/transformer_examples/blob/main/images/chunking_on_context_length_pieces.png?raw=1)

We can chunk in pieces of length equal to the context length instead of discarding everything after the first chunk. We may end up with a remainder which we can keep in a end pad or ignore.

We can implement this in practice with the following code which sets `return_overflowing_tokens` to `True` in the `tokenzier` call:

```python
def tokenize_and_chunk(texts):
  return tokenizer(
     texts["text"], truncation=True, max_length=context_length,
     return_overflowing_tokens=True
  )

tokenized_datasets = raw_datasets.map(
  tokenize_and_chunk, batched=True, remove_columns=["text"]
)

len(raw_datasets["train"]), len(tokenized_datasets["train"])

      (36718, 47192)
```

This way of chunking is ideal if all of your text is very long. But this won't work nicely if there is a variety of lengths in the text. In this case the best option is to concatenate all of your text in one big string with a special token (depicted in orange) indicating when we pass from one document to another.

![Figure: chunking on context length pieces](https://github.com/dimitarpg13/transformer_examples/blob/main/images/concatenate_in_one_big_string.png?raw=1)

This is how this can be done in code:

```python
def tokenize_and_chunk(texts):
  all_input_ids = []
  for input_ids in tokenizer(texts["text"])["input_ids"]:
    all_input_ids.extend(input_ids)
    all_input_ids.append(tokenizer.eos_token_id)
  
  chunks = []
  for idx in range(0, len(all_input_ids), context_length):
    chunks.append(all_input_ids[idx: idx + context_length])
  return {"input_ids": chunks}

  tokenized_datasets = raw_datasets.map(tokenize_and_chunk, batched=True, remove_columns=["text"])

  len(raw_datasets["train"]), len(tokenized_datasets["train"])
```

The masking itself is done in a `DataCollator` instance:

```python
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer, mlm_probability=0.15)

```
or
```python
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
   tokenizer, mlm_probability=0.15, return_tensors="tf"
)
```



So the next step is to load a DistilRoBERTa tokenizer to process the `text` subfield:

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilroberta-base")

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Notice from the example above that the `text` field is actually nested inside `answers`. This means we need to extract the `text` subfield from its nested structure with the `flatten` method:

In [7]:
eli5 = eli5.flatten()
eli5["train"][0]

{'q_id': '74aqnr',
 'title': 'Are there planets in interstellar/intergalactic space?',
 'selftext': 'So, here\'s where I\'m going with this. As I understand it, when a solar system is formed, dust turns into clumps, clumps turn into bigger clumps and so on until the sun and planets are formed. There\'s also a lot of collisions. Could a planet size object (or bigger) get knocked into interstellar space or intergalactic space? I may be using the wrong terminology here.. I mean the space between solar systems and the space between galaxies. If so, is it possible to detect one (or have we done that already)? Lastly, (assuming this possible) with no sun, could there be life on them (for example, I\'ve heard it\'s possible one of the moons around Jupiter could have liquid water because Jupiter\'s gravity is "flexing" it, or could the planet itself simply have a hot enough molten core for liquid water)? -Thanks!',
 'category': 'Physics',
 'subreddit': 'explainlikeimfive',
 'answers.a_id': ['d

Each subfield now is a separate column as indicated by the `answers` prefix, and the `text` field is a list now.  Instead of tokenizing each sentence separatel, convert the list to a string so you can jointly tokenize them.

Here is a preprocessing function which we will use to join the list of string for each example and tokenize the result:

In [8]:
def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]])

To apply this preprocessing function over the entire dataset, we use the HuggingFace Datasets [map](https://huggingface.co/docs/datasets/v4.0.0/en/package_reference/main_classes#datasets.Dataset.map) method. The map function performance can be increased by setting `batched=True` to process multiple elements of the dataset at once; the number of processes can be increased with `num_proc`:

In [9]:
tokenized_eli5 = eli5.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=eli5["train"].column_names,
)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (619 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (620 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (761 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (621 > 512). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (606 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (547 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (611 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (736 > 512). Running this sequence through the model will result in indexing errors


This dataset contains the token sequences, but some of these are longer than the maximum input length for the model.

We will use a second preprocessing function to

a) concatenate all of the sequences

b) split the concatenated sequences into shorter chunks defined by `block_size`, which should be both shorter than the maximum input length and short enough to fit into GPU RAM

In [10]:
block_size = 128

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    return result

Apply the `group_text` function over the entire dataset

In [11]:
lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Now create a batch of examples using [DataCollatorForLanguageModeling](https://huggingface.co/docs/transformers/v4.53.3/en/main_classes/data_collator#transformers.DataCollatorForLanguageModeling). It is more efficient to _dynamically pad_ the sentences to the longest length in a btach during collation, instead of padding the whole dataset to the maximum length.

With PyTorch we use the end-of-seuqnece token as the padding token and specify `mlm_probability` to randomly mask tokens each time we iteratore over the data.

In [12]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

Now we are going to fine-tune our model. For the purpose loading DistilRoBERTa with [AutoModelForMaskedLM](https://huggingface.co/docs/transformers/v4.53.3/en/model_doc/auto#transformers.AutoModelForMaskedLM)

In [13]:
from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("distilbert/distilroberta-base")

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


the remaining three steps for the model fine-tuning are:

1) define training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/v4.53.3/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` which specifies where to save the model.

2) pass the training arguments to the the [Trainer](https://huggingface.co/docs/transformers/v4.53.3/en/main_classes/trainer#transformers.Trainer) along with the model, datasets and the data collator.

3) Call [train()](https://huggingface.co/docs/transformers/v4.53.3/en/main_classes/trainer#transformers.Trainer.train) to fine-tune the model

In [14]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="dimitar_eli5_mlm_model",
    eval_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

trainer.train()

  trainer = Trainer(


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mdimitar_pg13[0m ([33mdimitar_pg13-nike[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss
1,2.2569,2.060303
2,2.1609,2.053981
3,2.1267,2.018985


TrainOutput(global_step=3984, training_loss=2.191538921800483, metrics={'train_runtime': 292.9975, 'train_samples_per_second': 108.718, 'train_steps_per_second': 13.597, 'total_flos': 1056133756795392.0, 'train_loss': 2.191538921800483, 'epoch': 3.0})

evaluate the model and compute perplexity

In [15]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 7.43


push to hub

In [16]:
trainer.push_to_hub()

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ...ar_eli5_mlm_model/training_args.bin: 100%|##########| 5.37kB / 5.37kB            

  ...ar_eli5_mlm_model/model.safetensors:  13%|#2        | 41.9MB /  329MB            

  ...ents.1755022798.715cbf4b6488.1432.0: 100%|##########| 7.62kB / 7.62kB            

  ...ents.1755023109.715cbf4b6488.1432.1: 100%|##########|   359B /   359B            

CommitInfo(commit_url='https://huggingface.co/dimitarpg13/dimitar_eli5_mlm_model/commit/52b3f3f3e215313d7ab27d9e93b8c3e619044508', commit_message='End of training', commit_description='', oid='52b3f3f3e215313d7ab27d9e93b8c3e619044508', pr_url=None, repo_url=RepoUrl('https://huggingface.co/dimitarpg13/dimitar_eli5_mlm_model', endpoint='https://huggingface.co', repo_type='model', repo_id='dimitarpg13/dimitar_eli5_mlm_model'), pr_revision=None, pr_num=None)

### Inference

use the special `<mask>` token to indicate the blank

In [17]:
text = "The Milky Way is a <mask> galaxy."

Execute finetuned model for inference utilizing [pipeline()](https://huggingface.co/docs/transformers/v4.53.3/en/main_classes/pipelines#transformers.pipeline). Instantiate a pipeline for fill-mask with your model, and pass your text to it. If you like, you can use the `top_k` parameter to specify how many predictions to return:

In [19]:
from transformers import pipeline

mask_filler = pipeline("fill-mask", "dimitarpg13/dimitar_eli5_mlm_model")
mask_filler(text, top_k=3)

config.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/329M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

Device set to use cuda:0


[{'score': 0.626653254032135,
  'token': 21300,
  'token_str': ' spiral',
  'sequence': 'The Milky Way is a spiral galaxy.'},
 {'score': 0.0651693046092987,
  'token': 2232,
  'token_str': ' massive',
  'sequence': 'The Milky Way is a massive galaxy.'},
 {'score': 0.03233642503619194,
  'token': 30794,
  'token_str': ' dwarf',
  'sequence': 'The Milky Way is a dwarf galaxy.'}]