## Masked Language Modeling

Masked Language Modeling predicts which words fits best a blanked words in a given sentence. These models take sentences with blanked text as input and their output are the possible values of the text for that mask. These models can attend to tokens bidirectionally. This means the model has full access o the tokens on the left and the right. Masked Language modeling is used before fine-tuning the model for the specific task at hand. For example, if you need to use a model in a specific domain models like BERT will treat the domain-specific words as rare tokens. Then one can train the masked language model using the corpus of words for the specific domain and then fine-tune the model on a downstream task then we will end up with better performing model; that is model with higher inference accuracy given the amount of training time and word corpus. Wtth regard to classification metrics there is no single correct answer. We evaulate the distribution of the masked values. Common metrics are cross-entropy loss and perplexity.

We can use any plain text dataset and tokenize the text to mask the data.

Next we will tune [DistillRoBERTa](https://huggingface.co/distilbert/distilroberta-base) on the [r/askscience](https://www.reddit.com/r/askscience/) subset of the [ELI5](https://facebookresearch.github.io/ELI5/) dataset.

We will start by loading the first 5,000 examples with the [ELI5-Category](https://huggingface.co/datasets/rexarski/eli5_category) Dataset using the Datasets library. But first we take care of installing the necessary libraries:



In [1]:
%%capture --no-stderr

%pip install --quiet transformers datasets evaluate

%pip install --quiet datasets==2.16.0


In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
from datasets import load_dataset

eli5 = load_dataset("eli5_category", split="train[:5000]")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/4.17k [00:00<?, ?B/s]

Downloading readme: 0.00B [00:00, ?B/s]

Downloading data:   0%|          | 0.00/62.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.00M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.76M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.85M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/91772 [00:00<?, ? examples/s]

Generating validation1 split:   0%|          | 0/5446 [00:00<?, ? examples/s]

Generating validation2 split:   0%|          | 0/2375 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5411 [00:00<?, ? examples/s]

In [4]:
eli5 = eli5.train_test_split(test_size=0.2)

In [5]:
eli5["train"][0]

{'q_id': '73o3ot',
 'title': 'Why is it that when we eat something too much we start to dislike it?',
 'selftext': '',
 'category': 'Biology',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['dnrv3rw', 'dns8vdn'],
  'text': ["We have an innate desire of a healthy, varied diet with lots of different nutrients, vitamins, etc., to keep our bodies supplied with all the items required by our metabolism. Short term, this means that when you eat a whole lot of a single meal in a short amount of time, say a lot of candy, your body is giving a response that it's had enough of that nurtrient, and is likely reaching unhealthy levels where it will have to waste resources instead of use it. Long term, it's a bit more psychological. Our desire to try out new recipes, new foods and change from the old ones is, when you have a steady and healthy diet, likely because you have acces to information about new food: such as new recipes, new information about certain nutrients, cooking shows, adver

## Processing a dataset for masked language modeling

Example:

`[My] [name] [MASK] [Sylvain] [.]`
               
                  |
                  V
                 [is]

`[I] [MASK] [at] [Hug] [##ging] [Face] [.]`

          |
          V
        [work]

We need to fill the masks

```python
from datasets import load_dataset

raw_datasets = load_dataset("wikitext", "wikitext-2-raw-v1")
raw_datasets["train"]

Dataset({
  features: ['text'],
  num_rows: 36718
})
```

Gather all of the text in one column in your dataset. Before we start the masking process we need to make all of the text the same length to batch them together. The first way to make the text the same length is the way we do that in text classification tasks - pad the short text sentences and truncate the long text sentences.

Example:

`[CLS] [My] [name] [is] [Sylvain] [.] [SEP]`

`[CLS] [I] [MASK] [at] [Hug] [##ging] [SEP]`

`[CLS] [Short] [text] [PAD] [PAD] [PAD] [SEP]`

As we have seen when we repurpose data for text classification

```python
from datasets import load_dataset
from transformers import AutoTokenizer

raw_datasets = load_dataset("imdb")
raw_datasets = raw_datasets.remove_columns("label")

model_checkpoint = "distilbert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
context_length = 128

def tokenize_pad_and_truncate(texts):
  return tokenizer(texts["text"], truncation=True, padding="max length", max_length=context_length)

tokenized_datasets = raw_datasets.map(tokenize_pad_and_truncate, batched=True)

```
This padding and truncation is done automatically by `AutoTokenizer`.
However, using `AutoTokenizer` we are going to loose a lot of text if the datasets are very long compared to the context length we have picked.

![Figure: chunking on context length pieces](https://github.com/dimitarpg13/transformer_examples/blob/main/images/chunking_on_context_length_pieces.png?raw=1)

We can chunk in pieces of length equal to the context length instead of discarding everything after the first chunk. We may end up with a remainder which we can keep in a end pad or ignore.

We can implement this in practice with the following code which sets `return_overflowing_tokens` to `True` in the `tokenzier` call:

```python
def tokenize_and_chunk(texts):
  return tokenizer(
     texts["text"], truncation=True, max_length=context_length,
     return_overflowing_tokens=True
  )

tokenized_datasets = raw_datasets.map(
  tokenize_and_chunk, batched=True, remove_columns=["text"]
)

len(raw_datasets["train"]), len(tokenized_datasets["train"])

      (36718, 47192)
```

This way of chunking is ideal if all of your text is very long. But this won't work nicely if there is a variety of lengths in the text. In this case the best option is to concatenate all of your text in one big string with a special token (depicted in orange) indicating when we pass from one document to another.

![Figure: chunking on context length pieces](https://github.com/dimitarpg13/transformer_examples/blob/main/images/concatenate_in_one_big_string.png?raw=1)

This is how this can be done in code:

```python
def tokenize_and_chunk(texts):
  all_input_ids = []
  for input_ids in tokenizer(texts["text"])["input_ids"]:
    all_input_ids.extend(input_ids)
    all_input_ids.append(tokenizer.eos_token_id)
  
  chunks = []
  for idx in range(0, len(all_input_ids), context_length):
    chunks.append(all_input_ids[idx: idx + context_length])
  return {"input_ids": chunks}

  tokenized_datasets = raw_datasets.map(tokenize_and_chunk, batched=True, remove_columns=["text"])

  len(raw_datasets["train"]), len(tokenized_datasets["train"])
```

The masking itself is done in a `DataCollator` instance:

```python
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer, mlm_probability=0.15)

```
or
```python
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
   tokenizer, mlm_probability=0.15, return_tensors="tf"
)
```



So the next step is to load a DistilRoBERTa tokenizer to process the `text` subfield:

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilroberta-base")

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Notice from the example above that the `text` field is actually nested inside `answers`. This means we need to extract the `text` subfield from its nested structure with the `flatten` method:

In [7]:
eli5 = eli5.flatten()
eli5["train"][0]

{'q_id': '73o3ot',
 'title': 'Why is it that when we eat something too much we start to dislike it?',
 'selftext': '',
 'category': 'Biology',
 'subreddit': 'explainlikeimfive',
 'answers.a_id': ['dnrv3rw', 'dns8vdn'],
 'answers.text': ["We have an innate desire of a healthy, varied diet with lots of different nutrients, vitamins, etc., to keep our bodies supplied with all the items required by our metabolism. Short term, this means that when you eat a whole lot of a single meal in a short amount of time, say a lot of candy, your body is giving a response that it's had enough of that nurtrient, and is likely reaching unhealthy levels where it will have to waste resources instead of use it. Long term, it's a bit more psychological. Our desire to try out new recipes, new foods and change from the old ones is, when you have a steady and healthy diet, likely because you have acces to information about new food: such as new recipes, new information about certain nutrients, cooking shows, ad

Each subfield now is a separate column as indicated by the `answers` prefix, and the `text` field is a list now.  Instead of tokenizing each sentence separatel, convert the list to a string so you can jointly tokenize them.

Here is a preprocessing function which we will use to join the list of string for each example and tokenize the result:

In [8]:
def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]])

To apply this preprocessing function over the entire dataset, we use the HuggingFace Datasets [map](https://huggingface.co/docs/datasets/v4.0.0/en/package_reference/main_classes#datasets.Dataset.map) method.