<a href="https://colab.research.google.com/github/dimitarpg13/transformer_examples/blob/main/notebooks/bert/Masked_Language_Modeling_with_DistilBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Masked Language Modeling

Masked Language Modeling predicts which words fits best a blanked words in a given sentence. These models take sentences with blanked text as input and their output are the possible values of the text for that mask. These models can attend to tokens bidirectionally. This means the model has full access o the tokens on the left and the right. Masked Language modeling is used before fine-tuning the model for the specific task at hand. For example, if you need to use a model in a specific domain models like BERT will treat the domain-specific words as rare tokens. Then one can train the masked language model using the corpus of words for the specific domain and then fine-tune the model on a downstream task then we will end up with better performing model; that is model with higher inference accuracy given the amount of training time and word corpus. Wtth regard to classification metrics there is no single correct answer. We evaulate the distribution of the masked values. Common metrics are cross-entropy loss and perplexity.

We can use any plain text dataset and tokenize the text to mask the data.

Next we will tune [DistillRoBERTa](https://huggingface.co/distilbert/distilroberta-base) on the [r/askscience](https://www.reddit.com/r/askscience/) subset of the [ELI5](https://facebookresearch.github.io/ELI5/) dataset.

We will start by loading the first 5,000 examples with the [ELI5-Category](https://huggingface.co/datasets/rexarski/eli5_category) Dataset using the Datasets library. But first we take care of installing the necessary libraries:



In [1]:
!pip install transformers datasets evaluate
!pip install -U datasets



In [2]:
from datasets import load_dataset

eli5 = load_dataset("eli5_category", split="train[:5000]")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [3]:
eli5 = eli5.train_test_split(test_size=0.2)

In [4]:
eli5["train"][0]

{'q_id': '5p3u5z',
 'title': 'What is the difference between low level programming language and high level language?',
 'selftext': 'What is the difference between low level programming language and high level language? I have no knowledge of coding/computer language at all so please keep that in mind. Also examples of both would be great. Thanks!',
 'category': 'Technology',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['dco5c3p', 'dcob0eu', 'dco5qd6'],
  'text': ['It\'s basically how close it is to the "real" operations the computer is doing. For example, take a list of names you need to alphabetically sort. Using plain English to make it simpler, in a high level language you basically just say: Computer, sort the list alphabetically and the rest is handled by the in built functions of the language that convert this to machine code. In a low level language you need to be more specific, e.g.: Take the first name and compare it to the second. If the first letter is lower the

## Processing a dataset for masked language modeling

Example:

`[My] [name] [MASK] [Sylvain] [.]`
               
                  |
                  V
                 [is]

`[I] [MASK] [at] [Hug] [##ging] [Face] [.]`

          |
          V
        [work]

We need to fill the masks

```python
from datasets import load_dataset

raw_datasets = load_dataset("wikitext", "wikitext-2-raw-v1")
raw_datasets["train"]

Dataset({
  features: ['text'],
  num_rows: 36718
})
```

Gather all of the text in one column in your dataset. Before we start the masking process we need to make all of the text the same length to batch them together. The first way to make the text the same length is the way we do that in text classification tasks - pad the short text sentences and truncate the long text sentences.

Example:

`[CLS] [My] [name] [is] [Sylvain] [.] [SEP]`

`[CLS] [I] [MASK] [at] [Hug] [##ging] [SEP]`

`[CLS] [Short] [text] [PAD] [PAD] [PAD] [SEP]`

As we have seen when we repurpose data for text classification

```python
from datasets import load_dataset
from transformers import AutoTokenizer

raw_datasets = load_dataset("imdb")
raw_datasets = raw_datasets.remove_columns("label")

model_checkpoint = "distilbert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
context_length = 128

def tokenize_pad_and_truncate(texts):
  return tokenizer(texts["text"], truncation=True, padding="max length", max_length=context_length)

tokenized_datasets = raw_datasets.map(tokenize_pad_and_truncate, batched=True)

```
This padding and truncation is done automatically by `AutoTokenizer`.
However, using `AutoTokenizer` we are going to loose a lot of text if the datasets are very long compared to the context length we have picked.

![Figure: chunking on context length pieces](https://github.com/dimitarpg13/transformer_examples/blob/main/images/chunking_on_context_length_pieces.png?raw=1)

We can chunk in pieces of length equal to the context length instead of discarding everything after the first chunk. We may end up with a remainder which we can keep in a end pad or ignore.

We can implement this in practice with the following code which sets `return_overflowing_tokens` to `True` in the `tokenzier` call:

```python
def tokenize_and_chunk(texts):
  return tokenizer(
     texts["text"], truncation=True, max_length=context_length,
     return_overflowing_tokens=True
  )

tokenized_datasets = raw_datasets.map(
  tokenize_and_chunk, batched=True, remove_columns=["text"]
)

len(raw_datasets["train"]), len(tokenized_datasets["train"])

      (36718, 47192)
```

This way of chunking is ideal if all of your text is very long. But this won't work nicely if there is a variety of lengths in the text. In this case the best option is to concatenate all of your text in one big string with a special token (depicted in orange) indicating when we pass from one document to another.

![Figure: chunking on context length pieces](https://github.com/dimitarpg13/transformer_examples/blob/main/images/concatenate_in_one_big_string.png?raw=1)

This is how this can be done in code:

```python
def tokenize_and_chunk(texts):
  all_input_ids = []
  for input_ids in tokenizer(texts["text"])["input_ids"]:
    all_input_ids.extend(input_ids)
    all_input_ids.append(tokenizer.eos_token_id)
  
  chunks = []
  for idx in range(0, len(all_input_ids), context_length):
    chunks.append(all_input_ids[idx: idx + context_length])
  return {"input_ids": chunks}

  tokenized_datasets = raw_datasets.map(tokenize_and_chunk, batched=True, remove_columns=["text"])

  len(raw_datasets["train"]), len(tokenized_datasets["train"])
```

The masking itself is done in a `DataCollator` instance:

```python
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer, mlm_probability=0.15)

```
or
```python
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
   tokenizer, mlm_probability=0.15, return_tensors="tf"
)
```



So the next step is to load a DistilRoBERTa tokenizer to process the `text` subfield:

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilroberta-base")

Notice from the example above that the `text` field is actually nested inside `answers`. This means we need to extract the `text` subfield from its nested structure with the `flatten` method:

In [6]:
eli5 = eli5.flatten()
eli5["train"][0]

{'q_id': '5p3u5z',
 'title': 'What is the difference between low level programming language and high level language?',
 'selftext': 'What is the difference between low level programming language and high level language? I have no knowledge of coding/computer language at all so please keep that in mind. Also examples of both would be great. Thanks!',
 'category': 'Technology',
 'subreddit': 'explainlikeimfive',
 'answers.a_id': ['dco5c3p', 'dcob0eu', 'dco5qd6'],
 'answers.text': ['It\'s basically how close it is to the "real" operations the computer is doing. For example, take a list of names you need to alphabetically sort. Using plain English to make it simpler, in a high level language you basically just say: Computer, sort the list alphabetically and the rest is handled by the in built functions of the language that convert this to machine code. In a low level language you need to be more specific, e.g.: Take the first name and compare it to the second. If the first letter is lower 