<a href="https://colab.research.google.com/github/dimitarpg13/transformer_examples/blob/main/notebooks/bert/Masked_Language_Model_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Tutorial on Masked Language Modeling

Masked language modeling predicts a masked token in a sequence, and the model can attend to tokens bidirectionally. Masked language modeling is mainly applicable to tasks that require good contextual understanding of an entire sequence. BERT is an example of a masked language model.

In this tutorial DistilRoBERTa will be fine-tuned on the [r/askscience](https://www.reddit.com/r/askscience/) subset of the [ELI5](https://huggingface.co/datasets/sentence-transformers/eli5) dataset.

In [1]:
%pip install transformers datasets evaluate

%pip install datasets==2.16.0

Collecting evaluate
  Downloading evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.5-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.5
Collecting datasets==2.16.0
  Downloading datasets-2.16.0-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow-hotfix (from datasets==2.16.0)
  Downloading pyarrow_hotfix-0.7-py3-none-any.whl.metadata (3.6 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets==2.16.0)
  Downloading dill-0.3.7-py3-none-any.whl.metadata (9.9 kB)
Collecting fsspec<=2023.10.0,>=2023.1.0 (from fsspec[http]<=2023.10.0,>=2023.1.0->datasets==2.16.0)
  Downloading fsspec-2023.10.0-py3-none-any.whl.metadata (6.8 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
Collecting multiprocess

 to log in to your Hugging Face account so you can upload and share your model with the community:

In [2]:
from huggingface_hub import notebook_login

notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Load ELI5 dataset

Load the first 5000 examples from the [ELI5-Category](https://huggingface.co/datasets/rexarski/eli5_category/blob/main/README.md) dataset with the HF Datasets library. This will be done for initial experimentaton and preparation for the full dataset.

In [3]:
from datasets import load_dataset

# The previous dataset 'eli5_category' is no longer supported as it loads from a script.
# Searching for a different ELI5 dataset on the Hugging Face Hub that can be loaded directly.
# Found a preprocessed version 'eli5' that can be loaded.
eli5 = load_dataset("eli5_category", split="train[:5000]")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/4.17k [00:00<?, ?B/s]

Downloading readme: 0.00B [00:00, ?B/s]

Downloading data:   0%|          | 0.00/62.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.00M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.76M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.85M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/91772 [00:00<?, ? examples/s]

Generating validation1 split:   0%|          | 0/5446 [00:00<?, ? examples/s]

Generating validation2 split:   0%|          | 0/2375 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5411 [00:00<?, ? examples/s]

split the dataset into a train and test sets using [train_test_split](https://huggingface.co/docs/datasets/v4.0.0/en/package_reference/main_classes#datasets.Dataset.train_test_split) Datasets method:

In [4]:
eli5 = eli5.train_test_split(test_size=0.2)

Inspect the first record of the train set:

In [5]:
eli5["train"][0]

{'q_id': '78949z',
 'title': 'How does a Thermoelectric Generator work?',
 'selftext': '',
 'category': 'Technology',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['dosd73s'],
  'text': ["What is really being asked is how the thermoelectric effect works, so I'll try and explain that. Imagine you had a metal wire that has either end held at a different temperature. The electrons in the metal act similar to a gas, where the electrons at the hotter end are moving faster and spreading out more. This causes a higher concentration of electrons at the cold end, which causes a voltage difference between the two ends of the wire. Note that different materials will generate different voltages, even under identical thermal conditions. A thermocouple or thermoelectric generator uses two dissimilar materials, with the hot ends attached together. This guarantees that there is a voltage difference between the two cold ends, which can either be used in power production or as a measurement s

We are only inerested in the `text` field in case of using Masked Language Model (such as BERT) fine-tuning. We do not need the labels in this case because the next word is the label.

load a DistilRoBERTa tokenizer to process the text subfield

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilroberta-base")

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]