<a href="https://colab.research.google.com/github/dimitarpg13/transformer_examples/blob/main/notebooks/bert/Masked_Language_Model_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Tutorial on Masked Language Modeling

Masked language modeling predicts a masked token in a sequence, and the model can attend to tokens bidirectionally. Masked language modeling is mainly applicable to tasks that require good contextual understanding of an entire sequence. BERT is an example of a masked language model.

In this tutorial DistilRoBERTa will be fine-tuned on the [r/askscience](https://www.reddit.com/r/askscience/) subset of the [ELI5](https://huggingface.co/datasets/sentence-transformers/eli5) dataset.

In [None]:
%pip install transformers datasets evaluate

%pip install datasets==2.16.0

 to log in to your Hugging Face account so you can upload and share your model with the community:

In [2]:
from huggingface_hub import notebook_login

notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Load ELI5 dataset

Load the first 5000 examples from the [ELI5-Category](https://huggingface.co/datasets/rexarski/eli5_category/blob/main/README.md) dataset with the HF Datasets library. This will be done for initial experimentaton and preparation for the full dataset.

In [None]:
from datasets import load_dataset

# The previous dataset 'eli5_category' is no longer supported as it loads from a script.
# Searching for a different ELI5 dataset on the Hugging Face Hub that can be loaded directly.
# Found a preprocessed version 'eli5' that can be loaded.
eli5 = load_dataset("eli5_category", split="train[:5000]")

split the dataset into a train and test sets using [train_test_split](https://huggingface.co/docs/datasets/v4.0.0/en/package_reference/main_classes#datasets.Dataset.train_test_split) Datasets method:

In [4]:
eli5 = eli5.train_test_split(test_size=0.2)

Inspect the first record of the train set:

In [5]:
eli5["train"][0]

{'q_id': '78949z',
 'title': 'How does a Thermoelectric Generator work?',
 'selftext': '',
 'category': 'Technology',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['dosd73s'],
  'text': ["What is really being asked is how the thermoelectric effect works, so I'll try and explain that. Imagine you had a metal wire that has either end held at a different temperature. The electrons in the metal act similar to a gas, where the electrons at the hotter end are moving faster and spreading out more. This causes a higher concentration of electrons at the cold end, which causes a voltage difference between the two ends of the wire. Note that different materials will generate different voltages, even under identical thermal conditions. A thermocouple or thermoelectric generator uses two dissimilar materials, with the hot ends attached together. This guarantees that there is a voltage difference between the two cold ends, which can either be used in power production or as a measurement s

We are only inerested in the `text` field in case of using Masked Language Model (such as BERT) fine-tuning. We do not need the labels in this case because the next word is the label.

load a DistilRoBERTa tokenizer to process the text subfield

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilroberta-base")