# Finetune DistilRoBERTa on the r/askscience subset of the ELI5 dataset that can fill in the blank of a sentence.

## Introduction
Masked language modeling predicts a masked token in a sequence, and the model can attend to tokens bidirectionally. This means the model has full access to the tokens on the left and right. Masked language modeling is great for tasks that require a good contextual understanding of an entire sequence. BERT is an example of a masked language model.


## Setup

In [1]:
import torch
torch.cuda.is_available()

True

In [None]:
!pip install datasets transformers evaluate gradio

In [3]:
# login to the HuggingFace Hub
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Load the ELI5 dataset

In [4]:
# Load the first 5000 examples from the ELI5-Category Dataset
from datasets import load_dataset

eli5 = load_dataset("eli5_category", split="train[:5000]")
eli5

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/4.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/12.6k [00:00<?, ?B/s]

The repository for eli5_category contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/eli5_category.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/62.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.00M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.76M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.85M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/91772 [00:00<?, ? examples/s]

Generating validation1 split:   0%|          | 0/5446 [00:00<?, ? examples/s]

Generating validation2 split:   0%|          | 0/2375 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5411 [00:00<?, ? examples/s]

Dataset({
    features: ['q_id', 'title', 'selftext', 'category', 'subreddit', 'answers', 'title_urls', 'selftext_urls'],
    num_rows: 5000
})

In [5]:
# Split the train dataset into a train and test set
eli5 = eli5.train_test_split(test_size=0.2)
eli5

DatasetDict({
    train: Dataset({
        features: ['q_id', 'title', 'selftext', 'category', 'subreddit', 'answers', 'title_urls', 'selftext_urls'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['q_id', 'title', 'selftext', 'category', 'subreddit', 'answers', 'title_urls', 'selftext_urls'],
        num_rows: 1000
    })
})

In [6]:
# Take a look at a sample
eli5["train"][0]

{'q_id': '769x3x',
 'title': "If an infinitely fast car was on a finite loop, (ignoring physics laws which throw it off the track) wouldn't it just ram into itself? If not, why?",
 'selftext': '',
 'category': 'Physics',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['docetxe', 'docf1k8', 'doceubh'],
  'text': ['It\'s a truly meaningless question. An infinitely fast thing is everywhere at once, defying any notion of "where it is" or of "hitting or not hitting" anything.',
   'The length of the track minus length of the car will always be the distance the car has to travel to hit itself regardless of speed. So no.',
   "Speed is distance over time. So to get back to the starting point with no time passing you would have to divide by zero, which is undefined in math. So the answer is we don't know."],
  'score': [11, 3, 3],
  'text_urls': [[], [], []]},
 'title_urls': ['url'],
 'selftext_urls': ['url']}

## Data Preprocessing

In [7]:
# Load a DistilRoBERTa tokenizer to process the text subfield
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilroberta-base")

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [8]:
# Extract the text subfield from its nested structure
eli5 = eli5.flatten()
eli5["train"][0]

{'q_id': '769x3x',
 'title': "If an infinitely fast car was on a finite loop, (ignoring physics laws which throw it off the track) wouldn't it just ram into itself? If not, why?",
 'selftext': '',
 'category': 'Physics',
 'subreddit': 'explainlikeimfive',
 'answers.a_id': ['docetxe', 'docf1k8', 'doceubh'],
 'answers.text': ['It\'s a truly meaningless question. An infinitely fast thing is everywhere at once, defying any notion of "where it is" or of "hitting or not hitting" anything.',
  'The length of the track minus length of the car will always be the distance the car has to travel to hit itself regardless of speed. So no.',
  "Speed is distance over time. So to get back to the starting point with no time passing you would have to divide by zero, which is undefined in math. So the answer is we don't know."],
 'answers.score': [11, 3, 3],
 'answers.text_urls': [[], [], []],
 'title_urls': ['url'],
 'selftext_urls': ['url']}

In [9]:
# Function to join the list of strings for each example
def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]])

In [10]:
# Apply the preprocessing over entire dataset
tokenized_eli5 = eli5.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=eli5["train"].column_names,
)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (3079 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1147 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1806 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2072 > 512). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (515 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (529 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (522 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (847 > 512). Running this sequence through the model will result in indexing errors


In [11]:
# Concatenate and split concatenated sequences into shorter chunks
block_size = 128

def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # Drop the small remainders
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block-size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    return result

In [12]:
# Apply the group text function over the entire dataset
lm_dataset = tokenized_eli5.map(
    group_texts,
    batched=True,
    num_proc=4,
)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

In [13]:
# Create a batch of examples using Data collator
# to dynamically pad the sentences to the longest length in a batch during collation
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm_probability=0.15
)

## Train the Model

In [14]:
# Load DistilRoBERTa model
from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("distilbert/distilroberta-base")

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [15]:
# Define training hyperparameters
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="distilroberta-finetuned-eli5",
    eval_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

In [16]:
# Define trainer with training args, model, datasets and data collator
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [17]:
# Finetune the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,2.2493,2.06609
2,2.1761,2.030004
3,2.1281,2.022738


TrainOutput(global_step=3996, training_loss=2.190221410375219, metrics={'train_runtime': 894.9001, 'train_samples_per_second': 35.712, 'train_steps_per_second': 4.465, 'total_flos': 1059615079218432.0, 'train_loss': 2.190221410375219, 'epoch': 3.0})

## Evaluate the Model

In [18]:
# Evaluate our model and its perplexity
import math
import evaluate

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 7.52


In [19]:
# Push the model to the Hub
trainer.push_to_hub()

events.out.tfevents.1724610654.63686663dcb1.3942.0:   0%|          | 0.00/7.53k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

events.out.tfevents.1724611569.63686663dcb1.3942.1:   0%|          | 0.00/359 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/ashaduzzaman/distilroberta-finetuned-eli5/commit/d37c5e3c08c02ee1a1c7053959d466fc81368b4d', commit_message='End of training', commit_description='', oid='d37c5e3c08c02ee1a1c7053959d466fc81368b4d', pr_url=None, pr_revision=None, pr_num=None)

## Inference the model to fill in the blank

In [20]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask", model="ashaduzzaman/distilroberta-finetuned-eli5"
)

config.json:   0%|          | 0.00/688 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/329M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [26]:
text = "The quick brown <mask> jumps over the lazy dog."
fill_mask(text, top_k=3)

[{'score': 0.18019390106201172,
  'token': 324,
  'token_str': 'ie',
  'sequence': 'The quick brownie jumps over the lazy dog.'},
 {'score': 0.10272006690502167,
  'token': 2173,
  'token_str': ' guy',
  'sequence': 'The quick brown guy jumps over the lazy dog.'},
 {'score': 0.05183292552828789,
  'token': 23602,
  'token_str': ' fox',
  'sequence': 'The quick brown fox jumps over the lazy dog.'}]