# Fine-tuning a Model for Masked Language Modeling (MLM) Exam

In this exam, you will be tasked with performing dataset preprocessing and fine-tuning a model for a masked language modeling task. Complete each step carefully according to the instructions provided.

### Model and Dataset Information

For this task, you will be working with the following:

- **Model Checkpoint**: Use the pre-trained model checkpoint `bert-base-uncased` for both the model and tokenizer.
- **Dataset**: You will be using the `CUTD/math_df` dataset. Ensure to load and preprocess the dataset correctly for training and evaluation.

**Note:**
- Any additional steps or methods you include that improve or enhance the results will be rewarded with bonus points if they are justified.
- The steps outlined here are suggestions. You are free to implement alternative methods or approaches to achieve the task, as long as you explain the reasoning and the process at the bottom of the notebook.
- You can use either TensorFlow or PyTorch for this task. If you prefer TensorFlow, feel free to use it when working with Hugging Face Transformers.
- The number of data samples you choose to work with is flexible. However, if you select a very low number of samples and the training time is too short, this could affect the evaluation of your work.

## Step 1: Load the Dataset

Load the dataset and split it into training and test sets. Use 20% of the data for testing.

In [1]:
# Students Names : Feras Alsayigh, Abdullah Aloufi, Taher Mutanbak, Saud Abusarrah, Abdulah Hafiz

In [2]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m28.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K

In [3]:
import pandas as pd
from transformers import AutoTokenizer
import numpy as np

from datasets import load_dataset

dataset = load_dataset('CUTD/math_df', split="train[:80%]")
dataset = dataset.train_test_split(test_size=0.2)
#Data source
#https://huggingface.co/datasets/CUTD/math_df/tree/main

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


math_df.csv:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [4]:
dataset['train']['text']

['A human rights lawyer who works alongside the pacifist to uphold justice and fight for peace',
 'An entertainment lawyer specializing in endorsement and brand deals for athletes',
 'An entrepreneur and crowdfunding enthusiast who is passionate about the democratization of financial opportunities and an advocate for small businesses.',
 "A visually impaired student who benefits from the AI system's text-to-speech features",
 "A graduate student studying film theory and fascinated by the socio-cultural aspects of Mira Nair's movies",
 "A aspiring political analyst who was inspired by the professor's teaching, but now presents differing viewpoints",
 'A senior Linux administrator with a focus on system security. She has been working with Linux distributions like RHEL, CentOS, and Ubuntu for over a decade and knows the ins and outs of handling package installations, system configurations, and troubleshooting pc issues.',
 'A sports reporter assigned to cover the journey and achievements 

In [5]:
dataset['train'].column_names

['Unnamed: 0', 'text']

## Step 2: Load the Pretrained Model and Tokenizer

Use a pre-trained model and tokenizer for this task. Initialize both in this step.

In [6]:
from transformers import AutoTokenizer
from transformers import AutoModelForMaskedLM
# I chose this tokenizer based on hugging face doc since I think it suitable for masked problem
#https://huggingface.co/docs/transformers/tasks/masked_language_modeling

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilroberta-base")
model = AutoModelForMaskedLM.from_pretrained("distilbert/distilroberta-base")

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Step 3: Preprocess the Dataset

Define a preprocessing function that tokenizes the text data and prepares the inputs for the model. Ensure that you truncate the sequences to a maximum length of 512 tokens and pad them appropriately.

**Bonus**: If you performed more comprehensive preprocessing, such as removing links, converting text to lowercase, or applying additional preprocessing techniques.

In [7]:
def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["text"]])

In [8]:
tokenized_ds = dataset.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=dataset["train"].column_names,
)


Map (num_proc=4):   0%|          | 0/6400 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1600 [00:00<?, ? examples/s]

In [9]:
tokenized_ds

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 6400
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 1600
    })
})

In [10]:
block_size = 512

# From the doc to concatenate and group
#https://huggingface.co/docs/transformers/tasks/masked_language_modeling
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    return result

lm_dataset= tokenized_ds.map(group_texts, batched=True, num_proc=4)
# lm_dataset_test = tokenized_ds.map(group_texts, batched=True, num_proc=4)

Map (num_proc=4):   0%|          | 0/6400 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1600 [00:00<?, ? examples/s]

## Step 4: Define Training Arguments

Set up the training configuration, including parameters like learning rate, batch size, number of epochs, and weight decay.

In [11]:
from transformers import TrainingArguments
training_args = TrainingArguments(
    output_dir="my_awesom_model",
    eval_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
)


## Step 5: Initialize the Trainer

Initialize the Trainer using the model, training arguments, and datasets (both training and evaluation).

In [12]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

In [13]:
from transformers import Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    data_collator=data_collator,
    eval_dataset = lm_dataset["test"]
)

## Step 6: Fine-tune the Model

Run the training process using the initialized Trainer to fine-tune the model on the masked language modeling task.

In [14]:
lm_dataset['train']

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 1408
})

In [15]:
lm_dataset['test']

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 348
})

In [16]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,1.225745
2,No log,1.108353
3,1.348100,1.066714


TrainOutput(global_step=528, training_loss=1.3379427880951853, metrics={'train_runtime': 104.1387, 'train_samples_per_second': 40.561, 'train_steps_per_second': 5.07, 'total_flos': 560194511044608.0, 'train_loss': 1.3379427880951853, 'epoch': 3.0})

## Step 7: Inference

Use the fine-tuned model for inference. Create a pipeline for masked language modeling and test it with a sample sentence.

In [17]:
dataset['test']['text'][0:3]

["A granddaughter inspired by the nurse's compassion, who aspires to follow in their footsteps and become a geriatric nurse",
 'An aerospace engineer researching biomimicry and aerodynamics',
 'A data scientist who applies mathematical models to analyze patterns in abstract art']

In [31]:
# I'll chose the third one and mask the word bookshop

In [32]:
text = "A data scientist who applies mathematical models to analyze <mask> in abstract art"

In [33]:
from transformers import pipeline

mask_filler = pipeline("fill-mask", model=model, tokenizer= tokenizer )
mask_filler(text, top_k=3)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'score': 0.07663597166538239,
  'token': 8117,
  'token_str': ' patterns',
  'sequence': 'A data scientist who applies mathematical models to analyze patterns in abstract art'},
 {'score': 0.03671473264694214,
  'token': 9126,
  'token_str': ' errors',
  'sequence': 'A data scientist who applies mathematical models to analyze errors in abstract art'},
 {'score': 0.030674902722239494,
  'token': 5550,
  'token_str': ' differences',
  'sequence': 'A data scientist who applies mathematical models to analyze differences in abstract art'}]