# Fine-tuning BERT with Masked Language Modelling

## Setting Up the Environment and Preparing the Dataset

The first step is to install the necessary libraries:

In [1]:
!pip install datasets -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/485.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━[0m [32m409.6/485.4 kB[0m [31m14.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.8/194.8 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25h

Next, we import the required libraries:

In [2]:
from datasets import load_dataset
import torch
from tqdm.auto import tqdm
from transformers import AdamW, BertTokenizer, BertForMaskedLM
import pandas as pd
import warnings

torch.manual_seed(42)
warnings.filterwarnings("ignore")

Now, we load the model we want to fine-tune along with the corresponding tokenizer:

In [3]:
model = BertForMaskedLM.from_pretrained("bert-base-uncased", return_dict=True)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another archite

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

The dataset used in this tutorial is andjela-r/mlm-harry-potter. It contains all seven books, with the rows modified to ensure that none exceed 512 tokens, which is the maximum input length for this model.

In [4]:
data = load_dataset("andjela-r/mlm-harry-potter", split="train[:10%]").to_pandas()
data.head(10)

README.md:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/4.08M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/55305 [00:00<?, ? examples/s]

Unnamed: 0,text
0,Harry Potter and the Sorcerer's Stone
1,CHAPTER ONE
2,THE BOY WHO LIVED
3,"Mr. and Mrs. Dursley, of number four, Privet D..."
4,Mr. Dursley was the director of a firm called ...
5,"The Dursleys had everything they wanted, but t..."
6,"When Mr. and Mrs. Dursley woke up on the dull,..."
7,"None of them noticed a large, tawny owl flutte..."
8,"At half past eight, Mr. Dursley picked up his ..."
9,It was on the corner of the street that he not...


## Preparing the Dataset for Training

The next step is to preprocess the dataset for training. This is done in a few steps:

In [5]:
prep_data = data['text'].astype(str).tolist()

Here, we convert all dataset rows into a single list.

In [6]:
# Try augmenting the dataset
# prep_data = prep_data * 5

Now, we tokenize the text:

In [7]:
inputs = tokenizer(
    prep_data, max_length=512, truncation=True, padding=True, return_tensors='pt'
)

Our inputs dictionary contains several keys:  
* input_ids – Tokenized representation of the text  
* token_type_ids – Indicates different segments of the text  
* attention_mask – Shows which tokens should be attended to  

We now add an additional key, labels, which represents what the model should predict:

In [8]:
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [9]:
inputs['labels'] = inputs['input_ids'].detach().clone()

This simply creates a safe copy of our input_ids so they don't accidentally get modified.  

In [10]:
inputs

{'input_ids': tensor([[  101,  4302, 10693,  ...,     0,     0,     0],
        [  101,  3127,  2028,  ...,     0,     0,     0],
        [  101,  1996,  2879,  ...,     0,     0,     0],
        ...,
        [  101,  2067,  2006,  ...,     0,     0,     0],
        [  101, 11867, 22494,  ...,     0,     0,     0],
        [  101,  2134,  1005,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'labels': tensor([[  101,  4302, 10693,  ...,     0,     0,     0],
        [  101,  3127,  2028,  ...,     0,     0,     0],
        [  101,  1996, 

Here, we can notice a few things:
* 101 – This is the special [CLS] token that marks the start of a sentence.
* 102 – The special [SEP] token that marks the start of a sentence.
* 0 – Padding tokens that are added to ensure uniform input length.

## Creating and applying the mask

First, we generate a random tensor of the same shape as inputs['input_ids']. This tensor will contain floating-point values between 0 and 1, which we will later use to determine which tokens should be masked.

In [11]:
random_tensor = torch.rand(inputs['input_ids'].shape)

In [12]:
random_tensor

tensor([[0.3652, 0.2347, 0.4906,  ..., 0.3116, 0.2113, 0.3886],
        [0.8105, 0.5732, 0.5176,  ..., 0.2954, 0.4166, 0.2893],
        [0.8042, 0.9128, 0.8691,  ..., 0.5848, 0.3568, 0.2125],
        ...,
        [0.3909, 0.7462, 0.1341,  ..., 0.3357, 0.0088, 0.0567],
        [0.4014, 0.7726, 0.3555,  ..., 0.5459, 0.1148, 0.6641],
        [0.7116, 0.8562, 0.6483,  ..., 0.8978, 0.5627, 0.1566]])

In [13]:
inputs['input_ids'].shape, random_tensor.shape

(torch.Size([5530, 307]), torch.Size([5530, 307]))

Now, we create the mask by selecting 15% of the tokens at random. However, we must ensure that special tokens such as [CLS] (token ID 101), [SEP] (token ID 102), and padding tokens (token ID 0) are not masked. We achieve this by applying logical operations to filter them out:

In [14]:
masked_tensor = (random_tensor < 0.15) * (inputs['input_ids'] != 101 ) * (inputs['input_ids'] != 102) * (inputs["input_ids"] != 0 )

In [15]:
masked_tensor

tensor([[False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        ...,
        [False, False,  True,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False]])

Finally, we collect the positions of all nonzero elements in the masked_tensor, which correspond to the tokens that have been selected for masking. We do this by iterating through each row and extracting the indices of nonzero values:

In [16]:
nonzero_indices = []
for i in range(len(inputs['input_ids'])):
  nonzero_indices.append(torch.flatten(masked_tensor[i].nonzero()).tolist())

In [17]:
len(nonzero_indices)

5530

Once we have identified which tokens should be masked using nonzero_indices, we replace them with the [MASK] token, which has the token ID 103 in BERT-based models:

In [18]:
for i in range(len(inputs['input_ids'])):
  inputs['input_ids'][i, nonzero_indices[i]] = 103

In [19]:
inputs['input_ids']

tensor([[  101,  4302, 10693,  ...,     0,     0,     0],
        [  101,  3127,  2028,  ...,     0,     0,     0],
        [  101,  1996,  2879,  ...,     0,     0,     0],
        ...,
        [  101,  2067,   103,  ...,     0,     0,     0],
        [  101, 11867, 22494,  ...,     0,     0,     0],
        [  101,  2134,  1005,  ...,     0,     0,     0]])

We now define a Dataset class, HPDataset, which allows us to handle our tokenized inputs properly when training using PyTorch's DataLoader.

In [20]:
class HPDataset(torch.utils.data.Dataset):
  def __init__(self, encodings):
    self.encodings = encodings
  def __len__(self):
    return len(self.encodings['input_ids'])
  def __getitem__(self, index):
    return {key: val[index] for key, val in self.encodings.items()}

In [21]:
dataset = HPDataset(inputs)

In [22]:
dataloader = torch.utils.data.DataLoader(
    dataset,
    batch_size=16, # Each batch contains 16 sequences
    shuffle=True # Shuffle the data to improve training
)

We check whether a GPU is available and move our model to the appropriate device:

In [23]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

In [24]:
device

device(type='cuda')

## Training

In [25]:
model.to(device) # Move the model to the device ("cpu" or "cuda")

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwi

In [26]:
epochs = 3 # The model will train for 3 full passes over the dataset.
optimizer = AdamW(model.parameters(), lr=1e-5)

In [27]:
model.train()

for epoch in range(epochs):
    loop = tqdm(dataloader)
    for batch in loop:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        labels = batch['labels'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

        loop.set_description("Epoch: {}".format(epoch))
        loop.set_postfix(loss=loss.item())

  0%|          | 0/346 [00:00<?, ?it/s]

  0%|          | 0/346 [00:00<?, ?it/s]

  0%|          | 0/346 [00:00<?, ?it/s]

In [28]:
model.eval() # Puts the model in evaluation mode

# Example corpus
# Feel free to add your own sentences
test_corpus = [
    "Harry [MASK] is a wizzard.",

    "He pulled out the letter and read: \
    HOGWARTS SCHOOL of [MASK] and WIZARDRY Headmaster: ALBUS DUMBLEDORE (Order of Merlin, First Class, Grand Sorc., Chf. Warlock, \
    Supreme Mugwump, International Confed. of Wizards) Dear Mr. Potter, We are pleased to inform you that you have been accepted at Hogwarts \
    School of Witchcraft and Wizardry.",

    'I know that," said [MASK] McGonagall irritably. "But that\'s no reason to lose our heads. People are being downright \
    careless, out on the streets in broad daylight, not even dressed in Muggle clothes, swapping rumors.',

    "I'm sorry... You think that He-[MASK]-Must-Not-Be-Named is still alive, then?"
]

# Loop through each example sentence
for sentence in test_corpus:
    inputs = tokenizer(sentence, return_tensors='pt', max_length=512, truncation=True, padding=True)
    inputs = {key: val.to(device) for key, val in inputs.items()}

    masked_index = torch.where(inputs['input_ids'][0] == tokenizer.mask_token_id)[0].item()

    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

    predicted_token_id = logits[0, masked_index].argmax().item()
    predicted_token = tokenizer.decode([predicted_token_id])

    print(f"Original sentence: {sentence}")
    print(f"Predicted token: {predicted_token}")
    print("-" * 50)

Original sentence: Harry [MASK] is a wizzard.
Predicted token: potter
--------------------------------------------------
Original sentence: He pulled out the letter and read:     HOGWARTS SCHOOL of [MASK] and WIZARDRY Headmaster: ALBUS DUMBLEDORE (Order of Merlin, First Class, Grand Sorc., Chf. Warlock,     Supreme Mugwump, International Confed. of Wizards) Dear Mr. Potter, We are pleased to inform you that you have been accepted at Hogwarts     School of Witchcraft and Wizardry.
Predicted token: witchcraft
--------------------------------------------------
Original sentence: I know that," said [MASK] McGonagall irritably. "But that's no reason to lose our heads. People are being downright     careless, out on the streets in broad daylight, not even dressed in Muggle clothes, swapping rumors.
Predicted token: professor
--------------------------------------------------
Original sentence: I'm sorry... You think that He-[MASK]-Must-Not-Be-Named is still alive, then?
Predicted token: who


## Calculating accuracy

In [29]:
correct = 0
total = 0

for sentence in prep_data:
    # Replace a random token with [MASK] and store the original token
    tokens = tokenizer.encode(sentence, return_tensors='pt')[0]
    masked_index = torch.randint(0, len(tokens), (1,)).item()
    original_token = tokens[masked_index].item()
    tokens[masked_index] = tokenizer.mask_token_id

    inputs = {'input_ids': tokens.unsqueeze(0).to(device)}

    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

    predicted_token_id = logits[0, masked_index].argmax().item()

    if predicted_token_id == original_token:
        correct += 1
    total += 1

accuracy = correct / total
print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 61.52%
