# Introduction

This notebook is intended to explore how to Fine-Tuning BERT Core model.

# Fine-Tuning with MLM

In [26]:
# Import Standard Libraries
from transformers import BertTokenizer, BertForMaskedLM, AdamW
import torch
from tqdm import tqdm

## Retrieve Loss (Theory)

In [2]:
# Instance tokenizer and the model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

# Sample text with masks
text = ("After Abraham Lincoln won the November 1860 presidential [MASK] on an "
        "anti-slavery platform, an initial seven slave states declared their "
        "secession from the country to form the Confederacy. War broke out in "
        "April 1861 when secessionist forces [MASK] Fort Sumter in South "
        "Carolina, just over a month after Lincoln's inauguration.")

# Get tokens and feed forward in the model
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [3]:
outputs.keys()

odict_keys(['logits'])

The output of the model are directly the logits, since we're using only the MLM head.

In [4]:
inputs.input_ids

tensor([[  101,  2044,  8181,  5367,  2180,  1996,  2281,  7313,  4883,   103,
          2006,  2019,  3424,  1011,  8864,  4132,  1010,  2019,  3988,  2698,
          6658,  2163,  4161,  2037, 22965,  2013,  1996,  2406,  2000,  2433,
          1996, 18179,  1012,  2162,  3631,  2041,  1999,  2258,  6863,  2043,
         22965,  2923,  2749,   103,  3481,  7680,  3334,  1999,  2148,  3792,
          1010,  2074,  2058,  1037,  3204,  2044,  5367,  1005,  1055, 17331,
          1012,   102]])

Let's proceed with an example of fine tuning by masking around 15% of the `inputs.input_ids`.

In [5]:
# Let's get random numbers between 0 and 1 for each token and mask tokens that have this random number associated less than 0.15
rand = torch.rand(inputs.input_ids.shape)

# Remember to not mask the special tokens [101] and [102]
mask_arr = (rand < 0.15) * (inputs.input_ids != 101) * (inputs.input_ids != 102)
mask_arr

tensor([[False, False,  True, False,  True, False,  True, False, False, False,
         False, False, False, False,  True, False, False, False,  True,  True,
         False, False, False, False, False, False,  True, False, False, False,
         False, False, False, False, False, False,  True, False, False,  True,
         False,  True, False, False,  True, False, False, False, False,  True,
          True, False, False, False, False, False, False, False, False, False,
          True, False]])

In [6]:
# Extract tokens to mask
selection = torch.flatten((mask_arr[0]).nonzero()).tolist()
selection

[2, 4, 6, 14, 18, 19, 26, 36, 39, 41, 44, 49, 50, 60]

In [7]:
# Copy the tokens before masking them, in order to have the labels saved
inputs['labels'] = inputs.input_ids.detach().clone()
inputs

{'input_ids': tensor([[  101,  2044,  8181,  5367,  2180,  1996,  2281,  7313,  4883,   103,
          2006,  2019,  3424,  1011,  8864,  4132,  1010,  2019,  3988,  2698,
          6658,  2163,  4161,  2037, 22965,  2013,  1996,  2406,  2000,  2433,
          1996, 18179,  1012,  2162,  3631,  2041,  1999,  2258,  6863,  2043,
         22965,  2923,  2749,   103,  3481,  7680,  3334,  1999,  2148,  3792,
          1010,  2074,  2058,  1037,  3204,  2044,  5367,  1005,  1055, 17331,
          1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([

In [8]:
# Mask the original tokens by substitue their value with a special token [103]
inputs.input_ids[0, selection] = 103
inputs

{'input_ids': tensor([[  101,  2044,   103,  5367,   103,  1996,   103,  7313,  4883,   103,
          2006,  2019,  3424,  1011,   103,  4132,  1010,  2019,   103,   103,
          6658,  2163,  4161,  2037, 22965,  2013,   103,  2406,  2000,  2433,
          1996, 18179,  1012,  2162,  3631,  2041,   103,  2258,  6863,   103,
         22965,   103,  2749,   103,   103,  7680,  3334,  1999,  2148,   103,
           103,  2074,  2058,  1037,  3204,  2044,  5367,  1005,  1055, 17331,
           103,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([

In [9]:
outputs = model(**inputs)
outputs.keys()

odict_keys(['loss', 'logits'])

After modifying the input by adding also the `labels` key, we not only get the logits, but also the associated loss computed between the `input_ids` and the `labels` values.

In [10]:
outputs.loss

tensor(1.7191, grad_fn=<NllLossBackward0>)

## Perform the Fine-Turning

In [11]:
# Retrieve data for fine-tuning
with open('./../datasets/bert_fine_tuning/clean.txt', 'r') as file:
    text = file.read().split('\n')

In [12]:
text[:5]

['From my grandfather Verus I learned good morals and the government of my temper.',
 'From the reputation and remembrance of my father, modesty and a manly character.',
 'From my mother, piety and beneficence, and abstinence, not only from evil deeds, but even from evil thoughts; and further, simplicity in my way of living, far removed from the habits of the rich.',
 'From my great-grandfather, not to have frequented public schools, and to have had good teachers at home, and to know that on such things a man should spend liberally.',
 "From my governor, to be neither of the green nor of the blue party at the games in the Circus, nor a partizan either of the Parmularius or the Scutarius at the gladiators' fights; from him too I learned endurance of labour, and to want little, and to work with my own hands, and not to meddle with other people's affairs, and not to be ready to listen to slander."]

In [13]:
# Instance tokenizer and the model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [14]:
# Get tokens
# NOTE: use max_length, truncation and padding since we have more sentences with respect to just one
inputs = tokenizer(text, 
                   return_tensors='pt', 
                   max_length=512, 
                   truncation=True, 
                   padding='max_length')

In [15]:
# 507 sentences with 512 tokens each
inputs.input_ids.shape

torch.Size([507, 512])

In [16]:
# Define the labels
inputs['labels'] = inputs.input_ids.detach().clone()

In [17]:
# Create the mask for some of the input tokens for training
rand = torch.rand(inputs.input_ids.shape)
mask_arr = (rand < 0.15) * (inputs.input_ids != 101) * (inputs.input_ids != 102) * (inputs.input_ids != 0)

In [18]:
# Retrieve the tokens that have to be masked
selection = []

# Loop through sentences
for i in range(inputs.input_ids.shape[0]):

    # Append the list of tokens to be masked for the sentence 'i'
    selection.append(
        torch.flatten(mask_arr[i].nonzero()).tolist()
    )

In [19]:
# Apply the mask
for i in range(inputs.input_ids.shape[0]):
    inputs.input_ids[i, selection[i]] = 103

Now the training dataset is prepared with the masked input and the labels. We're going to use a Data Loader in order to feed the data during the training process.

In [20]:
class MeditationsDataset(torch.utils.data.Dataset):
    """
    Datataset class
    """
    def __init__(self, encodings):
        # Encodings are our "inputs" tensors
        self.encodings = encodings

    def __getitem__(self, idx):
        # Get the sentence at indenx 'idx'
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    
    def __len__(self):
        return len(self.encodings.input_ids)

In [21]:
# Initialise the dataset and create the PyTorch Data Loader from that
dataset = MeditationsDataset(inputs)
loader = torch.utils.data.DataLoader(dataset, batch_size=126, shuffle=True)

In [22]:
# Use CUDA if available
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_a

In [23]:
# Activate training mode
model.train()

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_a

In [25]:
# Initialize optimizer
optim = AdamW(model.parameters(), lr=5e-5)



In [27]:
epochs = 2

for epoch in range(epochs):

    # Use TQDM for a progressive loading baar
    loop = tqdm(loader, leave=True)

    for batch in loop:

        # Initialise gradients
        optim.zero_grad()

        # Retrieve data for training and load them to the chosen device
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        # Feed forward
        outputs = model(input_ids, 
                        attention_mask=attention_mask,
                        labels=labels)
        # Extract loss
        loss = outputs.loss

        # Compute gradients
        loss.backward()

        # Update weights
        optim.step()

        # Print information on loading bar
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())

  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  0%|          | 0/5 [01:54<?, ?it/s]


KeyboardInterrupt: 

# Hugging Face Trainer

In [11]:
# Import Standard Libraries
from transformers import BertTokenizer, BertForMaskedLM, TrainingArguments, Trainer
import torch

2024-01-31 14:41:08.292384: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
# Instance tokenizer and the model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [3]:
# Retrieve data for fine-tuning
with open('./../datasets/bert_fine_tuning/clean.txt', 'r') as file:
    text = file.read().split('\n')

In [4]:
# Get tokens
# NOTE: use max_length, truncation and padding since we have more sentences with respect to just one
inputs = tokenizer(text, 
                   return_tensors='pt', 
                   max_length=512, 
                   truncation=True, 
                   padding='max_length')

In [5]:
# Define the labels
inputs['labels'] = inputs.input_ids.detach().clone()

In [6]:
# Create the mask for some of the input tokens for training
rand = torch.rand(inputs.input_ids.shape)
mask_arr = (rand < 0.15) * (inputs.input_ids != 101) * (inputs.input_ids != 102) * (inputs.input_ids != 0)

# Retrieve the tokens that have to be masked
selection = []

# Loop through sentences
for i in range(inputs.input_ids.shape[0]):

    # Append the list of tokens to be masked for the sentence 'i'
    selection.append(
        torch.flatten(mask_arr[i].nonzero()).tolist()
    )
    
# Apply the mask
for i in range(inputs.input_ids.shape[0]):
    inputs.input_ids[i, selection[i]] = 103

In [7]:
class MeditationsDataset(torch.utils.data.Dataset):
    """
    Datataset class
    """
    def __init__(self, encodings):
        # Encodings are our "inputs" tensors
        self.encodings = encodings

    def __getitem__(self, idx):
        # Get the sentence at indenx 'idx'
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    
    def __len__(self):
        return len(self.encodings.input_ids)

In [8]:
# Initialise the dataset
dataset = MeditationsDataset(inputs)

In [10]:
# Initialise object for model training arguments
args = TrainingArguments(
    output_dir='out',
    per_device_train_batch_size=16,
    num_train_epochs=2
)

In [12]:
# Initialise trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=dataset
)

In [None]:
# Fit the model
trainer.train()

# Fine-Tuning with NSP

## Retrieve Loss (Theory)

In [26]:
# Import Standard Libraries
from transformers import BertTokenizer, BertForNextSentencePrediction
import torch
import random

In [14]:
# Instance tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')

In [15]:
# Define input texts
text = ("After Abraham Lincoln won the November 1860 presidential election on an "
        "anti-slavery platform, an initial seven slave states declared their "
        "secession from the country to form the Confederacy.")
text2 = ("War broke out in April 1861 when secessionist forces attacked Fort "
         "Sumter in South Carolina, just over a month after Lincoln's "
         "inauguration.")

In [16]:
# Tokenize
inputs = tokenizer(text, text2, return_tensors='pt')

In [17]:
inputs

{'input_ids': tensor([[  101,  2044,  8181,  5367,  2180,  1996,  2281,  7313,  4883,  2602,
          2006,  2019,  3424,  1011,  8864,  4132,  1010,  2019,  3988,  2698,
          6658,  2163,  4161,  2037, 22965,  2013,  1996,  2406,  2000,  2433,
          1996, 18179,  1012,   102,  2162,  3631,  2041,  1999,  2258,  6863,
          2043, 22965,  2923,  2749,  4457,  3481,  7680,  3334,  1999,  2148,
          3792,  1010,  2074,  2058,  1037,  3204,  2044,  5367,  1005,  1055,
         17331,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Sentences are divided by special token `[102]` in the key `input_ids` and through the `token_type_ids`.

In [18]:
# Define labels
labels = torch.LongTensor([0])

In [21]:
# Feed forward
outputs = model(**inputs, labels=labels)
outputs.keys()

odict_keys(['loss', 'logits'])

In [20]:
outputs.loss

tensor(3.2186e-06, grad_fn=<NllLossBackward0>)

## Perform the Fine-Tuning

In [22]:
# Retrieve data for fine-tuning
with open('./../datasets/bert_fine_tuning/clean.txt', 'r') as file:
    text = file.read().split('\n')

For each paragrapoh in the text, we need to construct sentences.

In [24]:
# Split on dot to retrieve single sentences
text[51].split('.')

['Body, soul, intelligence: to the body belong sensations, to the soul appetites, to the intelligence principles',
 ' To receive the impressions of forms by means of appearances belongs even to animals; to be pulled by the strings of desire belongs both to wild beasts and to men who have made themselves into women, and to a Phalaris and a Nero: and to have the intelligence that guides to the things which appear suitable belongs also to those who do not believe in the gods, and who betray their country, and do their impure deeds when they have shut the doors',
 ' If then everything else is common to all that I have mentioned, there remains that which is peculiar to the good man, to be pleased and content with what happens, and with the thread which is spun for him; and not to defile the divinity which is planted in his breast, nor disturb it by a crowd of images, but to preserve it tranquil, following it obediently as a god, neither saying anything contrary to the truth, nor doing anyth

For constructing the dataset, we want let 50% of sentences followed by their original subsequent sentence and 50% followed by another random sentence. So to have a balanced labels distributions.

In [25]:
# Create a bag of random sentences from which retrieve the random sentence for 50% of the dataset
bag = [item for sentence in text for item in sentence.split('.') if item != '']
bag_size = len(bag)

In [27]:
# Let's construct our dataset
sentence_a = [] # List of sentences A
sentence_b = [] # List of sentences B for sentence A
label = [] # List of ground truth Sentences B for sentence A

# Fetch each paragraph in the text
for paragraph in text:

    # Retrieve the sentences
    sentences = [
        sentence for sentence in paragraph.split('.') if sentence != ''
    ]
    # Compute the number of sentences for that paragraph
    num_sentences = len(sentences)

    # Discard paragraphs with just one sentence
    if num_sentences > 1:

        # Choose a random sentence to start with (exclude the last two so we have always a Sentence B to pick up from)
        start = random.randint(0, num_sentences-2)

        # Random choice between the actual Sentence B or a random one from the bat
        if random.random() >= 0.5:

            # This is IsNextSentence
            sentence_a.append(sentences[start])
            sentence_b.append(sentences[start+1])
            label.append(0)
        else:

            # Pick random sentence index from the bag
            index = random.randint(0, bag_size-1)
            
            # This is NotNextSentence
            sentence_a.append(sentences[start])
            sentence_b.append(bag[index])
            label.append(1)

In [28]:
# Check few training points
for i in range(3):
    print(label[i])
    print(sentence_a[i] + '\n---')
    print(sentence_b[i] + '\n')

1
From Maximus I learned self-government, and not to be led aside by anything; and cheerfulness in all circumstances, as well as in illness; and a just admixture in the moral character of sweetness and dignity, and to do what was set before me without complaining
---
 To that place then we must remove, where there are so many great orators, and so many noble philosophers, Heraclitus, Pythagoras, Socrates; so many heroes of former days, and so many generals after them, and tyrants; besides these, Eudoxus, Hipparchus, Archimedes, and other men of acute natural talents, great minds, lovers of labour, versatile, confident, mockers even of the perishable and ephemeral life of man, as Menippus and such as are like him

1
 He took a reasonable care of his body's health, not as one who was greatly attached to life, nor out of regard to personal appearance, nor yet in a careless way, but so that, through his own attention, he very seldom stood in need of the physician's art or of medicine or ex

In [29]:
# Tokenize the dataset
inputs = tokenizer(sentence_a, sentence_b, return_tensors='pt', max_length=512, truncation=True, padding='max_length')