# Finetuning a masked Language Model
* MLM predicts the next word or predict word to fill in the blank of sentence.
* MLM is generally used for finetuning in domain specific data

In [1]:
!pip install git+https://github.com/huggingface/transformers
!pip install datasets

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-s4bbpw80
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-s4bbpw80
  Resolved https://github.com/huggingface/transformers to commit 39c3c0a72af6fbda5614dde02ff236069bb79827
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [2]:
from transformers import AutoModelForMaskedLM

model_checkpoint = "distilbert-base-uncased"                                    # model checkpoint
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)                  # loading for masking

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [3]:
#@ Comparing parameters
# Distilbert is made by knowledge disillation from larger model i.e Bert
distilbert_num_parameters = model.num_parameters() / 1_000_000                                          # disilbert parameters in million
print(f"'>>> DistilBERT number of parameters: {round(distilbert_num_parameters)}M'")                    # rounding and printing
print(f"'>>> BERT number of parameters: 110M'")                                                         # bert model from which knowledge distillation was done

'>>> DistilBERT number of parameters: 67M'
'>>> BERT number of parameters: 110M'


In [4]:
text = "This is a great [MASK]."                                                                       # sample text for masking

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)                                            # loading tokenizer

In [6]:
import torch
inputs = tokenizer(text, return_tensors =  "pt")
token_logits = model(**inputs).logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]                         # findng location of [MASK]
mask_token_logits = token_logits[0, mask_token_index, :]                                                  # extract logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()                                # pick the [MASK] candidates with highest logits
for token in top_5_tokens:                                                                                # iterate top5 tokens
  print(f"'>>>{text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")                          # replace masking value in mask_token

'>>>This is a great deal.'
'>>>This is a great success.'
'>>>This is a great adventure.'
'>>>This is a great idea.'
'>>>This is a great feat.'


# Domain Adaptaion : Making Distilbert more specific to movie by finetuning on Imdb datasets

In [7]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")                                # loading Imdb datasets
imdb_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [8]:
# Inspecting few random data
sample = imdb_dataset["train"].shuffle(seed=42).select(range(3))
for row in sample:
  print(f"Review : {row['text']}")
  print(f"Label : {row['label']}")

Review : There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...
Label : 1
Review : This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with a bunch of kids called "when you stub your toe on th

In [9]:
# Inspecting usupervised data
samples = imdb_dataset["unsupervised"].shuffle(seed=42).select(range(3))
for row in samples:
  print(f"reivew : {row['text']}")
  print(f"label : {row['label']}" )

reivew : If you've seen the classic Roger Corman version starring Vincent Price it's hard to put it out of your head, but you probably should do because this one is totally different. Subtlety has been abandoned in favour of gross-out horror - nudity, gore and all-round unpleasantness. OK it's ridiculous, trashy, sensationalised and historically dubious (did any members of the Inquisition really wear horn-rimmed glasses?), but despite all this it is strangely compelling. I literally couldn't tear myself away from the screen until the end of the movie. If there's a bigger compliment you can pay to a film I don't know what it is.
label : -1
reivew : For me, this was the most moving film of the decade. Samira Makhmalbaf shows pure bravery and vision in the making. She has an intelligence and gift for speaking to the people, regardless of their nationality or beliefs. I am inspired and touched by her humanity and can only hope that she has touched many people the same way. Her message in t

In [10]:
#@ Preprocessing the data
# tokenizing datasets
def tokenize_function(examples):
    result = tokenizer(examples["text"])                                                     # tokenizing texts
    if tokenizer.is_fast:                                                                    # check if fast tokenizer is available
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]   # maps result word_ids
    return result



tokenized_datasets = imdb_dataset.map(
    tokenize_function,                                                                      # tokenizing function
    batched=True,                                                                           # Use batched=True to activate fast multithreading!
    remove_columns=["text", "label"]                                                        # remove unnecessary columns
)
tokenized_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (532 > 512). Running this sequence through the model will result in indexing errors


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 50000
    })
})

In [11]:
!pip install sentencepiece



In [12]:
# checking context length of the model and comparing with other models
print(tokenizer.model_max_length)



512


In [13]:
# selecting chunk size according to gpu
chunk_size = 128

In [14]:
# printing number of tokens per review from few training samples
tokenized_samples = tokenized_datasets["train"][:3]
for idx,sample in enumerate(tokenized_samples["input_ids"]):
  print(f"'>>>Review {idx} length : {len(sample)}'")

'>>>Review 0 length : 363'
'>>>Review 1 length : 304'
'>>>Review 2 length : 133'


In [15]:
# concatenating samples into one text so that we can create chunk of equal lengths
concatenated_samples = {k :sum(tokenized_samples[k], []) for k in tokenized_samples.keys()}
total_length = len(concatenated_samples["input_ids"])
print(total_length)

800


In [16]:
chunks = {
    k : [t[i: i+chunk_size] for i in range(0, total_length, chunk_size)]                   # slicing text in chunk using list comphrension
    for k,t in concatenated_samples.items()
}
for chunk in chunks["input_ids"]:                                                          # printing result
  print(f">>Chunk length : {len(chunk)}")

>>Chunk length : 128
>>Chunk length : 128
>>Chunk length : 128
>>Chunk length : 128
>>Chunk length : 128
>>Chunk length : 128
>>Chunk length : 32


In [17]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

In [18]:
# making datasets of equal length
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 61291
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 59904
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 122957
    })
})

In [19]:
#@ Finetuning distilbert
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)                          # mlm_prob gives fraction of token to mask

In [20]:
samples = [lm_datasets["train"][i] for i in range(2)]                                   # select few samples of text chunks
for sample in samples:
  _ = sample.pop("word_ids")                                                            # remove word ids
for chunk in data_collator(samples)["input_ids"]:                                       # for each chunk mask 15% words and return input_ids
  print(f">>>{tokenizer.decode(chunk)}")                                                # decode back to sentence

>>>[CLS] i rented i [MASK] curious - [MASK] from my video store because of all the controversy that surrounded it when it was first released in 1967.ل also heard that at first it was seized by u [MASK] s. customs if it [MASK] [MASK] to enter this country, therefore being [MASK] fan of films considered " controversial " i really had to see this for myself [MASK] < br / > < [MASK] / > the plot is centered around a young swedish drama student named ᵍ who [MASK] to postseason everything she [MASK] about [MASK]. in particular [MASK] wants to focus her attentions to making [MASK] sort of documentary on what the average swede thought about certain political issues [MASK]
>>>as the vietnam war [MASK] race issues in the united states. in [MASK] asking politicians [MASK] ordinary denizens of stockholm about their opinions on [MASK], she has sex with her drama teacher, classmates, and married men. < br / > < br / [MASK] what kills me about i am curious - yellow [MASK] that 40 [MASK] ago [MASK] th

In [37]:
import collections
import numpy as np

from transformers import default_data_collator

wwm_probability = 0.2                                                           # mask 20% of the words


def whole_word_masking_data_collator(features):
    for feature in features:
        print(feature)
        word_ids = feature.pop("word_ids")                                      # removing word ids

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return default_data_collator(features)

In [22]:
# downsample dataset to smaller numbers
train_size = 10_000                                                             # select only 10k
test_size = int(0.1 * train_size)                                               # only 1% of  train size

downsampled_dataset = lm_datasets["train"].train_test_split(                    # creating train test split
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset                                                             # inspecting results

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1000
    })
})

In [23]:
# logging to hugging face

In [24]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [25]:
!pip install accelerate
!pip install torch



In [26]:
import torch
import accelerate

In [27]:
#@ Creating training arguments
from transformers import TrainingArguments
batch_size = 64
# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-imdb",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=True,
    fp16=True,
    logging_steps=logging_steps,
)

In [28]:
#@ prepring trainer for model
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [29]:
# starting training loop
trainer.train()

Epoch,Training Loss,Validation Loss
1,2.6959,2.54404
2,2.5693,2.463611
3,2.5434,2.424867


TrainOutput(global_step=471, training_loss=2.6019686235237525, metrics={'train_runtime': 178.5298, 'train_samples_per_second': 168.039, 'train_steps_per_second': 2.638, 'total_flos': 994208670720000.0, 'train_loss': 2.6019686235237525, 'epoch': 3.0})

In [30]:
#@ evaluating using perplexity which is the exponent of cross-entropy loss
import math
eval_results = trainer.evaluate()
print(f">>> Perplexity : {math.exp(eval_results['eval_loss']):.2f}")

>>> Perplexity : 11.60


In [None]:
# pushig model to hub
trainer.push_to_hub("finetuned MLM on imbdb datasets")

In [53]:
#@ Using finetuned model
from transformers import pipeline

mask_filler = pipeline(
    "fill-mask", model="Utshav/distilbert-base-uncased-finetuned-imdb"
)

In [55]:
preds = mask_filler("I love this [MASK]")

for pred in preds:
    print(f">>> {pred['sequence']}")

>>> i love this.
>>> i love this!
>>> i love this song
>>> i love this movie
>>> i love this one
