## Fine-tuning a masked language model (Domain Adaptation.).
However, there are a few cases where you’ll want to first fine-tune the language models on your data, before training a task-specific head. For example, if your dataset contains legal contracts or scientific articles, a vanilla Transformer model like BERT will typically treat the domain-specific words in your corpus as rare tokens, and the resulting performance may be less than satisfactory. By fine-tuning the language model on in-domain data you can boost the performance of many downstream tasks, which means you usually only have to do this step once!

This process of fine-tuning a pretrained language model on in-domain data is usually called domain adaptation. It was popularized in 2018 by ULMFiT, which was one of the first neural architectures (based on LSTMs) to make transfer learning really work for NLP.

In [1]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [2]:
# check the number of parameters of the model
distilbert_num_parameters = model.num_parameters()
print(f"DistilBERT number of parameters: {round(distilbert_num_parameters/ 1_000_000)}M'")
print(f"BERT number of parameters: 110M'")

DistilBERT number of parameters: 67M'
BERT number of parameters: 110M'


In [3]:
model.device

device(type='cpu')

In [4]:
text = "This is a great [MASK]."

In [5]:
tokenizer(text)

{'input_ids': [101, 2023, 2003, 1037, 2307, 103, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

In [6]:
tokenizer.decode(tokenizer(text)["input_ids"])

'[CLS] this is a great [MASK]. [SEP]'

In [7]:
import torch

input = tokenizer(text, return_tensors="pt")
token_logits = model(**input).logits
token_logits.shape

torch.Size([1, 8, 30522])

In [8]:
# find the location of the [MASK]
mask_token_id = torch.where(input["input_ids"]==tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0,mask_token_id,:]
mask_token_logits.shape


torch.Size([1, 30522])

In [9]:
mask_token_logits[0]

tensor([-4.8228, -4.6268, -5.1041,  ..., -4.2771, -5.0184, -3.9428],
       grad_fn=<SelectBackward0>)

In [10]:
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f" {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}")

 This is a great deal.
 This is a great success.
 This is a great adventure.
 This is a great idea.
 This is a great feat.


In [11]:
torch.topk(mask_token_logits, 5, dim=1)

torch.return_types.topk(
values=tensor([[7.0727, 6.6514, 6.6425, 6.2530, 5.8618]], grad_fn=<TopkBackward0>),
indices=tensor([[3066, 3112, 6172, 2801, 8658]]))

In [12]:
len(tokenizer.get_vocab().keys())

30522

In [13]:
# load of datasets

from datasets import load_dataset

imdb_dataset = load_dataset("imdb")
imdb_dataset

README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [14]:
imdb_dataset["unsupervised"].features

{'text': Value('string'), 'label': ClassLabel(names=['neg', 'pos'])}

In [15]:
sample_data =  imdb_dataset["train"].shuffle(seed=42).select(range(4))
for x in sample_data:
    print(f"Review: {x['text']}")
    print(f"Sentiment: {x['label']}")
    print("\n\n")

Review: There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...
Sentiment: 1



Review: This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with a bunch of kids called "when you stub your toe o

In [16]:
# for unstructure data
sample_data =  imdb_dataset["unsupervised"].shuffle(seed=42).select(range(4))
for x in sample_data:
    print(f"Review: {x['text']}")
    print(f"Sentiment: {x['label']}")
    print("\n\n")

Review: If you've seen the classic Roger Corman version starring Vincent Price it's hard to put it out of your head, but you probably should do because this one is totally different. Subtlety has been abandoned in favour of gross-out horror - nudity, gore and all-round unpleasantness. OK it's ridiculous, trashy, sensationalised and historically dubious (did any members of the Inquisition really wear horn-rimmed glasses?), but despite all this it is strangely compelling. I literally couldn't tear myself away from the screen until the end of the movie. If there's a bigger compliment you can pay to a film I don't know what it is.
Sentiment: -1



Review: For me, this was the most moving film of the decade. Samira Makhmalbaf shows pure bravery and vision in the making. She has an intelligence and gift for speaking to the people, regardless of their nationality or beliefs. I am inspired and touched by her humanity and can only hope that she has touched many people the same way. Her message 

In [17]:
# tokenization
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


# Use batched=True to activate fast multithreading!
tokenized_datasets = imdb_dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "label"]
)
tokenized_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (720 > 512). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 50000
    })
})

In [18]:
print(tokenizer.model_max_length)

512


In [19]:
chunk_size = 128

In [20]:
# Slicing produces a list of lists for each feature
tokenized_samples = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Review {idx} length: {len(sample)}'")

'>>> Review 0 length: 363'
'>>> Review 1 length: 304'
'>>> Review 2 length: 133'


In [21]:
len(tokenized_datasets["train"][0]["input_ids"])

363

In [22]:
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated reviews length: {total_length}'")

'>>> Concatenated reviews length: 800'


In [23]:
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 32'


In [24]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

In [25]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 61291
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 59904
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 122957
    })
})

In [26]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

"as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men. < br / > < br / > what kills me about i am curious - yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes are few and far between, even then it ' s not shot like some cheaply made porno. while my countrymen mind find it shocking, in reality sex and nudity are a major staple in swedish cinema. even ingmar bergman,"

In [27]:
tokenizer.decode(lm_datasets["train"][1]["labels"]) == tokenizer.decode(lm_datasets["train"][1]["input_ids"])

True

## Fine-tuning DistilBERT with the Trainer API

In [28]:
# this is used for [MASK] the token in the give sentence of 128 token (0.15 ==15%)
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

In [29]:
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] i rented i am curious - yellow from my video store because of all the controversy that surrounded it when itwg [MASK] released in 1967. i [MASK] heard [MASK] at first it was seized [MASK] u. s. customs if it ever tried to enter this country, therefore being a fan of films considered " [MASK] " i really had to see this for [MASK]. [MASK] br [MASK] > < br / [MASK] the plot [MASK] centered around a young swedish drama student named [MASK] who wants to [MASK] everything she can about life laughter in particular [MASK] [MASK] to focus her attentions to making some sort [MASK] documentary on what [MASK] average swede thought about certain political issues such'

'>>> as the vietnam war and race issues in the united states. in between asking politicians [MASK] ordinary denizens of stockholm about their opinions on politics, she [MASK] sex with her drama marlene, classmates [MASK] and married [MASK]. < br / > < br / > what kills [MASK] about i am curious [MASK] yellow is [MASK] 40 

In [30]:
tokenizer.decode(lm_datasets["train"][0]["input_ids"])

'[CLS] i rented i am curious - yellow from my video store because of all the controversy that surrounded it when it was first released in 1967. i also heard that at first it was seized by u. s. customs if it ever tried to enter this country, therefore being a fan of films considered " controversial " i really had to see this for myself. < br / > < br / > the plot is centered around a young swedish drama student named lena who wants to learn everything she can about life. in particular she wants to focus her attentions to making some sort of documentary on what the average swede thought about certain political issues such'

In [31]:
import collections
import numpy as np

from transformers import default_data_collator

wwm_probability = 0.2


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return default_data_collator(features)

In [32]:
samples = [lm_datasets["train"][i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for chunk in batch["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] i rented i am curious - yellow from [MASK] video store because [MASK] all the controversy that surrounded it [MASK] [MASK] was first [MASK] in [MASK] [MASK] [MASK] also heard that [MASK] first [MASK] was seized by [MASK]. s [MASK] customs if it [MASK] tried to enter this country, therefore [MASK] [MASK] fan of [MASK] considered " controversial " i really had [MASK] see this for myself. < br [MASK] > < br / > the plot is centered [MASK] a young swedish drama [MASK] named lena who wants to learn everything [MASK] can about life. in particular [MASK] wants to focus her attentions to making [MASK] sort of [MASK] on [MASK] the average swede [MASK] [MASK] certain political issues [MASK]'

'>>> as the vietnam war and race [MASK] in the [MASK] [MASK]. in between asking politicians and ordinary [MASK] [MASK] [MASK] of stockholm [MASK] their opinions on [MASK], [MASK] [MASK] sex with [MASK] drama teacher, classmates, and married men. < br [MASK] > < [MASK] / > what kills me about i [

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [34]:
from transformers import TrainingArguments

batch_size = 32
# Show the training loss with every epoch
logging_steps = len(lm_datasets["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"distilbert-imdb_mask_model",
    overwrite_output_dir=True,
    eval_strategy="epoch",
    num_train_epochs=5,
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=True,
    fp16=True,
    logging_steps=logging_steps,
)

In [36]:
from transformers import Trainer,TrainerCallback
import math

# how to implement the perpexity for lm

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["test"],
    data_collator=data_collator,
    processing_class=tokenizer,
)


class PerplexityCallback(TrainerCallback):
    def on_evaluate(self, args, state, control, metrics, **kwargs):
        if "eval_loss" in metrics:
            try:
                metrics["perplexity"] = math.exp(metrics["eval_loss"])
            except OverflowError:
                metrics["perplexity"] = float("inf")
        print(f"Perplexity: {metrics['perplexity']:.2f}")
        return control


trainer.add_callback(PerplexityCallback)

In [37]:
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mazheraly009[0m ([33mazheraly009-nust[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Perplexity: 22.89
>>> Perplexity: 22.89


In [38]:
trainer.train()

Epoch,Training Loss,Validation Loss,Model Preparation Time
1,2.5249,2.344039,0.0016
2,2.3985,2.291346,0.0016
3,2.3441,2.256899,0.0016
4,2.3079,2.23279,0.0016
5,2.2869,2.227116,0.0016


Perplexity: 10.42
Perplexity: 9.89
Perplexity: 9.55
Perplexity: 9.33
Perplexity: 9.27


TrainOutput(global_step=9580, training_loss=2.3724066278381986, metrics={'train_runtime': 2822.7932, 'train_samples_per_second': 108.564, 'train_steps_per_second': 3.394, 'total_flos': 1.015600727284992e+16, 'train_loss': 2.3724066278381986, 'epoch': 5.0})

In [39]:
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 9.25
>>> Perplexity: 9.25


In [40]:
model.push_to_hub("azherali/distilbert-imdb_mask_model")

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ...t-imdb_mask_model/model.safetensors:   6%|6         | 16.7MB /  268MB            

CommitInfo(commit_url='https://huggingface.co/azherali/distilbert-imdb_mask_model/commit/e72caf0b430f0724b70f7c9d6e2b605432cfef53', commit_message='Upload DistilBertForMaskedLM', commit_description='', oid='e72caf0b430f0724b70f7c9d6e2b605432cfef53', pr_url=None, repo_url=RepoUrl('https://huggingface.co/azherali/distilbert-imdb_mask_model', endpoint='https://huggingface.co', repo_type='model', repo_id='azherali/distilbert-imdb_mask_model'), pr_revision=None, pr_num=None)

In [41]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("fill-mask", model="azherali/distilbert-imdb_mask_model")

config.json:   0%|          | 0.00/506 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Device set to use cuda:0


In [56]:
text = "This movie was absolutely [MASK] and the performances were stunning."
for x in pipe(text):
  print(x["sequence"])

this movie was absolutely fantastic and the performances were stunning.
this movie was absolutely stunning and the performances were stunning.
this movie was absolutely beautiful and the performances were stunning.
this movie was absolutely brilliant and the performances were stunning.
this movie was absolutely wonderful and the performances were stunning.


In [59]:
import torch
from transformers import AutoModelForMaskedLM,AutoTokenizer

model_checkpoint = "azherali/distilbert-imdb_mask_model"

model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

text ="This movie was absolutely [MASK] and the performances were stunning."

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

'>>> This movie was absolutely fantastic and the performances were stunning.'
'>>> This movie was absolutely stunning and the performances were stunning.'
'>>> This movie was absolutely beautiful and the performances were stunning.'
'>>> This movie was absolutely brilliant and the performances were stunning.'
'>>> This movie was absolutely wonderful and the performances were stunning.'
