# Domain adaptation for MLMs: How to teach a BERT-family about the *movies* domain via MLM finetuning

In this notebook we'll perform MLM fine-tuning of a pre-trained BERT-family model `DistilBERT` Specfically, we want to adjust `DistilBERT` weights to favor the *movies* domain.

We'll do this by performing MLM training on the IMDB Sentiment Analysis dataset.  The notebook demonstrates the process of preparing data, training, evaluating, and
sharing the fine-tuned model, using Hugging Face Transformers, Datasets,
and Accelerate libraries.



Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate
# To run the training on TPU, you will need to uncomment the following line:
# !pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
!apt install git-lfs

### 1. Create an environment for training and sharing language models within the Hugging Face ecosystem

In [None]:
!git config --global user.email "you@example.com"
!git config --global user.name "Your Name"

In [None]:

from huggingface_hub import notebook_login

notebook_login()

### 2. Load the BERT-family model and its tokenizer; then investigate the model

Import the `AutoModelForMaskedLM` class from the transformers library. This class is used to load pre-trained models specifically designed for MLM tasks.



*   Specify the name of the pre-trained model we want. In this case, it's the `DistilBERT model`, a smaller and faster version of `BERT`.
*   Call the `AutoModelForMaskedLM.from_pretrained()` method to download and load the pre-trained `DistilBERT` model specified by model_checkpoint. This model is now stored in the model variable and ready for use in MLM, such as filling in missing words in a sentence.



In [None]:
from transformers import AutoModelForMaskedLM

model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)



*   `model.num_parameters()`: This function call retrieves the total number of trainable parameters in the loaded `DistilBERT` model.
*   `/ 1_000_000`: This division converts the parameter count to millions (M) for easier readability.

In [None]:
distilbert_num_parameters = model.num_parameters() / 1_000_000
print(f"'>>> DistilBERT number of parameters: {round(distilbert_num_parameters)}M'")
print(f"'>>> BERT number of parameters: 110M'")



*   `[MASK]`: This special token `[MASK]` is a placeholder. In MLM, we hide certain words in a sentence and train a model to predict those hidden words. The `[MASK]` token represents the hidden word the model needs to predict.

* We'll use this input to probe for the model's current (before finetuning) predictions

In [None]:
text = "This is a great [MASK]."

`AutoTokenizer`  automatically loads the appropriate tokenizer based on the specified pre-trained model.

We'll be covering tokenizers in a later class, but in brief, tokenizers do the following
 * **Tokenization**: It breaks down text into smaller units called tokens, which could be words, subwords, or even characters, depending on the tokenizer.
 * **Vocabulary**: It has a vocabulary of known tokens and their corresponding numerical IDs.
 * **Encoding**: It converts text into numerical representations (input IDs) that can be fed into the pre-trained model.
 * **Decoding**: It converts numerical representations back into human-readable text.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

* Tokenize the input text:
 * Use the previously loaded tokenizer to process the text (which contains th.
      `[MASK]` token).
 * return_tensors="pt": This ensures the output is a PyTorch tensor.
 * The result `(inputs)` is a dictionary containing tokenized input IDs and other information needed by the model.

* Get model predictions:
 * Pass the tokenized input to the loaded model (model).
 * `.logits`: This extracts the raw predictions (logits) from the model's output. Logits represent the model's confidence in different token possibilities.

* Locate the `[MASK]` token index
  
* Extract logits for the `[MASK]` token

* Get the top 5 most likely tokens (from `mask_token_logits`) to replace the `[MASK]`

What do these completions tell us about BERT's pretrained model weights?


In [None]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")
imdb_dataset

In [None]:
import torch

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

### 3. Load the IMDB Sentiment Analysis dataset

In [None]:
sample = imdb_dataset["train"].shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Review: {row['text']}'")
    print(f"'>>> Label: {row['label']}'")

In [None]:
tokenizer.model_max_length

In [None]:
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


# Use batched=True to activate fast multithreading!
tokenized_datasets = imdb_dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "label"]
)
tokenized_datasets

### 4. Chunk the data

Chunking  is used to handle long sequences of text more efficiently, as BERT  has limits on the input length it can process. Dictionary `chunks` will have the same keys as `concatenated_examples`, but the values will be lists of chunks instead of the original text data.

In [None]:
chunk_size = 128

In [None]:
# Slicing produces a list of lists for each feature
tokenized_samples = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Review {idx} length: {len(sample)}'")

In [None]:
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated reviews length: {total_length}'")

In [None]:
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

Group and prepare text data for processing: concatenate all text examples, calculate the total length, and adjust it to ensure that the text can be divided into complete chunks of the desired chunk_size.

In [None]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Decode a sequence of token IDs back into human-readable text using the tokenizer.

In [None]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

### 5. Prepare the data for the MLM task

`DataCollatorForLanguageModeling` is designed to prepare batches of data specifically for masked language modeling (MLM) tasks.

* `tokenizer`: The tokenizer to use for masking and padding the input data.
*  `mlm_probability=0.15` This sets the probability of masking a token during data preparation. In this case, there's a 15% chance that a given token will be masked. Masking is the core principle of MLM: the model is trained to predict the masked tokens based on the surrounding context.
* Padding: We pad sequences to the same length within a batch to ensure consistent input shapes for the model.
* Labels: We create the "labels" for the MLM task, which are the original token IDs of the masked tokens. These labels are used to calculate the loss during training

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

In [None]:
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")

In [None]:
import collections
import numpy as np

from transformers import default_data_collator

wwm_probability = 0.2


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return default_data_collator(features)

In [None]:
samples = [lm_datasets["train"][i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for chunk in batch["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")

Create a smaller, downsampled version of the original training dataset to speed up experimentation.

In [None]:
train_size = 10_000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

### 6. Set up MLM training

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Batch size and logging:

* `batch_size`: Sets the batch size for training, which determines how many examples are processed at once.

* `logging_steps`: Calculates how often the training loss should be logged. This is typically done after every epoch (one full pass through the training data), and it's calculated by dividing the total number of training examples by the batch size.

Training arguments:

* `evaluation_strategy`: Sets the frequency of evaluation during training ("epoch" means evaluate after each epoch).
* `learning_rate`: Sets the learning rate for the optimizer, which controls how much the model's weights are adjusted during each training step.

In [None]:
from transformers import TrainingArguments

batch_size = 64
# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-imdb",
    #overwrite_output_dir=True,
    #evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    fp16=True,
    logging_steps=logging_steps,
)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
    #tokenizer=tokenizer,
)

### 7. Use the perplexity metric to determine current model quality

* `evaluate()` method runs the model on the evaluation dataset and returns a dictionary containing evaluation metrics, including the evaluation loss (eval_loss).
* Perplexity measures how well the model predicts the next token in a sequence, with lower perplexity indicating better performance.

In [None]:
import math

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

In [None]:
trainer.train()

In [None]:
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

In [None]:
trainer.push_to_hub() #need to do this to access the model in a later cell; make sure your HF API token has write access

### 8. Try to improve things further by performing random masking

In [None]:
def insert_random_mask(batch):
    features = [dict(zip(batch, t)) for t in zip(*batch.values())]
    masked_inputs = data_collator(features)
    # Create a new "masked" column for each column in the dataset
    return {"masked_" + k: v.numpy() for k, v in masked_inputs.items()}

In [None]:
downsampled_dataset = downsampled_dataset.remove_columns(["word_ids"])
eval_dataset = downsampled_dataset["test"].map(
    insert_random_mask,
    batched=True,
    remove_columns=downsampled_dataset["test"].column_names,
)
eval_dataset = eval_dataset.rename_columns(
    {
        "masked_input_ids": "input_ids",
        "masked_attention_mask": "attention_mask",
        "masked_labels": "labels",
    }
)

In [None]:
from torch.utils.data import DataLoader
from transformers import default_data_collator

batch_size = 64
train_dataloader = DataLoader(
    downsampled_dataset["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
eval_dataloader = DataLoader(
    eval_dataset, batch_size=batch_size, collate_fn=default_data_collator
)

Use *Adam* optimizer. In general, an optimizer in a DL and NNs context is responsible for updating the model's parameters during training to minimize the loss function and improve the model's performance.

The *AdamW* optimization algorithm, which is a variant of *Adam* that incorporates weight decay. Weight decay helps to prevent overfitting by adding a penalty to the loss function that discourages the model's parameters from becoming too large.

In [None]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

In [None]:
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

Initial hyperparameters:

* `num_train_epochs`: Sets the number of training epochs to 3. An epoch is one full pass through the training data.
* `num_update_steps_per_epoch`: Gets the number of update steps per epoch, which is determined by the length of the train_dataloader (the number of batches in the training data).
* `num_training_steps`: Calculates the total number of training steps by multiplying the number of epochs by the number of update steps per epoch.

Create a Learning Rate Scheduler:

* "linear": Specifies the type of learning rate schedule to use. In this case, it's a linear schedule, which means the learning rate will decrease linearly from its initial value to 0 over the course of training.

* `optimizer=optimizer`: Provides the optimizer that will be used during training. The scheduler will adjust the learning rate of this optimizer.

* `num_warmup_steps=0`: Warmup steps are an initial period where the learning rate is gradually increased to its initial value. In this case, there's no warmup period.

* `num_training_steps=num_training_steps`: Specifies the total number of training steps, which was calculated earlier.

In [None]:
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [None]:
from huggingface_hub import get_full_repo_name

model_name = "distilbert-base-uncased-finetuned-imdb-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

In [None]:
from huggingface_hub import Repository

output_dir = model_name
repo = Repository(output_dir, clone_from=repo_name)

Define the training loop:

* `loss = outputs.loss`: Extracts the loss value from the model's outputs.
* `accelerator.backward(loss)`: Performs backpropagation using the accelerator to calculate gradients of the loss with respect to the model's parameters. This is done in a distributed manner if using multiple devices.
* `optimizer.step()`: Updates the model's parameters based on the calculated gradients using the optimizer.
* `lr_scheduler.step()`: Updates the learning rate according to the learning rate schedule defined by the lr_scheduler.
* `optimizer.zero_grad()`: Resets the gradients to zero before the next batch to avoid accumulating gradients from previous batches.

In [None]:
from tqdm.auto import tqdm
import torch
import math

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        loss = outputs.loss
        losses.append(accelerator.gather(loss.repeat(batch_size)))

    losses = torch.cat(losses)
    losses = losses[: len(eval_dataset)]
    try:
        perplexity = math.exp(torch.mean(losses))
    except OverflowError:
        perplexity = float("inf")

    print(f">>> Epoch {epoch}: Perplexity: {perplexity}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        #repo.push_to_hub(
        #    commit_message=f"Training in progress epoch {epoch}", blocking=False
        #)

In [None]:
from transformers import pipeline

mask_filler = pipeline(
    "fill-mask", model="huggingface-course/distilbert-base-uncased-finetuned-imdb"
)

### 9. Check to see if the new predictions for `this is a great [MASK]` reflect the movies domain

In [None]:
preds = mask_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")