<a target="_blank" href="https://colab.research.google.com/github/cswamy/pytorch/blob/main/notebooks/MLM_finetuned_distilbert_imdb.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

### **Notes**

Notebook to finetune a distilbert-base_uncased model for masked language modelling on the IMDB dataset.

App: https://huggingface.co/spaces/cswamy/masked_language_model

Resources:

Hugging face checkpoint: https://huggingface.co/distilbert-base-uncased
Original distilbert-base-uncased paper: https://arxiv.org/abs/1910.01108
IMDB dataset: https://huggingface.co/datasets/imdb
Inspired by hugging face tutorial: https://huggingface.co/learn/nlp-course/chapter7/3?fw=pt

### **Setup**

In [1]:
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [2]:
try:
  import transformers
  print("[INFO] Hugging face transformers imported successfully!")
except:
  !pip install -q transformers
  import transformers
  print("[INFO] Hugging face transformers installed and imported successfully!")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m28.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.8/294.8 kB[0m [31m32.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m89.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m76.2 MB/s[0m eta [36m0:00:00[0m
[?25h[INFO] Hugging face transformers installed and imported successfully!


In [3]:
try:
  import datasets
  print("[INFO] Hugging face datasets imported successfully!")
except:
  !pip install -q datasets
  import datasets
  print("[INFO] Hugging face datasets installed and imported successfully!")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25h[INFO] Hugging face datasets installed and imported successfully!


### **Define tokenizer**

In [4]:
from transformers import AutoModelForMaskedLM, AutoTokenizer

checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForMaskedLM.from_pretrained(checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

### **Download dataset**

In [5]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")
imdb_dataset

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

### **Pre-process dataset**

#### Define tokenization function

In [6]:
def tokenize_function(examples):
  result = tokenizer(examples["text"])
  result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
  return result

#### Tokenize full dataset

In [7]:
tokenized_datasets = imdb_dataset.map(tokenize_function,
                                      batched=True,
                                      remove_columns=["text", "label"])

tokenized_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (720 > 512). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 50000
    })
})

#### Split dataset into chunks

In [8]:
def create_chunks(examples,
                  chunk_size:int=128):
  # Concatenate all examples
  concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
  # Get length of concatenated text
  total_length = len(concatenated_examples[list(examples.keys())[0]])
  # Drop last chunk if smaller than chunk_size
  total_length = (total_length // chunk_size) * chunk_size
  # Split chunks
  result = {
      k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
      for k, t in concatenated_examples.items()
  }
  # Create new labels column
  result["labels"] = result["input_ids"].copy()
  return result

In [9]:
lm_datasets = tokenized_datasets.map(create_chunks, batched=True)
lm_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 61291
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 59904
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 122957
    })
})

### **Prepare datacollator for random masking**

In [10]:
# Initialise data collator
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer,
                                                mlm_probability=0.15)

In [11]:
# Test data collator on few samples
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
  _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
  print(f"{tokenizer.decode(chunk)}\n")

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


[CLS] i rented [MASK] am curious - yellow from my video store because of all the controversy that surrounded it [MASK] it was first released in 1967. i [MASK] heard that at first [MASK] was seized by u opportunities s. customs if it ever tried [MASK] enter this country, therefore being a [MASK] of films considered " controversial [MASK] i really [MASK] to [MASK] this for myself [MASK] < br / > < [MASK] / > the plot is centered around a young swedish drama student named lena who wants to learn [MASK] she canpiece [MASK]. in particular she [MASK] to focus her attentions to making some sort [MASK] documentary on what the average swede thought about certain political issues such

as [MASK] vietnam war [MASK] race [MASK] in the united states. in between asking politicians [MASK] ordinary [MASK] [MASK]ns of stockholm about their [MASK] on politics, she boom sex with her drama teacher, classmates, and married [MASK]. < br [MASK] > < br / > what [MASK] [MASK] about i am curious - yellow is [MA

### **Prepare dataloaders**

#### Downsample data for faster training

In [24]:
train_size = 30000
val_size = int(0.1 * train_size)

downsampled_datasets = lm_datasets["train"].train_test_split(train_size=train_size,
                                                             test_size=val_size,
                                                             seed=42)

# Also remove word_ids since collator does not expect this column
downsampled_datasets = downsampled_datasets.remove_columns("word_ids")

downsampled_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 30000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 3000
    })
})

#### Create dataloaders

In [25]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(dataset=downsampled_datasets["train"],
                              batch_size=64,
                              shuffle=True,
                              collate_fn=data_collator)

val_dataloader = DataLoader(dataset=downsampled_datasets["test"],
                            batch_size=64,
                            shuffle=False,
                            collate_fn=data_collator)

len(train_dataloader), len(val_dataloader)

(469, 47)

### **Train and eval**

#### Setup training

In [26]:
from transformers import AutoModelForMaskedLM

# Define model
model = AutoModelForMaskedLM.from_pretrained(checkpoint).to(device)

# Define optimiser
optimiser = torch.optim.AdamW(params=model.parameters(),
                              lr=2e-5)

# Setup scheduler
lr_scheduler = torch.optim.lr_scheduler.LinearLR(optimizer=optimiser)

#### Setup perplexity metric

In [27]:
try:
  from torchmetrics.text import Perplexity
  print(f"Torchmetrics perplexity imported!")
except:
  !pip install -q torchmetrics
  from torchmetrics.text import Perplexity
  print(f"Torchmetrics perplexity installed and imported!")

Torchmetrics perplexity imported!


In [39]:
perp_fn = Perplexity(ignore_index=-100).to(device)

#### Training loop

In [40]:
from tqdm.auto import tqdm
EPOCHS = 5

train_loss, train_perp = 0, 0
for epoch in tqdm(range(EPOCHS)):
  for batch in train_dataloader:
    # Send batch to device
    batch = {k: v.to(device) for k, v in batch.items()}

    # Forward pass
    outputs = model(**batch)

    # Accumulate loss
    loss = outputs.loss
    train_loss += loss

    # Zero grad optimiser
    optimiser.zero_grad()

    # Backpropagate loss
    loss.backward()

    # Step optimiser
    optimiser.step()

    # Step scheduler
    lr_scheduler.step()

    # Calculate perplexity
    perp = perp_fn(outputs.logits, batch["labels"])
    train_perp += perp.item()

  # Average loss and perplexity across batches
  train_loss /= len(train_dataloader)
  train_perp /= len(train_dataloader)

  # Print progress
  print(f"Epoch: {epoch+1} | Training loss: {train_loss:.4f} | Training perplexity: {train_perp:.4f}")

  0%|          | 0/5 [00:00<?, ?it/s]

Epoch: 1 | Training loss: 2.4615 | Training perplexity: 11.8237
Epoch: 2 | Training loss: 2.3473 | Training perplexity: 10.4794
Epoch: 3 | Training loss: 2.2965 | Training perplexity: 9.9533
Epoch: 4 | Training loss: 2.2606 | Training perplexity: 9.6058
Epoch: 5 | Training loss: 2.2225 | Training perplexity: 9.2448


#### Eval loop

In [42]:
val_loss, val_perp = 0, 0

model.eval()
with torch.inference_mode():
  for batch in val_dataloader:
    # Send batch to device
    batch = {k: v.to(device) for k, v in batch.items()}

    # Forward pass
    outputs = model(**batch)

    # Calculate and accumulate loss
    val_loss += outputs.loss

    # Calculate and accumulate perplexity
    perp = perp_fn(outputs.logits, batch["labels"])
    val_perp += perp.item()

  # Average loss and perplexity across batches
  val_loss /= len(val_dataloader)
  val_perp /= len(val_dataloader)

# Print outputs
print(f"Validation loss: {val_loss:.4f} | Validation perplexity: {val_perp:.4f}")

Validation loss: 2.2480 | Validation perplexity: 9.5069


### **Save model**

In [43]:
!git clone https://github.com/cswamy/pytorch

Cloning into 'pytorch'...
remote: Enumerating objects: 68, done.[K
remote: Counting objects: 100% (68/68), done.[K
remote: Compressing objects: 100% (61/61), done.[K
remote: Total 68 (delta 21), reused 11 (delta 2), pack-reused 0[K
Receiving objects: 100% (68/68), 38.91 KiB | 2.99 MiB/s, done.
Resolving deltas: 100% (21/21), done.


In [45]:
from pytorch.scripts import utils

utils.save_model(model=model,
                 target_dir="models",
                 model_name="distilbert_finetuned_imdb.pth")

[INFO] Saving model to: models/distilbert_finetuned_imdb.pth


### **Make predictions**

#### Sample and predict on test dataset

In [47]:
# Remove word_ids column before creating dataloader
lm_datasets_test = lm_datasets["test"].remove_columns("word_ids")

# Create dataloader
test_dataloader = DataLoader(dataset=lm_datasets_test,
                             batch_size=64,
                             shuffle=False,
                             collate_fn=data_collator)

In [48]:
test_loss, test_perp = 0, 0

model.eval()
with torch.inference_mode():
  for batch in test_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}

    # Forward pass
    outputs = model(**batch)

    # Calculate test loss
    test_loss += outputs.loss

    # Calculate perplexity
    perp = perp_fn(outputs.logits, batch["labels"])
    test_perp += perp.item()

  # Average loss and perplexity across batches
  test_loss /= len(test_dataloader)
  test_perp /= len(test_dataloader)

# Print output
print(f"Test loss: {test_loss:.4f} | Test perplexity: {test_perp:.4f}")

Test loss: 2.2759 | Test perplexity: 9.8043


#### Predict on new text

In [78]:
# Define predict function
def pred_mask(text:str):
  """
  Function returns top 5 candidates for a MASK.
  Args:
    text(str): text with MASK at end to complete.
  Returns:
    List of top 5 candidate tokens for the MASK.
  """
  input = tokenizer(text, return_tensors="pt").to(device)
  output_logits = model(**input).logits

  # Find location of [MASK] and extract its logits
  mask_token_index = torch.where(input["input_ids"] == tokenizer.mask_token_id)[1]
  mask_token_logits = output_logits[0, mask_token_index, :]

  # Pick the [MASK] candidates with highest logits
  top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

  # Convert tokens to sentences
  options_list = []
  for token in top_5_tokens:
    option = f"<<< {text.replace(tokenizer.mask_token, tokenizer.decode([token]))} >>>"
    options_list.append(option)

  return options_list

In [79]:
text = "This is a great [MASK]."
options_list = pred_mask(text)

options_list

['<<< This is a great film. >>>',
 '<<< This is a great movie. >>>',
 '<<< This is a great idea. >>>',
 '<<< This is a great comedy. >>>',
 '<<< This is a great show. >>>']

### **Deploy to hugging face**

In [60]:
from pathlib import Path

# Create folders
demo_path = Path("demos/distilbert_mlm")
demo_path.mkdir(parents=True, exist_ok=True)

In [61]:
# Move model to demo folder
!mv models/distilbert_finetuned_imdb.pth demos/distilbert_mlm

In [62]:
%%writefile demos/distilbert_mlm/model.py
from transformers import AutoModelForMaskedLM, AutoTokenizer

def create_distilbert_mlm():
  """
  Initializes model and tokenizer for distilbert checkpoint.
  """
  checkpoint = "distilbert-base-uncased"
  tokenizer = AutoTokenizer.from_pretrained(checkpoint)
  model = AutoModelForMaskedLM.from_pretrained(checkpoint)

  return model, tokenizer

Writing demos/distilbert_mlm/model.py


In [63]:
%%writefile demos/distilbert_mlm/app.py
import torch
import gradio as gr

from model import create_distilbert_mlm

# Setup model and tokenizer
model, tokenizer = create_distilbert_mlm()

# Load state dict from model
model.load_state_dict(
    torch.load(
        f="distilbert_finetuned_imdb.pth",
        map_location=torch.device("cpu")
    ))

# Predict function
def predict(text:str):

  # Tokenize inputs and get model outputs
  input = tokenizer(text, return_tensors="pt")
  output_logits = model(**input).logits

  # Find location of [MASK] and extract its logits
  mask_token_index = torch.where(input["input_ids"] == tokenizer.mask_token_id)[1]
  mask_token_logits = output_logits[0, mask_token_index, :]

  # Pick the [MASK] candidates with highest logits
  top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

  # Convert tokens to sentences
  options_list = []
  for token in top_5_tokens:
    option = f"<<< {text.replace(tokenizer.mask_token, tokenizer.decode([token]))} >>>"
    options_list.append(option)

  return options_list

# Create examples list
examples_list = ["This is a great [MASK]."]

# Create gradio app
title = "Masked Language Model 🤿"
description = "DistilBERT model finetuned for masked language modelling on the IMDB dataset."

demo = gr.Interface(fn=predict,
                    inputs=gr.inputs.Textbox(label="Input",
                                             placeholder="Enter sentence here..."),
                    outputs="text",
                    examples=examples_list,
                    title=title,
                    description=description)

# Launch gradio
demo.launch()

Writing demos/distilbert_mlm/app.py


In [64]:
%%writefile demos/distilbert_mlm/requirements.txt
torch==1.12.0
gradio==3.1.4
transformers==4.33.1

Writing demos/distilbert_mlm/requirements.txt


In [65]:
!cd demos/distilbert_mlm && zip -r ../distilbert_mlm.zip *

  adding: app.py (deflated 53%)
  adding: distilbert_finetuned_imdb.pth (deflated 8%)
  adding: model.py (deflated 47%)
  adding: requirements.txt (deflated 17%)


In [66]:
try:
  from google.colab import files
  files.download("demos/distilbert_mlm.zip")
except:
  print(f"Download failed!")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

For further instructions on uploading to hugging face, refer here: https://www.learnpytorch.io/09_pytorch_model_deployment/#117-deploying-our-foodvision-big-app-to-huggingface-spaces