# Fine-tuning a masked language model (DistilBERT) on the IMDb dataset that can autocomplete sentences (PyTorch)

## Introduction

Masked language modeling predicts a masked token in a sequence, and the model can attend to tokens bidirectionally. This means the model has full access to the tokens on the left and right. Masked language modeling is great for tasks that require a good contextual understanding of an entire sequence. BERT is an example of a masked language model



## Setup

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate
!apt install git-lfs

Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl.metadata (9.3 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m

In [None]:
# Setup git
!git config --global user.email "ashaduzzaman2505@gmail.com"
!git config --global user.name "ashaduzzaman-sarker"

In [None]:
# logged in to the Hugging Face Hub
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Picking a pretrained model (DistilBERT) for masked language modeling

[`DistilBERT`](https://huggingface.co/distilbert/distilbert-base-uncased) trained using a special technique called `knowledge distillation`, where a large “teacher model” like BERT is used to guide the training of a “student model” that has far fewer parameters

In [None]:
# Download DistilBERT using the AutoModelForMaskedLM class
from transformers import AutoModelForMaskedLM

model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [None]:
# Check the number of parameters this model
distilbert_num_parameters = model.num_parameters() / 1_000_000

print(f"DistilBERT number of parameters: {round(distilbert_num_parameters)}M")
print(f"BERT number of parameters: 110M")

DistilBERT number of parameters: 67M
BERT number of parameters: 110M


In [None]:
# Load DistilBERT’s tokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
# Let's pass a text example to the model
import torch

text = "This is a great [MASK]."

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits

# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]

# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"{text.replace(tokenizer.mask_token, tokenizer.decode([token]))}")

This is a great deal.
This is a great success.
This is a great adventure.
This is a great idea.
This is a great feat.


## Load the Dataset

The dataset used for domain adaptation is the [**Large Movie Review Dataset (IMDb)**](https://huggingface.co/datasets/stanfordnlp/imdb), a well-known corpus for benchmarking sentiment analysis models. By fine-tuning DistilBERT on this dataset, the model adapts from its Wikipedia-based pretraining to the subjective language of movie reviews.

In [None]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")
imdb_dataset

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [None]:
# Let’s take a look at a few samples
sample = imdb_dataset["train"].shuffle(seed=42).select(range(5))

# 0 : negative review, 1: positive one.
for row in sample:
    print(f"Review: {row['text']}")
    print(f"Label: {row['label']}")
    print("\n")

Review: There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...
Label: 1


Review: This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with a bunch of kids called "when you stub your toe on the

## Preprocessing the data

Our dataset should be set in a format with just one column of texts. We will need to batch them together

### Tokenized our movie reviews

In [None]:
# First tokenize our corpus without setting the truncation=True
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result

# Use batched=True to activate fast multithreading!
tokenized_datasets = imdb_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["text", "label"]
)
tokenized_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (720 > 512). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 50000
    })
})

### Group them all together and split the result into chunks

In [None]:
# Inspect the model_max_length/context size attribute of the tokenizer
tokenizer.model_max_length

512

In [None]:
# Pick smaller chunk size/context size to run our experiments on Google Colab GPUs
chunk_size = 128

In [None]:
## let’s print a few reviews

# Slicing produces a list of lists for each feature
tokenized_samples = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"Review {idx} length: {len(sample)}")

Review 0 length: 363
Review 1 length: 304
Review 2 length: 133


In [None]:
# We can then concatenate all these examples with a simple dictionary comprehension
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}

total_length = len(concatenated_examples["input_ids"])
print(f"The concatenated reviews are {total_length} tokens long")

The concatenated reviews are 800 tokens long


In [None]:
## let’s split the concatenated reviews into chunks
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"A chunk has {len(chunk)} tokens")

A chunk has 128 tokens
A chunk has 128 tokens
A chunk has 128 tokens
A chunk has 128 tokens
A chunk has 128 tokens
A chunk has 128 tokens
A chunk has 32 tokens


In [None]:
# Drop the last chunk as it’s smaller than chunk_size
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 61291
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 59904
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 122957
    })
})

In [None]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

"as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men. < br / > < br / > what kills me about i am curious - yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno. while my countrymen mind find it shocking, in reality sex and nudity are a major staple in swedish cinema. even ingmar bergman,"

In [None]:
tokenizer.decode(lm_datasets["train"][1]["labels"])

"as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men. < br / > < br / > what kills me about i am curious - yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno. while my countrymen mind find it shocking, in reality sex and nudity are a major staple in swedish cinema. even ingmar bergman,"

 > As `group_texts()` & `input_ids` are identical, We will insert `[MASK]` tokens at random positions in the inputs on the fly during fine-tuning using a special data collator

## Fine-tuning DistilBERT with the Trainer API

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

In [None]:
# let’s feed a few examples to the data collator to see random masking
samples = [lm_datasets["train"][i] for i in range(3)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"{tokenizer.decode(chunk)}")
    print(f"\n")

[CLS] i rented i am [MASK] - yellow [MASK] my video store because of all the controversy that surrounded it [MASK] it was first released in 1967. i also heard that at first it was seized [MASK] u. s. customs if it ever [MASK] to enter this [MASK], [MASK] being a fan [MASK] films considered " controversial [MASK] i really had [MASK] see this for myself. < br / > [MASK] br / > the plot is centered around a young swedish drama [MASK] [MASK] lena who wants [MASK] learn everything she can about [MASK] [MASK] in particular she wants to focus her attention [MASK] to [MASK] some sort of documentary on what the average swede [MASK] about certain political issues such


as the vietnam war and race issues in the united states. prostate between asking politicians and [MASK] [MASK] [MASK]ns of stockholm about their opinions on [MASK], she has sex with her drama [MASK], classmates, and married men. < br / > < br / > what kills me about i am [MASK] - yellow is that 40 [MASK] ago, [MASK] was considere

In [None]:
# Let's build a custom data collator
import collections
import numpy as np

from transformers import default_data_collator

wwm_probability = 0.2


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return default_data_collator(features)

In [None]:
# Let's try it on the same samples as before
samples = [lm_datasets["train"][i] for i in range(3)]
batch = whole_word_masking_data_collator(samples)

for chunk in batch["input_ids"]:
    print(f"{tokenizer.decode(chunk)}")
    print(f"\n")

[CLS] i rented i am curious [MASK] yellow [MASK] [MASK] video [MASK] [MASK] of all the controversy that surrounded it when [MASK] was first [MASK] [MASK] 1967. i also heard that at first it was seized [MASK] u [MASK] [MASK]. customs [MASK] it ever tried to enter this country, [MASK] being [MASK] fan of films [MASK] " controversial " i really had to see this for myself. [MASK] br [MASK] > < [MASK] / > [MASK] plot is centered around a [MASK] swedish [MASK] student named lena [MASK] wants [MASK] learn everything she can about life. in particular she wants to focus her attentions [MASK] making some [MASK] of documentary on what the average [MASK] [MASK] thought about certain political issues such


as the vietnam war and race [MASK] in the united states. in between asking [MASK] and ordinary denizens [MASK] stockholm [MASK] their opinions on politics, she has sex with her [MASK] [MASK], [MASK], and married men. < br / > < br / > what kills me about i [MASK] curious - yellow is that 40 [MAS

In [None]:
# Downsample the dataset for training on Google Colab GPUs
train_size = 10_000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1000
    })
})

In [None]:
# The arguments for the Trainer
from transformers import TrainingArguments

batch_size = 32

# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
      output_dir=f"{model_name}-finetuned-imdb",
      overwrite_output_dir=True,
      evaluation_strategy="epoch",
      learning_rate=2e-5,
      weight_decay=0.01,
      per_device_train_batch_size=batch_size,
      per_device_eval_batch_size=batch_size,
      push_to_hub=True,
      fp16=True, # Use mixed precision
      logging_steps=logging_steps, # track the training loss with each epoch
)



In [None]:
# Instantiate the Trainer
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [None]:
# Perplexity for our language model
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 23.36


> A lower perplexity score means a better language model. we can lower it by fine-tuning

In [None]:
# Let's run the training loop
trainer.train()

Epoch,Training Loss,Validation Loss
1,2.6728,2.456296
2,2.5551,2.448899
3,2.5099,2.445534


TrainOutput(global_step=939, training_loss=2.579288341493779, metrics={'train_runtime': 209.4912, 'train_samples_per_second': 143.204, 'train_steps_per_second': 4.482, 'total_flos': 994208670720000.0, 'train_loss': 2.579288341493779, 'epoch': 3.0})

In [None]:
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 11.44


In [None]:
# Push the model card with the training information to the Hub
trainer.push_to_hub()

events.out.tfevents.1723910893.22361e023b54.1014.0:   0%|          | 0.00/6.79k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

events.out.tfevents.1723911190.22361e023b54.1014.1:   0%|          | 0.00/359 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Ashaduzzaman/distilbert-base-uncased-finetuned-imdb/commit/ae2fafbe2ed9544fc825039616aef1f742864440', commit_message='End of training', commit_description='', oid='ae2fafbe2ed9544fc825039616aef1f742864440', pr_url=None, pr_revision=None, pr_num=None)

## Fine-tuning DistilBERT with 🤗 Accelerate

 In order to implement some custom logic we can use HuggingFace Accelerate

In [None]:
 # let’s implement a simple function that applies masking on a batch
def insert_random_mask(batch):
    features = [dict(zip(batch, t)) for t in zip(*batch.values())]
    masked_inputs = data_collator(features)
    # Create a new "masked" column for each column in the dataset
    return {"masked_" + k: v.numpy() for k, v in masked_inputs.items()}

In [None]:
# Apply this function to our test set and drop the unmasked columns
downsampled_dataset = downsampled_dataset.remove_columns(["word_ids"])
eval_dataset = downsampled_dataset["test"].map(
    insert_random_mask,
    batched=True,
    remove_columns=downsampled_dataset["test"].column_names,
)

eval_dataset = eval_dataset.rename_columns(
    {
        "masked_input_ids": "input_ids",
        "masked_attention_mask": "attention_mask",
        "masked_labels": "labels",
    }
)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
# Set up the dataloaders form HuggingFace Transformers for evaluation
from torch.utils.data import DataLoader
from transformers import default_data_collator

batch_size = 64
train_dataloader = DataLoader(
    downsampled_dataset["train"],
    shuffle=True,
    collate_fn=default_data_collator,
    batch_size=batch_size,
)

eval_dataloader = DataLoader(
    eval_dataset,
    collate_fn=default_data_collator,
    batch_size=batch_size,
)

In [None]:
# Load the pretrained model
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

In [None]:
# Set the Optimizer
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

In [None]:
# Create Accelerator object
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

In [None]:
# Define the learning rate scheduler
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [None]:
# Create a model repository on the Hugging Face Hub
from huggingface_hub import create_repo, get_full_repo_name

repo_name = "distilbert-base-uncased-finetuned-imdb-accelerate"
create_repo(repo_name)

In [None]:
model_name = "distilbert-base-uncased-finetuned-imdb-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

'Ashaduzzaman/distilbert-base-uncased-finetuned-imdb-accelerate'

In [None]:
# Create and clone the repository using the Repository class from 🤗 Hub
from huggingface_hub import Repository

output_dir = model_name
repo = Repository(output_dir, clone_from=repo_name)

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
Cloning https://huggingface.co/Ashaduzzaman/distilbert-base-uncased-finetuned-imdb-accelerate into local empty directory.


In [None]:
from tqdm.auto import tqdm
import torch
import math

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        loss = outputs.loss
        losses.append(accelerator.gather(loss.repeat(batch_size)))

    losses = torch.cat(losses)
    losses = losses[: len(eval_dataset)]
    try:
        perplexity = math.exp(torch.mean(losses))
    except OverflowError:
        perplexity = float("inf")

    print(f">>> Epoch {epoch}: Perplexity: {perplexity}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

  0%|          | 0/471 [00:00<?, ?it/s]

>>> Epoch 0: Perplexity: 1067.3815833110061
>>> Epoch 1: Perplexity: 1067.3815833110061
>>> Epoch 2: Perplexity: 1067.3815833110061


## Using our fine-tuned model

In [None]:
text = "The movie was an absolute [MASK], leaving the audience in tears."

In [None]:
from transformers import pipeline

mask_filler = pipeline(
    "fill-mask",
    model="Ashaduzzaman/distilbert-base-uncased-finetuned-imdb-accelerate"
)

preds = mask_filler(text)

for pred in preds:
    print(f"{pred['sequence']}")

config.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


the movie was an absolute success, leaving the audience in tears.
the movie was an absolute disaster, leaving the audience in tears.
the movie was an absolute failure, leaving the audience in tears.
the movie was an absolute flop, leaving the audience in tears.
the movie was an absolute disappointment, leaving the audience in tears.


## Gradio interface to interact with our Hugging Face model

In [None]:
!pip install gradio

In [None]:
mask_filler = pipeline(
    "fill-mask",
    model="Ashaduzzaman/distilbert-base-uncased-finetuned-imdb-accelerate",
    device=0  # Use GPU
)


In [None]:
import gradio as gr
from transformers import pipeline

# Initialize the mask-filling pipeline with GPU support
mask_filler = pipeline(
    "fill-mask",
    model="Ashaduzzaman/distilbert-base-uncased-finetuned-imdb-accelerate",
    device=0  # Use GPU
)

def fill_mask(text):
    predictions = mask_filler(text)
    return [pred['sequence'] for pred in predictions]

# Define the Gradio interface with updated syntax
iface = gr.Interface(
    fn=fill_mask,
    inputs=gr.Textbox(label="Input Text", lines=2, placeholder="Enter a sentence with [MASK]"),
    outputs=gr.Textbox(label="Predictions"),
    title="Mask Filling Model",
    description="Enter a sentence with [MASK] to get predictions."
)

# Launch the interface
iface.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://f114b1673bc4dd2938.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




- Here are a few sample sentences with `[MASK]` that can be use to test our mask-filling model:

1. **"The weather today is [MASK], perfect for a picnic."**
2. **"She couldn’t believe the [MASK] of the final score."**
3. **"He looked at the [MASK] and realized he had left it at home."**
4. **"The book was a real [MASK], I couldn’t put it down."**
5. **"Their new project is expected to [MASK] great success."**