## Hugging Face Transformers Tutorial


check Hugging Face Website for more details: https://huggingface.co/docs/transformers/index


In [1]:
!pip install datasets
!pip install -U "transformers>=4.40.0"



In [15]:
import transformers
from transformers import TrainingArguments
import inspect, os

print("Transformers version:", transformers.__version__)
print("Transformers module path:", transformers.__file__)
print("TrainingArguments module:", TrainingArguments.__module__)
print("TrainingArguments signature:", inspect.signature(TrainingArguments.__init__))


Transformers version: 4.57.3
Transformers module path: /usr/local/lib/python3.12/dist-packages/transformers/__init__.py
TrainingArguments module: transformers.training_args


In [2]:
from collections import defaultdict, Counter
import json

from matplotlib import pyplot as plt
import numpy as np
import torch

def print_encoding(model_inputs, indent=4):
    indent_str = " " * indent
    print("{")
    for k, v in model_inputs.items():
        print(indent_str + k + ":")
        print(indent_str + indent_str + str(v))
    print("}")

## Part 2: Finetuning

### 2.1 Loading in a dataset

In [3]:
from datasets import load_dataset, DatasetDict
from torch.utils.data import DataLoader

imdb_dataset = load_dataset("imdb")

# Just take the first 50 tokens for speed
def truncate(example):
    return {
        'text': " ".join(example['text'].split()[:50]),
        'label': example['label']
    }

# Create a small dataset
small_imdb_dataset = DatasetDict(
    train=imdb_dataset['train']
        .shuffle(seed=1111)
        .select(range(128))
        .map(truncate),

    val=imdb_dataset['train']
        .shuffle(seed=1111)
        .select(range(128, 160))
        .map(truncate),)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
small_imdb_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 128
    })
    val: Dataset({
        features: ['text', 'label'],
        num_rows: 32
    })
})

In [5]:
small_imdb_dataset['train'][:10]

{'text': ["Probably Jackie Chan's best film in the 1980s, and the one that put him on the map. The scale of this self-directed police drama is evident from the opening and closing scenes, during which a squatters' village and shopping mall are demolished. There are, clearly, differences between the original Chinese",
  'A wonderful movie! Anyone growing up in an Italian family will definitely see themselves in these characters. A good family movie with sadness, humor, and very good acting from all. You will enjoy this movie!! We need more like it.',
  'HORRENDOUS! Avoid like the plague. I would rate this in the top 10 worst movies ever. Special effects, acting, mood, sound, etc. appear to be done by day care students...wait, I have seen programs better than this. Opens like a soft porn show with a blurred nude female doing a',
  'And I absolutely adore Isabelle Blais!!! She was so cute in this movie, and far different from her role in "Quebec-Montreal" where she was more like a man-eat

In [6]:
from transformers import DistilBertTokenizer, DistilBertTokenizerFast, AutoTokenizer

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-cased")

In [7]:
# Prepare the dataset - this tokenizes the dataset in batches of 16 examples.
small_tokenized_dataset = small_imdb_dataset.map(
    lambda example: tokenizer(example['text'], padding=True, truncation=True),
    batched=True,
    batch_size=16
)

small_tokenized_dataset = small_tokenized_dataset.remove_columns(["text"])
small_tokenized_dataset = small_tokenized_dataset.rename_column("label", "labels")
small_tokenized_dataset.set_format("torch")

In [8]:
small_tokenized_dataset['train'][0:2]

{'labels': tensor([1, 1]),
 'input_ids': tensor([[  101, 10109,  9662, 10185,   112,   188,  1436,  1273,  1107,  1103,
           3011,   117,  1105,  1103,  1141,  1115,  1508,  1140,  1113,  1103,
           4520,   119,  1109,  3418,  1104,  1142,  2191,   118,  2002,  2021,
           3362,  1110, 10238,  1121,  1103,  2280,  1105,  5134,  4429,   117,
           1219,  1134,   170,  4816,  6718, 18899,   112,  1491,  1105,  6001,
           8796,  1132,  6515,   119,  1247,  1132,   117,  3817,   117,  5408,
           1206,  1103,  1560,  1922,   102,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0],
         [  101,   138,  7310,  2523,   106, 15859,  2898,  1146,  1107,  1126,
           2169,  1266,  1209,  5397,  1267,  2310,  1107,  1292,  2650,   119,
            138,  1363,  1266,  2523,  1114, 12928,   1

In [9]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(small_tokenized_dataset['train'], batch_size=16)
eval_dataloader = DataLoader(small_tokenized_dataset['val'], batch_size=16)

### 2.2 Training

In [10]:
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup, DistilBertForSequenceClassification
from tqdm.notebook import tqdm
import torch


In [11]:
model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-cased',
    num_labels=2
)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)

optimizer = AdamW(model.parameters(), lr=5e-5, weight_decay=0.01)

lr_scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
import os

os.makedirs("checkpoints", exist_ok=True)


In [13]:
loss = 0

best_val_loss = float("inf")
progress_bar = tqdm(range(num_training_steps))


for epoch in range(num_epochs):
    print(f"\nEpoch {epoch+1}/{num_epochs}")
    # ------------------------ TRAIN -----------------------
    model.train()
    for batch in train_dataloader:
        optimizer.zero_grad()
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        progress_bar.update(1)

    # ---------------------- VALIDATION ---------------------
    model.eval()
    val_loss = 0
    for batch in eval_dataloader:
        with torch.no_grad():
            outputs = model(**batch)
            val_loss += outputs.loss.item()

    avg_val_loss = val_loss / len(eval_dataloader)
    print(f"Validation loss: {avg_val_loss}")

    if avg_val_loss < best_val_loss:
        print("Saving checkpoint!")
        best_val_loss = avg_val_loss
        torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'val_loss': best_val_loss,
        }, f"checkpoints/epoch_{epoch}.pt")


  0%|          | 0/24 [00:00<?, ?it/s]


Epoch 1/3
Validation loss: 0.6600549817085266
Saving checkpoint!

Epoch 2/3
Validation loss: 0.6099148988723755
Saving checkpoint!

Epoch 3/3
Validation loss: 0.5518483519554138
Saving checkpoint!


In [14]:
imdb_dataset = load_dataset("imdb")

small_imdb_dataset = DatasetDict(
    train=imdb_dataset['train'].shuffle(seed=1111).select(range(128)).map(truncate),
    val=imdb_dataset['train'].shuffle(seed=1111).select(range(128, 160)).map(truncate),
)

small_tokenized_dataset = small_imdb_dataset.map(
    lambda example: tokenizer(example['text'], truncation=True),
    batched=True,
    batch_size=16
)

In [19]:
from transformers import TrainingArguments, Trainer, DistilBertForSequenceClassification
import numpy as np  # make sure this is imported

model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-cased',
    num_labels=2
)

arguments = TrainingArguments(
    output_dir="sample_hf_trainer",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    eval_strategy="epoch",          # <<< changed here
    save_strategy="epoch",          # this is fine
    learning_rate=2e-5,
    load_best_model_at_end=True,
    seed=224
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {"accuracy": np.mean(predictions == labels)}

trainer = Trainer(
    model=model,
    args=arguments,
    train_dataset=small_tokenized_dataset['train'],
    eval_dataset=small_tokenized_dataset['val'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)



Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


In [22]:
import json
import os
from transformers import TrainerCallback, EarlyStoppingCallback

class LoggingCallback(TrainerCallback):
    def __init__(self, log_path):
        self.log_path = log_path
        # ensure directory exists
        os.makedirs(os.path.dirname(log_path), exist_ok=True)

    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs is None:
            return

        # remove very large unnecessary field
        logs.pop("total_flos", None)

        # only the main process writes (important in distributed setups)
        if state.is_local_process_zero:
            with open(self.log_path, "a") as f:
                f.write(json.dumps(logs) + "\n")


You are adding a <class 'transformers.trainer_callback.EarlyStoppingCallback'> to the callbacks of this Trainer, but there is already one. The currentlist of callbacks is
:DefaultFlowCallback
TensorBoardCallback
WandbCallback
NotebookProgressCallback
EarlyStoppingCallback
LoggingCallback


In [23]:
import os
os.environ["WANDB_DISABLED"] = "true"

In [25]:
from transformers import EarlyStoppingCallback
from transformers.integrations import WandbCallback, TensorBoardCallback

# remove the built-in ones, if present
trainer.remove_callback(EarlyStoppingCallback)
trainer.remove_callback(WandbCallback)
trainer.remove_callback(TensorBoardCallback)

# now add your own ones
trainer.add_callback(EarlyStoppingCallback(early_stopping_patience=1, early_stopping_threshold=0.0))
trainer.add_callback(LoggingCallback("sample_hf_trainer/log.jsonl"))


You are adding a <class 'transformers.trainer_callback.EarlyStoppingCallback'> to the callbacks of this Trainer, but there is already one. The currentlist of callbacks is
:DefaultFlowCallback
NotebookProgressCallback
LoggingCallback
EarlyStoppingCallback
LoggingCallback
You are adding a <class '__main__.LoggingCallback'> to the callbacks of this Trainer, but there is already one. The currentlist of callbacks is
:DefaultFlowCallback
NotebookProgressCallback
LoggingCallback
EarlyStoppingCallback
LoggingCallback
EarlyStoppingCallback


In [26]:
# train the model
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.672181,0.65625
2,No log,0.645001,0.71875
3,No log,0.629983,0.71875


Epoch,Training Loss,Validation Loss


TrainOutput(global_step=24, training_loss=0.6422349611918131, metrics={'train_runtime': 4.5056, 'train_samples_per_second': 85.226, 'train_steps_per_second': 5.327, 'train_loss': 0.6422349611918131, 'epoch': 3.0})

In [27]:
# evaluating the model is very easy

# results = trainer.evaluate()                           # just gets evaluation metrics
results = trainer.predict(small_tokenized_dataset['val']) # also gives you predictions

In [28]:
results

PredictionOutput(predictions=array([[ 0.00601948, -0.04014441],
       [-0.044009  , -0.02139598],
       [-0.0555438 , -0.04180663],
       [ 0.01569425, -0.02921834],
       [-0.29210055,  0.20153485],
       [-0.06077717, -0.05975012],
       [ 0.06373959, -0.21708521],
       [-0.01478907, -0.02640947],
       [-0.26968148,  0.14232491],
       [-0.15853502,  0.00260534],
       [-0.06190501, -0.00064673],
       [ 0.1206478 , -0.1706069 ],
       [ 0.03898401, -0.10488792],
       [-0.04095368, -0.05324618],
       [ 0.10470027, -0.10749435],
       [ 0.12740274, -0.2312723 ],
       [-0.01505886, -0.07913213],
       [-0.1990701 ,  0.05572536],
       [-0.10914148, -0.06973588],
       [-0.2867682 ,  0.20510046],
       [ 0.14426945, -0.20019588],
       [-0.01767553, -0.03617218],
       [ 0.01329763, -0.03574487],
       [-0.09857135,  0.00263446],
       [-0.07223912, -0.09185578],
       [-0.05607162, -0.00138547],
       [-0.20094395,  0.10381635],
       [-0.17394114,  0.00

In [30]:
from transformers import AutoModelForSequenceClassification
# To load our saved model, we can pass the path to the checkpoint into the `from_pretrained` method:
test_str = "I enjoyed the movie!"

finetuned_model = AutoModelForSequenceClassification.from_pretrained("sample_hf_trainer/checkpoint-24")
model_inputs = tokenizer(test_str, return_tensors="pt")
prediction = torch.argmax(finetuned_model(**model_inputs).logits)
print(["NEGATIVE", "POSITIVE"][prediction])

POSITIVE


Included here are also some practical tips for fine-tuning:

**Good default hyperparameters.**

* Epochs: {2, 3, 4} (larger amounts of data need fewer epochs)
* Batch size (bigger is better: as large as you can make it)
* Optimizer: AdamW
* AdamW learning rate: {2e-5, 5e-5}
* Learning rate scheduler: linear warm up for first {0, 100, 500} steps of training
* weight_decay (l2 regularization): {0, 0.01, 0.1}


## Part 3:  Generation


In [31]:
from transformers import AutoModelForCausalLM

gpt2_tokenizer = AutoTokenizer.from_pretrained('gpt2')

gpt2 = AutoModelForCausalLM.from_pretrained('distilgpt2')
gpt2.config.pad_token_id = gpt2.config.eos_token_id  # Prevents warning during decoding

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [32]:
prompt = "Once upon a time"

tokenized_prompt = gpt2_tokenizer(prompt, return_tensors="pt")

for i in range(10):
    output = gpt2.generate(**tokenized_prompt,
                  max_length=50,
                  do_sample=True,
                  top_p=0.9)

    print(f"{i + 1}) {gpt2_tokenizer.batch_decode(output)[0]}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


1) Once upon a time when people were going through this difficult work that I felt like it was a necessity I made sure the entire world was safe.
As a result I went on hiatus from the games and tried to live my dream of playing Minecraft at


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


2) Once upon a time of day, the city was in a state of mourning. People mourned in their grief and mourned in grief, but not in their grief.

In all likelihood, these words would probably have been used to describe what


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


3) Once upon a time of great change, there is a change in our way of thinking. A new philosophy might be coming about, or maybe an idea has been invented. Or perhaps it's an idea.


The answer is yes. Some


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


4) Once upon a time when an average human knows about a common disease, a doctor tells you that you've not yet been diagnosed with the disease, not your normal person, and your doctor says that you've not been diagnosed with the disease, not your


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


5) Once upon a time of the day, people might start getting nervous. It might not be all that unusual, for example, when I first saw my first video for the documentary, ‏The First Day in Paris,‏ I thought: �


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


6) Once upon a time of great difficulty, it seems impossible for a person to perform an act of skill that is not possible in all situations. But once we have done that, we can be sure that the skill which is required to perform the skill is


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


7) Once upon a time-frame the data to be retrieved and processed is stored in a CSV form that has a unique identifier.



In a new paper published this week in the journal Nature, researchers from the University of Florida (UF


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


8) Once upon a time when it's almost impossible to get back to a full-time work or work place or a job, people are always asking for money to pay the bill, but once you have enough to make the extra time, you may be


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


9) Once upon a time in this world, what happened is what we don't know or trust," she said. "We don't know what we knew."<|endoftext|>
10) Once upon a time, we had been taught to think like the "hobbits." And as I explained to you in my book on learning to work in the business, we are taught to think like the "hobbits" (and if


## Defining Custom Datasets


In [34]:
# Option 1: Load into Hugging Face Datasets

# Kaggle donwload https://www.kaggle.com/datasets/mexwell/the-e2e-challenge-dataset
import pandas as pd
from datasets import Dataset

df = pd.read_csv("e2e-dataset/trainset.csv")
custom_dataset = Dataset.from_pandas(df)

In [35]:
import csv
from torch.utils.data import Dataset, DataLoader

class E2EDataset(Dataset):
    """Tokenize data when we call __getitem__"""
    def __init__(self, path, tokenizer):
        with open(path, newline="") as f:
            reader = csv.reader(f)
            next(reader) # skip the heading
            self.data = [{"source": row[0], "target": row[1]} for row in reader]
        self.tokenizer = tokenizer

    def __getitem__(self, i):
        inputs = self.tokenizer(self.data[i]['source'])
        labels = self.tokenizer(self.data[i]['target'])
        inputs['labels'] = labels.input_ids
        return inputs


In [36]:
bart_tokenizer = AutoTokenizer.from_pretrained('facebook/bart-base')

config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

In [37]:
dataset = E2EDataset("e2e-dataset/trainset.csv", bart_tokenizer)

In [39]:
import torch

src_texts = ["This is the first test.", "This is the second test."]
tgt_texts = ["Target 1", "Target 2"]

batch = bart_tokenizer(
    src_texts,
    text_target=tgt_texts,
    max_length=128,
    truncation=True,
    padding=True,
    return_tensors="pt",
)

batch  # contains input_ids, attention_mask, labels


{'input_ids': tensor([[   0,  713,   16,    5,   78, 1296,    4,    2],
        [   0,  713,   16,    5,  200, 1296,    4,    2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[    0, 41858,   112,     2],
        [    0, 41858,   132,     2]])}

In [40]:
dataset[0]

{'input_ids': [0, 13650, 10975, 133, 48596, 7479, 3529, 40118, 10975, 22478, 7479, 425, 43430, 10975, 4321, 87, 984, 541, 7479, 2111, 691, 10975, 245, 66, 9, 195, 7479, 583, 10975, 347, 2001, 1140, 1614, 1069, 5183, 742, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [0, 133, 48596, 8881, 583, 22450, 1614, 1069, 5183, 34, 10, 195, 999, 691, 4, 1437, 14614, 386, 23, 984, 541, 4, 2]}

## Pipelines

In [41]:
from transformers import pipeline

sentiment_analysis = pipeline("sentiment-analysis", model="siebert/sentiment-roberta-large-english")

config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/256 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Device set to use cuda:0


You can run the pipeline by just calling it on a string

In [42]:
sentiment_analysis("Hugging Face Transformers is really cool!")

[{'label': 'POSITIVE', 'score': 0.998448371887207}]

Or on a list of strings:

In [43]:
sentiment_analysis(["I didn't know if I would like Hákarl, but it turned out pretty good.",
                    "I didn't know if I would like Hákarl, and it was just as bad as I'd heard."])

[{'label': 'POSITIVE', 'score': 0.9988769888877869},
 {'label': 'NEGATIVE', 'score': 0.9994940757751465}]

You can find more information on pipelines (including which ones are available) [here](https://huggingface.co/docs/transformers/main_classes/pipelines)

## Masked Language Modeling

In [44]:
from transformers import AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", fast=True)
bert = AutoModelForMaskedLM.from_pretrained("bert-base-cased")

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [45]:
prompt = "I am [MASK] to learn about HuggingFace!"
model = pipeline("fill-mask", "bert-base-cased")
model(prompt)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


[{'score': 0.3552880585193634,
  'token': 7215,
  'token_str': 'excited',
  'sequence': 'I am excited to learn about HuggingFace!'},
 {'score': 0.15621043741703033,
  'token': 1280,
  'token_str': 'going',
  'sequence': 'I am going to learn about HuggingFace!'},
 {'score': 0.07893389463424683,
  'token': 9582,
  'token_str': 'eager',
  'sequence': 'I am eager to learn about HuggingFace!'},
 {'score': 0.03559933230280876,
  'token': 1303,
  'token_str': 'here',
  'sequence': 'I am here to learn about HuggingFace!'},
 {'score': 0.035240538418293,
  'token': 17261,
  'token_str': 'delighted',
  'sequence': 'I am delighted to learn about HuggingFace!'}]

In [46]:
inputs = tokenizer(prompt, return_tensors="pt")
mask_index = np.where(inputs['input_ids'] == tokenizer.mask_token_id)
outputs = bert(**inputs)
top_5_predictions = torch.softmax(outputs.logits[mask_index], dim=1).topk(5)

print(prompt)
for i in range(5):
    prediction = tokenizer.decode(top_5_predictions.indices[0, i])
    prob = top_5_predictions.values[0, i]
    print(f"  {i+1}) {prediction}\t{prob:.3f}")

I am [MASK] to learn about HuggingFace!
  1) excited	0.355
  2) going	0.156
  3) eager	0.079
  4) here	0.036
  5) delighted	0.035
