# Introduction

This notebook is inteded to experiment with Masked Language Modeling task.

# BERT

In [1]:
# Import Standard Libraries
from transformers import BertTokenizer, BertForPreTraining
import torch

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Initialize the tokenizer for preprocessing
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

tokenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 32.6kB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 2.06MB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 4.71MB/s]
config.json: 100%|██████████| 570/570 [00:00<00:00, 1.64MB/s]


In [3]:
# Initialize the model (NOTE: This is a model version for further fine-tuning BERT)
model = BertForPreTraining.from_pretrained('bert-base-uncased')

model.safetensors: 100%|██████████| 440M/440M [00:10<00:00, 42.2MB/s] 


In [4]:
# Define input text
text = ("After Abraham Lincoln won the November 1860 presidential [MASK] on an "
        "anti-slavery platform, an initial seven slave states declared their "
        "secession from the country to form the Confederacy. War broke out in "
        "April 1861 when secessionist forces [MASK] Fort Sumter in South "
        "Carolina, just over a month after Lincoln's inauguration.")

In [5]:
# Compute tokens and feed them into the model
tokens = tokenizer(text, return_tensors='pt')
outputs = model(**tokens)

In [6]:
outputs.keys()

odict_keys(['prediction_logits', 'seq_relationship_logits'])

- **prediction_logits** - It is the output of the MLM head
- **seq_relationship_logits** - It is the output of the NSP head

In [7]:
outputs.prediction_logits

tensor([[[ -7.6192,  -7.5433,  -7.6124,  ...,  -6.7155,  -6.7375,  -4.6122],
         [-12.5489, -12.3772, -12.6500,  ..., -11.8643, -11.4446,  -9.1151],
         [ -6.2346,  -6.3590,  -5.9091,  ...,  -6.1258,  -6.2720,  -5.0268],
         ...,
         [ -2.2497,  -2.1352,  -2.1812,  ...,  -1.7201,  -1.2728,  -7.8301],
         [-14.2654, -14.3100, -14.2294,  ..., -11.4669, -11.7212, -10.3129],
         [-11.5071, -12.0389, -11.6046,  ..., -11.2875,  -9.1655,  -9.1733]]],
       grad_fn=<ViewBackward0>)

In [8]:
outputs.prediction_logits.shape

torch.Size([1, 62, 30522])

The value `62` is the number of tokens that has been created from the initial text, while `30522` is the vocabulary size.

In [9]:
outputs.seq_relationship_logits

tensor([[ 2.8256, -1.6897]], grad_fn=<AddmmBackward0>)

Given the fact that our input is not set to work for NSP, the meaning of the `seq_relationship_logits` does not make much sense.

<br>

Now let's retrieve the predicted tokens from MLM

In [10]:
# Get the mapping from token IDs to the Vocabulary Index (Token-ID)
token2idx = tokenizer.get_vocab()

In [14]:
# Show mapping example
token2idx['hello']

7592

In [15]:
# Create an inverted dictionary ID-Token
idx2token = {value:key for key, value in token2idx.items()}

In [17]:
# Show mapping example
idx2token[7592]

'hello'

In [19]:
# Let's map the second token of the outputs
outputs.prediction_logits[0][2].shape

torch.Size([30522])

In [22]:
# Convert logits into probabilities
softmax = torch.nn.functional.softmax(outputs.prediction_logits[0][2], dim=-1)

# Retrieve the max probability
argmax = torch.argmax(softmax)

In [23]:
# Find predicted token
idx2token[argmax.item()]

'abraham'

In [27]:
# Retrieve all predicted tokens
# Convert logits into probabilities
softmax = torch.nn.functional.softmax(outputs.prediction_logits[0], dim=0)

# Retrieve the max probability
argmax = torch.argmax(softmax, dim=1)

for index in argmax:
    print(idx2token[index.item()], end=' ')

##ecin although abraham lincolnshire won 1948 november 1860 presidential primaries on his anti - slavery platform , an initial seven tributary states declare their independence from the country to form ##ici confederacy ##yre war broke out in april 1861 when ##oya ##ist forces occupied fort sum ##mer for south carolina ##trip just over a month before grant ' s inauguration ; ##tson 