Know about BERT a bit more.

In [21]:
import torch
import transformers
# when you load from pretrained
model = transformers.BertForMaskedLM.from_pretrained("bert-base-cased")
tokenizer = transformers.BertTokenizerFast.from_pretrained("bert-base-cased")

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


# Use pipeline

In [9]:
fill_mask = transformers.pipeline("fill-mask", model=model, tokenizer=tokenizer)

In [3]:
# perform predictions with only one mask
example = "It is known that [MASK] is the capital of Germany"
for prediction in fill_mask(example):
  print(prediction)

{'score': 0.10833366960287094, 'token': 9529, 'token_str': 'Frankfurt', 'sequence': 'It is known that Frankfurt is the capital of Germany'}
{'score': 0.0837741494178772, 'token': 11212, 'token_str': 'Leipzig', 'sequence': 'It is known that Leipzig is the capital of Germany'}
{'score': 0.06412549316883087, 'token': 1122, 'token_str': 'it', 'sequence': 'It is known that it is the capital of Germany'}
{'score': 0.055242475122213364, 'token': 3206, 'token_str': 'Berlin', 'sequence': 'It is known that Berlin is the capital of Germany'}
{'score': 0.05167430639266968, 'token': 8339, 'token_str': 'Hamburg', 'sequence': 'It is known that Hamburg is the capital of Germany'}


In [5]:
# perform predictions with two masks
example = "It is known that [MASK] is [MASK] capital of Germany"
for prediction in fill_mask(example):
  print(prediction)

[{'score': 0.11870907247066498, 'token': 9529, 'token_str': 'Frankfurt', 'sequence': '[CLS] It is known that Frankfurt is [MASK] capital of Germany [SEP]'}, {'score': 0.09645532071590424, 'token': 11212, 'token_str': 'Leipzig', 'sequence': '[CLS] It is known that Leipzig is [MASK] capital of Germany [SEP]'}, {'score': 0.07038077712059021, 'token': 17339, 'token_str': 'Bremen', 'sequence': '[CLS] It is known that Bremen is [MASK] capital of Germany [SEP]'}, {'score': 0.05996522679924965, 'token': 8339, 'token_str': 'Hamburg', 'sequence': '[CLS] It is known that Hamburg is [MASK] capital of Germany [SEP]'}, {'score': 0.053893789649009705, 'token': 3206, 'token_str': 'Berlin', 'sequence': '[CLS] It is known that Berlin is [MASK] capital of Germany [SEP]'}]
[{'score': 0.9693134427070618, 'token': 1103, 'token_str': 'the', 'sequence': '[CLS] It is known that [MASK] is the capital of Germany [SEP]'}, {'score': 0.008970746770501137, 'token': 170, 'token_str': 'a', 'sequence': '[CLS] It is kno

In [8]:
for prediction in fill_mask(example):
    for pre in prediction:
        print(pre)
    print("="*50)

{'score': 0.11870907247066498, 'token': 9529, 'token_str': 'Frankfurt', 'sequence': '[CLS] It is known that Frankfurt is [MASK] capital of Germany [SEP]'}
{'score': 0.09645532071590424, 'token': 11212, 'token_str': 'Leipzig', 'sequence': '[CLS] It is known that Leipzig is [MASK] capital of Germany [SEP]'}
{'score': 0.07038077712059021, 'token': 17339, 'token_str': 'Bremen', 'sequence': '[CLS] It is known that Bremen is [MASK] capital of Germany [SEP]'}
{'score': 0.05996522679924965, 'token': 8339, 'token_str': 'Hamburg', 'sequence': '[CLS] It is known that Hamburg is [MASK] capital of Germany [SEP]'}
{'score': 0.053893789649009705, 'token': 3206, 'token_str': 'Berlin', 'sequence': '[CLS] It is known that Berlin is [MASK] capital of Germany [SEP]'}
{'score': 0.9693134427070618, 'token': 1103, 'token_str': 'the', 'sequence': '[CLS] It is known that [MASK] is the capital of Germany [SEP]'}
{'score': 0.008970746770501137, 'token': 170, 'token_str': 'a', 'sequence': '[CLS] It is known that 

It seems that BERT only predicts one mask at a time...But not many details are given under this wrapped pipeline.

# Step by step
Let's examine more details. I go through it based on the [YT tutorial](https://www.youtube.com/watch?v=q9NS5WpfkrU&t=1s). It's following videos could be helpful as well.

Perhaps [BERT 101](https://huggingface.co/blog/bert-101) from HG is useful.

In [120]:
example = "BERT, short for Bidirectional Encoder Representations from Transformers, is a Machine Learning (ML) model for natural language processing."
inputs = tokenizer(example, return_tensors="pt")

inputs

{'input_ids': tensor([[  101,   139,  9637,  1942,   117,  1603,  1111,   139,  2386,  5817,
         17264, 13832, 13775,  1197, 20777,  4894, 20936,  1116,  1121, 25267,
           117,  1110,   170,  7792,  9681,   113,   150,  2162,   114,  2235,
          1111,  2379,  1846,  6165,   119,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [121]:
print(inputs.keys())
print(inputs["input_ids"])
print(inputs["token_type_ids"])

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
tensor([[  101,   139,  9637,  1942,   117,  1603,  1111,   139,  2386,  5817,
         17264, 13832, 13775,  1197, 20777,  4894, 20936,  1116,  1121, 25267,
           117,  1110,   170,  7792,  9681,   113,   150,  2162,   114,  2235,
          1111,  2379,  1846,  6165,   119,   102]])
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])


In [122]:
inputs['labels'] = inputs['input_ids'].clone()
print(inputs['labels'])

tensor([[  101,   139,  9637,  1942,   117,  1603,  1111,   139,  2386,  5817,
         17264, 13832, 13775,  1197, 20777,  4894, 20936,  1116,  1121, 25267,
           117,  1110,   170,  7792,  9681,   113,   150,  2162,   114,  2235,
          1111,  2379,  1846,  6165,   119,   102]])


In [123]:
rand = torch.rand(inputs['labels'].shape)
print(rand)
print(rand.shape)

tensor([[0.3833, 0.4233, 0.6101, 0.8474, 0.4840, 0.1935, 0.3732, 0.0623, 0.6702,
         0.5749, 0.4297, 0.1022, 0.3344, 0.0406, 0.2400, 0.6945, 0.1701, 0.1315,
         0.1819, 0.0654, 0.5040, 0.2206, 0.6792, 0.5919, 0.5604, 0.7918, 0.0377,
         0.9316, 0.7125, 0.7666, 0.9865, 0.4611, 0.7818, 0.9955, 0.2688, 0.1095]])
torch.Size([1, 36])


In [124]:
# 15% of tokens that are not the special tokens ([CLS], [END]) are masked
mask_arr = (rand < 0.15) * (inputs['input_ids'] != 101) * (inputs['input_ids'] != 102)
mask_arr

tensor([[False, False, False, False, False, False, False,  True, False, False,
         False,  True, False,  True, False, False, False,  True, False,  True,
         False, False, False, False, False, False,  True, False, False, False,
         False, False, False, False, False, False]])

In [125]:
# The tokens are sublists of the list so it needs to be flattened
mask_arr[0].nonzero().tolist()

[[7], [11], [13], [17], [19], [26]]

In [126]:
slection = torch.flatten(mask_arr[0].nonzero()).tolist()
print(slection)

[7, 11, 13, 17, 19, 26]


In [127]:
# Apply masks (token 103) to the inputs
inputs.input_ids[0, slection] = 103
inputs.input_ids

tensor([[  101,   139,  9637,  1942,   117,  1603,  1111,   103,  2386,  5817,
         17264,   103, 13775,   103, 20777,  4894, 20936,   103,  1121,   103,
           117,  1110,   170,  7792,  9681,   113,   103,  2162,   114,  2235,
          1111,  2379,  1846,  6165,   119,   102]])

In [128]:
outputs = model(**inputs)
outputs.keys()

odict_keys(['loss', 'logits'])

In [129]:
for key, value in outputs.items():
    print(key, ":", value.shape)

loss : torch.Size([])
logits : torch.Size([1, 36, 28996])


In [130]:
outputs.logits[0, :, :].argmax(dim=-1)

tensor([  119,   139,  9637,  1942,   117,  1603,  1111,   139,  2386,  5817,
        17264, 13832, 13775,  6828, 20777,  4894, 20936,   117,  1121,  1483,
          117,  1110,   170,  7792,  9681,   113,   150,  2162,   114,  2235,
         1111,  2379,  1846,  6165,   119,   119])

In [131]:
# transform the predicted token ids to token strings
predicted_text = tokenizer.decode(
    token_ids=outputs.logits[0, :, :].argmax(dim=-1)
    )
print(predicted_text)

. BERT, short for Bidirectional Encode Language Representation, from English, is a Machine Learning ( ML ) model for natural language processing..


The model actually calculates the logits of all input tokens. Each of the 36 tokens has 28996 logits (the vocab size) implying the probability of each token's predicted label. This matches the mechanism of self-attention, i.e. the model see all the inputs. Therefore, the confussion I had about how BERT predicts the masked tokens is solved. It simply predicts everything.

According to [the original paper](https://arxiv.org/abs/1810.04805), the model only cares about the masked token's logits in terms of the loss function.

In [None]:
print("Predicted masked token ids:")
for predictions in outputs.logits[0, slection, :]:
    print(predictions.argmax().item())

print("="*50)

print("Actual masked token ids:")
for label in inputs['labels'][0, slection]:
    print(label.item())

Predicted masked token ids:
139
13832
20936
6828
2379
119
Actual masked token ids:
139
13832
20936
25267
2379
119


The masked accuracy is 5/6. Not bad!