# Masked Language Modeling (MLM)

https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForMaskedLM

In [4]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

## 1. Load Tokenizer, Model classes

PS: Ignore warning for the time being

In [5]:
# change this to try out other models
model_name = "bert-base-uncased"

# create an instanmce of tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# create the model class
model = AutoModelForMaskedLM.from_pretrained(model_name)


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## 2. Prepare the input

In [6]:
mask_token = tokenizer.mask_token

# example text
# notice the use of model specific special mask.
input_text = "The cat sat on the " + mask_token + " and watched the birds outside"



# prepare the input
inputs = tokenizer(input_text, return_tensors="pt")

# find the index of mask_token in the tensor 
# we need it as it will be predicted in the out put
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

print("Input text : ", input_text)
print("input_ids : ", inputs['input_ids'])
print("Mask token index : ", mask_token_index.tolist()[0])

Input text :  The cat sat on the [MASK] and watched the birds outside
input_ids :  tensor([[ 101, 1996, 4937, 2938, 2006, 1996,  103, 1998, 3427, 1996, 5055, 2648,
          102]])
Mask token index :  6


## 3. Predict

In [7]:
# Call the model 
logits =  model(**inputs).logits

print(logits.size(), "   Vocab Size = ", tokenizer.vocab_size)



torch.Size([1, 13, 30522])    Vocab Size =  30522


## 4. Interpret logits to generate output

In [9]:

# Extract logits
mask_token_logits = logits[0, mask_token_index, :]

# Use argmax to get the token with max likelihood
# https://pytorch.org/docs/stable/generated/torch.argmax.html
# top_tokens = torch.argmax(..)

# get the top - k tokens with highest probability
# https://pytorch.org/docs/stable/generated/torch.topk.html
k = 5
top_tokens = torch.topk(mask_token_logits, k, dim=1).indices[0].tolist()

print("Tokens :",top_tokens)

# Reploace mask in the string
print("Full sentences : ")
for token in top_tokens:
    print(input_text.replace(tokenizer.mask_token, tokenizer.decode([token])))

Tokens : [2598, 2723, 5568, 6411, 7424]
Full sentences : 
The cat sat on the ground and watched the birds outside
The cat sat on the floor and watched the birds outside
The cat sat on the grass and watched the birds outside
The cat sat on the couch and watched the birds outside
The cat sat on the porch and watched the birds outside


# Exercise

Use the BertNextSentencePrediction to predict the next sentence.

## 1. Import the classes

In [21]:
from transformers import AutoModelForNextSentencePrediction, AutoTokenizer
import torch

## 2. Create the model for NSP

In [22]:
# change this to try out other models
model_name = "bert-base-uncased"

# create an instanmce of tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# create the model class
model = AutoModelForNextSentencePrediction.from_pretrained(model_name)

## 3. Carry out next sentence prediction

In [23]:
# example text
# notice the use of model specific special mask.
input_text_1 = "The cat sat on the rug." 
input_text_2 = "Clouds are white" 


# prepare the input
inputs = tokenizer(input_text_1, input_text_2, return_tensors="pt")

outputs = model(**inputs, labels=torch.LongTensor([1]))

prediction = outputs.logits.argmax(dim=1)

if prediction == 0:
    print("2nd sentence is a continuation of the 1st sentence.")
else:
    print("2nd sentence is NOT a continuation of the 1st sentence.")

2nd sentence is NOT a continuation of the 1st sentence.


In [24]:
# example text
# notice the use of model specific special mask.
input_text_1 = "The cat sat on the rug." 
input_text_2 = "It watched the birds outside." 

# prepare the input
inputs = tokenizer(input_text_1, input_text_2, return_tensors="pt")

outputs = model(**inputs, labels=torch.LongTensor([1]))

prediction = outputs.logits.argmax(dim=1)

if prediction == 0:
    print("2nd sentence is a continuation of the 1st sentence.")
else:
    print("2nd sentence is NOT a continuation of the 1st sentence.")

2nd sentence is a continuation of the 1st sentence.
