# BERT in HuggingFace - specific model tasks
* Notebook by Adam Lang
* Date: 6/20/2024
* We will demonstrate various use cases of BERT models from huggingface.

# Use Cases we will demo with BERT
1. Masked token prediction ('fill in the blank')
2. Next Sentence Prediction
3. Question Answering

In [19]:
# install transformers
!pip install transformers



## BERT Tokenizer

In [20]:
# import and instantiate the BERT tokenizer
from transformers import BertTokenizer

#tokenizer - use bert-base-uncased model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Note about encoding below
* The padding parameter was defined with 'max_length' but without a number.
* Usually this is a standard number such as 512, 1024, etc.
* However, in this case we will default to whatever the max_length that is accepted by the model.
* `encode_plus` - this specifically returns a dictionary of values not a list of values.
  * The great thing about `encode_plus` is that it returns a lot more enformation such as:
    * attention_masks
    * input_ids, position ids, etc..
  * This allows you to retrieve the specific parts of the encoding.

In [21]:
# define random text for experiment
text = 'The Boston Celtics won the 2024 NBA World Championship over the Dallas Mavericks.'

# encode text + attention_mask
encoding = tokenizer.encode_plus(text, add_special_tokens = True, truncation = True, padding = 'max_length',
                                return_attention_mask = True, return_tensors = 'pt')



## Now we can print out the encoded text

In [22]:
encoding

{'input_ids': tensor([[  101,  1996,  3731, 23279,  2180,  1996, 16798,  2549,  6452,  2088,
          2528,  2058,  1996,  5759, 28330,  1012,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,  

### Retrieve specific part of encoding

In [23]:
# need input_ids for this
input = encoding['input_ids'][0]
attention_mask = encoding['attention_mask'][0]

# print attention_mask
attention_mask

tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## Retrieve Mask Tokens (MLM)

In [24]:
tokenizer.mask_token

'[MASK]'

# 1. Task 1: Predict masked token
* MLM is the way to predict words that were originally "masked" upon input to the encoder transformer.
* Basic concepts: **"Fill in the blanks"** --> similar to auto-encoding

In [25]:
from transformers import BertTokenizer, BertForMaskedLM
from torch.nn import functional as F
import torch

In [26]:
## tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [27]:
## BERT model
model = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict = True) ## to get logits you need return_dict = True

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Note:
* The core concept of masked language modeling is seen below. We are using the logits or raw prediction probabilities from the transformer to predict the masked tokens or `tokenizer.mask_token`

In [35]:
# concatenation with the masked tokens -- the masked token "fills in the blanks"
text = "The Opera House in Australia is in, " +  tokenizer.mask_token + ' city'

# invoke tokenizer
input = tokenizer.encode_plus(text, return_tensors = 'pt')



In [36]:
# obtain top 10 predicted masked tokens
mask_index = torch.where(input['input_ids'][0] == tokenizer.mask_token_id)

In [37]:
## model output
output = model(**input)

# logits from output (raw probs)
logits = output.logits

In [38]:
## apply Softmax activation function to logits - probabilistic distribution for BERTs vocabulary
softmax = F.softmax(logits, dim=-1)

In [39]:
## mask token index - retrieving top 10 mask tokens

mask_word = softmax[0, mask_index, :]

In [40]:
## top 10 mask tokens
top_10 = torch.topk(mask_word, 10, dim=1)[1][0]

In [41]:
## iterate through tensor and replace with mask token
for token in top_10:
  word = tokenizer.decode([token]) # decode tokens
  new_sentence = text.replace(tokenizer.mask_token, word)
  print(new_sentence)

The Opera House in Australia is in, sydney city
The Opera House in Australia is in, melbourne city
The Opera House in Australia is in, brisbane city
The Opera House in Australia is in, adelaide city
The Opera House in Australia is in, the city
The Opera House in Australia is in, canberra city
The Opera House in Australia is in, auckland city
The Opera House in Australia is in, hobart city
The Opera House in Australia is in, griffith city
The Opera House in Australia is in, hume city


Summary:
* The top probabilistic masked token result was Sydney which is correct.

In [42]:
## if you ONLY wanted the top output not the top 10

# activation function
softmax = F.softmax(logits, dim=-1)

# mask token index
mask_word = softmax[0, mask_index, :]

top_word = torch.argmax(mask_word, dim=1) #max value is argmax

print(tokenizer.decode(top_word))

sydney


# 2. Task 2: Next Sentence Prediction
* Predicting the next sentence in BERT is the task of predicting whether one sentence follows another.
* **Next sentence prediction (NSP):**
  * the models concatenate two masked sentences as inputs during pretraining.
  * Sometimes they correspond to sentences that were next to each other in the original text, sometimes not.
  * The model then has to predict if the two sentences were following each other or not.
* huggingface docs: https://huggingface.co/google-bert/bert-base-cased

* The model will return:
  * logits or `torch.FloatTensor` of shape (`batch_size, 2)
  * This is the prediction scoring of the next sequence prediction.

In [43]:
## imports
from transformers import BertTokenizer, BertForNextSentencePrediction

# tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# model
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')

## Prompt and Next sentence
* We are going to test whether the next sentence makes sense for the initial prompt.

In [51]:
## test cases - prompt and then the next sentence to predict

prompt = "I drove to the grocery store to get food."

next_sentence = "I purchased milk and bread at the store."

Below we are using `encode_plus` which returns a full dictionary.

In [52]:
## encoding text
encoding = tokenizer.encode_plus(prompt, next_sentence, return_tensors= 'pt')

# outputs
outputs = model(**encoding)[0]

#activation function
softmax = F.softmax(outputs, dim=1)

# print softmax
print(softmax)

tensor([[9.9999e-01, 7.6648e-06]], grad_fn=<SoftmaxBackward0>)


Summary
* We obtained the probabilitiies from the logits using the Softmax activation function.
* The probabilities are from the prompt and the next sentence.
* BERT inserts a SEP token between the sentences.
* BERT then outputs a tensor of probabilities from the encoding.
* Unlike MLM we are only trying to compute softmax on the 2 inputs.

Output interpretation
* The closer together the probabilities are the more likely the sentences do not follow one another.
* In this case the 2nd probability is smaller than the first so we can say that it is more likely that the 2nd sentence follows the first.

# 3. Task 3: Question Answering
* We can use a specific BERT model for this task: `BertForQuestionAnswering`
* We will use this model which was trained cased on the SQUAD dataset: https://huggingface.co/deepset/bert-base-cased-squad2

In [57]:
from transformers import BertTokenizer, BertForQuestionAnswering
#from transformers import AutoTokenizer, AutoModelForQuestionAnswering

# tokenizer
tokenizer = BertTokenizer.from_pretrained("deepset/bert-base-cased-squad2")

# model
model = BertForQuestionAnswering.from_pretrained("deepset/bert-base-cased-squad2")

Some weights of the model checkpoint at deepset/bert-base-cased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Question-Answer Task with BERT
* We need to give the model example text and an example question.

In [58]:
## prompts
example_text = "GPT-3 came out in 2020"

example_question = "When did GPT-3 come out"

In [59]:
# tokenize inputs
tokenized_inputs = tokenizer(example_question, example_text, return_tensors='pt')

# print tokenized inputs
tokenized_inputs

{'input_ids': tensor([[  101,  1332,  1225, 15175,  1942,   118,   124,  1435,  1149,   102,
         15175,  1942,   118,   124,  1338,  1149,  1107, 12795,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Summary of BERT output:
* input_ids
  * indices of the input sequences in tokenized vocabulary.
* token_type_ids
  * also called "segment ids"
    * Represented as a binary mask identifying the 2 types of sequences in our model 0 or 1, which is question and answer (or whatever task you are performing.)
  * some models purpose is for classification of pairs of questions such as question-answering.
  * `[CLS] SEQUENCE A [SEP] SEQUENCE B [SEP]` --> these are the 2 input sequences, separated by SEP mask.
* attention_masks

## Testing the BERT question answering model.
* we will pass an input passage paragraph followed by a question.

In [61]:
text = "The Vatican Apostolic Library more commonly known as the Vatican Library or informally as the Vat, is the library of the Holy See, located in Vatican City, and is the city-state's national library. It was formally established in 1475, although it is much older—it is one of the oldest libraries in the world and contains one of the most significant collections of historical texts. It has 75,000 codices from throughout history, as well as 1.1 million printed books, which include some 8,500 incunabula. The Vatican Library is a research library for history, law, philosophy, science, and theology. The Vatican Library is open to anyone who can document their qualifications and research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail. Pope Nicholas V (1447–1455) envisioned a new Rome, with extensive public works to lure pilgrims and scholars to the city to begin its transformation. Nicolas wanted to create a 'public library' for Rome that was meant to be seen as an institution for humanist scholarship. His death prevented him from carrying out his plan, but his successor Pope Sixtus IV (1471–1484) established what is now known as the Vatican Library. In March 2014, the Vatican Library began an initial four-year project of digitising its collection of manuscripts, to be made available online. The Vatican Apostolic Archive was separated from the library at the beginning of the 17th century; it contains another 150,000 items."

question = "When was the Vat formally opened?"

In [62]:
## tokenizer and model
tokenizer = BertTokenizer.from_pretrained("deepset/bert-base-cased-squad2")

# model
model = BertForQuestionAnswering.from_pretrained("deepset/bert-base-cased-squad2")

Some weights of the model checkpoint at deepset/bert-base-cased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [63]:
## redefine tokenized inputs for new inputs
tokenized_inputs = tokenizer(question, text, return_tensors='pt')

# context manager - torch
with torch.no_grad():
  outputs = model(**tokenized_inputs)


# answer start logits
answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()


# predict answer tokens - input_ids from dict
predict_answer_tokens = tokenized_inputs.input_ids[0, answer_start_index : answer_end_index + 1]

# decode to see answer
tokenizer.decode(predict_answer_tokens)

'1475'

Summary:
* The model is correct, 1475.
* The model was able to use the long context attention mechanism to encode and decode the answer.