# BERT Embeddings and Vectors
* Notebook by Adam Lang
* Date: 6/11/2024
* We will go through the intricacies of BERT transfomer models.

In [1]:
# install transfomers
!pip install transformers -q

In [2]:
# imports
from transformers import BertModel, BertTokenizer
import torch

## BERT-base-uncased
* Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in this paper and first released in this repository. **This model is uncased: it does not make a difference between english and English.**
* model card: https://huggingface.co/google-bert/bert-base-uncased

In [3]:
# instantiate the model from huggingface
model = BertModel.from_pretrained('bert-base-uncased')

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [34]:
## test sentence
sentence = "She is a Machine Learning Engineer and works in California."

## BERT Tokenizer
* Define tokenizer from huggingface
* Then apply it to the text.

In [35]:
## tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')


#### Tokenize sentence and obtain tokens

In [36]:
tokens = tokenizer.tokenize(sentence)

#### Print out the tokens we obtained

In [37]:
tokens

['she',
 'is',
 'a',
 'machine',
 'learning',
 'engineer',
 'and',
 'works',
 'in',
 'california',
 '.']

In [38]:
print(tokens)

['she', 'is', 'a', 'machine', 'learning', 'engineer', 'and', 'works', 'in', 'california', '.']


In [39]:
print(len(tokens))

11


#### Now add the [CLS] token at beginning and [SEP] token at end of tokens list:
* CLS is the classification token.
* SEP is the separator token usually at the end of a sentence.

In [40]:
tokens = ['[CLS]'] + tokens + ['[SEP]']

#### Let's see the updated tokens list

In [41]:
print(tokens)

['[CLS]', 'she', 'is', 'a', 'machine', 'learning', 'engineer', 'and', 'works', 'in', 'california', '.', '[SEP]']


In [42]:
print(len(tokens))

13


Summary:
* We can see the tokens grew in length by 2 after adding the `CLS` and `SEP`.
* However, standard practice in masked language modeling (MLM) is to have a standard length of input.
* Therefore we need to add padding or 'PAD' tokens.
  * Let's say for this example we want a standard length of 16 tokens so we add 2 PAD tokens.

In [43]:
tokens = tokens + ['[PAD]'] + ['[PAD]']

#### Print updated tokens list with PADs

In [44]:
print(tokens)

['[CLS]', 'she', 'is', 'a', 'machine', 'learning', 'engineer', 'and', 'works', 'in', 'california', '.', '[SEP]', '[PAD]', '[PAD]']


In [45]:
## len of tokens
print(len(tokens))

15


We can see that our length has now grown by 2, again adding the PAD tokens.

## Attention Mask
* An attention mask is a binary mask that designates which tokens should be attended to (assigned non-zero weights such as 1) and which should be ignored (assigned zero weights).
* By applying this mask, the model can selectively attend to specific tokens while disregarding others.

In [46]:
attention_mask = [1 if i!= '[PAD]' else 0 for i in tokens]

In [47]:
print(attention_mask)

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]


Summary
* We see 1 where there are "actual tokens' and 0 where there are PADs or not actual tokens.

## Unique Token ID
* The concept is to map each token to a unique token ID.
* Why? Input IDs: The BERT base model has a defined vocabulary of 30522 tokens. Each unique string token maps to a unique integer ID. These are the integer IDs for each token in a sequence.
* We do this with the method:
  * `convert_tokens_to_ids()`
* Further reading on this: https://medium.com/@alexmriggio/bert-for-sequence-classification-from-scratch-code-and-theory-fb88053800fa

In [48]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)

In [49]:
# print token_ids
print(token_ids)

[101, 2016, 2003, 1037, 3698, 4083, 3992, 1998, 2573, 1999, 2662, 1012, 102, 0, 0]


Summary:
* Each token is encoded by the token_id in the position in each sentence.
* Why is this important?
  * Each unique token_id is mapped to a vector when creating embeddings.
  * If you have 30,522 tokens --> you will have 30,522 vectors fed into the BERT model.

#### Next steps
* Both the `token_ids` and the `attention_mask` need to do the following:
    * Convert to `torch.tensor()`
    * Apply `.usqueeze()` --> reduce dimensions to 1

In [50]:
token_ids = torch.tensor(token_ids).unsqueeze(0)

attention_mask = torch.tensor(attention_mask).unsqueeze(0)

## Feed token_ids + attention_masks to model

In [51]:
output = model(token_ids, attention_mask=attention_mask)

In [52]:
output

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.0655,  0.0770, -0.6389,  ..., -0.2221,  0.4882,  0.2644],
         [ 0.2520, -0.3882, -0.8195,  ...,  0.1664,  0.3918, -0.2453],
         [-0.1106,  0.2444, -0.0539,  ..., -0.4896,  0.1578,  0.3824],
         ...,
         [-0.1378,  0.3008, -0.9749,  ...,  0.3532, -0.0635, -0.4823],
         [ 0.1346,  0.2880, -0.2892,  ...,  0.2452,  0.1708, -0.0909],
         [ 0.0195,  0.3129, -0.2731,  ...,  0.1813,  0.1968, -0.3638]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.9452, -0.5548, -0.9395,  0.8934,  0.8746, -0.3437,  0.9400,  0.5532,
         -0.8653, -1.0000, -0.7528,  0.9677,  0.9864,  0.5806,  0.9565, -0.8545,
         -0.2569, -0.7218,  0.4626, -0.7341,  0.7854,  1.0000, -0.0434,  0.4335,
          0.5901,  0.9967, -0.8644,  0.9565,  0.9661,  0.8603, -0.8072,  0.4597,
         -0.9935, -0.2990, -0.9463, -0.9957,  0.6134, -0.8441, -0.0728, -0.1589,
         -0.9138,  0.5703,  1.00

Summary of output
1. last_hidden_state
  * 768 is the hidden_state or hidden size as we can see below
2. Pooler output -- embedding of the CLS

In [56]:
# hidden_state output
output[0].shape

torch.Size([1, 15, 768])

In [55]:
# pooler output
output[1].shape

torch.Size([1, 768])

# Summary
* We went over the core structure of a BERT model and BERT embeddings.
* Let's review the steps again:
1. load model from HF
2. Tokenizer
	* initiate
	* apply to text
3. Tokens
	* view the tokens
	* add [CLS] classification + [SEP] separator tokens
	* Add [PAD] tokens to standardize model input

4. Attention Maske
	* 1 for real tokens
	* 0 for non-real tokens (e.g. PADs)

5. Unique token ids
	* This helps with vectorization + embeddings
	* Each token gets a unique ID in the corpus for positional encoding

6. Feed tokens + attention_masks to model

7. Evaluate output
	* hidden state
	* pooling layer
