<a href="https://colab.research.google.com/github/agnxsh/token-embeddings-BERT/blob/main/token_embeddings_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [45]:
!pip install transformers -q

In [46]:
from transformers import BertModel, BertTokenizer
import torch

In [47]:
model = BertModel.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [48]:
sentence = "She is a MachineLearning Engineer and works in California"

In [49]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [50]:
tokens = tokenizer.tokenize(sentence)

In [51]:
print(tokens)

['she', 'is', 'a', 'machine', '##lea', '##rn', '##ing', 'engineer', 'and', 'works', 'in', 'california']


####Adding the CLS and SEP token to make the BERT model identfy the whitespaces and beginning and endings of sentences.

In [52]:
tokens = ['[CLS]'] + tokens + ['[SEP]']

In [53]:
print(tokens)

['[CLS]', 'she', 'is', 'a', 'machine', '##lea', '##rn', '##ing', 'engineer', 'and', 'works', 'in', 'california', '[SEP]']


##Adding Pads to the token lengths to get the same token length = 16 for every token

In [54]:
tokens = tokens + ['[PAD]'] + ['[PAD]']

In [55]:
print(tokens)

['[CLS]', 'she', 'is', 'a', 'machine', '##lea', '##rn', '##ing', 'engineer', 'and', 'works', 'in', 'california', '[SEP]', '[PAD]', '[PAD]']


In [56]:
print(len(tokens))

16


###We need to make the model understand that the PAD is not really a token, for that we need to use the concepts of an Attention Mask

In [57]:
attention_mask = [1 if i!= '[PAD]' else 0 for i in tokens]

In [58]:
print(attention_mask)

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]


##Unique Token ID

In [59]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)

####Let's now have a look at the token ids:

In [60]:
print(token_ids)

[101, 2016, 2003, 1037, 3698, 19738, 6826, 2075, 3992, 1998, 2573, 1999, 2662, 102, 0, 0]


In [61]:
['[CLS]', 'she', 'is', 'a', 'machine', '##lea', '##rn', '##ing', 'engineer', 'and', 'works', 'in', 'california', '[SEP]', '[PAD]', '[PAD]']

['[CLS]',
 'she',
 'is',
 'a',
 'machine',
 '##lea',
 '##rn',
 '##ing',
 'engineer',
 'and',
 'works',
 'in',
 'california',
 '[SEP]',
 '[PAD]',
 '[PAD]']

In [62]:
token_ids = torch.tensor(token_ids).unsqueeze(0)
attention_mask = torch.tensor(attention_mask).unsqueeze(0)

##Now we feed the token_ids and attention_mask to the pre-trained BERT model and get the embedding

In [65]:
output = model(token_ids,attention_mask=attention_mask)

In [66]:
output

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.1925,  0.1684, -0.4252,  ..., -0.2599,  0.3736,  0.0529],
         [ 0.2417, -0.2748, -0.4909,  ...,  0.1372,  0.3408, -0.4655],
         [-0.0871,  0.0837,  0.2605,  ..., -0.4635, -0.0462,  0.2621],
         ...,
         [ 0.6711, -0.0076, -0.3847,  ..., -0.1289, -0.5171, -0.8002],
         [-0.2731,  0.1098, -0.5440,  ...,  0.0314,  0.4467, -0.3448],
         [-0.2387,  0.0119, -0.4760,  ...,  0.4656,  0.5837, -0.3774]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.9531, -0.4914, -0.8872,  0.9035,  0.8174, -0.2919,  0.9511,  0.4982,
         -0.7595, -1.0000, -0.6996,  0.9459,  0.9890,  0.4754,  0.9723, -0.8460,
         -0.1423, -0.7209,  0.4428, -0.7905,  0.7822,  1.0000,  0.2119,  0.4066,
          0.5813,  0.9923, -0.8380,  0.9670,  0.9746,  0.8324, -0.8227,  0.4136,
         -0.9931, -0.2821, -0.8860, -0.9961,  0.5261, -0.8722, -0.0915, -0.0950,
         -0.9237,  0.5106,  1.00

###The last hidden state is an important thing to notice it is one of the most important layers because in this layer we add generally add the classifiers or another base model

In [68]:
output[0].shape
#ouput[0] is the last hidden state

torch.Size([1, 16, 768])

###pooler_output is a special vector consisting of the '[CLS]' tokens, it is usually of batch_size X hidden_size, hence it is 2d in nature

In [69]:
#now if we take the second last output layer, we get the pooler_output

In [70]:
output[1].shape

torch.Size([1, 768])

#BERT has 12 layers of encoders in it, and this is pulling out the layers of the top most layer of the BERT transformer