## **Download and Load the Pretrained Model and Tokenizer**

In [1]:
from transformers import AutoTokenizer, AutoModel

model = AutoModel.from_pretrained('bert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [2]:
tokenizer

BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [3]:
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

## **Preprocess the Input**

In [4]:
sentence = "I am working as a GenAI Engineer."

tokens = tokenizer.tokenize(sentence)

print("Tokens:\n", tokens)

Tokens:
 ['i', 'am', 'working', 'as', 'a', 'gen', '##ai', 'engineer', '.']


In [5]:
token_ids = tokenizer(sentence)

print("Token Ids:\n", token_ids)

Token Ids:
 {'input_ids': [101, 1045, 2572, 2551, 2004, 1037, 8991, 4886, 3992, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [6]:
## Decode method allows us to check how the final output of the 
## tokenizer translates back to text
print("Decoded Text Output:\n", tokenizer.decode(token_ids["input_ids"]))

Decoded Text Output:
 [CLS] i am working as a genai engineer. [SEP]


## **Generating Embeddings**

In [7]:
token_ids = tokenizer(sentence, padding=True, truncation=True, max_length=15, return_tensors="pt")

print(token_ids)

{'input_ids': tensor([[ 101, 1045, 2572, 2551, 2004, 1037, 8991, 4886, 3992, 1012,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [8]:
# Now pass your preprocessed batch of inputs directly to the model. 
# You just have to unpack the dictionary by adding **
outputs = model(**token_ids)

**Note** that model returns the output as a tuple with two values. The first value indicates the hidden state representation, hidden_rep, and it consists of the representation of all the tokens obtained from the final encoder (encoder 12) and the second value, cls_head, consists of the representation of the [CLS] token:

In [12]:
hidden_rep = outputs.last_hidden_state
cls_head = outputs.pooler_output

**hidden_rep**  
hidden_rep contains the embedding representation of all the tokens in our input. Let's print the shape of hidden_rep.

The shape represents [batch_size, sequence_length, embedding_size]

**cls_head**  
cls_head holds the aggregated representation of the sentence.

The shape represents [batch_size, embedding_size]

In [13]:
print(hidden_rep.shape)

torch.Size([1, 11, 768])


In [14]:
print(cls_head.shape)

torch.Size([1, 768])
