# Transformers - Embeddings

This notebook includes experimentation with the Embeddings through the usage of the Transformers.

# Setup Notebook

## Imports

In [None]:
# Import Standard Libraries
import json
from transformers import AutoTokenizer, AutoModel
import torch

# Experimentations

## AutoTokenizer with BERT

### Tokenization

In [3]:
# Instance the Tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

-
-
- The Tokenizer is used to converts raw text into tokens
<br>

**The Process:**
1) The first step is WordPiece Tokenization (e.g., "playing" &rarr; ["play", "##ing"])
2) The second step maps tokens into numerical IDs (based on BERT's vocabulary)
3) Add special tokens (`[CLS]` and `[SEP]`)
4) Pads and truncate text sequence to fix model's input
5) Create attention mask

In [4]:
text = "I love Data Science"
tokens = tokenizer(text, return_tensors="pt")  # Convert to PyTorch tensors

print('Tokens:' , tokens.input_ids)
print('Attention Mask:', tokens.attention_mask)

Tokens: tensor([[ 101, 1045, 2293, 2951, 2671,  102]])
Attention Mask: tensor([[1, 1, 1, 1, 1, 1]])


- [101] = [CLS] token (start of sentence)
- 1045 = "I", 2293 = "love", 2951 = "data", 2671 = "science"
- [102] = [SEP] token (end of sentence)

### Embeddings

The algorithm process an input tokenised sequence and compressed them into a lower dimensional representation called "Embeddings".



In [5]:
# Instance model
model = AutoModel.from_pretrained("bert-base-uncased")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [6]:
# Pass tokenized input into the model
outputs = model(**tokens)

# Extract last hidden state (word embeddings for each token)
embeddings = outputs.last_hidden_state

print(embeddings.shape)  # (batch_size, sequence_length, hidden_size)

torch.Size([1, 6, 768])


Each token (word or subword) gets a 768-dimensional vector that captures its meaning in context.

## Sentence Transformers

In [4]:
# Define the sentences
sentences = [
    "I took my dog for a walk",
    "Today is going to rain",
    "I took my cat for a walk"
]

In [5]:
# Define tokenizer and the model
model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

RuntimeError: Failed to import transformers.models.bert.modeling_bert because of the following error (look up to see its traceback):
No module named '_lzma'

In [6]:
import lzma

ModuleNotFoundError: No module named '_lzma'