# Embeddings in Transformer Models

Embeddings are a critical component in transformer models such as BERT and DistilBERT. They allow tokenized text to be converted into numerical vectors that capture semantic meaning, enabling the model to process language effectively.

## What Are Embeddings?

After tokenization, each token is mapped to a **vector of numbers** in a continuous vector space. These vectors are called **embeddings**.

- Embeddings allow words or tokens with similar meanings to have similar numerical representations.
- They enable the model to understand relationships between tokens, such as similarity, context, or semantic meaning.
- Embeddings are learned parameters, trained via backpropagation. Pretrained models like BERT and DistilBERT already have embeddings learned from large corpora, which can then be **fine-tuned** for specific tasks.

## Why We Need Embeddings

Computers cannot process raw text directly; they require numerical inputs. Embeddings serve as a bridge:

- Translate tokens into numbers
- Represent semantic meaning
- Preserve similarity: similar words are close in vector space
- Provide dense, fixed-length representations for each token

This is analogous to **transfer learning** in computer vision: we leverage pretrained embeddings instead of training from scratch.

## Conceptual Example of Embeddings

Imagine a 3-dimensional embedding space (simplified for visualization):

- **Dimension 1:** Wings
- **Dimension 2:** Engine
- **Dimension 3:** Sky

Tokens representing different entities can be mapped into this space:

| Token      | Wings | Engine | Sky |
| ---------- | ----- | ------ | --- |
| Bee        | 2     | 0      | 4   |
| Eagle      | 3     | 0      | 3   |
| Goose      | 3     | 0      | 2   |
| Jet        | 1     | 1      | 1   |
| Helicopter | 0     | 4      | 2   |
| Drone      | 0     | 3      | 3   |
| Rocket     | 0     | 2      | 4   |

- Tokens with similar properties are **close together** in the embedding space (e.g., bee, eagle, and goose).
- Tokens with distinct characteristics are further apart (e.g., jet vs. bee).

> Note: Real transformer embeddings have **768 dimensions** (for DistilBERT), not just three. The 3D example is purely illustrative.

## How Embeddings Are Used

1. **Input tokenization:** Raw text is split into tokens.
2. **Token IDs:** Each token is converted into a numerical index from the vocabulary.
3. **Embedding layer:** Token IDs are mapped to high-dimensional vectors.
4. **Transformer layers:** Embeddings are fed into the modelâ€™s transformer layers for further processing, such as attention and contextual encoding.

These embeddings capture **semantic relationships** between tokens, which are crucial for tasks like classification, translation, or question answering.


In [3]:
from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

In [4]:
sentence = "I love mathematics!"

In [5]:
tokens = tokenizer.tokenize(sentence)
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print("Tokens:", tokens)
print("Token IDs:", token_ids)

Tokens: ['i', 'love', 'mathematics', '!']
Token IDs: [1045, 2293, 5597, 999]


In [6]:
# Convert token IDs to tensor and add batch dimension
input_ids = torch.tensor([token_ids])

# Get embeddings from BERT
with torch.no_grad():  # No gradients needed for inference
    outputs = model(input_ids)
    embeddings = outputs.last_hidden_state  # shape: (batch_size, seq_len, hidden_size)

print("Embeddings shape:", embeddings.shape)

# Print embeddings for each token (first 10 dimensions for brevity)
for token, emb in zip(tokens, embeddings[0]):
    print(f"Token: {token}")
    print(f"Embedding (first 10 dims): {emb[:10]}")

Embeddings shape: torch.Size([1, 4, 768])
Token: i
Embedding (first 10 dims): tensor([ 0.0648,  0.3697,  0.0214,  0.0242, -0.2595,  0.0748,  0.3394,  0.0936,
        -0.2900, -0.3007])
Token: love
Embedding (first 10 dims): tensor([ 0.4542,  0.9776,  0.3335, -0.0131, -0.0401, -0.1888,  0.4279, -0.0346,
        -0.2702, -0.5418])
Token: mathematics
Embedding (first 10 dims): tensor([ 0.0373,  1.0036,  0.0657, -0.0548,  0.0196, -0.1038,  0.2634,  0.1846,
        -0.4612, -0.2633])
Token: !
Embedding (first 10 dims): tensor([ 0.1830,  0.9067,  0.3583, -0.0487, -0.2368, -0.2107,  0.6332, -0.0442,
        -0.4583, -0.3439])
