# Embeddings

Convert token IDs into dense vectors and add positional information.

In this notebook, you'll learn how embeddings work by implementing them step-by-step!

## Imports

In [None]:
import torch
import torch.nn as nn
import math
from typing import Optional

## Token Embedding

Maps token indices to dense vectors. Each token gets its own learnable d_model-dimensional vector.

### Step 1: Initialize TokenEmbedding

In [None]:
class TokenEmbedding(nn.Module):
    """
    Token embedding layer that maps token indices to dense vectors.
    """
    
    def __init__(self, vocab_size: int, d_model: int):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.d_model = d_model

### ðŸŽ¯ Practice: Implement your own `__init__`

Try creating the token embedding yourself!
- Hint: Use `nn.Embedding(vocab_size, d_model)`
- Store `d_model` for scaling later

In [None]:
# Your implementation here
# class MyTokenEmbedding(nn.Module):
#     def __init__(self, vocab_size: int, d_model: int):
#         # Your code here
#         pass

### Step 2: Forward Pass

In [None]:
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Convert token indices to embeddings.
        Scale by sqrt(d_model) to maintain variance.
        """
        return self.embedding(x) * math.sqrt(self.d_model)

# Add forward method to the class
TokenEmbedding.forward = forward

### ðŸŽ¯ Practice: Implement your own `forward`

Steps:
1. Get embeddings from the embedding layer
2. Scale by `sqrt(d_model)` (helps with training stability)

In [None]:
# Your implementation here
# def my_forward(self, x):
#     # Your code here
#     pass

## Positional Encoding

Adds position information using sinusoidal functions:
- PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
- PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

### Step 1: Initialize PositionalEncoding

In [None]:
class PositionalEncoding(nn.Module):
    """
    Sinusoidal positional encoding from 'Attention is All You Need'.
    """
    
    def __init__(self, d_model: int, max_seq_len: int = 512, dropout: float = 0.1):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)
        
        # Create positional encoding matrix
        pe = torch.zeros(max_seq_len, d_model)
        position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
        
        # Compute divisor term
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        )
        
        # Apply sin to even indices, cos to odd indices
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # Add batch dimension
        pe = pe.unsqueeze(0)
        
        # Register as buffer (not a parameter)
        self.register_buffer('pe', pe)

### ðŸŽ¯ Practice: Implement your own `__init__`

Steps:
1. Create position matrix (0 to max_seq_len)
2. Compute divisor term for different frequencies
3. Apply sin/cos to even/odd dimensions
4. Register as buffer (not trainable)

In [None]:
# Your implementation here
# class MyPositionalEncoding(nn.Module):
#     def __init__(self, d_model: int, max_seq_len: int = 512, dropout: float = 0.1):
#         # Your code here
#         pass

### Step 2: Forward Pass

In [None]:
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Add positional encoding to input embeddings.
        """
        seq_len = x.size(1)
        x = x + self.pe[:, :seq_len, :]
        return self.dropout(x)

# Add forward method to the class
PositionalEncoding.forward = forward

### ðŸŽ¯ Practice: Implement your own `forward`

Steps:
1. Get sequence length from input
2. Add positional encoding (slice to match seq_len)
3. Apply dropout

In [None]:
# Your implementation here
# def my_forward(self, x):
#     # Your code here
#     pass

## Transformer Embedding

Combines token embeddings with positional encoding.

### Step 1: Initialize TransformerEmbedding

In [None]:
class TransformerEmbedding(nn.Module):
    """
    Combined embedding layer for transformers.
    """
    
    def __init__(
        self,
        vocab_size: int,
        d_model: int,
        max_seq_len: int = 512,
        dropout: float = 0.1,
        use_learned_pos: bool = False
    ):
        super().__init__()
        
        self.token_embedding = TokenEmbedding(vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_seq_len, dropout)

### ðŸŽ¯ Practice: Implement your own `__init__`

Combine token embedding and positional encoding!

In [None]:
# Your implementation here
# class MyTransformerEmbedding(nn.Module):
#     def __init__(self, vocab_size, d_model, max_seq_len=512, dropout=0.1):
#         # Your code here
#         pass

### Step 2: Forward Pass

In [None]:
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Convert token indices to full embeddings with position information.
        """
        tok_emb = self.token_embedding(x)
        return self.positional_encoding(tok_emb)

# Add forward method to the class
TransformerEmbedding.forward = forward

### ðŸŽ¯ Practice: Implement your own `forward`

Steps:
1. Get token embeddings
2. Add positional encoding

In [None]:
# Your implementation here
# def my_forward(self, x):
#     # Your code here
#     pass

## Test Embeddings

In [None]:
d_model = 512
vocab_size = 1000
embed = TransformerEmbedding(vocab_size, d_model)
x = torch.randint(0, vocab_size, (2, 32))  # Batch 2, Seq 32
y = embed(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {y.shape}")  # Should be (2, 32, 512)
print(f"\nâœ… Embeddings work!")