# Embeddings Playground

This notebook explores how tokens (from tokenization) get converted into embeddings - fixed-length numerical vectors that capture semantic meaning.

## Key Concepts:
- **Tokens** → **Embeddings** → **Model Processing**
- Embeddings are dense vector representations of tokens
- Similar tokens have similar embeddings (in vector space)
- Fixed dimensionality regardless of input length

In [None]:
# Import required libraries
import tiktoken
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
import openai
import os

# For visualization
plt.style.use('default')
sns.set_palette("husl")

## From Tokens to Embeddings: The Journey

Let's trace how text becomes embeddings:

1. **Text** → "Hello world" 
2. **Tokenization** → [9906, 1917] (from notebook 01)
3. **Embedding Lookup** → Each token gets a fixed-size vector
4. **Result** → Matrix of embeddings

In [None]:
# Step 1: Tokenize text (connecting to notebook 01)
encoding = tiktoken.get_encoding("cl100k_base")

texts = [
    "Hello world",
    "Hello universe", 
    "Goodbye world",
    "Python programming",
    "Machine learning"
]

print("🔤 TOKENIZATION STEP:")
print("=" * 40)
for text in texts:
    tokens = encoding.encode(text)
    print(f"'{text}' → {tokens}")
    
    # Show the actual token pieces
    pieces = []
    for token in tokens:
        piece = encoding.decode([token])
        pieces.append(f"'{piece}'")
    print(f"  Token pieces: {' + '.join(pieces)}")
    print()

## What Are Embeddings?

**Embeddings** are dense numerical vectors that represent tokens in a high-dimensional space where:
- Similar meanings → Similar vectors
- Each token has a **fixed-length vector** (e.g., 1536 dimensions for OpenAI's text-embedding-ada-002)
- Mathematical operations can capture semantic relationships

### Conceptual Example:
If embeddings were 3D instead of 1536D:
- "king" → [0.2, 0.8, 0.1]
- "queen" → [0.3, 0.9, 0.2] 
- "cat" → [0.7, 0.1, 0.9]

Notice: "king" and "queen" are closer to each other than to "cat"

In [None]:
# Simulate what embeddings look like conceptually
# (This is just for illustration - real embeddings are much more complex)

# Create some fake embeddings for demonstration
np.random.seed(42)  # For reproducible results

# Simulate 5-dimensional embeddings for some tokens
token_embeddings = {
    "Hello": np.random.normal(0.5, 0.1, 5),    # Greeting cluster
    "Hi": np.random.normal(0.5, 0.1, 5),       # Greeting cluster  
    "Goodbye": np.random.normal(-0.5, 0.1, 5), # Farewell cluster
    "Bye": np.random.normal(-0.5, 0.1, 5),     # Farewell cluster
    "world": np.random.normal(0.0, 0.1, 5),    # Noun cluster
    "universe": np.random.normal(0.0, 0.1, 5), # Noun cluster
    "Python": np.random.normal(0.8, 0.1, 5),   # Programming cluster
    "programming": np.random.normal(0.8, 0.1, 5) # Programming cluster
}

print("🧮 SIMULATED TOKEN EMBEDDINGS (5D):")
print("=" * 50)
for token, embedding in token_embeddings.items():
    print(f"'{token}': [{', '.join(f'{x:.2f}' for x in embedding)}]")

print("\n💡 Notice: Similar tokens have similar vector values!")

In [None]:
# Calculate cosine similarity between token embeddings
print("🔍 SIMILARITY ANALYSIS:")
print("=" * 40)

# Compare some pairs
comparisons = [
    ("Hello", "Hi"),           # Both greetings
    ("Hello", "Goodbye"),      # Opposite meanings  
    ("world", "universe"),     # Similar concepts
    ("Python", "programming"), # Related concepts
    ("Hello", "Python")        # Unrelated
]

for token1, token2 in comparisons:
    emb1 = token_embeddings[token1].reshape(1, -1)
    emb2 = token_embeddings[token2].reshape(1, -1)
    similarity = cosine_similarity(emb1, emb2)[0][0]
    
    print(f"'{token1}' ↔ '{token2}': {similarity:.3f}")
    if similarity > 0.8:
        print("  → Very similar! 🎯")
    elif similarity > 0.3:
        print("  → Somewhat similar")
    elif similarity < -0.3:
        print("  → Opposite meanings! ↔️")
    else:
        print("  → Not very related")
    print()

## Real OpenAI Embeddings

Now let's get real embeddings from OpenAI's API. These are the actual vectors that models like GPT-4 use!

**Note**: You'll need an OpenAI API key. Set it as an environment variable or paste it below.

In [None]:
# Setup OpenAI API (you'll need your API key)
# Option 1: Set environment variable OPENAI_API_KEY
# Option 2: Uncomment and add your key below
# openai.api_key = "your-api-key-here"

# Check if API key is available
try:
    client = openai.OpenAI()  # Will use OPENAI_API_KEY env var
    print("✅ OpenAI API key found!")
    api_available = True
except:
    print("❌ OpenAI API key not found.")
    print("Set OPENAI_API_KEY environment variable or uncomment the line above.")
    api_available = False

In [None]:
# Get real embeddings from OpenAI
if api_available:
    def get_embedding(text, model="text-embedding-ada-002"):
        """Get embedding for a text using OpenAI's API"""
        response = client.embeddings.create(
            input=text,
            model=model
        )
        return np.array(response.data[0].embedding)
    
    # Test words for embedding analysis
    test_words = [
        "king", "queen", "man", "woman",
        "cat", "dog", "animal",
        "happy", "joyful", "sad",
        "Python", "programming", "code"
    ]
    
    print("🌐 GETTING REAL OPENAI EMBEDDINGS...")
    print("=" * 50)
    
    real_embeddings = {}
    for word in test_words:
        try:
            embedding = get_embedding(word)
            real_embeddings[word] = embedding
            print(f"✅ '{word}': {len(embedding)}D vector")
        except Exception as e:
            print(f"❌ Error getting embedding for '{word}': {e}")
            break
    
    if real_embeddings:
        sample_word = list(real_embeddings.keys())[0]
        sample_embedding = real_embeddings[sample_word]
        print(f"\n📊 Embedding details:")
        print(f"Dimensions: {len(sample_embedding)}")
        print(f"Sample values for '{sample_word}': [{', '.join(f'{x:.4f}' for x in sample_embedding[:5])}...]")
        print(f"Value range: {sample_embedding.min():.4f} to {sample_embedding.max():.4f}")
else:
    print("⚠️  Skipping real embeddings - API key not available")
    print("We'll continue with simulated examples")

In [None]:
# Analyze real embedding similarities
if api_available and real_embeddings:
    print("🔍 REAL EMBEDDING SIMILARITIES:")
    print("=" * 50)
    
    # Interesting word pairs to compare
    word_pairs = [
        ("king", "queen"),        # Gender relationship
        ("man", "woman"),         # Gender relationship
        ("cat", "dog"),           # Both animals
        ("happy", "joyful"),      # Synonyms
        ("happy", "sad"),         # Opposites
        ("Python", "programming"), # Related concepts
        ("cat", "programming"),   # Unrelated
    ]
    
    for word1, word2 in word_pairs:
        if word1 in real_embeddings and word2 in real_embeddings:
            emb1 = real_embeddings[word1].reshape(1, -1)
            emb2 = real_embeddings[word2].reshape(1, -1)
            similarity = cosine_similarity(emb1, emb2)[0][0]
            
            print(f"'{word1}' ↔ '{word2}': {similarity:.4f}")
            
            # Interpret the similarity
            if similarity > 0.8:
                print("  → Extremely similar! 🎯")
            elif similarity > 0.6:
                print("  → Very similar! ✨")
            elif similarity > 0.4:
                print("  → Moderately similar 📊")
            elif similarity > 0.2:
                print("  → Somewhat similar 🔍")
            else:
                print("  → Not very similar 🔀")
            print()
else:
    print("⚠️ Skipping real similarity analysis - embeddings not available")

## Visualizing Embeddings

Since embeddings are high-dimensional (1536D), we can't visualize them directly. But we can use **dimensionality reduction** to project them into 2D space for visualization.

In [None]:
# Visualize embeddings using PCA (Principal Component Analysis)
if api_available and real_embeddings and len(real_embeddings) > 2:
    print("📊 VISUALIZING EMBEDDINGS IN 2D:")
    print("=" * 40)
    
    # Prepare data for PCA
    words = list(real_embeddings.keys())
    embeddings_matrix = np.array([real_embeddings[word] for word in words])
    
    # Reduce from 1536D to 2D
    pca = PCA(n_components=2)
    embeddings_2d = pca.fit_transform(embeddings_matrix)
    
    # Create the plot
    plt.figure(figsize=(12, 8))
    
    # Define colors for different categories
    colors = {
        'royal': ['king', 'queen'],
        'gender': ['man', 'woman'], 
        'animals': ['cat', 'dog', 'animal'],
        'emotions': ['happy', 'joyful', 'sad'],
        'tech': ['Python', 'programming', 'code']
    }
    
    color_map = {}
    color_list = ['red', 'blue', 'green', 'orange', 'purple']
    for i, (category, word_list) in enumerate(colors.items()):
        for word in word_list:
            if word in words:
                color_map[word] = color_list[i % len(color_list)]
    
    # Plot points
    for i, word in enumerate(words):
        x, y = embeddings_2d[i]
        color = color_map.get(word, 'gray')
        plt.scatter(x, y, c=color, s=100, alpha=0.7)
        plt.annotate(word, (x, y), xytext=(5, 5), textcoords='offset points', fontsize=10)
    
    plt.title('OpenAI Embeddings Visualized in 2D Space\n(Using PCA Dimensionality Reduction)', fontsize=14)
    plt.xlabel(f'First Principal Component ({pca.explained_variance_ratio_[0]:.1%} variance)')
    plt.ylabel(f'Second Principal Component ({pca.explained_variance_ratio_[1]:.1%} variance)')
    plt.grid(True, alpha=0.3)
    
    # Add legend
    for i, (category, word_list) in enumerate(colors.items()):
        plt.scatter([], [], c=color_list[i % len(color_list)], label=category.capitalize(), s=100, alpha=0.7)
    plt.legend()
    
    plt.tight_layout()
    plt.show()
    
    print(f"\n💡 Interpretation:")
    print(f"- Words that are close together have similar meanings")
    print(f"- The PCA captures {pca.explained_variance_ratio_[0] + pca.explained_variance_ratio_[1]:.1%} of the variance")
    print(f"- This is a 2D projection of {embeddings_matrix.shape[1]}D space!")
    
else:
    print("⚠️ Skipping visualization - need embeddings from OpenAI API")

## How Embeddings Work in Language Models

### The Complete Pipeline:

1. **Text Input**: "Hello world"
2. **Tokenization**: [9906, 1917] (from notebook 01)
3. **Embedding Lookup**: Each token → 1536D vector
4. **Model Processing**: Attention, transformations, etc.
5. **Output**: Generated text, classifications, etc.

### Key Insights:
- **Fixed Size**: Every token gets the same size vector (1536D for OpenAI)
- **Learned**: Embeddings are learned during model training
- **Semantic**: Similar tokens have similar embeddings
- **Context-Independent**: Each token has one embedding (context comes later in the model)

In [None]:
# Demonstrate the complete token-to-embedding pipeline
print("🔄 COMPLETE PIPELINE DEMONSTRATION:")
print("=" * 50)

sample_text = "Hello beautiful world"
print(f"📝 Input Text: '{sample_text}'")
print()

# Step 1: Tokenization
tokens = encoding.encode(sample_text)
print(f"🔤 Step 1 - Tokenization:")
print(f"  Tokens: {tokens}")
for i, token in enumerate(tokens):
    piece = encoding.decode([token])
    print(f"  Token {i+1}: {token} → '{piece}'")
print()

# Step 2: Embedding lookup (simulated)
print(f"🧮 Step 2 - Embedding Lookup:")
print(f"  Each token gets converted to a 1536D vector")
if api_available:
    try:
        full_embedding = get_embedding(sample_text)
        print(f"  Full text embedding shape: {full_embedding.shape}")
        print(f"  Sample values: [{', '.join(f'{x:.4f}' for x in full_embedding[:5])}...]")
    except:
        print("  (Simulated - would be 1536D vectors)")
else:
    print("  (Simulated - would be 1536D vectors)")
    for i, token in enumerate(tokens):
        piece = encoding.decode([token])
        print(f"  '{piece}' → [1536 dimensional vector]")

print()
print(f"📊 Result: {len(tokens)} tokens → {len(tokens)} embeddings → Ready for model processing!")

## Next Steps & Key Takeaways

### What We've Learned:
1. **Tokens → Embeddings**: Each token becomes a fixed-size vector
2. **Semantic Similarity**: Similar tokens have similar embeddings
3. **High Dimensional**: Real embeddings are 1536D (much richer than our examples)
4. **Foundation**: Embeddings are the foundation for all model processing

### The Journey So Far:
- **Notebook 01**: Text → Tokens (discrete IDs)
- **Notebook 02**: Tokens → Embeddings (dense vectors)
- **Next**: How models process these embeddings (attention, transformers, etc.)

### Key Questions Answered:
- ✅ How do models convert tokens to numbers they can work with?
- ✅ Why do similar words have similar representations?
- ✅ What does "fixed-length" mean in the context of variable-length text?

### Experiment Ideas:
1. Try getting embeddings for different languages
2. Explore embeddings for code vs. natural language
3. Test how embeddings change with context (spoiler: they don't at this stage!)
4. Calculate embedding similarity for your own word pairs