# Module 2, Lesson 1: Vectors as Meaning

**Course:** Foundations of Attention  
**Module:** The Geometry of Intelligence  
**Learning Objectives:**
1. Define high-dimensional vector spaces and their relevance to semantic meaning (Concept)
2. Calculate the Dot Product to measure directional alignment (Skill)
3. Interpret the Dot Product as a metric for similarity/attention (Concept)

---

## Introduction: Why Geometry Matters for AI

Before we can build attention mechanisms, we need to understand a fundamental question:

**How do computers measure similarity between words?**

The answer lies in **geometry**. By representing words as vectors in high-dimensional space, we transform the problem of "meaning" into a problem of "distance" and "direction."

This lesson will teach you the mathematical foundation that makes modern AI possible.

## 1. From Strings to Numbers: The Embedding Revolution

### The Problem
Computers cannot understand language. If you feed the string `"cat"` into a neural network, it sees nothing but a stream of ASCII bytes:

```
"cat" = [99, 97, 116]  # ASCII values
```

To perform mathematical operations on language (like computing attention), we must first convert these discrete symbols into continuous numbers.

### The Old Way: One-Hot Encoding

Historically, we represented words as sparse vectors with a single `1`:

```
Cat:    [1, 0, 0, 0, 0]
Dog:    [0, 1, 0, 0, 0]
King:   [0, 0, 1, 0, 0]
Queen:  [0, 0, 0, 1, 0]
Apple:  [0, 0, 0, 0, 1]
```

**The Problem:** These vectors are **orthogonal** (90 degrees apart). The dot product between any two words is 0, which mathematically implies they share *zero* similarity. This is obviously wrong‚Äî"Cat" and "Dog" are both animals!

### The Modern Way: Dense Embeddings

An **embedding** is a dense vector of floating-point numbers that represents the *semantic meaning* of a word:

```python
Cat:   [0.82, -0.13, 0.51, -0.74, ...] # 768 dimensions
Dog:   [0.79, -0.09, 0.48, -0.71, ...] # Similar to Cat!
King:  [-0.21, 0.88, 0.05, 0.43, ...]
Queen: [-0.19, 0.85, 0.08, 0.47, ...] # Similar to King!
```

In this high-dimensional space:
- Words with **similar meanings** point in **similar directions**
- The **distance** between vectors encodes **semantic difference**
- The **dot product** measures **semantic similarity**

This is the foundation of every modern language model, including GPT, BERT, and transformers.

## 2. Visualization: The Geometry of Semantics

Let's visualize how embeddings create semantic clusters. We'll generate two groups of words:

1. **Fruits:** (Apple, Banana, Orange, Grape, Pear)
2. **Vehicles:** (Car, Truck, Bus, Bike, Train)

Watch how words with similar meanings cluster together in 3D space.

In [None]:
import numpy as np
import plotly.graph_objects as go

# Set random seed for reproducibility
np.random.seed(42)

# Generate fruit vectors: Centered around [1, 1, 1]
fruits = np.random.normal(loc=[1, 1, 1], scale=0.2, size=(5, 3))
fruit_labels = ["Apple", "Banana", "Orange", "Grape", "Pear"]

# Generate vehicle vectors: Centered around [-1, -1, -1]
vehicles = np.random.normal(loc=[-1, -1, -1], scale=0.2, size=(5, 3))
vehicle_labels = ["Car", "Truck", "Bus", "Bike", "Train"]

# Create 3D scatter plot
fig = go.Figure()

fig.add_trace(go.Scatter3d(
    x=fruits[:,0], y=fruits[:,1], z=fruits[:,2],
    mode='markers+text',
    text=fruit_labels,
    textposition="top center",
    name='Fruits',
    marker=dict(size=10, color='red', opacity=0.8)
))

fig.add_trace(go.Scatter3d(
    x=vehicles[:,0], y=vehicles[:,1], z=vehicles[:,2],
    mode='markers+text',
    text=vehicle_labels,
    textposition="top center",
    name='Vehicles',
    marker=dict(size=10, color='blue', opacity=0.8)
))

fig.update_layout(
    title="Semantic Clusters in 3D Embedding Space",
    scene=dict(
        xaxis_title="Dimension 1",
        yaxis_title="Dimension 2",
        zaxis_title="Dimension 3"
    ),
    width=800,
    height=600
)

fig.show()

print("‚úÖ Notice how Fruits cluster together, far from Vehicles.")
print("This spatial distance is how the model knows that an Apple is not a Car.")

### üß† Thought Exercise

**Question:** In the visualization above, we only used 3 dimensions for display. Real word embeddings (like in GPT or BERT) use 768 or even 1536 dimensions. Why?

<details>
<summary>Click to reveal answer</summary>

**Answer:** Higher dimensions allow the model to capture more subtle relationships. In 3D, we can only separate a few concepts. But with 768 dimensions, we can encode:
- Synonyms (happy ‚âà joyful)
- Antonyms (hot ‚â† cold)
- Analogies (king - man + woman ‚âà queen)
- Grammar (singular vs plural)
- Context ("bank" in "river bank" vs "savings bank")

Think of it like this: In 2D, you can only draw a few non-overlapping circles. In 768D, you can fit billions of separate concepts!
</details>

## 3. The Dot Product: Measuring Similarity

Now comes the critical question: **How do we measure if two words are similar?**

The answer is the **dot product** (also called the **inner product**). This is the *single most important operation* in attention mechanisms.

### Definition

Given two vectors $\vec{a}$ and $\vec{b}$:

$$\vec{a} \cdot \vec{b} = \sum_{i=1}^{n} a_i \times b_i = a_1 b_1 + a_2 b_2 + \cdots + a_n b_n$$

### Example

```python
a = [1, 2, 3]
b = [4, 5, 6]

a ¬∑ b = (1√ó4) + (2√ó5) + (3√ó6) = 4 + 10 + 18 = 32
```

### Geometric Interpretation

The dot product can also be written as:

$$\vec{a} \cdot \vec{b} = ||\vec{a}|| \, ||\vec{b}|| \, \cos(\theta)$$

Where:
- $||\vec{a}||$ is the magnitude (length) of vector $\vec{a}$
- $\theta$ is the angle between the vectors

**Key Insight:**
- If vectors point in the **same direction** ($\theta = 0¬∞$), then $\cos(\theta) = 1$ ‚Üí **high dot product** ‚Üí **high similarity**
- If vectors are **perpendicular** ($\theta = 90¬∞$), then $\cos(\theta) = 0$ ‚Üí **zero dot product** ‚Üí **no similarity**
- If vectors point in **opposite directions** ($\theta = 180¬∞$), then $\cos(\theta) = -1$ ‚Üí **negative dot product** ‚Üí **opposite meaning**

This is exactly how attention works: words with high dot products "attend" to each other!

### üìä Interactive Example: Dot Product vs Angle

Let's calculate the dot product between pairs of words from our semantic clusters:

In [None]:
# Using our fruit and vehicle vectors from before
apple_vec = fruits[0]  # Apple
banana_vec = fruits[1]  # Banana
car_vec = vehicles[0]   # Car

# Calculate dot products
apple_banana_dot = np.dot(apple_vec, banana_vec)
apple_car_dot = np.dot(apple_vec, car_vec)

# Calculate magnitudes
apple_mag = np.linalg.norm(apple_vec)
banana_mag = np.linalg.norm(banana_vec)
car_mag = np.linalg.norm(car_vec)

# Calculate angles (in degrees)
angle_apple_banana = np.arccos(apple_banana_dot / (apple_mag * banana_mag)) * 180 / np.pi
angle_apple_car = np.arccos(apple_car_dot / (apple_mag * car_mag)) * 180 / np.pi

print("=" * 60)
print("Dot Product Analysis")
print("=" * 60)
print(f"\nApple ¬∑ Banana = {apple_banana_dot:.4f}")
print(f"Angle between Apple and Banana: {angle_apple_banana:.2f}¬∞")
print("‚úÖ Both are fruits ‚Üí Similar meaning ‚Üí Small angle ‚Üí HIGH dot product")

print(f"\nApple ¬∑ Car = {apple_car_dot:.4f}")
print(f"Angle between Apple and Car: {angle_apple_car:.2f}¬∞")
print("‚úÖ Fruit vs Vehicle ‚Üí Different meaning ‚Üí Large angle ‚Üí LOW (or negative) dot product")

print("\n" + "=" * 60)
print("üéØ Key Takeaway:")
print("The dot product quantifies semantic similarity!")
print("This is the foundation of the attention mechanism.")
print("=" * 60)

## 4. Vector Arithmetic: Encoding Relationships

Because embeddings are just numbers, we can do arithmetic on them. The most famous example is:

$$\vec{\text{King}} - \vec{\text{Man}} + \vec{\text{Woman}} \approx \vec{\text{Queen}}$$

### Why This Works

Think of it geometrically:
- The vector $\vec{\text{King}} - \vec{\text{Man}}$ represents the "direction" from Man to King
- This direction encodes the concept of **"Royalty"**
- Adding that same royalty direction to $\vec{\text{Woman}}$ should land you on $\vec{\text{Queen}}$!

Let's simulate this:

In [None]:
# Simulate word vectors (in reality, these come from trained models)
np.random.seed(100)

# Create vectors where the relationship holds approximately
man = np.array([0.5, 0.1, 0.3])
woman = np.array([0.5, 0.9, 0.3])  # Similar to man, but different in one dimension
king = np.array([0.9, 0.1, 0.8])   # Similar to man in gender dimension, different in royalty
queen = np.array([0.9, 0.9, 0.8])  # Similar to woman in gender, similar to king in royalty

# Perform the arithmetic
result = king - man + woman

print("Vector Arithmetic Demonstration")
print("=" * 60)
print(f"King:   {king}")
print(f"Man:    {man}")
print(f"Woman:  {woman}")
print(f"\nKing - Man + Woman = {result}")
print(f"Queen (actual):      {queen}")
print(f"\nDistance from result to Queen: {np.linalg.norm(result - queen):.4f}")
print("\n‚úÖ The result is very close to Queen!")
print("\nüéØ This shows embeddings encode semantic relationships as geometric directions.")

### üìö Recommended Resources

To dive deeper into word embeddings:
- [Illustrated Word2Vec](http://jalammar.github.io/illustrated-word2vec/) by Jay Alammar ‚Äî A beautiful visual guide to how embeddings are trained
- [3Blue1Brown: Vectors](https://www.youtube.com/watch?v=fNk_zzaMoSs) ‚Äî The essence of linear algebra, visually explained
- [TensorFlow Embedding Projector](https://projector.tensorflow.org/) ‚Äî Explore real embeddings in an interactive 3D space

## 5. Knowledge Check Quiz

Test your understanding before moving to the programming assignment.

### Question 1: Conceptual
If two words have very similar meanings (e.g., "Happy" and "Joyful"), what should be true about their embedding vectors?

<details>
<summary>Click to reveal answer</summary>

**Answer:** They should point in roughly the same direction, meaning:
1. Their **dot product** should be large (close to max value)
2. The **angle between them** should be small (close to 0¬∞)
3. Their **Euclidean distance** (L2 norm) should be small

This is the geometric signature of semantic similarity!
</details>

### Question 2: Mathematical
Given vectors $\vec{a} = [2, 3, 1]$ and $\vec{b} = [1, 0, 4]$, calculate their dot product.

<details>
<summary>Click to reveal answer</summary>

**Answer:** 
$$\vec{a} \cdot \vec{b} = (2 \times 1) + (3 \times 0) + (1 \times 4) = 2 + 0 + 4 = 6$$
</details>

### Question 3: Attention Preview
In the attention mechanism, we compute dot products between a "Query" vector and multiple "Key" vectors. What do you think a high dot product represents?

<details>
<summary>Click to reveal answer</summary>

**Answer:** A high dot product between a Query and a Key means those two words/tokens should "pay attention" to each other. They are semantically relevant!

This is the core insight: **Attention = Dot Product Similarity**

In Module 4, you'll implement the full formula:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

The $QK^T$ part is computing all dot products between queries and keys!
</details>

## 6. Programming Assignment

Now it's time to implement what you've learned. This assignment builds toward the final project where you'll implement the full attention mechanism.

### Task Overview

In a real Transformer, the embedding layer is a matrix of shape `(vocab_size, d_model)` where:
- `vocab_size` = number of unique words (e.g., 50,000)
- `d_model` = embedding dimension (e.g., 768)

Each row is a word's embedding vector.

### Your Tasks

1. Create a random embedding matrix `E` of shape `(100, 16)`
2. Extract vectors for two different words
3. Calculate the dot product between them
4. Calculate the L2 norm (magnitude) of each vector
5. Calculate the cosine similarity

### Implementation

In [None]:
import numpy as np

def embedding_assignment():
    """
    Complete this function to implement basic embedding operations.
    
    Returns:
        E: Embedding matrix of shape (100, 16)
        vec1: Embedding vector for word at index 42
        vec2: Embedding vector for word at index 17
        dot_product: Dot product between vec1 and vec2
        norm1: L2 norm of vec1
        norm2: L2 norm of vec2
        cosine_sim: Cosine similarity between vec1 and vec2
    """
    # Set seed for reproducibility
    np.random.seed(42)
    
    # 1. TODO: Create a random embedding matrix E using np.random.randn
    # Shape should be (100, 16) representing 100 words with 16-dimensional embeddings
    vocab_size = 100
    d_model = 16
    E = None  # Replace with your implementation
    
    # 2. TODO: Extract embedding vectors for words at indices 42 and 17
    vec1 = None  # Replace with your implementation
    vec2 = None  # Replace with your implementation
    
    # 3. TODO: Calculate the dot product between vec1 and vec2
    # Hint: Use np.dot() or the @ operator
    dot_product = None  # Replace with your implementation
    
    # 4. TODO: Calculate the L2 norm (magnitude) of each vector
    # Hint: Use np.linalg.norm() or calculate manually: sqrt(sum(x^2))
    norm1 = None  # Replace with your implementation
    norm2 = None  # Replace with your implementation
    
    # 5. TODO: Calculate cosine similarity
    # Formula: cosine_sim = (vec1 ¬∑ vec2) / (||vec1|| * ||vec2||)
    # This normalizes the dot product to be between -1 and 1
    cosine_sim = None  # Replace with your implementation
    
    return E, vec1, vec2, dot_product, norm1, norm2, cosine_sim

# Run your implementation
try:
    E, vec1, vec2, dot_prod, norm1, norm2, cos_sim = embedding_assignment()
    
    # Display results
    print("=" * 60)
    print("Assignment Results")
    print("=" * 60)
    print(f"Embedding matrix shape: {E.shape if E is not None else 'Not implemented'}")
    print(f"\nVector 1 (first 5 dims): {vec1[:5] if vec1 is not None else 'Not implemented'}")
    print(f"Vector 2 (first 5 dims): {vec2[:5] if vec2 is not None else 'Not implemented'}")
    print(f"\nDot product: {dot_prod}")
    print(f"Norm of vector 1: {norm1}")
    print(f"Norm of vector 2: {norm2}")
    print(f"Cosine similarity: {cos_sim}")
    print("=" * 60)
    
except Exception as e:
    print(f"‚ùå Error: {e}")
    print("Make sure to replace all None values with your implementation.")

### Local Tests (Run These Before Submission)

These tests check your implementation locally. You should pass all of these before submitting to the server.

In [None]:
def test_embedding_assignment():
    """
    Local tests for the embedding assignment.
    Run this to verify your implementation before submitting to the server.
    """
    print("Running local tests...\n")
    
    try:
        E, vec1, vec2, dot_prod, norm1, norm2, cos_sim = embedding_assignment()
        
        passed = 0
        total = 7
        
        # Test 1: Shape of embedding matrix
        if E is not None and E.shape == (100, 16):
            print("‚úÖ Test 1 PASSED: Embedding matrix has correct shape (100, 16)")
            passed += 1
        else:
            print(f"‚ùå Test 1 FAILED: Expected shape (100, 16), got {E.shape if E is not None else 'None'}")
        
        # Test 2: vec1 shape
        if vec1 is not None and vec1.shape == (16,):
            print("‚úÖ Test 2 PASSED: Vector 1 has correct shape (16,)")
            passed += 1
        else:
            print(f"‚ùå Test 2 FAILED: Expected shape (16,), got {vec1.shape if vec1 is not None else 'None'}")
        
        # Test 3: vec2 shape
        if vec2 is not None and vec2.shape == (16,):
            print("‚úÖ Test 3 PASSED: Vector 2 has correct shape (16,)")
            passed += 1
        else:
            print(f"‚ùå Test 3 FAILED: Expected shape (16,), got {vec2.shape if vec2 is not None else 'None'}")
        
        # Test 4: Dot product correctness
        if dot_prod is not None and vec1 is not None and vec2 is not None:
            expected_dot = np.dot(vec1, vec2)
            if np.isclose(dot_prod, expected_dot):
                print("‚úÖ Test 4 PASSED: Dot product calculated correctly")
                passed += 1
            else:
                print(f"‚ùå Test 4 FAILED: Dot product mismatch")
        else:
            print("‚ùå Test 4 FAILED: Dot product not calculated")
        
        # Test 5: norm1 correctness
        if norm1 is not None and vec1 is not None:
            expected_norm = np.linalg.norm(vec1)
            if np.isclose(norm1, expected_norm):
                print("‚úÖ Test 5 PASSED: Norm 1 calculated correctly")
                passed += 1
            else:
                print(f"‚ùå Test 5 FAILED: Norm 1 mismatch")
        else:
            print("‚ùå Test 5 FAILED: Norm 1 not calculated")
        
        # Test 6: norm2 correctness
        if norm2 is not None and vec2 is not None:
            expected_norm = np.linalg.norm(vec2)
            if np.isclose(norm2, expected_norm):
                print("‚úÖ Test 6 PASSED: Norm 2 calculated correctly")
                passed += 1
            else:
                print(f"‚ùå Test 6 FAILED: Norm 2 mismatch")
        else:
            print("‚ùå Test 6 FAILED: Norm 2 not calculated")
        
        # Test 7: Cosine similarity correctness
        if cos_sim is not None and norm1 is not None and norm2 is not None and dot_prod is not None:
            expected_cos = dot_prod / (norm1 * norm2)
            if np.isclose(cos_sim, expected_cos):
                print("‚úÖ Test 7 PASSED: Cosine similarity calculated correctly")
                passed += 1
            else:
                print(f"‚ùå Test 7 FAILED: Cosine similarity mismatch")
        else:
            print("‚ùå Test 7 FAILED: Cosine similarity not calculated")
        
        print(f"\n{'=' * 60}")
        print(f"Score: {passed}/{total} tests passed")
        print("=" * 60)
        
        if passed == total:
            print("\nüéâ All tests passed! You're ready to submit to the server.")
            print("\nNext step: Submit your code to receive your completion key.")
            return True
        else:
            print("\n‚ö†Ô∏è  Some tests failed. Review your implementation and try again.")
            return False
            
    except Exception as e:
        print(f"‚ùå Error running tests: {e}")
        return False

# Run the tests
test_embedding_assignment()

## 7. Submission Instructions

Once you pass all local tests:

1. **Review your code** to ensure it follows best practices
2. **Submit to the API server** (instructions will be provided separately)
3. If all hidden tests pass, you'll **receive a unique completion key**
4. **Save your key** ‚Äî you'll need all 4 module keys to generate your certificate

### What's Next?

In the next lesson, you'll learn about:
- **Matrix Multiplication** as linear transformations
- How to generate **Query**, **Key**, and **Value** matrices
- The geometric interpretation of transforming embedding space

These concepts will directly lead to implementing the attention formula in Module 4!

## 8. Summary & Key Takeaways

### What You Learned

1. **Embeddings** represent words as dense vectors in high-dimensional space
2. **Semantic similarity** is encoded as geometric proximity (distance and direction)
3. The **dot product** measures how aligned two vectors are:
   - High dot product = Similar meaning
   - Low/zero dot product = Unrelated meaning
   - Negative dot product = Opposite meaning
4. **Vector arithmetic** (e.g., King - Man + Woman ‚âà Queen) works because embeddings encode semantic relationships as geometric directions
5. **Cosine similarity** normalizes the dot product to be between -1 and 1

### Connection to Attention

Everything you learned today builds toward the attention mechanism:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

The $QK^T$ term computes dot products between:
- **Q** (Query): "What am I looking for?"
- **K** (Key): "What information do I have?"

Words with high dot products "attend" to each other!

### Practice Problems

Before moving on, try these challenges:
1. Modify the visualization to show 3 semantic clusters instead of 2
2. Implement cosine similarity from scratch without using `np.linalg.norm`
3. Create a function that finds the top-K most similar words to a given word

### Additional Resources

- [3Blue1Brown: Dot Products](https://www.youtube.com/watch?v=LyGKycYT2v0) ‚Äî Visual intuition for dot products
- [NumPy dot documentation](https://numpy.org/doc/stable/reference/generated/numpy.dot.html) ‚Äî Official docs
- [Dot Product in ML (Towards Data Science)](https://towardsdatascience.com/dot-product-in-machine-learning-49e756e8c5a0) ‚Äî ML applications

---

**Ready to continue?** Move on to Module 2, Lesson 2: Matrix Multiplication and Linear Transformations.