# Module 2, Lesson 1: Vectors as Meaning

**Course:** Foundations of Attention  
**Module:** The Geometry of Intelligence  
**Learning Objective:** Define high-dimensional vector spaces and their relevance to semantic meaning (Concept)

---

## Introduction: The Bridge Between Language and Math

How do computers understand language?

The answer is: **they don't**‚Äîat least not the way humans do. But they can understand **geometry**. And that's the key insight that powers every modern AI system:

> **We can represent meaning as geometry.**

In this lesson, you'll learn how words become vectors, and why high-dimensional space is the secret to encoding semantic meaning.

## 1. What is a Vector?

A **vector** is simply a list of numbers. That's it.

```python
v = [0.5, 0.3, -0.2]
```

But here's the powerful part: we can interpret those numbers as **coordinates in space**.

### Geometric Interpretation

A vector can be visualized as:
1. **A point** in space (the coordinates)
2. **An arrow** from the origin to that point (direction and magnitude)

For example:
- `[2, 3]` is a point in 2D space
- `[1, 2, 3]` is a point in 3D space
- `[0.1, 0.2, 0.3, ..., 0.768]` is a point in 768D space

### Why Vectors Matter for AI

In machine learning, we use vectors to represent **anything**:
- A word: `"cat" ‚Üí [0.12, -0.45, 0.78, ...]`
- An image: `cat.jpg ‚Üí [0.23, 0.67, -0.12, ...]`
- A user: `user_123 ‚Üí [0.34, 0.89, -0.56, ...]`

Once everything is a vector, we can use **geometry** to measure relationships.

## 2. What is a Vector Space?

A **vector space** is the mathematical "universe" where vectors live. Think of it as:
- A canvas where every possible vector has a unique location
- A coordinate system with axes (dimensions)

### Dimensions

The **dimension** of a vector space is how many numbers you need to specify a location:

| Space | Dimensions | Example Vector |
|-------|------------|----------------|
| 2D (plane) | 2 | `[3, 4]` |
| 3D (physical space) | 3 | `[1, 2, 5]` |
| 768D (GPT-2 embeddings) | 768 | `[0.12, -0.34, ..., 0.56]` |
| 1536D (GPT-3 embeddings) | 1536 | Even more numbers! |

### Key Properties

In a vector space, you can:
1. **Add vectors**: `[1, 2] + [3, 4] = [4, 6]`
2. **Scale vectors**: `2 √ó [1, 2] = [2, 4]`
3. **Measure distance** between vectors
4. **Measure direction** (which we'll explore in the next lesson)

These operations let us do math on meaning!

## 3. From Words to Vectors: The Embedding

### The Problem: Computers Can't Read

Computers see text as raw bytes:
```
"cat" = [99, 97, 116]  # ASCII codes
```

There's no semantic meaning in these numbers. "cat" and "dog" are just different byte sequences.

### The Old Solution: One-Hot Encoding

Historically, we represented each word as a sparse vector:

```
Vocabulary: ["cat", "dog", "king", "queen", "apple"]

cat   = [1, 0, 0, 0, 0]
dog   = [0, 1, 0, 0, 0]
king  = [0, 0, 1, 0, 0]
queen = [0, 0, 0, 1, 0]
apple = [0, 0, 0, 0, 1]
```

**Problem:** Every word is the same distance from every other word. There's no notion of similarity.

### The Modern Solution: Dense Embeddings

An **embedding** is a learned dense vector that captures semantic meaning:

```python
# These vectors are learned by neural networks
cat   = [0.82, -0.13, 0.51, -0.74, 0.23, ...] # 768 numbers
dog   = [0.79, -0.09, 0.48, -0.71, 0.19, ...] # Similar to cat!
king  = [-0.21, 0.88, 0.05, 0.43, -0.67, ...] 
queen = [-0.19, 0.85, 0.08, 0.47, -0.65, ...] # Similar to king!
apple = [0.34, 0.12, -0.89, 0.45, 0.91, ...] # Different from cat/dog!
```

**Key Insight:** Words with similar meanings are **close together** in this high-dimensional space.

## 4. Visualizing High-Dimensional Space

We can't visualize 768 dimensions, but we can visualize 2D or 3D projections.

Let's create a toy embedding space with two semantic clusters:
1. **Fruits** (Apple, Banana, Orange, Grape, Pear)
2. **Vehicles** (Car, Truck, Bus, Bike, Train)

Watch how words with similar meanings cluster together.

In [None]:
import numpy as np
import plotly.graph_objects as go

# Set random seed for reproducibility
np.random.seed(42)

# Generate fruit embeddings: Centered around [1, 1, 1]
fruits = np.random.normal(loc=[1, 1, 1], scale=0.2, size=(5, 3))
fruit_labels = ["Apple", "Banana", "Orange", "Grape", "Pear"]

# Generate vehicle embeddings: Centered around [-1, -1, -1]
vehicles = np.random.normal(loc=[-1, -1, -1], scale=0.2, size=(5, 3))
vehicle_labels = ["Car", "Truck", "Bus", "Bike", "Train"]

# Create 3D scatter plot
fig = go.Figure()

fig.add_trace(go.Scatter3d(
    x=fruits[:,0], y=fruits[:,1], z=fruits[:,2],
    mode='markers+text',
    text=fruit_labels,
    textposition="top center",
    name='Fruits',
    marker=dict(size=10, color='red', opacity=0.8)
))

fig.add_trace(go.Scatter3d(
    x=vehicles[:,0], y=vehicles[:,1], z=vehicles[:,2],
    mode='markers+text',
    text=vehicle_labels,
    textposition="top center",
    name='Vehicles',
    marker=dict(size=10, color='blue', opacity=0.8)
))

fig.update_layout(
    title="Semantic Clusters in 3D Embedding Space",
    scene=dict(
        xaxis_title="Dimension 1",
        yaxis_title="Dimension 2",
        zaxis_title="Dimension 3"
    ),
    width=800,
    height=600
)

fig.show()

print("\n‚úÖ Observation: Fruits cluster together (red), far from Vehicles (blue).")
print("This spatial separation is how the model 'knows' an Apple is not a Car.")

### üß† Thought Exercise

**Question:** We visualized 3D space above. Real embeddings (like in GPT or BERT) use 768 or 1536 dimensions. Why do we need so many dimensions?

<details>
<summary>Click to reveal answer</summary>

**Answer:** More dimensions = more capacity to encode different types of meaning.

Think about what embeddings need to capture:
- **Synonyms**: happy ‚âà joyful
- **Antonyms**: hot ‚â† cold
- **Categories**: cat, dog ‚Üí animals
- **Grammar**: walk, walked, walking
- **Context**: "bank" (river) vs "bank" (money)
- **Analogies**: king - man + woman ‚âà queen

In 3D, you can only separate a few concepts cleanly. With 768 dimensions, you can encode thousands of subtle relationships simultaneously!

**Analogy**: Think of dimensions as "channels" in your brain. More channels = more nuanced understanding.
</details>

## 5. Understanding Distance in High-Dimensional Space

If similar words are close together, how do we measure "closeness"?

### Euclidean Distance

The most intuitive measure is **Euclidean distance** (straight-line distance):

$$d(\vec{a}, \vec{b}) = \sqrt{\sum_{i=1}^{n} (a_i - b_i)^2}$$

In 2D: $d([1,2], [4,6]) = \sqrt{(4-1)^2 + (6-2)^2} = \sqrt{9 + 16} = 5$

Let's calculate distances in our toy embedding space:

In [None]:
# Calculate distances between words
apple = fruits[0]
banana = fruits[1]
car = vehicles[0]

# Distance between two fruits (similar words)
dist_apple_banana = np.linalg.norm(apple - banana)

# Distance between fruit and vehicle (different words)
dist_apple_car = np.linalg.norm(apple - car)

print("=" * 60)
print("Distance Analysis in Embedding Space")
print("=" * 60)
print(f"\nDistance(Apple, Banana): {dist_apple_banana:.4f}")
print("  ‚Üí Both are fruits ‚Üí SMALL distance ‚Üí Similar meaning")

print(f"\nDistance(Apple, Car): {dist_apple_car:.4f}")
print("  ‚Üí Fruit vs Vehicle ‚Üí LARGE distance ‚Üí Different meaning")

print("\n" + "=" * 60)
print("üéØ Key Insight: Semantic similarity = Geometric proximity")
print("=" * 60)

## 6. The Curse and Blessing of Dimensionality

### The Blessing
High-dimensional spaces give us:
- **More room** to separate concepts
- **More capacity** to encode complex relationships
- **More expressiveness** for subtle meanings

### The Curse
But there's a catch: in very high dimensions, **everything is far from everything else**. Distances become less meaningful.

**Solution:** Instead of just measuring distance, we also measure **direction** (which we'll learn about in the next lesson using dot products).

This is why modern AI uses **both**:
- Distance (how far apart?)
- Direction (do they point the same way?)

## 7. How Are Embeddings Created?

You might wonder: where do these magic vectors come from?

### Training Process (Simplified)

Embeddings are **learned** by neural networks using one of these approaches:

1. **Word2Vec** (2013): Predict nearby words
   - "The cat sat on the ___" ‚Üí predict "mat"
   - Words that appear in similar contexts get similar vectors

2. **Transformer Models** (2017+): Learn from massive text
   - GPT, BERT, etc. learn embeddings as part of their architecture
   - Updated continuously during training

### The Key Principle

> **"You shall know a word by the company it keeps."** ‚Äî J.R. Firth, 1957

Words that appear in similar **contexts** (nearby words, similar sentences) end up with similar **vectors**.

### üìö Recommended Resources

To understand how embeddings are trained:
- [Illustrated Word2Vec](http://jalammar.github.io/illustrated-word2vec/) by Jay Alammar
- [3Blue1Brown: Vectors](https://www.youtube.com/watch?v=fNk_zzaMoSs) ‚Äî Visual introduction to vectors
- [TensorFlow Embedding Projector](https://projector.tensorflow.org/) ‚Äî Explore real embeddings interactively

## 8. Knowledge Check Quiz

Test your understanding before the programming assignment.

### Question 1
What is the main advantage of representing words as dense vectors (embeddings) instead of one-hot vectors?

<details>
<summary>Click to reveal answer</summary>

**Answer:** Dense embeddings capture **semantic relationships** through geometric proximity. Words with similar meanings are close together in vector space, whereas one-hot vectors treat all words as equally different (all pairs are the same distance apart).
</details>

### Question 2
If a word embedding has 768 dimensions, what does that number represent?

<details>
<summary>Click to reveal answer</summary>

**Answer:** 768 dimensions means each word is represented by a vector with 768 numbers. Each dimension can be thought of as capturing a different aspect of meaning (though they're not interpretable individually). More dimensions = more capacity to encode complex semantic relationships.
</details>

### Question 3
In our fruit/vehicle visualization, why do fruits cluster together?

<details>
<summary>Click to reveal answer</summary>

**Answer:** Fruits cluster together because they share semantic properties (edible, grows on plants, sweet, etc.). The embedding space learned to position semantically similar concepts near each other. In a real trained model, this happens automatically through exposure to text where fruits appear in similar contexts.
</details>

## 9. Programming Assignment

Now it's your turn to work with embedding spaces.

### Task Overview

In a real Transformer model, embeddings are stored in a matrix:
- Shape: `(vocab_size, d_model)`
- `vocab_size` = number of unique words (e.g., 50,000)
- `d_model` = embedding dimension (e.g., 768)

Each **row** is one word's embedding vector.

### Your Tasks

1. Create a random embedding matrix `E` of shape `(100, 16)`
2. Extract embedding vectors for three different "words" (indices)
3. Calculate the Euclidean distance between pairs of words
4. Calculate the magnitude (L2 norm) of a vector
5. Identify which pair of words is most similar (smallest distance)

In [None]:
import numpy as np

def embedding_assignment():
    """
    Complete this function to work with embedding vectors and vector spaces.
    
    Returns:
        E: Embedding matrix of shape (100, 16)
        word1: Embedding vector for word at index 42
        word2: Embedding vector for word at index 17
        word3: Embedding vector for word at index 89
        dist_12: Euclidean distance between word1 and word2
        dist_13: Euclidean distance between word1 and word3
        dist_23: Euclidean distance between word2 and word3
        magnitude: L2 norm of word1
    """
    # Set seed for reproducibility
    np.random.seed(42)
    
    # 1. TODO: Create embedding matrix E of shape (100, 16)
    # Use np.random.randn to create random embeddings
    vocab_size = 100
    d_model = 16
    E = None  # Replace this
    
    # 2. TODO: Extract embedding vectors for three words
    # Get the rows at indices 42, 17, and 89
    word1 = None  # index 42
    word2 = None  # index 17
    word3 = None  # index 89
    
    # 3. TODO: Calculate Euclidean distances between word pairs
    # Hint: Use np.linalg.norm(vec_a - vec_b)
    # or calculate manually: sqrt(sum((a - b)^2))
    dist_12 = None  # Distance between word1 and word2
    dist_13 = None  # Distance between word1 and word3
    dist_23 = None  # Distance between word2 and word3
    
    # 4. TODO: Calculate the L2 norm (magnitude) of word1
    # Hint: Use np.linalg.norm(word1)
    # or calculate manually: sqrt(sum(x^2))
    magnitude = None  # Replace this
    
    return E, word1, word2, word3, dist_12, dist_13, dist_23, magnitude

# Run your implementation
try:
    E, w1, w2, w3, d12, d13, d23, mag = embedding_assignment()
    
    print("=" * 60)
    print("Assignment Results")
    print("=" * 60)
    print(f"Embedding matrix shape: {E.shape if E is not None else 'Not implemented'}")
    print(f"\nWord 1 (first 5 dims): {w1[:5] if w1 is not None else 'Not implemented'}")
    print(f"Word 2 (first 5 dims): {w2[:5] if w2 is not None else 'Not implemented'}")
    print(f"Word 3 (first 5 dims): {w3[:5] if w3 is not None else 'Not implemented'}")
    print(f"\nDistance(word1, word2): {d12}")
    print(f"Distance(word1, word3): {d13}")
    print(f"Distance(word2, word3): {d23}")
    print(f"\nMagnitude of word1: {mag}")
    
    if all(x is not None for x in [d12, d13, d23]):
        min_dist = min(d12, d13, d23)
        if min_dist == d12:
            print("\n‚úÖ Most similar pair: word1 and word2")
        elif min_dist == d13:
            print("\n‚úÖ Most similar pair: word1 and word3")
        else:
            print("\n‚úÖ Most similar pair: word2 and word3")
    print("=" * 60)
    
except Exception as e:
    print(f"‚ùå Error: {e}")
    print("Make sure to replace all None values with your implementation.")

### Local Tests

Run these tests before submitting to the server.

In [None]:
def test_embedding_assignment():
    """
    Local tests for the embedding assignment.
    """
    print("Running local tests...\n")
    
    try:
        E, w1, w2, w3, d12, d13, d23, mag = embedding_assignment()
        
        passed = 0
        total = 8
        
        # Test 1: Embedding matrix shape
        if E is not None and E.shape == (100, 16):
            print("‚úÖ Test 1 PASSED: Embedding matrix has correct shape (100, 16)")
            passed += 1
        else:
            print(f"‚ùå Test 1 FAILED: Expected shape (100, 16), got {E.shape if E is not None else 'None'}")
        
        # Test 2-4: Vector shapes
        for i, (word, name) in enumerate([(w1, "word1"), (w2, "word2"), (w3, "word3")], start=2):
            if word is not None and word.shape == (16,):
                print(f"‚úÖ Test {i} PASSED: {name} has correct shape (16,)")
                passed += 1
            else:
                print(f"‚ùå Test {i} FAILED: {name} expected shape (16,), got {word.shape if word is not None else 'None'}")
        
        # Test 5-7: Distance calculations
        if all(x is not None for x in [w1, w2, w3]):
            expected_d12 = np.linalg.norm(w1 - w2)
            expected_d13 = np.linalg.norm(w1 - w3)
            expected_d23 = np.linalg.norm(w2 - w3)
            
            for i, (dist, expected, name) in enumerate([
                (d12, expected_d12, "dist_12"),
                (d13, expected_d13, "dist_13"),
                (d23, expected_d23, "dist_23")
            ], start=5):
                if dist is not None and np.isclose(dist, expected):
                    print(f"‚úÖ Test {i} PASSED: {name} calculated correctly")
                    passed += 1
                else:
                    print(f"‚ùå Test {i} FAILED: {name} incorrect")
        else:
            print("‚ùå Tests 5-7 SKIPPED: Vectors not extracted")
        
        # Test 8: Magnitude
        if mag is not None and w1 is not None:
            expected_mag = np.linalg.norm(w1)
            if np.isclose(mag, expected_mag):
                print("‚úÖ Test 8 PASSED: Magnitude calculated correctly")
                passed += 1
            else:
                print("‚ùå Test 8 FAILED: Magnitude incorrect")
        else:
            print("‚ùå Test 8 FAILED: Magnitude not calculated")
        
        print(f"\n{'=' * 60}")
        print(f"Score: {passed}/{total} tests passed")
        print("=" * 60)
        
        if passed == total:
            print("\nüéâ All tests passed! Ready to submit to the server.")
            return True
        else:
            print("\n‚ö†Ô∏è  Some tests failed. Review your implementation.")
            return False
            
    except Exception as e:
        print(f"‚ùå Error running tests: {e}")
        return False

# Run the tests
test_embedding_assignment()

## 10. Submission Instructions

Once you pass all local tests:

1. **Submit your code** to the API server (instructions provided separately)
2. **Server runs hidden tests** to validate your implementation
3. **Receive your completion key** if all tests pass
4. **Save the key** ‚Äî you'll need all 4 module keys for your certificate

### What's Next?

In **Lesson 2**, you'll learn about:
- **The Dot Product**: Measuring similarity between vectors
- How direction (not just distance) encodes semantic relationships
- The connection to the attention mechanism

This will answer the question: *How do we know which words should "pay attention" to each other?*

## 11. Summary & Key Takeaways

### What You Learned

1. **Vectors** are lists of numbers that can represent points in space
2. **Vector spaces** are mathematical "universes" where vectors live
3. **High-dimensional spaces** (768D, 1536D) give us capacity to encode complex semantic relationships
4. **Embeddings** transform words into vectors where semantic similarity = geometric proximity
5. **Euclidean distance** measures how far apart two vectors are
6. Embeddings are **learned** by neural networks from massive text data

### The Big Picture

You've taken the first step toward understanding attention:

```
Words ‚Üí Vectors ‚Üí Geometric Space ‚Üí Similarity Measures ‚Üí Attention
  ^                                                          ^
 (This lesson)                                      (Module 4)
```

### Connection to Attention

In the attention mechanism, we need to know **which words are related**. By representing words as vectors:
- We can **measure** semantic relationships mathematically
- We can **compute** which words should attend to each other
- We can **learn** these representations from data

Next lesson: How to measure similarity using the **dot product**!

### Additional Resources

- [Illustrated Word2Vec](http://jalammar.github.io/illustrated-word2vec/) ‚Äî How embeddings are trained
- [3Blue1Brown: Vectors](https://www.youtube.com/watch?v=fNk_zzaMoSs) ‚Äî Visual introduction to linear algebra
- [TensorFlow Embedding Projector](https://projector.tensorflow.org/) ‚Äî Explore real embeddings in 3D

---

**Next:** Module 2, Lesson 2 ‚Äî The Dot Product: Measuring Similarity