# Tutorial 5: Embeddings as Functor Values

**Course 3: Document Functors (Lorren Dray)**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/buildLittleWorlds/category-theory-document-functors/blob/main/notebooks/05_embeddings_as_functor_values.ipynb)

---

## Overview

In Year 942, Dray made a crucial connection: **document embeddings are the numerical values that functors assign**. Each embedding dimension corresponds to a probe, and the values are the functor's response to that probe.

### Learning Goals

1. See embeddings as numerical functor values
2. Understand probes as access methods
3. Connect categorical structure to vector space structure
4. Build intuition for why embeddings work

---

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load datasets
BASE_URL = "https://raw.githubusercontent.com/buildLittleWorlds/densworld-datasets/main/data/"

embeddings = pd.read_csv(BASE_URL + "embedding_correspondences.csv")
print(f"Loaded {len(embeddings)} embedding observations")
embeddings.head()

## Part 1: From Sets to Numbers

In earlier tutorials, a document functor D assigned *sets* to access methods:
```
D(subject_catalog) = {"boundaries", "surveys", "SW-sector"}
```

Dray's insight: we can assign *numbers* instead:
```
D("relevance to boundary studies") = 0.85
```

These numerical values form an **embedding** — a vector representation of the document.

In [None]:
# Look at embeddings for DOC-001
doc_001_emb = embeddings[embeddings['document_id'] == 'DOC-001']

print("DOC-001 (Boundary Survey Report SW-6) as a numerical functor:\n")
for _, row in doc_001_emb.iterrows():
    print(f"  F({row['probe_name']}) = {row['numerical_value']:.2f}")
    print(f"    (Dimension: {row['embedding_dimension']}, Level: {row['functor_value']})")
    print()

## Part 2: Probes as Access Methods

Each **probe** is a question we can ask about a document:
- "How relevant is this document to boundary studies?"
- "How relevant is this document to category theory?"
- "Was this authored by Kell?"

The document's response to each probe is a number between 0 and 1.

In [None]:
# Get all unique probes
probes = embeddings[['probe_id', 'probe_name', 'probe_description']].drop_duplicates()

print("Available Probes (Access Methods for Numerical Functors):\n")
for _, probe in probes.iterrows():
    print(f"  {probe['probe_id']}: {probe['probe_name']}")
    print(f"    Description: {probe['probe_description']}")
    print()

## Part 3: Documents as Vectors

When we apply a document functor to all probes, we get a **vector**:

```
D = [D(probe_1), D(probe_2), ..., D(probe_n)]
  = [0.85, 0.12, 0.45, 0.08, 0.92]
```

This vector is the document's **embedding**.

In [None]:
def document_to_vector(doc_id, embeddings_df):
    """
    Convert a document's functor values to a vector.
    """
    doc_emb = embeddings_df[embeddings_df['document_id'] == doc_id]
    
    # Sort by embedding dimension
    doc_emb = doc_emb.sort_values('embedding_dimension')
    
    probes = doc_emb['probe_name'].values
    values = doc_emb['numerical_value'].values
    
    return probes, values

# Get vectors for several documents
unique_docs = embeddings['document_id'].unique()[:5]

print("Documents as Vectors (Functor Values):\n")
for doc_id in unique_docs:
    probes, values = document_to_vector(doc_id, embeddings)
    title = embeddings[embeddings['document_id'] == doc_id]['document_title'].iloc[0]
    print(f"{doc_id}: {title}")
    print(f"  Vector: [{', '.join(f'{v:.2f}' for v in values)}]")
    print()

In [None]:
# Visualize embeddings as a heatmap
# Create a pivot table: documents × probes
pivot = embeddings.pivot_table(
    index='document_title',
    columns='probe_name',
    values='numerical_value',
    aggfunc='first'
)

fig, ax = plt.subplots(figsize=(14, 8))
sns.heatmap(pivot, annot=True, fmt='.2f', cmap='YlOrRd', 
            linewidths=0.5, ax=ax, cbar_kws={'label': 'Functor Value'})

ax.set_title('Document Embeddings as Functor Values\nD(probe) = numerical response', fontsize=12)
ax.set_xlabel('Probe (Access Method)')
ax.set_ylabel('Document')

plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

## Part 4: The Connection to Transformers

In modern transformers, documents (or tokens) are represented as embedding vectors:

| Dray's Framework | Transformer Component |
|------------------|----------------------|
| Probe | Embedding dimension |
| Functor value | Embedding coordinate |
| Document functor | Document embedding |
| Applying probe | Projecting onto dimension |

Each dimension of the embedding is a "probe" — a learned question the model asks about the input.

In [None]:
# Simulate a simple embedding space
fig, ax = plt.subplots(figsize=(10, 10))

# Use just two dimensions for visualization
x_probe = 'topic_boundaries'
y_probe = 'topic_categories'

# Get values for these probes
x_values = embeddings[embeddings['probe_name'] == x_probe].set_index('document_id')['numerical_value']
y_values = embeddings[embeddings['probe_name'] == y_probe].set_index('document_id')['numerical_value']

# Find documents that have both probes
common_docs = set(x_values.index) & set(y_values.index)

for doc_id in common_docs:
    x = x_values.loc[doc_id]
    y = y_values.loc[doc_id]
    
    ax.scatter(x, y, s=200, alpha=0.7)
    
    # Get short title
    title = embeddings[embeddings['document_id'] == doc_id]['document_title'].iloc[0]
    short_title = title[:25] + '...' if len(title) > 25 else title
    ax.annotate(short_title, (x + 0.02, y + 0.02), fontsize=8)

ax.set_xlabel(f'F({x_probe}): Relevance to Boundary Studies', fontsize=11)
ax.set_ylabel(f'F({y_probe}): Relevance to Category Theory', fontsize=11)
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)
ax.set_title('Documents in 2D Embedding Space\n(Two Functor Values as Coordinates)', fontsize=12)

plt.tight_layout()
plt.show()

## Part 5: Dray's Key Insight

> "Document embeddings are the numerical values that functors assign — each dimension corresponds to a probe. When we embed a document, we are computing its response to a battery of questions."
> — Lorren Dray, *Documents as Numerical Functors* (Year 945)

In [None]:
# Compare two documents by their functor values
doc1_id = 'DOC-001'  # Boundary Survey Report
doc2_id = 'DOC-002'  # On the Intensity of Passages (Vance)

doc1_emb = embeddings[embeddings['document_id'] == doc1_id].set_index('probe_name')['numerical_value']
doc2_emb = embeddings[embeddings['document_id'] == doc2_id].set_index('probe_name')['numerical_value']

# Find common probes
common_probes = sorted(set(doc1_emb.index) & set(doc2_emb.index))

fig, ax = plt.subplots(figsize=(10, 6))

x = np.arange(len(common_probes))
width = 0.35

bars1 = ax.bar(x - width/2, [doc1_emb.loc[p] for p in common_probes], width, label='Boundary Survey (DOC-001)', color='steelblue')
bars2 = ax.bar(x + width/2, [doc2_emb.loc[p] for p in common_probes], width, label='Intensity of Passages (DOC-002)', color='coral')

ax.set_xlabel('Probe')
ax.set_ylabel('Functor Value')
ax.set_title('Comparing Document Functor Values\nSame probes, different responses', fontsize=12)
ax.set_xticks(x)
ax.set_xticklabels([p.replace('topic_', '') for p in common_probes], rotation=45, ha='right')
ax.legend()
ax.set_ylim(0, 1)
ax.grid(True, axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## Part 6: Similarity Through Functor Values

Two documents are "similar" if their functor values are similar across probes.

This is why **cosine similarity** works for document comparison: it measures how aligned two embedding vectors are, which corresponds to how similarly the documents respond to the same set of probes.

In [None]:
from numpy.linalg import norm

def cosine_similarity(v1, v2):
    """Compute cosine similarity between two vectors."""
    return np.dot(v1, v2) / (norm(v1) * norm(v2))

# Compute similarity matrix for all documents
unique_docs = embeddings['document_id'].unique()

# Build document vectors
doc_vectors = {}
for doc_id in unique_docs:
    doc_emb = embeddings[embeddings['document_id'] == doc_id]
    # Use probe_id as index to ensure consistent ordering
    vec = doc_emb.sort_values('probe_id')['numerical_value'].values
    if len(vec) >= 3:  # Only include documents with enough probes
        doc_vectors[doc_id] = vec

# Compute similarity matrix
doc_ids = list(doc_vectors.keys())
n = len(doc_ids)
sim_matrix = np.zeros((n, n))

for i, doc_i in enumerate(doc_ids):
    for j, doc_j in enumerate(doc_ids):
        v1, v2 = doc_vectors[doc_i], doc_vectors[doc_j]
        # Handle different vector lengths by truncating to minimum
        min_len = min(len(v1), len(v2))
        sim_matrix[i, j] = cosine_similarity(v1[:min_len], v2[:min_len])

# Visualize similarity matrix
fig, ax = plt.subplots(figsize=(10, 8))

# Get short titles
short_titles = []
for doc_id in doc_ids:
    title = embeddings[embeddings['document_id'] == doc_id]['document_title'].iloc[0]
    short_titles.append(title[:20] + '...' if len(title) > 20 else title)

sns.heatmap(sim_matrix, annot=True, fmt='.2f', cmap='coolwarm',
            xticklabels=short_titles, yticklabels=short_titles,
            ax=ax, cbar_kws={'label': 'Cosine Similarity'})

ax.set_title('Document Similarity Matrix\n(Similarity = Alignment of Functor Values)', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

## Summary

In this tutorial, we've seen:

1. **Numerical functors**: Documents can assign numbers (not just sets) to probes
2. **Probes as dimensions**: Each probe corresponds to an embedding dimension
3. **Embeddings as vectors**: A document's embedding is its vector of functor values
4. **Similarity**: Documents are similar when their functor values align

### The Key Equation

```
Embedding[i] = F(probe_i)
```

Each embedding dimension is a functor value — the document's numerical response to a probe.

### Next Tutorial

In Tutorial 6, we'll explore the **representable perspective** — special functors Hom(A, -) that capture the view from a single access point.

---

*Part of the [Category Theory & LLMs Series](https://github.com/buildLittleWorlds)*