# Tutorial 1: Introduction to Document Structure

**Course 3: Document Functors (Lorren Dray)**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/buildLittleWorlds/category-theory-document-functors/blob/main/notebooks/01_introduction_document_structure.ipynb)

---

## Overview

In this tutorial, we meet **Lorren Dray** and encounter the problem that motivated her life's work: the **multiple classification problem**.

### The Year 918 Observation

Working as a junior cataloger at the Capital Archives, Dray noticed something puzzling. The same document seemed to belong to multiple categories:

- A boundary survey report was simultaneously:
  - A geographic record (in the location index)
  - A personnel document (in the author index)
  - A technical manual (in the methodology index)
  - A historical source (in the date registry)

Which category was "correct"? None of them — and all of them. The document's identity depended on *how you accessed it*.

### Learning Goals

By the end of this tutorial, you will:

1. Understand the multiple classification problem
2. See how documents yield different observations through different access methods
3. Begin thinking about documents as *functions* that respond to inquiry

---

## Part 1: The Multiple Classification Problem

> "I have encountered a puzzling phenomenon in my cataloging work. The same document appears to belong to multiple categories depending on how one approaches it."
> — Lorren Dray, letter to her father (Year 920)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load datasets from Densworld repository
BASE_URL = "https://raw.githubusercontent.com/buildLittleWorlds/densworld-datasets/main/data/"

# Load document functor examples
documents = pd.read_csv(BASE_URL + "document_functor_examples.csv")
print(f"Loaded {len(documents)} document-access observations")
documents.head()

### Examining a Single Document Through Multiple Access Methods

Let's look at the first document in our dataset: the **Boundary Survey Report SW-6** — the famous report documenting the breathing phenomenon.

In [None]:
# Filter to just DOC-001
doc_001 = documents[documents['document_id'] == 'DOC-001']

print("Boundary Survey Report SW-6 as seen through different access methods:\n")
for _, row in doc_001.iterrows():
    print(f"Access Method: {row['access_method']}")
    print(f"  Observation Type: {row['observation_type']}")
    print(f"  Value: {row['observation_value']}")
    print()

### The Same Document, Four Different Faces

Notice how the same document yields completely different information depending on which access method we use:

| Access Method | What We See |
|---------------|-------------|
| `subject_catalog` | Topic keywords: boundaries, surveys, SW-sector |
| `author_index` | Contributor: torvun_kell |
| `date_registry` | Temporal range: 847-01-03 to 847-12-31 |
| `location_index` | Spatial reference: denstone_sectors.SW |

This is the **multiple classification problem**: which of these is the "true" identity of the document?

In [None]:
# Visualize access methods as a pie chart of "perspectives"
fig, ax = plt.subplots(1, 1, figsize=(8, 8))

access_methods = doc_001['access_method'].values
observation_values = doc_001['observation_value'].values

colors = plt.cm.Set3(np.linspace(0, 1, len(access_methods)))

# Equal wedges - each access method is equally valid
wedges, texts, autotexts = ax.pie(
    [1] * len(access_methods), 
    labels=access_methods,
    autopct=lambda pct: '',
    colors=colors,
    startangle=90
)

# Add observation values as annotations
for i, (wedge, obs) in enumerate(zip(wedges, observation_values)):
    angle = (wedge.theta1 + wedge.theta2) / 2
    x = 0.6 * np.cos(np.radians(angle))
    y = 0.6 * np.sin(np.radians(angle))
    # Truncate long observation values
    obs_short = obs[:20] + '...' if len(obs) > 20 else obs
    ax.annotate(obs_short, xy=(x, y), ha='center', va='center', fontsize=8)

ax.set_title('DOC-001: Boundary Survey Report SW-6\nFour Access Methods, One Document', fontsize=12)
plt.tight_layout()
plt.show()

## Part 2: Dray's Insight

Dray's father, the logician Kellen Dray, responded to her puzzlement with a crucial observation:

> "Perhaps categories are not containers but perspectives."

This reframing was the seed of document functor theory. Instead of asking "which category is correct?", Dray began asking:

**What if a document IS the function that maps access methods to observations?**

In [None]:
# Load correspondence to see this exchange
correspondence = pd.read_csv(BASE_URL + "dray_correspondence.csv")

# Find the early father-daughter exchange
early_letters = correspondence[correspondence['date'].str.startswith('920')]

for _, letter in early_letters.iterrows():
    print(f"Date: {letter['date']}")
    print(f"From: {letter['sender']} → To: {letter['recipient']}")
    print(f"Subject: {letter['subject']}")
    print(f"\n\"{letter['excerpt']}\"\n")
    print("-" * 60 + "\n")

### The Functor Interpretation Preview

We can preview Dray's eventual insight by thinking of a document as a *function*:

```
Document: AccessMethod → Observations

DOC-001(subject_catalog) = {"boundaries", "surveys", "SW-sector"}
DOC-001(author_index) = {"torvun_kell"}
DOC-001(date_registry) = {"847-01-03 to 847-12-31"}
DOC-001(location_index) = {"denstone_sectors.SW"}
```

The document *is* this mapping. It has no single "true" category — only responses to different inquiries.

In [None]:
def document_functor(doc_df, access_method):
    """
    Apply a document-as-functor to an access method.
    Returns the observation for that access method.
    """
    result = doc_df[doc_df['access_method'] == access_method]
    if len(result) == 0:
        return None  # No observation for this access method
    return result['observation_value'].values[0]

# Test the functor interpretation
print("DOC-001 as a functor:")
print()

for method in ['subject_catalog', 'author_index', 'date_registry', 'location_index']:
    observation = document_functor(doc_001, method)
    print(f"DOC-001({method}) = {observation}")

## Part 3: Comparing Multiple Documents

The functor interpretation becomes more powerful when we compare multiple documents. Each document is a *different function* from the same set of access methods.

In [None]:
# Get unique documents
unique_docs = documents['document_id'].unique()[:5]  # First 5 documents

# Compare how they respond to subject_catalog access
print("How different documents respond to subject_catalog access:\n")

for doc_id in unique_docs:
    doc_df = documents[documents['document_id'] == doc_id]
    doc_title = doc_df['document_title'].iloc[0]
    observation = document_functor(doc_df, 'subject_catalog')
    print(f"{doc_id}: {doc_title}")
    print(f"  → {observation}")
    print()

In [None]:
# Heatmap: documents × access methods
# Create a pivot table showing which documents have which access methods

pivot = documents.pivot_table(
    index='document_title',
    columns='access_method',
    values='observation_value',
    aggfunc='first'
)

# Create presence/absence matrix (1 if observation exists, 0 otherwise)
presence = pivot.notna().astype(int)

fig, ax = plt.subplots(figsize=(12, 8))
sns.heatmap(presence, cmap='YlGnBu', cbar_kws={'label': 'Observation Present'},
            linewidths=0.5, ax=ax)

ax.set_title('Document × Access Method Matrix\n(Which documents have which access method observations?)', fontsize=12)
ax.set_xlabel('Access Method')
ax.set_ylabel('Document')

plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

## Part 4: The Notation

Dray developed a notation for document functors that would become standard in the Capital Archives:

```
F(A) = observations when document F is accessed via method A
```

For example:
- `F('Boundaries')` — what the document reveals about boundaries
- `F('Kell')` — what the document reveals about Kell
- `F('Year 847')` — what the document reveals about Year 847

In [None]:
# Show the functor notation from the dataset
print("Dray's functor notation for DOC-001:\n")

for _, row in doc_001.iterrows():
    print(f"{row['functor_notation']} = {row['observation_value']}")

## Summary

In this tutorial, we've seen:

1. **The Multiple Classification Problem**: Documents don't have single true categories
2. **Access-Dependent Observations**: The same document yields different information through different access methods
3. **Documents as Functions**: A document can be thought of as a function mapping access methods to observations
4. **Dray's Notation**: F(A) represents what document F reveals when accessed via method A

### Key Quote

> "A document is not a thing but a way of responding to inquiry. Change the method of access, and the document reveals different aspects of itself."
> — Lorren Dray

### Next Tutorial

In Tutorial 2, we'll develop the categorical structure of the Archive itself — seeing access methods as objects and document flows as morphisms.

---

*Part of the [Category Theory & LLMs Series](https://github.com/buildLittleWorlds)*