# Tutorial 8: The Language of Dens
## Language as a Category

---

### The Linguistic Turn

*To the research assistant:*

*In Year 899, Tessery Vold turned her relational framework toward an unexpected domain: language itself.*

*Vold observed that words and phrases form a natural category. Objects are linguistic expressions—words, phrases, sentences. Morphisms are containment relationships: "stakdur" is contained in "the stakdur," which is contained in "the stakdur hunts."*

*This seemingly simple observation has profound implications. If language is a category, then linguistic meaning can be understood categorically. A word's meaning is not an intrinsic property—it is determined by how the word relates to other words. This is Vold's Probing Theorem applied to language.*

*The Capital linguists dismissed this as reductionist. How can meaning arise from mere containment? But Vold countered: "You know a word by the company it keeps. This is not metaphor. This is mathematics."*

*Your task: Analyze linguistic containment data to understand the categorical structure of Densworld language. Determine whether meaning can indeed be captured by containment relationships.*

—*Archive Review Committee, Year 934*

---

## What You Will Learn

In this tutorial, you will learn to:

1. Model **language as a category** with words as objects and containment as morphisms
2. Understand **substring relations** and their categorical properties
3. Build **phrase graphs** representing linguistic structure
4. Connect linguistic categories to **distributional semantics**
5. See how this relates to modern NLP and language models

By the end, you will understand:
- Why language has natural categorical structure
- How "meaning is use" connects to the Yoneda perspective
- The foundation of word embeddings in categorical terms

---

## The Language Category

Consider language as a category **L**:

- **Objects**: Strings (words, phrases, sentences)
- **Morphisms**: Containment relations
  - There is a morphism s → t if s is a substring of t
  - Or more generally: s appears in the context of t
- **Identity**: Every string contains itself
- **Composition**: If s ⊆ t and t ⊆ u, then s ⊆ u

This is a **pre-order category**—there is at most one morphism between any two objects.

### Vold's Interpretation

> *"The word 'stakdur' means nothing in isolation. But place it in contexts: 'the stakdur,' 'stakdur territory,' 'the stakdur hunts'—and meaning emerges. The word is defined by what contains it."*
> — Tessery Vold, "The Language of Dens," Year 899

---

## Part 1: Loading Linguistic Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from collections import defaultdict

# Set style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('deep')

print("Libraries loaded. Ready to study the Language of Dens.")

In [None]:
# Load linguistic containment data
BASE_URL = "https://raw.githubusercontent.com/buildLittleWorlds/densworld-datasets/main/data/"

containment = pd.read_csv(BASE_URL + "linguistic_containment.csv")

print(f"Linguistic containment records: {len(containment)}")
print(f"\nColumns: {list(containment.columns)}")

In [None]:
# View sample data
containment.head(10)

Each row represents a **morphism** in the language category:
- `shorter_phrase` → `longer_phrase` (containment)
- `containment_type`: How the phrase is extended (prefix, compound, modifier, etc.)
- `frequency_score`: How often this containment appears in the corpus
- `context_probability`: Probability of the longer phrase given the shorter

---

## Part 2: Building the Language Category

Let's construct the category from the containment data.

In [None]:
# Extract all phrases (objects in the category)
shorter_phrases = set(containment['shorter_phrase'])
longer_phrases = set(containment['longer_phrase'])
all_phrases = shorter_phrases | longer_phrases

print(f"Shorter phrases: {len(shorter_phrases)}")
print(f"Longer phrases: {len(longer_phrases)}")
print(f"Total unique phrases (objects): {len(all_phrases)}")

In [None]:
# Build the containment graph (morphisms)
G_lang = nx.DiGraph()

# Add nodes
G_lang.add_nodes_from(all_phrases)

# Add edges (morphisms): shorter → longer
for _, row in containment.iterrows():
    G_lang.add_edge(
        row['shorter_phrase'], 
        row['longer_phrase'],
        containment_type=row['containment_type'],
        frequency=row['frequency_score'],
        probability=row['context_probability']
    )

print(f"Language category constructed:")
print(f"  Objects (phrases): {G_lang.number_of_nodes()}")
print(f"  Morphisms (containments): {G_lang.number_of_edges()}")

In [None]:
# Verify pre-order property: at most one morphism between any pair
edge_counts = defaultdict(int)
for u, v in G_lang.edges():
    edge_counts[(u, v)] += 1

multi_edges = [(k, v) for k, v in edge_counts.items() if v > 1]

if multi_edges:
    print(f"Warning: {len(multi_edges)} pairs have multiple edges")
else:
    print("Pre-order verified: at most one morphism between any pair")

---

## Part 3: Analyzing Containment Structure

Let's explore how words relate to their contexts.

In [None]:
# For each word, find all phrases that contain it
def get_containing_phrases(word, G):
    """Get all phrases that contain the given word (morphisms from word)."""
    return list(G.successors(word))

# Analyze key words
key_words = ['stakdur', 'passage', 'form', 'archive', 'morphism']

print("Containing phrases for key words:")
print("=" * 50)

for word in key_words:
    if word in G_lang:
        containers = get_containing_phrases(word, G_lang)
        print(f"\n'{word}' is contained in {len(containers)} phrases:")
        for phrase in containers[:5]:
            print(f"    → '{phrase}'")
        if len(containers) > 5:
            print(f"    ... and {len(containers) - 5} more")

In [None]:
# Containment type distribution
type_counts = containment['containment_type'].value_counts()

fig, ax = plt.subplots(figsize=(12, 5))
type_counts.plot(kind='barh', ax=ax, color='steelblue', alpha=0.7)
ax.set_xlabel('Count')
ax.set_ylabel('Containment Type')
ax.set_title('Distribution of Containment Types in the Language Category')
plt.tight_layout()
plt.show()

Different containment types reveal how language builds structure:
- **compound_formation**: "passage" → "passage diagram"
- **modifier_prefix**: "stakdur" → "apex stakdur"
- **prefix_context**: "stakdur" → "the stakdur"

---

## Part 4: The Yoneda Perspective on Language

Applying the Probing Theorem: a word is determined by what contains it.

Hom(word, -) = {phrases that contain word}

In [None]:
def language_embedding(word, G, containment_df):
    """
    Build a 'contextual embedding' for a word based on what contains it.
    Returns: (container_phrases, containment_types, probabilities)
    """
    if word not in G:
        return [], [], []
    
    containers = []
    types = []
    probs = []
    
    for _, row in containment_df[containment_df['shorter_phrase'] == word].iterrows():
        containers.append(row['longer_phrase'])
        types.append(row['containment_type'])
        probs.append(row['context_probability'])
    
    return containers, types, probs

# Compare embeddings for different words
words_to_compare = ['stakdur', 'grimslew', 'passage', 'form']

print("Language embeddings (what contains each word):")
print("=" * 60)

for word in words_to_compare:
    containers, types, probs = language_embedding(word, G_lang, containment)
    print(f"\n'{word}':")
    for c, t, p in zip(containers[:4], types[:4], probs[:4]):
        print(f"    → '{c}' ({t}, p={p:.2f})")

In [None]:
# Build feature vectors for words based on containment types
# This is analogous to word2vec but using categorical structure

all_types = list(containment['containment_type'].unique())
words_with_contexts = list(shorter_phrases)

# Feature matrix: rows = words, columns = containment types
# Value = average probability for that type
n_words = len(words_with_contexts)
n_types = len(all_types)

feature_matrix = np.zeros((n_words, n_types))

for i, word in enumerate(words_with_contexts):
    word_data = containment[containment['shorter_phrase'] == word]
    for j, ctype in enumerate(all_types):
        type_data = word_data[word_data['containment_type'] == ctype]
        if len(type_data) > 0:
            feature_matrix[i, j] = type_data['context_probability'].mean()

print(f"Language embedding matrix: {feature_matrix.shape}")
print(f"  Rows (words): {n_words}")
print(f"  Columns (containment types): {n_types}")

In [None]:
# Compute similarity between words based on their containment patterns
from sklearn.metrics.pairwise import cosine_similarity

# Normalize
word_sim = cosine_similarity(feature_matrix + 1e-10)

# Find most similar word pairs
similar_pairs = []
for i in range(n_words):
    for j in range(i+1, n_words):
        if word_sim[i, j] > 0.1:
            similar_pairs.append((words_with_contexts[i], words_with_contexts[j], word_sim[i, j]))

similar_pairs.sort(key=lambda x: -x[2])

print("Most similar words (by containment pattern):")
print("=" * 50)
for w1, w2, sim in similar_pairs[:10]:
    print(f"  '{w1}' ↔ '{w2}': {sim:.3f}")

Words with similar containment patterns are semantically related. This is the categorical foundation of distributional semantics.

---

## Part 5: Visualizing the Language Category

In [None]:
# Focus on a subset: the "stakdur" context graph
stakdur_related = set(['stakdur'])

# Add immediate neighbors
for phrase in list(stakdur_related):
    if phrase in G_lang:
        stakdur_related.update(G_lang.successors(phrase))
        stakdur_related.update(G_lang.predecessors(phrase))

G_stakdur = G_lang.subgraph(stakdur_related).copy()

print(f"Stakdur context subgraph: {G_stakdur.number_of_nodes()} nodes, {G_stakdur.number_of_edges()} edges")

In [None]:
# Visualize
fig, ax = plt.subplots(figsize=(14, 10))

pos = nx.spring_layout(G_stakdur, k=2, iterations=50, seed=42)

# Color nodes by type
node_colors = ['gold' if node == 'stakdur' else 'lightblue' for node in G_stakdur.nodes()]

nx.draw_networkx_nodes(G_stakdur, pos, node_color=node_colors, node_size=2000, ax=ax)
nx.draw_networkx_labels(G_stakdur, pos, font_size=9, ax=ax)
nx.draw_networkx_edges(G_stakdur, pos, edge_color='gray', arrows=True, 
                        arrowsize=15, connectionstyle='arc3,rad=0.1', ax=ax)

ax.set_title('Language Category: Stakdur Context Graph\n(Edges = containment)', fontsize=12)
ax.axis('off')
plt.tight_layout()
plt.show()

The graph shows how "stakdur" is embedded in larger linguistic structures. Each edge is a morphism in the language category.

---

## Part 6: Composition and Transitivity

In a category, morphisms compose. In the language category:
- If "stakdur" ⊆ "stakdur territory" and "stakdur territory" ⊆ "the stakdur territory"
- Then "stakdur" ⊆ "the stakdur territory" (composition)

In [None]:
# Find composition chains (paths of length > 1)
def find_composition_chains(G, max_length=3):
    """Find all paths of length 2+ (composable morphisms)."""
    chains = []
    for source in G.nodes():
        for target in G.nodes():
            if source != target:
                # Find all simple paths
                try:
                    for path in nx.all_simple_paths(G, source, target, cutoff=max_length):
                        if len(path) > 2:  # More than just source → target
                            chains.append(path)
                except nx.NetworkXNoPath:
                    pass
    return chains

# Find chains in the stakdur subgraph
chains = find_composition_chains(G_stakdur, max_length=3)

print(f"Found {len(chains)} composition chains:")
print("=" * 50)
for chain in chains[:10]:
    print(f"  {' → '.join(chain)}")

In [None]:
# Compute transitive closure
G_closure = nx.transitive_closure(G_lang, reflexive=False)

print(f"Original language category: {G_lang.number_of_edges()} morphisms")
print(f"Transitive closure: {G_closure.number_of_edges()} morphisms")
print(f"Implicit morphisms (from composition): {G_closure.number_of_edges() - G_lang.number_of_edges()}")

The transitive closure reveals all the implicit containment relationships. These are morphisms that exist through composition.

---

## Part 7: Context Probability and Meaning

The `context_probability` column captures how likely a word is to appear in a given context. This is the foundation of statistical language modeling.

In [None]:
# Analyze context probabilities
print("Context probability statistics:")
print(containment['context_probability'].describe())

In [None]:
# Which containments have highest probability?
high_prob = containment.nlargest(10, 'context_probability')

print("Highest probability containments:")
print("=" * 50)
for _, row in high_prob.iterrows():
    print(f"  '{row['shorter_phrase']}' → '{row['longer_phrase']}' (p={row['context_probability']:.2f})")

In [None]:
# Visualize probability distribution by containment type
fig, ax = plt.subplots(figsize=(12, 6))

containment.boxplot(column='context_probability', by='containment_type', ax=ax)
ax.set_title('Context Probability by Containment Type')
ax.set_xlabel('Containment Type')
ax.set_ylabel('Context Probability')
plt.suptitle('')  # Remove automatic title
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

Different containment types have different probability distributions. Technical terms (Vold's terminology) have high probabilities—they consistently appear together.

---

## Part 8: The ML Connection — Distributional Semantics

The categorical view of language directly connects to:

### 1. Word2Vec
"Predict a word from its context" = Learn the containment structure

### 2. BERT
"Predict masked words" = Learn Hom(word, context) relationships

### 3. GPT
"Predict next word" = Learn morphisms in the language category

In [None]:
# Simulate word2vec-style "context prediction"
# Given a word, predict its probable contexts

def predict_contexts(word, containment_df, top_k=5):
    """Predict most likely contexts for a word."""
    word_data = containment_df[containment_df['shorter_phrase'] == word]
    if len(word_data) == 0:
        return []
    
    sorted_data = word_data.sort_values('context_probability', ascending=False)
    return list(sorted_data[['longer_phrase', 'context_probability']].head(top_k).itertuples(index=False, name=None))

# Test predictions
test_words = ['passage', 'stakdur', 'theorem', 'archive']

print("Context prediction (like word2vec):")
print("=" * 50)

for word in test_words:
    predictions = predict_contexts(word, containment)
    print(f"\n'{word}' → predicted contexts:")
    for context, prob in predictions:
        print(f"    '{context}' (p={prob:.2f})")

In [None]:
# Build a simple "language model" that predicts containment probability
# This is a categorical language model!

def categorical_lm_probability(word, context, containment_df):
    """
    P(context | word) in the language category.
    Returns the probability if the containment exists, 0 otherwise.
    """
    match = containment_df[
        (containment_df['shorter_phrase'] == word) & 
        (containment_df['longer_phrase'] == context)
    ]
    if len(match) > 0:
        return match['context_probability'].values[0]
    return 0.0

# Test the categorical LM
test_pairs = [
    ('passage', 'passage diagram'),
    ('passage', 'creature passage'),
    ('stakdur', 'the stakdur'),
    ('stakdur', 'apex stakdur'),
    ('theorem', 'probing theorem')
]

print("Categorical language model probabilities:")
print("=" * 50)
for word, context in test_pairs:
    prob = categorical_lm_probability(word, context, containment)
    print(f"  P('{context}' | '{word}') = {prob:.2f}")

This is a categorical language model: it predicts the probability of a morphism (containment) in the language category.

---

## Part 9: From Containment to Meaning

Vold's insight: meaning emerges from containment patterns.

In [None]:
# Cluster words by their containment profiles
from sklearn.cluster import KMeans

# Use the feature matrix we built earlier
n_clusters = 4

if feature_matrix.sum() > 0:  # Check for non-empty features
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    clusters = kmeans.fit_predict(feature_matrix)
    
    print(f"Clustered {len(words_with_contexts)} words into {n_clusters} groups:")
    print("=" * 50)
    
    for i in range(n_clusters):
        cluster_words = [w for w, c in zip(words_with_contexts, clusters) if c == i]
        print(f"\nCluster {i}: {cluster_words}")
else:
    print("Feature matrix is empty; skipping clustering")

In [None]:
# Analyze what distinguishes clusters
if 'clusters' in dir():
    print("Cluster characteristics (average probability by containment type):")
    print("=" * 60)
    
    for i in range(n_clusters):
        mask = clusters == i
        cluster_features = feature_matrix[mask].mean(axis=0)
        
        print(f"\nCluster {i}:")
        top_types = sorted(zip(all_types, cluster_features), key=lambda x: -x[1])[:3]
        for ctype, avg in top_types:
            if avg > 0:
                print(f"    {ctype}: {avg:.2f}")

Words cluster by their containment patterns. This is semantic clustering emerging from categorical structure—exactly what Vold predicted.

---

## Exercises

### Exercise 1: Morphism Chains

Find the longest containment chain starting from the word "form". What does this reveal about how "form" is used in Densworld texts?

In [None]:
# Your code here
# Hint: Use nx.dag_longest_path or find paths manually

### Exercise 2: Context Overlap

Find pairs of words that share the same containing phrases. These words may be semantically related.

In [None]:
# Your code here
# Hint: Compare successor sets in the graph

### Exercise 3: Probability Prediction

Build a simple model that predicts context probability from containment type. Does the type strongly predict the probability?

In [None]:
# Your code here
# Hint: Group by containment_type and compute mean probability

### Exercise 4: The Vold Vocabulary

Vold introduced specific terminology (passage, morphism, coherent shift, etc.). Analyze how her terms cluster compared to everyday Densworld words.

In [None]:
# Your code here
# Hint: Filter for Vold-related terms and compare their embeddings

---

## Discussion Questions

1. Vold claims meaning emerges from containment. But doesn't containment presuppose meaning? Can the categorical approach really ground semantics?

2. Modern language models (GPT, etc.) learn statistical patterns over text. Is this fundamentally categorical? What would Vold say about transformer architectures?

3. The language category is a pre-order (at most one morphism between any pair). What structure is lost compared to a richer category?

---

## Summary

In this tutorial, you learned:

| Concept | What You Learned |
|---------|------------------|
| Language category | Objects = phrases, morphisms = containment |
| Pre-order structure | At most one morphism between any pair |
| Contextual meaning | Words are defined by what contains them |
| Categorical LM | Probability of morphisms as language model |
| Distributional semantics | "Know a word by the company it keeps" |

| Skill | Code Pattern |
|-------|--------------|
| Build containment graph | `nx.DiGraph()`, add edges for containment |
| Find contexts | `G.successors(word)` |
| Compute word similarity | Cosine similarity on containment features |
| Predict contexts | Sort by probability |

---

## Next Tutorial

In **Tutorial 9: Weighted Passages**, you will learn about **enriched categories**:

- Morphisms decorated with values (probabilities, costs, weights)
- How enrichment changes categorical reasoning
- Connection to weighted graphs and probabilistic models
- The categorical foundation of neural attention

> *"A passage is not merely a yes or no. Some passages are likely, others rare. Some paths are well-worn, others barely visible. To capture this, we must enrich our category with degrees of passability."*
> — Tessery Vold, "Weighted Passages," Year 901

---

## Credits

**Source Material:** Tai-Danae Bradley, "Category Theory and Language Models" (Cartesian Cafe)

**Densworld Integration:** The Relational Foundations course applies categorical concepts through the framework of Tessery Vold.

**Learn more:** [buildLittleWorlds](https://github.com/buildLittleWorlds)

---

> *"The linguists say: 'You know a word by the company it keeps.' But I say: you know a word by the passages it admits. What phrases contain it? What sentences extend it? The word IS the pattern of its containments. This is not metaphor. This is the mathematics of meaning."*
> — Tessery Vold, "The Language of Dens," Year 899