# Tutorial 9: The Map of Ideas

## The Capital Archives — A Course in Natural Language Processing

---

*The archive is vast and disorganized. Manuscripts are shelved by acquisition date, not by subject. Finding all texts about a particular topic requires searching through thousands of documents.*

*"What if," the Chief asks, "we could create a map? Not a physical map, but an intellectual one. A way to see how ideas cluster, how topics connect, how the archive's holdings relate to each other?"*

*This is the problem of **topic modeling**: discovering the hidden thematic structure in a collection of documents.*

---

In this tutorial, you will learn:
- Topic modeling concepts
- Latent Dirichlet Allocation (LDA)
- Interpreting and visualizing topics
- Using topics to understand document collections

In [None]:
# ============================================
# COLAB SETUP - Run this cell first!
# ============================================
# This cell sets up the environment for Google Colab
# Skip this cell if running locally

import os

# Clone the repository if running in Colab
if 'google.colab' in str(get_ipython()):
    if not os.path.exists('capital-archives-nlp'):
        !git clone https://github.com/buildLittleWorlds/capital-archives-nlp.git
    os.chdir('capital-archives-nlp')
    
    # Install/download NLTK data
    import nltk
    nltk.download('punkt', quiet=True)
    nltk.download('punkt_tab', quiet=True)
    nltk.download('stopwords', quiet=True)
    print("✓ Repository cloned and NLTK data downloaded!")
else:
    print("✓ Running locally - no setup needed")

In [None]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

# NLP
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Topic modeling
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF

nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('punkt_tab', quiet=True)

import warnings
warnings.filterwarnings('ignore')

print("Libraries loaded.")

In [None]:
# Load corpus
manuscripts = pd.read_csv('manuscripts.csv')
texts = pd.read_csv('manuscript_texts.csv')

corpus = texts.groupby('manuscript_id').agg(
    text=('text', ' '.join)
).reset_index()

corpus = corpus.merge(
    manuscripts[['manuscript_id', 'title', 'author', 'genre']],
    on='manuscript_id', how='left'
)

print(f"Loaded {len(corpus)} documents")

## 9.1 What is Topic Modeling?

**Topic modeling** is an unsupervised technique for discovering abstract "topics" in a collection of documents.

### Key Concepts

- A **topic** is a distribution over words (some words are more likely than others)
- A **document** is a mixture of topics
- The algorithm discovers both the topics AND how they're mixed in each document

### LDA (Latent Dirichlet Allocation)

LDA assumes:
1. There are K topics (you choose K)
2. Each document is a mixture of these K topics
3. Each word in a document comes from one of the topics

## 9.2 Preparing Data for Topic Modeling

In [None]:
# Create document-term matrix using CountVectorizer
# LDA works with raw counts, not TF-IDF

vectorizer = CountVectorizer(
    max_df=0.95,           # Ignore terms in >95% of documents
    min_df=2,              # Ignore terms in <2 documents
    max_features=1000,     # Keep top 1000 terms
    stop_words='english',  # Remove stopwords
    ngram_range=(1, 1)     # Unigrams only
)

doc_term_matrix = vectorizer.fit_transform(corpus['text'])
feature_names = vectorizer.get_feature_names_out()

print(f"Document-term matrix shape: {doc_term_matrix.shape}")
print(f"  {doc_term_matrix.shape[0]} documents")
print(f"  {doc_term_matrix.shape[1]} terms")

## 9.3 Running LDA

In [None]:
# Choose number of topics
n_topics = 5  # Start small, adjust based on results

# Fit LDA model
lda = LatentDirichletAllocation(
    n_components=n_topics,
    random_state=42,
    max_iter=20,
    learning_method='online'
)

doc_topics = lda.fit_transform(doc_term_matrix)

print(f"Fitted LDA with {n_topics} topics")
print(f"Document-topic matrix shape: {doc_topics.shape}")

In [None]:
# Display the topics
def display_topics(model, feature_names, n_top_words=10):
    """
    Display the top words for each topic.
    """
    topics = []
    for topic_idx, topic in enumerate(model.components_):
        top_words = [feature_names[i] for i in topic.argsort()[:-n_top_words-1:-1]]
        topics.append(top_words)
        print(f"\nTopic {topic_idx}:")
        print(f"  {', '.join(top_words)}")
    return topics

topic_words = display_topics(lda, feature_names)

## 9.4 Interpreting Topics

Topics are just lists of words. We need to interpret what they mean.

In [None]:
# Let's try to name the topics based on their top words
# (This is subjective - you might interpret differently)

# After looking at the words, assign labels
# For now, use generic labels
topic_labels = [f"Topic {i}" for i in range(n_topics)]

# You can update these after inspecting the topics:
# topic_labels = ['Philosophy of Language', 'Expeditions', 'Debates', 'Mirado/Water', 'Stone School']

print("Topic labels (update based on your interpretation):")
for i, (label, words) in enumerate(zip(topic_labels, topic_words)):
    print(f"  {i}: {label} - {words[:5]}")

## 9.5 Document-Topic Distributions

In [None]:
# Add topic proportions to corpus
for i in range(n_topics):
    corpus[f'topic_{i}'] = doc_topics[:, i]

# Find dominant topic for each document
corpus['dominant_topic'] = doc_topics.argmax(axis=1)

# Show example
print("Topic distributions (first 10 documents):")
topic_cols = [f'topic_{i}' for i in range(n_topics)]
print(corpus[['title', 'author', 'dominant_topic'] + topic_cols].head(10).to_string())

In [None]:
# Which documents are most representative of each topic?
print("\nMost representative documents per topic:")
for topic_id in range(n_topics):
    top_docs = corpus.nlargest(3, f'topic_{topic_id}')
    print(f"\nTopic {topic_id} ({topic_labels[topic_id]}):")
    for _, row in top_docs.iterrows():
        print(f"  {row['title'][:50]}... ({row[f'topic_{topic_id}']:.2%})")

## 9.6 Visualizing Topics

In [None]:
# Distribution of documents across topics
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Documents per dominant topic
topic_counts = corpus['dominant_topic'].value_counts().sort_index()
axes[0].bar(topic_counts.index, topic_counts.values, color='steelblue')
axes[0].set_xlabel('Topic')
axes[0].set_ylabel('Number of Documents')
axes[0].set_title('Documents per Dominant Topic')
axes[0].set_xticks(range(n_topics))

# Average topic proportions
avg_topics = corpus[topic_cols].mean()
axes[1].bar(range(n_topics), avg_topics.values, color='steelblue')
axes[1].set_xlabel('Topic')
axes[1].set_ylabel('Average Proportion')
axes[1].set_title('Average Topic Proportions Across All Documents')
axes[1].set_xticks(range(n_topics))

plt.tight_layout()
plt.show()

In [None]:
# Topic distribution by genre
genre_topics = corpus.groupby('genre')[topic_cols].mean()

fig, ax = plt.subplots(figsize=(12, 6))
genre_topics.plot(kind='bar', ax=ax, width=0.8)
ax.set_xlabel('Genre')
ax.set_ylabel('Average Topic Proportion')
ax.set_title('Topic Distribution by Genre')
ax.legend(title='Topic', bbox_to_anchor=(1.02, 1))
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Topic distribution by author
author_counts = corpus['author'].value_counts()
top_authors = author_counts[author_counts >= 2].index[:8]

author_topics = corpus[corpus['author'].isin(top_authors)].groupby('author')[topic_cols].mean()

fig, ax = plt.subplots(figsize=(12, 6))
author_topics.plot(kind='bar', ax=ax, width=0.8)
ax.set_xlabel('Author')
ax.set_ylabel('Average Topic Proportion')
ax.set_title('Topic Distribution by Author')
ax.legend(title='Topic', bbox_to_anchor=(1.02, 1))
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 9.7 Finding Similar Documents by Topic

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Calculate similarity based on topic distributions
topic_similarity = cosine_similarity(doc_topics)

def find_similar_by_topic(doc_idx, n=5):
    """
    Find documents with similar topic distributions.
    """
    similarities = topic_similarity[doc_idx]
    similar_idx = similarities.argsort()[::-1][1:n+1]  # Exclude self
    
    results = []
    for idx in similar_idx:
        results.append({
            'title': corpus.iloc[idx]['title'],
            'author': corpus.iloc[idx]['author'],
            'similarity': similarities[idx]
        })
    return pd.DataFrame(results)

# Example: find documents similar to first document
print(f"Documents similar to '{corpus.iloc[0]['title'][:40]}...':")
print(find_similar_by_topic(0))

## 9.8 Choosing the Number of Topics

How many topics should we use? This is often determined by:
- Domain knowledge
- Interpretability of resulting topics
- Metrics like perplexity or coherence

In [None]:
# Try different numbers of topics and compare perplexity
topic_range = range(3, 10)
perplexities = []

for n in topic_range:
    lda_temp = LatentDirichletAllocation(
        n_components=n, random_state=42, max_iter=10, learning_method='online'
    )
    lda_temp.fit(doc_term_matrix)
    perplexity = lda_temp.perplexity(doc_term_matrix)
    perplexities.append(perplexity)
    print(f"  {n} topics: perplexity = {perplexity:.2f}")

# Plot
plt.figure(figsize=(8, 5))
plt.plot(list(topic_range), perplexities, 'o-')
plt.xlabel('Number of Topics')
plt.ylabel('Perplexity (lower is better)')
plt.title('LDA Perplexity vs. Number of Topics')
plt.show()

## 9.9 Summary

In this tutorial, you learned:

1. **Topic modeling concepts**: Documents as mixtures of topics
2. **LDA**: Latent Dirichlet Allocation for topic discovery
3. **Topic interpretation**: Examining top words and representative documents
4. **Visualization**: Topic distributions across documents, genres, authors
5. **Model selection**: Choosing the number of topics

### The Map Takes Shape

Topic modeling reveals the intellectual geography of the archive:
- What topics dominate the collection
- How topics relate to genres and authors
- Which documents share thematic concerns

---

*The Chief examines your topic map with interest. "So the archive has structure after all," she says. "Ideas cluster. Themes recur. Perhaps we can finally reorganize the shelves by topic rather than acquisition date." You suspect this reorganization will take years.*

## Exercises

### Exercise 9.1: Topic Labeling
Examine the topic words carefully. Can you assign meaningful labels to each topic? What themes do they represent?

In [None]:
# YOUR CODE HERE - update topic_labels based on your interpretation


### Exercise 9.2: NMF Comparison
Try Non-negative Matrix Factorization (NMF) instead of LDA. How do the topics differ?

In [None]:
# YOUR CODE HERE
