# Tutorial 6: Sorting the Schools

## The Capital Archives — A Course in Natural Language Processing

---

*Many manuscripts in the archive are anonymous or have disputed attributions. "Who wrote this?" is often less certain than we'd like. But perhaps the text itself contains clues. Perhaps documents from the same philosophical school share something—a vocabulary, a style, an emphasis—that allows us to group them together.*

*The Chief Archivist suspects some 'anonymous' manuscripts can be sorted into schools by style alone.*

---

In this tutorial, you will learn:
- TF-IDF: Term Frequency-Inverse Document Frequency
- Document similarity with cosine similarity
- Document clustering
- Visualizing document relationships

In [None]:
# ============================================
# COLAB SETUP - Run this cell first!
# ============================================
# This cell sets up the environment for Google Colab
# Skip this cell if running locally

import os

# Clone the repository if running in Colab
if 'google.colab' in str(get_ipython()):
    if not os.path.exists('capital-archives-nlp'):
        !git clone https://github.com/buildLittleWorlds/capital-archives-nlp.git
    os.chdir('capital-archives-nlp')
    print("✓ Repository cloned and ready!")
else:
    print("✓ Running locally - no setup needed")

In [None]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Machine learning
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

import warnings
warnings.filterwarnings('ignore')

print("Libraries loaded.")

In [None]:
# Load corpus
manuscripts = pd.read_csv('data/manuscripts.csv')
texts = pd.read_csv('data/manuscript_texts.csv')
scholars = pd.read_csv('data/scholars.csv')

corpus = texts.groupby('manuscript_id').agg(
    text=('text', ' '.join)
).reset_index()

corpus = corpus.merge(
    manuscripts[['manuscript_id', 'title', 'author', 'genre', 'authenticity_status']],
    on='manuscript_id', how='left'
)

# Get philosophical school for each author
author_school = dict(zip(scholars['name'], scholars['philosophical_school']))
corpus['school'] = corpus['author'].map(author_school).fillna('unknown')

print(f"Loaded {len(corpus)} documents")
print(f"\nDocuments by school:")
print(corpus['school'].value_counts())

## 6.1 From Counts to TF-IDF

Raw word counts are problematic: long documents have more of everything. And common words dominate.

**TF-IDF** addresses both problems:
- **TF (Term Frequency)**: How often does this word appear in this document?
- **IDF (Inverse Document Frequency)**: How rare is this word across all documents?

TF-IDF = TF × IDF

Words that appear frequently in one document but rarely elsewhere get high scores.

In [None]:
# Create TF-IDF matrix
tfidf = TfidfVectorizer(
    max_features=1000,      # Use top 1000 terms
    min_df=2,               # Ignore terms in fewer than 2 documents
    max_df=0.95,            # Ignore terms in more than 95% of documents
    stop_words='english',   # Remove English stopwords
    ngram_range=(1, 2)      # Include unigrams and bigrams
)

tfidf_matrix = tfidf.fit_transform(corpus['text'])

print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
print(f"  {tfidf_matrix.shape[0]} documents")
print(f"  {tfidf_matrix.shape[1]} features (terms)")

In [None]:
# Get the feature names (terms)
feature_names = tfidf.get_feature_names_out()

print("Sample terms in vocabulary:")
print(feature_names[:30])

In [None]:
# What terms have the highest TF-IDF scores for a specific document?
def get_top_tfidf_terms(doc_index, n=15):
    """
    Get the top TF-IDF terms for a document.
    """
    doc_vector = tfidf_matrix[doc_index].toarray().flatten()
    top_indices = doc_vector.argsort()[-n:][::-1]
    
    return [(feature_names[i], doc_vector[i]) for i in top_indices]

# Example: top terms for first document
doc_idx = 0
print(f"Top TF-IDF terms for '{corpus.iloc[doc_idx]['title'][:50]}...'")
print(f"Author: {corpus.iloc[doc_idx]['author']}")
print()
for term, score in get_top_tfidf_terms(doc_idx):
    print(f"  {term}: {score:.3f}")

## 6.2 Document Similarity

With TF-IDF vectors, we can measure how similar documents are using **cosine similarity**.

In [None]:
# Calculate pairwise similarities
similarity_matrix = cosine_similarity(tfidf_matrix)

print(f"Similarity matrix shape: {similarity_matrix.shape}")

In [None]:
def find_similar_documents(doc_index, n=5):
    """
    Find documents most similar to a given document.
    """
    similarities = similarity_matrix[doc_index]
    # Get indices sorted by similarity (excluding self)
    similar_indices = similarities.argsort()[::-1][1:n+1]
    
    results = []
    for idx in similar_indices:
        results.append({
            'title': corpus.iloc[idx]['title'],
            'author': corpus.iloc[idx]['author'],
            'similarity': similarities[idx]
        })
    return pd.DataFrame(results)

# Find documents similar to a Grigsu text
grigsu_docs = corpus[corpus['author'] == 'Grigsu Haldo']
if len(grigsu_docs) > 0:
    doc_idx = grigsu_docs.index[0]
    print(f"Documents similar to '{corpus.iloc[doc_idx]['title'][:50]}...'")
    print(f"by {corpus.iloc[doc_idx]['author']}\n")
    print(find_similar_documents(doc_idx))

## 6.3 Clustering Documents

Can we automatically group documents into clusters based on their TF-IDF vectors?

In [None]:
# Cluster documents using K-means
n_clusters = 4  # Let's try 4 clusters

kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
corpus['cluster'] = kmeans.fit_predict(tfidf_matrix)

print("Documents per cluster:")
print(corpus['cluster'].value_counts().sort_index())

In [None]:
# What characterizes each cluster?
def get_cluster_terms(cluster_id, n=10):
    """
    Get the most distinctive terms for a cluster.
    """
    cluster_docs = corpus[corpus['cluster'] == cluster_id].index
    cluster_vectors = tfidf_matrix[cluster_docs].toarray()
    mean_vector = cluster_vectors.mean(axis=0)
    
    top_indices = mean_vector.argsort()[-n:][::-1]
    return [(feature_names[i], mean_vector[i]) for i in top_indices]

# Show top terms for each cluster
for cluster_id in range(n_clusters):
    print(f"\nCluster {cluster_id}:")
    for term, score in get_cluster_terms(cluster_id):
        print(f"  {term}: {score:.3f}")

In [None]:
# How do clusters align with philosophical schools?
cluster_school = pd.crosstab(corpus['cluster'], corpus['school'])
print("Clusters vs Philosophical Schools:")
print(cluster_school)

## 6.4 Visualizing Document Space

We can't visualize 1000-dimensional TF-IDF space directly, but we can reduce it to 2D.

In [None]:
# Reduce to 2D using PCA
pca = PCA(n_components=2, random_state=42)
coords_pca = pca.fit_transform(tfidf_matrix.toarray())

corpus['pca_x'] = coords_pca[:, 0]
corpus['pca_y'] = coords_pca[:, 1]

print(f"Variance explained: {pca.explained_variance_ratio_.sum():.2%}")

In [None]:
# Plot documents colored by school
fig, ax = plt.subplots(figsize=(12, 8))

schools = corpus['school'].unique()
colors = plt.cm.tab10(np.linspace(0, 1, len(schools)))

for school, color in zip(schools, colors):
    mask = corpus['school'] == school
    ax.scatter(corpus.loc[mask, 'pca_x'], 
               corpus.loc[mask, 'pca_y'],
               label=school, alpha=0.7, s=100, c=[color])

ax.set_xlabel('PCA Component 1')
ax.set_ylabel('PCA Component 2')
ax.set_title('Documents in TF-IDF Space (PCA)')
ax.legend(loc='best')

plt.tight_layout()
plt.show()

In [None]:
# Try t-SNE for potentially better separation
# t-SNE is better at preserving local structure
if len(corpus) > 10:  # t-SNE needs enough points
    tsne = TSNE(n_components=2, random_state=42, perplexity=min(5, len(corpus)-1))
    coords_tsne = tsne.fit_transform(tfidf_matrix.toarray())
    
    corpus['tsne_x'] = coords_tsne[:, 0]
    corpus['tsne_y'] = coords_tsne[:, 1]
    
    # Plot
    fig, ax = plt.subplots(figsize=(12, 8))
    
    for school, color in zip(schools, colors):
        mask = corpus['school'] == school
        ax.scatter(corpus.loc[mask, 'tsne_x'], 
                   corpus.loc[mask, 'tsne_y'],
                   label=school, alpha=0.7, s=100, c=[color])
    
    ax.set_xlabel('t-SNE 1')
    ax.set_ylabel('t-SNE 2')
    ax.set_title('Documents in TF-IDF Space (t-SNE)')
    ax.legend(loc='best')
    
    plt.tight_layout()
    plt.show()

## 6.5 Investigating Suspicious Documents

Let's look at documents with suspected forgery status. Where do they cluster?

In [None]:
# Find suspected forgeries
suspicious = corpus[corpus['authenticity_status'] == 'suspected_forgery']

print(f"Suspected forgeries in corpus: {len(suspicious)}")
if len(suspicious) > 0:
    print(suspicious[['manuscript_id', 'title', 'author', 'cluster']])

In [None]:
# If we have suspected forgeries attributed to Grigsu, 
# do they cluster with authentic Grigsu or with other schools?

if len(suspicious) > 0:
    # Check what cluster the suspicious docs are in
    print("\nSuspicious documents by cluster:")
    print(suspicious[['title', 'author', 'cluster']])
    
    # Compare to authentic documents by same attributed author
    for _, sus_doc in suspicious.iterrows():
        claimed_author = sus_doc['author']
        print(f"\n{sus_doc['title'][:50]}...")
        print(f"  Attributed to: {claimed_author}")
        print(f"  Cluster: {sus_doc['cluster']}")
        
        # Find most similar documents
        doc_idx = corpus[corpus['manuscript_id'] == sus_doc['manuscript_id']].index[0]
        similar = find_similar_documents(doc_idx, n=3)
        print(f"  Most similar to:")
        for _, row in similar.iterrows():
            print(f"    - {row['title'][:40]}... by {row['author']} (sim={row['similarity']:.3f})")

## 6.6 Summary

In this tutorial, you learned:

1. **TF-IDF**: Representing documents as weighted term vectors
2. **Cosine similarity**: Measuring document similarity
3. **K-means clustering**: Grouping documents automatically
4. **Dimensionality reduction**: PCA and t-SNE for visualization

### Key Findings

- Documents can be represented as vectors in term space
- Similar documents cluster together based on vocabulary
- Clusters may align with philosophical schools, genres, or authors
- Suspected forgeries may cluster with unexpected groups

---

*The visualization reveals patterns invisible to the naked eye. Documents cluster by philosophical affinity, and some 'anonymous' texts now reveal their likely origins. But what of the suspected forgeries? If they cluster with the wrong school, that would be evidence of inauthenticity...*

## Exercises

### Exercise 6.1: Optimal Clusters
Try different values of k for K-means clustering. Use the elbow method or silhouette score to find the optimal number of clusters.

In [None]:
# YOUR CODE HERE
from sklearn.metrics import silhouette_score

# Try k from 2 to 8
# Calculate inertia and silhouette score for each

### Exercise 6.2: Genre Clustering
Do documents cluster better by genre than by philosophical school? Create a visualization colored by genre and compare.

In [None]:
# YOUR CODE HERE


### Exercise 6.3: Document Similarity Search
Build a simple search function: given a query text (not in the corpus), find the most similar documents.

In [None]:
# YOUR CODE HERE
def search_similar(query_text, top_n=5):
    """
    Find documents most similar to a query.
    """
    # Transform query using the fitted TF-IDF vectorizer
    query_vector = tfidf.transform([query_text])
    
    # Calculate similarity to all documents
    # YOUR CODE HERE
    pass