# Tutorial 3: The Word Counters

## The Capital Archives — A Course in Natural Language Processing

---

*The stone-school scholars believe that counting words reveals hidden patterns in reality. "The primes appear everywhere," Grigsu wrote, "in Yeller's movements, in the counting songs, in the very structure of language itself." Whether or not words are permanent, counting them is certainly useful.*

*The Chief Archivist wants frequency reports on the collection. Which words appear most often? How do different authors and genres differ in their word usage?*

---

In this tutorial, you will learn:
- Tokenization: breaking text into words
- Word frequency analysis
- Stopword removal
- Comparing word frequencies across documents
- Visualizing word distributions

In [None]:
# ============================================
# COLAB SETUP - Run this cell first!
# ============================================
# This cell sets up the environment for Google Colab
# Skip this cell if running locally

import os

# Clone the repository if running in Colab
if 'google.colab' in str(get_ipython()):
    if not os.path.exists('capital-archives-nlp'):
        !git clone https://github.com/buildLittleWorlds/capital-archives-nlp.git
    os.chdir('capital-archives-nlp')
    
    # Install/download NLTK data
    import nltk
    nltk.download('punkt', quiet=True)
    nltk.download('punkt_tab', quiet=True)
    nltk.download('stopwords', quiet=True)
    print("✓ Repository cloned and NLTK data downloaded!")
else:
    print("✓ Running locally - no setup needed")

In [None]:
# Standard imports
import pandas as pd
import numpy as np
import re
from collections import Counter
import matplotlib.pyplot as plt

# NLP libraries
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

# Download required NLTK data (run once)
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('punkt_tab', quiet=True)

print("Libraries loaded.")

In [None]:
# Load our data
manuscripts = pd.read_csv('manuscripts.csv')
texts = pd.read_csv('manuscript_texts.csv')

# Create a combined corpus
corpus = texts.groupby('manuscript_id').agg(
    text=('text', ' '.join)
).reset_index()

corpus = corpus.merge(
    manuscripts[['manuscript_id', 'title', 'author', 'genre', 'authenticity_status']],
    on='manuscript_id',
    how='left'
)

print(f"Loaded {len(corpus)} documents")

## 3.1 What is Tokenization?

**Tokenization** is the process of breaking text into smaller units called **tokens**. Usually, tokens are words, but they can also be sentences, paragraphs, or even individual characters.

In [None]:
# Simple tokenization: split on whitespace
sample_text = "Grigsu argued that words are hard, like stones. Yasho disagreed—she believed in dissolution."

simple_tokens = sample_text.split()
print("Simple split():")
print(simple_tokens)

In [None]:
# Problem: punctuation is attached to words!
# 'hard,' and 'stones.' include punctuation
# 'disagreed—she' is treated as one token

# Better: use NLTK's word_tokenize
nltk_tokens = word_tokenize(sample_text)
print("\nNLTK word_tokenize():")
print(nltk_tokens)

In [None]:
# NLTK separates punctuation from words
# But notice: it keeps punctuation as separate tokens
# We might want to remove them

# Filter to keep only alphabetic tokens
word_tokens = [token for token in nltk_tokens if token.isalpha()]
print("\nAlphabetic tokens only:")
print(word_tokens)

### Building a Tokenization Function

Let's create a reusable tokenization function for our corpus.

In [None]:
def tokenize(text, lowercase=True, remove_punctuation=True):
    """
    Tokenize text into words.
    
    Parameters:
    -----------
    text : str
        The text to tokenize
    lowercase : bool
        Convert tokens to lowercase
    remove_punctuation : bool
        Remove non-alphabetic tokens
        
    Returns:
    --------
    list : List of tokens
    """
    if pd.isna(text):
        return []
    
    # Tokenize
    tokens = word_tokenize(str(text))
    
    # Lowercase
    if lowercase:
        tokens = [t.lower() for t in tokens]
    
    # Remove punctuation
    if remove_punctuation:
        tokens = [t for t in tokens if t.isalpha()]
    
    return tokens

# Test it
print(tokenize(sample_text))

In [None]:
# Tokenize our entire corpus
corpus['tokens'] = corpus['text'].apply(tokenize)
corpus['token_count'] = corpus['tokens'].apply(len)

print(f"Total tokens in corpus: {corpus['token_count'].sum():,}")
print(f"Average tokens per document: {corpus['token_count'].mean():.0f}")

## 3.2 Word Frequency Analysis

Now let's count how often each word appears.

In [None]:
# Combine all tokens from all documents
all_tokens = []
for tokens in corpus['tokens']:
    all_tokens.extend(tokens)

print(f"Total tokens: {len(all_tokens):,}")
print(f"Unique tokens (vocabulary): {len(set(all_tokens)):,}")

In [None]:
# Count word frequencies
word_freq = Counter(all_tokens)

print("25 most common words:")
for word, count in word_freq.most_common(25):
    print(f"  {word}: {count}")

### Observation

The most common words are **function words** like "the", "is", "of", "and". These words appear frequently in all English text—they don't tell us much about the specific content of our manuscripts.

These are called **stopwords**.

## 3.3 Stopword Removal

**Stopwords** are common words that carry little semantic meaning. Removing them helps us focus on the content words that distinguish one text from another.

In [None]:
# NLTK's English stopwords
english_stopwords = set(stopwords.words('english'))

print(f"Number of English stopwords: {len(english_stopwords)}")
print(f"\nSample stopwords: {sorted(english_stopwords)[:30]}")

In [None]:
# Remove stopwords from our word frequency
content_words = {word: count for word, count in word_freq.items() 
                 if word not in english_stopwords}

print("25 most common content words (stopwords removed):")
for word, count in Counter(content_words).most_common(25):
    print(f"  {word}: {count}")

Now we see more interesting words! Words like "words", "stone", "dissolution", "meaning" tell us about the actual content of the philosophical debates in our archive.

In [None]:
# Let's create a function that tokenizes with stopword removal
def tokenize_content(text, extra_stopwords=None):
    """
    Tokenize text and remove stopwords.
    
    Parameters:
    -----------
    text : str
        The text to tokenize
    extra_stopwords : set
        Additional stopwords to remove
        
    Returns:
    --------
    list : List of content tokens
    """
    stop_words = set(stopwords.words('english'))
    if extra_stopwords:
        stop_words.update(extra_stopwords)
    
    tokens = tokenize(text)  # Our earlier function
    content_tokens = [t for t in tokens if t not in stop_words]
    
    return content_tokens

# Apply to corpus
corpus['content_tokens'] = corpus['text'].apply(tokenize_content)
corpus['content_count'] = corpus['content_tokens'].apply(len)

print(f"Tokens removed by stopword filtering: {corpus['token_count'].sum() - corpus['content_count'].sum():,}")

## 3.4 Comparing Word Frequencies Across Groups

The real power of word frequency analysis comes from comparison. Do different authors use different words? What about different genres?

In [None]:
def get_word_freq(documents, column='content_tokens'):
    """
    Get word frequencies from a set of documents.
    
    Parameters:
    -----------
    documents : DataFrame
        Documents with a tokens column
    column : str
        Name of the column containing token lists
        
    Returns:
    --------
    Counter : Word frequency counts
    """
    all_tokens = []
    for tokens in documents[column]:
        all_tokens.extend(tokens)
    return Counter(all_tokens)

In [None]:
# Compare word frequencies by author
# Let's look at two major philosophers: Grigsu (stone-school) and Yasho (water-school)

grigsu_docs = corpus[corpus['author'] == 'Grigsu Haldo']
yasho_docs = corpus[corpus['author'] == 'Yasho Krent']

print(f"Grigsu documents: {len(grigsu_docs)}")
print(f"Yasho documents: {len(yasho_docs)}")

In [None]:
grigsu_freq = get_word_freq(grigsu_docs)
yasho_freq = get_word_freq(yasho_docs)

print("Grigsu's top 15 words:")
for word, count in grigsu_freq.most_common(15):
    print(f"  {word}: {count}")

print("\nYasho's top 15 words:")
for word, count in yasho_freq.most_common(15):
    print(f"  {word}: {count}")

In [None]:
# Words that appear much more in Grigsu than Yasho
def compare_frequencies(freq1, freq2, name1="Group 1", name2="Group 2", min_count=3):
    """
    Compare word frequencies between two groups.
    Returns words that are distinctive to each group.
    """
    # Normalize by total counts
    total1 = sum(freq1.values())
    total2 = sum(freq2.values())
    
    all_words = set(freq1.keys()) | set(freq2.keys())
    
    ratios = []
    for word in all_words:
        count1 = freq1.get(word, 0)
        count2 = freq2.get(word, 0)
        
        if count1 + count2 < min_count:
            continue
        
        # Normalize to rate per 1000 words
        rate1 = (count1 / total1) * 1000 if total1 > 0 else 0
        rate2 = (count2 / total2) * 1000 if total2 > 0 else 0
        
        # Ratio (add smoothing to avoid division by zero)
        ratio = (rate1 + 0.1) / (rate2 + 0.1)
        
        ratios.append((word, count1, count2, rate1, rate2, ratio))
    
    ratios.sort(key=lambda x: x[5], reverse=True)
    return ratios

comparison = compare_frequencies(grigsu_freq, yasho_freq, "Grigsu", "Yasho")

print("Words MORE common in Grigsu than Yasho:")
for word, c1, c2, r1, r2, ratio in comparison[:10]:
    print(f"  {word}: Grigsu={c1}, Yasho={c2}, ratio={ratio:.2f}")

print("\nWords MORE common in Yasho than Grigsu:")
for word, c1, c2, r1, r2, ratio in comparison[-10:]:
    print(f"  {word}: Grigsu={c1}, Yasho={c2}, ratio={ratio:.2f}")

## 3.5 Visualizing Word Frequencies

Visualizations help us understand word distributions at a glance.

In [None]:
# Bar chart of top words
fig, ax = plt.subplots(figsize=(12, 6))

top_words = Counter(content_words).most_common(20)
words, counts = zip(*top_words)

ax.barh(range(len(words)), counts, color='steelblue')
ax.set_yticks(range(len(words)))
ax.set_yticklabels(words)
ax.invert_yaxis()  # Most common at top
ax.set_xlabel('Frequency')
ax.set_title('Most Common Content Words in the Archive')

plt.tight_layout()
plt.show()

In [None]:
# Word cloud (if wordcloud is installed)
try:
    from wordcloud import WordCloud
    
    wc = WordCloud(width=800, height=400, background_color='white', 
                   max_words=100, colormap='viridis')
    wc.generate_from_frequencies(content_words)
    
    plt.figure(figsize=(14, 7))
    plt.imshow(wc, interpolation='bilinear')
    plt.axis('off')
    plt.title('Word Cloud of Archive Content')
    plt.show()
    
except ImportError:
    print("WordCloud not installed. Run: pip install wordcloud")

In [None]:
# Compare two authors visually
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Grigsu
grigsu_top = grigsu_freq.most_common(15)
words_g, counts_g = zip(*grigsu_top)
axes[0].barh(range(len(words_g)), counts_g, color='darkred')
axes[0].set_yticks(range(len(words_g)))
axes[0].set_yticklabels(words_g)
axes[0].invert_yaxis()
axes[0].set_xlabel('Frequency')
axes[0].set_title('Grigsu Haldo (Stone-School)')

# Yasho
yasho_top = yasho_freq.most_common(15)
words_y, counts_y = zip(*yasho_top)
axes[1].barh(range(len(words_y)), counts_y, color='darkblue')
axes[1].set_yticks(range(len(words_y)))
axes[1].set_yticklabels(words_y)
axes[1].invert_yaxis()
axes[1].set_xlabel('Frequency')
axes[1].set_title('Yasho Krent (Water-School)')

plt.tight_layout()
plt.show()

## 3.6 Zipf's Law

**Zipf's Law** states that in any natural language corpus, the frequency of a word is inversely proportional to its rank. The most common word appears about twice as often as the second most common, three times as often as the third, and so on.

In [None]:
# Let's check if our corpus follows Zipf's Law
sorted_freq = sorted(word_freq.values(), reverse=True)
ranks = range(1, len(sorted_freq) + 1)

# Plot on log-log scale
fig, ax = plt.subplots(figsize=(10, 6))
ax.loglog(ranks, sorted_freq, 'b.', markersize=3, alpha=0.5)
ax.set_xlabel('Rank (log scale)')
ax.set_ylabel('Frequency (log scale)')
ax.set_title("Zipf's Law: Word Frequency vs. Rank")
ax.grid(True, alpha=0.3)

# If it follows Zipf's Law, this should be roughly a straight line
plt.show()

## 3.7 Genre-Specific Vocabulary

Let's compare vocabulary across different genres in the archive.

In [None]:
# What genres do we have?
print("Documents by genre:")
print(corpus['genre'].value_counts())

In [None]:
# Get word frequencies for different genres
genre_freqs = {}
for genre in corpus['genre'].unique():
    if pd.notna(genre):
        genre_docs = corpus[corpus['genre'] == genre]
        if len(genre_docs) > 0:
            genre_freqs[genre] = get_word_freq(genre_docs)

# Show top words for each genre
for genre, freq in genre_freqs.items():
    print(f"\n{genre.upper()} (top 10 words):")
    for word, count in freq.most_common(10):
        print(f"  {word}: {count}")

## 3.8 Searching for Specific Terms

Sometimes we want to know how specific terms are distributed across the corpus.

In [None]:
def search_term(corpus_df, term, token_col='content_tokens'):
    """
    Find all documents containing a specific term.
    
    Returns:
    --------
    DataFrame : Documents containing the term, with count
    """
    term = term.lower()
    
    results = []
    for _, row in corpus_df.iterrows():
        count = row[token_col].count(term)
        if count > 0:
            results.append({
                'manuscript_id': row['manuscript_id'],
                'title': row['title'],
                'author': row['author'],
                'count': count
            })
    
    return pd.DataFrame(results).sort_values('count', ascending=False)

# Search for key terms
print("Documents mentioning 'dissolution':")
print(search_term(corpus, 'dissolution'))

print("\nDocuments mentioning 'stone':")
print(search_term(corpus, 'stone'))

In [None]:
# Track multiple terms
key_terms = ['stone', 'water', 'word', 'meaning', 'dissolution', 'permanent', 'pool']

term_distribution = {}
for term in key_terms:
    term_counts = []
    for tokens in corpus['content_tokens']:
        term_counts.append(tokens.count(term.lower()))
    term_distribution[term] = sum(term_counts)

# Visualize
fig, ax = plt.subplots(figsize=(10, 5))
terms, counts = zip(*sorted(term_distribution.items(), key=lambda x: -x[1]))
ax.bar(terms, counts, color='steelblue')
ax.set_xlabel('Term')
ax.set_ylabel('Total Occurrences')
ax.set_title('Key Term Frequencies in the Archive')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 3.9 Summary

In this tutorial, you learned:

1. **Tokenization**: Breaking text into words using `word_tokenize()`
2. **Word frequency**: Counting words with `Counter`
3. **Stopword removal**: Filtering out common function words
4. **Frequency comparison**: Finding words distinctive to different groups
5. **Visualization**: Bar charts, word clouds, Zipf's Law

### Key Insights from the Archive

Our word frequency analysis has already revealed important patterns:
- The archive is dominated by philosophical vocabulary ("word", "meaning", "stone", "water")
- Different authors have distinctive vocabularies
- Genre affects word choice (treatises vs. expedition reports)

---

*The stone-school archivist looks at your frequency reports with satisfaction. "You see?" he says. "The patterns are there, waiting to be counted. The primes in the word frequencies. The structures in the silence between words." You're not sure you see primes in your bar charts, but you nod politely.*

## Exercises

### Exercise 3.1: Hapax Legomena
A **hapax legomenon** is a word that appears only once in a corpus. Find all hapax legomena in our archive. What proportion of the vocabulary are hapax legomena? What kinds of words are they?

In [None]:
# YOUR CODE HERE
hapax = [word for word, count in word_freq.items() if count == 1]
print(f"Hapax legomena: {len(hapax)} words ({100*len(hapax)/len(word_freq):.1f}% of vocabulary)")
print(f"\nSample hapax legomena: {hapax[:20]}")

### Exercise 3.2: Vocabulary Richness
Calculate the **type-token ratio** (TTR) for each author. TTR = unique words / total words. Which authors have the richest vocabulary?

In [None]:
# YOUR CODE HERE


### Exercise 3.3: Custom Stopwords
Some words specific to our archive (like "manuscript", "archive", "scholar") might be too common to be useful. Create a custom stopword list for the archive and see how it changes the top content words.

In [None]:
# YOUR CODE HERE
archive_stopwords = {'manuscript', 'archive', 'scholar', 'text', 'page', 'document'}
# Add to existing stopwords and re-analyze...