# Tutorial 4: Patterns in the Stone

## The Capital Archives — A Course in Natural Language Processing

---

*"The patterns are in the stone," Grigsu wrote in his final, fragmentary notes from the Yeller Quarry expedition. "Not in single words but in their combinations. The sequences that repeat. The phrases that persist."*

*The Chief wants you to find recurring phrases in the archive—patterns that appear again and again across different manuscripts.*

---

In this tutorial, you will learn:
- N-grams: sequences of words
- Collocations: words that appear together
- Concordance: viewing words in context
- Using NLTK's text analysis tools

In [None]:
# ============================================
# COLAB SETUP - Run this cell first!
# ============================================
# This cell sets up the environment for Google Colab
# Skip this cell if running locally

import os

# Clone the repository if running in Colab
if 'google.colab' in str(get_ipython()):
    if not os.path.exists('capital-archives-nlp'):
        !git clone https://github.com/buildLittleWorlds/capital-archives-nlp.git
    os.chdir('capital-archives-nlp')
    
    # Install/download NLTK data
    import nltk
    nltk.download('punkt', quiet=True)
    nltk.download('punkt_tab', quiet=True)
    nltk.download('stopwords', quiet=True)
    print("✓ Repository cloned and NLTK data downloaded!")
else:
    print("✓ Running locally - no setup needed")

In [None]:
# Standard imports
import pandas as pd
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt

# NLP libraries
import nltk
from nltk import ngrams, bigrams, trigrams
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.collocations import BigramCollocationFinder, TrigramCollocationFinder
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures
from nltk.text import Text

nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('punkt_tab', quiet=True)

print("Libraries loaded.")

In [None]:
# Load and prepare corpus
manuscripts = pd.read_csv('data/manuscripts.csv')
texts = pd.read_csv('data/manuscript_texts.csv')

# Create corpus
corpus = texts.groupby('manuscript_id').agg(
    text=('text', ' '.join)
).reset_index()

corpus = corpus.merge(
    manuscripts[['manuscript_id', 'title', 'author', 'genre']],
    on='manuscript_id', how='left'
)

# Combine all text
all_text = ' '.join(corpus['text'])
print(f"Total characters in corpus: {len(all_text):,}")

## 4.1 What Are N-grams?

An **n-gram** is a contiguous sequence of n items (words, in our case) from a text.

- **Unigrams** (n=1): individual words
- **Bigrams** (n=2): pairs of consecutive words
- **Trigrams** (n=3): triples of consecutive words

In [None]:
# Example sentence
sample = "Words are hard like stones in the village."
tokens = word_tokenize(sample.lower())

print("Tokens (unigrams):")
print(tokens)

print("\nBigrams:")
print(list(bigrams(tokens)))

print("\nTrigrams:")
print(list(trigrams(tokens)))

In [None]:
# Create n-grams for the whole corpus
all_tokens = word_tokenize(all_text.lower())
# Filter to alphabetic tokens only
all_tokens = [t for t in all_tokens if t.isalpha()]

# Get bigrams and trigrams
corpus_bigrams = list(bigrams(all_tokens))
corpus_trigrams = list(trigrams(all_tokens))

print(f"Total bigrams: {len(corpus_bigrams):,}")
print(f"Total trigrams: {len(corpus_trigrams):,}")

In [None]:
# Count bigram frequencies
bigram_freq = Counter(corpus_bigrams)

print("Most common bigrams:")
for bg, count in bigram_freq.most_common(20):
    print(f"  {' '.join(bg)}: {count}")

### Observation

Many common bigrams are just combinations of stopwords ("of the", "in the", "to the"). Let's filter those out.

In [None]:
# Filter bigrams: require at least one content word
stop_words = set(stopwords.words('english'))

content_bigrams = [(w1, w2) for w1, w2 in corpus_bigrams 
                   if w1 not in stop_words or w2 not in stop_words]

content_bigram_freq = Counter(content_bigrams)

print("Most common content bigrams:")
for bg, count in content_bigram_freq.most_common(20):
    print(f"  {' '.join(bg)}: {count}")

In [None]:
# Even stricter: require both words to be content words
strict_content_bigrams = [(w1, w2) for w1, w2 in corpus_bigrams 
                          if w1 not in stop_words and w2 not in stop_words]

strict_bigram_freq = Counter(strict_content_bigrams)

print("Most common bigrams (both words content):")
for bg, count in strict_bigram_freq.most_common(20):
    print(f"  {' '.join(bg)}: {count}")

## 4.2 Trigrams and Beyond

In [None]:
# Trigram analysis
trigram_freq = Counter(corpus_trigrams)

print("Most common trigrams:")
for tg, count in trigram_freq.most_common(15):
    print(f"  {' '.join(tg)}: {count}")

In [None]:
# Content trigrams (at least 2 content words)
def count_content_words(ngram, stopwords_set):
    return sum(1 for w in ngram if w not in stopwords_set)

content_trigrams = [tg for tg in corpus_trigrams 
                    if count_content_words(tg, stop_words) >= 2]

content_trigram_freq = Counter(content_trigrams)

print("Most common content trigrams:")
for tg, count in content_trigram_freq.most_common(15):
    print(f"  {' '.join(tg)}: {count}")

## 4.3 Collocations: Statistically Significant Pairs

Simple frequency counting finds common phrases, but some of those are common just because the individual words are common.

**Collocations** are word combinations that occur together more often than chance would predict. NLTK provides statistical measures to find them.

In [None]:
# Find collocations using PMI (Pointwise Mutual Information)
# PMI measures how much more likely two words appear together than independently

bigram_finder = BigramCollocationFinder.from_words(all_tokens)

# Filter to bigrams that appear at least 3 times
bigram_finder.apply_freq_filter(3)

# Find best collocations by PMI
bigram_measures = BigramAssocMeasures()

print("Top collocations by PMI:")
for colloc in bigram_finder.nbest(bigram_measures.pmi, 20):
    print(f"  {' '.join(colloc)}")

In [None]:
# Also try likelihood ratio - often gives better results
print("Top collocations by likelihood ratio:")
for colloc in bigram_finder.nbest(bigram_measures.likelihood_ratio, 20):
    print(f"  {' '.join(colloc)}")

In [None]:
# Trigram collocations
trigram_finder = TrigramCollocationFinder.from_words(all_tokens)
trigram_finder.apply_freq_filter(3)

trigram_measures = TrigramAssocMeasures()

print("Top trigram collocations by likelihood ratio:")
for colloc in trigram_finder.nbest(trigram_measures.likelihood_ratio, 15):
    print(f"  {' '.join(colloc)}")

## 4.4 Concordance: Words in Context

A **concordance** shows every occurrence of a word in context—the words that appear before and after it. This helps us understand how a word is used.

In [None]:
# Create an NLTK Text object for concordance
nltk_text = Text(all_tokens)

# Show concordance for a key term
print("Concordance for 'stone':")
nltk_text.concordance('stone', width=75, lines=15)

In [None]:
print("Concordance for 'dissolution':")
nltk_text.concordance('dissolution', width=75, lines=15)

In [None]:
print("Concordance for 'yeller':")
nltk_text.concordance('yeller', width=75, lines=15)

## 4.5 Similar Words

NLTK can find words that appear in similar contexts—they may be synonyms or related concepts.

In [None]:
# Find words used in similar contexts to 'word'
print("Words similar to 'word':")
nltk_text.similar('word')

In [None]:
print("Words similar to 'stone':")
nltk_text.similar('stone')

In [None]:
# Common contexts: what contexts do two words share?
print("Common contexts of 'stone' and 'water':")
nltk_text.common_contexts(['stone', 'water'])

## 4.6 Author-Specific N-grams

Do different authors have different recurring phrases?

In [None]:
def get_author_ngrams(corpus_df, author, n=2, min_freq=2, num_results=15):
    """
    Get the most common n-grams for a specific author.
    """
    author_docs = corpus_df[corpus_df['author'] == author]
    if len(author_docs) == 0:
        return []
    
    author_text = ' '.join(author_docs['text'])
    tokens = word_tokenize(author_text.lower())
    tokens = [t for t in tokens if t.isalpha() and t not in stop_words]
    
    author_ngrams = list(ngrams(tokens, n))
    freq = Counter(author_ngrams)
    
    return [(ng, count) for ng, count in freq.most_common(num_results) 
            if count >= min_freq]

In [None]:
# Compare bigrams between authors
authors_to_compare = ['Grigsu Haldo', 'Yasho Krent']

for author in authors_to_compare:
    print(f"\n{author}'s common bigrams:")
    author_bigrams = get_author_ngrams(corpus, author, n=2)
    for bg, count in author_bigrams:
        print(f"  {' '.join(bg)}: {count}")

## 4.7 Searching for Specific Patterns

Sometimes we want to find specific types of phrases—for example, phrases containing "Yeller" or phrases about philosophical concepts.

In [None]:
# Find all bigrams containing a specific word
def bigrams_containing(word, bigram_counter):
    """
    Find all bigrams containing a specific word.
    """
    word = word.lower()
    matching = [(bg, count) for bg, count in bigram_counter.items()
                if word in bg]
    return sorted(matching, key=lambda x: -x[1])

# Bigrams containing 'yeller'
print("Bigrams containing 'yeller':")
for bg, count in bigrams_containing('yeller', bigram_freq)[:15]:
    print(f"  {' '.join(bg)}: {count}")

In [None]:
# Bigrams containing 'meaning'
print("Bigrams containing 'meaning':")
for bg, count in bigrams_containing('meaning', bigram_freq)[:15]:
    print(f"  {' '.join(bg)}: {count}")

## 4.8 Visualizing N-gram Patterns

In [None]:
# Visualize top bigrams
top_bigrams = strict_bigram_freq.most_common(20)

fig, ax = plt.subplots(figsize=(12, 8))

labels = [' '.join(bg) for bg, _ in top_bigrams]
counts = [count for _, count in top_bigrams]

ax.barh(range(len(labels)), counts, color='steelblue')
ax.set_yticks(range(len(labels)))
ax.set_yticklabels(labels)
ax.invert_yaxis()
ax.set_xlabel('Frequency')
ax.set_title('Most Common Content Bigrams in the Archive')

plt.tight_layout()
plt.show()

## 4.9 Summary

In this tutorial, you learned:

1. **N-grams**: Sequences of n consecutive words
2. **Collocations**: Statistically significant word pairs using PMI and likelihood ratio
3. **Concordance**: Viewing words in their surrounding context
4. **Similar words**: Finding words used in similar contexts
5. **Pattern search**: Finding n-grams containing specific terms

### Key Patterns Discovered

Our analysis has revealed recurring phrases in the archive:
- Philosophical terms often appear in specific combinations
- Different authors have distinctive phrase patterns
- Key concepts like "stone", "water", and "words" appear in characteristic contexts

---

*"There," the stone-school archivist says, pointing at your bigram chart. "Do you see? The patterns persist. The combinations recur. This is not random. This is structure." You're beginning to think he might have a point.*

## Exercises

### Exercise 4.1: Four-grams
Extract four-grams from the corpus. What are the most common four-word phrases?

In [None]:
# YOUR CODE HERE


### Exercise 4.2: Genre-Specific Collocations
Compare the collocations in "treatise" documents versus "debate_transcript" documents. What phrases are distinctive to each genre?

In [None]:
# YOUR CODE HERE


### Exercise 4.3: Concordance for Key Figures
Create concordances for the names of key scholars (Grigsu, Yasho, Bagbu). How are they discussed in the archive?

In [None]:
# YOUR CODE HERE
