# Tutorial 10: The Forger's Hand

## The Capital Archives — A Course in Natural Language Processing

## Capstone Investigation

---

*The Chief Archivist calls you into her office. The door closes. She places three manuscripts on the desk.*

*"These appeared over the past decade," she says. "Each claims to be a lost work by Grigsu Haldo. Each was 'discovered' under suspicious circumstances. And each rewrites the history of the stone-school."*

*She taps the manuscripts. "The Senate is interested. If these are genuine, they upend everything we thought we knew about Grigsu's final philosophy. If they're forgeries, someone has been systematically deceiving us. I need proof. Not suspicions—evidence."*

*"Use everything you've learned. Find the forger's hand."*

---

In this capstone, you will:
- Combine techniques from all previous tutorials
- Build a comprehensive analysis of suspected forgeries
- Present evidence for or against authenticity
- Draw conclusions about who may have created the forgeries

In [None]:
# ============================================
# COLAB SETUP - Run this cell first!
# ============================================
# This cell sets up the environment for Google Colab
# Skip this cell if running locally

import os

# Clone the repository if running in Colab
if 'google.colab' in str(get_ipython()):
    if not os.path.exists('capital-archives-nlp'):
        !git clone https://github.com/buildLittleWorlds/capital-archives-nlp.git
    os.chdir('capital-archives-nlp')
    
    # Install/download NLTK data (this notebook uses all NLTK features)
    import nltk
    nltk.download('punkt', quiet=True)
    nltk.download('punkt_tab', quiet=True)
    nltk.download('stopwords', quiet=True)
    nltk.download('averaged_perceptron_tagger', quiet=True)
    nltk.download('averaged_perceptron_tagger_eng', quiet=True)
    nltk.download('vader_lexicon', quiet=True)
    print("✓ Repository cloned and NLTK data downloaded!")
else:
    print("✓ Running locally - no setup needed")

In [None]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
import re

# NLP
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Machine learning
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA

nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('vader_lexicon', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True)

import warnings
warnings.filterwarnings('ignore')

print("Libraries loaded. Investigation ready to begin.")

## 10.1 The Evidence

Let's examine what we have.

In [None]:
# Load all data
manuscripts = pd.read_csv('manuscripts.csv')
texts = pd.read_csv('manuscript_texts.csv')
scholars = pd.read_csv('scholars.csv')

# Load forgery evidence if available
try:
    forgery_evidence = pd.read_csv('forgery_evidence.csv')
    print(f"Loaded forgery evidence: {len(forgery_evidence)} records")
except:
    forgery_evidence = None
    print("No forgery evidence file found")

# Create corpus
corpus = texts.groupby('manuscript_id').agg(
    text=('text', ' '.join)
).reset_index()

corpus = corpus.merge(
    manuscripts[['manuscript_id', 'title', 'author', 'genre', 'authenticity_status',
                 'date_composed', 'date_archived']],
    on='manuscript_id', how='left'
)

print(f"\nTotal documents in corpus: {len(corpus)}")

In [None]:
# Identify the suspected forgeries
suspected = corpus[corpus['authenticity_status'] == 'suspected_forgery']

print(f"Suspected forgeries: {len(suspected)}")
print("\nSuspected manuscripts:")
for _, row in suspected.iterrows():
    print(f"  {row['manuscript_id']}: {row['title']}")
    print(f"    Attributed to: {row['author']}")
    print(f"    Claimed date: {row['date_composed']}")
    print()

In [None]:
# Find verified Grigsu documents for comparison
grigsu_verified = corpus[
    (corpus['author'] == 'Grigsu Haldo') & 
    (corpus['authenticity_status'] == 'verified')
]

print(f"Verified Grigsu documents: {len(grigsu_verified)}")
for _, row in grigsu_verified.iterrows():
    print(f"  {row['manuscript_id']}: {row['title'][:50]}...")

## 10.2 Vocabulary Analysis

First line of investigation: Do the suspected forgeries use vocabulary consistent with authentic Grigsu?

In [None]:
def get_vocabulary_stats(texts_series, label=""):
    """
    Get vocabulary statistics for a set of texts.
    """
    all_text = ' '.join(texts_series)
    tokens = word_tokenize(all_text.lower())
    tokens = [t for t in tokens if t.isalpha()]
    
    word_freq = Counter(tokens)
    total_words = len(tokens)
    unique_words = len(word_freq)
    
    return {
        'label': label,
        'total_words': total_words,
        'unique_words': unique_words,
        'ttr': unique_words / total_words if total_words > 0 else 0,
        'word_freq': word_freq
    }

In [None]:
# Analyze vocabulary for different groups
grigsu_auth_vocab = get_vocabulary_stats(grigsu_verified['text'], "Authentic Grigsu")
suspected_vocab = get_vocabulary_stats(suspected['text'], "Suspected Forgeries")

# Also compare to known water-school authors
yasho_docs = corpus[corpus['author'] == 'Yasho Krent']
yasho_vocab = get_vocabulary_stats(yasho_docs['text'], "Yasho (Water-School)")

print("Vocabulary comparison:")
print(f"\n{'Group':<25} {'Words':<10} {'Unique':<10} {'TTR':<10}")
print("-" * 55)
for v in [grigsu_auth_vocab, suspected_vocab, yasho_vocab]:
    print(f"{v['label']:<25} {v['total_words']:<10} {v['unique_words']:<10} {v['ttr']:.3f}")

In [None]:
# Key term analysis: water-school vs stone-school vocabulary
stone_school_terms = ['stone', 'permanent', 'persist', 'hard', 'casting', 'village', 'grandmother']
water_school_terms = ['dissolution', 'dissolve', 'pool', 'flow', 'residue', 'collective', 'water']

def count_term_frequency(word_freq, terms, total_words):
    """Count frequency of terms per 1000 words."""
    count = sum(word_freq.get(t, 0) for t in terms)
    return count / total_words * 1000 if total_words > 0 else 0

print("\nSchool-specific vocabulary (per 1000 words):")
print(f"\n{'Group':<25} {'Stone-school':<15} {'Water-school':<15}")
print("-" * 55)

for v in [grigsu_auth_vocab, suspected_vocab, yasho_vocab]:
    stone_freq = count_term_frequency(v['word_freq'], stone_school_terms, v['total_words'])
    water_freq = count_term_frequency(v['word_freq'], water_school_terms, v['total_words'])
    print(f"{v['label']:<25} {stone_freq:<15.2f} {water_freq:<15.2f}")

In [None]:
# Detailed term analysis
print("\nDetailed term frequencies (per 1000 words):")
print(f"\n{'Term':<15} {'Auth. Grigsu':<15} {'Suspected':<15} {'Yasho':<15}")
print("-" * 60)

key_terms = ['dissolution', 'stone', 'pool', 'persist', 'word', 'meaning', 'flow']
for term in key_terms:
    g_freq = grigsu_auth_vocab['word_freq'].get(term, 0) / grigsu_auth_vocab['total_words'] * 1000
    s_freq = suspected_vocab['word_freq'].get(term, 0) / suspected_vocab['total_words'] * 1000 if suspected_vocab['total_words'] > 0 else 0
    y_freq = yasho_vocab['word_freq'].get(term, 0) / yasho_vocab['total_words'] * 1000 if yasho_vocab['total_words'] > 0 else 0
    print(f"{term:<15} {g_freq:<15.2f} {s_freq:<15.2f} {y_freq:<15.2f}")

## 10.3 Stylometric Analysis

Second line of investigation: Do the suspected forgeries match Grigsu's writing style?

In [None]:
# Reuse stylometric features from Tutorial 7
FUNCTION_WORDS = ['the', 'a', 'an', 'and', 'or', 'but', 'if', 'that', 'which', 
                   'is', 'are', 'was', 'were', 'be', 'been', 'have', 'has', 'had',
                   'do', 'does', 'did', 'will', 'would', 'could', 'should', 'may',
                   'might', 'must', 'shall', 'can', 'to', 'of', 'in', 'for', 'on',
                   'with', 'at', 'by', 'from', 'as', 'into', 'through', 'during',
                   'before', 'after', 'above', 'below', 'between', 'under', 'again',
                   'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why',
                   'how', 'all', 'each', 'few', 'more', 'most', 'other', 'some', 'such',
                   'no', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very',
                   'just', 'now', 'i', 'we', 'you', 'he', 'she', 'it', 'they', 'this']

def extract_style_features(text):
    """Extract stylometric features from text."""
    features = {}
    
    sentences = sent_tokenize(text)
    words = word_tokenize(text.lower())
    words_alpha = [w for w in words if w.isalpha()]
    
    if len(words_alpha) == 0:
        return features
    
    # Sentence features
    sent_lengths = [len(word_tokenize(s)) for s in sentences]
    features['avg_sentence_length'] = np.mean(sent_lengths) if sent_lengths else 0
    features['std_sentence_length'] = np.std(sent_lengths) if sent_lengths else 0
    
    # Word features
    word_lengths = [len(w) for w in words_alpha]
    features['avg_word_length'] = np.mean(word_lengths)
    
    # Vocabulary
    features['type_token_ratio'] = len(set(words_alpha)) / len(words_alpha)
    
    # Function words
    word_freq = Counter(words_alpha)
    for fw in FUNCTION_WORDS[:20]:  # Top 20 function words
        features[f'fw_{fw}'] = word_freq.get(fw, 0) / len(words_alpha) * 100
    
    return features

In [None]:
# Extract features for all relevant documents
def get_group_features(df, label):
    features_list = []
    for _, row in df.iterrows():
        feats = extract_style_features(row['text'])
        feats['manuscript_id'] = row['manuscript_id']
        feats['group'] = label
        features_list.append(feats)
    return pd.DataFrame(features_list)

auth_grigsu_features = get_group_features(grigsu_verified, 'Authentic Grigsu')
suspected_features = get_group_features(suspected, 'Suspected Forgery')
yasho_features = get_group_features(yasho_docs, 'Yasho')

all_features = pd.concat([auth_grigsu_features, suspected_features, yasho_features])

In [None]:
# Compare key style metrics
style_metrics = ['avg_sentence_length', 'avg_word_length', 'type_token_ratio', 
                 'fw_the', 'fw_that', 'fw_is', 'fw_and']

style_comparison = all_features.groupby('group')[style_metrics].mean().round(3)
print("\nStylometric comparison:")
print(style_comparison)

In [None]:
# Visualize style differences
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

metrics_to_plot = ['avg_sentence_length', 'avg_word_length', 'type_token_ratio', 'fw_the']
titles = ['Average Sentence Length', 'Average Word Length', 'Type-Token Ratio', 'Frequency of "the"']

for ax, metric, title in zip(axes.flat, metrics_to_plot, titles):
    data = [all_features[all_features['group'] == g][metric].values for g in all_features['group'].unique()]
    ax.boxplot(data, labels=all_features['group'].unique())
    ax.set_title(title)
    ax.tick_params(axis='x', rotation=45)

plt.suptitle('Stylometric Comparison', y=1.02)
plt.tight_layout()
plt.show()

## 10.4 Document Similarity Analysis

Third line of investigation: Which documents are the suspected forgeries most similar to?

In [None]:
# Create TF-IDF for similarity analysis
tfidf = TfidfVectorizer(max_features=500, stop_words='english')
tfidf_matrix = tfidf.fit_transform(corpus['text'])

# Calculate similarity matrix
similarity = cosine_similarity(tfidf_matrix)

In [None]:
# For each suspected forgery, find most similar documents
print("Similarity analysis of suspected forgeries:")
print("=" * 70)

for _, row in suspected.iterrows():
    doc_idx = corpus[corpus['manuscript_id'] == row['manuscript_id']].index[0]
    sims = similarity[doc_idx]
    
    # Get top similar docs (excluding self)
    top_indices = sims.argsort()[::-1][1:6]
    
    print(f"\n{row['manuscript_id']}: {row['title'][:50]}...")
    print(f"  Attributed to: {row['author']}")
    print(f"  Most similar to:")
    
    for idx in top_indices:
        sim_doc = corpus.iloc[idx]
        print(f"    {sims[idx]:.3f} - {sim_doc['title'][:40]}... by {sim_doc['author']}")

## 10.5 Visualizing the Evidence

Let's create a visualization showing where the suspected forgeries fall in document space.

In [None]:
# Reduce to 2D using PCA
pca = PCA(n_components=2)
coords = pca.fit_transform(tfidf_matrix.toarray())

corpus['pca_x'] = coords[:, 0]
corpus['pca_y'] = coords[:, 1]

# Categorize documents
def categorize_doc(row):
    if row['authenticity_status'] == 'suspected_forgery':
        return 'Suspected Forgery'
    elif row['author'] == 'Grigsu Haldo':
        return 'Authentic Grigsu'
    elif row['author'] == 'Yasho Krent':
        return 'Yasho (Water-School)'
    else:
        return 'Other'

corpus['category'] = corpus.apply(categorize_doc, axis=1)

In [None]:
# Plot
fig, ax = plt.subplots(figsize=(12, 8))

colors = {'Authentic Grigsu': 'green', 'Suspected Forgery': 'red', 
          'Yasho (Water-School)': 'blue', 'Other': 'gray'}
sizes = {'Authentic Grigsu': 150, 'Suspected Forgery': 200, 
         'Yasho (Water-School)': 150, 'Other': 50}

for cat in ['Other', 'Yasho (Water-School)', 'Authentic Grigsu', 'Suspected Forgery']:
    mask = corpus['category'] == cat
    ax.scatter(corpus.loc[mask, 'pca_x'], corpus.loc[mask, 'pca_y'],
               c=colors[cat], s=sizes[cat], label=cat, alpha=0.7)

# Add labels for suspected forgeries
for _, row in suspected.iterrows():
    ax.annotate(row['manuscript_id'], (row['pca_x'], row['pca_y']),
                xytext=(5, 5), textcoords='offset points', fontsize=8)

ax.set_xlabel('PCA Component 1')
ax.set_ylabel('PCA Component 2')
ax.set_title('Document Space: Where Do the Suspected Forgeries Cluster?')
ax.legend()

plt.tight_layout()
plt.show()

## 10.6 Examining the Forgery Evidence File

If available, let's examine the pre-compiled evidence.

In [None]:
if forgery_evidence is not None:
    print("Forgery evidence from the archives:")
    print("=" * 70)
    
    # Show evidence for suspected documents
    for ms_id in suspected['manuscript_id'].values:
        ms_evidence = forgery_evidence[forgery_evidence['manuscript_id'] == ms_id]
        if len(ms_evidence) > 0:
            print(f"\n{ms_id}:")
            for _, row in ms_evidence.iterrows():
                print(f"  [{row['evidence_type']}] {row['description']}")
else:
    print("No forgery evidence file available.")

## 10.7 Building the Case

Let's summarize all the evidence.

In [None]:
# Compile evidence summary
print("="*70)
print("INVESTIGATION SUMMARY: THE SUSPECTED FORGERIES")
print("="*70)

print("\n1. VOCABULARY EVIDENCE")
print("-"*40)
print("The suspected forgeries show vocabulary patterns inconsistent with")
print("authentic Grigsu and more consistent with water-school writings.")
print("\n   Key finding: Water-school terminology appears at rates")
print("   significantly higher than in authentic Grigsu texts.")

print("\n2. STYLOMETRIC EVIDENCE")
print("-"*40)
print("The writing style of suspected forgeries differs from authentic Grigsu")
print("in measurable ways:")
if len(auth_grigsu_features) > 0 and len(suspected_features) > 0:
    auth_sent_len = auth_grigsu_features['avg_sentence_length'].mean()
    sus_sent_len = suspected_features['avg_sentence_length'].mean()
    print(f"   - Average sentence length: Authentic={auth_sent_len:.1f}, Suspected={sus_sent_len:.1f}")

print("\n3. SIMILARITY EVIDENCE")
print("-"*40)
print("The suspected forgeries cluster closer to water-school documents")
print("than to authentic Grigsu texts in document space.")

print("\n4. CONCLUSION")
print("-"*40)
print("Multiple lines of evidence suggest the suspected manuscripts are")
print("NOT authentic works of Grigsu Haldo. They appear to have been")
print("written by someone familiar with water-school philosophy who")
print("attempted to attribute stone-school recantations to Grigsu.")

## 10.8 Your Investigation Report

Write your own investigation report based on the evidence you've gathered.

### Investigation Report Template

**Date:** [Today's date]

**Investigator:** [Your name]

**Subject:** Authenticity of Manuscripts MS-0156, MS-0157, MS-0158, MS-0159

---

#### Executive Summary

[Write 2-3 sentences summarizing your conclusions]

#### Evidence Summary

**Vocabulary Analysis:**
- [List key findings]

**Stylometric Analysis:**
- [List key findings]

**Document Similarity:**
- [List key findings]

#### Conclusion

[Are the documents authentic? Why or why not?]

#### Possible Forger

[Based on the evidence, who might have created these documents?]

---

## 10.9 Conclusion

In this capstone investigation, you:

1. Combined **vocabulary analysis** to detect anachronistic or school-inappropriate terminology
2. Applied **stylometric analysis** to compare writing patterns
3. Used **document similarity** to see where texts cluster
4. Synthesized **multiple lines of evidence** into a coherent argument

These are the same techniques used by real forensic linguists to detect forgeries, identify authors, and authenticate disputed documents.

---

*The Chief reads your report. She nods slowly. "Good work," she says. "The Senate will want to see this. The Mink estate will probably protest. There will be a hearing."*

*She taps the manuscripts. "But the evidence is clear. These are not Grigsu's words. Someone tried to rewrite history."*

*She looks at you. "Welcome to the archives. The work continues."*