# Tutorial 7: The Voice of Grigsu

## The Capital Archives — A Course in Natural Language Processing

---

*Several manuscripts claim Grigsu Haldo as their author. Some were found among his personal papers after his death. Others appeared years later, "discovered" by various archivists. The Chief is skeptical of the later discoveries. "A scholar's voice is distinctive," she says. "Can you identify the imposters by measuring the voice?"*

*This is the problem of **authorship attribution**: determining who wrote an anonymous or disputed text.*

---

In this tutorial, you will learn:
- Stylometric features for authorship
- Function words as author fingerprints
- Building a classifier for authorship
- Evaluating and interpreting results

In [None]:
# ============================================
# COLAB SETUP - Run this cell first!
# ============================================
# This cell sets up the environment for Google Colab
# Skip this cell if running locally

import os

# Clone the repository if running in Colab
if 'google.colab' in str(get_ipython()):
    if not os.path.exists('capital-archives-nlp'):
        !git clone https://github.com/buildLittleWorlds/capital-archives-nlp.git
    os.chdir('capital-archives-nlp')
    
    # Install/download NLTK data
    import nltk
    nltk.download('punkt', quiet=True)
    nltk.download('punkt_tab', quiet=True)
    nltk.download('averaged_perceptron_tagger', quiet=True)
    nltk.download('averaged_perceptron_tagger_eng', quiet=True)
    print("✓ Repository cloned and NLTK data downloaded!")
else:
    print("✓ Running locally - no setup needed")

In [None]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

# NLP
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import pos_tag

# Machine learning
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True)
nltk.download('punkt_tab', quiet=True)

import warnings
warnings.filterwarnings('ignore')

print("Libraries loaded.")

In [None]:
# Load corpus
manuscripts = pd.read_csv('data/manuscripts.csv')
texts = pd.read_csv('data/manuscript_texts.csv')

corpus = texts.groupby('manuscript_id').agg(
    text=('text', ' '.join)
).reset_index()

corpus = corpus.merge(
    manuscripts[['manuscript_id', 'title', 'author', 'genre', 'authenticity_status']],
    on='manuscript_id', how='left'
)

print(f"Loaded {len(corpus)} documents")

## 7.1 What is Stylometry?

**Stylometry** is the quantitative study of writing style. The key insight: 

> Authors have unconscious habits that are difficult to fake. The words they choose, the sentence structures they prefer, even how they use common words—all create a distinctive "fingerprint."

### Key Stylometric Features

1. **Sentence length** (average, variation)
2. **Word length** (average, distribution)
3. **Vocabulary richness** (type-token ratio)
4. **Function word frequencies** (the, and, of, to, ...)
5. **POS tag distributions**
6. **Punctuation patterns**

In [None]:
# Let's define stylometric features

# Common function words (author fingerprint)
FUNCTION_WORDS = [
    'the', 'a', 'an', 'and', 'or', 'but', 'if', 'then', 'because', 'as',
    'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against',
    'between', 'into', 'through', 'during', 'before', 'after', 'above',
    'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over',
    'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when',
    'where', 'why', 'how', 'all', 'each', 'few', 'more', 'most', 'other',
    'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
    'than', 'too', 'very', 'can', 'will', 'just', 'should', 'now', 'i',
    'we', 'you', 'he', 'she', 'it', 'they', 'what', 'which', 'who', 'whom',
    'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were',
    'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
    'did', 'doing', 'would', 'could', 'ought', 'might', 'must', 'shall'
]

In [None]:
def extract_stylometric_features(text):
    """
    Extract stylometric features from a text.
    
    Returns:
    --------
    dict : Feature name -> value
    """
    features = {}
    
    # Tokenize
    sentences = sent_tokenize(text)
    words = word_tokenize(text.lower())
    words_alpha = [w for w in words if w.isalpha()]
    
    if len(words_alpha) == 0:
        return {}
    
    # --- Sentence-level features ---
    sentence_lengths = [len(word_tokenize(s)) for s in sentences]
    if len(sentence_lengths) > 0:
        features['avg_sentence_length'] = np.mean(sentence_lengths)
        features['std_sentence_length'] = np.std(sentence_lengths)
    else:
        features['avg_sentence_length'] = 0
        features['std_sentence_length'] = 0
    
    # --- Word-level features ---
    word_lengths = [len(w) for w in words_alpha]
    features['avg_word_length'] = np.mean(word_lengths)
    features['std_word_length'] = np.std(word_lengths)
    
    # Vocabulary richness
    features['type_token_ratio'] = len(set(words_alpha)) / len(words_alpha)
    
    # --- Function word frequencies ---
    word_freq = Counter(words_alpha)
    total_words = len(words_alpha)
    
    for fw in FUNCTION_WORDS:
        features[f'fw_{fw}'] = word_freq.get(fw, 0) / total_words * 100
    
    # --- POS tag distribution ---
    tagged = pos_tag(words_alpha[:1000])  # Limit for speed
    pos_freq = Counter(tag for _, tag in tagged)
    total_tags = len(tagged)
    
    for pos in ['NN', 'VB', 'JJ', 'RB', 'IN', 'DT', 'PRP']:
        features[f'pos_{pos}'] = pos_freq.get(pos, 0) / total_tags * 100
    
    # --- Punctuation ---
    features['question_ratio'] = text.count('?') / (len(sentences) + 1)
    features['exclamation_ratio'] = text.count('!') / (len(sentences) + 1)
    
    return features

In [None]:
# Test feature extraction
sample_doc = corpus.iloc[0]
features = extract_stylometric_features(sample_doc['text'])

print(f"Features for '{sample_doc['title'][:40]}...':")
for name, value in list(features.items())[:15]:
    print(f"  {name}: {value:.3f}")
print(f"  ... and {len(features) - 15} more features")

In [None]:
# Extract features for all documents
all_features = []
for _, row in corpus.iterrows():
    feats = extract_stylometric_features(row['text'])
    feats['manuscript_id'] = row['manuscript_id']
    feats['author'] = row['author']
    feats['authenticity'] = row['authenticity_status']
    all_features.append(feats)

features_df = pd.DataFrame(all_features)
print(f"Extracted {len(features_df.columns) - 3} features for {len(features_df)} documents")

## 7.2 Comparing Authors' Styles

In [None]:
# Compare key features across authors
key_features = ['avg_sentence_length', 'avg_word_length', 'type_token_ratio',
                'fw_the', 'fw_and', 'fw_that', 'fw_is']

author_styles = features_df.groupby('author')[key_features].mean()

# Show top authors
top_authors = features_df['author'].value_counts().head(8).index
print("Style comparison (top authors):")
print(author_styles.loc[top_authors].round(2))

In [None]:
# Visualize style differences
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Filter to authors with enough documents
author_counts = features_df['author'].value_counts()
multi_doc_authors = author_counts[author_counts >= 2].index[:8]
subset = features_df[features_df['author'].isin(multi_doc_authors)]

# Plot 1: Sentence length
subset.boxplot(column='avg_sentence_length', by='author', ax=axes[0, 0])
axes[0, 0].set_title('Average Sentence Length')
axes[0, 0].tick_params(axis='x', rotation=45)

# Plot 2: Word length
subset.boxplot(column='avg_word_length', by='author', ax=axes[0, 1])
axes[0, 1].set_title('Average Word Length')
axes[0, 1].tick_params(axis='x', rotation=45)

# Plot 3: Type-token ratio
subset.boxplot(column='type_token_ratio', by='author', ax=axes[1, 0])
axes[1, 0].set_title('Type-Token Ratio (Vocabulary Richness)')
axes[1, 0].tick_params(axis='x', rotation=45)

# Plot 4: Function word 'the'
subset.boxplot(column='fw_the', by='author', ax=axes[1, 1])
axes[1, 1].set_title('Frequency of "the" (%)')
axes[1, 1].tick_params(axis='x', rotation=45)

plt.suptitle('Stylometric Profiles by Author', y=1.02)
plt.tight_layout()
plt.show()

## 7.3 Building an Authorship Classifier

In [None]:
# Prepare data for classification
# We need authors with multiple documents
author_counts = features_df['author'].value_counts()
valid_authors = author_counts[author_counts >= 2].index.tolist()

# Filter to valid authors and verified documents
train_df = features_df[
    (features_df['author'].isin(valid_authors)) & 
    (features_df['authenticity'] == 'verified')
].copy()

print(f"Training on {len(train_df)} verified documents from {len(valid_authors)} authors")

In [None]:
# Prepare features and labels
feature_cols = [c for c in features_df.columns 
                if c not in ['manuscript_id', 'author', 'authenticity']]

X = train_df[feature_cols].fillna(0).values
y = train_df['author'].values

print(f"Feature matrix shape: {X.shape}")
print(f"Number of classes: {len(np.unique(y))}")

In [None]:
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train classifier with cross-validation
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Use cross-validation if we have enough data
if len(train_df) >= 10:
    cv_scores = cross_val_score(clf, X_scaled, y, cv=min(5, len(train_df)//2))
    print(f"Cross-validation accuracy: {cv_scores.mean():.2%} (+/- {cv_scores.std()*2:.2%})")
else:
    print("Not enough data for cross-validation")

# Train final model on all data
clf.fit(X_scaled, y)

In [None]:
# Which features are most important?
importances = pd.Series(clf.feature_importances_, index=feature_cols)
top_features = importances.nlargest(15)

fig, ax = plt.subplots(figsize=(10, 6))
top_features.plot(kind='barh', ax=ax, color='steelblue')
ax.set_xlabel('Feature Importance')
ax.set_title('Most Important Features for Authorship Attribution')
plt.tight_layout()
plt.show()

## 7.4 Testing Disputed Documents

In [None]:
# Find suspected forgeries
disputed = features_df[features_df['authenticity'] == 'suspected_forgery'].copy()

print(f"Suspected forgeries: {len(disputed)}")
if len(disputed) > 0:
    print(disputed[['manuscript_id', 'author']].to_string())

In [None]:
if len(disputed) > 0:
    # Predict authorship of disputed documents
    X_disputed = disputed[feature_cols].fillna(0).values
    X_disputed_scaled = scaler.transform(X_disputed)
    
    # Get predictions and probabilities
    predictions = clf.predict(X_disputed_scaled)
    probabilities = clf.predict_proba(X_disputed_scaled)
    
    print("\nAuthorship analysis of suspected forgeries:")
    print("="*70)
    
    for i, (_, row) in enumerate(disputed.iterrows()):
        print(f"\nDocument: {row['manuscript_id']}")
        print(f"  Claimed author: {row['author']}")
        print(f"  Predicted author: {predictions[i]}")
        print(f"  Confidence: {max(probabilities[i]):.1%}")
        
        # Show top 3 most likely authors
        prob_ranking = sorted(zip(clf.classes_, probabilities[i]), 
                             key=lambda x: -x[1])[:3]
        print("  Top candidates:")
        for author, prob in prob_ranking:
            marker = " <- claimed" if author == row['author'] else ""
            print(f"    {author}: {prob:.1%}{marker}")

## 7.5 Comparing Authentic vs. Disputed

In [None]:
# If we have disputed Grigsu documents, compare them to authentic Grigsu
if len(disputed) > 0:
    # Get any documents attributed to Grigsu (authentic or disputed)
    grigsu_authentic = features_df[
        (features_df['author'] == 'Grigsu Haldo') & 
        (features_df['authenticity'] == 'verified')
    ]
    
    grigsu_disputed = features_df[
        (features_df['author'].str.contains('Grigsu', na=False)) & 
        (features_df['authenticity'] == 'suspected_forgery')
    ]
    
    if len(grigsu_authentic) > 0 and len(grigsu_disputed) > 0:
        print(f"Authentic Grigsu documents: {len(grigsu_authentic)}")
        print(f"Disputed 'Grigsu' documents: {len(grigsu_disputed)}")
        
        # Compare key features
        compare_features = ['avg_sentence_length', 'avg_word_length', 
                           'type_token_ratio', 'fw_the', 'fw_that', 'fw_is']
        
        print("\nFeature comparison:")
        print(f"{'Feature':<25} {'Authentic':<12} {'Disputed':<12} {'Difference':<12}")
        print("-" * 60)
        
        for feat in compare_features:
            auth_mean = grigsu_authentic[feat].mean()
            disp_mean = grigsu_disputed[feat].mean()
            diff = disp_mean - auth_mean
            print(f"{feat:<25} {auth_mean:<12.3f} {disp_mean:<12.3f} {diff:+.3f}")

## 7.6 Summary

In this tutorial, you learned:

1. **Stylometric features**: Sentence length, word length, vocabulary richness, function words
2. **Author profiling**: Comparing stylistic features across authors
3. **Classification**: Training a model to identify authors
4. **Forgery detection**: Applying the model to disputed documents

### The Evidence So Far

Our stylometric analysis provides evidence about disputed manuscripts:
- Do they match the claimed author's style?
- Which author do they actually resemble?
- What specific features differ?

This is one piece of evidence. Combined with vocabulary analysis (Tutorial 6), n-gram patterns (Tutorial 4), and historical evidence, we can build a case for or against authenticity.

---

*"Every writer has a voice," the Chief says, examining your analysis. "The way they breathe between sentences. The words they reach for without thinking. The forger can copy ideas, but copying the breath? That's much harder."*

## Exercises

### Exercise 7.1: Additional Features
Add more stylometric features: contraction usage, sentence starters, comma frequency. Do they improve classification?

In [None]:
# YOUR CODE HERE


### Exercise 7.2: Different Classifiers
Try different classification algorithms (SVM, Naive Bayes, Neural Network). Which performs best?

In [None]:
# YOUR CODE HERE


### Exercise 7.3: School Attribution
Instead of author, try to classify documents by philosophical school. Is it easier or harder than author attribution?

In [None]:
# YOUR CODE HERE
