# Tutorial 5: The Shape of Arguments

## The Capital Archives — A Course in Natural Language Processing

---

*The water-school scholars believe that the structure of an argument reveals its maker's temperament. "A scholar who uses many nouns," Yasho wrote, "thinks in objects, in discrete entities. A scholar who favors verbs thinks in processes, in transformation. The grammar is the philosophy."*

*The Chief Archivist wants you to analyze the grammatical structure of the debate transcripts. Who uses more nouns? More verbs? More questions?*

---

In this tutorial, you will learn:
- Part-of-speech (POS) tagging
- Sentence structure analysis
- Comparing grammatical patterns across texts
- Using spaCy for advanced NLP

In [None]:
# ============================================
# COLAB SETUP - Run this cell first!
# ============================================
# This cell sets up the environment for Google Colab
# Skip this cell if running locally

import os

# Clone the repository if running in Colab
if 'google.colab' in str(get_ipython()):
    if not os.path.exists('capital-archives-nlp'):
        !git clone https://github.com/buildLittleWorlds/capital-archives-nlp.git
    os.chdir('capital-archives-nlp')
    
    # Install/download NLTK data
    import nltk
    nltk.download('punkt', quiet=True)
    nltk.download('punkt_tab', quiet=True)
    nltk.download('averaged_perceptron_tagger', quiet=True)
    nltk.download('averaged_perceptron_tagger_eng', quiet=True)
    print("✓ Repository cloned and NLTK data downloaded!")
else:
    print("✓ Running locally - no setup needed")

In [None]:
# Standard imports
import pandas as pd
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt

# NLP libraries
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)

print("Libraries loaded.")

In [None]:
# Try to load spaCy (optional but recommended)
try:
    import spacy
    nlp = spacy.load('en_core_web_sm')
    SPACY_AVAILABLE = True
    print("spaCy loaded successfully.")
except:
    SPACY_AVAILABLE = False
    print("spaCy not available. Using NLTK only.")
    print("To install: pip install spacy && python -m spacy download en_core_web_sm")

In [None]:
# Load corpus
manuscripts = pd.read_csv('data/manuscripts.csv')
texts = pd.read_csv('data/manuscript_texts.csv')

corpus = texts.groupby('manuscript_id').agg(
    text=('text', ' '.join)
).reset_index()

corpus = corpus.merge(
    manuscripts[['manuscript_id', 'title', 'author', 'genre']],
    on='manuscript_id', how='left'
)

print(f"Loaded {len(corpus)} documents")

## 5.1 What is Part-of-Speech Tagging?

**Part-of-speech (POS) tagging** assigns grammatical categories to words:
- **NN** = noun
- **VB** = verb
- **JJ** = adjective
- **RB** = adverb
- etc.

In [None]:
# Basic POS tagging with NLTK
sample = "Grigsu argued that words persist like stones in the darkness."
tokens = word_tokenize(sample)
tagged = pos_tag(tokens)

print("POS-tagged sentence:")
for word, tag in tagged:
    print(f"  {word}: {tag}")

In [None]:
# POS tag reference
pos_explanations = {
    'NN': 'noun, singular',
    'NNS': 'noun, plural',
    'NNP': 'proper noun, singular',
    'VB': 'verb, base form',
    'VBD': 'verb, past tense',
    'VBG': 'verb, gerund/present participle',
    'VBN': 'verb, past participle',
    'VBP': 'verb, present, not 3rd person singular',
    'VBZ': 'verb, present, 3rd person singular',
    'JJ': 'adjective',
    'JJR': 'adjective, comparative',
    'JJS': 'adjective, superlative',
    'RB': 'adverb',
    'IN': 'preposition/subordinating conjunction',
    'DT': 'determiner',
    'PRP': 'personal pronoun',
}

print("Common POS tags:")
for tag, explanation in pos_explanations.items():
    print(f"  {tag}: {explanation}")

## 5.2 POS Tagging a Document

In [None]:
def get_pos_distribution(text):
    """
    Get the distribution of POS tags in a text.
    
    Returns:
    --------
    Counter : Counts of each POS tag
    """
    tokens = word_tokenize(text)
    tagged = pos_tag(tokens)
    tags = [tag for _, tag in tagged]
    return Counter(tags)

# Test on one document
sample_doc = corpus[corpus['author'] == 'Grigsu Haldo'].iloc[0]
pos_dist = get_pos_distribution(sample_doc['text'])

print(f"POS distribution in '{sample_doc['title'][:40]}...':")
for tag, count in pos_dist.most_common(15):
    print(f"  {tag}: {count}")

In [None]:
# Calculate noun/verb ratio
def noun_verb_ratio(text):
    """
    Calculate the ratio of nouns to verbs.
    Higher ratio = more noun-heavy ("thingness")
    Lower ratio = more verb-heavy ("process")
    """
    pos_dist = get_pos_distribution(text)
    
    nouns = sum(pos_dist.get(tag, 0) for tag in ['NN', 'NNS', 'NNP', 'NNPS'])
    verbs = sum(pos_dist.get(tag, 0) for tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'])
    
    if verbs == 0:
        return float('inf')
    return nouns / verbs

# Test
print(f"Noun/verb ratio: {noun_verb_ratio(sample_doc['text']):.2f}")

## 5.3 Comparing Authors

In [None]:
# Calculate POS statistics for each document
def analyze_document_grammar(text):
    """
    Analyze grammatical features of a text.
    """
    pos_dist = get_pos_distribution(text)
    total = sum(pos_dist.values())
    
    if total == 0:
        return {}
    
    nouns = sum(pos_dist.get(tag, 0) for tag in ['NN', 'NNS', 'NNP', 'NNPS'])
    verbs = sum(pos_dist.get(tag, 0) for tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'])
    adjectives = sum(pos_dist.get(tag, 0) for tag in ['JJ', 'JJR', 'JJS'])
    adverbs = sum(pos_dist.get(tag, 0) for tag in ['RB', 'RBR', 'RBS'])
    
    return {
        'noun_pct': nouns / total * 100,
        'verb_pct': verbs / total * 100,
        'adj_pct': adjectives / total * 100,
        'adv_pct': adverbs / total * 100,
        'noun_verb_ratio': nouns / verbs if verbs > 0 else np.nan
    }

In [None]:
# Analyze all documents
grammar_stats = []
for _, row in corpus.iterrows():
    stats = analyze_document_grammar(row['text'])
    stats['manuscript_id'] = row['manuscript_id']
    stats['author'] = row['author']
    stats['genre'] = row['genre']
    grammar_stats.append(stats)

grammar_df = pd.DataFrame(grammar_stats)
grammar_df.head(10)

In [None]:
# Compare authors
author_grammar = grammar_df.groupby('author').agg({
    'noun_pct': 'mean',
    'verb_pct': 'mean',
    'adj_pct': 'mean',
    'noun_verb_ratio': 'mean'
}).round(2)

print("Grammatical profiles by author:")
print(author_grammar.sort_values('noun_verb_ratio', ascending=False).head(15))

In [None]:
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Noun vs verb percentages by author
top_authors = grammar_df.groupby('author').size().nlargest(10).index
author_subset = grammar_df[grammar_df['author'].isin(top_authors)]

author_means = author_subset.groupby('author')[['noun_pct', 'verb_pct']].mean()
author_means.plot(kind='bar', ax=axes[0], color=['steelblue', 'coral'])
axes[0].set_xlabel('Author')
axes[0].set_ylabel('Percentage of words')
axes[0].set_title('Noun and Verb Usage by Author')
axes[0].legend(['Nouns', 'Verbs'])
axes[0].tick_params(axis='x', rotation=45)

# Noun/verb ratio distribution
axes[1].hist(grammar_df['noun_verb_ratio'].dropna(), bins=20, color='steelblue', edgecolor='white')
axes[1].set_xlabel('Noun/Verb Ratio')
axes[1].set_ylabel('Number of Documents')
axes[1].set_title('Distribution of Noun/Verb Ratios')

plt.tight_layout()
plt.show()

## 5.4 Sentence-Level Analysis

In [None]:
def analyze_sentences(text):
    """
    Analyze sentence-level features.
    """
    sentences = sent_tokenize(text)
    
    if len(sentences) == 0:
        return {}
    
    sentence_lengths = [len(word_tokenize(s)) for s in sentences]
    
    # Count questions
    questions = sum(1 for s in sentences if s.strip().endswith('?'))
    
    # Count exclamations
    exclamations = sum(1 for s in sentences if s.strip().endswith('!'))
    
    return {
        'num_sentences': len(sentences),
        'avg_sentence_length': np.mean(sentence_lengths),
        'std_sentence_length': np.std(sentence_lengths),
        'max_sentence_length': max(sentence_lengths),
        'min_sentence_length': min(sentence_lengths),
        'question_ratio': questions / len(sentences),
        'exclamation_ratio': exclamations / len(sentences)
    }

In [None]:
# Analyze sentences for all documents
sentence_stats = []
for _, row in corpus.iterrows():
    stats = analyze_sentences(row['text'])
    stats['manuscript_id'] = row['manuscript_id']
    stats['author'] = row['author']
    stats['genre'] = row['genre']
    sentence_stats.append(stats)

sentence_df = pd.DataFrame(sentence_stats)
sentence_df.head(10)

In [None]:
# Who writes the longest sentences?
print("Average sentence length by author:")
author_sentences = sentence_df.groupby('author')['avg_sentence_length'].mean().sort_values(ascending=False)
print(author_sentences.head(10))

In [None]:
# Which genres use more questions?
print("\nQuestion ratio by genre:")
genre_questions = sentence_df.groupby('genre')['question_ratio'].mean().sort_values(ascending=False)
print(genre_questions)

## 5.5 Using spaCy for Advanced Analysis

spaCy provides more sophisticated linguistic analysis.

In [None]:
if SPACY_AVAILABLE:
    # Process a sample text with spaCy
    sample = "Grigsu argued that words persist like stones, but Yasho believed they dissolve."
    doc = nlp(sample)
    
    print("spaCy analysis:")
    for token in doc:
        print(f"  {token.text:12} {token.pos_:6} {token.dep_:10} {token.head.text}")
else:
    print("spaCy not available. Install it for advanced analysis.")

In [None]:
if SPACY_AVAILABLE:
    # Named entity recognition
    sample = "Grigsu traveled from the Capital to Yeller Quarry with Yasho and Bagbu in 869."
    doc = nlp(sample)
    
    print("Named entities:")
    for ent in doc.ents:
        print(f"  {ent.text}: {ent.label_}")

In [None]:
if SPACY_AVAILABLE:
    def spacy_pos_analysis(text, max_length=50000):
        """
        Analyze POS distribution using spaCy.
        """
        # Truncate if too long
        if len(text) > max_length:
            text = text[:max_length]
        
        doc = nlp(text)
        pos_counts = Counter(token.pos_ for token in doc)
        return pos_counts
    
    # Test
    sample_doc = corpus.iloc[0]
    pos_counts = spacy_pos_analysis(sample_doc['text'])
    
    print("POS distribution (spaCy):")
    for pos, count in pos_counts.most_common(10):
        print(f"  {pos}: {count}")

## 5.6 Combining Features for Author Profiles

In [None]:
# Combine grammar and sentence stats
full_stats = grammar_df.merge(sentence_df, on=['manuscript_id', 'author', 'genre'])

# Create author profiles
author_profiles = full_stats.groupby('author').agg({
    'noun_pct': 'mean',
    'verb_pct': 'mean',
    'adj_pct': 'mean',
    'noun_verb_ratio': 'mean',
    'avg_sentence_length': 'mean',
    'question_ratio': 'mean'
}).round(2)

print("Author grammatical profiles:")
print(author_profiles.head(10))

## 5.7 Summary

In this tutorial, you learned:

1. **POS tagging**: Assigning grammatical categories to words
2. **POS distribution analysis**: Counting different word types
3. **Noun/verb ratio**: A measure of "thingness" vs "process"
4. **Sentence analysis**: Length, questions, exclamations
5. **Author profiling**: Combining features to characterize writers

### What the Grammar Reveals

Different authors have distinct grammatical signatures:
- Some favor nouns (emphasizing objects, concepts)
- Some favor verbs (emphasizing actions, processes)
- Sentence length varies by author and genre

These patterns can help with authorship attribution and genre classification.

---

*Yasho's hypothesis seems to hold: the grammar does reveal something about the writer. Grigsu, with his emphasis on permanence and stone, uses more nouns—the grammatical category for things that persist. The water-school writers favor verbs—the category for change and motion.*

## Exercises

### Exercise 5.1: Adjective Analysis
Which authors use the most adjectives? Extract the actual adjectives used by each author and compare.

In [None]:
# YOUR CODE HERE


### Exercise 5.2: Debate Questions
Analyze the debate transcripts specifically. Who asks more questions? Is there a correlation between questions asked and debate outcomes?

In [None]:
# YOUR CODE HERE


### Exercise 5.3: Sentence Complexity
Use spaCy (if available) to analyze dependency structures. Which authors write more complex sentences (deeper dependency trees)?

In [None]:
# YOUR CODE HERE
