# Tutorial 2: Cleaning the Stacks

## The Capital Archives — A Course in Natural Language Processing

---

*The water-school archivists are fastidious. "A pristine text," they say, "is like the seventy pools of Mirado—no dirt, no contamination, pure meaning flowing without obstruction." The stone-school disagrees, of course. They believe every smudge and error is part of the text's history, its hardness, its permanence.*

*But for computational analysis, we need clean data. The Chief has assigned you to standardize the collection.*

---

In this tutorial, you will learn:
- String manipulation in Python and pandas
- Regular expressions for pattern matching
- Common text cleaning operations
- Handling missing and inconsistent data
- Building a reusable text preprocessing pipeline

In [None]:
# ============================================
# COLAB SETUP - Run this cell first!
# ============================================
# This cell sets up the environment for Google Colab
# Skip this cell if running locally

import os

# Clone the repository if running in Colab
if 'google.colab' in str(get_ipython()):
    if not os.path.exists('capital-archives-nlp'):
        !git clone https://github.com/buildLittleWorlds/capital-archives-nlp.git
    os.chdir('capital-archives-nlp')
    print("✓ Repository cloned and ready!")
else:
    print("✓ Running locally - no setup needed")

In [None]:
# Standard imports
import pandas as pd
import numpy as np
import re
from collections import Counter

# Load our data
manuscripts = pd.read_csv('manuscripts.csv')
texts = pd.read_csv('manuscript_texts.csv')

print(f"Loaded {len(manuscripts)} manuscript records")
print(f"Loaded {len(texts)} text sections")

## 2.1 Understanding Your Text Data

Before cleaning, we need to understand what we're working with. What kinds of issues might exist in our texts?

In [None]:
# Let's examine a sample of our texts
sample_texts = texts.sample(5, random_state=42)

for _, row in sample_texts.iterrows():
    print(f"\n{'='*60}")
    print(f"Manuscript: {row['manuscript_id']}, Section: {row['section']}")
    print(f"{'='*60}")
    print(row['text'][:300] + "...")

In [None]:
# Check for missing values
print("Missing values in texts:")
print(texts.isnull().sum())

print("\nMissing values in manuscripts:")
print(manuscripts.isnull().sum())

## 2.2 Basic String Operations

pandas provides powerful string methods through the `.str` accessor. Let's explore them.

In [None]:
# Create a working copy of one text
sample_text = texts.iloc[0]['text']
print("Original text (first 300 chars):")
print(sample_text[:300])

In [None]:
# Lowercase conversion
lower_text = sample_text.lower()
print("Lowercased:")
print(lower_text[:300])

In [None]:
# Uppercase conversion
upper_text = sample_text.upper()
print("Uppercased:")
print(upper_text[:300])

In [None]:
# String replacement
# Let's replace line breaks with spaces
cleaned = sample_text.replace('\n', ' ')
print("With line breaks removed:")
print(cleaned[:300])

In [None]:
# Applying string methods to a whole column
# Count the length of each text
texts['char_count'] = texts['text'].str.len()
texts['word_count'] = texts['text'].str.split().str.len()

print("Text statistics:")
print(texts[['manuscript_id', 'char_count', 'word_count']].head(10))

## 2.3 Regular Expressions: The Archivist's Pattern-Finder

Regular expressions (regex) let us find and manipulate patterns in text. They're essential for text cleaning.

In [None]:
# Basic regex patterns
test_text = "Grigsu wrote MS-0012 in the year 869. He also wrote MS-0034, MS-0089, and MS-0093."

# Find all manuscript IDs (pattern: MS- followed by digits)
manuscript_ids = re.findall(r'MS-\d+', test_text)
print(f"Manuscript IDs found: {manuscript_ids}")

# Find all years (3 or 4 digit numbers)
years = re.findall(r'\b\d{3,4}\b', test_text)
print(f"Years found: {years}")

In [None]:
# Common regex patterns for text cleaning

# Remove extra whitespace
messy_text = "This   has    extra   spaces    and\n\nnewlines."
clean_text = re.sub(r'\s+', ' ', messy_text).strip()
print(f"Before: '{messy_text}'")
print(f"After: '{clean_text}'")

In [None]:
# Remove punctuation (keeping apostrophes for contractions)
text_with_punct = "Hello! How are you? I'm fine, thanks."
no_punct = re.sub(r"[^\w\s']", '', text_with_punct)
print(f"Before: '{text_with_punct}'")
print(f"After: '{no_punct}'")

In [None]:
# Find all words that start with capital letters (potential names/places)
sample = "Grigsu traveled from the Capital to Yeller Quarry with Yasho and Bagbu."
capitalized = re.findall(r'\b[A-Z][a-z]+\b', sample)
print(f"Capitalized words: {capitalized}")

### Regex Reference Card

| Pattern | Meaning | Example |
|---------|---------|----------|
| `\d` | Any digit | `\d+` matches "123" |
| `\w` | Any word character | `\w+` matches "hello" |
| `\s` | Any whitespace | `\s+` matches spaces/tabs/newlines |
| `.` | Any character | `a.c` matches "abc", "aXc" |
| `*` | Zero or more | `ab*` matches "a", "ab", "abbb" |
| `+` | One or more | `ab+` matches "ab", "abbb" but not "a" |
| `?` | Zero or one | `ab?` matches "a" or "ab" |
| `[]` | Character class | `[aeiou]` matches any vowel |
| `^` | Start of string | `^Hello` matches "Hello world" |
| `$` | End of string | `world$` matches "Hello world" |
| `\b` | Word boundary | `\bcat\b` matches "cat" but not "catalog" |

### Exercise 2.1: Practice with Regex

Use regular expressions to extract information from this text:

In [None]:
practice_text = """
The debate took place on Day 15 of the Third Month, 869. 
Present were Grigsu, Yasho, Bagbu, and Mink. 
They discussed MS-0012, MS-0008, and MS-0045.
Grigsu argued for 2 hours. Yasho spoke for 3 hours.
The final vote was 7 to 5 in favor of the water-school.
"""

# YOUR CODE HERE: Find all manuscript IDs
manuscript_ids = re.findall(r'MS-\d+', practice_text)
print(f"Manuscript IDs: {manuscript_ids}")

# YOUR CODE HERE: Find all numbers

# YOUR CODE HERE: Find all capitalized words (names)

## 2.4 Building a Text Cleaning Function

Let's create a reusable function that applies standard cleaning operations to any text.

In [None]:
def clean_text(text, lowercase=True, remove_punctuation=False, 
               normalize_whitespace=True, remove_numbers=False):
    """
    Clean a text string with various options.
    
    Parameters:
    -----------
    text : str
        The text to clean
    lowercase : bool
        Convert to lowercase
    remove_punctuation : bool
        Remove punctuation marks
    normalize_whitespace : bool
        Replace multiple spaces/newlines with single space
    remove_numbers : bool
        Remove numeric digits
        
    Returns:
    --------
    str : The cleaned text
    """
    if pd.isna(text):
        return ""
    
    # Convert to string if needed
    text = str(text)
    
    # Normalize whitespace first
    if normalize_whitespace:
        text = re.sub(r'\s+', ' ', text).strip()
    
    # Lowercase
    if lowercase:
        text = text.lower()
    
    # Remove numbers
    if remove_numbers:
        text = re.sub(r'\d+', '', text)
    
    # Remove punctuation (but keep apostrophes)
    if remove_punctuation:
        text = re.sub(r"[^\w\s']", '', text)
    
    # Final whitespace cleanup
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

In [None]:
# Test the cleaning function
test = """ON THE PERMANENCE OF THE UTTERED
By Grigsu Haldo

A word is not a soft thing like water... It is HARD, harder than stone!"""

print("Original:")
print(test)
print("\n" + "="*50 + "\n")

print("Cleaned (default settings):")
print(clean_text(test))
print("\n" + "="*50 + "\n")

print("Cleaned (with punctuation removed):")
print(clean_text(test, remove_punctuation=True))

In [None]:
# Apply cleaning to our entire corpus
texts['clean_text'] = texts['text'].apply(clean_text)

# Compare original and cleaned
sample_idx = 0
print("Original (first 200 chars):")
print(texts.iloc[sample_idx]['text'][:200])
print("\nCleaned (first 200 chars):")
print(texts.iloc[sample_idx]['clean_text'][:200])

## 2.5 Handling Author Name Variations

The manuscripts have inconsistent author names. Let's standardize them.

In [None]:
# What author names do we have?
print("Unique authors:")
for author in sorted(manuscripts['author'].unique()):
    print(f"  {author}")

In [None]:
# Let's look at the scholars table for canonical names
scholars = pd.read_csv('scholars.csv')
print("\nScholars with alternative names:")
print(scholars[['name', 'also_known_as']].dropna(subset=['also_known_as']).head(10))

In [None]:
def standardize_author(author_name, scholars_df):
    """
    Try to match an author name to a canonical scholar name.
    
    Returns the canonical name if found, otherwise returns the original.
    """
    if pd.isna(author_name):
        return 'Unknown'
    
    # Check for exact match
    if author_name in scholars_df['name'].values:
        return author_name
    
    # Check also_known_as column
    for _, row in scholars_df.iterrows():
        if pd.notna(row['also_known_as']):
            aliases = [a.strip() for a in str(row['also_known_as']).split(',')]
            if author_name in aliases:
                return row['name']
    
    # No match found, return original
    return author_name

# Test it
print(standardize_author('Bagbu', scholars))  # Should return canonical name
print(standardize_author('Grigsu', scholars))  # Should return Grigsu Haldo

## 2.6 Creating a Clean Corpus

Let's combine all our texts into a single, clean corpus ready for analysis.

In [None]:
# Aggregate texts by manuscript (combining all sections)
corpus = texts.groupby('manuscript_id').agg(
    full_text=('text', ' '.join),
    clean_text=('clean_text', ' '.join),
    num_sections=('section', 'count')
).reset_index()

# Add word counts
corpus['word_count'] = corpus['clean_text'].str.split().str.len()

print(f"Corpus contains {len(corpus)} documents")
print(f"Total words: {corpus['word_count'].sum():,}")

In [None]:
# Merge with metadata
corpus = corpus.merge(
    manuscripts[['manuscript_id', 'title', 'author', 'genre', 'authenticity_status']],
    on='manuscript_id',
    how='left'
)

corpus.head()

## 2.7 Sentence Segmentation

For some analyses, we need to work with sentences rather than whole documents.

In [None]:
def split_into_sentences(text):
    """
    Split text into sentences using regex.
    This is a simple approach - for better results, use nltk or spacy.
    """
    # Split on sentence-ending punctuation followed by space and capital letter
    # Or just split on .!? followed by whitespace
    sentences = re.split(r'(?<=[.!?])\s+', text)
    # Filter out empty strings
    sentences = [s.strip() for s in sentences if s.strip()]
    return sentences

# Test it
test_para = """Grigsu believed words were hard like stones. 
Yasho disagreed completely! She argued that words dissolve. 
Who was right? The debate continues to this day."""

sentences = split_into_sentences(test_para)
for i, s in enumerate(sentences, 1):
    print(f"{i}. {s}")

In [None]:
# Count sentences in each document
corpus['num_sentences'] = corpus['full_text'].apply(lambda x: len(split_into_sentences(x)))
corpus['avg_sentence_length'] = corpus['word_count'] / corpus['num_sentences']

print("Sentence statistics:")
print(corpus[['manuscript_id', 'author', 'num_sentences', 'avg_sentence_length']].head(10))

In [None]:
# Who writes the longest sentences?
author_sentence_length = corpus.groupby('author')['avg_sentence_length'].mean().sort_values(ascending=False)

print("Average sentence length by author:")
print(author_sentence_length.head(15))

## 2.8 Saving Your Clean Data

Let's save our cleaned corpus for use in future tutorials.

In [None]:
# Save the cleaned corpus
corpus.to_csv('corpus_cleaned.csv', index=False)
print(f"Saved cleaned corpus with {len(corpus)} documents")

# Also save the text sections with clean text
texts[['manuscript_id', 'section', 'text', 'clean_text']].to_csv('texts_cleaned.csv', index=False)
print(f"Saved cleaned text sections")

## 2.9 Summary

In this tutorial, you learned:

1. **String operations**: `.lower()`, `.upper()`, `.replace()`, `.strip()`
2. **Regular expressions**: Pattern matching with `re.findall()`, `re.sub()`
3. **Text cleaning pipeline**: Building reusable cleaning functions
4. **Name standardization**: Matching variations to canonical forms
5. **Sentence segmentation**: Breaking text into sentences

### Key Takeaways

- **Clean early, clean consistently**: Apply the same cleaning to all texts
- **Keep the original**: Always preserve the original text alongside cleaned versions
- **Document your choices**: Text cleaning involves decisions that affect analysis
- **Test your functions**: Verify cleaning works as expected on diverse inputs

---

*The water-school archivist inspects your work. "Acceptable," she says, her tone suggesting this is high praise. "The texts are clean, the metadata standardized. Now you may begin to count the words themselves."*

## Exercises

### Exercise 2.2: Custom Cleaning
Write a function that removes all text within square brackets [like this], which often indicates editorial insertions.

In [None]:
def remove_brackets(text):
    """Remove all text within square brackets, including the brackets."""
    # YOUR CODE HERE
    pass

# Test
test = "Grigsu said [standing] that words are hard [the audience murmured]."
print(remove_brackets(test))  # Should print: "Grigsu said that words are hard."

### Exercise 2.3: Find All Dates
Write a regex to find all dates in the format "Day X of the Y Month, YEAR" from the texts.

In [None]:
# YOUR CODE HERE
date_pattern = r''  # Fill in the pattern

# Test on the debate transcripts
debate_texts = texts[texts['manuscript_id'].str.startswith('MS-01')]['text']
for text in debate_texts:
    dates = re.findall(date_pattern, text)
    if dates:
        print(f"Found dates: {dates}")

### Exercise 2.4: Character Cleanup
Some texts contain unusual characters from transcription errors. Write code to:
1. Find all unique non-ASCII characters in the corpus
2. Create a mapping to replace them with ASCII equivalents

In [None]:
# YOUR CODE HERE
all_text = ' '.join(corpus['full_text'])

# Find non-ASCII characters
non_ascii = set(c for c in all_text if ord(c) > 127)
print(f"Non-ASCII characters found: {non_ascii}")