# Tutorial 1: Opening the Archive

## The Capital Archives — A Course in Natural Language Processing

---

**Welcome, Junior Archivist.**

You have been assigned to the archives district of the Capital, that labyrinthine network of licensed shops near the Senate-house. Each shop has its single round window—no walk-in entrance—where archivists dispense manuscripts, maps, expedition reports, and philosophical treatises to those who know what to ask for.

The Chief Archivist has handed you a catalog and a simple task: take inventory. Before you can do anything sophisticated with these texts, you need to understand what you have.

In this tutorial, you will learn:
- How to load and inspect tabular data with pandas
- How to load and inspect text data
- Basic string operations and counting
- How to explore a corpus before analyzing it

---

In [None]:
# ============================================
# COLAB SETUP - Run this cell first!
# ============================================
# This cell sets up the environment for Google Colab
# Skip this cell if running locally

import os

# Clone the repository if running in Colab
if 'google.colab' in str(get_ipython()):
    if not os.path.exists('capital-archives-nlp'):
        !git clone https://github.com/buildLittleWorlds/capital-archives-nlp.git
    os.chdir('capital-archives-nlp')
    print("✓ Repository cloned and ready!")
else:
    print("✓ Running locally - no setup needed")

## 1.1 Setting Up Your Workspace

Every archivist needs tools. We'll start by importing the libraries we need.

In [None]:
# Core libraries for data handling
import pandas as pd
import numpy as np

# For displaying data nicely
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

# Verify everything is loaded
print("Workspace ready. Tools loaded.")
print(f"pandas version: {pd.__version__}")

## 1.2 Loading the Manuscript Catalog

The Chief has given you access to the central catalog—a record of every manuscript in the district's collections. Let's load it and see what we're working with.

In [None]:
# Load the manuscript catalog
manuscripts = pd.read_csv('data/manuscripts.csv')

# How many manuscripts do we have?
print(f"Total manuscripts in the catalog: {len(manuscripts)}")

# What information do we have about each one?
print(f"\nColumns in the catalog:")
for col in manuscripts.columns:
    print(f"  - {col}")

In [None]:
# Let's look at the first few entries
manuscripts.head(10)

### Understanding the Catalog

Each row represents a manuscript in the archives. The columns tell us:
- **manuscript_id**: Unique identifier (MS-0001, MS-0002, etc.)
- **title**: The name of the work
- **author**: Who wrote it (sometimes "Anonymous" or "attributed to...")
- **genre**: What kind of text it is (treatise, debate transcript, expedition report, etc.)
- **condition**: Physical state of the manuscript
- **authenticity_status**: Whether scholars believe it's genuine

Let's explore each of these.

## 1.3 Counting and Grouping: What Do We Have?

The first task of any archivist is inventory. Let's count what kinds of manuscripts we have.

In [None]:
# Count manuscripts by genre
genre_counts = manuscripts['genre'].value_counts()

print("Manuscripts by genre:")
print(genre_counts)

In [None]:
# Let's visualize this
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 6))
genre_counts.plot(kind='barh', ax=ax, color='steelblue')
ax.set_xlabel('Number of Manuscripts')
ax.set_ylabel('Genre')
ax.set_title('Archive Holdings by Genre')
plt.tight_layout()
plt.show()

### Exercise 1.1: Explore Other Categories

The genre distribution tells us what kinds of texts dominate the collection. But there are other ways to slice the data. Try counting manuscripts by:
1. Condition (how many are damaged?)
2. Authenticity status (how many are disputed?)
3. Author (who wrote the most?)

In [None]:
# YOUR CODE HERE: Count manuscripts by condition
condition_counts = manuscripts['condition'].value_counts()
print("Manuscripts by condition:")
print(condition_counts)

In [None]:
# YOUR CODE HERE: Count manuscripts by authenticity_status


In [None]:
# YOUR CODE HERE: Who are the most prolific authors?
# Hint: Use value_counts() and maybe .head(10) to see the top authors


## 1.4 Loading the Actual Texts

The catalog tells us *about* manuscripts, but to do text analysis, we need the texts themselves. The archive has transcribed many of the most important works.

In [None]:
# Load the transcribed texts
texts = pd.read_csv('data/manuscript_texts.csv')

print(f"Total text sections: {len(texts)}")
print(f"Columns: {list(texts.columns)}")

In [None]:
# Each manuscript may have multiple sections
# Let's see how many unique manuscripts have been transcribed
unique_manuscripts = texts['manuscript_id'].nunique()
print(f"Unique manuscripts with text: {unique_manuscripts}")

In [None]:
# Let's look at one example text
sample = texts[texts['manuscript_id'] == 'MS-0012'].iloc[0]
print(f"Manuscript: {sample['manuscript_id']}")
print(f"Section: {sample['section']}")
print(f"\nText (first 500 characters):")
print(sample['text'][:500])

## 1.5 Basic Text Statistics

Before sophisticated analysis, we need basic statistics about our texts. How long are they? How many words?

In [None]:
# Calculate word counts for each section
# A simple approach: split on whitespace and count
texts['word_count'] = texts['text'].str.split().str.len()

print("Word count statistics (per section):")
print(texts['word_count'].describe())

In [None]:
# Now let's aggregate by manuscript (some have multiple sections)
manuscript_stats = texts.groupby('manuscript_id').agg(
    total_words=('word_count', 'sum'),
    num_sections=('section', 'count'),
    full_text=('text', ' '.join)  # Combine all sections
).reset_index()

print(f"\nManuscripts with text: {len(manuscript_stats)}")
print(f"Total words in collection: {manuscript_stats['total_words'].sum():,}")

In [None]:
# Which manuscripts are the longest?
print("Longest manuscripts:")
print(manuscript_stats.nlargest(10, 'total_words')[['manuscript_id', 'total_words', 'num_sections']])

In [None]:
# Which are the shortest?
print("Shortest manuscripts:")
print(manuscript_stats.nsmallest(10, 'total_words')[['manuscript_id', 'total_words', 'num_sections']])

## 1.6 Joining Data: Connecting Texts to Metadata

The real power comes when we connect our texts to their metadata. Who wrote the longest works? Which genres have the most words?

In [None]:
# Merge the manuscript stats with the catalog
# This gives us metadata for each text we have
texts_with_metadata = manuscript_stats.merge(
    manuscripts[['manuscript_id', 'title', 'author', 'genre', 'condition', 'authenticity_status']],
    on='manuscript_id',
    how='left'
)

texts_with_metadata.head()

In [None]:
# Now we can ask: which authors wrote the most words in our transcribed collection?
author_words = texts_with_metadata.groupby('author')['total_words'].sum().sort_values(ascending=False)

print("Words by author (transcribed texts only):")
print(author_words.head(15))

In [None]:
# Average manuscript length by genre
genre_length = texts_with_metadata.groupby('genre')['total_words'].mean().sort_values(ascending=False)

print("Average manuscript length by genre:")
print(genre_length)

## 1.7 Your First Text Exploration

Let's pick a single manuscript and explore it in detail. This is the kind of close reading you'll do throughout the course—getting to know individual texts before applying computational methods.

In [None]:
# Pick a manuscript to explore
# Let's look at one of the philosophical treatises
target_id = 'MS-0012'  # This is Grigsu's "On the Permanence of the Uttered"

# Get metadata
meta = manuscripts[manuscripts['manuscript_id'] == target_id].iloc[0]
print(f"Title: {meta['title']}")
print(f"Author: {meta['author']}")
print(f"Genre: {meta['genre']}")
print(f"Condition: {meta['condition']}")
print(f"Authenticity: {meta['authenticity_status']}")

In [None]:
# Get the full text
target_text = ' '.join(texts[texts['manuscript_id'] == target_id]['text'].tolist())

# Basic stats
words = target_text.split()
print(f"\nWord count: {len(words)}")
print(f"Character count: {len(target_text)}")
print(f"Average word length: {sum(len(w) for w in words) / len(words):.1f} characters")

In [None]:
# What are the most common words?
from collections import Counter

# Simple word frequency (we'll do this more sophisticatedly later)
word_freq = Counter(word.lower() for word in words)

print("Most common words:")
for word, count in word_freq.most_common(20):
    print(f"  {word}: {count}")

Notice that the most common words are mostly "function words" like *the*, *is*, *a*, *to*. These are common in all English text. In later tutorials, we'll learn to filter these out to find the "content words" that distinguish one text from another.

## 1.8 Exercise: Explore Another Manuscript

Choose a different manuscript and perform the same exploration. Pick something that interests you—maybe a debate transcript, an expedition report, or a text by a different author.

In [None]:
# First, let's see what manuscripts have text available
available = texts_with_metadata[['manuscript_id', 'title', 'author', 'genre', 'total_words']]
print(available.to_string())

In [None]:
# YOUR CODE HERE: Choose a manuscript_id and explore it
my_target_id = 'MS-0008'  # Change this to your chosen manuscript

# Get metadata
my_meta = manuscripts[manuscripts['manuscript_id'] == my_target_id].iloc[0]
print(f"Title: {my_meta['title']}")
print(f"Author: {my_meta['author']}")

# Get text and calculate stats
# YOUR CODE HERE


## 1.9 Loading Other Archive Data

The archive contains more than just manuscripts. Let's load the other data files we'll use throughout this course.

In [None]:
# Load all the archive data
scholars = pd.read_csv('data/scholars.csv')
debates = pd.read_csv('data/debates.csv')
shops = pd.read_csv('data/archivist_shops.csv')
expeditions = pd.read_csv('data/expeditions.csv')
word_index = pd.read_csv('data/word_index.csv')

print("Archive data loaded:")
print(f"  Scholars: {len(scholars)} records")
print(f"  Debates: {len(debates)} records")
print(f"  Archive Shops: {len(shops)} records")
print(f"  Expeditions: {len(expeditions)} records")
print(f"  Word Index: {len(word_index)} terms")

In [None]:
# Let's look at the scholars
print("Scholars by philosophical school:")
print(scholars['philosophical_school'].value_counts())

In [None]:
# The word index is particularly interesting for NLP
# It shows how different schools define the same terms
word_index.head()

## 1.10 Summary and Next Steps

In this tutorial, you learned:

1. **Loading data with pandas**: `pd.read_csv()` to load tabular data
2. **Basic exploration**: `.head()`, `.info()`, `.describe()`, `.value_counts()`
3. **Counting and grouping**: Finding patterns in categorical data
4. **Text statistics**: Word counts, character counts, basic frequency
5. **Joining data**: Connecting texts to their metadata with `.merge()`

### The Archive Awaits

You now have a basic map of the archive's holdings. You know:
- What kinds of documents we have
- Who wrote them
- How long they are
- Which ones have been transcribed

In the next tutorial, we'll dig into **cleaning and standardizing** the texts. Many of these manuscripts have inconsistencies, errors, and quirks that need to be addressed before serious analysis can begin.

---

*The Chief Archivist nods approvingly at your inventory report. "Good," she says. "Now you know what we have. Tomorrow, we'll teach you to make sense of it."*

## Exercises

Before moving on, complete these exercises to solidify your understanding:

### Exercise 1.2: Suspicious Manuscripts
Find all manuscripts with `authenticity_status` of "suspected_forgery". What do they have in common? Who are they attributed to?

In [None]:
# YOUR CODE HERE


### Exercise 1.3: The Scholars
Using the `scholars.csv` file, find all scholars belonging to the "stone_school". What are their names? What manuscripts did they write? (Hint: the `major_works` column contains manuscript IDs)

In [None]:
# YOUR CODE HERE


### Exercise 1.4: Word Count Distribution
Create a histogram showing the distribution of manuscript lengths (total_words). Are most manuscripts short or long? Are there outliers?

In [None]:
# YOUR CODE HERE
