# Tutorial 8: Sentiment in the Margins

## The Capital Archives — A Course in Natural Language Processing

---

*The margins of manuscripts tell their own stories. Readers leave notes: "Brilliant!" "This is wrong." "See also MS-0034." "The author contradicts himself." These annotations reveal what readers thought of these texts—a reception history written in hasty ink.*

*The Chief wants to understand how different scholars and texts were received. What did readers think of Grigsu? Of Yasho? Did opinions change over time?*

---

In this tutorial, you will learn:
- Sentiment analysis basics
- Using VADER for sentiment scoring
- Analyzing opinion and emotion in text
- Tracking sentiment across documents and time

In [None]:
# ============================================
# COLAB SETUP - Run this cell first!
# ============================================
# This cell sets up the environment for Google Colab
# Skip this cell if running locally

import os

# Clone the repository if running in Colab
if 'google.colab' in str(get_ipython()):
    if not os.path.exists('capital-archives-nlp'):
        !git clone https://github.com/buildLittleWorlds/capital-archives-nlp.git
    os.chdir('capital-archives-nlp')
    
    # Install/download NLTK data
    import nltk
    nltk.download('punkt', quiet=True)
    nltk.download('punkt_tab', quiet=True)
    nltk.download('vader_lexicon', quiet=True)
    print("✓ Repository cloned and NLTK data downloaded!")
else:
    print("✓ Running locally - no setup needed")

In [None]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

# NLP
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer

nltk.download('vader_lexicon', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)

print("Libraries loaded.")

In [None]:
# Load corpus
manuscripts = pd.read_csv('manuscripts.csv')
texts = pd.read_csv('manuscript_texts.csv')

corpus = texts.groupby('manuscript_id').agg(
    text=('text', ' '.join)
).reset_index()

corpus = corpus.merge(
    manuscripts[['manuscript_id', 'title', 'author', 'genre']],
    on='manuscript_id', how='left'
)

print(f"Loaded {len(corpus)} documents")

## 8.1 What is Sentiment Analysis?

**Sentiment analysis** determines the emotional tone of text:
- **Positive**: Happy, good, excellent, wonderful
- **Negative**: Bad, terrible, wrong, hate
- **Neutral**: Factual, objective, neither positive nor negative

### VADER (Valence Aware Dictionary and sEntiment Reasoner)

VADER is a rule-based sentiment analyzer specifically tuned for social media but works well for general text. It provides:
- **neg**: Negative sentiment (0-1)
- **neu**: Neutral sentiment (0-1)
- **pos**: Positive sentiment (0-1)
- **compound**: Overall sentiment (-1 to 1)

In [None]:
# Initialize VADER
sia = SentimentIntensityAnalyzer()

# Test on some sample sentences
test_sentences = [
    "Grigsu's argument is brilliant and convincing.",
    "This theory is completely wrong and foolish.",
    "The expedition departed on the third day of the month.",
    "The stone-school has been thoroughly discredited!",
    "A nuanced and thoughtful analysis of word theory."
]

print("Sentiment analysis examples:")
for sentence in test_sentences:
    scores = sia.polarity_scores(sentence)
    print(f"\n'{sentence}'")
    print(f"  Positive: {scores['pos']:.3f}")
    print(f"  Negative: {scores['neg']:.3f}")
    print(f"  Neutral: {scores['neu']:.3f}")
    print(f"  Compound: {scores['compound']:.3f}")

## 8.2 Analyzing Document Sentiment

In [None]:
def analyze_sentiment(text):
    """
    Analyze sentiment of a text.
    
    Returns dict with overall scores and sentence-level analysis.
    """
    # Overall document sentiment
    overall = sia.polarity_scores(text)
    
    # Sentence-level sentiment
    sentences = sent_tokenize(text)
    sentence_scores = [sia.polarity_scores(s)['compound'] for s in sentences]
    
    return {
        'compound': overall['compound'],
        'positive': overall['pos'],
        'negative': overall['neg'],
        'neutral': overall['neu'],
        'num_sentences': len(sentences),
        'avg_sentence_sentiment': np.mean(sentence_scores) if sentence_scores else 0,
        'std_sentence_sentiment': np.std(sentence_scores) if sentence_scores else 0,
        'positive_sentences': sum(1 for s in sentence_scores if s > 0.05),
        'negative_sentences': sum(1 for s in sentence_scores if s < -0.05),
        'neutral_sentences': sum(1 for s in sentence_scores if -0.05 <= s <= 0.05)
    }

In [None]:
# Analyze sentiment for all documents
sentiment_data = []
for _, row in corpus.iterrows():
    sent = analyze_sentiment(row['text'])
    sent['manuscript_id'] = row['manuscript_id']
    sent['author'] = row['author']
    sent['genre'] = row['genre']
    sent['title'] = row['title']
    sentiment_data.append(sent)

sentiment_df = pd.DataFrame(sentiment_data)
print(f"Analyzed {len(sentiment_df)} documents")

In [None]:
# View summary
print("\nSentiment summary:")
print(sentiment_df[['compound', 'positive', 'negative', 'neutral']].describe())

In [None]:
# Most positive and most negative documents
print("\nMost positive documents:")
print(sentiment_df.nlargest(5, 'compound')[['title', 'author', 'compound']])

print("\nMost negative documents:")
print(sentiment_df.nsmallest(5, 'compound')[['title', 'author', 'compound']])

## 8.3 Sentiment by Author and Genre

In [None]:
# Average sentiment by author
author_sentiment = sentiment_df.groupby('author').agg({
    'compound': 'mean',
    'positive': 'mean',
    'negative': 'mean',
    'manuscript_id': 'count'
}).rename(columns={'manuscript_id': 'num_docs'}).round(3)

print("Sentiment by author (sorted by compound):")
print(author_sentiment.sort_values('compound', ascending=False).head(15))

In [None]:
# Average sentiment by genre
genre_sentiment = sentiment_df.groupby('genre').agg({
    'compound': 'mean',
    'positive': 'mean',
    'negative': 'mean',
    'std_sentence_sentiment': 'mean',
    'manuscript_id': 'count'
}).rename(columns={'manuscript_id': 'num_docs'}).round(3)

print("\nSentiment by genre:")
print(genre_sentiment.sort_values('compound', ascending=False))

In [None]:
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Genre sentiment
genre_sentiment.sort_values('compound')['compound'].plot(
    kind='barh', ax=axes[0], color='steelblue'
)
axes[0].axvline(x=0, color='gray', linestyle='--')
axes[0].set_xlabel('Compound Sentiment')
axes[0].set_title('Sentiment by Genre')

# Sentiment distribution
sentiment_df['compound'].hist(bins=20, ax=axes[1], color='steelblue', edgecolor='white')
axes[1].axvline(x=0, color='red', linestyle='--', label='Neutral')
axes[1].set_xlabel('Compound Sentiment')
axes[1].set_ylabel('Number of Documents')
axes[1].set_title('Distribution of Document Sentiment')

plt.tight_layout()
plt.show()

## 8.4 Analyzing Debates

Debates should show interesting sentiment patterns—arguments for and against, emotional exchanges.

In [None]:
# Focus on debate transcripts
debates = sentiment_df[sentiment_df['genre'] == 'debate_transcript']

print(f"Debate transcripts: {len(debates)}")
if len(debates) > 0:
    print("\nDebates by sentiment:")
    print(debates[['title', 'compound', 'positive_sentences', 'negative_sentences']].sort_values('compound'))

In [None]:
# Analyze sentiment variation within debates
if len(debates) > 0:
    print("\nSentiment variability in debates:")
    print(debates[['title', 'std_sentence_sentiment']].sort_values('std_sentence_sentiment', ascending=False).head())
    
    # High variability = contentious debate with swings between positive and negative

## 8.5 Sentiment About Specific Entities

What do texts say about specific scholars or concepts?

In [None]:
def sentiment_around_entity(text, entity, window=50):
    """
    Analyze sentiment in text windows around mentions of an entity.
    
    Parameters:
    -----------
    text : str
        The document text
    entity : str
        The entity to search for (case-insensitive)
    window : int
        Characters before and after to include
        
    Returns:
    --------
    list : List of (context, sentiment) tuples
    """
    entity_lower = entity.lower()
    text_lower = text.lower()
    
    results = []
    start = 0
    while True:
        pos = text_lower.find(entity_lower, start)
        if pos == -1:
            break
        
        # Extract context window
        context_start = max(0, pos - window)
        context_end = min(len(text), pos + len(entity) + window)
        context = text[context_start:context_end]
        
        # Analyze sentiment
        sentiment = sia.polarity_scores(context)['compound']
        results.append((context, sentiment))
        
        start = pos + 1
    
    return results

In [None]:
# Analyze sentiment around mentions of key figures
all_text = ' '.join(corpus['text'])

entities_to_check = ['Grigsu', 'Yasho', 'stone-school', 'water-school', 'dissolution']

for entity in entities_to_check:
    mentions = sentiment_around_entity(all_text, entity)
    if mentions:
        avg_sentiment = np.mean([s for _, s in mentions])
        print(f"\n'{entity}': {len(mentions)} mentions, average sentiment: {avg_sentiment:.3f}")
        
        # Show most positive and negative mentions
        sorted_mentions = sorted(mentions, key=lambda x: x[1])
        if sorted_mentions:
            print(f"  Most negative: {sorted_mentions[0][1]:.3f}")
            print(f"  Most positive: {sorted_mentions[-1][1]:.3f}")

## 8.6 Sentiment and Authenticity

Do suspected forgeries have different sentiment patterns than authentic documents?

In [None]:
# Add authenticity status to sentiment data
sentiment_df = sentiment_df.merge(
    manuscripts[['manuscript_id', 'authenticity_status']],
    on='manuscript_id', how='left'
)

# Compare authentic vs suspected
auth_comparison = sentiment_df.groupby('authenticity_status').agg({
    'compound': 'mean',
    'positive': 'mean',
    'negative': 'mean',
    'std_sentence_sentiment': 'mean',
    'manuscript_id': 'count'
}).rename(columns={'manuscript_id': 'count'}).round(3)

print("Sentiment by authenticity status:")
print(auth_comparison)

## 8.7 Summary

In this tutorial, you learned:

1. **Sentiment analysis basics**: Positive, negative, neutral, compound scores
2. **VADER**: A rule-based sentiment analyzer
3. **Document-level analysis**: Overall and sentence-level sentiment
4. **Entity-level analysis**: Sentiment around specific mentions
5. **Comparative analysis**: Sentiment by author, genre, authenticity

### Insights from the Archive

Sentiment analysis reveals:
- Which genres tend to be more emotionally charged
- How different scholars are discussed
- Whether disputed documents have unusual sentiment patterns

---

*The marginal annotations reveal much: readers who praised and readers who condemned, opinions that shifted over time, reputations that rose and fell. The ink has faded, but the sentiments persist.*

## Exercises

### Exercise 8.1: Custom Sentiment Lexicon
VADER is general-purpose. Create a custom lexicon for the archive's philosophical vocabulary (e.g., "dissolution" might be positive for water-school, negative for stone-school).

In [None]:
# YOUR CODE HERE


### Exercise 8.2: Sentiment Trajectory
For longer documents, plot how sentiment changes from beginning to end. Do arguments build to a climax?

In [None]:
# YOUR CODE HERE
