# Sentence-Level Analysis — *War and Peace* & *Anna Karenina*

This notebook applies the **sentence types** analysis (declarative, interrogative, exclamative, dialogue detection)
to two Tolstoy novels: *War and Peace* and *Anna Karenina*. The structure follows the professor's Session 3 notes,
and the project folder layout used in your Session 2 notebook (data files in `../data/`).

**What this notebook does**
- Load the raw text files from `../data/`
- Clean Gutenberg headers/footers (simple heuristics)
- Split text into sentences and classify sentence types
- Produce summary tables and visualizations


In [1]:

# Imports and setup
import re
from pathlib import Path
import nltk
from nltk.tokenize import sent_tokenize
from collections import defaultdict, Counter
import pandas as pd
import matplotlib.pyplot as plt

# Ensure punkt is available
nltk.download('punkt', quiet=True)


True

In [2]:

# Paths (assumes this notebook is run from notebooks/)
cwd = Path.cwd()
DATA_DIR = cwd.parent / "data" if cwd.name == "notebooks" else cwd / "data"
RESULTS_DIR = cwd.parent / "results" if cwd.name == "notebooks" else cwd / "results"
RESULTS_DIR.mkdir(parents=True, exist_ok=True)

FILE_ANNA = "The Project Gutenberg eBook of Anna Karenina, by Leo Tolstoy.txt"
FILE_WAR  = "The Project Gutenberg eBook of War and Peace, by Leo Tolstoy.txt"

path_anna = DATA_DIR / FILE_ANNA
path_war  = DATA_DIR / FILE_WAR

# Check files
missing = []
for p in (path_anna, path_war):
    if not p.exists():
        missing.append(str(p))
if missing:
    raise FileNotFoundError("Missing required text files. Please place them in the data/ folder:\n" + "\n".join(missing))

# Load raw texts preserving punctuation and newlines
raw_anna = path_anna.read_text(encoding='utf-8', errors='ignore')
raw_war  = path_war.read_text(encoding='utf-8', errors='ignore')

# Trim Gutenberg header/footer if present
def trim_gutenberg(text):
    start_m = re.search(r"\*\*\* START OF .*?\*\*\*", text, re.IGNORECASE)
    end_m   = re.search(r"\*\*\* END OF .*?\*\*\*", text, re.IGNORECASE)
    if start_m and end_m and start_m.end() < end_m.start():
        return text[start_m.end():end_m.start()]
    return text

raw_anna = trim_gutenberg(raw_anna)
raw_war  = trim_gutenberg(raw_war)

# For compatibility with cells expecting 'clean_anna'/'clean_war', set them to raw strings (not token lists)
clean_anna = raw_anna
clean_war = raw_war

print('Loaded texts. Lengths (characters):', len(clean_anna), len(clean_war))


Loaded texts. Lengths (characters): 1963463 3203449


In [3]:

# Sentence tokenization and classification functions
def safe_sent_tokenize(text):
    """Tokenize into sentences robustly. Accepts string or bytes. Returns list of sentences."""
    if text is None:
        return []
    if isinstance(text, bytes):
        try:
            text = text.decode('utf-8', errors='ignore')
        except Exception:
            text = str(text)
    if not isinstance(text, str):
        text = str(text)
    # Use nltk.sent_tokenize (punkt) for best results; fallback to regex if it fails
    try:
        sents = sent_tokenize(text)
        return [s.strip() for s in sents if s.strip()]
    except Exception:
        fallback = re.split(r'(?<=[\.\?\!])\s+', text)
        return [s.strip() for s in fallback if s.strip()]

def classify_sentences(sentences):
    """Classify sentences into declarative, interrogative, exclamative, and dialogue."""
    groups = defaultdict(list)
    for s in sentences:
        ss = s.strip()
        if not ss:
            continue
        # Dialogue heuristic: contains quotation characters or starts with dash/em-dash
        if re.search(r'["“”«»„‟\u201c\u201d]', ss) or ss.startswith(('—','–','-','"', "'","“")):
            groups['dialogue'].append(ss)
        elif ss.endswith('?'):
            groups['interrogative'].append(ss)
        elif ss.endswith('!'):
            groups['exclamative'].append(ss)
        else:
            groups['declarative'].append(ss)
    total = sum(len(v) for v in groups.values())
    counts = {
        'total': total,
        'declarative': len(groups['declarative']),
        'interrogative': len(groups['interrogative']),
        'exclamative': len(groups['exclamative']),
        'dialogue': len(groups['dialogue']),
    }
    return counts, groups


In [4]:

# Run analysis on both books and build DataFrame df
texts = {
    'War and Peace': clean_war,
    'Anna Karenina': clean_anna
}

results = {}
groups_store = {}
for name, text in texts.items():
    sents = safe_sent_tokenize(text)
    counts, groups = classify_sentences(sents)
    results[name] = counts
    groups_store[name] = groups

df = pd.DataFrame.from_dict(results, orient='index').fillna(0).astype(int)
# compute percentages columns
for col in ['declarative','interrogative','exclamative','dialogue']:
    df[col + '_pct'] = (df[col] / df['total']).fillna(0) * 100

# Reorder columns
df = df[['total','declarative','declarative_pct','interrogative','interrogative_pct','exclamative','exclamative_pct','dialogue','dialogue_pct']]
df.index.name = 'book'

# Save results CSV
df.to_csv(RESULTS_DIR / 'sentence_type_counts_percentages.csv')
print('Analysis complete. DataFrame "df" created with shape:', df.shape)
df


Analysis complete. DataFrame "df" created with shape: (2, 9)


Unnamed: 0_level_0,total,declarative,declarative_pct,interrogative,interrogative_pct,exclamative,exclamative_pct,dialogue,dialogue_pct
book,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
War and Peace,26431,15650,59.210775,704,2.663539,761,2.879195,9316,35.246491
Anna Karenina,16807,9319,55.447135,420,2.498959,286,1.701672,6782,40.352234


In [5]:

# Print human-friendly summaries and examples (3 examples per class)
def show_summary(name, groups, counts, show_n=3):
    print(f"\n=== {name} ===")
    print("Counts:", counts)
    for cat in ['declarative','interrogative','exclamative','dialogue']:
        items = groups.get(cat, [])
        if not items:
            print(f"-- {cat} (showing 0) --")
            continue
        n_show = len(items) if show_n is None else min(show_n, len(items))
        print(f"-- {cat} (showing {n_show}) --")
        for i in range(n_show):
            s = items[i].replace('\n', ' ').strip()
            if len(s) > 300:
                s = s[:300] + " …[truncated]"
            print("  ", s)

# Display for both books
for name in ['War and Peace', 'Anna Karenina']:
    show_summary(name, groups_store[name], results[name], show_n=3)



=== War and Peace ===
Counts: {'total': 26431, 'declarative': 15650, 'interrogative': 704, 'exclamative': 761, 'dialogue': 9316}
-- declarative (showing 3) --
   With these words she greeted Prince Vasíli Kurágin, a man of high rank and importance, who was the first to arrive at her reception.
   Anna Pávlovna had had a cough for some days.
   She was, as she said, suffering from la grippe; grippe being then a new word in St. Petersburg, used only by the elite.
-- interrogative (showing 3) --
   But how do you do?
   Well, and what has been decided about Novosíltsev’s dispatch?
   What answer did Novosíltsev get?
-- exclamative (showing 3) --
   But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself!
   She is betraying us!
   That is the one t

In [6]:

# Visualization: save three images (counts, percentages, pies)
import numpy as np

cats = ['declarative', 'interrogative', 'exclamative', 'dialogue']

def plot_counts(df, out_path):
    labels = cats
    x = np.arange(len(labels))
    width = 0.35
    books = df.index.tolist()
    if len(books) < 2:
        book1 = books[0]
        book2 = None
    else:
        book1, book2 = books[0], books[1]
    vals1 = [int(df.loc[book1, c]) for c in labels]
    if book2:
        vals2 = [int(df.loc[book2, c]) for c in labels]
    else:
        vals2 = [0]*len(labels)
        book2 = 'Other'
    plt.figure(figsize=(10,6))
    plt.bar(x - width/2, vals1, width, label=book1)
    plt.bar(x + width/2, vals2, width, label=book2)
    plt.xticks(x, labels, rotation=30)
    plt.ylabel('Counts')
    plt.title('Sentence Type Counts')
    plt.legend()
    plt.tight_layout()
    plt.savefig(out_path)
    plt.close()

def plot_percentages(df, out_path):
    labels = cats
    x = np.arange(len(labels))
    width = 0.35
    books = df.index.tolist()
    if len(books) < 2:
        book1 = books[0]
        book2 = None
    else:
        book1, book2 = books[0], books[1]
    vals1 = [float(df.loc[book1, c + '_pct']) for c in labels]
    if book2:
        vals2 = [float(df.loc[book2, c + '_pct']) for c in labels]
    else:
        vals2 = [0.0]*len(labels)
        book2 = 'Other'
    plt.figure(figsize=(10,6))
    plt.bar(x - width/2, vals1, width, label=book1)
    plt.bar(x + width/2, vals2, width, label=book2)
    plt.xticks(x, labels, rotation=30)
    plt.ylabel('Percentage (%)')
    plt.title('Sentence Type Percentages')
    plt.legend()
    plt.tight_layout()
    plt.savefig(out_path)
    plt.close()

def plot_pies(df, out_path):
    books = df.index.tolist()
    if len(books) < 2:
        book1 = books[0]
        book2 = None
    else:
        book1, book2 = books[0], books[1]
    vals1 = [int(df.loc[book1, c]) for c in cats]
    if book2:
        vals2 = [int(df.loc[book2, c]) for c in cats]
    else:
        vals2 = [0]*len(cats)
        book2 = 'Other'
    labels = ['Declarative','Interrogative','Exclamative','Dialogue']
    fig, axes = plt.subplots(1, 2, figsize=(12,6))
    axes[0].pie(vals1, labels=labels, autopct=lambda p: f'{p:.1f}%' if p>0 else '', startangle=90)
    axes[0].set_title(book1)
    axes[1].pie(vals2, labels=labels, autopct=lambda p: f'{p:.1f}%' if p>0 else '', startangle=90)
    axes[1].set_title(book2)
    plt.suptitle('Sentence Type Distributions (Pie Charts)')
    plt.tight_layout()
    plt.savefig(out_path)
    plt.close()

# Save images
plot_counts(df, RESULTS_DIR / 'sentence_type_counts.png')
plot_percentages(df, RESULTS_DIR / 'sentence_type_percentages.png')
plot_pies(df, RESULTS_DIR / 'sentence_type_pies.png')

print('Saved 3 images to', RESULTS_DIR)


Saved 3 images to c:\Users\Omen\Documents\GitHub\ThirdProject\results


In [12]:
# Save example sentences per category (top 200 each) to CSV files for manual inspection
for book, groups in groups_store.items():
    rows = []
    for cat in ['declarative','interrogative','exclamative','dialogue']:
        items = groups.get(cat, [])[:200]  # up to 200 examples per category
        for i, sent in enumerate(items):
            rows.append({'category': cat, 'example_index': i+1, 'sentence': sent})
    pd.DataFrame(rows).to_csv(RESULTS_DIR / f"{book.replace(' ', '_')}_sentence_examples.csv", index=False)

print('Saved sentence example CSVs to', RESULTS_DIR)


Saved sentence example CSVs to c:\Users\Omen\Documents\GitHub\ThirdProject\results


What the results mean:

-High declarative sentence percentage → The novel is heavily narrative-driven, focused on exposition, descriptions, and the progression of events.

-Low interrogative percentage → Characters do not frequently ask direct questions; conversations are more monologue/exposition-based than Socratic or inquiry-focused.

-Moderate dialogue sentences → Dialogue exists but is not the dominant feature. Tolstoy emphasizes large-scale events, internal reflections, and philosophical narration.

-Rare exclamative sentences → Emotional outbursts are limited; Tolstoy uses restrained, measured tone except for rare dramatic moments.

Overall Interpretation:
War and Peace reads like a historical epic with measured storytelling. Its sentence structure supports a calm, expansive narrative style with a focus on description and reflection rather than emotional intensity or rapid conversational exchanges.