# The Lazy Book Report

Your professor has assigned a book report on "The Red-Headed League" by Arthur Conan Doyle. 

You haven't read the book. And out of stubbornness, you won't.

But you *have* learned NLP. Let's use it to answer the professor's questions without reading.

## Setup

First, let's fetch the text from Project Gutenberg and prepare it for analysis.

In [1]:
# Fetch and prepare text - RUN THIS CELL FIRST
import os
import urllib.request
import re

os.makedirs("output", exist_ok=True)

url = 'https://www.gutenberg.org/files/1661/1661-0.txt'
req = urllib.request.Request(url, headers={'User-Agent': 'Python-urllib'})
with urllib.request.urlopen(req, timeout=30) as resp:
    text = resp.read().decode('utf-8')

# Strip Gutenberg boilerplate
text = text.split('*** START OF')[1].split('***')[1]
text = text.split('*** END OF')[0]

# Extract "The Red-Headed League" story (it's the second story in the collection)
matches = list(re.finditer(r'THE RED-HEADED LEAGUE', text, re.IGNORECASE))
story_start = matches[1].end()
story_text = text[story_start:]
story_end = re.search(r'\n\s*III\.\s*\n', story_text)
story_text = story_text[:story_end.start()] if story_end else story_text

# Split into 3 sections by word count
words = story_text.split()[:4000]
section_size = len(words) // 3
sections = [
    ' '.join(words[:section_size]),
    ' '.join(words[section_size:2*section_size]),
    ' '.join(words[2*section_size:])
]

print(f"Story loaded: {len(words)} words in {len(sections)} sections")
print(f"Section sizes: {[len(s.split()) for s in sections]}")

Story loaded: 4000 words in 3 sections
Section sizes: [1333, 1333, 1334]


## Professor's Questions

Your professor wants you to answer 5 questions about the story. Let's use NLP to find the answers.

---

## Question 1: Writing Style

> "This text is from the 1890s. What makes it different from modern writing?"

**NLP Method:** Use preprocessing to compute text statistics. Tokenize the text and calculate:
- Vocabulary richness (unique words / total words)
- Average sentence length
- Average word length

**Hint:** Formal, literary writing typically shows higher vocabulary richness and longer sentences than modern casual text.

In [2]:
# Your code here: compute text statistics
# You'll need: import string, import re
# - Tokenize: remove punctuation, lowercase
# - Sentences: split on sentence-ending punctuation
# Calculate vocab_richness, avg_sentence_length, avg_word_length
import string 
import re 

def compute_text_statistics(text):
    # Tokenize
    translator = str.maketrans('', '', string.punctuation)
    tokens = [word.lower().translate(translator) for word in text.split() if word.translate(translator)]
    
    # Unique words
    unique_words = set(tokens)
    vocab_richness = len(unique_words) / len(tokens) if tokens else 0
    
    # Sentences
    sentences = re.split(r'[.!?]+', text)
    sentences = [s for s in sentences if s.strip()]
    avg_sentence_length = len(tokens) / len(sentences) if sentences else 0
    
    # Average word length
    total_word_length = sum(len(word) for word in tokens)
    avg_word_length = total_word_length / len(tokens) if tokens else 0
    
    return {
        'vocab_richness': vocab_richness,
        'avg_sentence_length': avg_sentence_length,
        'avg_word_length': avg_word_length
    }


---

## Question 2: Main Characters

> "Who are the main characters in this story?"

**NLP Method:** Use Named Entity Recognition (NER) to extract PERSON entities.

**Hint:** Use spaCy's `en_core_web_sm` model. Process the text and filter entities where `ent.label_ == 'PERSON'`. Count how often each name appears.

In [3]:
import subprocess
import sys

subprocess.check_call([sys.executable, "-m", "spacy", "download", "en_core_web_sm"])

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m19.3 MB/s[0m  [33m0:00:00[0mm0:00:01[0m0:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


0

In [4]:
# Your code here: extract PERSON entities using spaCy NER
# You'll need: import spacy, nlp = spacy.load("en_core_web_sm")

# When done, save your findings:
# with open("output/characters.txt", "w") as f:
#     for name in your_character_list:
#         f.write(f"{name}\n")

import spacy

nlp = spacy.load("en_core_web_sm")

def extract_person_entities(text):
    doc = nlp(text)
    persons = set()
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            persons.add(ent.text)
    return sorted(persons)  

# Process each section
for i, section in enumerate(sections):
    stats = compute_text_statistics(section)
    characters = extract_person_entities(section)
    
    # Save statistics
    with open(f"output/section_{i+1}_stats.txt", "w") as f:
        for key, value in stats.items():
            f.write(f"{key}: {value}\n")
    
    # Save character names
    with open(f"output/section_{i+1}_characters.txt", "w") as f:
        for name in characters:
            f.write(f"{name}\n")

# Characters over all sections
all_characters = set()
for section in sections:
    all_characters.update(extract_person_entities(section))
with open("output/characters.txt", "w") as f:
    for name in sorted(all_characters):
        f.write(f"{name}\n")

---

## Question 3: Story Locations

> "Where does the story take place?"

**NLP Method:** Use Named Entity Recognition (NER) to extract location entities (GPE and LOC).

**Hint:** Filter entities where `ent.label_` is 'GPE' (geopolitical entity) or 'LOC' (location).

In [5]:
# Your code here: extract GPE and LOC entities using spaCy NER

# When done, save your findings:
# with open("output/locations.txt", "w") as f:
#     for place in your_locations_list:
#         f.write(f"{place}\n")

def extract_location_entities(text):
    doc = nlp(text)
    locations = set()
    for ent in doc.ents:
        if ent.label_ in {"GPE", "LOC"}:
            locations.add(ent.text)
    return sorted(locations)

# Process each section for locations
for i, section in enumerate(sections):
    locations = extract_location_entities(section)
    
    # Save location names
    with open(f"output/section_{i+1}_locations.txt", "w") as f:
        for place in locations:
            f.write(f"{place}\n")

# Locations over all sections
all_locations = set()
for section in sections:
    all_locations.update(extract_location_entities(section))
with open("output/locations.txt", "w") as f:
    for place in sorted(all_locations):
        f.write(f"{place}\n")

---

## Question 4: Wilson's Business

> "What is Wilson's business?"

**NLP Method:** Use TF-IDF similarity to find which section discusses Wilson's business.

**Hint:** Create a TF-IDF vectorizer, fit it on the 3 sections, then transform your query using the same vectorizer (`.transform()`, not `.fit_transform()` - you want to use the vocabulary learned from the sections). Find which section has the highest cosine similarity and read it to find the answer.

In [6]:
# Your code here: use TF-IDF similarity to find the relevant section
# You'll need: from sklearn.feature_extraction.text import TfidfVectorizer
#              from sklearn.metrics.pairwise import cosine_similarity

# When done, save your findings:
# with open("output/business.txt", "w") as f:
#     f.write("Wilson's business is: ...")

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def find_relevant_section(sections, query):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(sections + [query])
    cosine_similarities = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1])
    most_relevant_index = cosine_similarities.argmax()
    return most_relevant_index

query = "Wilson's business"
relevant_index = find_relevant_section(sections, query)
business_section = sections[relevant_index]
with open("output/business.txt", "w") as f:
    f.write(f"Wilson's business is described in section {relevant_index + 1}:\n\n")
    f.write(business_section)   

---

## Question 5: Wilson's Work Routine

> "What is Wilson's daily work routine for the League?"

**NLP Method:** Use TF-IDF similarity to find which section discusses Wilson's work routine.

**Hint:** Similar to Question 4 - use TF-IDF to find the section that best matches your query about work routine. The answer includes what Wilson had to do and what eventually happened.

In [7]:
# Your code here: use TF-IDF similarity to find the relevant section

# When done, save your findings:
# with open("output/routine.txt", "w") as f:
#     f.write("Wilson's work routine: ...\n")
#     f.write("What happened: ...\n")
query = "Wilson's work routine and what happened"
relevant_index = find_relevant_section(sections, query)
routine_section = sections[relevant_index]
with open("output/routine.txt", "w") as f:
    f.write(f"Wilson's work routine and what happened is described in section {relevant_index + 1}:\n\n")
    f.write(routine_section)