# The Lazy Book Report

Your professor has assigned a book report on "The Red-Headed League" by Arthur Conan Doyle. 

You haven't read the book. And out of stubbornness, you won't.

But you *have* learned NLP. Let's use it to answer the professor's questions without reading.

## Setup

First, let's fetch the text from Project Gutenberg and prepare it for analysis.

In [6]:
# Fetch and prepare text - RUN THIS CELL FIRST
import os
import urllib.request
import re
import spacy
from collections import Counter

os.makedirs("output", exist_ok=True)

url = 'https://www.gutenberg.org/files/1661/1661-0.txt'
req = urllib.request.Request(url, headers={'User-Agent': 'Python-urllib'})
with urllib.request.urlopen(req, timeout=30) as resp:
    text = resp.read().decode('utf-8')

# Strip Gutenberg boilerplate
text = text.split('*** START OF')[1].split('***')[1]
text = text.split('*** END OF')[0]

# Extract "The Red-Headed League" story (it's the second story in the collection)
matches = list(re.finditer(r'THE RED-HEADED LEAGUE', text, re.IGNORECASE))
story_start = matches[1].end()
story_text = text[story_start:]
story_end = re.search(r'\n\s*III\.\s*\n', story_text)
story_text = story_text[:story_end.start()] if story_end else story_text

# Split into 3 sections by word count
words = story_text.split()[:4000]
section_size = len(words) // 3
sections = [
    ' '.join(words[:section_size]),
    ' '.join(words[section_size:2*section_size]),
    ' '.join(words[2*section_size:])
]

print(f"Story loaded: {len(words)} words in {len(sections)} sections")
print(f"Section sizes: {[len(s.split()) for s in sections]}")

Story loaded: 4000 words in 3 sections
Section sizes: [1333, 1333, 1334]


## Professor's Questions

Your professor wants you to answer 5 questions about the story. Let's use NLP to find the answers.

---

## Question 1: Writing Style

> "This text is from the 1890s. What makes it different from modern writing?"

**NLP Method:** Use preprocessing to compute text statistics. Tokenize the text and calculate:
- Vocabulary richness (unique words / total words)
- Average sentence length
- Average word length

**Hint:** Formal, literary writing typically shows higher vocabulary richness and longer sentences than modern casual text.

In [7]:
# Your code here: compute text statistics
# You'll need: import string, import re
import string

#nltk.download('punkt_tab')
# - Tokenize: remove punctuation, lowercase
abbreviations = ['Mr.', 'Mrs.', 'Dr.', 'St.', 'etc.']
text_flag = story_text
for abbr in abbreviations:
    text_flag = text_flag.replace(abbr, abbr.replace('.', 'PROT'))

# Split on sentence-ending punctuation
sentences = re.findall(r'[^.!?]+[.!?]', text_flag)
# Restore abbreviation periods and strip whitespace
sentences_vector = [
    s.replace('PROT', '.').strip()
    for s in sentences
    if s.strip()
]

clean_text = story_text.translate(str.maketrans('', '', string.punctuation)).lower()
tokens = [word for s in sentences_vector for word in re.findall(r'[a-z]+', s)]

words_only = [t for t in tokens if t.isalpha()]

vocab_richness = len(set(words_only)) / len(words_only)
avg_sentence_length = len(tokens) / len(sentences)
avg_word_length = sum(len(w) for w in tokens) / len(tokens)

print(f"Vocabulary richness: {vocab_richness:.3f}")
print(f"Average sentence length: {avg_sentence_length:.2f} words")
print(f"Average word length: {avg_word_length:.2f} chars")



Vocabulary richness: 0.082
Average sentence length: 15.24 words
Average word length: 4.08 chars


---

## Question 2: Main Characters

> "Who are the main characters in this story?"

**NLP Method:** Use Named Entity Recognition (NER) to extract PERSON entities.

**Hint:** Use spaCy's `en_core_web_sm` model. Process the text and filter entities where `ent.label_ == 'PERSON'`. Count how often each name appears.

In [11]:
# Your code here: extract PERSON entities using spaCy NER
# You'll need: import spacy, nlp = spacy.load("en_core_web_sm")

nlp = spacy.load('en_core_web_sm')
doc = nlp(story_text)
characters = sorted({
    ent.text for ent in doc.ents if ent.label_ == 'PERSON'
})

character_counts = Counter(
    ent.text for ent in doc.ents if ent.label_ == "PERSON"
)
for name, count in character_counts.most_common():
    print(f"{name}: {count}")

# When done, save your findings:
# with open("output/characters.txt", "w") as f:
#     for name in your_character_list:
#         f.write(f"{name}\n")

with open("output/characters.txt", "w") as f:
    for name, count in character_counts.most_common():
        f.write(f"{name}\t{count}\n")

Holmes: 392
Watson: 71
Lestrade: 38
Rucastle: 33
McCarthy: 32
Arthur: 20
Hunter: 19
Frank: 18
Sherlock Holmes: 15
Wilson: 13
Hosmer Angel: 13
Windibank: 13
Merryweather: 12
Holder: 12
Peterson: 11
Roylott: 11
Miss Stoner: 11
Jones: 10
Turner: 10
Horner: 10
Mary: 10
Alice: 9
Henry Baker: 9
Jabez Wilson: 8
Hatherley: 8
Horsham: 8
Neville St. Clair: 8
Simon: 8
Duncan Ross: 7
Bradstreet: 7
Oakshott: 7
Hosmer: 6
Angel: 6
Ross: 6
James: 6
Lee: 6
Baker: 6
Stoke Moran: 6
Lysander Stark: 6
George Burnwell: 6
Mary Sutherland: 5
John Clay: 5
James Windibank: 5
Surrey: 5
Alpha: 5
Grimesby Roylott: 5
Stoper: 5
Fowler: 5
Vincent Spaulding: 4
James McCarthy: 4
John Openshaw: 4
Hudson: 4
Waterloo: 4
Pondicherry: 4
Swandam Lane: 4
Aloysius Doran: 4
Flora Millar: 4
Toller: 4
Albert: 3
Irene Adler: 3
William Crowder: 3
John: 3
Openshaw: 3
Kate: 3
Kent: 3
Hugh Boone: 3
Jem: 3
Ferguson: 3
Backwater: 3
Robert St. Simon: 3
Hatty Doran: 3
Doran: 3
Flora: 3
I. ‘: 2
William Morris: 2
Jump: 2
Hardy: 2
Miss Suthe

---

## Question 3: Story Locations

> "Where does the story take place?"

**NLP Method:** Use Named Entity Recognition (NER) to extract location entities (GPE and LOC).

**Hint:** Filter entities where `ent.label_` is 'GPE' (geopolitical entity) or 'LOC' (location).

In [12]:
# Your code here: extract GPE and LOC entities using spaCy NER

locations = sorted({
    ent.text
    for ent in doc.ents
    if ent.label_ in ("GPE", "LOC")
})

# When done, save your findings:
# with open("output/locations.txt", "w") as f:
#     for place in your_locations_list:
#         f.write(f"{place}\n")

with open("output/locations.txt", "w") as f:
    for place in locations:
        f.write(f"{place}\n")


---

## Question 4: Wilson's Business

> "What is Wilson's business?"

**NLP Method:** Use TF-IDF similarity to find which section discusses Wilson's business.

**Hint:** Create a TF-IDF vectorizer, fit it on the 3 sections, then transform your query using the same vectorizer (`.transform()`, not `.fit_transform()` - you want to use the vocabulary learned from the sections). Find which section has the highest cosine similarity and read it to find the answer.

In [None]:
# Your code here: use TF-IDF similarity to find the relevant section
# You'll need: from sklearn.feature_extraction.text import TfidfVectorizer
#              from sklearn.metrics.pairwise import cosine_similarity

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

sentences = re.findall(r'[^.!?]+[.!?]', story_text)
query = "What is Wilson's business?"

vectorizer = TfidfVectorizer(stop_words="english")
tfidf = vectorizer.fit_transform(sentences + [query])

similarities = cosine_similarity(tfidf[-1], tfidf[:-1])[0]
best_sentence = sentences[similarities.argmax()].strip()


# When done, save your findings:
# with open("output/business.txt", "w") as f:
#     f.write("Wilson's business is: ...")

with open("output/business.txt", "w") as f:
    f.write(f"Wilson's business is: {best_sentence}")

---

## Question 5: Wilson's Work Routine

> "What is Wilson's daily work routine for the League?"

**NLP Method:** Use TF-IDF similarity to find which section discusses Wilson's work routine.

**Hint:** Similar to Question 4 - use TF-IDF to find the section that best matches your query about work routine. The answer includes what Wilson had to do and what eventually happened.

In [None]:
# Your code here: use TF-IDF similarity to find the relevant section

# When done, save your findings:
# with open("output/routine.txt", "w") as f:
#     f.write("Wilson's work routine: ...\n")
#     f.write("What happened: ...\n")

