# The Lazy Book Report

Your professor has assigned a book report on "The Red-Headed League" by Arthur Conan Doyle. 

You haven't read the book. And out of stubbornness, you won't.

But you *have* learned NLP. Let's use it to answer the professor's questions without reading.

## Setup

First, let's fetch the text from Project Gutenberg and prepare it for analysis.

In [1]:
# Fetch and prepare text - RUN THIS CELL FIRST
import os
import urllib.request
import re

os.makedirs("output", exist_ok=True)

url = 'https://www.gutenberg.org/files/1661/1661-0.txt'
req = urllib.request.Request(url, headers={'User-Agent': 'Python-urllib'})
with urllib.request.urlopen(req, timeout=30) as resp:
    text = resp.read().decode('utf-8')

# Strip Gutenberg boilerplate
text = text.split('*** START OF')[1].split('***')[1]
text = text.split('*** END OF')[0]

# Extract "The Red-Headed League" story (it's the second story in the collection)
matches = list(re.finditer(r'THE RED-HEADED LEAGUE', text, re.IGNORECASE))
story_start = matches[1].end()
story_text = text[story_start:]
story_end = re.search(r'\n\s*III\.\s*\n', story_text)
story_text = story_text[:story_end.start()] if story_end else story_text

# Split into 3 sections by word count
words = story_text.split()[:4000]
section_size = len(words) // 3
sections = [
    ' '.join(words[:section_size]),
    ' '.join(words[section_size:2*section_size]),
    ' '.join(words[2*section_size:])
]

print(f"Story loaded: {len(words)} words in {len(sections)} sections")
print(f"Section sizes: {[len(s.split()) for s in sections]}")

Story loaded: 4000 words in 3 sections
Section sizes: [1333, 1333, 1334]


## Professor's Questions

Your professor wants you to answer 5 questions about the story. Let's use NLP to find the answers.

---

## Question 1: Writing Style

> "This text is from the 1890s. What makes it different from modern writing?"

**NLP Method:** Use preprocessing to compute text statistics. Tokenize the text and calculate:
- Vocabulary richness (unique words / total words)
- Average sentence length
- Average word length

**Hint:** Formal, literary writing typically shows higher vocabulary richness and longer sentences than modern casual text.

In [2]:
# import string, import re
import string
import re

# Tokenize
words_tokenized = story_text.translate(str.maketrans('', '', string.punctuation)).lower().split()

# Sentences
## find_nuisance_words = set(re.findall(r'\b([A-Z][a-z]{0,3}\.)\s+\w', story_text)) | Initial check for abbreviations
## Since formal, literary writing = lots of titles = can't split on periods
find_nuisance_words = set(re.findall(r'\b(Mr\.|Mrs\.|Dr\.|St\.|etc\.)', story_text))
text_flag = story_text
for nw in find_nuisance_words:
    text_flag = text_flag.replace(nw, nw.replace('.', 'PROT'))
sentences = re.findall(r'[^.!?]+[.!?]', text_flag)
sentences = [s.replace('PROT', '.').strip() for s in sentences if s.strip()]

# Calculate vocab_richness, avg_sentence_length, avg_word_length
vocab_richness = len(set(words_tokenized)) / len(words_tokenized)
avg_sentence_length = len(words_tokenized) / len(sentences)
avg_word_length = sum(len(word) for word in words_tokenized) / len(words_tokenized)

# Output Results
print(f"Vocabulary Richness: {vocab_richness:.4f}")
print(f"Average Sentence Length: {avg_sentence_length:.2f} words")
print(f"Average Word Length: {avg_word_length:.2f} characters")

Vocabulary Richness: 0.0997
Average Sentence Length: 15.53 words
Average Word Length: 4.19 characters


---

## Question 2: Main Characters

> "Who are the main characters in this story?"

**NLP Method:** Use Named Entity Recognition (NER) to extract PERSON entities.

**Hint:** Use spaCy's `en_core_web_sm` model. Process the text and filter entities where `ent.label_ == 'PERSON'`. Count how often each name appears.

In [3]:
# Extract PERSON entities using spaCy NER
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(story_text)
people = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]
characters = {name: people.count(name) for name in set(people)}

# Save findings: 
with open("output/characters.txt", "w") as f:
    for name, counts in characters.items():
        f.write(f"{name}: {counts}\n")

---

## Question 3: Story Locations

> "Where does the story take place?"

**NLP Method:** Use Named Entity Recognition (NER) to extract location entities (GPE and LOC).

**Hint:** Filter entities where `ent.label_` is 'GPE' (geopolitical entity) or 'LOC' (location).

In [4]:

# Your code here: extract GPE and LOC entities using spaCy NER
locations = set([ent.text for ent in doc.ents if ent.label_ in ("GPE", "LOC")])

# When done, save your findings:
with open("output/locations.txt", "w") as f:
    for place in locations:
        f.write(f"{place}\n")

---

## Question 4: Wilson's Business

> "What is Wilson's business?"

**NLP Method:** Use TF-IDF similarity to find which section discusses Wilson's business.

**Hint:** Create a TF-IDF vectorizer, fit it on the 3 sections, then transform your query using the same vectorizer (`.transform()`, not `.fit_transform()` - you want to use the vocabulary learned from the sections). Find which section has the highest cosine similarity and read it to find the answer.

In [5]:
# Your code here: use TF-IDF similarity to find the relevant section

# Import libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Standard setup for vectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(sections)

# Using Wilson's Business as Query
looking_for = "Wilson's business"
looking_vec = vectorizer.transform([looking_for])
cosine = cosine_similarity(looking_vec, X).flatten()
print(cosine)

# Get highest similarlity from cosine, choose section for analysis, split sentences
highest_similarity = cosine.argmax()
section = sections[highest_similarity]
sentences_section = re.findall(r'[^.!?]+[.!?]', section)  # Match via punctuation only allow ending punct.

# Iterate over each sentence, remove white space, check for "business", if present print the relevant sentence
for sentence in sentences_section:
    sentence = sentence.strip()
    if "business" in sentence.lower():
        print(sentence)
        break

# Save findings:
with open("output/business.txt", "w") as f:
    f.write("Wilson's business is: a small pawnbroker’s business at Coburg Square")

[0.03360193 0.06577618 0.04978973]
Sherlock Holmes,” said Jabez Wilson, mopping his forehead; “I have a small pawnbroker’s business at Coburg Square, near the City.


---

## Question 5: Wilson's Work Routine

> "What is Wilson's daily work routine for the League?"

**NLP Method:** Use TF-IDF similarity to find which section discusses Wilson's work routine.

**Hint:** Similar to Question 4 - use TF-IDF to find the section that best matches your query about work routine. The answer includes what Wilson had to do and what eventually happened.

In [6]:
# Your code here: use TF-IDF similarity to find the relevant section

# Use Wilson Daily Work Routine 
query = "Wilson daily work routine League"

# Follow same steps as before
query_vec = vectorizer.transform([query])
cosine = cosine_similarity(query_vec, X).flatten()
print(cosine)
highest_similarity = cosine.argmax()
section = sections[highest_similarity]
sentences_in_section = re.findall(r'[^.!?]+[.!?]', section)

# Get query words (lowercase, filter short words)
query_words = [w.lower() for w in query.split()]
print(f"Searching for: {query_words}")

# Find sentences containing any query word. Print next sentence as well, 
# because 'and the work?' seems like it has follow-up
for next, sentence in enumerate(sentences_in_section):
    sentence = sentence.strip().lower()
    if any(word in sentence for word in query_words):
        print(sentence, sentences_in_section[next + 1], '\n')

# Save findings
with open("output/routine.txt", "w") as f:
    f.write("Wilson's work routine: Is to copy out the _Encyclopædia Britannica_\n")
    f.write("What happened: the red-headed league is dissolved\n")

[0.03942181 0.0336671  0.04271273]
Searching for: ['wilson', 'daily', 'work', 'routine', 'league']
jabez wilson,’ said my assistant, ‘and he is willing to fill a vacancy in the league. ’ “‘And he is admirably suited for it,’ the other answered. 

wilson?  Have you a family? 

wilson! ’ said Vincent Spaulding. 

’ “‘and the work? ’ “‘Is purely nominal. 

’ “‘and the work? ’ “‘Is to copy out the _Encyclopædia Britannica_. 

jabez wilson, and let me congratulate you once more on the important position which you have been fortunate enough to gain. ’ He bowed me out of the room and I went home with my assistant, hardly knowing what to say or do, I was so pleased at my own good fortune. 

duncan ross was there to see that i got fairly to work.  He started me off upon the letter A, and then he left me; but he would drop in from time to time to see that all was right with me. 

holmes, and on saturday the manager came in and planked down four golden sovereigns for my week’s work.  It was the s