# The Lazy Book Report

Your professor has assigned a book report on "The Red-Headed League" by Arthur Conan Doyle. 

You haven't read the book. And out of stubbornness, you won't.

But you *have* learned NLP. Let's use it to answer the professor's questions without reading.

## Setup

First, let's fetch the text from Project Gutenberg and prepare it for analysis.

In [1]:
# Fetch and prepare text - RUN THIS CELL FIRST
import os
import urllib.request
import re

os.makedirs("output", exist_ok=True)

url = 'https://www.gutenberg.org/files/1661/1661-0.txt'
req = urllib.request.Request(url, headers={'User-Agent': 'Python-urllib'})
with urllib.request.urlopen(req, timeout=30) as resp:
    text = resp.read().decode('utf-8')

# Strip Gutenberg boilerplate
text = text.split('*** START OF')[1].split('***')[1]
text = text.split('*** END OF')[0]

# Extract "The Red-Headed League" story (it's the second story in the collection)
matches = list(re.finditer(r'THE RED-HEADED LEAGUE', text, re.IGNORECASE))
story_start = matches[1].end()
story_text = text[story_start:]
story_end = re.search(r'\n\s*III\.\s*\n', story_text)
story_text = story_text[:story_end.start()] if story_end else story_text

# Split into 3 sections by word count
words = story_text.split()[:4000]
section_size = len(words) // 3
sections = [
    ' '.join(words[:section_size]),
    ' '.join(words[section_size:2*section_size]),
    ' '.join(words[2*section_size:])
]

print(f"Story loaded: {len(words)} words in {len(sections)} sections")
print(f"Section sizes: {[len(s.split()) for s in sections]}")

Story loaded: 4000 words in 3 sections
Section sizes: [1333, 1333, 1334]


## Professor's Questions

Your professor wants you to answer 5 questions about the story. Let's use NLP to find the answers.

---

## Question 1: Writing Style

> "This text is from the 1890s. What makes it different from modern writing?"

**NLP Method:** Use preprocessing to compute text statistics. Tokenize the text and calculate:
- Vocabulary richness (unique words / total words)
- Average sentence length
- Average word length

**Hint:** Formal, literary writing typically shows higher vocabulary richness and longer sentences than modern casual text.

In [7]:
# Your code here: compute text statistics
# You'll need: import string, import re
# - Tokenize: remove punctuation, lowercase
# - Sentences: split on sentence-ending punctuation
# Calculate vocab_richness, avg_sentence_length, avg_word_length

import string
import re
import nltk
nltk.download('punkt_tab')

clean_text = story_text.lower().translate(str.maketrans('', '', string.punctuation))
words_in_story = nltk.word_tokenize(clean_text)
sentence_in_story = nltk.sent_tokenize(story_text)

vocab_richness = len(set(words_in_story))/len(words_in_story)

avg_sentence_length = len(words_in_story)/len(sentence_in_story)

sum_word = 0
for w in words_in_story:
    sum_word += len(w)
avg_word_length = sum_word/len(words_in_story)

print(f"Vocab Richness: {vocab_richness:.2f}")
print(f"Avg Sentence Length: {avg_sentence_length:.2f} words")
print(f"Avg Word Length: {avg_word_length:.2f} letters")



[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/roopadilip/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Vocab Richness: 0.08
Avg Sentence Length: 24.07 words
Avg Word Length: 3.92 letters


---

## Question 2: Main Characters

> "Who are the main characters in this story?"

**NLP Method:** Use Named Entity Recognition (NER) to extract PERSON entities.

**Hint:** Use spaCy's `en_core_web_sm` model. Process the text and filter entities where `ent.label_ == 'PERSON'`. Count how often each name appears.

In [31]:
# Your code here: extract PERSON entities using spaCy NER
# You'll need: import spacy, nlp = spacy.load("en_core_web_sm")
import spacy
nlp = spacy.load("en_core_web_sm")
t = nlp(story_text)
# When done, save your findings:
# with open("output/characters.txt", "w") as f:
#     for name in your_character_list:
#         f.write(f"{name}\n")

name_map = {
    "Sherlock": "Sherlock Holmes",
    "Holmes": "Sherlock Holmes",
    "John": "John Watson",
    "Watson": "John Watson",
    "Jabez": "Jabez Wilson",
    "Wilson": "Jabez Wilson",
    "Mary": "Mary Sutherland",
    "Sutherland": "Mary Sutherland",

}
characters = {}
for ent in t.ents:
    if ent.label_ == "PERSON":
        original_name = ent.text
        proper_name = name_map.get(original_name,original_name)
        if proper_name not in characters:
            characters[proper_name] = 1
        else:
            characters[proper_name] += 1

sorted_characters = sorted(characters.items(), key=lambda x: x[1], reverse=True)
with open("output/characters.txt", "w") as f:
    for name,count in sorted_characters:
        f.write(f"{name}: {count}\n")



---

## Question 3: Story Locations

> "Where does the story take place?"

**NLP Method:** Use Named Entity Recognition (NER) to extract location entities (GPE and LOC).

**Hint:** Filter entities where `ent.label_` is 'GPE' (geopolitical entity) or 'LOC' (location).

In [27]:
# Your code here: extract GPE and LOC entities using spaCy NER

# When done, save your findings:
# with open("output/locations.txt", "w") as f:
#     for place in your_locations_list:
#         f.write(f"{place}\n")

locations = {}
for ent in t.ents:
    if ent.label_ in["GPE","LOC"]:
        place = ent.text
        if place not in locations:
            locations[place] = 1
        else:
            locations[place] += 1
sorted_locations = sorted(locations.items(), key=lambda x: x[1], reverse=True)

print(sorted_locations)
with open("output/locations.txt", "w") as f:
   for loc,count in sorted_locations:
       f.write(f"{loc}: {count}\n")

[('London', 33), ('England', 18), ('America', 10), ('France', 6), ('Eyford', 6), ('China', 5), ('India', 4), ('Boscombe Valley', 3), ('Florida', 3), ('Europe', 3), ('Savannah', 3), ('California', 3), ('Streatham', 3), ('U.S.A.', 2), ('Encyclopædia Britannica', 2), ('Underground', 2), ('Scotland', 2), ('Scarlet', 2), ('Australia', 2), ('Victoria', 2), ('Bristol', 2), ('Ballarat', 2), ('Major Prendergast', 2), ('Horsham', 2), ('Georgia', 2), ('South', 2), ('Atlantic', 2), ('morocco', 2), ('Covent Garden', 2), ('Berkshire', 2), ('San Francisco', 2), ('Frisco', 2), ('Pa.', 2), ('Holmes', 2), ('Philadelphia', 2), ('Rucastle', 2), ('Lebanon', 1), ('Pennsylvania', 1), ('Londoners', 1), ('Abbots', 1), ('Strand', 1), ('west.', 1), ('the City and Suburban Bank', 1), ('Kensington', 1), ('Oxford', 1), ('Cornwall', 1), ('Holland', 1), ('Marseilles', 1), ('Auckland', 1), ('New Zealand', 1), ('Leadenhall\r\nStreet', 1), ('the St. Pancras\r\nHotel', 1), ('Andover', 1), ('The Hague', 1), ('Horace', 1),

---

## Question 4: Wilson's Business

> "What is Wilson's business?"

**NLP Method:** Use TF-IDF similarity to find which section discusses Wilson's business.

**Hint:** Create a TF-IDF vectorizer, fit it on the 3 sections, then transform your query using the same vectorizer (`.transform()`, not `.fit_transform()` - you want to use the vocabulary learned from the sections). Find which section has the highest cosine similarity and read it to find the answer.

In [41]:
# Your code here: use TF-IDF similarity to find the relevant section
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

query = ["What is Wilson's business?"]
vectorizer = TfidfVectorizer(stop_words='english')
result = vectorizer.fit_transform(sections)
vector = vectorizer.transform(query)
ret = cosine_similarity(vector, result).flatten()

best_index = ret.argmax()
best_score = ret[best_index]

print(best_index)
print(best_score)

print(sections[0])

with open("output/business.txt", "w") as f:
    f.write("Wilson's business is: Freemasonry")



1
0.18361726647829435
I had called upon my friend, Mr. Sherlock Holmes, one day in the autumn of last year and found him in deep conversation with a very stout, florid-faced, elderly gentleman with fiery red hair. With an apology for my intrusion, I was about to withdraw when Holmes pulled me abruptly into the room and closed the door behind me. “You could not possibly have come at a better time, my dear Watson,” he said cordially. “I was afraid that you were engaged.” “So I am. Very much so.” “Then I can wait in the next room.” “Not at all. This gentleman, Mr. Wilson, has been my partner and helper in many of my most successful cases, and I have no doubt that he will be of the utmost use to me in yours also.” The stout gentleman half rose from his chair and gave a bob of greeting, with a quick little questioning glance from his small fat-encircled eyes. “Try the settee,” said Holmes, relapsing into his armchair and putting his fingertips together, as was his custom when in judicial mo

---

## Question 5: Wilson's Work Routine

> "What is Wilson's daily work routine for the League?"

**NLP Method:** Use TF-IDF similarity to find which section discusses Wilson's work routine.

**Hint:** Similar to Question 4 - use TF-IDF to find the section that best matches your query about work routine. The answer includes what Wilson had to do and what eventually happened.

In [45]:
# Your code here: use TF-IDF similarity to find the relevant section
query = ["What is Wilson's daily work routine for the League?"]
vectorizer = TfidfVectorizer(stop_words='english')
result = vectorizer.fit_transform(sections)
vector = vectorizer.transform(query)
ret = cosine_similarity(vector, result).flatten()

best_index = ret.argmax()
best_score = ret[best_index]

print(best_index)
print(best_score)


with open("output/routine.txt", "w") as f:
    f.write("Wilson's work routine: He works at his pawnbroker's business, along with his two assistants. He was asked to work from 10 to 2. He would have to be in the office or building the whole time. He needs to copy out the Encyclopedia Britannica\n")
    f.write("What happened: He went to work one day and saw a sign that said THE RED_HEADED LEAGUE IS DISSOLVED")



2
0.13064694976267216
