# The Lazy Book Report

Your professor has assigned a book report on "The Red-Headed League" by Arthur Conan Doyle. 

You haven't read the book. And out of stubbornness, you won't.

But you *have* learned NLP. Let's use it to answer the professor's questions without reading.

## Setup

First, let's fetch the text from Project Gutenberg and prepare it for analysis.

In [2]:
# Fetch and prepare text - RUN THIS CELL FIRST
import os
import urllib.request
import re

os.makedirs("output", exist_ok=True)

url = 'https://www.gutenberg.org/files/1661/1661-0.txt'
req = urllib.request.Request(url, headers={'User-Agent': 'Python-urllib'})
with urllib.request.urlopen(req, timeout=30) as resp:
    text = resp.read().decode('utf-8')

# Strip Gutenberg boilerplate
text = text.split('*** START OF')[1].split('***')[1]
text = text.split('*** END OF')[0]

# Extract "The Red-Headed League" story (it's the second story in the collection)
matches = list(re.finditer(r'THE RED-HEADED LEAGUE', text, re.IGNORECASE))
story_start = matches[1].end()
story_text = text[story_start:]
story_end = re.search(r'\n\s*III\.\s*\n', story_text)
story_text = story_text[:story_end.start()] if story_end else story_text

# Split into 3 sections by word count
words = story_text.split()[:4000]
section_size = len(words) // 3
sections = [
    ' '.join(words[:section_size]),
    ' '.join(words[section_size:2*section_size]),
    ' '.join(words[2*section_size:])
]

print(f"Story loaded: {len(words)} words in {len(sections)} sections")
print(f"Section sizes: {[len(s.split()) for s in sections]}")

Story loaded: 4000 words in 3 sections
Section sizes: [1333, 1333, 1334]


## Professor's Questions

Your professor wants you to answer 5 questions about the story. Let's use NLP to find the answers.

---

## Question 1: Writing Style

> "This text is from the 1890s. What makes it different from modern writing?"

**NLP Method:** Use preprocessing to compute text statistics. Tokenize the text and calculate:
- Vocabulary richness (unique words / total words)
- Average sentence length
- Average word length

**Hint:** Formal, literary writing typically shows higher vocabulary richness and longer sentences than modern casual text.

In [3]:
# Your code here: compute text statistics
# You'll need: import string, import re
# - Tokenize: remove punctuation, lowercase
# - Sentences: split on sentence-ending punctuation
# Calculate vocab_richness, avg_sentence_length, avg_word_length

import string 
import re 

def compute_text_statistics(text):
    # Tokenize: remove punctuation, lowercase
    translator = str.maketrans('', '', string.punctuation)
    tokens = text.lower().translate(translator).split()
    
    # Vocabulary richness
    vocab_richness = len(set(tokens)) / len(tokens) if tokens else 0
    
    # Sentences: split on sentence-ending punctuation
    sentences = re.split(r'[.!?]+', text)
    sentences = [s for s in sentences if s.strip()]
    
    # Average sentence length
    sentence_lengths = []
    for sentence in sentences:
        sentence_tokens = sentence.lower().translate(translator).split()
        sentence_lengths.append(len(sentence_tokens))
    avg_sentence_length = sum(sentence_lengths) / len(sentence_lengths) if sentence_lengths else 0
    
    # Average word length
    word_length = [len(word) for word in tokens]
    avg_word_length = sum(word_length) / len(word_length) if word_length else 0
    
    return {
        'vocab_richness': vocab_richness,
        'avg_sentence_length': avg_sentence_length,
        'avg_word_length': avg_word_length
    }

stats = compute_text_statistics(story_text)
print("Statistics for 'The Red-Headed League':")
print(f"Vocabulary Richness: {stats['vocab_richness']:.2f}")
print(f"Average Sentence Length: {stats['avg_sentence_length']:.2f} words")
print(f"Average Word Length: {stats['avg_word_length']:.2f} characters")

#  The vocabulary richness is rather low and and the average sentence length and average word length is rather low compared to modern text. 

Statistics for 'The Red-Headed League':
Vocabulary Richness: 0.10
Average Sentence Length: 14.84 words
Average Word Length: 4.19 characters


---

## Question 2: Main Characters

> "Who are the main characters in this story?"

**NLP Method:** Use Named Entity Recognition (NER) to extract PERSON entities.

**Hint:** Use spaCy's `en_core_web_sm` model. Process the text and filter entities where `ent.label_ == 'PERSON'`. Count how often each name appears.

In [4]:
# Your code here: extract PERSON entities using spaCy NER
# You'll need: import spacy, nlp = spacy.load("en_core_web_sm")

# When done, save your findings:
# with open("output/characters.txt", "w") as f:
#     for name in your_character_list:
#         f.write(f"{name}\n")

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(story_text)
raw_names = sorted(set(ent.text for ent in doc.ents if ent.label_ == "PERSON"))
clean_names = [name.replace("'s", "") for name in raw_names]
clean_names = [name for name in clean_names if len(name.split()) > 1]

remove = {
    "Agra", "Bosombe Pool", "Pall Mall", "Waterloo", "Waterloo Bridge", "St.", "XI", "Square", "Surrey", "Near Lee", "Principal", "F.H.M ' Now", "I.", "I. '"
}

clean_names = [name for name in clean_names if name not in remove]

merge_map = {
    "Holmes": "Sherlock Holmes",
    "Sherlock": "Sherlock Holmes",
    "Watson": "Dr. Watson"
   
}
final_names = sorted(set(merge_map.get(name, name) for name in clean_names))

with open("output/characters.txt", "w") as f:
    for name in final_names:
        f.write(f"{name}\n")


---

## Question 3: Story Locations

> "Where does the story take place?"

**NLP Method:** Use Named Entity Recognition (NER) to extract location entities (GPE and LOC).

**Hint:** Filter entities where `ent.label_` is 'GPE' (geopolitical entity) or 'LOC' (location).

In [5]:
# Your code here: extract GPE and LOC entities using spaCy NER

# When done, save your findings:
# with open("output/locations.txt", "w") as f:
#     for place in your_locations_list:
#         f.write(f"{place}\n")

doc = nlp(story_text)
your_locations_list = [ent.text for ent in doc.ents if ent.label_ in {"GPE", "LOC"}]
your_locations_list = sorted(list(set(your_locations_list)))
your_locations_list = [name.replace("'s", "") for name in your_locations_list]

remove = {
    "Holmes", "Horace", "Rucastle", "Uffa", "geese", "crisply", "n't", "the", "Street", "wooden", "Scarlet", "Esq", "Major Prendergast", "the West End", "the City", "the City and Suburban Bank", "the St. Pancras Hotel", "the Amoy River", "8_s", "Cal.", "D.D.", "Pa.", "the", "lodgings", "City", "Union", "Hotel", "west.", "china", "morocco"
}

generic_terms = {"North", "South", "East", "West", "Captial", "City", "States", "Union"}

your_locations_list = [name for name in your_locations_list if name not in remove and name not in generic_terms]

""
with open("output/locations.txt", "w") as f:
    for place in your_locations_list:
        f.write(f"{place}\n")



---

## Question 4: Wilson's Business

> "What is Wilson's business?"

**NLP Method:** Use TF-IDF similarity to find which section discusses Wilson's business.

**Hint:** Create a TF-IDF vectorizer, fit it on the 3 sections, then transform your query using the same vectorizer (`.transform()`, not `.fit_transform()` - you want to use the vocabulary learned from the sections). Find which section has the highest cosine similarity and read it to find the answer.

In [6]:
# Your code here: use TF-IDF similarity to find the relevant section
# You'll need: from sklearn.feature_extraction.text import TfidfVectorizer
#              from sklearn.metrics.pairwise import cosine_similarity

# When done, save your findings:
# with open("output/business.txt", "w") as f:
#     f.write("Wilson's business is: ...")

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

query = "Wilson's business"

sections = story_text.split("\n\n")  
sections = [s.replace("\n", " ") for s in sections if s.strip()]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(sections + [query])
cosine_similarities = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1])
most_relevant_index = cosine_similarities.argmax()
most_relevant_section = sections[most_relevant_index]

with open("output/business.txt", "w") as f:
    f.write(f"Wilson's business is:\n\n{most_relevant_section}\n")


---

## Question 5: Wilson's Work Routine

> "What is Wilson's daily work routine for the League?"

**NLP Method:** Use TF-IDF similarity to find which section discusses Wilson's work routine.

**Hint:** Similar to Question 4 - use TF-IDF to find the section that best matches your query about work routine. The answer includes what Wilson had to do and what eventually happened.

In [8]:
# Your code here: use TF-IDF similarity to find the relevant section

# When done, save your findings:
# with open("output/routine.txt", "w") as f:
#     f.write("Wilson's work routine: ...\n")
#     f.write("What happened: ...\n")

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

paragraphs = [p.strip() for p in story_text.split("\n\n") if len(p.strip()) > 50]

query = "Wilson's work routine"

vectorizer = TfidfVectorizer(stop_words ='english')
tfidf = vectorizer.fit_transform([query] + paragraphs)

similarities = cosine_similarity(tfidf[0:1], tfidf[1:]).flatten()

best_idx = similarities.argmax()
best_paragraph = paragraphs[best_idx]

what_happened = ""
if best_idx + 1 < len(paragraphs):
    what_happened = paragraphs[best_idx + 1]

with open("output/routine.txt", "w") as f:
    f.write(f"Wilson's work routine:\n{best_paragraph}\n\n")
    f.write(f"What happened:\n{what_happened}\n")
