# The Lazy Book Report

Your professor has assigned a book report on "The Red-Headed League" by Arthur Conan Doyle. 

You haven't read the book. And out of stubbornness, you won't.

But you *have* learned NLP. Let's use it to answer the professor's questions without reading.

## Setup

First, let's fetch the text from Project Gutenberg and prepare it for analysis.

In [1]:
# Fetch and prepare text - RUN THIS CELL FIRST
import os
import urllib.request
import re
import nltk
nltk.download('punkt_tab')
nltk.download('stopwords')

os.makedirs("output", exist_ok=True)

url = 'https://www.gutenberg.org/files/1661/1661-0.txt'
req = urllib.request.Request(url, headers={'User-Agent': 'Python-urllib'})
with urllib.request.urlopen(req, timeout=30) as resp:
    text = resp.read().decode('utf-8')

# Strip Gutenberg boilerplate
text = text.split('*** START OF')[1].split('***')[1]
text = text.split('*** END OF')[0]

# Extract "The Red-Headed League" story (it's the second story in the collection)
matches = list(re.finditer(r'THE RED-HEADED LEAGUE', text, re.IGNORECASE))
story_start = matches[1].end()
story_text = text[story_start:]
story_end = re.search(r'\n\s*III\.\s*\n', story_text)
story_text = story_text[:story_end.start()] if story_end else story_text

# Split into 3 sections by word count
words = story_text.split()[:4000]
section_size = len(words) // 3
sections = [
    ' '.join(words[:section_size]),
    ' '.join(words[section_size:2*section_size]),
    ' '.join(words[2*section_size:])
]

print(f"Story loaded: {len(words)} words in {len(sections)} sections")
print(f"Section sizes: {[len(s.split()) for s in sections]}")

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/jharatani/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/jharatani/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Story loaded: 4000 words in 3 sections
Section sizes: [1333, 1333, 1334]


## Professor's Questions

Your professor wants you to answer 5 questions about the story. Let's use NLP to find the answers.

---

## Question 1: Writing Style

> "This text is from the 1890s. What makes it different from modern writing?"

**NLP Method:** Use preprocessing to compute text statistics. Tokenize the text and calculate:
- Vocabulary richness (unique words / total words)
- Average sentence length
- Average word length

**Hint:** Formal, literary writing typically shows higher vocabulary richness and longer sentences than modern casual text.

In [2]:
# Your code here: compute text statistics
# You'll need: import string, import re
import string
import re

from nltk.tokenize import RegexpTokenizer


# - Tokenize: remove punctuation, lowercase

print(text.lower())
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(text)

print(tokens)

# - Sentences: split on sentence-ending punctuation
sentences = nltk.sent_tokenize(text)

# Calculate vocab_richness, avg_sentence_length, avg_word_length

sentences = nltk.sent_tokenize(text)
tokens_raw = nltk.word_tokenize(text.lower())
words_only = [t for t in tokens_raw if t.isalpha()]

vocab_richness = len(set(words_only)) / len(words_only)
avg_sentence_len = len(words_only) / len(sentences)
avg_word_len = sum(len(w) for w in words_only) / len(words_only)

print(f"Vocabulary richness (type-token ratio): {vocab_richness:.3f}")
print(f"Average sentence length: {avg_sentence_len:.1f} words")
print(f"Average word length: {avg_word_len:.1f} characters")







the adventures of sherlock holmes

by arthur conan doyle


contents

   i.     a scandal in bohemia
   ii.    the red-headed league
   iii.   a case of identity
   iv.    the boscombe valley mystery
   v.     the five orange pips
   vi.    the man with the twisted lip
   vii.   the adventure of the blue carbuncle
   viii.  the adventure of the speckled band
   ix.    the adventure of the engineer’s thumb
   x.     the adventure of the noble bachelor
   xi.    the adventure of the beryl coronet
   xii.   the adventure of the copper beeches




i. a scandal in bohemia


i.

to sherlock holmes she is always _the_ woman. i have seldom heard him
mention her under any other name. in his eyes she eclipses and
predominates the whole of her sex. it was not that he felt any emotion
akin to love for irene adler. all emotions, and that one particularly,
were abhorrent to his cold, precise but admirably balanced mind. he
was, i take it, the most perfect reasoning and observing machine that
the

---

## Question 2: Main Characters

> "Who are the main characters in this story?"

**NLP Method:** Use Named Entity Recognition (NER) to extract PERSON entities.

**Hint:** Use spaCy's `en_core_web_sm` model. Process the text and filter entities where `ent.label_ == 'PERSON'`. Count how often each name appears.

In [8]:
# Your code here: extract PERSON entities using spaCy NER
# You'll need: import spacy, nlp = spacy.load("en_core_web_sm")
import spacy
nlp = spacy.load("en_core_web_sm")

from collections import Counter
import re

all_persons = []

for section in sections:   # or just use story_text
    doc = nlp(section)
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            name = ent.text.strip()

            # filter obvious junk
            if (
                len(name) > 2
                and not re.fullmatch(r'[IVXLCDM]+\.?', name)
                and re.search(r'[A-Za-z]', name)
            ):
                all_persons.append(name)

counts = Counter(all_persons)

with open("output/characters.txt", "w") as f:
    f.write("Top 10 main characters (by last-name mentions):\n")
    for name, c in counts.most_common(10):
        f.write(f"{name}\t{c}\n")



---

## Question 3: Story Locations

> "Where does the story take place?"

**NLP Method:** Use Named Entity Recognition (NER) to extract location entities (GPE and LOC).

**Hint:** Filter entities where `ent.label_` is 'GPE' (geopolitical entity) or 'LOC' (location).

In [None]:
# Your code here: extract GPE and LOC entities using spaCy NER

# When done, save your findings:
# with open("output/locations.txt", "w") as f:
#     for place in your_locations_list:
#         f.write(f"{place}\n")



---

## Question 4: Wilson's Business

> "What is Wilson's business?"

**NLP Method:** Use TF-IDF similarity to find which section discusses Wilson's business.

**Hint:** Create a TF-IDF vectorizer, fit it on the 3 sections, then transform your query using the same vectorizer (`.transform()`, not `.fit_transform()` - you want to use the vocabulary learned from the sections). Find which section has the highest cosine similarity and read it to find the answer.

In [None]:
# Your code here: use TF-IDF similarity to find the relevant section
# You'll need: from sklearn.feature_extraction.text import TfidfVectorizer
#              from sklearn.metrics.pairwise import cosine_similarity

# When done, save your findings:
# with open("output/business.txt", "w") as f:
#     f.write("Wilson's business is: ...")



---

## Question 5: Wilson's Work Routine

> "What is Wilson's daily work routine for the League?"

**NLP Method:** Use TF-IDF similarity to find which section discusses Wilson's work routine.

**Hint:** Similar to Question 4 - use TF-IDF to find the section that best matches your query about work routine. The answer includes what Wilson had to do and what eventually happened.

In [None]:
# Your code here: use TF-IDF similarity to find the relevant section

# When done, save your findings:
# with open("output/routine.txt", "w") as f:
#     f.write("Wilson's work routine: ...\n")
#     f.write("What happened: ...\n")

