# Lesson 1 Homework: NLTK basics and stopwords

This homework lets you practise the ideas from Lesson 1 using a longer,
real text about Robin Hood. You will:

- Load text from a file
- Tokenize into sentences and words
- Remove punctuation
- Handle stopwords
- Try stemming and lemmatization
- Explore pronunciations with `cmudict` (preview for TTS)

Cells with `TODO` comments are for you to fill in. Read the hint in
each cell before you start typing code.

## 0. Setup

Run this cell once to make sure all NLTK resources needed for this
homework are available in your environment. It will download:

- `punkt` – sentence tokeniser data
- `averaged_perceptron_tagger` / `averaged_perceptron_tagger_eng` – POS tagger models
- `cmudict` – CMU pronouncing dictionary (for phones)
- `wordnet` and `omw-1.4` – data for lemmatisation
- `stopwords` – lists of common function words


In [None]:
import nltk

# Download the NLTK data used in this homework.
nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")
nltk.download("averaged_perceptron_tagger_eng")
nltk.download("cmudict")
nltk.download("wordnet")
nltk.download("omw-1.4")
nltk.download("stopwords")


## 1. Load the Robin Hood text

We will work with a real literary text from *The Merry Adventures of
Robin Hood* (public domain, Project Gutenberg). The text is stored in
the file `example_text_robin_hood.txt` in the same folder as this
notebook.

Here we use `pathlib.Path(...).read_text()` instead of a context
manager with `open(...)` because it is a compact, modern way to read
an entire file as a string, and it automatically closes the file for
us. The `open`/`with` pattern is equivalent but more verbose; you can
always rewrite the code using `with open(...) as f:` if you prefer.

### Task

- Use `pathlib.Path` and `.read_text()` to load the contents of the
  file into a variable named `text`.
- Print the first 500 characters to get a feeling for the text.


In [None]:
from pathlib import Path

# TODO: read the contents of "example_text_robin_hood.txt" into a
# variable called `text` using Path(...).read_text(encoding="utf-8").
# TODO: print the first 500 characters of `text`.


## 2. Sentence tokenization with PunktSentenceTokenizer

In the lesson notebook you used `PunktSentenceTokenizer` to split a
short sample into sentences. Here you will do the same with the longer
Robin Hood text.

### Task

- Create a `PunktSentenceTokenizer` object.
- Use its `.tokenize()` method to split `text` into a list named
  `sentences`.
- Inspect the number of sentences and look at the first few.

In [None]:
from nltk.tokenize import PunktSentenceTokenizer

# TODO: create a PunktSentenceTokenizer and use its .tokenize(text)
# method to build a list called `sentences`.
# TODO: inspect how many sentences you get and look at the first few.


## 3. Word tokenization with TreebankWordTokenizer

Next, you will break the text into word-level tokens **and** keep track
of which words belong to which sentence. This is useful later for
text-to-speech, because the system usually speaks one sentence at a
time and needs to know where sentences begin and end to control
pauses and prosody.

### Task

- Use `TreebankWordTokenizer` (as in the lesson) **together with the
  `sentences` list from the previous step**.
- Build a list of lists called `words_per_sentence`, where each inner
  list contains the tokens for one sentence.
- Create a lowercased version of this grouped list called
  `lower_words_per_sentence`.
- Optionally also create a single flat list `words` containing all
  tokens, and a flat lowercased list `lower_words` that you can reuse
  later.

Hint: you will probably need a **nested loop** (or a list
comprehension with another loop inside it) that goes over sentences
first and then over the tokens inside each sentence.


In [None]:
from nltk.tokenize import TreebankWordTokenizer

# TODO: use TreebankWordTokenizer together with the `sentences` list
#       to build `words_per_sentence` (a list of lists of tokens).
# TODO: build a lowercased version called `lower_words_per_sentence`.
# TODO: optionally create flat lists `words` and `lower_words` that
#       contain all tokens from all sentences.


## 4. Removing punctuation with RegexpTokenizer (review)

You previously saw how `TreebankWordTokenizer` splits text into
words *and* keeps punctuation as separate tokens (for example,
`"Robin,"` becomes `["Robin", ","]`).

In contrast, `RegexpTokenizer(r"\w+")` keeps only word-like
tokens and drops punctuation entirely. In real projects you would
normally pick **one** tokenizer style or the other, depending on
your goal, rather than using both at the same time. Here we use
both only to compare their behaviour.

For text-to-speech it is helpful to have **clean word tokens per
sentence**, so in this section we remove punctuation at the
sentence level using the `sentences` list from section 2.

### Task

- Create a `RegexpTokenizer` with the pattern `r"\w+"`.
- For each sentence in `sentences`, use the tokenizer to build a list
  of tokens without punctuation. Collect these into a list of lists
  called `words_no_punct_per_sentence`.
- Optionally also make a flat list `words_no_punct` with all
  punctuation-free tokens.


In [None]:
from nltk.tokenize import RegexpTokenizer

# TODO: create a RegexpTokenizer with the pattern r"\w+".
# TODO: using the `sentences` list, build `words_no_punct_per_sentence`
#       (tokens per sentence, no punctuation). Hint: use a nested loop
#       or a list comprehension over sentences.
# TODO: if you create a flat list `words_no_punct`, you can also
#       compare its length to `words`.


## 5. Stopwords (optional)

Stopwords are very common words (like "the", "and", "to") that often
do not carry much meaning by themselves. NLTK provides a list of
English stopwords.
This section is **optional** and mainly useful for text analysis; a
basic speech-generation system would normally keep these words so that
sentences sound natural.

### Task

- Import `stopwords` from `nltk.corpus`.
- Build a Python `set` of English stopwords.
- Starting from `words_no_punct_per_sentence`, create a list
  `content_words_per_sentence` that excludes any token whose
  lowercase form is in the stopword set. Optionally also build a
  flat list `content_words`.


In [None]:
from nltk.corpus import stopwords

# TODO: build a set of English stopwords using stopwords.words("english").
# TODO: starting from `words_no_punct`, create a list `content_words`
#       that excludes any token whose lowercase form is in the
#       stopword set.
# TODO: print how many tokens remain after removing stopwords.


## 6. Stemming and lemmatization on content words (optional)

Finally, you will apply a stemmer and a lemmatizer to some
**content words** from the Robin Hood text that you choose yourself.
This section is **optional**. Stemming and lemmatization are very
useful for analysing language, but a basic text-to-speech system
usually works with the original word forms.

### Task

- Use `PorterStemmer` to compute stems for some frequent
  content words (for example from a flattened version of
  `content_words_per_sentence`).
- Use `WordNetLemmatizer` to compute lemmas for the same words.
- Compare stems and lemmas for a few examples.


In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

# TODO: choose a small list of content words, for example by picking
#       some interesting nouns or verbs from `content_words`.
# TODO: create a list of stems using PorterStemmer and a list of
#       lemmas using WordNetLemmatizer.
# TODO: display (word, stem, lemma) triples for your sample.


## 7. CMU Pronouncing Dictionary (`cmudict`)

For text-to-speech we need to know **how words are pronounced**. NLTK
includes the CMU Pronouncing Dictionary (`cmudict`), which maps many
English words to lists of phones.

### Task

- Import `cmudict` from `nltk.corpus`.
- Build the dictionary with `cmudict.dict()`.
- Pick a small list of words from the Robin Hood text (for example
  `["forest", "sheriff", "arrow", "king"]`).
- Look up each word in the dictionary and print the word together with
  its pronunciation(s).
- For each word, also print how many phones are in its first
  pronunciation (use `len(...)`).
- (Optional, a bit trickier) Approximate the number of **stressed
  syllables** by counting phones that end in a digit (like `AA1`,
  `EH2`). Which words sound longer or more complex according to this
  simple measure?
- Choose one short sentence from `words_no_punct_per_sentence` and, for
  each word in that sentence, look up its phones (skipping words that
  are missing). Build a list `phones_for_sentence` that contains all
  phones in order.
- From `phones_for_sentence`, build a list of simple **diphone labels**
  (pairs of neighbouring phones). This is close to what a diphone-based
  TTS system would need before matching phones to pieces of audio.


In [None]:
from nltk.corpus import cmudict

# TODO: create the CMU dictionary object (for example, cmu = cmudict.dict()).
# TODO: choose a list of words from the text and print each word together
#       with its entry in the dictionary (use cmu.get(word.lower())).
# TODO: for the first pronunciation of each word, also print how many
#       phones it contains.
# TODO (optional): count how many phones end with a digit to estimate
#       the number of stressed syllables.
# TODO: pick one short sentence from `words_no_punct_per_sentence` and
#       build a list `phones_for_sentence` by concatenating the phone
#       sequences for each word (skipping missing words). Then build a
#       list of diphone labels from `phones_for_sentence`.
