# Lesson 1: Introduction to NLTK

In this lesson we will start learning basic text processing with **NLTK** (Natural Language Toolkit).

The goals for today are:
- Install and import NLTK
- Load some example text
- Tokenize text into sentences and words
- Explore simple frequency counts
- Practice with a few short exercises

> This notebook is designed for step-by-step practice. You can run cells one by one and discuss the outputs as you go.

## 1. Setup

In this project we will use a **virtual environment (venv)** so the packages for MariaTTS do not affect your system Python.

### 1.1 Create a virtual environment (in the terminal)

In a terminal inside this project folder, run:

```bash
python3 -m venv .venv
source .venv/bin/activate  # macOS / Linux
# On Windows (PowerShell):
# .venv\\Scripts\\Activate.ps1
```

You should now see `(.venv)` at the start of your terminal prompt.

Then install the packages we will use in this course:

```bash
pip install jupyter nltk numpy pyaudio
```

When you start Jupyter, make sure the kernel (Python interpreter) is the one from this `.venv`.

The next cell installs NLTK **inside** whichever environment your notebook is using. If NLTK is already installed in that environment, you can skip it.


In [1]:
# Install NLTK if needed (may already be installed)
!pip install nltk


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import nltk

# Download the NLTK data we will need
nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")        # older tagger name
nltk.download("averaged_perceptron_tagger_eng")    # newer tagger name used by pos_tag
nltk.download("cmudict")  # pronunciation dictionary used later for TTS
nltk.download("wordnet")   # needed for WordNetLemmatizer
nltk.download("omw-1.4")   # extra WordNet data (optional but useful)

[nltk_data] Downloading package punkt to /home/liv/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/liv/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package cmudict to /home/liv/nltk_data...
[nltk_data]   Package cmudict is already up-to-date!
[nltk_data] Downloading package wordnet to /home/liv/nltk_data...
[nltk_data] Downloading package omw-1.4 to /home/liv/nltk_data...


True

## 2. Loading some text

To experiment, we will start with a short sample text. Later, you can replace this with text you choose (e.g. a paragraph from a story or article).

In [3]:
sample_text = (
    "Text processing with Python is fun. "
    "We can use NLTK to explore language. "
    "This will be useful for future projects!"
)
print(sample_text)

Text processing with Python is fun. We can use NLTK to explore language. This will be useful for future projects!


### Exercise 1 (discussion)

- Read the text above out loud.
- Think about what *tokens* might mean in the context of text.
- How might a computer split this text into smaller pieces?

## 3. Sentence tokenization

NLTK provides tools for splitting text into sentences.

Here we use **PunktSentenceTokenizer**, which is a pre-trained model
that has learned how sentences usually start and end based on large
collections of English text. It handles things like periods in
abbreviations (e.g. "Dr.") better than a simple `split('.')` would.

In [4]:
from nltk.tokenize import PunktSentenceTokenizer

sentence_tokenizer = PunktSentenceTokenizer()
sentences = sentence_tokenizer.tokenize(sample_text)
sentences

['Text processing with Python is fun.',
 'We can use NLTK to explore language.',
 'This will be useful for future projects!']

### Exercise 2

1. How many sentences did NLTK find?
2. Does this match what *you* consider separate sentences?
3. Replace `sample_text` with a short text of your own choice and run the cells again.

## 4. Word tokenization

Next, we split the text into **word tokens**. This will be important later when we want to count words or analyze patterns in language.

Here we use **TreebankWordTokenizer**, which follows rules that were
designed for a well-known annotated corpus (the Penn Treebank).
It knows, for example, how to separate contractions (`"don't" → "do" + "n't"`)
and how to handle punctuation around words.

In [5]:
from nltk.tokenize import TreebankWordTokenizer

word_tokenizer = TreebankWordTokenizer()
words = word_tokenizer.tokenize(sample_text)
words

['Text',
 'processing',
 'with',
 'Python',
 'is',
 'fun.',
 'We',
 'can',
 'use',
 'NLTK',
 'to',
 'explore',
 'language.',
 'This',
 'will',
 'be',
 'useful',
 'for',
 'future',
 'projects',
 '!']

### Exercise 3

- Look at the list of tokens above.
- Which tokens are words, and which are punctuation?
- Do you see anything surprising, or something you would handle differently if you were tokenizing by hand?

## 5. Normalizing text (lowercasing)

A common step in text processing is to convert everything to lowercase so that `Text` and `text` are treated the same.

In [6]:
lower_words = [w.lower() for w in words]
lower_words

['text',
 'processing',
 'with',
 'python',
 'is',
 'fun.',
 'we',
 'can',
 'use',
 'nltk',
 'to',
 'explore',
 'language.',
 'this',
 'will',
 'be',
 'useful',
 'for',
 'future',
 'projects',
 '!']

### Exercise 4

- Why might lowercasing be helpful when counting word frequencies?
- Can you think of any situation where lowercasing might *lose* useful information?

## 6. Counting word frequencies

NLTK has a convenient `FreqDist` class for counting how often each token appears. This is a simple but powerful tool that we will reuse later.

In [7]:
from nltk import FreqDist

freq = FreqDist(lower_words)
freq.most_common(10)

[('text', 1),
 ('processing', 1),
 ('with', 1),
 ('python', 1),
 ('is', 1),
 ('fun.', 1),
 ('we', 1),
 ('can', 1),
 ('use', 1),
 ('nltk', 1)]

### Exercise 5

1. Which words are most common in this tiny sample?
2. Change `sample_text` to a different paragraph and run the cells again.
3. Compare the most common words between two different texts.

## 7. Dealing with punctuation

Sometimes we want to remove punctuation tokens so that we only count *words*.

NLTK also provides tokenizers that can help with this. Here we use
**RegexpTokenizer** with a pattern that keeps only word characters
(`\w+` = letters, digits and underscore) and drops punctuation.

In [8]:
from nltk.tokenize import RegexpTokenizer

no_punct_tokenizer = RegexpTokenizer(r"\w+")

# TODO: use `no_punct_tokenizer` to create a list called
# `words_no_punct` from `sample_text` that contains only
# tokens without punctuation.

words_no_punct = []  # TODO: replace this with your solution
words_no_punct

[]

### Exercise 6

- Complete the TODO above to fill in `words_no_punct`.
- Compare `words` and `words_no_punct`. Which tokens disappeared?
- Rerun this with your own `sample_text` and see what happens.

## 8. Stemming

A **stemmer** reduces related word forms to a shorter root form.
For example, `"connect"`, `"connected"`, and `"connection"` may all
be reduced to something like `"connect"`.

NLTK includes several stemmers. Here we will use the **PorterStemmer**,
a classic rule-based algorithm that chops off common English endings
(such as `-ing`, `-ed`, `-s`). It is simple and fast, and often used as
a first step for grouping related word forms in text analysis.

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

# TODO: apply the stemmer to each token in `lower_words`
# and store the result in a new list called `stems`.

stems = []  # TODO: replace this with your solution
stems[:30]

### Exercise 7

- Complete the TODO above to create the list `stems`.
- Pick a few words and compare the original token to its stem.
- Which stems look reasonable? Which ones look a bit strange?

## 9. Lemmatization

A **lemmatizer** tries to reduce words to a base form that is a real
dictionary word (a *lemma*). This often works better if we know the
part-of-speech of each word.

Here we use WordNetLemmatizer for a small subset of words.

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

# TODO: lemmatize a small list of example words.
# Start with verbs and nouns like "running", "better", "mice", "was".

example_words = ["running", "better", "mice", "was"]  # you can edit this list

lemmas = []  # TODO: fill this list using lemmatizer
list(zip(example_words, lemmas))

### Exercise 8

- Complete the TODO above to fill in `lemmas`.
- Compare stems vs. lemmas for the same words (for example, by
  feeding the same list into your stemmer code).
- Which representation do you think would be more useful for
  analyzing text in your projects?

## 10. Pronunciations for speech (CMU dictionary)

For text-to-speech we eventually need to know **how to pronounce each word**.
NLTK includes the CMU Pronouncing Dictionary (`cmudict`), which we
will use later in the MariaTTS project to turn words into sequences
of **phones** (basic sound units).

> This section is a gentle preview; we will come back to it in more detail when we work directly on speech.


In [None]:
from nltk.corpus import cmudict

cmu = cmudict.dict()

for word in ["text", "processing", "python", "language"]:
    print(word, "→", cmu.get(word.lower()))

### Project connection

- Each list of capitalized codes (like `['P', 'AY1', 'TH', 'AA0', 'N']`) represents one possible pronunciation of a word.
- Later, the TTS system will break these sequences into **diphones** (pairs of phones) and use recorded audio for each diphone to synthesize speech.
- Try adding some of your own words to the list above. Which ones are missing from the dictionary?


## 11. Part-of-speech (POS) tagging (preview)

Part-of-speech tagging labels each word with its grammatical role (noun, verb, adjective, etc.). We will only take a quick look here.

> Do not worry about understanding every tag now. The goal is just to get familiar with the idea.

In [None]:
pos_tags = nltk.pos_tag(words)
pos_tags[:20]

### Exercise 6 (discussion)

- Pick a few words and look at their tags.
- Do the tags match your intuition about whether the word is a noun, verb, etc.?
- Note down any tags you find confusing and any questions you have.

## 12. Summary

In this lesson, you:
- Installed and imported NLTK
- Loaded and printed a short text
- Tokenized text into sentences and words
- Normalized words with lowercasing
- Counted word frequencies
- Practised removing punctuation from token lists
- Explored stemming and lemmatization
- Looked up pronunciations using the CMU dictionary (preview for text-to-speech)
- Took a quick look at part-of-speech tags

These basic operations will be useful building blocks for later work with text in Python.
