# Lecture 6: NLTK Corpora

The Natural Language Toolkit (NLTK) is a powerful Python library for working with human language data. This tutorial covers accessing and exploring corpora, frequency distributions, and text analysis.

**Prerequisites:**
- Python and the `ling250` environment
- NLTK installed (`pip install nltk` if needed)
- `Night_Vale.txt` in your `data/` folder

## What is NLTK?

NLTK is an **open-source** Python library (free to use and modify) with a huge range of functionality:
- Accessing, exploring, and manipulating corpora
- Finding and visualizing statistical patterns in text
- Tagging/annotating text for linguistic structure
- Tokenization and basic NLP models

It's excellent for **exploratory analysis** of text. We could easily spend weeks on just NLTK — today we'll focus on `nltk.corpus`.

In [None]:
import nltk
from nltk.corpus import gutenberg

## Part 1: Accessing NLTK Corpora

NLTK comes with many built-in corpora. Let's start with the **Gutenberg corpus** — a collection of public domain literature from [Project Gutenberg](https://www.gutenberg.org/).

### Viewing available files

Every corpus has a `.fileids()` method to see what files it contains:

In [None]:
# See what files are in the Gutenberg corpus
gutenberg.fileids()

**Note:** If you get an error saying the corpus isn't found, you may need to download it first:

In [None]:
# Uncomment and run if needed:
# nltk.download('gutenberg')

### Levels of granularity

We can access corpus text at different levels:

In [None]:
# .raw() - raw characters (unprocessed)
raw_text = gutenberg.raw('austen-emma.txt')
print(type(raw_text))
print(raw_text[:200])

In [None]:
# .words() - tokenized words (no sentence structure)
words = gutenberg.words('austen-emma.txt')
print(type(words))
print(words[:30])

In [None]:
# .sents() - sentences (list of lists)
sents = gutenberg.sents('austen-emma.txt')
print(type(sents))
print(sents[:3])

**Note:** If `.sents()` gives an error, you may need to download the tokenizer:

In [None]:
# Uncomment and run if needed:
# nltk.download('punkt')

## Part 2: The Brown Corpus

The Brown Corpus is a collection of American English texts from different genres. What makes it special is that texts are **categorized** by genre.

In [None]:
from nltk.corpus import brown

# What genres are available?
brown.categories()

### Accessing specific categories

We can filter by category when accessing words or sentences:

In [None]:
# Get words from only the 'news' category
news_words = brown.words(categories='news')
print(news_words[:50])

In [None]:
# Multiple categories
fiction_words = brown.words(categories=['science_fiction', 'romance'])
print(len(fiction_words), "words in sci-fi and romance")

## Part 3: Loading Your Own Text Files

You can treat your own text files as a corpus using `PlaintextCorpusReader`.

In [None]:
from nltk.corpus import PlaintextCorpusReader

# Create a corpus from files in the ../data/ directory
# First argument: directory path
# Second argument: file pattern(s)
my_corpus = PlaintextCorpusReader('../data/', 'Night_Vale.txt')

# Now we can use the same methods
print(my_corpus.fileids())

In [None]:
# Get words from our Night Vale corpus
nv_words = my_corpus.words('Night_Vale.txt')
print(len(nv_words), "words")
print(nv_words[:50])

**Tip:** You can use wildcards to load multiple files:

In [None]:
# Load all .txt files in a directory
# multi_corpus = PlaintextCorpusReader('../data/', '.*\.txt')

## Part 4: Frequency Distributions

A `FreqDist` counts occurrences in a list of items (usually words).

In [None]:
from nltk import FreqDist

# Count word frequencies in Emma
emma_words = gutenberg.words('austen-emma.txt')
fdist = FreqDist(emma_words)

print(type(fdist))

### Useful FreqDist methods

In [None]:
# Most common words
fdist.most_common(20)

In [None]:
# Single most common word
fdist.max()

In [None]:
# How many times does a specific word appear?
fdist['Emma']

In [None]:
# Tabulate top items
fdist.tabulate(15)

In [None]:
# Plot frequencies
fdist.plot(30)

# If the plot doesn't show, you may need:
# import matplotlib.pyplot as plt
# plt.show()

### Night Vale example

In [None]:
# Most common words in Night Vale
nv_fdist = FreqDist(nv_words)
nv_fdist.most_common(30)

**Observation:** Most common words are function words ("the", "a", "to", etc.). To find interesting content words, we might want to filter these out.

In [None]:
# Filter out short words and punctuation
content_words = [w for w in nv_words if len(w) > 3 and w.isalpha()]
content_fdist = FreqDist(content_words)
content_fdist.most_common(30)

## Part 5: Conditional Frequency Distributions

A `ConditionalFreqDist` counts occurrences based on **conditions**. It takes a list of **pairs** where:
- First item = condition (e.g., genre, category)
- Second item = sample (e.g., word)

In [None]:
from nltk import ConditionalFreqDist

# Example: Count words by genre in Brown corpus
# Create pairs of (genre, word)
genre_word_pairs = [
    (genre, word)
    for genre in ['news', 'romance']
    for word in brown.words(categories=genre)
]

print(genre_word_pairs[:10])

In [None]:
# Build the ConditionalFreqDist
cfd = ConditionalFreqDist(genre_word_pairs)
print(type(cfd))

### Useful ConditionalFreqDist methods

In [None]:
# What conditions (genres) do we have?
cfd.conditions()

In [None]:
# Access a specific condition's FreqDist
cfd['news'].most_common(20)

In [None]:
# Compare specific words across conditions
# Use samples parameter to specify which words to show
cfd.tabulate(samples=['the', 'she', 'he', 'love', 'said', 'news'])

In [None]:
# Plot comparison (limit to specific words)
cfd.plot(samples=['she', 'he', 'love', 'said', 'news', 'man', 'woman'])

### More genres

In [None]:
# Compare more genres
genres = ['news', 'romance', 'science_fiction', 'humor']
genre_word_pairs = [
    (genre, word.lower())  # lowercase for case-insensitive comparison
    for genre in genres
    for word in brown.words(categories=genre)
]

cfd_multi = ConditionalFreqDist(genre_word_pairs)

In [None]:
# Compare interesting words
cfd_multi.tabulate(samples=['love', 'space', 'time', 'funny', 'said', 'president'])

## Part 6: Text Objects

NLTK's `Text` class wraps a list of words and provides convenient analysis methods.

In [None]:
from nltk.text import Text

# Create a Text object from our Night Vale words
nv_text = Text(nv_words)
print(type(nv_text))

**Important:** `Text()` expects a **list of words**, not a list of sentences. If you pass it sentences, it will treat each sentence as a single "word"!

### Useful Text methods

In [None]:
# Concordance: see a word in context
nv_text.concordance('desert')

In [None]:
# Similar words: words appearing in similar contexts
nv_text.similar('desert')

In [None]:
# Common contexts: where do these words appear?
nv_text.common_contexts(['desert', 'city'])

In [None]:
# Count occurrences
nv_text.count('Cecil')

In [None]:
# Collocations: common two-word phrases
nv_text.collocations()

In [None]:
# Dispersion plot: where do words appear in the text?
nv_text.dispersion_plot(['Cecil', 'Carlos', 'desert', 'radio'])

### Emma as a Text object

In [None]:
emma_text = Text(emma_words)
emma_text.concordance('Emma')

In [None]:
emma_text.collocations()

## Part 7: Bigrams

A **bigram** is a pair of consecutive words. Bigrams help us understand which words tend to follow other words.

In [None]:
from nltk.util import bigrams

# Simple example
text = ['the', 'cat', 'sat', 'on', 'the', 'mat']
bi = bigrams(text)

# bigrams() returns a generator, so convert to list to see it
print(list(bi))

**Note on generators:** A generator is like a lazy list — it produces items one at a time rather than storing them all in memory. Converting to a list forces it to generate everything at once.

In [None]:
# For a second use, create the generator again
bi = bigrams(text)
print(list(bi))

### Bigrams with real text

In [None]:
# Get bigrams from Emma
emma_bigrams = list(bigrams(emma_words))
print(len(emma_bigrams), "bigrams")
print(emma_bigrams[:20])

### Bigrams with ConditionalFreqDist

Since bigrams are **pairs**, we can use them with `ConditionalFreqDist`:
- Condition = first word
- Sample = second word

This tells us: "Given the first word, what words commonly follow?"

In [None]:
# Build CFD from bigrams
emma_cfd = ConditionalFreqDist(emma_bigrams)

In [None]:
# What words follow "Mr"?
emma_cfd['Mr'].most_common(10)

In [None]:
# What words follow "she"?
emma_cfd['she'].most_common(10)

### Text generation with bigrams

We can use bigram frequencies to generate text — always pick the most common following word:

In [None]:
# Naive generation: always pick .max()
current_word = 'Emma'
generated = [current_word]

for i in range(20):
    # Get most common word that follows current_word
    next_word = emma_cfd[current_word].max()
    generated.append(next_word)
    current_word = next_word

print(' '.join(generated))

**Problem:** This often gets stuck in loops! Always choosing the *most* common word is too deterministic.

### Better generation: random sampling

Instead of always picking the most common word, we can randomly sample based on frequencies:

In [None]:
import random

current_word = 'Emma'
generated = [current_word]

for i in range(30):
    # Get all possible next words and their frequencies
    following_words = list(emma_cfd[current_word].keys())
    frequencies = list(emma_cfd[current_word].values())
    
    # If no following words, stop
    if not following_words:
        break
    
    # Randomly choose weighted by frequency
    next_word = random.choices(following_words, weights=frequencies)[0]
    generated.append(next_word)
    current_word = next_word

print(' '.join(generated))

In [None]:
# Try running the above cell multiple times - you get different results!

## Part 8: Lexical Resources

NLTK includes several lexical resources. One of the most powerful is **WordNet**, a large lexical database of English.

WordNet groups words into **synsets** (synonym sets) and provides:
- Definitions
- Examples
- Relationships between words (synonyms, antonyms, hypernyms, etc.)

In [None]:
from nltk.corpus import wordnet as wn

# Uncomment if needed:
# nltk.download('wordnet')

In [None]:
# Get synsets for a word
wn.synsets('dog')

In [None]:
# Get definition of first synset
dog = wn.synset('dog.n.01')
print(dog.definition())

In [None]:
# Get examples
dog.examples()

In [None]:
# Get hypernyms (more general terms)
dog.hypernyms()

In [None]:
# Get hyponyms (more specific terms)
dog.hyponyms()[:10]

**Note:** We'll explore WordNet more in future lectures. For now, just know it exists as a powerful resource for understanding word relationships!

---

## Practice Exercises

Try these on your own!

### Exercise 1: Most common words in different genres

Compare the top 20 most common words (excluding punctuation and short words) in the 'news' and 'humor' categories of the Brown corpus. What differences do you notice?

In [None]:
# Your code here


### Exercise 2: Character names in Night Vale

Find the most common capitalized words in Night Vale (hint: use regex or check if `word[0].isupper()`). These are likely to include character names and locations. What are the top 20?

In [None]:
# Your code here


### Exercise 3: Comparing corpora with Text objects

Create Text objects for both Night Vale and one of the Gutenberg texts. Use `.similar()` to compare which words appear in similar contexts to "night" in each text. What differences do you notice?

In [None]:
# Your code here


### Exercise 4: Generate from Night Vale

Create a bigram-based text generator for Night Vale. Start with the word "Welcome" and generate 50 words. Try running it several times. Does it capture any Night Vale flavor?

In [None]:
# Your code here


---

## Quick Reference

### Accessing Corpora

| Method | Returns |
|--------|--------|
| `.fileids()` | List of file names |
| `.raw(fileid)` | Raw text as string |
| `.words(fileid)` | List of words |
| `.sents(fileid)` | List of sentences (each sentence is a list of words) |
| `.categories()` | List of categories (if applicable) |

### FreqDist

| Method | Returns |
|--------|--------|
| `.most_common(n)` | List of (word, count) tuples for top n words |
| `.max()` | Most common word |
| `[word]` | Count of specific word |
| `.tabulate(n)` | Print table of top n words |
| `.plot(n)` | Plot frequency graph |

### ConditionalFreqDist

| Method | Returns |
|--------|--------|
| `.conditions()` | List of conditions |
| `[condition]` | FreqDist for that condition |
| `.tabulate(samples=[...])` | Print table comparing conditions |
| `.plot(samples=[...])` | Plot comparison |

### Text Objects

| Method | Does |
|--------|-----|
| `.concordance(word)` | Show word in context |
| `.similar(word)` | Find words in similar contexts |
| `.common_contexts([words])` | Find shared contexts |
| `.count(word)` | Count occurrences |
| `.collocations()` | Find common phrases |
| `.dispersion_plot([words])` | Plot word positions |

### Other

| Function | Does |
|----------|-----|
| `nltk.download('name')` | Download NLTK data |
| `bigrams(words)` | Create bigrams from word list |
| `PlaintextCorpusReader(dir, files)` | Create corpus from local files |

---

## Further Reading

- [NLTK Book Chapter 1](https://www.nltk.org/book/ch01.html) — Language Processing and Python
- [NLTK Book Chapter 2](https://www.nltk.org/book/ch02.html) — Accessing Text Corpora and Lexical Resources
- [NLTK Corpus HOWTO](https://www.nltk.org/howto/corpus.html)
- [NLTK Documentation](https://www.nltk.org/)