# AIG 230 ‚Äì Week 2 Lab
## From Raw Text to Corpus: Tokenization, Normalization, and Vocabulary

Industry Context: Exploring the State of the Union Corpus

## Learning Objectives
- Understand raw text, documents, and corpora
- Explore a real-world corpus
- Compare NLTK and spaCy preprocessing pipelines
- Perform tokenization, normalization, and vocabulary analysis

In [2]:

import nltk
import spacy
import string
from collections import Counter

nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download("state_union")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\david\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\david\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package state_union to
[nltk_data]     C:\Users\david\AppData\Roaming\nltk_data...
[nltk_data]   Package state_union is already up-to-date!


True

In [3]:
# Load spaCy English model
# Run once if needed:
!python -m spacy download en_core_web_sm

nlp = spacy.load("en_core_web_sm")

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --- ------------------------------------ 1.0/12.8 MB 16.7 MB/s eta 0:00:01
     ----------------------- ---------------- 7.6/12.8 MB 29.4 MB/s eta 0:00:01
     ---------------------------------------- 12.8/12.8 MB 32.2 MB/s  0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m‚úî Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


Breaking it down:

```
spacy.load() - loads a pre-trained language model
```

"en_core_web_sm" - the small English model that includes:

- Tokenizer - splits text into words/sentences
- Part-of-speech tagger - identifies noun, verb, adjective, etc.
- Dependency parser - analyzes grammatical relationships
- Named entity recognizer - identifies people, places, organizations
- Word vectors - semantic meaning of words

nlp = - stores the loaded model in a variable so you can use it to process text

## Part 1 ‚Äì Obtaining the Corpus

In [4]:

from nltk.corpus import state_union
state_union.fileids()[:10]


['1945-Truman.txt',
 '1946-Truman.txt',
 '1947-Truman.txt',
 '1948-Truman.txt',
 '1949-Truman.txt',
 '1950-Truman.txt',
 '1951-Truman.txt',
 '1953-Eisenhower.txt',
 '1954-Eisenhower.txt',
 '1955-Eisenhower.txt']

Each file is a document. The collection is the corpus.

In [5]:

len(state_union.fileids())


65

## Part 2 ‚Äì Inspect Raw Text

In [6]:

sample_file = state_union.fileids()[0]
raw_text = state_union.raw(sample_file)
raw_text[:1000]


"PRESIDENT HARRY S. TRUMAN'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS\n \nApril 16, 1945\n\nMr. Speaker, Mr. President, Members of the Congress:\nIt is with a heavy heart that I stand before you, my friends and colleagues, in the Congress of the United States.\nOnly yesterday, we laid to rest the mortal remains of our beloved President, Franklin Delano Roosevelt. At a time like this, words are inadequate. The most eloquent tribute would be a reverent silence.\nYet, in this decisive hour, when world events are moving so rapidly, our silence might be misunderstood and might give comfort to our enemies.\nIn His infinite wisdom, Almighty God has seen fit to take from us a great man who loved, and was beloved by, all humanity.\nNo man could possibly fill the tremendous void left by the passing of that noble soul. No words can ease the aching hearts of untold millions of every race, creed and color. The world knows it has lost a heroic champion of justice and freedom.\nTragic fate has 

## Part 3 ‚Äì Word Tokenization

In [7]:

from nltk.tokenize import word_tokenize
tokens_nltk = word_tokenize(raw_text)
tokens_nltk[:50]


['PRESIDENT',
 'HARRY',
 'S.',
 'TRUMAN',
 "'S",
 'ADDRESS',
 'BEFORE',
 'A',
 'JOINT',
 'SESSION',
 'OF',
 'THE',
 'CONGRESS',
 'April',
 '16',
 ',',
 '1945',
 'Mr.',
 'Speaker',
 ',',
 'Mr.',
 'President',
 ',',
 'Members',
 'of',
 'the',
 'Congress',
 ':',
 'It',
 'is',
 'with',
 'a',
 'heavy',
 'heart',
 'that',
 'I',
 'stand',
 'before',
 'you',
 ',',
 'my',
 'friends',
 'and',
 'colleagues',
 ',',
 'in',
 'the',
 'Congress',
 'of',
 'the']

In [21]:

doc = nlp(raw_text)
doc


PRESIDENT HARRY S. TRUMAN'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS
 
April 16, 1945

Mr. Speaker, Mr. President, Members of the Congress:
It is with a heavy heart that I stand before you, my friends and colleagues, in the Congress of the United States.
Only yesterday, we laid to rest the mortal remains of our beloved President, Franklin Delano Roosevelt. At a time like this, words are inadequate. The most eloquent tribute would be a reverent silence.
Yet, in this decisive hour, when world events are moving so rapidly, our silence might be misunderstood and might give comfort to our enemies.
In His infinite wisdom, Almighty God has seen fit to take from us a great man who loved, and was beloved by, all humanity.
No man could possibly fill the tremendous void left by the passing of that noble soul. No words can ease the aching hearts of untold millions of every race, creed and color. The world knows it has lost a heroic champion of justice and freedom.
Tragic fate has thrust upon

In [None]:
tokens_spacy = [token.text for token in doc]
tokens_spacy[:50]

üìå Tokenization splits text into meaningful units (tokens).
There is no universal standard, but conventions vary by task and language.

## Part 4 ‚Äì Sentence Tokenization

In [9]:

from nltk.tokenize import sent_tokenize
sentences_nltk = sent_tokenize(raw_text)
sentences_nltk[:5]


["PRESIDENT HARRY S. TRUMAN'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS\n \nApril 16, 1945\n\nMr. Speaker, Mr. President, Members of the Congress:\nIt is with a heavy heart that I stand before you, my friends and colleagues, in the Congress of the United States.",
 'Only yesterday, we laid to rest the mortal remains of our beloved President, Franklin Delano Roosevelt.',
 'At a time like this, words are inadequate.',
 'The most eloquent tribute would be a reverent silence.',
 'Yet, in this decisive hour, when world events are moving so rapidly, our silence might be misunderstood and might give comfort to our enemies.']

In [10]:

sentences_spacy = list(doc.sents)
sentences_spacy[:5]


[PRESIDENT HARRY S. TRUMAN'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS
  
 April 16, 1945
 
 Mr. Speaker, Mr. President, Members of the Congress:,
 It is with a heavy heart that I stand before you, my friends and colleagues, in the Congress of the United States.,
 Only yesterday, we laid to rest the mortal remains of our beloved President, Franklin Delano Roosevelt.,
 At a time like this, words are inadequate.,
 The most eloquent tribute would be a reverent silence.]

## Part 5 ‚Äì Normalization
Normalization makes text more consistent.

In [11]:

normalized_nltk = [t.lower() for t in tokens_nltk if t.isalpha()]
normalized_nltk[:20]


['president',
 'harry',
 'truman',
 'address',
 'before',
 'a',
 'joint',
 'session',
 'of',
 'the',
 'congress',
 'april',
 'speaker',
 'president',
 'members',
 'of',
 'the',
 'congress',
 'it',
 'is']

In [18]:
def normalize(tokens):
    return [
        token.lower()
        for token in tokens
        if token.isalpha()
    ]
normalize(tokens_nltk)

['president',
 'harry',
 'truman',
 'address',
 'before',
 'a',
 'joint',
 'session',
 'of',
 'the',
 'congress',
 'april',
 'speaker',
 'president',
 'members',
 'of',
 'the',
 'congress',
 'it',
 'is',
 'with',
 'a',
 'heavy',
 'heart',
 'that',
 'i',
 'stand',
 'before',
 'you',
 'my',
 'friends',
 'and',
 'colleagues',
 'in',
 'the',
 'congress',
 'of',
 'the',
 'united',
 'states',
 'only',
 'yesterday',
 'we',
 'laid',
 'to',
 'rest',
 'the',
 'mortal',
 'remains',
 'of',
 'our',
 'beloved',
 'president',
 'franklin',
 'delano',
 'roosevelt',
 'at',
 'a',
 'time',
 'like',
 'this',
 'words',
 'are',
 'inadequate',
 'the',
 'most',
 'eloquent',
 'tribute',
 'would',
 'be',
 'a',
 'reverent',
 'silence',
 'yet',
 'in',
 'this',
 'decisive',
 'hour',
 'when',
 'world',
 'events',
 'are',
 'moving',
 'so',
 'rapidly',
 'our',
 'silence',
 'might',
 'be',
 'misunderstood',
 'and',
 'might',
 'give',
 'comfort',
 'to',
 'our',
 'enemies',
 'in',
 'his',
 'infinite',
 'wisdom',
 'almigh

In [12]:

normalized_spacy = [token.text.lower() for token in doc if token.is_alpha]
normalized_spacy[:20]


['president',
 'harry',
 'truman',
 'address',
 'before',
 'a',
 'joint',
 'session',
 'of',
 'the',
 'congress',
 'april',
 'speaker',
 'president',
 'members',
 'of',
 'the',
 'congress',
 'it',
 'is']

## Part 6 ‚Äì Stop Word Removal

In [13]:

from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
filtered_nltk = [t for t in normalized_nltk if t not in stop_words]
filtered_nltk[:20]


['president',
 'harry',
 'truman',
 'address',
 'joint',
 'session',
 'congress',
 'april',
 'speaker',
 'president',
 'members',
 'congress',
 'heavy',
 'heart',
 'stand',
 'friends',
 'colleagues',
 'congress',
 'united',
 'states']

In [14]:

filtered_spacy = [t for t in normalized_spacy if not nlp.vocab[t].is_stop]
filtered_spacy[:20]


['president',
 'harry',
 'truman',
 'address',
 'joint',
 'session',
 'congress',
 'april',
 'speaker',
 'president',
 'members',
 'congress',
 'heavy',
 'heart',
 'stand',
 'friends',
 'colleagues',
 'congress',
 'united',
 'states']

üìå Stop words are common words that often add little semantic meaning.

## Part 7 ‚Äì Vocabulary and Frequency

In [24]:
all_tokens = []

for doc in normalized_nltk:
    tokens = word_tokenize(doc.lower())
    tokens = [t for t in tokens if t not in string.punctuation]
    tokens = [t for t in tokens if t not in stop_words]
    all_tokens.extend(tokens)

vocabulary = sorted(set(all_tokens))
print(len(vocabulary))
vocabulary


588


['abiding',
 'abject',
 'able',
 'abroad',
 'achieve',
 'achieved',
 'aching',
 'across',
 'address',
 'admiral',
 'aggressors',
 'ahead',
 'allies',
 'almighty',
 'alone',
 'along',
 'already',
 'also',
 'always',
 'america',
 'american',
 'americans',
 'announced',
 'another',
 'appeal',
 'april',
 'armed',
 'armies',
 'arnold',
 'ask',
 'assigned',
 'assist',
 'assisting',
 'assumed',
 'assure',
 'avert',
 'axis',
 'back',
 'backward',
 'bad',
 'barriers',
 'based',
 'battlefields',
 'beat',
 'beaten',
 'become',
 'becomes',
 'behind',
 'believe',
 'beloved',
 'benefit',
 'better',
 'beyond',
 'bitter',
 'blood',
 'bombs',
 'brave',
 'breakers',
 'bring',
 'bringing',
 'broken',
 'build',
 'call',
 'came',
 'camp',
 'carry',
 'carrying',
 'casts',
 'cease',
 'certain',
 'champion',
 'characteristic',
 'cherish',
 'chief',
 'civilization',
 'colleagues',
 'color',
 'come',
 'comfort',
 'commander',
 'common',
 'complete',
 'complicated',
 'conference',
 'confidence',
 'conflict',
 'c

In [15]:
# Vocabulary from corpus
vocabulary = sorted(set(filtered_spacy))
len(vocabulary), vocabulary[:30]


(543,
 ['abiding',
  'abject',
  'able',
  'abroad',
  'achieve',
  'achieved',
  'aching',
  'address',
  'admiral',
  'advantage',
  'aggressors',
  'ahead',
  'allies',
  'almighty',
  'america',
  'american',
  'americans',
  'announced',
  'appeal',
  'april',
  'armed',
  'armies',
  'arnold',
  'ask',
  'assigned',
  'assist',
  'assisting',
  'assumed',
  'assure',
  'avert'])

üìå The vocabulary is the set of unique tokens in a corpus.

In [17]:

Counter(filtered_spacy).most_common(15)


[('peace', 23),
 ('world', 20),
 ('nations', 12),
 ('america', 11),
 ('people', 10),
 ('hope', 10),
 ('united', 8),
 ('freedom', 7),
 ('great', 6),
 ('shall', 6),
 ('man', 5),
 ('justice', 5),
 ('victory', 5),
 ('entire', 5),
 ('defense', 5)]

# Stemming vs Lemmatization

Stemming and lemmatization are both normalization techniques, but they make very different trade-offs.

Stemming is fast and rule-based but can distort meaning

Lemmatization is slower but linguistically informed

In industry, the choice depends on task, domain, and interpretability requirements.

In [26]:
# Stemming with NLTK
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

stemmed_tokens = [stemmer.stem(t) for t in filtered_nltk[:30]]
list(zip(filtered_nltk[:30], stemmed_tokens))


[('president', 'presid'),
 ('harry', 'harri'),
 ('truman', 'truman'),
 ('address', 'address'),
 ('joint', 'joint'),
 ('session', 'session'),
 ('congress', 'congress'),
 ('april', 'april'),
 ('speaker', 'speaker'),
 ('president', 'presid'),
 ('members', 'member'),
 ('congress', 'congress'),
 ('heavy', 'heavi'),
 ('heart', 'heart'),
 ('stand', 'stand'),
 ('friends', 'friend'),
 ('colleagues', 'colleagu'),
 ('congress', 'congress'),
 ('united', 'unit'),
 ('states', 'state'),
 ('yesterday', 'yesterday'),
 ('laid', 'laid'),
 ('rest', 'rest'),
 ('mortal', 'mortal'),
 ('remains', 'remain'),
 ('beloved', 'belov'),
 ('president', 'presid'),
 ('franklin', 'franklin'),
 ('delano', 'delano'),
 ('roosevelt', 'roosevelt')]

‚Äúdemocracy‚Äù ‚Üí ‚Äúdemocraci‚Äù

stems are not necessarily real words

### spaCy does not include a built-in stemmer by default.

This is not a limitation. It is a design choice.

spaCy prioritizes:

- linguistically informed processing

- lemmatization over stemming

However, in real pipelines, you can still perform stemming alongside spaCy.

In [29]:
from nltk.stem import PorterStemmer
doc_spacy = nlp(raw_text)

stemmer = PorterStemmer()

# Use spaCy for tokenization, NLTK for stemming
stemmed_spacy_tokens = [
    stemmer.stem(token.text.lower())
    for token in doc_spacy
    if token.is_alpha and not token.is_stop
]

stemmed_spacy_tokens[:30]


['presid',
 'harri',
 'truman',
 'address',
 'joint',
 'session',
 'congress',
 'april',
 'speaker',
 'presid',
 'member',
 'congress',
 'heavi',
 'heart',
 'stand',
 'friend',
 'colleagu',
 'congress',
 'unit',
 'state',
 'yesterday',
 'laid',
 'rest',
 'mortal',
 'remain',
 'belov',
 'presid',
 'franklin',
 'delano',
 'roosevelt']

In [28]:
# Lemmatization with spaCy
# Always recreate the spaCy Doc explicitly
doc_spacy = nlp(raw_text)

lemmatized_tokens = [
    token.lemma_
    for token in doc_spacy
    if token.is_alpha and not token.is_stop
]

lemmatized_tokens[:30]


['PRESIDENT',
 'HARRY',
 'TRUMAN',
 'ADDRESS',
 'JOINT',
 'session',
 'CONGRESS',
 'April',
 'Speaker',
 'President',
 'Members',
 'Congress',
 'heavy',
 'heart',
 'stand',
 'friend',
 'colleague',
 'Congress',
 'United',
 'States',
 'yesterday',
 'lay',
 'rest',
 'mortal',
 'remain',
 'beloved',
 'President',
 'Franklin',
 'Delano',
 'Roosevelt']

NLTK does support lemmatization, but it requires:

- a lemmatizer

- part-of-speech information to work well

By default, NLTK‚Äôs lemmatizer assumes nouns

In [30]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download("wordnet")
nltk.download("omw-1.4")

lemmatizer = WordNetLemmatizer()

lemmatized_nltk = [
    lemmatizer.lemmatize(t)
    for t in filtered_nltk[:30]
]

list(zip(filtered_nltk[:30], lemmatized_nltk))


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\david\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\david\AppData\Roaming\nltk_data...


[('president', 'president'),
 ('harry', 'harry'),
 ('truman', 'truman'),
 ('address', 'address'),
 ('joint', 'joint'),
 ('session', 'session'),
 ('congress', 'congress'),
 ('april', 'april'),
 ('speaker', 'speaker'),
 ('president', 'president'),
 ('members', 'member'),
 ('congress', 'congress'),
 ('heavy', 'heavy'),
 ('heart', 'heart'),
 ('stand', 'stand'),
 ('friends', 'friend'),
 ('colleagues', 'colleague'),
 ('congress', 'congress'),
 ('united', 'united'),
 ('states', 'state'),
 ('yesterday', 'yesterday'),
 ('laid', 'laid'),
 ('rest', 'rest'),
 ('mortal', 'mortal'),
 ('remains', 'remains'),
 ('beloved', 'beloved'),
 ('president', 'president'),
 ('franklin', 'franklin'),
 ('delano', 'delano'),
 ('roosevelt', 'roosevelt')]

Notice that many verbs are not lemmatized correctly.
This is because NLTK‚Äôs lemmatizer defaults to noun POS tags.

In [31]:
comparison = list(zip(
    filtered_spacy[:20],
    stemmed_spacy_tokens[:20],
    lemmatized_nltk[:20]
))

comparison


[('president', 'presid', 'president'),
 ('harry', 'harri', 'harry'),
 ('truman', 'truman', 'truman'),
 ('address', 'address', 'address'),
 ('joint', 'joint', 'joint'),
 ('session', 'session', 'session'),
 ('congress', 'congress', 'congress'),
 ('april', 'april', 'april'),
 ('speaker', 'speaker', 'speaker'),
 ('president', 'presid', 'president'),
 ('members', 'member', 'member'),
 ('congress', 'congress', 'congress'),
 ('heavy', 'heavi', 'heavy'),
 ('heart', 'heart', 'heart'),
 ('stand', 'stand', 'stand'),
 ('friends', 'friend', 'friend'),
 ('colleagues', 'colleagu', 'colleague'),
 ('congress', 'congress', 'congress'),
 ('united', 'unit', 'united'),
 ('states', 'state', 'state')]

# From Words to Subwords: Byte Pair Encoding (BPE)

So far, we have treated words as the basic unit of meaning.
Modern NLP systems often go one step further and operate on subword units.

One of the most common subword tokenization methods is Byte Pair Encoding (BPE).

In large corpora like the State of the Union addresses, word-level tokenization creates several problems:

Rare words appear very infrequently

New words appear over time (e.g. cybersecurity, biotechnology)

Related words are treated as completely separate tokens

Subword tokenization solves this by breaking words into frequently occurring pieces.

In [33]:
# We will use a small subset of real policy-related words that appear in State of the Union speeches.

words = [
    "democracy",
    "democratic",
    "democratization",
    "economy",
    "economic",
    "economics"
]

words
# At the word level, all of these are treated as separate tokens.

['democracy',
 'democratic',
 'democratization',
 'economy',
 'economic',
 'economics']

## Step 1 ‚Äì Character-Level Representation

BPE starts by representing each word as a sequence of characters
(with a special end-of-word marker).

In [34]:
char_tokens = [list(word) + ["</w>"] for word in words]
char_tokens


[['d', 'e', 'm', 'o', 'c', 'r', 'a', 'c', 'y', '</w>'],
 ['d', 'e', 'm', 'o', 'c', 'r', 'a', 't', 'i', 'c', '</w>'],
 ['d',
  'e',
  'm',
  'o',
  'c',
  'r',
  'a',
  't',
  'i',
  'z',
  'a',
  't',
  'i',
  'o',
  'n',
  '</w>'],
 ['e', 'c', 'o', 'n', 'o', 'm', 'y', '</w>'],
 ['e', 'c', 'o', 'n', 'o', 'm', 'i', 'c', '</w>'],
 ['e', 'c', 'o', 'n', 'o', 'm', 'i', 'c', 's', '</w>']]

## Step 2 ‚Äì Count Frequent Character Pairs

BPE repeatedly merges the most frequent adjacent character pairs across the corpus.

In [35]:
from collections import Counter

pair_counts = Counter()

for word in char_tokens:
    for i in range(len(word) - 1):
        pair = (word[i], word[i+1])
        pair_counts[pair] += 1

pair_counts.most_common(10)


[(('o', 'n'), 4),
 (('d', 'e'), 3),
 (('e', 'm'), 3),
 (('m', 'o'), 3),
 (('o', 'c'), 3),
 (('c', 'r'), 3),
 (('r', 'a'), 3),
 (('a', 't'), 3),
 (('t', 'i'), 3),
 (('i', 'c'), 3)]

## Step 3 ‚Äì Merge Frequent Pairs (Conceptual)

The most frequent pair is ('o', 'n').
BPE merges it into a new token: "on".

This process repeats many times, gradually forming meaningful subwords.

In [36]:
bpe_tokens_example = [
    ["democr", "acy</w>"],
    ["democr", "atic</w>"],
    ["democr", "atization</w>"],
    ["econ", "omy</w>"],
    ["econ", "omic</w>"],
    ["econ", "omics</w>"]
]

bpe_tokens_example


[['democr', 'acy</w>'],
 ['democr', 'atic</w>'],
 ['democr', 'atization</w>'],
 ['econ', 'omy</w>'],
 ['econ', 'omic</w>'],
 ['econ', 'omics</w>']]