# Obtaining and Exploring a Real-World Corpus

In industry, you almost never start with a perfectly curated list of documents.
You start by:

- locating an existing corpus

- understanding its scope and structure

- checking what it contains and what it does not

- inspecting it before deciding how to preprocess it

https://stateoftheunion.onetwothree.net/

https://www.kaggle.com/datasets/nltkdata/state-union-corpus

## Loading a Real Corpus (State of the Union)

NLTK includes a built-in corpus of all U.S. State of the Union addresses spanning more than two centuries.

In [15]:
# Import libraries
import nltk
from nltk.corpus import state_union
# Download required NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download("punkt_tab")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\david\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\david\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\david\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

- Each file corresponds to one speech, typically named by year and president.
- Each file is a document.
- The full collection is a corpus.

## Understanding Corpus Scope

- This corpus spans over 200 years of political language.
- This makes it suitable for temporal, historical, and policy trend analysis.

# Inspecting Raw Text (No Processing Yet)
Before tokenization, normalization, or cleaning, we inspect the raw text.

- Is this clean or messy text?

- Do you see formatting artifacts?

- Would you need domain-specific cleaning?

In [8]:
# Inspect the first 500 words of the 1945 Truman address
truman_1945_words = state_union.words('1945-Truman.txt')[:500]
#truman_1945_text = ' '.join(truman_1945_words)  # Join words into a single string

In [9]:
truman_1945_words

['PRESIDENT',
 'HARRY',
 'S',
 '.',
 'TRUMAN',
 "'",
 'S',
 'ADDRESS',
 'BEFORE',
 'A',
 'JOINT',
 'SESSION',
 'OF',
 'THE',
 'CONGRESS',
 'April',
 '16',
 ',',
 '1945',
 'Mr',
 '.',
 'Speaker',
 ',',
 'Mr',
 '.',
 'President',
 ',',
 'Members',
 'of',
 'the',
 'Congress',
 ':',
 'It',
 'is',
 'with',
 'a',
 'heavy',
 'heart',
 'that',
 'I',
 'stand',
 'before',
 'you',
 ',',
 'my',
 'friends',
 'and',
 'colleagues',
 ',',
 'in',
 'the',
 'Congress',
 'of',
 'the',
 'United',
 'States',
 '.',
 'Only',
 'yesterday',
 ',',
 'we',
 'laid',
 'to',
 'rest',
 'the',
 'mortal',
 'remains',
 'of',
 'our',
 'beloved',
 'President',
 ',',
 'Franklin',
 'Delano',
 'Roosevelt',
 '.',
 'At',
 'a',
 'time',
 'like',
 'this',
 ',',
 'words',
 'are',
 'inadequate',
 '.',
 'The',
 'most',
 'eloquent',
 'tribute',
 'would',
 'be',
 'a',
 'reverent',
 'silence',
 '.',
 'Yet',
 ',',
 'in',
 'this',
 'decisive',
 'hour',
 ',',
 'when',
 'world',
 'events',
 'are',
 'moving',
 'so',
 'rapidly',
 ',',
 'our'

# Words vs Sentences vs Characters?
At this stage, we only explore structure.

# Quick Corpus-Wide Exploration

In [29]:
!pip install matplotlib

Collecting matplotlib
  Downloading matplotlib-3.10.8-cp311-cp311-win_amd64.whl.metadata (52 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.3-cp311-cp311-win_amd64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib)
  Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.61.1-cp311-cp311-win_amd64.whl.metadata (116 kB)
Collecting kiwisolver>=1.3.1 (from matplotlib)
  Downloading kiwisolver-1.4.9-cp311-cp311-win_amd64.whl.metadata (6.4 kB)
Collecting pillow>=8 (from matplotlib)
  Downloading pillow-12.1.0-cp311-cp311-win_amd64.whl.metadata (9.0 kB)
Collecting pyparsing>=3 (from matplotlib)
  Downloading pyparsing-3.3.1-py3-none-any.whl.metadata (5.6 kB)
Downloading matplotlib-3.10.8-cp311-cp311-win_amd64.whl (8.1 MB)
   ---------------------------------------- 0.0/8.1 MB ? eta -:--:--
   ---------------------------------------  8.1/8.1 MB 45.7 MB/s eta 0:00:01
   --

In the next section, we will treat each State of the Union address as a document and the full collection as a corpus. We will begin designing preprocessing pipelines using both NLTK and spaCy, and we will compare how each library handles the same steps.