<a href="https://colab.research.google.com/github/ephantus9/ephantuswa/blob/main/Copy_of_EPHANTUS_COS_598_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 1: Text Preprocessing & Token Analysis

**Topics:** normalization, tokenization, stopwords, punctuation, stemming, lemmatization   

---

**Name:** Your Name  \
**Released on:** Jan 29, 2026  \
**Submission date:** Feb 6, 2026 (11:59 PM)  \
**AI assistance used for this exercise? (Yes / No):**  \
**AI Usage Disclosure Form submitted? (Yes / No):**

---

### Instructions for Submission

If you use Google Colab:

1. Change the access settings of your notebook
    - Click Share in the top-right corner of the notebook
    - Under General access, change the setting from "Restricted" to "University of Maine System"
    - Set the "Role" to "Editor"

2. Download the notebook in `.ipynb` format

3. Submit the following items
    - The Google Colab notebook URL with proper access permissions
    - The notebook file (.ipynb)
    - The AI Usage Disclosure Form (mandatory if any AI tools were used)


**In this exercise, we will work with a larger corpus. We are going to use the book Pride and Prejudice by Jane Austen from Project Gutenberg.**

## Reading the Corpus

In [None]:
import urllib   # read data from a URL
import copy     # create a deep copy of an object

# loads and reads the book "Pride and Prejudice" by Jane Austen
doc = urllib.request.urlopen('https://www.gutenberg.org/files/1342/1342-0.txt')
text_original = doc.read().decode('utf8')

# create a deep copy of the full text
# This allows us to reuse the original text if we want to "restart" reading
text = copy.deepcopy(text_original)

print(text)


*** START OF THE PROJECT GUTENBERG EBOOK 1342 ***




                            [Illustration:

                             GEORGE ALLEN
                               PUBLISHER

                        156 CHARING CROSS ROAD
                                LONDON

                             RUSKIN HOUSE
                                   ]

                            [Illustration:

               _Reading Jane’s Letters._      _Chap 34._
                                   ]




                                PRIDE.
                                  and
                               PREJUDICE

                                  by
                             Jane Austen,

                           with a Preface by
                           George Saintsbury
                                  and
                           Illustrations by
                             Hugh Thomson

                         [Illustration: 1894]

         

## Corpus Cleaning and Normalization

Now we will discard some parts from the beginning and end of the document. Our goal is to keep the text from the first chapter to the end of the last chapter. Everything else before and after that is discarded.

In [None]:
import re

text = text.lower()

# START: keep text from chapter i
start_pattern = re.search(r'^chapter i\b', text, re.MULTILINE)
if start_pattern:
    text = text[start_pattern.start():]
else:
    print('Start pattern not found!')

# END: discard illustration + everything after
end_pattern = re.search(r'\[illustration:\s*the\s*end\s*\]', text)

if end_pattern:
    text = text[:end_pattern.start()]
else:
    print('End pattern not found!')

print(text)

chapter i.]


it is a truth universally acknowledged, that a single man in possession
of a good fortune must be in want of a wife.

however little known the feelings or views of such a man may be on his
first entering a neighbourhood, this truth is so well fixed in the minds
of the surrounding families, that he is considered as the rightful
property of some one or other of their daughters.

“my dear mr. bennet,” said his lady to him one day, “have you heard that
netherfield park is let at last?”

mr. bennet replied that he had not.

“but it is,” returned she; “for mrs. long has just been here, and she
told me all about it.”

mr. bennet made no answer.

“do not you want to know who has taken it?” cried his wife, impatiently.

“_you_ want to tell me, and i have no objection to hearing it.”

[illustration:

“he came down to see the place”

[_copyright 1894 by george allen._]]

this was invitation enough.

“why, my dear, you must know, mrs. long says that n


## **Task 1:** More Cleaning and Normalization

In [None]:
# step 1: remove illustration blocks that start with [Illustration: and
# continue until the fixed copyright line [_Copyright 1894 by George Allen._]]
# REPLACE THIS COMMENT WITH YOUR CODE
Quiz1 = re.compile(
    r"\[illustration:.*?\[_copyright 1894 by george allen._\]\]",
    flags=re.DOTALL|re.IGNORECASE)

text = re.sub(Quiz1, "", text)

# step 2: remove standalone [Illustration] tags from the text
# REPLACE THIS COMMENT WITH YOUR CODE
text = re.sub(r"\s*\[Illustration]\s*","",text,
              flags=re.IGNORECASE)

# step 3: remove chapter headings written with Roman numerals
# (e.g., chapter lxi., chapter iv.)
# REPLACE THIS COMMENT WITH YOUR CODE
text = re.sub(r"\s*\bchapter\s+[ivxlcdm]+\s*\.\s*", "", text,
              flags=re.IGNORECASE)

# step 4: remove underscore-based emphasis by converting _word_ to word
# REPLACE THIS COMMENT WITH YOUR CODE
text = re.sub(r"_(\w+)_",r"\1", text)

# step 5: remove editorial markers such as /* and */
# REPLACE THIS COMMENT WITH YOUR CODE
text = re.sub(r"/\*|\*/", "",text)

# step 6: normalize whitespace by replacing multiple spaces, tabs, and
# newlines with a single space
# REPLACE THIS COMMENT WITH YOUR CODE
text = re.sub(r"\s+", " ", text).strip()

# step 7: remove any leading closing bracket (]) from the beginning of the text
# REPLACE THIS COMMENT WITH YOUR CODE
text = re.sub(r"^\]+", "", text)

# step 8: trim leading and trailing whitespace from the final text
# REPLACE THIS COMMENT WITH YOUR CODE
text = text.strip()

print(text)

it is a truth universally acknowledged, that a single man in possession of a good fortune must be in want of a wife. however little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered as the rightful property of some one or other of their daughters. “my dear mr. bennet,” said his lady to him one day, “have you heard that netherfield park is let at last?” mr. bennet replied that he had not. “but it is,” returned she; “for mrs. long has just been here, and she told me all about it.” mr. bennet made no answer. “do not you want to know who has taken it?” cried his wife, impatiently. “you want to tell me, and i have no objection to hearing it.” this was invitation enough. “why, my dear, you must know, mrs. long says that netherfield is taken by a young man of large fortune from the north of england; that he came down on monday in a chaise and four to see the place,

In [None]:
# limit reading to the first 200,000 characters for experimentation
# REPLACE THIS COMMENT WITH YOUR CODE
text = text[:200_000]

print(text)

it is a truth universally acknowledged, that a single man in possession of a good fortune must be in want of a wife. however little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered as the rightful property of some one or other of their daughters. “my dear mr. bennet,” said his lady to him one day, “have you heard that netherfield park is let at last?” mr. bennet replied that he had not. “but it is,” returned she; “for mrs. long has just been here, and she told me all about it.” mr. bennet made no answer. “do not you want to know who has taken it?” cried his wife, impatiently. “you want to tell me, and i have no objection to hearing it.” this was invitation enough. “why, my dear, you must know, mrs. long says that netherfield is taken by a young man of large fortune from the north of england; that he came down on monday in a chaise and four to see the place,

## Importing Libraries

In [None]:
import re
import string
from collections import Counter

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

import matplotlib.pyplot as plt

nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

## **Task 2:** Tokenization (using NLTK)

Tokenize the corpus (our corpus is now limited to 200,000 characters) and store the tokens in variable `tokens`. Print the total number of tokens in the corpus. This step establishes the baseline representation of the text.

In [None]:
# REPLACE THIS COMMENT WITH YOUR CODE
tokens = word_tokenize(text)
print(tokens)
print("Total number of tokens:", len(tokens))

['it', 'is', 'a', 'truth', 'universally', 'acknowledged', ',', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune', 'must', 'be', 'in', 'want', 'of', 'a', 'wife', '.', 'however', 'little', 'known', 'the', 'feelings', 'or', 'views', 'of', 'such', 'a', 'man', 'may', 'be', 'on', 'his', 'first', 'entering', 'a', 'neighbourhood', ',', 'this', 'truth', 'is', 'so', 'well', 'fixed', 'in', 'the', 'minds', 'of', 'the', 'surrounding', 'families', ',', 'that', 'he', 'is', 'considered', 'as', 'the', 'rightful', 'property', 'of', 'some', 'one', 'or', 'other', 'of', 'their', 'daughters', '.', '“', 'my', 'dear', 'mr.', 'bennet', ',', '”', 'said', 'his', 'lady', 'to', 'him', 'one', 'day', ',', '“', 'have', 'you', 'heard', 'that', 'netherfield', 'park', 'is', 'let', 'at', 'last', '?', '”', 'mr.', 'bennet', 'replied', 'that', 'he', 'had', 'not', '.', '“', 'but', 'it', 'is', ',', '”', 'returned', 'she', ';', '“', 'for', 'mrs.', 'long', 'has', 'just', 'been', 'here', ',', 'and', 

## **Task 3:** Token Frequency Analysis

Identify:
- the 100 most frequent tokens
- the 100 least frequent tokens

Store the 100 most frequent and the 100 least frequent tokens with their frequencies in dictionaries `most_common_100` and `least_common_100`, respectively, sorted by frequency (most frequent tokens appear first). Print the dictionaries. This helps reveal which words dominate the corpus and which occur rarely.

In [None]:
# REPLACE THIS COMMENT WITH YOUR CODE
from nltk.probability import FreqDist
fdist = FreqDist(tokens)
most_common_100 = dict(fdist.most_common(100))
print("100 Most Frequent Tokens:", most_common_100)

least_common_sorted = sorted(fdist.items(), key=lambda x: (x[1],x[0]))
least_common_100=dict(least_common_sorted[:100])
print("100 Least Frequent Tokens:", least_common_100)

100 Most Frequent Tokens: {',': 2678, 'the': 1312, 'to': 1184, '.': 1108, 'of': 1102, 'and': 1037, 'a': 629, 'her': 617, '“': 584, '”': 577, 'i': 562, 'in': 540, ';': 522, 'was': 486, 'not': 470, 'she': 458, 'that': 457, 'he': 432, 'his': 413, 'it': 403, 'you': 384, 'be': 341, 'with': 335, 'as': 333, 'had': 321, 'for': 304, 'but': 297, 'mr.': 296, 'is': 281, 'at': 239, '’': 224, 'him': 223, 'have': 211, 'by': 210, 's': 195, 'my': 192, 'on': 179, 'so': 165, 'were': 161, 'elizabeth': 160, 'very': 159, 'which': 153, 'darcy': 148, 'all': 144, 'could': 144, 'bingley': 142, '--': 139, 'no': 137, 'they': 136, '?': 134, 'their': 131, 'said': 131, 'from': 125, 'them': 124, 'been': 120, 'such': 117, 'bennet': 117, 'would': 116, 'much': 115, 'miss': 114, 'are': 113, 'what': 113, 'mrs.': 111, 'an': 110, 'your': 109, 'can': 107, 'will': 107, 'do': 105, 'this': 104, 'if': 101, 'there': 101, 'me': 96, 'jane': 95, 'am': 93, 'more': 93, 'when': 89, '!': 88, 'or': 87, 'than': 85, 'must': 82, 'should': 8

## **Task 4:** Token Length Analysis

Identify:
- the 20 longest tokens
- the 20 shortest tokens

Store the 20 longest and the 20 shortest tokens with their lengths in dictionaries `longest_20` and `shortest_20`, respectively, sorted by length and then alphabetically, with the longest tokens appearing first. Print the dictionaries. This step highlights the variation in token size.

In [None]:
# REPLACE THIS COMMENT WITH YOUR CODE
unique_tokens = set(tokens)
longest_20 = sorted(unique_tokens, key = lambda t: (-len(t), t)) [:20]
print("20 Longest Tokens:")
for token in longest_20:
    print(token, len(token))


shortest_20 = sorted(unique_tokens, key = lambda t: (len(t), t)) [:20]
print("\n20 Shortest Tokens:")
for token in shortest_20:
    print(token, len(token))

20 Longest Tokens:
chosen.elizabeth 16
fellow-creatures 16
hard-heartedness 16
incomprehensible 16
self-complacency 16
self-consequence 16
self-gratulation 16
superciliousness 16
three-and-twenty 16
accomplishments 15
acknowledgments 15
congratulations 15
conscientiously 15
correspondents. 15
five-and-twenty 15
four-and-twenty 15
inconsistencies 15
married.chapter 15
one-and-twenty. 15
self-importance 15

20 Shortest Tokens:
! 1
( 1
) 1
, 1
. 1
: 1
; 1
? 1
a 1
i 1
m 1
o 1
s 1
t 1
‘ 1
’ 1
“ 1
” 1
-- 2
ah 2


## **Task 5:** Stopword Frequency Analysis (using NLTK)

Identify the stopwords from `tokens` (variable already created in Task 1). Store the stopwords with their frequencies in dictionary `stopword_tokens_freq`, sorted by frequency (most frequent stopwords appear first). Print the resulting dictionary.

Also report the total number of stopword tokens and the percentage they represent relative to the original token count. This step illustrates how much of the corpus consists of stopwords.

In [None]:
# REPLACE THIS COMMENT WITH YOUR CODE
stop_words = set(stopwords.words('english'))
stopword_tokens = [m for m in tokens if m in stop_words]
stopwords_counts = Counter(stopword_tokens)
stopword_tokens_freq = dict(stopwords_counts.most_common())
print("The stopword frequency are:")
for words, freq in stopword_tokens_freq.items():
    print(words, freq)


total_stopword_tokens = len(stopword_tokens)
total_tokens = len(tokens)

# Calculating the percentages
stopword_percentage = (total_stopword_tokens / total_tokens) *100
print( "The total number of stopword tokens are :", total_stopword_tokens)
print("The total number of tokens are:", total_tokens)
print("The percentage of stopwords in the corpus: {:.2}% ". format(stopword_percentage))

The stopword frequency are:
the 1312
to 1184
of 1102
and 1037
a 629
her 617
i 562
in 540
was 486
not 470
she 458
that 457
he 432
his 413
it 403
you 384
be 341
with 335
as 333
had 321
for 304
but 297
is 281
at 239
him 223
have 211
by 210
s 195
my 192
on 179
so 165
were 161
very 159
which 153
all 144
no 137
they 136
their 131
from 125
them 124
been 120
such 117
are 113
what 113
an 110
your 109
can 107
will 107
do 105
this 104
if 101
there 101
me 96
am 93
more 93
when 89
or 87
than 85
should 81
who 75
did 72
has 71
how 71
any 70
most 66
we 64
some 63
herself 63
other 60
being 60
only 58
own 57
before 54
after 52
too 49
now 40
then 40
about 39
himself 39
does 37
into 34
out 34
again 30
having 29
up 28
same 28
its 27
between 26
myself 25
our 24
whom 23
where 23
over 22
these 22
once 22
because 21
just 20
nor 20
those 20
while 18
each 18
few 16
down 14
here 13
themselves 13
off 13
both 13
during 12
yourself 12
why 11
under 9
against 8
through 8
yours 6
doing 5
further 5
ma 4
hers 4
o 3
won 3

## **Task 6:** Stopword Removal and Comparison

Store tokens without stopwords in a list `tokens_no_stop`. Then recompute the 100 most frequent and 100 least frequent tokens from `tokens_no_stop`.

Store the 100 most frequent and the 100 least frequent tokens with their frequencies in dictionaries `most_common_100_no_stop` and `least_common_100_no_stop`, respectively, sorted by frequency (most frequent tokens appear first). Print the resulting dictionaries.

Count and report:
- How many tokens are common between the frequency list with stopwords and the frequency list without stopwords.
- How many tokens are replaced from the frequency list with stopwords (i.e., appear only after stopword removal).

In [None]:
# REPLACE THIS COMMENT WITH YOUR CODE

## **Task 7:** Punctuation Removal and Comparison

Store tokens without stopwords and punctuation in a list `tokens_no_stop_no_punc`. Then recompute the 100 most frequent and 100 least frequent tokens from `tokens_no_stop_no_punc`.

Store the 100 most frequent and the 100 least frequent tokens with their frequencies in dictionaries `most_common_100_no_stop_no_punc` and `least_common_100_no_stop_no_punc`, respectively, sorted by frequency (most frequent tokens appear first). Print the resulting dictionaries.

Count and report:
- How many tokens are common between the frequency list with stopwords and punctuation, and the frequency list without stopwords and punctuation.
- How many tokens are replaced from the frequency list with stopwords and punctuation (i.e., appear only after stopwords and punctuation removal).

## **Task 8:** Stemming vs. Lemmatization Comparison

Apply stemming and lemmatization to tokens that do not contain stopwords or punctuation, and store the results in two different variables. Now compare the results and identify cases where stemming produces an incorrect or non-word form. Find the percentage of tokens for which stemming produces incorrect or non-word forms.

In [None]:
# REPLACE THIS COMMENT WITH YOUR CODE

## **Task 9:** Frequency Analysis of Lemmatized Tokens

From the lemmatized token list, identify the 100 most frequent tokens. Store the tokens with their frequencies in a dictionary, sorted by frequency (most frequent tokens appear first). Print the resulting dictionary.

Finally, compare these results with the frequency distribution of tokens without stopwords and punctuation.

What do you observe? Discuss how lemmatization impacts the 100 most frequent tokens of the corpus.

In [None]:
# REPLACE THIS COMMENT WITH YOUR CODE

REPLACE THIS SENTENCE WITH YOUR OBSERVATION AND DISCUSSION.