Use this NLTK corpus:

Option A (recommended): Gutenberg - 'austen-emma.txt' or 'carroll-alice.txt'

Option B: Brown corpus (choose 1-2 categories, e.g., 'news' and 'romance')

Choose ONE option and state your choice at the top of your notebook/report.
You must complete all three parts below. Each part builds on the previous parts.


## **Part A - Text Preprocessing (50%)**

Goal: Prepare a clean token stream and justify your choices.
A1. Load the corpus
Load the raw text (or sentences) from your chosen NLTK corpus and print:
Total number of characters (if raw text) or total number of sentences (if sentence-based)
Total number of tokens BEFORE preprocessing


In [2]:
import nltk
nltk.download('gutenberg')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab') # Added to resolve LookupError

from nltk.corpus import gutenberg
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from collections import Counter
import string

# Load raw text
raw_text = gutenberg.raw('carroll-alice.txt')

print("Total characters:", len(raw_text))

# Token count BEFORE preprocessing
tokens_before = word_tokenize(raw_text)
print("Total tokens BEFORE preprocessing:", len(tokens_before))

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Total characters: 144395
Total tokens BEFORE preprocessing: 33535


## **A2. Preprocess**

Create a preprocessing function that performs:

Lowercasing
Tokenization

Removal of punctuation tokens

Optional: stopword removal (if you choose to remove stopwords, explain why)

Optional: stemming OR lemmatization (choose one if you use it, explain why)


In [3]:
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    tokens = word_tokenize(text.lower())

    # Remove punctuation and non-alphabetic tokens
    tokens = [t for t in tokens if t.isalpha()]

    # Lemmatize
    tokens = [lemmatizer.lemmatize(t) for t in tokens]

    return tokens

tokens_after = preprocess(raw_text)

print("Total tokens AFTER preprocessing:", len(tokens_after))
print("Vocabulary size:", len(set(tokens_after)))

# Top 20 frequent tokens
freq = Counter(tokens_after)
print("\nTop 20 most frequent tokens:")
for word, count in freq.most_common(20):
    print(word, count)


Total tokens AFTER preprocessing: 25499
Vocabulary size: 2293

Top 20 most frequent tokens:
the 1616
a 887
and 810
to 720
it 597
she 545
i 542
of 499
said 462
alice 397
wa 367
in 359
you 359
that 284
her 248
at 209
on 191
had 185
with 179
all 178


Report these statistics AFTER preprocessing:

Total number of tokens

Vocabulary size (unique tokens)

Top 20 most frequent tokens (with counts)

# **A3. Reflection**

Preprocessing really shapes how sparse data is and how well models perform. Lowercasing merges words like “Alice” and “alice,” so your vocabulary doesn’t explode. Dropping punctuation cuts out useless tokens that mess up vectors. Lemmatization simplifies forms like “running” to “run,” making stats more solid for frequency-based models.
Keep stopwords for language models—they carry grammar and context. But in Bag-of-Words, they dominate and create issues. So, balance cleaning noise with keeping useful info, based on your task.


### **Part B – Text Representation**

Goal: Compare Bag-of-Words and TF-IDF representations and interpret the results.
B1. Create documents

Split your corpus into documents. Choose ONE of the following strategies and justify it:

**Split into fixed-size chunks of 600 tokens.**
The novel is shorter than Emma, Chunking ensures enough context per document, Avoids extremely small documents (like sentences).


In [4]:
chunk_size = 600
documents = []

for i in range(0, len(tokens_after), chunk_size):
    chunk = tokens_after[i:i+chunk_size]
    documents.append(" ".join(chunk))

print("Number of documents:", len(documents))


Number of documents: 43


**B2. Vectorization**

In [5]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Bag-of-Words
count_vectorizer = CountVectorizer()
bow_matrix = count_vectorizer.fit_transform(documents)

# TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

print("BoW shape:", bow_matrix.shape)
print("TF-IDF shape:", tfidf_matrix.shape)


BoW shape: (43, 2283)
TF-IDF shape: (43, 2283)


Top 15 TF-IDF Terms for 2 Documents

In [6]:
import numpy as np

feature_names = tfidf_vectorizer.get_feature_names_out()

def top_tfidf_terms(doc_index, top_n=15):
    row = tfidf_matrix[doc_index].toarray().flatten()
    top_indices = row.argsort()[-top_n:][::-1]
    return [(feature_names[i], row[i]) for i in top_indices]

print("Top TF-IDF terms for Document 0:")
print(top_tfidf_terms(0))

print("\nTop TF-IDF terms for Document 3:")
print(top_tfidf_terms(3))


Top TF-IDF terms for Document 0:
[('the', np.float64(0.3449720033229098)), ('to', np.float64(0.2973896580369912)), ('it', np.float64(0.2498073127510726)), ('she', np.float64(0.2498073127510726)), ('of', np.float64(0.202224967465154)), ('wa', np.float64(0.19032938114367434)), ('down', np.float64(0.18567486472003328)), ('and', np.float64(0.1784337948221947)), ('her', np.float64(0.1655631079451196)), ('picture', np.float64(0.12125986958676195)), ('very', np.float64(0.11997474266704955)), ('fell', np.float64(0.11329659945654286)), ('alice', np.float64(0.10952154108009704)), ('in', np.float64(0.10706027689331682)), ('jar', np.float64(0.09733069729752344))]

Top TF-IDF terms for Document 3:
[('to', np.float64(0.3239881019192489)), ('the', np.float64(0.29906594023315286)), ('she', np.float64(0.2741437785470568)), ('and', np.float64(0.22429945517486466)), ('it', np.float64(0.16199405095962446)), ('way', np.float64(0.15510220434605176)), ('cake', np.float64(0.1377788317574055)), ('wa', np.float

Interpretation Example: In Alice in Wonderland, TF-IDF often highlights:
Character names: "alice", "hatter", "queen", "rabbit"; Scene-specific words: "tea", "trial", "garden". These terms appear strongly in particular sections but not across the entire book, making them distinctive.

B3. Cosine Similarity

In [7]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(tfidf_matrix)

# Remove self similarity
np.fill_diagonal(similarity_matrix, 0)

max_sim = similarity_matrix.max()
indices = np.where(similarity_matrix == max_sim)

print("Most similar pair of documents:", indices[0][0], "and", indices[1][0])
print("Similarity score:", max_sim)


Most similar pair of documents: 37 and 38
Similarity score: 0.7965779195927278


Highly similar chunks often correspond to continuous narrative scenes (e.g., Mad Hatter tea party).
A surprising similarity may occur between distant chunks that both focus heavily on the Queen or courtroom scene.

### **Part C – Word Embeddings (25%)**

C1. Prepare Sentences

In [8]:
from nltk.tokenize import sent_tokenize

sentences_raw = sent_tokenize(raw_text)

sentences = [preprocess(sentence) for sentence in sentences_raw]


C2. Train Word2Vec

In [13]:
!pip install gensim
from gensim.models import Word2Vec

model = Word2Vec(
    sentences=sentences,
    vector_size=100,
    window=5,
    min_count=1,  # Changed from 3 to 1 to include less frequent words
    sg=1,      # Skip-gram
    epochs=15
)

print("Word2Vec vocabulary size:", len(model.wv))

Word2Vec vocabulary size: 2294


C3. Most Similar Words

In [14]:
target_words = ["alice", "queen", "rabbit", "hatter", "king"]

for word in target_words:
    print(f"\nTop 10 words similar to '{word}':")
    for sim_word, score in model.wv.most_similar(word, topn=10):
        print(sim_word, score)



Top 10 words similar to 'alice':
much 0.856599748134613
remark 0.854781985282898
curious 0.8514491319656372
rather 0.8483914732933044
certainly 0.8437233567237854
poor 0.841570258140564
very 0.8352744579315186
herself 0.8340606093406677
right 0.8322117924690247
never 0.8315980434417725

Top 10 words similar to 'queen':
verse 0.9257924556732178
turning 0.918398916721344
king 0.9160270690917969
executioner 0.9142700433731079
jury 0.9139493107795715
shouted 0.9110017418861389
heart 0.9057885408401489
shrill 0.9005929827690125
pointing 0.8996469378471375
teacup 0.8989439606666565

Top 10 words similar to 'rabbit':
white 0.9212437272071838
kid 0.9016737937927246
glove 0.8977483510971069
court 0.8786617517471313
low 0.8766154646873474
hurried 0.8696526885032654
fan 0.8694412112236023
pair 0.8631691932678223
blew 0.8622889518737793
took 0.8583149909973145

Top 10 words similar to 'hatter':
king 0.9272031188011169
angrily 0.9106067419052124
march 0.8996372222900391
duchess 0.898351788520813
g

C4. Analogies

In [15]:
print("king - man + woman ≈")
print(model.wv.most_similar(positive=["king", "woman"], negative=["man"]))

print("\nqueen - woman + man ≈")
print(model.wv.most_similar(positive=["queen", "man"], negative=["woman"]))

print("\nalice - girl + boy ≈")
print(model.wv.most_similar(positive=["alice", "boy"], negative=["girl"]))


king - man + woman ≈
[('executioner', 0.9122945666313171), ('turning', 0.9034780859947205), ('jury', 0.8967708349227905), ('pointing', 0.8936290144920349), ('dormouse', 0.8914698362350464), ('eagerly', 0.8903188705444336), ('tea', 0.8832041621208191), ('knave', 0.8823738694190979), ('queen', 0.8801239728927612), ('hatter', 0.8761951923370361)]

queen - woman + man ≈
[('king', 0.9164907932281494), ('jury', 0.8530906438827515), ('hatter', 0.8434901833534241), ('turning', 0.841728687286377), ('verse', 0.8373295664787292), ('angrily', 0.8330967426300049), ('interrupted', 0.8316513299942017), ('gryphon', 0.8281545639038086), ('heart', 0.8279138803482056), ('place', 0.8225963115692139)]

alice - girl + boy ≈
[('hastily', 0.8559770584106445), ('eagerly', 0.8492262363433838), ('turning', 0.8454760909080505), ('executioner', 0.8416664004325867), ('tea', 0.8394157290458679), ('indignantly', 0.8390999436378479), ('itself', 0.8331753015518188), ('decidedly', 0.8312958478927612), ('turned', 0.82564

Interpretation: The corpus is relatively small; Limited vocabulary diversity.