# 7.A) Implementing Non-Negative Matrix Factorization (NMF) for topic modeling and evaluate using the reconstruction error.
# B) Implement Wordnet to show Word Disambiguition.


A) Non-Negative Matrix Factorization (NMF) for Topic Modeling
Goal: Use NMF on a collection of documents (e.g., news articles), extract topics, and evaluate using reconstruction error.

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
import numpy as np

# Step 1: Load data
newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'), subset='all')
documents = newsgroups.data

# Step 2: Vectorize using TF-IDF
vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = vectorizer.fit_transform(documents)

# Step 3: Apply NMF
num_topics = 10
nmf_model = NMF(n_components=num_topics, random_state=42)
W = nmf_model.fit_transform(tfidf)
H = nmf_model.components_

# Step 4: Print Topics
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(H):
    print(f"Topic #{topic_idx + 1}:")
    print(" ".join([feature_names[i] for i in topic.argsort()[:-11:-1]]))

# Step 5: Reconstruction error
print(f"\nReconstruction Error: {nmf_model.reconstruction_err_}")


Topic #1:
don just like think know good ve time really want
Topic #2:
windows dos file program files window use using run running
Topic #3:
god jesus bible believe christ faith christian christians sin church
Topic #4:
drive scsi ide disk card controller hard drives bus floppy
Topic #5:
key chip encryption clipper keys government escrow use algorithm phone
Topic #6:
thanks does know mail advance hi info looking information help
Topic #7:
00 new 10 sale car price 50 20 shipping offer
Topic #8:
game games team year hockey baseball season players play espn
Topic #9:
edu geb dsl cadre n3jxp chastity pitt skepticism intellect shameful
Topic #10:
people government israel armenian jews armenians gun state did children

Reconstruction Error: 133.310948282511


B) Word Sense Disambiguation using WordNet (Lesk Algorithm)
Use NLTK's implementation of the Lesk algorithm to demonstrate Word Sense Disambiguation.

In [3]:
import nltk
from nltk.corpus import wordnet as wn
from nltk.wsd import lesk
from nltk.tokenize import word_tokenize

# Download required corpora
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('punkt_tab')

# Example: Word Disambiguation for "bank"
sentence1 = "He deposited the money in the bank"
sentence2 = "The river overflowed the bank"

for sentence in [sentence1, sentence2]:
    context = word_tokenize(sentence)
    sense = lesk(context, 'bank')
    print(f"\nSentence: {sentence}")
    print(f"Predicted Sense: {sense}")
    print(f"Definition: {sense.definition() if sense else 'No sense found'}")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.



Sentence: He deposited the money in the bank
Predicted Sense: Synset('savings_bank.n.02')
Definition: a container (usually with a slot in the top) for keeping money at home

Sentence: The river overflowed the bank
Predicted Sense: Synset('savings_bank.n.02')
Definition: a container (usually with a slot in the top) for keeping money at home
