# Project Title

Applying clustering algorithms like Latent Dirichlet Allocation (LDA) or K-means to group similar documents together for topic modeling and understanding large text corpora.

## Objective

The objective of this project is to explore and analyze a large corpus of text data using unsupervised machine learning techniques.

We apply:
- **KMeans Clustering** to group similar documents based on their content.
- **Latent Dirichlet Allocation (LDA)** to identify underlying topics across the dataset.

This helps in understanding document similarities and discovering hidden thematic structures within the data.


## Dataset Description

The dataset used in this project is the **20 Newsgroups** dataset, which contains around 20,000 Usenet newsgroup documents across **20 categories**.

Each category represents a discussion group, such as:
- `comp.graphics`
- `sci.space`
- `rec.autos`
- `talk.religion.misc`
- ... and more.

The data is organized in folders (one per category), and each document is stored as a text file.

The dataset includes real-world features such as:
- Noisy and informal language
- Headers and quoted text
- Cross-posted articles


In [4]:
import os

def load_dataset(base_path='data'):
    texts = []
    labels = []
    label_names = []

    for label_index, category in enumerate(sorted(os.listdir(base_path))):
        category_path = os.path.join(base_path, category)
        if os.path.isdir(category_path):
            label_names.append(category)
            for filename in os.listdir(category_path):
                file_path = os.path.join(category_path, filename)
                try:
                    with open(file_path, 'r', encoding='latin1') as f:
                        texts.append(f.read())
                        labels.append(label_index)
                except Exception as e:
                    print(f"Error reading {file_path}: {e}")

    return texts, labels, label_names


## Preprocessing Steps

Before applying clustering or topic modeling, we clean and normalize the text data:

- Convert text to lowercase
- Remove punctuation and non-word characters
- Tokenize text into words
- Remove stopwords (common but unimportant words)
- Apply **lemmatization** to reduce words to their base form

This helps reduce noise and improves the quality of clustering and topic extraction.


In [5]:
texts, labels, label_names = load_dataset('data')

print(f"Loaded {len(texts)} documents across {len(label_names)} categories.")
print("Categories:", label_names)


Loaded 19997 documents across 20 categories.
Categories: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [6]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)

stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    text = re.sub(r'\W+', ' ', text.lower())  # Remove non-word characters and lowercase
    tokens = text.split()
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words and len(word) > 2]
    return ' '.join(tokens)

preprocessed_texts = [preprocess(text) for text in texts]

print("Original:")
print(texts[0][:500])

print("\nPreprocessed:")
print(preprocessed_texts[0][:500])


Original:
Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:49960 alt.atheism.moderated:713 news.answers:7054 alt.answers:126
Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!bb3.andrew.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!magnus.acs.ohio-state.edu!usenet.ins.cwru.edu!agate!spool.mu.edu!uunet!pipex!ibmpcug!mantis!mathew
From: mathew <mathew@mantis.co.uk>
Newsgroups: alt.atheism,alt.atheism.moderated,news.answers,alt.answers
Subject: Alt.Atheism FAQ: Atheist Resources
Summary: Books, addresses, mu

Preprocessed:
xref cantaloupe srv cmu edu alt atheism 49960 alt atheism moderated 713 news answer 7054 alt answer 126 path cantaloupe srv cmu edu crabapple srv cmu edu bb3 andrew cmu edu news sei cmu edu ci ohio state edu magnus ac ohio state edu usenet in cwru edu agate spool edu uunet pipex ibmpcug mantis mathew mathew mathew mantis newsgroups alt atheism alt atheism moderated news answer alt answer subject alt atheism faq atheist resource summary book address music anything relate

## Clustering using KMeans

We vectorize the preprocessed text using **TF-IDF (Term Frequency–Inverse Document Frequency)** to get numerical representations of documents.

Using **KMeans Clustering** with `k=20` (for 20 newsgroups), we group similar documents together.

We also extract the **top terms per cluster**, which give an idea of what each cluster is about.

This helps in automatically grouping documents with similar content or writing style.


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

vectorizer = TfidfVectorizer(max_features=1000)
X = vectorizer.fit_transform(preprocessed_texts)

k = 20  # number of clusters (equal to 20 newsgroups)
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)

# Show top terms per cluster
terms = vectorizer.get_feature_names_out()
for i in range(k):
    center = kmeans.cluster_centers_[i]
    top_indices = center.argsort()[-10:][::-1]
    top_terms = [terms[idx] for idx in top_indices]
    print(f"\nCluster {i} top terms: {', '.join(top_terms)}")



Cluster 0 top terms: magnus, ac, ohio, edu, state, cmu, auto, news, com, university

Cluster 1 top terms: gov, nasa, edu, space, jpl, sci, cmu, news, com, net

Cluster 2 top terms: edu, sys, mac, comp, hardware, ibm, drive, com, cmu, card

Cluster 3 top terms: uiuc, cso, edu, news, ux1, owner, cmu, net, com, talk

Cluster 4 top terms: forsale, edu, misc, sale, computer, cmu, offer, com, srv, news

Cluster 5 top terms: edu, game, hockey, team, sport, baseball, player, rec, news, year

Cluster 6 top terms: com, edu, cmu, news, rec, netcom, net, sun, srv, motorcycle

Cluster 7 top terms: god, edu, christian, jesus, rutgers, bible, people, religion, one, belief

Cluster 8 top terms: edu, sci, cmu, news, net, com, graphic, space, srv, would

Cluster 9 top terms: culture, soc, armenian, turkish, soviet, edu, politics, greek, muslim, mideast

Cluster 10 top terms: window, comp, edu, com, file, cmu, misc, do, news, net

Cluster 11 top terms: sandvik, mchp, sni, apple, horus, frank, kent, obje

## Topic Modeling using Latent Dirichlet Allocation (LDA)

LDA is a generative probabilistic model used to identify topics in a collection of documents.

We use the **Gensim library** to apply LDA on the corpus of cleaned text.

The model generates a set of topics, each represented by a list of high-probability words.

Each document is modeled as a mixture of these topics, which helps us understand hidden thematic patterns in the dataset.


In [8]:
from gensim import corpora, models

# Tokenize
tokenized = [doc.split() for doc in preprocessed_texts]

# Dictionary and Corpus
dictionary = corpora.Dictionary(tokenized)
corpus = [dictionary.doc2bow(doc) for doc in tokenized]

# Apply LDA
lda_model = models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, random_state=42)

# Show topics
for idx, topic in lda_model.print_topics(num_words=10):
    print(f"\nTopic {idx}: {topic}")



Topic 0: 0.145*"soc" + 0.132*"culture" + 0.018*"muslim" + 0.011*"arabic" + 0.007*"religion" + 0.006*"toronto" + 0.005*"umanitoba" + 0.005*"jewish" + 0.005*"halat" + 0.005*"concordia"

Topic 1: 0.009*"edu" + 0.007*"one" + 0.006*"use" + 0.006*"would" + 0.005*"geneva" + 0.004*"space" + 0.004*"also" + 0.004*"bit" + 0.004*"problem" + 0.004*"work"

Topic 2: 0.029*"atheism" + 0.021*"horus" + 0.019*"religion" + 0.018*"mchp" + 0.015*"d012s658" + 0.015*"sni" + 0.012*"dwyer" + 0.009*"christian" + 0.008*"oulu" + 0.008*"morality"

Topic 3: 0.020*"edu" + 0.010*"talk" + 0.009*"politics" + 0.008*"people" + 0.008*"com" + 0.007*"would" + 0.007*"cmu" + 0.007*"one" + 0.005*"gun" + 0.005*"right"

Topic 4: 0.048*"edu" + 0.035*"com" + 0.025*"cmu" + 0.016*"srv" + 0.012*"cantaloupe" + 0.012*"misc" + 0.011*"net" + 0.011*"news" + 0.009*"apr" + 0.009*"message"

Topic 5: 0.098*"edu" + 0.021*"cmu" + 0.014*"srv" + 0.013*"news" + 0.011*"cantaloupe" + 0.010*"net" + 0.009*"rutgers" + 0.009*"message" + 0.009*"apr" + 0.

## Visualizations

We use **pyLDAvis**, an interactive visualization tool, to explore LDA topics.

The visualization includes:
- Inter-topic distances (bubbles)
- Top terms per topic
- Term relevance sliders

This makes it easy to interpret the topics discovered by the model.


In [10]:
# Import libraries
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

# Enable Jupyter inline display
pyLDAvis.enable_notebook()

# Prepare the visualization using LDA model
vis = gensimvis.prepare(lda_model, corpus, dictionary)

# Display the visualization
vis

## Conclusion and Observations

- We successfully applied **KMeans clustering** to group documents by content.
- We extracted **interpretable topics** using **LDA**, revealing hidden patterns and themes.
- The results show clear groupings for categories like science, religion, politics, and computers.
- This approach is useful for summarizing large text datasets, organizing content, and building topic-aware search systems.

### Future Work
- Apply **BERT embeddings + HDBSCAN** for better clustering.
- Build a **web app interface** to explore clustered or topic-tagged documents.
- Compare performance of LDA vs. NMF (Non-negative Matrix Factorization).


In [11]:
import pickle

with open("preprocessed_data.pkl", "wb") as f:
    pickle.dump((preprocessed_texts, labels, label_names), f)


In [12]:
with open("preprocessed_data.pkl", "rb") as f:
    preprocessed_texts, labels, label_names = pickle.load(f)


In [13]:
print(f"Documents loaded: {len(texts)}")
print(f"Categories found: {label_names}")


Documents loaded: 19997
Categories found: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
