
# NLP Tutorial — Part 6: Topic Modeling Algorithms

**Runtime:** Google Colab (recommended)

In this notebook, you'll learn and *compare* several topic modeling approaches:

1. **Vectorizing text** (Count & TF-IDF)
2. **Non-negative Matrix Factorization (NMF)** — great for *small* datasets, interpretable (scikit-learn)
3. **Latent Dirichlet Allocation (LDA)** — probabilistic model, works better on *medium* datasets (Gensim)
4. **Modern embedding-based methods** — great for *larger* datasets:
   - **BERTopic**
   - **Top2Vec**

> **Choice guidance**
> - **Small dataset:** Start with **NMF**.
> - **Medium dataset:** Try **LDA**.
> - **Large dataset / modern stack:** Try **BERTopic** or **Top2Vec** (embedding-based).



## 0) Setup & Installs

> 🧰 Run this cell first in Colab. It installs libraries you may not have locally.


In [None]:
# Core
%pip -q install scikit-learn pandas numpy matplotlib

# Topic modeling libraries
%pip -q install gensim pyLDAvis==3.4.1

# Embeddings + modern topic models
%pip -q install sentence-transformers umap-learn hdbscan
%pip -q install bertopic
%pip -q install top2vec
%pip -q install gensim==4.3.3

print("✅ Installs complete.")



## 1) Load a Dataset

We'll default to a manageable subset of **20 Newsgroups**.  
You can also bring your own data — a CSV with a `text` column — using the provided template.


In [None]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd


# --- Option A: Use 20 Newsgroups (subset for speed) ---
categories = [
    'sci.space', 'comp.graphics', 'rec.sport.baseball',
    'talk.politics.mideast', 'sci.med'
]

newsgroups = fetch_20newsgroups(subset='train', categories=categories, remove=('headers','quotes','footers'))
texts = newsgroups.data
targets = newsgroups.target
target_names = newsgroups.target_names

df = pd.DataFrame({
    "text": texts,
    "label": [target_names[t] for t in targets]
})

print(df.shape)
df.head(3)


In [None]:

# --- Option B: Load your own CSV (expects a 'text' column) ---
# from google.colab import files
# uploaded = files.upload()  # then set filename below
# df = pd.read_csv("your_file.csv")
# assert 'text' in df.columns, "Your CSV must contain a 'text' column"
# df = df.dropna(subset=['text']).reset_index(drop=True)
# df.head()



## 2) Preprocessing & Vectorization

We demonstrate both **CountVectorizer** and **TfidfVectorizer**.  
You'll choose which to feed into a topic model depending on the algorithm and your goals.


In [None]:

import re
import numpy as np

def simple_clean(text):
    # light cleaning to keep the tutorial focused on modeling
    text = re.sub(r"\s+", " ", text)
    text = re.sub(r"http\S+", " ", text)  # drop URLs
    text = text.strip()
    return text

df['clean_text'] = df['text'].apply(simple_clean)
df = df[df['clean_text'].str.len() > 0].reset_index(drop=True)
len(df)


In [None]:

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# You can tweak min_df, max_df, ngram_range to balance specificity vs generality
count_vectorizer = CountVectorizer(
    max_features=20000,
    min_df=2,
    max_df=0.9,
    ngram_range=(1,2),
    stop_words='english'
)

tfidf_vectorizer = TfidfVectorizer(
    max_features=20000,
    min_df=2,
    max_df=0.9,
    ngram_range=(1,2),
    stop_words='english'
)

X_count = count_vectorizer.fit_transform(df['clean_text'])
X_tfidf = tfidf_vectorizer.fit_transform(df['clean_text'])

X_count.shape, X_tfidf.shape



> **Rule of thumb**
> - **NMF** usually does well with **TF-IDF** features.
> - **LDA** is a probabilistic model over **raw counts** (bag-of-words).



## 3) Non-negative Matrix Factorization (NMF) — Small Datasets

- Library: **scikit-learn**  
- Inputs: Prefer **TF-IDF**  
- Strengths: Often yields **interpretable** topics for small corpora


In [None]:

from sklearn.decomposition import NMF

n_topics = 10  # adjust
nmf = NMF(n_components=n_topics, random_state=42, init='nndsvda', max_iter=400)
W = nmf.fit_transform(X_tfidf)  # doc-topic matrix
H = nmf.components_               # topic-term matrix

feature_names = tfidf_vectorizer.get_feature_names_out()

def show_top_terms(H, feature_names, topn=12):
    topics = []
    for idx, topic_vec in enumerate(H):
        top_idx = topic_vec.argsort()[::-1][:topn]
        terms = [feature_names[i] for i in top_idx]
        topics.append((idx, terms))
    return topics

topics_nmf = show_top_terms(H, feature_names, topn=12)
for k, terms in topics_nmf:
    print(f"Topic {k:02d}: " + ", ".join(terms))


In [None]:

# Inspect top documents per topic
import numpy as np

def top_docs_for_topic(W, docs, topic_id, topn=5):
    scores = W[:, topic_id]
    idx = np.argsort(scores)[::-1][:topn]
    return [(i, float(scores[i]), docs[i][:300].replace("\n"," ")) for i in idx]

topic_id = 0
top_docs = top_docs_for_topic(W, df['clean_text'].tolist(), topic_id, topn=3)
for i, score, snippet in top_docs:
    print(f"Doc {i} — score={score:.3f}\n{snippet}\n{'-'*80}")



> **Tuning tips for NMF**
> - Increase/decrease `n_components` to adjust granularity.
> - Filter very common/rare terms via `max_df`, `min_df`.
> - Try different `ngram_range` values (e.g., `(1,1)` vs `(1,2)`).



## 4) Latent Dirichlet Allocation (LDA) — Medium Datasets

- Library: **Gensim**  
- Inputs: **Count (bag-of-words)**  
- Strengths: Probabilistic; often better than NMF on medium-sized corpora  
- Randomness is involved; set seeds for reproducibility where possible.


In [None]:

import gensim
from gensim import corpora

# Build Gensim dictionary + BOW corpus from tokenized text used by CountVectorizer
# We'll reuse the CountVectorizer's vocabulary to keep things aligned.
inv_vocab = {v:k for k,v in count_vectorizer.vocabulary_.items()}

def to_bow(row, X_sparse):
    # Convert a single row of the sparse matrix to Gensim BOW using the same vocabulary indices
    cols = X_sparse[row].nonzero()[1]
    counts = X_sparse[row, cols].toarray().ravel()
    # Map col indices back to tokens
    tokens = [inv_vocab[c] for c in cols]
    # Build dictionary on the fly is expensive; better to prebuild
    return list(zip(tokens, counts))

# Build a consistent Gensim dictionary
tokens_list = []
for i in range(X_count.shape[0]):
    cols = X_count[i].nonzero()[1]
    tokens_list.append([inv_vocab[c] for c in cols])

dictionary = corpora.Dictionary(tokens_list)
bow_corpus = [dictionary.doc2bow(tokens) for tokens in tokens_list]

# Train LDA
num_topics = 10
lda_model = gensim.models.LdaModel(
    corpus=bow_corpus,
    id2word=dictionary,
    num_topics=num_topics,
    random_state=42,
    chunksize=2000,
    passes=5,
    alpha='auto',
    eta='auto',
    per_word_topics=False
)

for i, topic in lda_model.show_topics(num_topics=num_topics, num_words=12, formatted=False):
    terms = ", ".join([w for w,_ in topic])
    print(f"Topic {i:02d}: {terms}")


In [None]:
# Optional: Interactive visualization with pyLDAvis (works in Colab)
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

pyLDAvis.enable_notebook()
lda_vis = gensimvis.prepare(lda_model, bow_corpus, dictionary)
lda_vis  # In Colab: this should render an interactive panel



> **Tuning tips for LDA**
> - Try `passes` and `iterations` (more can improve stability, but costs time).
> - Start with 5–20 topics; refine as needed.
> - Consider lemmatization and custom stop-word lists for domain data.



## 5) BERTopic — Embedding-based Topic Modeling (Modern, Larger Datasets)

- Uses transformer embeddings + clustering to form topics.
- Often more robust for varied language use and larger corpora.
- Library: **BERTopic** (depends on `sentence-transformers`, `umap-learn`, `hdbscan`).

> ⚠️ If this cell is slow on CPU, switch to **GPU** in Colab (Runtime → Change runtime type → T4 GPU).


In [None]:
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP  # or: from umap.umap_ import UMAP

# small, fast embedder
embed_model = SentenceTransformer("all-MiniLM-L6-v2")

# make results reproducible via UMAP's random_state
umap_model = UMAP(
    n_neighbors=15,
    n_components=5,
    min_dist=0.0,
    metric="cosine",
    random_state=42
)

topic_model = BERTopic(
    embedding_model=embed_model,
    umap_model=umap_model,
    verbose=True,
    calculate_probabilities=True,
    min_topic_size=10
)

texts_for_bertopic = df["clean_text"].tolist()
topics, probs = topic_model.fit_transform(texts_for_bertopic)

topic_info = topic_model.get_topic_info()
topic_info.head(10)


In [None]:

# Show top terms per topic
for topic_id in topic_info['Topic'].head(10):
    if topic_id == -1:
        continue  # -1 is usually outliers
    print(f"Topic {topic_id}:")
    print(topic_model.get_topic(topic_id))
    print("-"*80)


In [None]:

# Optional interactive visualizations (Plotly)
try:
    fig = topic_model.visualize_barchart(top_n_topics=10)
    fig.show()
except Exception as e:
    print("Visualization skipped:", e)



## 6) Top2Vec — Embedding + Joint Topic Discovery

- Learns document embeddings and discovers topics without predefined `k`.
- Can be heavier than BERTopic depending on embedding choice; keep dataset modest in demos.


In [None]:
from top2vec import Top2Vec

docs_small = df['clean_text'].tolist()[:1500]

t2v = Top2Vec(
    documents=docs_small,
    embedding_model='doc2vec',   # or 'universal-sentence-encoder', 'distiluse-base-multilingual-cased'
    speed='learn',
    workers=4,
    verbose=True
)

# get the total and then fetch them
num_topics = t2v.get_num_topics()
topic_words, word_scores, topic_nums = t2v.get_topics(num_topics)

print("Discovered topics:", len(topic_nums))
for i in range(min(10, len(topic_nums))):
    terms = ", ".join(topic_words[i][:12])
    print(f"Topic {topic_nums[i]}: {terms}")



## 7) Evaluation & Model Selection

There isn't a single perfect metric for topic modeling, but you can use a mix of:

- **Coherence** (e.g., `c_v`, `u_mass`) — correlates with human interpretability.
- **Diversity** — share of unique tokens among top words across topics.
- **Human-in-the-loop** — have domain experts label topics or rate quality.
- **Downstream utility** — e.g., topic features improve a classifier or retrieval task.

Below are simple examples for *coherence* (Gensim) and *diversity*.


In [None]:
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel

def nmf_topics_as_terms(H, feature_names, topn=10):
    topics = []
    for topic_vec in H:
        top_idx = topic_vec.argsort()[::-1][:topn]
        topics.append([feature_names[i] for i in top_idx])
    return topics

# 1) Prepare topics from NMF and a dictionary from your tokenized docs
nmf_term_lists = nmf_topics_as_terms(H, feature_names, topn=10)

tokens_for_eval = tokens_list  # list[list[str]] of tokens per doc (no empty docs)

dictionary = Dictionary(tokens_for_eval)
# optional but recommended: trim extremes to reduce noise
dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=50000)

# 2) Compute c_v coherence (uses texts + dictionary)
coh_nmf = CoherenceModel(
    topics=nmf_term_lists,
    texts=tokens_for_eval,
    dictionary=dictionary,
    coherence='c_v'
).get_coherence()

print(f"NMF c_v coherence: {coh_nmf:.3f}")


In [None]:

# Topic diversity (unique top terms across topics)
def topic_diversity(topic_term_lists):
    all_terms = [term for topic in topic_term_lists for term in topic]
    unique_terms = len(set(all_terms))
    total_terms = len(all_terms)
    return unique_terms / total_terms if total_terms else 0.0

div_nmf = topic_diversity(nmf_term_lists)
print(f"NMF topic diversity (top-10 terms): {div_nmf:.3f}")



## 8) Using Your Own Data

1. Upload a CSV with a `text` column.
2. Re-run **Section 1 (Option B)** and **Section 2**.
3. Choose **one** modeling approach (Sections 3–6) and tune hyperparameters:
   - Topic count (`n_components` for NMF, `num_topics` for LDA)
   - Vectorizer settings (`min_df`, `max_df`, `ngram_range`)
   - Clustering & dimensionality reduction settings for BERTopic/Top2Vec
4. Evaluate with **Section 7**, iterate.



## 9) Practical Tips & Gotchas

- **Preprocessing matters**: custom stopwords, lemmatization, phrase detection (bigrams) can help.
- **Interpretability vs. coherence**: pick settings that produce topics you (and stakeholders) find useful.
- **Reproducibility**: fix random seeds where possible; document versions and parameters.
- **Performance**: for large corpora, prefer GPU runtime and smaller embedding models initially.
- **Ethics**: inspect topics for bias or privacy leaks before sharing.



---

### Credits & References
- scikit-learn: NMF, vectorizers
- Gensim: LDA, coherence, pyLDAvis integration
- BERTopic: https://github.com/MaartenGr/BERTopic
- Top2Vec: https://github.com/ddangelov/Top2Vec

