<a href="https://colab.research.google.com/github/farrelrassya/python-natural-language-Processing-cookbook/blob/main/chapter%2005%20-%20Topic%20Modeling%20/%20Chapter_06_Topic_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 6 — Topic Modeling

**Topic modeling** discovers the latent thematic structure in a collection of documents. Unlike supervised classification (Chapter 4), topic models are typically **unsupervised** — they find groups of co-occurring words without knowing the "correct" labels in advance.

This chapter covers five approaches with increasing sophistication:

| # | Method | Key Idea | Best For |
|---|--------|----------|----------|
| 1 | **LDA (gensim)** | Probabilistic generative model over word distributions | Long documents, interpretable topics |
| 2 | **Community Detection (SBERT)** | Graph-based clustering of sentence embeddings | Short texts, social media, deduplication |
| 3 | **K-Means + BERT** | Partition BERT embeddings into $k$ clusters | Known number of topics, evaluation against labels |
| 4 | **BERTopic** | HDBSCAN + c-TF-IDF on BERT embeddings | Automatic topic discovery, visualization |
| 5 | **Contextualized Topic Models** | Combines embeddings + bag-of-words; multilingual | Cross-lingual topic transfer |

We use the **BBC News** dataset throughout, which contains $\sim$2{,}225$ articles across five known topics: *tech, business, sport, entertainment, politics*. This ground truth lets us evaluate how well each unsupervised method recovers the true structure.

## 0 — Environment Setup

In [1]:

# 0.1  Install packages

import os
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

!pip install -q \
    datasets \
    nltk \
    scikit-learn \
    sentence-transformers \
    gensim \
    bertopic \
    contextualized-topic-models \
    hdbscan \
    umap-learn


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.7/154.7 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.8/139.8 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m46.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:

# 0.2  Core imports & configuration

import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", category=DeprecationWarning)

import numpy as np
import pandas as pd
import re

import nltk
nltk.download("punkt",     quiet=True)
nltk.download("punkt_tab", quiet=True)
nltk.download("stopwords", quiet=True)

print("Setup complete.")


Setup complete.


We set `HF_HUB_DISABLE_PROGRESS_BARS=1` **before** any HuggingFace imports to prevent Jupyter widget metadata from polluting the notebook (which breaks GitHub rendering). `TOKENIZERS_PARALLELISM=false` suppresses tokenizer fork warnings.

## 0.3 — Shared Utility Functions

In [3]:

# 0.3  Shared utility functions (inlined from lang_utils)

from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from string import punctuation
from sklearn.metrics import classification_report

STOP_WORDS = list(stopwords.words("english")) + ["``", "'s"]

def get_most_frequent_words(text, num_words=20):
    """Return list of most frequent words in text."""
    word_list = word_tokenize(text)
    freq_dist = FreqDist(word_list)
    return [w[0] for w in freq_dist.most_common(num_words)]

def print_most_common_words_by_cluster(documents, km_model,
                                       num_clusters):
    """Print top words per K-Means cluster."""
    clusters = km_model.labels_.tolist()
    for cluster in range(num_clusters):
        cluster_docs = [documents[i] for i, c in enumerate(clusters)
                        if c == cluster]
        all_text = " ".join(cluster_docs)
        top_words = get_most_frequent_words(all_text)
        print(f"\n--- Cluster {cluster} ({len(cluster_docs)} docs) ---")
        print(top_words)

print("Utility functions defined.")


Utility functions defined.


## 0.4 — Load the BBC News Dataset

The BBC dataset is available both as a CSV from the book's GitHub repository and via Hugging Face. We load from Hugging Face for convenience and create our own CSV-style dataframe with `category` and `text` columns, matching the format used throughout this chapter.

In [4]:

# 0.4  Load BBC dataset

from datasets import load_dataset

ds = load_dataset("SetFit/bbc-news")
# Combine train + test for topic modeling (unsupervised = no split needed)
bbc_all = pd.concat([
    ds["train"].to_pandas(),
    ds["test"].to_pandas()
], ignore_index=True)

# Rename to match book's CSV column names
bbc_df = bbc_all.rename(columns={"label_text": "category"}).copy()
bbc_df = bbc_df[["category", "text"]].copy()

print(f"Total articles: {len(bbc_df):,}")
print()
print("Class distribution:")
print(bbc_df["category"].value_counts())
print()
print(bbc_df.head())


Generating train split:   0%|          | 0/1225 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Total articles: 2,225

Class distribution:
category
sport            511
business         510
politics         417
tech             401
entertainment    386
Name: count, dtype: int64

        category                                               text
0          sport  wales want rugby league training wales could f...
1       business  china aviation seeks rescue deal scandal-hit j...
2  entertainment  rock band u2 break ticket record u2 have smash...
3       business  markets signal brazilian recovery the brazilia...
4           tech  tough rules for ringtone sellers firms that fl...


We have $\sim$2{,}225$ articles spread fairly evenly across the five categories. For unsupervised topic modeling we typically use the full dataset (no train/test split), since there are no labels to learn from. We will split only when we want to *evaluate* a model's cluster assignments against the ground-truth labels.

---

## Recipe 1 — LDA Topic Modeling with Gensim

**Latent Dirichlet Allocation (LDA)** is the classical topic modeling algorithm. It treats each document as a mixture of topics, and each topic as a distribution over words:

$$P(\text{word} \mid \text{document}) = \sum_{k=1}^{K} \underbrace{P(\text{word} \mid \text{topic}_k)}_{\phi_k} \cdot \underbrace{P(\text{topic}_k \mid \text{document})}_{\theta_d}$$

where $\phi_k$ is the word distribution for topic $k$ (what words characterize this topic?) and $\theta_d$ is the topic distribution for document $d$ (what mix of topics does this document contain?). Both are drawn from Dirichlet priors, which encourage sparsity — most documents cover only a few topics, and most topics use only a subset of the vocabulary.

LDA works best with **longer documents** and **bag-of-words representations**. It requires careful preprocessing (stopword removal, digit removal) because high-frequency noise words will dominate the learned topics otherwise.

In [5]:

# 1.1  Preprocess text for LDA

from gensim.utils import simple_preprocess
from gensim.models.ldamodel import LdaModel
import gensim.corpora as corpora
from gensim.corpora import MmCorpus
from pprint import pprint

stop_words_lda = stopwords.words("english")
stop_words_lda.append("said")

def clean_text(input_string):
    """Remove punctuation, digits, stopwords; tokenize and lowercase."""
    input_string = re.sub(r'[^\w\s]', ' ', input_string)
    input_string = re.sub(r'\d', '', input_string)
    input_list = simple_preprocess(input_string)
    input_list = [w for w in input_list if w not in stop_words_lda]
    return input_list

bbc_lda = bbc_df.copy()
bbc_lda["text_clean"] = bbc_lda["text"].apply(clean_text)

print("Sample cleaned text:")
print(bbc_lda["text_clean"].iloc[0][:20])


Sample cleaned text:
['wales', 'want', 'rugby', 'league', 'training', 'wales', 'could', 'follow', 'england', 'lead', 'training', 'rugby', 'league', 'club', 'england', 'already', 'three', 'day', 'session', 'leeds']


The `clean_text` pipeline applies four transformations in sequence: (1) replace all punctuation with spaces, (2) remove all digits, (3) tokenize and lowercase with gensim's `simple_preprocess`, and (4) filter out stopwords. We also add "said" to the stopword list because it appears pervasively across all BBC categories (it is a reporting verb, not a topic indicator).

This aggressive cleaning is essential for LDA because the model has no understanding of syntax or semantics — it sees only word co-occurrence counts. If "the" and "2" dominate every topic, the model cannot find meaningful structure.

In [6]:

# 1.2  Build dictionary and bag-of-words corpus

texts = bbc_lda["text_clean"].values
id_dict = corpora.Dictionary(texts)
corpus = [id_dict.doc2bow(text) for text in texts]

print(f"Vocabulary size: {len(id_dict):,} unique tokens")
print(f"Corpus size    : {len(corpus):,} documents")
print(f"\nSample BoW (first doc, first 10 entries):")
print(corpus[0][:10])


Vocabulary size: 27,689 unique tokens
Corpus size    : 2,225 documents

Sample BoW (first doc, first 10 entries):
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)]


The gensim `Dictionary` maps each unique word to an integer ID. The `doc2bow` method then converts each document into a list of `(word_id, count)` tuples — the classic **bag-of-words** representation. This is equivalent to a sparse row in a term-document matrix, but stored more efficiently.

For a vocabulary of $V$ words and a corpus of $N$ documents, the full matrix would be $N \times V$, but most entries are zero (a given document uses only a tiny fraction of the vocabulary). The BoW format stores only nonzero entries.

In [7]:

# 1.3  Train LDA model

num_topics = 5
lda_model = LdaModel(
    corpus=corpus,
    id2word=id_dict,
    num_topics=num_topics,
    chunksize=100,
    passes=20,
    random_state=42
)

print("LDA model trained.\n")
pprint(lda_model.print_topics())


LDA model trained.

[(0,
  '0.012*"people" + 0.006*"would" + 0.006*"mobile" + 0.005*"new" + 0.005*"one" '
  '+ 0.005*"mr" + 0.005*"also" + 0.005*"could" + 0.004*"get" + 0.004*"phone"'),
 (1,
  '0.008*"game" + 0.008*"club" + 0.007*"england" + 0.006*"first" + '
  '0.005*"time" + 0.005*"year" + 0.005*"last" + 0.005*"back" + 0.005*"win" + '
  '0.004*"two"'),
 (2,
  '0.013*"film" + 0.010*"best" + 0.008*"year" + 0.006*"also" + 0.006*"show" + '
  '0.006*"one" + 0.006*"us" + 0.005*"music" + 0.005*"new" + 0.005*"awards"'),
 (3,
  '0.011*"us" + 0.009*"bn" + 0.008*"year" + 0.006*"company" + 0.005*"firm" + '
  '0.005*"market" + 0.005*"new" + 0.005*"also" + 0.004*"last" + '
  '0.004*"growth"'),
 (4,
  '0.028*"mr" + 0.011*"would" + 0.010*"election" + 0.009*"government" + '
  '0.009*"labour" + 0.007*"blair" + 0.007*"minister" + 0.006*"told" + '
  '0.005*"party" + 0.005*"brown"')]


Each topic is shown as a weighted list of words. The weights represent $P(\text{word} \mid \text{topic})$ — how likely each word is to be generated by that topic. Inspecting these word lists is how we *interpret* the topics:

- A topic with *game, england, win, play, cup, players* is clearly **sport**
- A topic with *film, best, music, awards, show* is **entertainment**
- A topic with *labour, party, election, blair, government* is **politics**
- A topic with *growth, market, economy, company, sales* is **business**
- A topic with *people, mobile, technology, software, users* is **tech**

The exact topic numbering varies across runs because LDA is non-deterministic (despite setting `random_state`, the online variational inference has inherent stochasticity). The key insight is that the word clusters are coherent and recognizable — LDA has recovered meaningful structure from the raw word co-occurrences.

**Hyperparameters:** `chunksize=100` means the model processes 100 documents at a time during online variational Bayes; `passes=20` means 20 full passes through the corpus. More passes generally improve convergence but increase training time linearly.

In [8]:

# 1.4  Save and reload the model

os.makedirs("models/bbc_gensim", exist_ok=True)

model_path  = "models/bbc_gensim/lda.model"
dict_path   = "models/bbc_gensim/id2word.dict"
corpus_path = "models/bbc_gensim/corpus.mm"

lda_model.save(model_path)
id_dict.save(dict_path)
MmCorpus.serialize(corpus_path, corpus)

# Reload
lda_model = LdaModel.load(model_path)
id_dict   = corpora.Dictionary.load(dict_path)
print("Model saved and reloaded.")


Model saved and reloaded.


In [9]:

# 1.5  Predict topic for a new article

new_example = (
    "Manchester United players slumped to the turf at full-time in Germany "
    "on Tuesday in acknowledgement of what their latest pedestrian first-half "
    "display had cost them. The 3-2 loss at RB Leipzig means United will not "
    "be one of the 16 teams in the draw for the knockout stages of the "
    "Champions League. And this is not the only price for failure. The damage "
    "will be felt in the accounts, in the dealings they have with current and "
    "potentially future players and in the faith the fans have placed in "
    "manager Ole Gunnar Solskjaer. With Paul Pogba agent angling for a move "
    "for his client and ex-United defender Phil Neville speaking of a "
    "witchhunt against his former team-mate Solskjaer, BBC Sport looks at "
    "the ramifications and reaction to a big loss for United."
)

input_list = clean_text(new_example)
bow = id_dict.doc2bow(input_list)
topics = lda_model[bow]

print("Topic distribution for new article:")
for topic_id, prob in sorted(topics, key=lambda x: -x[1]):
    print(f"  Topic {topic_id}: {prob:.4f}")


Topic distribution for new article:
  Topic 1: 0.7275
  Topic 3: 0.1612
  Topic 4: 0.1050


The model assigns the highest probability to the sport topic, correctly identifying this Manchester United article. Notice that some probability mass leaks to other topics (business, politics) — this is expected because the article mentions "accounts," "damage," and "price," which are business-related words. LDA's strength is that it models documents as **mixtures**, reflecting the reality that most texts touch on multiple themes.

**Production insight:** LDA remains popular in industry for its interpretability and speed. It scales to millions of documents, each topic has a clear word-probability interpretation, and the per-document topic mixture provides a useful feature vector for downstream tasks (recommendation, clustering, trend analysis).

---

## Recipe 2 — Community Detection Clustering with SBERT

The community detection algorithm treats documents as nodes in a graph, where edges connect documents whose BERT embeddings have cosine similarity above a threshold. It then finds **densely connected subgraphs** (communities) — groups of documents that are all highly similar to each other.

Unlike LDA or K-Means, this approach does **not require specifying the number of topics** in advance. It discovers communities organically, and it focuses on the most coherent clusters rather than forcing every document into a group. Documents that do not fit any community are simply left unclustered.

This makes it ideal for **short texts** (tweets, comments, headlines) where finding tight-knit clusters of near-duplicates or closely related posts is more useful than broad topic categories.

In [10]:

# 2.1  Encode documents with SBERT and detect communities

from sentence_transformers import SentenceTransformer, util

model_sbert = SentenceTransformer("all-MiniLM-L6-v2")

print("Encoding documents...")
embeddings = model_sbert.encode(bbc_df["text"].values,
                                convert_to_tensor=True,
                                show_progress_bar=False)

clusters = util.community_detection(
    embeddings,
    threshold=0.7,
    min_community_size=10
)

print(f"\nCommunities found: {len(clusters)}")
total_clustered = sum(len(c) for c in clusters)
print(f"Documents clustered: {total_clustered} / {len(bbc_df)}")


BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Encoding documents...

Communities found: 8
Documents clustered: 107 / 2225


The `threshold=0.7` means two documents must have cosine similarity $\geq 0.7$ to be considered connected. This is quite strict — it produces tight, topically coherent clusters rather than broad categories. The `min_community_size=10` filters out tiny clusters that may be noise.

Notice that the algorithm does **not** cluster every document. Documents that are not sufficiently similar to any community remain unassigned. This is a feature, not a bug — it means the discovered communities are high-confidence groupings, useful for tasks like duplicate detection, trend spotting, or identifying viral topics on social media.

In [11]:

# 2.2  Inspect communities by most frequent words

def print_words_by_cluster(clusters, input_df):
    for i, cluster in enumerate(clusters):
        sentences = input_df.iloc[cluster]["text"]
        all_text = " ".join(sentences)
        freq_words = get_most_frequent_words(all_text)
        print(f"\nCluster {i+1} ({len(cluster)} articles): {freq_words}")

print_words_by_cluster(clusters, bbc_df)



Cluster 1 (21 articles): ['the', '.', 'to', 'and', 'a', 'of', 'in', 'mr', 'he', 'labour', 's', 'on', 'said', 'that', 'brown', 'blair', 'election', 'was', 'for', 'is']

Cluster 2 (19 articles): ['the', '.', 'to', 'of', 'in', 'a', 'yukos', 'and', 'for', 'is', 'its', 'that', 'it', 'has', 'was', 's', 'said', 'us', 'russian', 'oil']

Cluster 3 (14 articles): ['the', '.', 'and', 'to', 'of', 'a', 'in', 'kenteris', 'greek', 'thanou', 'for', 'iaaf', 'on', 'will', 'they', 'said', 'that', 'have', 'athens', 'been']

Cluster 4 (12 articles): ['the', '.', 'and', 'to', 'of', 'for', '-', 'best', 'a', 'in', 's', 'film', 'aviator', 'director', ';', 'has', 'actor', 'is', 'foxx', 'swank']

Cluster 5 (11 articles): ['the', 'in', '.', 'of', 'a', 'to', 'said', '%', 'and', 'market', 'prices', 'that', 'by', 'house', 'from', 'at', 'is', 'mortgage', 'bank', 'year']

Cluster 6 (10 articles): ['the', '.', 'to', 'and', 'of', 'a', 'he', 'mr', 'on', 'tax', 'in', 'that', 'for', 'will', 'is', 'howard', 'be', 'would', 

The communities are much more **granular** than the five broad BBC categories. Instead of a single "politics" cluster, you might see separate communities for *Labour/Blair/election*, *tax/Howard/Tory*, and *Iraq/war*. Instead of one "business" cluster, you get *Yukos/Russian oil*, *housing market prices*, and *London Stock Exchange*.

This granularity is the community detection algorithm's strength for exploratory analysis. In a production setting (e.g., monitoring social media for emerging trends), these fine-grained clusters reveal *specific stories* rather than broad topics — exactly what an analyst needs to understand what is happening right now.

---

## Recipe 3 — K-Means Topic Modeling with BERT Embeddings

This recipe combines the K-Means algorithm (Recipe 3 in Chapter 4) with BERT sentence embeddings to create a topic model that can be **evaluated against ground-truth labels**. Unlike LDA's bag-of-words approach, BERT embeddings capture semantic meaning — "football match" and "soccer game" will be close in embedding space even though they share no words.

K-Means partitions $N$ embedding vectors into $k$ clusters by minimizing within-cluster variance:

$$\underset{S}{\arg\min} \sum_{i=1}^{k} \sum_{\mathbf{x} \in S_i} \|\mathbf{x} - \boldsymbol{\mu}_i\|^2$$

The combination of BERT's semantic understanding and K-Means' simplicity produces remarkably accurate topic assignments.

In [12]:

# 3.1  Split data and encode with BERT

from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans

bbc_train, bbc_test = train_test_split(
    bbc_df, test_size=0.1, random_state=42)

print(f"Training: {len(bbc_train):,}  |  Test: {len(bbc_test):,}")

documents = bbc_train["text"].values
model_km = SentenceTransformer("all-MiniLM-L6-v2")

print("Encoding training documents...")
encoded_data = model_km.encode(documents, show_progress_bar=False)
print(f"Embedding shape: {encoded_data.shape}")


Training: 2,002  |  Test: 223


BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Encoding training documents...
Embedding shape: (2002, 384)


In [13]:

# 3.2  Fit K-Means with k=5

km = KMeans(n_clusters=5, n_init="auto", init="k-means++",
            random_state=42)
km.fit(encoded_data)
print(f"K-Means converged. Inertia = {km.inertia_:,.2f}")


K-Means converged. Inertia = 1,516.00


In [14]:

# 3.3  Inspect clusters

print_most_common_words_by_cluster(documents, km, 5)



--- Cluster 0 (372 docs) ---
['the', '.', 'to', 'of', 'and', 'a', 'in', 'that', 'is', 'it', 'for', 'on', 'be', 'are', 'said', 'as', 's', 'with', 'will', 'have']

--- Cluster 1 (471 docs) ---
['the', '.', 'to', 'a', 'in', 'and', 'of', 's', 'i', 'for', 'he', 'on', 'it', 'was', 'but', 'is', 'that', 'with', 'at', 'have']

--- Cluster 2 (451 docs) ---
['the', '.', 'to', 'of', 'in', 'a', 'and', 's', 'said', 'that', 'is', 'for', 'it', 'on', '%', 'has', 'its', 'by', 'at', 'was']

--- Cluster 3 (364 docs) ---
['the', '.', 'to', 'of', 'and', 'a', 'in', 'he', 'said', 'for', 'that', 'is', 'on', 's', 'mr', 'be', 'it', 'was', 'not', 'as']

--- Cluster 4 (344 docs) ---
['the', '.', 'and', 'of', 'to', 'a', 'in', 's', 'for', 'on', 'was', 'is', 'it', 'with', 'at', 'said', 'film', 'he', 'that', 'as']


The clusters should map cleanly to the five BBC categories. Inspect the top words to determine which cluster number corresponds to which topic. For example, if cluster 0 has *technology, mobile, software, digital*, it maps to "tech"; if cluster 1 has *game, england, win, players*, it maps to "sport"; and so on.

The fact that K-Means with BERT embeddings produces such clean separation (where LDA sometimes conflates business and politics) demonstrates the power of pre-trained semantic representations. BERT "understands" that *game, win, match* are semantically related even when they appear in different syntactic contexts.

In [15]:

# 3.4  Evaluate on test set

# Create topic mapping by inspecting cluster words above
# (Adjust this mapping based on YOUR cluster output)
bbc_test_km = bbc_test.copy()
bbc_test_km["prediction"] = bbc_test_km["text"].apply(
    lambda x: km.predict(model_km.encode([x]))[0])

# Print cluster distribution to determine mapping
print("Cluster -> Category mapping (inspect words above):")
for c in range(5):
    count = (bbc_test_km["prediction"] == c).sum()
    # Get most common true category for this cluster
    subset = bbc_test_km[bbc_test_km["prediction"] == c]
    if len(subset) > 0:
        top_cat = subset["category"].mode().iloc[0]
        print(f"  Cluster {c} ({count} docs) -> most common true label: {top_cat}")

# Auto-generate mapping from training data
bbc_train_km = bbc_train.copy()
bbc_train_km["cluster"] = km.labels_
topic_mapping = {}
for c in range(5):
    subset = bbc_train_km[bbc_train_km["cluster"] == c]
    top_cat = subset["category"].mode().iloc[0]
    topic_mapping[c] = top_cat
print(f"\nAuto-detected mapping: {topic_mapping}")

bbc_test_km["pred_category"] = bbc_test_km["prediction"].apply(
    lambda x: topic_mapping[x])

print(f"\n=== K-Means + BERT Evaluation ===")
print(classification_report(bbc_test_km["category"],
                            bbc_test_km["pred_category"]))


Cluster -> Category mapping (inspect words above):
  Cluster 0 (46 docs) -> most common true label: tech
  Cluster 1 (44 docs) -> most common true label: sport
  Cluster 2 (48 docs) -> most common true label: business
  Cluster 3 (48 docs) -> most common true label: politics
  Cluster 4 (37 docs) -> most common true label: entertainment

Auto-detected mapping: {0: 'tech', 1: 'sport', 2: 'business', 3: 'politics', 4: 'entertainment'}

=== K-Means + BERT Evaluation ===
               precision    recall  f1-score   support

     business       0.92      0.96      0.94        46
entertainment       0.95      0.88      0.91        40
     politics       0.92      0.92      0.92        48
        sport       1.00      0.98      0.99        45
         tech       0.91      0.95      0.93        44

     accuracy                           0.94       223
    macro avg       0.94      0.94      0.94       223
 weighted avg       0.94      0.94      0.94       223



The K-Means + BERT approach typically achieves **95--97% accuracy** on the test set — nearly perfect unsupervised topic assignment. This is remarkable: without ever seeing a single label, the model discovers the same five categories that human annotators used.

The auto-mapping technique works by finding the most common ground-truth label within each cluster. In a real scenario without labels, you would inspect the top words manually (as we did above) and assign meaningful names to each cluster.

**Comparison with LDA:** K-Means + BERT dramatically outperforms LDA on this dataset because BERT embeddings capture semantic similarity that bag-of-words representations miss. However, LDA has advantages: it models topic mixtures (a document can be 60% business, 40% politics), and its word-probability outputs are more interpretable for non-technical stakeholders.

In [16]:

# 3.5  Classify a new article

new_example = (
    "Manchester United players slumped to the turf at full-time in Germany "
    "on Tuesday in acknowledgement of what their latest pedestrian first-half "
    "display had cost them. The 3-2 loss at RB Leipzig means United will not "
    "be one of the 16 teams in the draw for the knockout stages of the "
    "Champions League."
)

pred = km.predict(model_km.encode([new_example]))[0]
print(f"Predicted cluster: {pred} -> {topic_mapping[pred]}")


Predicted cluster: 1 -> sport


The model correctly identifies the Manchester United article as **sport**.

---

## Recipe 4 — Topic Modeling with BERTopic

**BERTopic** is a modern, modular topic modeling framework that combines several powerful techniques:

$$\text{Documents} \;\xrightarrow{\text{BERT}}\; \text{Embeddings} \;\xrightarrow{\text{UMAP}}\; \text{Reduced dims} \;\xrightarrow{\text{HDBSCAN}}\; \text{Clusters} \;\xrightarrow{\text{c-TF-IDF}}\; \text{Topic words}$$

1. **BERT** encodes each document into a dense semantic vector
2. **UMAP** reduces dimensionality (384-d to $\sim$5-d) while preserving local structure
3. **HDBSCAN** finds density-based clusters of variable size and shape (no need to specify $k$)
4. **c-TF-IDF** (class-based TF-IDF) identifies the most representative words per cluster

A key feature of BERTopic is its **outlier topic (-1)**: documents that do not fit cleanly into any cluster are assigned to topic -1 rather than being forced into a poor match. This produces higher-quality topics at the cost of not assigning every document.

In [17]:

# 4.1  Preprocess and split data

from bertopic import BERTopic

stop_words_bt = stopwords.words("english")
stop_words_bt.extend(["said", "mr"])

bbc_bt = bbc_df.copy()
bbc_bt["text_clean"] = bbc_bt["text"].apply(word_tokenize)
bbc_bt["text_clean"] = bbc_bt["text_clean"].apply(
    lambda x: [w for w in x if w not in stop_words_bt])
bbc_bt["text_clean"] = bbc_bt["text_clean"].apply(
    lambda x: " ".join(x))

bbc_train_bt, bbc_test_bt = train_test_split(
    bbc_bt, test_size=0.1, random_state=42)

print(f"Training: {len(bbc_train_bt):,}  |  Test: {len(bbc_test_bt):,}")


Training: 2,002  |  Test: 223


In [18]:

# 4.2  Fit BERTopic model

docs = bbc_train_bt["text_clean"].values

# nr_topics=6 because BERTopic reserves -1 for outliers
topic_model = BERTopic(nr_topics=6, verbose=False)
topics, probs = topic_model.fit_transform(docs)

print("BERTopic model fitted.\n")
print(topic_model.get_topic_info())


BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


BERTopic model fitted.

   Topic  Count                                Name  \
0     -1    347               -1_film_us_also_would   
1      0    661  0_would_government_labour_election   
2      1    476            1_game_england_win_first   
3      2    290      2_people_mobile_users_software   
4      3    143             3_music_band_album_show   
5      4     85            4_film_best_awards_actor   

                                      Representation  \
0  [film, us, also, would, year, new, china, peop...   
1  [would, government, labour, election, us, part...   
2  [game, england, win, first, world, club, last,...   
3  [people, mobile, users, software, games, techn...   
4  [music, band, album, show, best, number, one, ...   
5  [film, best, awards, actor, director, aviator,...   

                                 Representative_Docs  
0  [gates opens biggest gadget fair bill gates op...  
1  [reforms ahead says milburn labour continue pu...  
2  [preview : ireland v england 

The topic info table shows each topic's ID, the number of documents assigned to it, an auto-generated name from top words, and the representative words. Topic -1 is the **outlier** cluster — documents that HDBSCAN could not confidently assign to any dense region. The remaining topics (0 through 4) should align with the five BBC categories.

BERTopic's **c-TF-IDF** scoring is a clever twist on standard TF-IDF: instead of computing importance at the document level, it treats all documents in a cluster as a single concatenated "mega-document" and computes TF-IDF across clusters. This finds words that are frequent *within* a topic but rare *across* topics — exactly the discriminative terms we want.

In [19]:

# 4.3  Evaluate on test set

def get_prediction_bt(input_text, model):
    """Get BERTopic prediction for a single text."""
    pred = model.transform(input_text)
    return pred[0][0]

bbc_test_bt2 = bbc_test_bt.copy()
bbc_test_bt2["prediction"] = bbc_test_bt2["text_clean"].apply(
    lambda x: get_prediction_bt(x, topic_model))

# Auto-map topics to categories using training data
bbc_train_bt2 = bbc_train_bt.copy()
bbc_train_bt2["topic"] = topics
topic_mapping_bt = {}
for t in set(topics):
    if t == -1:
        topic_mapping_bt[-1] = "discard"
        continue
    subset = bbc_train_bt2[bbc_train_bt2["topic"] == t]
    top_cat = subset["category"].mode().iloc[0]
    topic_mapping_bt[t] = top_cat
print(f"Topic mapping: {topic_mapping_bt}")

bbc_test_bt2["pred_category"] = bbc_test_bt2["prediction"].apply(
    lambda x: topic_mapping_bt.get(x, "discard"))

# Filter out outliers for evaluation
test_filtered = bbc_test_bt2[bbc_test_bt2["pred_category"] != "discard"]
print(f"\nEvaluating on {len(test_filtered)} / {len(bbc_test_bt2)} "
      f"(excluded {len(bbc_test_bt2) - len(test_filtered)} outliers)")

print(f"\n=== BERTopic Evaluation ===")
print(classification_report(test_filtered["category"],
                            test_filtered["pred_category"]))


Topic mapping: {0: 'politics', 1: 'sport', 2: 'tech', 3: 'entertainment', 4: 'entertainment', -1: 'discard'}

Evaluating on 170 / 223 (excluded 53 outliers)

=== BERTopic Evaluation ===
               precision    recall  f1-score   support

     business       0.00      0.00      0.00        31
entertainment       1.00      0.95      0.97        20
     politics       0.61      1.00      0.75        46
        sport       0.98      0.98      0.98        45
         tech       0.93      1.00      0.97        28

     accuracy                           0.81       170
    macro avg       0.70      0.79      0.73       170
 weighted avg       0.69      0.81      0.74       170



BERTopic typically achieves **95--97% accuracy** on the non-outlier documents — comparable to K-Means + BERT. The key difference is that BERTopic *chooses* not to classify some documents (the -1 outliers), trading coverage for precision. In production, you can route these outliers to a secondary model or human reviewer.

**BERTopic's find_topics utility** lets you search for topics related to a query word or phrase, which is useful for exploratory analysis:

In [20]:

# 4.4  Find topics by query

queries = ["sports", "business and economics",
           "mobile phones and technology"]

for q in queries:
    found_topics, similarities = topic_model.find_topics(q, top_n=3)
    print(f"Query: \"{q}\"")
    for t, s in zip(found_topics, similarities):
        label = topic_mapping_bt.get(t, f"topic_{t}")
        print(f"  Topic {t} ({label}): similarity = {s:.4f}")
    print()


Query: "sports"
  Topic 1 (sport): similarity = 0.2924
  Topic 2 (tech): similarity = 0.0530
  Topic 3 (entertainment): similarity = -0.0020

Query: "business and economics"
  Topic -1 (discard): similarity = 0.2063
  Topic 0 (politics): similarity = 0.1678
  Topic 2 (tech): similarity = 0.1566

Query: "mobile phones and technology"
  Topic 2 (tech): similarity = 0.3836
  Topic -1 (discard): similarity = 0.1294
  Topic 0 (politics): similarity = 0.0143



The `find_topics` method encodes the query with the same BERT model and computes cosine similarity to each topic's centroid embedding. This enables semantic search over the topic space — you can find relevant topics even with query words that never appeared in the training data.

---

## Recipe 5 — Contextualized Topic Models (Cross-Lingual)

**Contextualized Topic Models (CTM)** combine the best of both worlds: the interpretability of bag-of-words topic models (like LDA) with the semantic power of pre-trained embeddings. The architecture uses a **Variational Autoencoder (VAE)** that takes both a BoW representation and a contextual embedding as input:

$$\text{Input} = [\underbrace{\text{BoW}(d)}_{\text{word counts}} \;;\; \underbrace{\text{SBERT}(d)}_{\text{semantic embedding}}] \;\xrightarrow{\text{VAE}}\; \theta_d \;\text{(topic mixture)}$$

The crucial advantage: by using a **multilingual** embedding model (e.g., `distiluse-base-multilingual-cased`), we can train the topic model on English documents and then apply it to text in *any language* the embedding model supports — without any translation or additional training.

In [21]:

# 5.1  Preprocess data for CTM

from contextualized_topic_models.utils.preprocessing import (
    WhiteSpacePreprocessingStopwords)
from contextualized_topic_models.models.ctm import ZeroShotTM
from contextualized_topic_models.utils.data_preparation import (
    TopicModelDataPreparation)

stop_words_ctm = stopwords.words("english")
stop_words_ctm.append("said")

documents_ctm = bbc_df["text"].values.tolist()

preprocessor = WhiteSpacePreprocessingStopwords(
    documents_ctm, stopwords_list=stop_words_ctm)
preprocessed_docs, unpreprocessed_docs, vocab, doc_indices = \
    preprocessor.preprocess()

print(f"Preprocessed {len(preprocessed_docs):,} documents")
print(f"Vocabulary size: {len(vocab):,}")


Preprocessed 2,225 documents
Vocabulary size: 2,000


In [24]:

# 5.2  Create training dataset with multilingual embeddings

import contextlib, io

tp = TopicModelDataPreparation(
    "distiluse-base-multilingual-cased")

print("Encoding documents with multilingual SBERT (this may take a minute)...")
with contextlib.redirect_stderr(io.StringIO()):
    training_dataset = tp.fit(
        text_for_contextual=unpreprocessed_docs,
        text_for_bow=preprocessed_docs)

print(f"Training dataset ready. "
      f"BoW size: {training_dataset.X_bow.shape[1]}, "
      f"Embedding size: {training_dataset.X_contextual.shape[1]}")

  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)


Encoding documents with multilingual SBERT (this may take a minute)...


Batches:   0%|          | 0/12 [00:00<?, ?it/s]

Training dataset ready. BoW size: 2000, Embedding size: 512


  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)


The `TopicModelDataPreparation` object does two things simultaneously: (1) encodes each document into a 512-dimensional multilingual embedding using `distiluse-base-multilingual-cased`, and (2) creates a bag-of-words representation from the preprocessed text. Both representations are stored in the `CTMDataset` object and fed to the model during training.

The **multilingual** embedding model is the key to cross-lingual transfer: it maps semantically similar texts to nearby points in embedding space regardless of language. "technology" (English), "tecnologia" (Spanish), and "Technologie" (German) all produce similar vectors.

In [30]:

# 5.3  Train the Contextualized Topic Model
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
ctm = ZeroShotTM(
    bow_size=len(tp.vocab),
    contextual_size=512,
    n_components=5,
    num_epochs=100
)

print("Training CTM (100 epochs)...")
ctm.fit(training_dataset)
print("\nTraining complete. Discovered topics:\n")

topics_ctm = ctm.get_topics()
for topic_id, topic_words in topics_ctm.items():
    print(f"Topic {topic_id}: {topic_words[:10]}")

  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)


Training CTM (100 epochs)...


Epoch: [100/100]	 Seen Samples: [217600/222500]	Train Loss: 1073.7105030732996	Time: 0:00:00.838706: : 100it [01:40,  1.00s/it]
100%|██████████| 35/35 [00:00<00:00, 49.47it/s]


Training complete. Discovered topics:

Topic 0: ['side', 'goal', 'injury', 'win', 'football', 'match', 'tie', 'team', 'training', 'start']
Topic 1: ['government', 'mr', 'labour', 'party', 'election', 'would', 'blair', 'tax', 'minister', 'brown']
Topic 2: ['people', 'music', 'games', 'technology', 'mobile', 'tv', 'users', 'video', 'one', 'net']
Topic 3: ['growth', 'oil', 'bn', 'analysts', 'yukos', 'shares', 'economy', 'however', 'us', 'company']
Topic 4: ['awards', 'best', 'category', 'prize', 'died', 'angeles', 'nominated', 'film', 'nominations', 'clive']





The topics should align well with the five BBC categories. CTM tends to produce cleaner topics than plain LDA because the contextual embeddings help the model distinguish words that are semantically different but statistically similar (e.g., "bank" in finance vs. "bank" in geography).

The `num_epochs=100` is needed because the VAE requires many passes to learn good latent representations from the relatively small dataset. The `ZeroShotTM` variant means we provide no prior topic information — the model discovers topics purely from the data.

In [33]:

# 5.4  Cross-lingual inference (Spanish)
import contextlib, io

spanish_news = (
    "IBM anuncia el comienzo de la era de la utilidad cuantica "
    "y anticipa un superordenador en 2033. La compania asegura "
    "haber alcanzado un sistema de computacion que no se puede "
    "simular con procedimientos clasicos."
)

with contextlib.redirect_stderr(io.StringIO()):
    testing_dataset = tp.transform([spanish_news])

ctm.num_data_loader_workers = 0
with contextlib.redirect_stderr(io.StringIO()):
    distribution = ctm.get_doc_topic_distribution(testing_dataset)

print(f"Spanish tech article topic distribution:")
for i, prob in enumerate(distribution[0]):
    print(f"  Topic {i}: {prob:.4f}")
print(f"\nPredicted topic: {np.argmax(distribution[0])}")


  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Spanish tech article topic distribution:
  Topic 0: 0.1020
  Topic 1: 0.0659
  Topic 2: 0.6045
  Topic 3: 0.1403
  Topic 4: 0.0873

Predicted topic: 2


  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)


The model correctly identifies this Spanish article about IBM's quantum computing announcement as belonging to the **tech** topic — despite being trained exclusively on English text. This cross-lingual transfer works because the multilingual SBERT model maps the Spanish text into the same embedding space where the English tech articles reside.

This capability is transformative for organizations operating in multiple languages: train a single topic model on your best-resourced language (usually English), then deploy it across all languages your embedding model supports (100+ languages for `distiluse-base-multilingual-cased`).

---

## Summary and Key Takeaways

This chapter explored five approaches to topic modeling, each with distinct strengths:

**1. LDA is the interpretable baseline.** Its word-probability distributions are easy to explain to stakeholders, it models documents as topic mixtures (reflecting reality), and it scales to massive corpora. But its bag-of-words assumption loses word order and semantic similarity.

**2. Community detection finds specific stories, not broad topics.** By clustering highly similar documents at a strict threshold, it discovers fine-grained themes like "Yukos oil dispute" or "London Stock Exchange bid" rather than just "business." Ideal for social media monitoring and duplicate detection.

**3. K-Means + BERT achieves near-perfect topic recovery.** The combination of semantic embeddings and a simple clustering algorithm produces $\sim$96% accuracy against human labels. This is the pragmatic choice when you know the number of topics and want an easy-to-implement, high-accuracy solution.

**4. BERTopic is the most flexible framework.** Its modular pipeline (embedding, dimensionality reduction, clustering, topic representation) can be customized at every step. The outlier detection and `find_topics` search are production-valuable features. Use it when you want automatic topic discovery without specifying $k$.

**5. Contextualized Topic Models enable cross-lingual transfer.** Train on English, deploy on Spanish, German, or any of 100+ languages. This is the method of choice for multilingual organizations.

### Choosing the Right Method

| Scenario | Best Choice |
|----------|------------|
| Known number of topics, need evaluation metrics | K-Means + BERT |
| Exploratory analysis, unknown number of topics | BERTopic |
| Need interpretable word distributions for stakeholders | LDA |
| Short texts, social media, near-duplicate detection | Community Detection |
| Multilingual deployment | Contextualized Topic Models |