#### Topic modeling automatically finds the main themes (“topics”) in a pile of documents.
#### Each topic is a bunch of related words, and each document mixes a few topics.

### intuition
Imagine 1,000 support tickets thrown into one box.

The algorithm keeps guessing groups of words that often appear together (e.g., login, sso, token), calls that a topic, and then says each ticket is, say, 60% “Login Issues”, 30% “Network”, 10% “Billing”.

That’s it. The rest is how we represent text and which algorithm we use.

### the moving parts 

Document: one email/ticket/review.

Token: a word (or bigram like “access token”).

Vocabulary: unique tokens across your corpus.

Topic: distribution over words (top words help you name the topic).

Document–Topic mix: for each document, how much of each topic it contains.

### three families of methods

##### Classic (bag-of-words)

LDA (Latent Dirichlet Allocation) — probabilistic; the textbook baseline.

NMF (Non-negative Matrix Factorization) — linear algebra; very practical.

Both work on word counts or tf-idf; no deep learning required.

### deep learning flavored

Neural Topic Models (NTM) — small neural nets (often VAE-style) that learn topics; variants like ProdLDA, ETM (uses word embeddings), CTM (uses metadata).

Conceptually similar to LDA but trained with neural objectives.

### Transformer era (semantic embeddings + clustering)

Use sentence/document embeddings (e.g., all-MiniLM-L6-v2), cluster them, then extract top words per cluster.

Tools: BERTopic, Top2Vec, or your own “Embeddings + KMeans + keywords”.

Often the easiest way to get good, meaningful topics on short texts.

### how to choose

Short texts / slang / mixed language → Embeddings + Clustering (BERTopic).

You need simple, explainable math → LDA/NMF.

You have metadata (labels/time) & want flexible models → Neural Topic Models.

### evaluation (how to tell if topics are “good”)

Topic coherence (C_v, NPMI): do top words in a topic co-occur in real docs?

Diversity: do topics avoid repeating the same top words?

Human sense-check: can a person name each topic in 5 seconds?

Downstream usefulness: dashboards, routing, search improvement, etc.