# Day-55: Topic Modeling (LDA)

We've covered what text is, what it means, and how it feels. 

Today, we tackle the big picture: Topic Modeling. This is a powerful machine learning technique that allows us to scan massive collections of documents (a corpus) and automatically discover the abstract topics that run through them. We'll focus on the most famous algorithm for this task: Latent Dirichlet Allocation (LDA).

## Topic Covered

Latent Dirichlet Allocation, Document-Topic Distributions

## Latent Dirichlet Allocation (LDA): Uncovering Hidden Themes 

LDA is a generative statistical model that assumes every document is a mixture of various topics, and every topic is, in turn, a collection of words that frequently occur together.

- Generative Assumption: LDA works backward, assuming a process where documents are created by:

    1. Randomly choosing a topic (e.g., "Space Exploration").

    2. Randomly choosing a word from that topic (e.g., "rocket," "Mars," "launch").

    3. Repeating this process until the document is complete.

- `The Goal`: LDA reverses this, determining the most likely set of topics and their associated words that could have generated the observed corpus.

- `Analogy`: The Cook's Recipe Book. Imagine you have a large library of recipes. LDA doesn't read the titles; it looks at the ingredients:

    - If a "topic" frequently uses "flour," "sugar," and "butter," LDA labels it the 'Baking' Topic.

    - If a "topic" frequently uses "tomato," "pasta," and "basil," LDA labels it the 'Italian Food' Topic.

    - A single recipe (document) might be 80% 'Italian Food' and 20% 'Baking' (for the bread sticks).

## Document-Topic Distributions: The Mixture

The output of an LDA model provides two key distributions that summarize the corpus:

1. Topic-Word Distribution: This shows the probability of a specific word appearing in a specific topic. This is how we define and interpret the topics.

    - Example: Topic 1: {(dog,0.15),(cat,0.10),(vet,0.05),…}

2. Document-Topic Distribution: This shows the proportional mixture of topics within each document. This tells you what each document is about.

    - Example: Document A: Topic 1 (Pets):85%, Topic 2 (Finance):10%, Topic 3 (Travel):5%.

By analyzing the Document-Topic Distribution, you can cluster, filter, or categorize documents based on their content without ever manually reading and labeling them.

## Code Example: Topic Modeling with Gensim

We use the Gensim library, which is the industry standard for implementing LDA. This requires cleaning and feature extraction (BoW) from our previous days.

In [None]:
! pip uninstall gensim numpy

In [None]:
! pip install numpy==1.23.5

In [None]:
! pip install gensim numpy nltk

In [1]:
import pandas as pd
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re

# --- 0. Data Preparation (Simulation) ---
documents = [
    "Machine learning models predict stock prices using technical analysis and financial reports.",
    "The new rocket launch from NASA successfully put a satellite into orbit around the Earth.",
    "Neural networks and deep learning are crucial for complex machine learning tasks like image recognition.",
    "Astronauts and cosmonauts prepare for long duration space missions to the Moon and Mars.",
    "A massive portfolio diversification strategy involves stocks, bonds, and real estate investments."
]

# Simple Preprocessing (Tokenization, Lowercasing, Stopword Removal)
stop_words = set(stopwords.words('english'))
processed_docs = []
for doc in documents:
    # Tokenize and remove stopwords/punctuation
    tokens = [token.lower() for token in word_tokenize(doc) if token.isalpha() and token.lower() not in stop_words]
    processed_docs.append(tokens)

# --- 1. Feature Extraction (Dictionary and Corpus) ---
# Gensim uses its own dictionary and corpus format (BoW)

# Create a dictionary (Mapping of word -> unique ID)
dictionary = corpora.Dictionary(processed_docs)

# Create a DTM (Document-Term Matrix / Bag-of-Words Corpus)
# Each tuple is (word_id, count)
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

# --- 2. LDA Model Training ---
# We tell the model to find a specific number of topics (e.g., 2)
num_topics = 2
lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=num_topics,
    random_state=42,
    passes=10 # Number of training passes
)

# --- 3. Interpretation (Topic-Word Distribution) ---
print("--- TOPIC-WORD DISTRIBUTIONS ---")
for idx, topic in lda_model.print_topics(num_words=5):
    print(f"Topic #{idx + 1}: {topic}")

# Expected output shows clear separation:
# Topic 1: likely 'Finance/ML' words (e.g., 'stock', 'investments', 'learning')
# Topic 2: likely 'Space' words (e.g., 'rocket', 'NASA', 'Mars')

# --- 4. Document-Topic Distribution ---
# Get the topic distribution for the first document
doc_topic_distribution = lda_model.get_document_topics(corpus[0])
print(f"\n--- DOCUMENT-TOPIC DISTRIBUTION (Doc 1) ---")
print(f"Document 1 is about: {doc_topic_distribution}")
# Example: [(0, 0.95), (1, 0.05)] -> 95% Topic 1, 5% Topic 2

--- TOPIC-WORD DISTRIBUTIONS ---
Topic #1: 0.034*"prepare" + 0.034*"long" + 0.034*"mars" + 0.034*"cosmonauts" + 0.034*"moon"
Topic #2: 0.061*"learning" + 0.043*"machine" + 0.026*"technical" + 0.026*"prices" + 0.026*"using"

--- DOCUMENT-TOPIC DISTRIBUTION (Doc 1) ---
Document 1 is about: [(0, 0.043867763), (1, 0.95613223)]
