# 04 — Topic Modeling

**Objective:** Discover latent topics in Reddit posts using LDA (gensim), evaluate coherence across topic counts, and analyze how topic prevalence changes over time.

**Approach:**
- TF-IDF analysis for top-level term importance
- Gensim LDA with coherence-based model selection (testing k=5, 10, 15, 20)
- Topic labeling based on discovered keywords
- Cross-analysis: topics × sentiment, topics × time

In [1]:
import sys
sys.path.insert(0, '..')

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from collections import Counter
import time

from src.utils import load_config
from src.topics import TopicModeler

config = load_config('../config/config.yaml')

# Load sentiment-analyzed data
df = pd.read_parquet('../data/processed/posts_sentiment.parquet')
print(f"Loaded {len(df):,} posts for topic modeling")

[32m2026-02-25 17:04:53.704[0m | [1mINFO    [0m | [36msrc.utils[0m:[36mload_config[0m:[36m45[0m - [1mConfiguration loaded from ..\config\config.yaml[0m


Loaded 294,704 posts for topic modeling


## 4.1 TF-IDF Analysis

In [2]:
# Build TF-IDF matrix
modeler = TopicModeler()
modeler.build_tfidf(df, text_col='text_lemmatized', max_features=3000)

print(f"TF-IDF matrix shape: {modeler.tfidf_matrix.shape}")
print(f"Vocabulary size: {len(modeler.tfidf_vectorizer.vocabulary_):,}")

# Top terms by TF-IDF
feature_names = modeler.tfidf_vectorizer.get_feature_names_out()
tfidf_means = modeler.tfidf_matrix.mean(axis=0).A1
top_idx = tfidf_means.argsort()[-20:][::-1]
print(f"\nTop 20 terms by avg TF-IDF:")
for i in top_idx:
    print(f"  {feature_names[i]:<25} {tfidf_means[i]:.4f}")

[32m2026-02-25 17:04:54.446[0m | [1mINFO    [0m | [36msrc.utils[0m:[36mload_config[0m:[36m45[0m - [1mConfiguration loaded from c:\Users\shril\Documents\Projects\reddit-tech-sentiment\notebooks\..\config\config.yaml[0m
[32m2026-02-25 17:04:57.885[0m | [1mINFO    [0m | [36msrc.topics[0m:[36mbuild_tfidf[0m:[36m299[0m - [1mTF-IDF matrix built: (294704, 3000)[0m


TF-IDF matrix shape: (294704, 3000)
Vocabulary size: 3,000

Top 20 terms by avg TF-IDF:
  datum                     0.0302
  machine                   0.0252
  science                   0.0240
  learning                  0.0234
  use                       0.0230
  learn                     0.0221
  data                      0.0212
  would                     0.0161
  get                       0.0154
  model                     0.0152
  work                      0.0146
  good                      0.0144
  help                      0.0136
  like                      0.0133
  computer                  0.0127
  know                      0.0123
  project                   0.0121
  python                    0.0116
  deep                      0.0115
  need                      0.0113


## 4.2 LDA Coherence Evaluation

We test LDA with different topic counts (5, 10, 15, 20) and select the model with the highest c_v coherence score. To keep runtime manageable, we run coherence evaluation on a 50K random sample.

In [3]:
# Prepare tokenized texts for gensim
# Use a 50K sample for coherence evaluation (full dataset would be very slow)
COHERENCE_SAMPLE = 50_000
df_coherence = df.sample(min(COHERENCE_SAMPLE, len(df)), random_state=42)

texts_sample = df_coherence['text_lemmatized'].apply(
    lambda x: x.split() if isinstance(x, str) else []
).tolist()

print(f"Evaluating coherence on {len(texts_sample):,} documents...")
start = time.time()

coherence_scores = modeler.evaluate_coherence(
    texts_sample, topic_counts=[5, 10, 15, 20]
)

elapsed = time.time() - start
print(f"\nCoherence evaluation completed in {elapsed:.0f}s")
print(f"\nResults:")
for k, score in sorted(coherence_scores.items()):
    marker = ' ← best' if score == max(coherence_scores.values()) else ''
    print(f"  k={k:>2}: c_v = {score:.4f}{marker}")

best_k = max(coherence_scores, key=coherence_scores.get)
print(f"\nOptimal topic count: {best_k}")

Evaluating coherence on 50,000 documents...


[32m2026-02-25 17:04:59.236[0m | [1mINFO    [0m | [36msrc.topics[0m:[36mevaluate_coherence[0m:[36m204[0m - [1mEvaluating coherence for 5 topics...[0m
[32m2026-02-25 17:06:06.328[0m | [1mINFO    [0m | [36msrc.topics[0m:[36mevaluate_coherence[0m:[36m219[0m - [1m  n_topics=5 → coherence=0.5192[0m
[32m2026-02-25 17:06:06.329[0m | [1mINFO    [0m | [36msrc.topics[0m:[36mevaluate_coherence[0m:[36m204[0m - [1mEvaluating coherence for 10 topics...[0m
[32m2026-02-25 17:07:11.690[0m | [1mINFO    [0m | [36msrc.topics[0m:[36mevaluate_coherence[0m:[36m219[0m - [1m  n_topics=10 → coherence=0.5186[0m
[32m2026-02-25 17:07:11.691[0m | [1mINFO    [0m | [36msrc.topics[0m:[36mevaluate_coherence[0m:[36m204[0m - [1mEvaluating coherence for 15 topics...[0m
[32m2026-02-25 17:08:18.500[0m | [1mINFO    [0m | [36msrc.topics[0m:[36mevaluate_coherence[0m:[36m219[0m - [1m  n_topics=15 → coherence=0.5492[0m
[32m2026-02-25 17:08:18.501[0m | [1m


Coherence evaluation completed in 268s

Results:
  k= 5: c_v = 0.5192
  k=10: c_v = 0.5186
  k=15: c_v = 0.5492 ← best
  k=20: c_v = 0.5252

Optimal topic count: 15


In [4]:
# Visualize coherence curve
fig = go.Figure()
fig.add_trace(go.Scatter(
    x=list(coherence_scores.keys()),
    y=list(coherence_scores.values()),
    mode='lines+markers',
    marker=dict(size=10),
    line=dict(color='#3498db', width=2),
))
fig.add_vline(x=best_k, line_dash='dash', line_color='#e74c3c',
              annotation_text=f'Optimal: k={best_k}')
fig.update_layout(height=350, title='LDA Coherence Score by Topic Count',
                  xaxis_title='Number of Topics', yaxis_title='Coherence (c_v)')
fig.show()

## 4.3 Fit Final LDA Model

In [5]:
# Fit LDA with the optimal number of topics on the full dataset
print(f"Fitting LDA with k={best_k} on {len(df):,} posts...")
start = time.time()

df, topic_info = modeler.fit_lda(df, text_col='text_lemmatized', n_topics=best_k)

elapsed = time.time() - start
print(f"LDA fitting completed in {elapsed:.0f}s")

# Display discovered topics with keywords
modeler.print_topics(n_words=8)

[32m2026-02-25 17:09:26.533[0m | [1mINFO    [0m | [36msrc.topics[0m:[36mfit_lda[0m:[36m117[0m - [1mFitting LDA model with 15 topics...[0m


Fitting LDA with k=15 on 294,704 posts...


[32m2026-02-25 17:13:38.147[0m | [1mINFO    [0m | [36msrc.topics[0m:[36mfit_lda[0m:[36m158[0m - [1mLDA model fitted — 15 topics discovered[0m


LDA fitting completed in 252s

DISCOVERED TOPICS

Topic 0: 0.040*"2022" + 0.019*"buy" + 0.017*"2021" + 0.016*"gpu" + 0.015*"top" + 0.015*"cpu" + 0.015*"2020" + 0.013*"100"

Topic 1: 0.043*"google" + 0.036*"app" + 0.024*"user" + 0.024*"web" + 0.024*"website" + 0.017*"use" + 0.013*"search" + 0.012*"server"

Topic 2: 0.018*"use" + 0.018*"function" + 0.014*"algorithm" + 0.014*"problem" + 0.013*"network" + 0.011*"input" + 0.010*"feature" + 0.010*"method"

Topic 3: 0.037*"computer" + 0.027*"science" + 0.020*"get" + 0.016*"year" + 0.015*"would" + 0.014*"job" + 0.013*"want" + 0.013*"degree"

Topic 4: 0.056*"datum" + 0.024*"use" + 0.015*"sql" + 0.015*"tool" + 0.014*"data" + 0.014*"pipeline" + 0.013*"database" + 0.011*"company"

Topic 5: 0.079*"learning" + 0.070*"machine" + 0.048*"intelligence" + 0.043*"artificial" + 0.036*"deep" + 0.033*"network" + 0.033*"learn" + 0.025*"computer"

Topic 6: 0.239*"amp" + 0.163*"x200b" + 0.088*"image" + 0.016*"transformer" + 0.016*"face" + 0.015*"video" + 0.014*

In [6]:
# Label topics based on top keywords
top_words = modeler.get_top_words_per_topic(n_words=8)  # fetch more words for fallback

# Clean up common spaCy lemmatization artifacts
LABEL_FIXES = {
    'datum': 'data',
    'x200b': None,      # Unicode zero-width space artifact — skip
    'amp': None,         # HTML &amp; artifact — skip
    'wa': None,          # Tokenizer artifact
    'ha': None,          # Tokenizer artifact
    'crack': None,       # Ambiguous (interviews vs piracy) — skip
}

# Stopwords too generic for topic labels
LABEL_STOPWORDS = {'use', 'get', 'like', 'would', 'make', 'know', 'good', 'want', 'need', 'think'}

def clean_topic_words(words, fixes=LABEL_FIXES, stopwords=LABEL_STOPWORDS, n=3, min_words=2):
    """Return top-n cleaned keywords, skipping artifacts and generic words.
    If fewer than min_words survive after stopword removal, relax the filter
    and allow stopwords through to reach min_words."""
    cleaned = []
    skipped_stopwords = []  # keep track for fallback
    for w, score in words:
        if w in fixes:
            if fixes[w] is None:
                continue
            w = fixes[w]
        if w in stopwords:
            if w not in [x for x in cleaned + skipped_stopwords]:
                skipped_stopwords.append(w)
            continue
        if w not in cleaned:
            cleaned.append(w)
        if len(cleaned) >= n:
            break
    # Fallback: if too few words, backfill from skipped stopwords
    while len(cleaned) < min_words and skipped_stopwords:
        cleaned.append(skipped_stopwords.pop(0))
    return cleaned

# First pass: generate candidate labels
candidates = {}
for tid, words in top_words.items():
    candidates[tid] = clean_topic_words(words, n=3)

# Second pass: deduplicate first-word across topics
# If two topics share the same lead word (e.g., 'Data'), keep the first
# occurrence and shift the duplicate to its next unique keyword
used_leads = {}  # lead_word → topic_id that claimed it
auto_labels = {}

for tid in sorted(candidates.keys()):
    words = candidates[tid]
    lead = words[0] if words else f'topic_{tid}'
    
    if lead in used_leads:
        # This lead word is taken — try to find a unique lead from the full word list
        all_cleaned = clean_topic_words(top_words[tid], n=6)
        unique_lead = None
        for w in all_cleaned:
            if w not in used_leads:
                unique_lead = w
                break
        if unique_lead:
            # Rebuild label starting with the unique word
            remaining = [w for w in all_cleaned if w != unique_lead][:2]
            words = [unique_lead] + remaining
        # else: keep as-is (rare edge case)
    
    used_leads[words[0]] = tid
    label = ' / '.join(words).title()
    auto_labels[tid] = label

print("\nTop words per topic (raw → cleaned):")
print("=" * 60)
for tid, words in top_words.items():
    raw_str = ', '.join([w[0] for w in words[:5]])
    print(f"  Topic {tid}: [{raw_str}] → {auto_labels[tid]}")

df['topic_name'] = df['lda_topic'].map(auto_labels)
print(f"\nTopic distribution:")
for tid, label in sorted(auto_labels.items()):
    count = (df['lda_topic'] == tid).sum()
    print(f"  Topic {tid}: {label} ({count:,} posts)")


Top words per topic (raw → cleaned):
  Topic 0: [2022, buy, 2021, gpu, top] → 2022 / Buy / 2021
  Topic 1: [google, app, user, web, website] → Google / App / User
  Topic 2: [use, function, algorithm, problem, network] → Function / Algorithm / Problem
  Topic 3: [computer, science, get, year, would] → Computer / Science / Year
  Topic 4: [datum, use, sql, tool, data] → Data / Sql / Tool
  Topic 5: [learning, machine, intelligence, artificial, deep] → Learning / Machine / Intelligence
  Topic 6: [amp, x200b, image, transformer, face] → Image / Transformer / Face
  Topic 7: [data, datum, engineer, engineering, interview] → Engineer / Data / Engineering
  Topic 8: [code, crack, open, source, download] → Code / Open / Source
  Topic 9: [paper, research, post, article, read] → Paper / Research / Post
  Topic 10: [datum, table, number, use, column] → Table / Data / Number
  Topic 11: [work, like, time, make, get] → Work / Time / People
  Topic 12: [learn, language, python, book, good] → Lea

## 4.4 Topic Distribution

In [7]:
# Topic distribution
topic_counts = df['topic_name'].value_counts()

fig = px.bar(x=topic_counts.index, y=topic_counts.values,
             color=topic_counts.values, color_continuous_scale='viridis',
             title='Topic Distribution Across All Posts (LDA)',
             labels={'x': 'Topic', 'y': 'Number of Posts'})
fig.update_layout(height=450, showlegend=False, coloraxis_showscale=False,
                  xaxis_tickangle=45)
fig.show()

# Topic probability distribution — how confident is the model?
fig = px.histogram(df, x='lda_topic_prob', nbins=50,
                   title='LDA Topic Assignment Confidence',
                   labels={'lda_topic_prob': 'Topic Probability'})
fig.update_layout(height=350)
fig.show()

print(f"\nAvg topic confidence: {df['lda_topic_prob'].mean():.3f}")
print(f"Posts with confidence > 0.5: {(df['lda_topic_prob'] > 0.5).mean():.1%}")


Avg topic confidence: 0.483
Posts with confidence > 0.5: 41.9%


## 4.5 Topics × Sentiment

In [8]:
# Average sentiment per topic
topic_sentiment = df.groupby('topic_name')['vader_compound'].agg(['mean', 'std', 'count']).round(4)
topic_sentiment = topic_sentiment.sort_values('mean')

fig = px.bar(x=topic_sentiment['mean'], y=topic_sentiment.index, orientation='h',
             title='Average Sentiment by Topic',
             labels={'x': 'Avg VADER Compound', 'y': 'Topic'},
             color=topic_sentiment['mean'],
             color_continuous_scale='RdYlGn')
fig.add_vline(x=0, line_dash='dash', line_color='gray')
fig.update_layout(height=450, coloraxis_showscale=False)
fig.show()

print("\nSentiment by topic:")
print(topic_sentiment.to_string())


Sentiment by topic:
                                     mean     std  count
topic_name                                              
Image / Transformer / Face         0.0909  0.2763   3166
Engineer / Data / Engineering      0.1145  0.2753  14523
Model / Training / Train           0.1593  0.3631  11413
2022 / Buy / 2021                  0.1651  0.3370   9563
Learning / Machine / Intelligence  0.1654  0.3188  31162
Google / App / User                0.1807  0.3856  10894
Code / Open / Source               0.1943  0.3185   9981
Function / Algorithm / Problem     0.2111  0.4639  25534
Table / Data / Number              0.2460  0.4411  13793
Learn / Language / Python          0.2747  0.3566  17274
Paper / Research / Post            0.3094  0.4226  15299
Data / Sql / Tool                  0.3342  0.4017  23359
Work / Time / People               0.3673  0.5581  29909
Help / Would                       0.4822  0.4702  40549
Computer / Science / Year          0.4838  0.4534  38285


## 4.6 Topic Prevalence Over Time

In [9]:
# Topic trends over time
df['month'] = pd.to_datetime(df['created_utc']).dt.tz_localize(None).dt.to_period('M').astype(str)
topic_time = df.groupby(['month', 'topic_name']).size().reset_index(name='count')

# Normalize to percentages
totals = topic_time.groupby('month')['count'].transform('sum')
topic_time['pct'] = (topic_time['count'] / totals * 100).round(1)

fig = px.area(topic_time, x='month', y='pct', color='topic_name',
              title='Topic Prevalence Over Time (% of Posts)',
              labels={'pct': '% of Posts', 'month': 'Month', 'topic_name': 'Topic'})
fig.update_layout(height=500, xaxis_tickangle=45)
fig.show()

In [10]:
# Save with topic assignments
df.to_parquet('../data/processed/posts_final.parquet', index=False)
print(f"Saved {len(df):,} posts with topic assignments")

# Save model
modeler.save_model('lda', '../outputs/models')
print("LDA model saved to outputs/models/")

[32m2026-02-25 17:13:40.145[0m | [1mINFO    [0m | [36msrc.topics[0m:[36msave_model[0m:[36m362[0m - [1mLDA model saved to ..\outputs\models[0m


Saved 294,704 posts with topic assignments
LDA model saved to outputs/models/


## Summary

- Evaluated LDA coherence across k=5, 10, 15, 20 — optimal model selected based on c_v score (see output above)
- Topics discovered from the data using real gensim LDA, labeled based on top keywords
- Topic assignment confidence and distribution analyzed
- Cross-analysis reveals which topics carry the most positive/negative sentiment
- Topic prevalence over time shows emerging and declining themes

**Note:** BERTopic was not run in this notebook due to compute constraints (requires sentence-transformers + UMAP + HDBSCAN). It can be added as a future enhancement.

**Next:** 05_trend_analysis.ipynb