# Topic Modeling — Event-Based Analysis (LDA)

This notebook applies topic modeling to Reddit comments related to the Ukraine war.
The goal is to identify dominant discussion themes for each key event and examine how
the focus of public discourse changes over time.

We use Latent Dirichlet Allocation (LDA), a widely used and interpretable topic modeling
method based on word distributions.


## 1. Imports and Setup


In [3]:
import os
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation


## 2. Load Event Datasets

We load the same event-level CSV files used in the sentiment analysis notebook.


In [5]:
base_dir = "../data/processed"

event_files = {
    "event1_kyiv": os.path.join(base_dir, "event1_kyiv.csv"),
    "event2_kherson": os.path.join(base_dir, "event2_kherson.csv"),
    "event3_stalemate": os.path.join(base_dir, "event3_stalemate.csv"),
    "event4_trump_election": os.path.join(base_dir, "event4_trump_election.csv"),
    "event5_white_house_meeting": os.path.join(base_dir, "event5_white_house_meeting.csv"),
}

event_files


{'event1_kyiv': '../data/processed/event1_kyiv.csv',
 'event2_kherson': '../data/processed/event2_kherson.csv',
 'event3_stalemate': '../data/processed/event3_stalemate.csv',
 'event4_trump_election': '../data/processed/event4_trump_election.csv',
 'event5_white_house_meeting': '../data/processed/event5_white_house_meeting.csv'}

## 3. Text Preparation

We prepare comment text for topic modeling by:
- keeping only non-empty comments
- converting text to lowercase
- removing very short comments

Heavy text cleaning is intentionally avoided to preserve meaning.


In [7]:
def prepare_text(df):
    df = df.copy()
    df["text"] = df["self_text"].astype(str).str.lower()
    df = df[df["text"].str.len() > 30]
    return df


## 4. Topic Modeling Function

This function:
- vectorizes text using bag-of-words
- fits an LDA model
- returns top words for each topic


In [9]:
def run_lda(texts, n_topics=5, n_words=10, min_df=20):
    vectorizer = CountVectorizer(
        max_df=0.95,
        min_df=min_df,
        stop_words=custom_stopwords
    )

    dtm = vectorizer.fit_transform(texts)

    lda = LatentDirichletAllocation(
        n_components=n_topics,
        random_state=42
    )
    lda.fit(dtm)

    words = vectorizer.get_feature_names_out()

    topics = []
    for i, topic in enumerate(lda.components_):
        top_words = ", ".join([words[j] for j in topic.argsort()[-n_words:]])
        topics.append({"topic_id": i, "top_words": top_words})

    return pd.DataFrame(topics)



## 4b. Refinement: Cleaning Text to Reduce Noise Topics

The first LDA run can produce noise-driven topics (URLs, generic filler words like "just", "like").
In this section, we apply light cleaning and custom stopwords, then rerun LDA to get more interpretable topics.


In [11]:
import re
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS



custom_stopwords = set(ENGLISH_STOP_WORDS).union({
    # common filler / low-information words on Reddit
    "just", "like", "people", "think", "know", "really", "going", "said", "say",
    "don", "doesn", "didn", "isn", "aren", "wasn", "weren", "can", "could", "would",
    "im", "ive", "youre", "theyre", "were", "ive", "id",

    # reddit / quote / formatting artifacts
    "gt", "amp", "reddit", "comment", "comments", "post", "posts",

    # web leftovers
    "https", "http", "www", "com",

    # French stopwords 
    "la", "le", "les", "et", "en", "que", "pas", "des", "il", "est", "un", "une", "du", "au",

    # common Ukraine-war generic terms (optional — helps topics be less repetitive)
    "ukraine", "russia", "russian"
})

custom_stopwords = list(custom_stopwords) 


def clean_for_topics(text):
    if not isinstance(text, str):
        return ""

    text = text.lower()

    # Remove URLs
    text = re.sub(r"http\S+|www\S+", " ", text)

    # Remove quote artifacts and common HTML leftovers
    text = re.sub(r"\bgt\b", " ", text)     # often comes from ">" quoting
    text = re.sub(r"\bamp\b", " ", text)

    # Keep only letters and spaces
    text = re.sub(r"[^a-z\s]", " ", text)

    # Normalize whitespace
    text = re.sub(r"\s+", " ", text).strip()

    return text


## 5. Topic Modeling Per Event

We run topic modeling separately for each event to allow meaningful comparison.


In [13]:
for event, path in event_files.items():
    if not os.path.exists(path):
        continue

    # Load once
    df = pd.read_csv(path, low_memory=False)

    print(f"{event}: {len(df)} comments")

    if len(df) < 500:
        print("  Not enough data for reliable topic modeling\n")
        continue

    # ----------------------------
    # Version 1: light cleaning
    # ----------------------------
    df_v1 = prepare_text(df)
    topics_v1 = run_lda(df_v1["text"], n_topics=5)

    # ----------------------------
    # Version 2: stronger cleaning
    # ----------------------------
    texts_v2 = (
        df["self_text"]
        .dropna()
        .apply(clean_for_topics)
    )
    texts_v2 = texts_v2[texts_v2.str.len() > 30]

    # Lightweight "mostly-English" filter (reduces non-English topics like French)
    common_english = r"\b(?:the|and|to|of|in|is|for|that|with|on|as|are)\b"
    texts_v2 = texts_v2[texts_v2.str.contains(common_english, regex=True)]
    print("Cleaned comments kept:", len(texts_v2))


    
    min_df_event = 5 if event == "event2_kherson" else 20
    topics_v2 = run_lda(texts_v2, n_topics=5, min_df=min_df_event)
    


    
    out_dir = "../data/processed/topics"
    os.makedirs(out_dir, exist_ok=True)

    out_path = os.path.join(out_dir, f"{event}_topics.csv")
    topics_v2.to_csv(out_path, index=False)
    print("Saved:", out_path)

    print("=== RAW TOPICS ===")
    print(topics_v1, "\n")

    print("=== CLEANED TOPICS ===")
    print(topics_v2, "\n")
    print("Cell executed successfully")



event1_kyiv: 2940 comments
Cleaned comments kept: 2124
Saved: ../data/processed/topics/event1_kyiv_topics.csv
=== RAW TOPICS ===
   topic_id                                          top_words
0         0  udsc, ukraina, web, ua, polish, border, need, ...
1         1  probably, want, news, ukrainian, thing, war, w...
2         2  look, west, yes, support, propaganda, ukrainia...
3         3  actually, world, mean, end, got, fucking, yeah...
4         4  maybe, weapons, soldiers, love, right, militar... 

=== CLEANED TOPICS ===
   topic_id                                          top_words
0         0  used, use, weapons, children, look, troops, se...
1         1  shit, did, fuck, way, propaganda, world, right...
2         2  pl, visa, ua, news, ready, help, polish, borde...
3         3  sure, power, doing, better, mean, bad, point, ...
4         4  invasion, yes, countries, did, military, putin... 

Cell executed successfully
event2_kherson: 795 comments
Cleaned comments kept: 525
Saved

## 6. Interpretation Notes

- Topics represent *themes*, not opinions.
- Topic labels are inferred from top words.
- Differences across events indicate shifts in public focus.
- Topic modeling complements sentiment analysis by explaining *why* tone changes.
