# Part 3: Advanced Text Processing - LDA and BERTopic Topic Modeling (20 pts)

Instruction:  
Please also make sure to structure your notebooks as if you were conducting this as a clean and nicely presented data analysis report. Do not include our prompts/problem statements in the final report notebooks.

In this section, we apply two types of topic modeling methods —
(1) LDA (Latent Dirichlet Allocation) and (2) BERTopic — to the State of the Union (SOTU) corpus.
We use lemmas as the base representation and compare how traditional bag-of-words models differ from transformer-based semantic models.

## 3.1 Preprocessing for Topic Modeling

We load the dataset, apply spaCy to tokenize and lemmatize text, and construct a list of lemma tokens for each document. Stop words, punctuation, and whitespace tokens are removed to prepare clean inputs for topic modeling.

In [15]:
import pandas as pd
import spacy
from tqdm import tqdm


sou = pd.read_csv("data/SOTU.csv")
raw_docs = sou["Text"].tolist()


nlp = spacy.load("en_core_web_sm")


processed_docs = []
for text in tqdm(raw_docs, desc="Processing texts for LDA"):
    doc = nlp(text)
    lemmas = [
        token.lemma_.lower()
        for token in doc
        if not token.is_stop and not token.is_punct and not token.is_space
    ]
    processed_docs.append(lemmas)

Processing texts for LDA: 100%|██████████| 246/246 [04:38<00:00,  1.13s/it]


## 3.2 LDA Topic Modeling

We construct a dictionary and bag-of-words corpus from the lemma lists.
Using Gensim’s LDA model, we extract a set of topics from the entire SOTU speech collection.

In [16]:
from gensim import corpora
from gensim.models import LdaModel


dictionary = corpora.Dictionary(processed_docs)
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]


lda = LdaModel(
    corpus=corpus, 
    id2word=dictionary, 
    num_topics=7,   
    passes=10,
    random_state=42
)


lda_topics = lda.print_topics(num_words=10)
for idx, topic in enumerate(lda_topics):
    print(f"--- LDA Topic {idx} ---")
    print(topic)

--- LDA Topic 0 ---
(0, '0.010*"government" + 0.009*"year" + 0.007*"world" + 0.007*"nation" + 0.007*"program" + 0.006*"congress" + 0.006*"people" + 0.006*"war" + 0.006*"federal" + 0.005*"national"')
--- LDA Topic 1 ---
(1, '0.013*"year" + 0.009*"america" + 0.009*"people" + 0.008*"new" + 0.007*"world" + 0.007*"work" + 0.007*"american" + 0.006*"nation" + 0.005*"congress" + 0.005*"help"')
--- LDA Topic 2 ---
(2, '0.011*"government" + 0.009*"states" + 0.009*"year" + 0.008*"$" + 0.008*"united" + 0.005*"congress" + 0.005*"country" + 0.005*"law" + 0.004*"great" + 0.004*"public"')
--- LDA Topic 3 ---
(3, '0.012*"states" + 0.012*"government" + 0.008*"united" + 0.007*"congress" + 0.007*"country" + 0.006*"year" + 0.006*"great" + 0.005*"public" + 0.005*"$" + 0.005*"law"')
--- LDA Topic 4 ---
(4, '0.008*"government" + 0.008*"man" + 0.008*"law" + 0.007*"great" + 0.006*"country" + 0.005*"nation" + 0.005*"people" + 0.005*"work" + 0.005*"congress" + 0.004*"need"')
--- LDA Topic 5 ---
(5, '0.006*"govern

## LDA Topic Interpretation

The LDA topics generally capture high-frequency, surface-level themes in the corpus.
Common words such as government, year, world, nation, states, congress appear across many topics, showing that LDA groups speeches based on repeated vocabulary patterns rather than semantic meaning.
Because SOTU speeches share strong stylistic overlap, the LDA topics tend to be broad and partially overlapping.

## 3.3 BERTopic Modeling

We further clean the lemma lists by removing additional English stopwords from NLTK, then apply BERTopic, which uses transformer embeddings and density-based clustering to form semantically coherent clusters.

In [17]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

clean_docs = []
for doc in processed_docs:
    filtered = [w for w in doc if w not in stop_words]
    clean_docs.append(" ".join(filtered))

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [18]:
from bertopic import BERTopic

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(clean_docs)


topic_info = topic_model.get_topic_info()
topic_info.head()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,155,0_government_states_united_congress,"[government, states, united, congress, year, c...",[congress united states discharge constitution...
1,1,91,1_year_america_people_world,"[year, america, people, world, new, nation, am...",[everybody seat mr. speaker mr. vice president...


## BERTopic Topic Interpretation

BERTopic typically produces fewer, more semantically distinct topics.
Because transformer embeddings capture contextual meaning, the model groups speeches into broader conceptual clusters such as:

- Government/Congress/National Affairs

- America/People/World/Future Themes

This reflects semantic coherence, contrasting with LDA’s frequency-driven clustering.

## 3.4 Comparison: LDA vs BERTopic

- LDA uses a bag-of-words representation and identifies topics based on word co-occurrence.
   <br>→ Produces several overlapping topics with similar vocabulary.

- BERTopic uses sentence-level semantic embeddings from transformers.
   <br>→ Produces fewer, but more meaningful and interpretable clusters.

- Since SOTU speeches share structure and vocabulary,
   <br>→ LDA tends to fragment content into many similar topics,
   <br>→ BERTopic groups documents by broader themes reflecting actual meaning.

Overall, BERTopic provides more coherent high-level topics, while LDA gives granular but noisy lexical clusters.