# Project 2  Part 3: Advanced Text Processing - LDA and BERTopic Topic Modeling (20 pts)


**Resources:**
- LDA:
    - https://medium.com/sayahfares19/text-analysis-topic-modelling-with-spacy-gensim-4cd92ef06e06 
    - https://www.kaggle.com/code/faressayah/text-analysis-topic-modeling-with-spacy-gensim#%F0%9F%93%9A-Topic-Modeling (code for previous post)
    - https://towardsdatascience.com/topic-modelling-in-python-with-spacy-and-gensim-dc8f7748bdbf/ 
- BERTopic:
    - https://maartengr.github.io/BERTopic/getting_started/visualization/visualize_documents.html#visualize-documents-with-plotly 
    - https://maartengr.github.io/BERTopic/getting_started/visualization/visualize_topics.html 


In [19]:
####################
## MOVE THIS BLOCK TO part03.py
####################

####################
## CALL THIS BLOCK TO part03.py
####################

# from tqdm.auto import tqdm
from spacy import displacy
from bertopic import BERTopic
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from sklearn.feature_extraction.text import CountVectorizer
import pyLDAvis
import pyLDAvis.gensim_models

In [20]:
####################
## CALL THIS BLOCK FROM part00_utils_visuals.py
####################

# read in SOTU.csv using pandas, name the variable `sou` for simplicity
# the below cell is what the output should look like

from src import part00_utils_visuals as part00
# import src.part00_utils_visuals as part00

from src import part01

part00.plot_style(style=part00.PLOT_STYLE_SEABORN)

sou = part00.pd.read_csv(part00.DIR_DATA_00_RAW / part00.CSV_SOTU)

In [21]:
sou

Unnamed: 0,President,Year,Text,Word Count
0,Joseph R. Biden,2024.0,"\n[Before speaking, the President presented hi...",8003
1,Joseph R. Biden,2023.0,\nThe President. Mr. Speaker——\n[At this point...,8978
2,Joseph R. Biden,2022.0,"\nThe President. Thank you all very, very much...",7539
3,Joseph R. Biden,2021.0,\nThe President. Thank you. Thank you. Thank y...,7734
4,Donald J. Trump,2020.0,\nThe President. Thank you very much. Thank yo...,6169
...,...,...,...,...
241,George Washington,1791.0,\nFellow-Citizens of the Senate and House of R...,2264
242,George Washington,1790.0,\nFellow-Citizens of the Senate and House of R...,1069
243,George Washington,1790.0,\nFellow-Citizens of the Senate and House of R...,1069
244,George Washington,1790.0,\nFellow-Citizens of the Senate and House of R...,1069


### LDA

- Train an LDA model with 18 topics
- Output the top 10 words for each topic. 
- Output the topic distribution for the first speech
- Make a visualization

You may use the next two cells to process the data.

In [22]:
import spacy
from tqdm import tqdm
from collections import Counter

spacy.cli.download("en_core_web_sm")

nlp = spacy.load("en_core_web_sm")

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m76.0 MB/s[0m  [33m0:00:00[0mm0:00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [23]:
####################
## MOVE THIS BLOCK TO part03.py
####################

####################
## CALL THIS BLOCK TO part03.py
####################

def preprocess_text(text): 
    doc = nlp(text) 
    return [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct and not token.is_space and len(token.lemma_) > 3]

In [24]:
# Process all texts - note this takes ~ 5 minutes to run
# processed_docs = sou['Text'].apply(preprocess_text)

from tqdm.auto import tqdm
tqdm.pandas()  # registers .progress_apply()

step00_processed_docs = sou['Text'].progress_apply(preprocess_text)


  0%|          | 0/246 [00:00<?, ?it/s]

Example Progress Bar: ![ProgressBar_Screenshot 2025-11-23 at 15.05.05.png](attachment:d9bb6faf-2cb4-4593-8ec2-24cfc2da4fb0.png))

In [31]:
type(step00_processed_docs), step00_processed_docs

(pandas.core.series.Series,
 0      [speak, president, present, prepared, remark, ...
 1      [president, speaker, point, president, turn, f...
 2      [president, thank, thank, thank, madam, speake...
 3      [president, thank, thank, thank, good, mitch, ...
 4      [president, thank, thank, thank, madam, speake...
                              ...                        
 241    [fellow, citizen, senate, house, representativ...
 242    [fellow, citizen, senate, house, representativ...
 243    [fellow, citizen, senate, house, representativ...
 244    [fellow, citizen, senate, house, representativ...
 245    [fellow, citizen, senate, house, representativ...
 Name: Text, Length: 246, dtype: object)

In [28]:
part01.save_the_processed_data_to_csv(data=step00_processed_docs, filepath=part00.DIR_DATA_03_LDA_BERT / "step00_processed_docs.csv")

To train an LDA model, use the LdaModel function that we imported a couple of cells back. The last resource linked under the LDA section is especially useful for walking through the steps we have below. *Note: one of the arguments to the LdaModel function is `random_state` which specifies the random seed for reproducibility. Please set yours to 42. Further, the last resource provided uses `LdaMulticore` which is essentially a parallelizable version of our function `LdaModel`. Use `LdaModel` instead, but the usage will be similar, except you can ignore the `iterations` and `workers` arguments..*.

In [36]:
# processed_docs = read_csv(...)

import ast

step01_processed_docs_from_csv = part00.pd.read_csv(part00.DIR_DATA_03_LDA_BERT / "step00_processed_docs.csv")
step01_processed_docs_from_csv = step01_processed_docs_from_csv["Text"]
step01_processed_docs_from_csv = step01_processed_docs_from_csv.apply(ast.literal_eval)
type(step01_processed_docs_from_csv), step01_processed_docs_from_csv

(pandas.core.series.Series,
 0      [speak, president, present, prepared, remark, ...
 1      [president, speaker, point, president, turn, f...
 2      [president, thank, thank, thank, madam, speake...
 3      [president, thank, thank, thank, good, mitch, ...
 4      [president, thank, thank, thank, madam, speake...
                              ...                        
 241    [fellow, citizen, senate, house, representativ...
 242    [fellow, citizen, senate, house, representativ...
 243    [fellow, citizen, senate, house, representativ...
 244    [fellow, citizen, senate, house, representativ...
 245    [fellow, citizen, senate, house, representativ...
 Name: Text, Length: 246, dtype: object)

In [54]:
# Build dictionary from processed_docs, which is a list of tokens extracted from our speeches
step02_build_dict_from_processed_docs = Dictionary(step01_processed_docs_from_csv)
step02_build_dict_from_processed_docs;

<gensim.corpora.dictionary.Dictionary at 0x7a7dae758650>

In [55]:
step03_corpus = [doc for doc in step01_processed_docs_from_csv]
step03_corpus = [step02_build_dict_from_processed_docs.doc2bow(doc) for doc in step01_processed_docs_from_csv]
step03_corpus;

In [62]:
# train LDA model with 18 topics

NUM_OF_TOPICS      = 18
RANDOM_SEED_NUM    = 42
NUM_OF_PASSES      = 10
PROGRESS_FREQUENCY = 0 # batch learning

lda_model = LdaModel(
    corpus=step03_corpus,
    id2word=step02_build_dict_from_processed_docs,
    num_topics=NUM_OF_TOPICS,
    random_state=RANDOM_SEED_NUM,
    passes=NUM_OF_PASSES,
    update_every=PROGRESS_FREQUENCY,
)
lda_model

<gensim.models.ldamodel.LdaModel at 0x7a7dc35fed50>

In [102]:
# print the top 10 words for each topic
NUM_OF_TOP_N_TOPICS = 10

print(f"--- Top LDA topics. ---")
for idx, topic in lda_model.print_topics(num_words=NUM_OF_TOP_N_TOPICS, ):
    print(f"Topic: {idx} \nWords: {topic}\n")

--- Top LDA topics. ---
Topic: 0 
Words: 0.013*"states" + 0.013*"government" + 0.009*"united" + 0.008*"congress" + 0.007*"country" + 0.006*"year" + 0.006*"public" + 0.006*"great" + 0.005*"state" + 0.005*"power"

Topic: 1 
Words: 0.002*"year" + 0.002*"people" + 0.002*"government" + 0.001*"states" + 0.001*"congress" + 0.001*"country" + 0.001*"nation" + 0.001*"great" + 0.001*"time" + 0.001*"united"

Topic: 2 
Words: 0.003*"year" + 0.003*"government" + 0.002*"states" + 0.002*"congress" + 0.002*"united" + 0.002*"people" + 0.002*"great" + 0.002*"service" + 0.002*"public" + 0.001*"increase"

Topic: 3 
Words: 0.015*"year" + 0.011*"world" + 0.011*"people" + 0.011*"america" + 0.010*"nation" + 0.007*"help" + 0.007*"congress" + 0.007*"american" + 0.007*"work" + 0.006*"time"

Topic: 4 
Words: 0.009*"isthmus" + 0.006*"colombia" + 0.005*"government" + 0.004*"states" + 0.004*"panama" + 0.004*"united" + 0.004*"colombian" + 0.004*"treaty" + 0.003*"year" + 0.003*"congress"

Topic: 5 
Words: 0.003*"year" 

In [103]:
# lda_model.print_topics(-1)

In [112]:
# print the topic distribution for the first speech
# Where does a text belong to

# step04_topic_dist_first_speech = lda_model[step03_corpus[0]]
step04_topic_dist_first_speech = lda_model[step03_corpus][0]
step04_topic_dist_first_speech

[(7, np.float32(0.9997309))]

In [78]:
# make a visualization using pyLDAvis
pyLDAvis.enable_notebook()

In [79]:
pyLDAvis.gensim_models.prepare(lda_model, step03_corpus, step02_build_dict_from_processed_docs)

In [113]:
# print the topic distribution for the first speech

SPEECH_ID = 0

step05_first_speech_bow = step03_corpus[SPEECH_ID]
step05_first_speech_bow;

step06_first_speech_topics = lda_model.get_document_topics(bow=step05_first_speech_bow)

for topic_id, prob in step06_first_speech_topics:
    print(f"Topic {topic_id}: {prob}")

Topic 7: 0.9997308850288391


### BERTopic

- Train a BERTopic model with a `min_topic_size` of 3 *Hint: use `BERTopic` to instantiate the model and specify `min_topic_size` in here. Actually fit the model using `fit_transform`, which `docs` passed into this.*
- Output the top 10 words for each topic. 
- Output the topic distribution for the first speech
- Make a visualization of the topics (see topic_model.visualize_topics())

In [None]:
docs = sou['Text'].to_list()

In [None]:
# train the model - this takes about 30 seconds

# remove stop words from the topics (Hint: use CountVectorizer and then .update_topics on topic_model)

In [None]:
# output the top 10 words for each topic - hint see get_topic_info

In [None]:
# output the topic distribution for the first speech
# hint: check out approximate_distribution() and visualize_distribution()

In [None]:
# run this cell to visualize the topics
topic_model.visualize_topics()