# Part 3: Advanced Text Processing - LDA and BERTopic Topic Modeling (20 pts)

### **References Used:**
- LDA:
    - https://medium.com/sayahfares19/text-analysis-topic-modelling-with-spacy-gensim-4cd92ef06e06 
    - https://www.kaggle.com/code/faressayah/text-analysis-topic-modeling-with-spacy-gensim#%F0%9F%93%9A-Topic-Modeling (code for previous post)
    - https://towardsdatascience.com/topic-modelling-in-python-with-spacy-and-gensim-dc8f7748bdbf/ 
- BERTopic:
    - https://maartengr.github.io/BERTopic/getting_started/visualization/visualize_documents.html#visualize-documents-with-plotly 
    - https://maartengr.github.io/BERTopic/getting_started/visualization/visualize_topics.html
    - https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html#example
    - https://maartengr.github.io/BERTopic/getting_started/topicrepresentation/topicrepresentation.html#update-topic-representation-after-training


In [1]:
!pip install spacy
import spacy
!python -m spacy download en_core_web_sm

Collecting spacy
  Downloading spacy-3.8.11-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.15-cp312-cp312-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (2.3 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.13-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (9.7 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.12-cp312-cp312-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (2.5 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.10-cp312-cp312-manylinux2014_x

In [2]:
import spacy
from tqdm import tqdm
from collections import Counter
import pandas as pd

# imports
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn-v0_8-dark') 

sou = pd.read_csv('data/SOTU.csv')
nlp = spacy.load("en_core_web_sm")

In [3]:
from spacy import displacy
from bertopic import BERTopic
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from sklearn.feature_extraction.text import CountVectorizer
import pyLDAvis
import pyLDAvis.gensim_models

  from .autonotebook import tqdm as notebook_tqdm


### LDA

To create and analyze potential topics associated with the speeches, we will first use the LDA method and package.
- Train an LDA model with 18 topics
- Output the top 10 words for each topic. 
- Output the topic distribution for the first speech
- Make a visualization

In [4]:
def preprocess_text(text): 
    doc = nlp(text) 
    return [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct and not token.is_space and len(token.lemma_) > 3]

In [5]:
# Process all texts - note this takes ~ 5 minutes to run
processed_docs = sou['Text'].apply(preprocess_text)

In [6]:
processed_docs

0      [speak, president, present, prepared, remark, ...
1      [president, speaker, point, president, turn, f...
2      [president, thank, thank, thank, madam, speake...
3      [president, thank, thank, thank, good, mitch, ...
4      [president, thank, thank, thank, madam, speake...
                             ...                        
241    [fellow, citizen, senate, house, representativ...
242    [fellow, citizen, senate, house, representativ...
243    [fellow, citizen, senate, house, representativ...
244    [fellow, citizen, senate, house, representativ...
245    [fellow, citizen, senate, house, representativ...
Name: Text, Length: 246, dtype: object

In [19]:
# Build dictionary from processed_docs, which is a list of tokens extracted from our speeches
sou['tokens'] = processed_docs
#Gensim Dictionary object maps each word to their unique ID:
dictionary = Dictionary(sou['tokens'])
#print(dictionary.token2id)
dictionary.filter_extremes(no_below=5, no_above=0.5)

#create sparse vector (i, j) where i is dictionary id and j is number of occurences of that distinct word (?)
corpus = [dictionary.doc2bow(doc) for doc in sou['tokens']]

In [20]:
# train LDA model with 18 topics
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=18, random_state=42, passes=10)

In [21]:
# print the top 10 words for each topic
lda_model.print_topics(-1)

[(0,
  '0.008*"canal" + 0.005*"tariff" + 0.004*"panama" + 0.004*"statute" + 0.004*"company" + 0.004*"method" + 0.004*"convention" + 0.003*"board" + 0.003*"cent" + 0.003*"china"'),
 (1,
  '0.003*"mexico" + 0.001*"texas" + 0.001*"mexican" + 0.001*"convention" + 0.001*"americans" + 0.001*"minister" + 0.001*"program" + 0.001*"article" + 0.001*"cent" + 0.001*"loan"'),
 (2,
  '0.006*"method" + 0.005*"board" + 0.005*"agricultural" + 0.005*"farmer" + 0.005*"cent" + 0.004*"farm" + 0.004*"project" + 0.004*"veteran" + 0.004*"depression" + 0.004*"committee"'),
 (3,
  '0.004*"cent" + 0.004*"gold" + 0.004*"silver" + 0.003*"indian" + 0.003*"june" + 0.003*"bond" + 0.003*"method" + 0.003*"island" + 0.002*"conference" + 0.002*"tariff"'),
 (4,
  '0.019*"spain" + 0.009*"article" + 0.007*"minister" + 0.006*"likewise" + 0.005*"manufacture" + 0.005*"port" + 0.005*"tribe" + 0.005*"intercourse" + 0.004*"presume" + 0.004*"colony"'),
 (5,
  '0.009*"tariff" + 0.008*"corporation" + 0.007*"evil" + 0.006*"cable" + 0

In [22]:
# print the topic distribution for the first speech
sou['Text'][0]
lda_model[corpus][0]

[(7, np.float32(0.9994253))]

The first speech is 99% belonging to topic 2!

In [23]:
# make a visualization using pyLDAvis
pyLDAvis.enable_notebook()

lda_display = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(lda_display)


In [12]:
#save to outputs
pyLDAvis.save_html(lda_display, 'outputs/lda_topics.html')

### BERTopic
We will also conduct topic analysis using the BERTopic method and package. We will run through the following steps:
- Train a BERTopic model with a `min_topic_size` of 3 
- Output the top 10 words for each topic. 
- Output the topic distribution for the first speech
- Make a visualization of the topics

In [13]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

In [14]:
docs = sou['Text'].to_list()

In [15]:
# train the model - this takes about 30 seconds
topic_model = BERTopic(min_topic_size=3)
topics, probs = topic_model.fit_transform(docs)


# remove stop words from the topics (Hint: use CountVectorizer and then .update_topics on topic_model)
vectorizer_model = CountVectorizer(stop_words="english")
topic_model.update_topics(docs, vectorizer_model=vectorizer_model) 

In [16]:
# output the top 10 words for each topic - hint see get_topic_info
topic_model.get_topic_info()['Representation']

0     [government, states, united, congress, year, p...
1     [america, american, americans, people, tonight...
2     [government, united, states, department, congr...
3     [government, work, public, congress, great, la...
4     [world, new, america, president, years, today,...
5     [world, peace, nations, soviet, economic, nati...
6     [government, law, states, united, congress, gr...
7     [states, public, government, congress, present...
8     [jobs, america, thats, new, americans, people,...
9     [people, jobs, work, year, american, new, amer...
10    [mexico, states, congress, united, texas, gove...
11    [government, states, public, united, subject, ...
12    [states, government, united, public, congress,...
13    [children, people, new, world, challenge, amer...
14    [bank, public, states, country, government, su...
15    [states, united, government, great, powers, pu...
16    [national, federal, reduction, public, ought, ...
17    [government, states, united, year, congres

In [17]:
# output the topic distribution for the first speech
topic_distr, _ = topic_model.approximate_distribution(docs)
first_speech_viz = topic_model.visualize_distribution(topic_distr[1])

#save first speech topic distribution to outputs
first_speech_viz.write_html("outputs/BERTopic_first_speech_viz.html")
first_speech_viz

In [18]:
# run this cell to visualize the topics
viz_topics = topic_model.visualize_topics()

#save topic visualizations to output
viz_topics.write_html("outputs/BERTopic_topics_viz.html")
viz_topics

## Discussion and Reflections

The topic distribution across the two dimensional PCA is notably different for the LDA (bag of words) and BERTopic (semantic similarity) approaches, as seen the 2D graph distributions. The LDA distribution appears to have larger clusters on the right quadrant of the analyses, with significantly smaller clusters on the left quadrant. This suggests the spacy gensim approach, which is a generative probabilistic model that represents topics as word probabilities and uncovers latent or hidden topics clusters, . On the other hand, the BERTopic distributions land in each quadrant of the PCA grid, with more even distribution between each in terms of cluster size. This demonstrates how the two approaches use different attributes of the speeches and different algorithms to conclude topic summaries and distributions.