## Part 3: Advanced Text Processing - LDA and BERTopic Topic Modeling (20 pts)

**Resources:**
- LDA:
    - https://medium.com/sayahfares19/text-analysis-topic-modelling-with-spacy-gensim-4cd92ef06e06 
    - https://www.kaggle.com/code/faressayah/text-analysis-topic-modeling-with-spacy-gensim#%F0%9F%93%9A-Topic-Modeling (code for previous post)
    - https://towardsdatascience.com/topic-modelling-in-python-with-spacy-and-gensim-dc8f7748bdbf/ 
- BERTopic:
    - https://maartengr.github.io/BERTopic/getting_started/visualization/visualize_documents.html#visualize-documents-with-plotly 
    - https://maartengr.github.io/BERTopic/getting_started/visualization/visualize_topics.html 


In [16]:
from spacy import displacy
from bertopic import BERTopic
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from sklearn.feature_extraction.text import CountVectorizer
import pyLDAvis
import pyLDAvis.gensim_models
import spacy

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

### LDA

- Train an LDA model with 18 topics
- Output the top 10 words for each topic. 
- Output the topic distribution for the first speech
- Make a visualization

You may use the next two cells to process the data.

In [17]:
sou = pd.read_csv('data/SOTU.csv')
nlp = spacy.load("en_core_web_sm")

In [18]:
def preprocess_text(text): 
    doc = nlp(text) 
    return [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct and not token.is_space and len(token.lemma_) > 3]

In [19]:
# Process all texts - note this takes ~ 5 minutes to run
processed_docs = sou['Text'].apply(preprocess_text)

To train an LDA model, use the LdaModel function that we imported a couple of cells back. The last resource linked under the LDA section is especially useful for walking through the steps we have below. *Note: one of the arguments to the LdaModel function is `random_state` which specifies the random seed for reproducibility. Please set yours to 42. Further, the last resource provided uses `LdaMulticore` which is essentially a parallelizable version of our function `LdaModel`. Use `LdaModel` instead, but the usage will be similar, except you can ignore the `iterations` and `workers` arguments..*.

In [31]:
# Build dictionary from processed_docs, which is a list of tokens extracted from our speeches
dictionary = Dictionary(processed_docs)
# filtering words with too low and too high of a frequency
dictionary.filter_extremes(no_below = 2, no_above = 0.5, keep_n = 1000)
# Creating our bag of words
corpus = [dictionary.doc2bow(doc) for doc in processed_docs] 

In [32]:
# train LDA model with 18 topics
lda_model = LdaModel(corpus = corpus, id2word = dictionary, num_topics = 18, random_state = 42, passes = 10)

In [47]:
lda_model.print_topics(-1)

[(0,
  '0.010*"island" + 0.005*"mexico" + 0.004*"convention" + 0.004*"canal" + 0.004*"spain" + 0.004*"port" + 0.004*"chinese" + 0.004*"article" + 0.003*"admit" + 0.003*"corporation"'),
 (1,
  '0.011*"method" + 0.010*"board" + 0.009*"agricultural" + 0.009*"cent" + 0.008*"farmer" + 0.008*"project" + 0.008*"tariff" + 0.007*"committee" + 0.007*"loan" + 0.007*"conference"'),
 (2,
  '0.028*"americans" + 0.022*"tonight" + 0.012*"today" + 0.011*"thank" + 0.011*"budget" + 0.010*"percent" + 0.010*"program" + 0.009*"challenge" + 0.009*"worker" + 0.009*"hard"'),
 (3,
  '0.010*"cent" + 0.009*"june" + 0.008*"indian" + 0.007*"pension" + 0.006*"method" + 0.006*"indians" + 0.005*"mail" + 0.005*"postal" + 0.005*"amount" + 0.005*"bond"'),
 (4,
  '0.007*"gold" + 0.005*"note" + 0.004*"wrong" + 0.004*"currency" + 0.004*"bond" + 0.003*"island" + 0.003*"reserve" + 0.003*"convention" + 0.003*"spain" + 0.003*"americans"'),
 (5,
  '0.007*"americans" + 0.006*"program" + 0.006*"tonight" + 0.005*"today" + 0.003*"bi

In [51]:
# print the topic distribution for the first speech
first_speech = corpus[0]
topic_dist = lda_model.get_document_topics(first_speech)
print(topic_dist)

[(2, np.float32(0.9988718))]


In [52]:
# make a visualization using pyLDAvis
pyLDAvis.enable_notebook()
pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)

### BERTopic

- Train a BERTopic model with a `min_topic_size` of 3 *Hint: use `BERTopic` to instantiate the model and specify `min_topic_size` in here. Actually fit the model using `fit_transform`, which `docs` passed into this.*
- Output the top 10 words for each topic. 
- Output the topic distribution for the first speech
- Make a visualization of the topics (see topic_model.visualize_topics())

In [9]:
docs = sou['Text'].to_list()

NameError: name 'sou' is not defined

In [10]:
# train the model - this takes about 30 seconds

# remove stop words from the topics (Hint: use CountVectorizer and then .update_topics on topic_model)

In [11]:
# output the top 10 words for each topic - hint see get_topic_info

In [12]:
# output the topic distribution for the first speech
# hint: check out approximate_distribution() and visualize_distribution()

In [13]:
# run this cell to visualize the topics
topic_model.visualize_topics()

NameError: name 'topic_model' is not defined