# Project 2: Reproducibility in Natural Language Processing

## Part 3: Advanced Text Processing - LDA and BERTopic Topic Modeling (20 pts)

In this notebook, we will compare two methods for creating topic models of the speeches we've been analyzing: Latent Dirichlet allocation (LDA) and BERTopic. To begin, we need to import our requisite packages.

### Imports
In **Part 2,** we downloaded spaCy's English language text processing model `en_core_web_sm` into our environment. If, for whatever reason, you have reached this point without downloading it, please do so now. While having your `sotu` environment activated, run the following:

```
python -m spacy download en_core_web_sm
```

In [None]:
# imports
import pandas as pd
import spacy
from spacy import displacy
from bertopic import BERTopic
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from sklearn.feature_extraction.text import CountVectorizer
import pyLDAvis
import pyLDAvis.gensim_models
import tqdm

nlp = spacy.load("en_core_web_sm")

### Read Data

In [None]:
sou = pd.read_csv("data/SOTU.csv")

### LDA
LDA's "bag-of-words" approach is much more sensitive to text preprocessing, so the function below uses spaCy to tokenize the text; cut tokens down to their semantic "root" (e.g., "runs" and "running" become "run"); and remove stop words, punctiation, and spaces.

In [None]:
def preprocess_text(text): 
    doc = nlp(text) 
    return [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct and not token.is_space and len(token.lemma_) > 3]

In [None]:
# Process all texts - note this takes ~ 5 minutes to run
processed_docs = sou['Text'].apply(preprocess_text)

In [None]:
# Build dictionary from processed_docs, which is a list of tokens extracted from our speeches
dictionary = Dictionary(processed_docs)
dictionary.filter_extremes(no_below=5, no_above=0.5) # remove both highly common and rare tokens
corpus = [dictionary.doc2bow(doc) for doc in processed_docs] # build a corpus 

In [None]:
# Train LDA model with 18 topics
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=18, passes=10, random_state=42)

In [None]:
print("\n--- State of the Union LDA Topics ---") 
for idx, topic in lda_model.print_topics(-1): 
    print(f"Topic: {idx} \nWords: {topic}\n")

In [None]:
# print the topic distribution for the first speech
lda_model[corpus[0]]

In [None]:
# make a visualization using pyLDAvis
pyLDAvis.enable_notebook()
ldavis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
pyLDAvis.save_html(ldavis, 'outputs/lda_visualization.html')
ldavis

### BERTopic
BERTopic is better at handling semantic richness—or the messiness of natural language—so to start we don't need to any text preprocessing. All we do is list each speech as a string.

In [None]:
docs = sou['Text'].to_list()

In [None]:
# train the model - this takes about 30 seconds
topic_model = BERTopic(min_topic_size=3)
topics, probs = topic_model.fit_transform(docs)
# remove stop words from the topics
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 3), min_df=10)
topic_model.update_topics(docs, vectorizer_model=vectorizer_model)

In [None]:
# output the top 10 words for each topic
topic_model.get_topic_info()

In [None]:
# output the topic distribution for the first speech
topic_distr, _ = topic_model.approximate_distribution(docs)
first_speech_distr = topic_model.visualize_distribution(topic_distr[0])
first_speech_distr.write_html("outputs/first_speech_distr.html")
first_speech_distr

In [None]:
# run this cell to visualize the topics
topic_model.visualize_topics()
bertopicvis = topic_model.visualize_topics()
bertopicvis.write_html("outputs/bertopicvis.html")
bertopicvis