# Topic Modeling and Visualization

Topic modeling using `sklearn` and `pyLDAvis` to visualize topic modeling that works with the results of both`sklearn` and `gensim`.

 <a id="sec1"></a>
 ## 1. Prepare the notebook materials

In [1]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

Import the needed packages and enable pyLDAvis to work inside the notebook:

In [2]:
import pandas
import pyLDAvis
import pyLDAvis.lda_model
pyLDAvis.enable_notebook()

### Load headline and actor description dataset

In [6]:
actor_desc = pandas.read_csv("GPT_actor_blurbs.csv")

In [7]:
# displaying the contents of the CSV file
actor_desc.head(10)

Unnamed: 0,title,actor_blurb
0,"CNN Soledad OBrien , Liberal Guest Team Up to ...",The primary actor in the headline is Soledad O...
1,A Critical Race Tale Told by Leftist Idiots Si...,The headline does not reference a specific pri...
2,Amazing Benefits of Becoming Salesforce Develo...,The headline does not explicitly reference a p...
3,The race is on ; Candidates address critical i...,"The primary actors in the headline ""The race i..."
4,Journalist declares one - man war against cri...,"The primary actor in the headline is a ""journa..."
5,Reminder : Critical race theory isnt scriptural,The headline does not reference a specific pri...
6,B . C . Court Of Appeal Weighs In On Legal Rep...,"The primary actor in the headline is the ""B.C...."
7,Continued Violence In Kenosha May Tighten 2020...,The headline does not reference a specific pri...
8,Sandia Labs Goes Nuclear On Employee Who Spark...,"The primary actor in the headline is ""Sandia L..."
9,Rebellion Against Critical Race Theory in the ...,"The primary actor in the headline ""Rebellion A..."


In [8]:
blurbs = actor_desc['actor_blurb'].tolist()
print(len(blurbs))

11704


<a id="sec2"></a>
## 2. Convert to document-term matrices

Next, the raw documents are converted into document-term matrix (dtm), initially as raw counts (with `CountVectorizer`) and then as TF-IDF values (with `TfidfVectorizer`).

We're creating two different vectorizers, to notice the difference that different feature representations make in the process of modeling.

In [9]:
# Create the TF vector represetnation, this only counts the terms in each document

tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                stop_words = 'english',
                                lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b',
                                max_df = 0.5, 
                                min_df = 10)

dtm_tf = tf_vectorizer.fit_transform(blurbs)
print(dtm_tf.shape)

(11704, 2188)


Let's look at some of the features (i.e. words in the corpus):

In [10]:
tf_vectorizer.get_feature_names_out()[500:520]

array(['created', 'creating', 'creation', 'creator', 'crenshaw',
       'criminal', 'crisis', 'criteria', 'critic', 'criticism',
       'criticisms', 'criticize', 'criticized', 'criticizes',
       'criticizing', 'critics', 'critique', 'critiques', 'critiquing',
       'crt'], dtype=object)

To be able to compare the results of TF and TFIDF representation, we will use the same parameters.
This is why we will initialize the `TfidfVectorizer` with the parmeters of the `CountVectorizer`.

Initialize and prepare Document-Term-Matrix:

In [13]:
# TF IDF = measures term importance relative to the other documents
tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())

dtm_tfidf = tfidf_vectorizer.fit_transform(blurbs)
print(dtm_tfidf.shape)

(11704, 2188)




<a id="sec3"></a>
## 3. Fit Latent Dirichlet Allocation models

We will fit two LDA models, each with 3 topics. 

In [14]:
# for TF DTM
lda_tf = LatentDirichletAllocation(n_components=4, random_state=0)
lda_tf.fit(dtm_tf)

lda_tf

In [15]:
# for TFIDF DTM
lda_tfidf = LatentDirichletAllocation(n_components=3, random_state=0)
lda_tfidf.fit(dtm_tfidf)

Given that the models are probability distributions over topics and words, we will focus on their visualization to learn more about them.

<a id="sec4"></a>
## 4. Visualizing the models with pyLDAvis

Note on relevance metric: lambda influences the order in which the words are sorted. Adjust lambda to help with interpretation of the different categories. 

see this link: https://we1s.ucsb.edu/research/we1s-tools-and-software/topic-model-observatory/tmo-guide/tmo-guide-pyldavis/

In [16]:
# the TF representation model

pyLDAvis.lda_model.prepare(lda_tf, dtm_tf, tf_vectorizer)

In [17]:
# The TFIDF representation model
# not as helpful as TF-- could be due to French? Not sure

pyLDAvis.lda_model.prepare(lda_tfidf, dtm_tfidf, tfidf_vectorizer)