# Topic Modeling and Visualization

Topic modeling using `sklearn` and `pyLDAvis` to visualize topic modeling that works with the results of both`sklearn` and `gensim`.

 <a id="sec1"></a>
 ## 1. Prepare the notebook materials

In [1]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

Now install `pyLDAvis`:

Import the needed packages and enable pyLDAvis to work inside the notebook:

In [7]:
import pyLDAvis
import pyLDAvis.lda_model
pyLDAvis.enable_notebook()

In [8]:
import pandas

### Load city article dataset

In [10]:
accra = pandas.read_excel("LexisUni_articles/processed/accra_articles.xlsx")
 
# displaying the contents of the CSV file
accra.head()

Unnamed: 0,full_article,date
0,\n\n\n\r\nAccra Technical University to offer ...,2018-12-08
1,\n\n\n\r\nMeteorological Agency warns of rains...,2021-12-25
2,\n\n\n\r\nMining; First Ecowas Mining and Petr...,2016-01-25
3,\n\n\n\r\nMövenpick Ambassador Hotel Accra: Po...,2019-12-19
4,\n\n\n\r\nPHOTOS: Accra Floods After Heavy Dow...,2016-07-26


In [11]:
docs_raw = accra['full_article'].tolist()
print(len(docs_raw))

579


<a id="sec2"></a>
## 2. Convert to document-term matrices

Next, the raw documents are converted into document-term matrix (dtm), initially as raw counts (with `CountVectorizer`) and then as TF-IDF values (with `TfidfVectorizer`).

We're creating two different vectorizers, to notice the difference that different feature representations make in the process of modeling.

In [12]:
# Create the TF vector represetnation, this only counts the terms in each document

tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                stop_words = 'english',
                                lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b',
                                max_df = 0.5, 
                                min_df = 10)

dtm_tf = tf_vectorizer.fit_transform(docs_raw)
print(dtm_tf.shape)

(579, 2679)


Let's look at some of the features (i.e. words in the corpus):

In [14]:
tf_vectorizer.get_feature_names_out()[500:520]

array(['construct', 'constructed', 'construction', 'consultant',
       'consumers', 'consumption', 'contact', 'contain', 'contained',
       'containers', 'containing', 'contemporary', 'content',
       'contentservices', 'context', 'continent', 'continental',
       'continue', 'continued', 'continues'], dtype=object)

To be able to compare the results of TF and TFIDF representation, we will use the same parameters.
This is why we will initialize the `TfidfVectorizer` with the parmeters of the `CountVectorizer`.

Initialize and prepare Document-Term-Matrix:

In [15]:
# TF IDF = measures term importance relative to the other documents
tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())

dtm_tfidf = tfidf_vectorizer.fit_transform(docs_raw)
print(dtm_tfidf.shape)

(579, 2679)




<a id="sec3"></a>
## 3. Fit Latent Dirichlet Allocation models

We will fit two LDA models, each with 3 topics. 

In [36]:
# for TF DTM
lda_tf = LatentDirichletAllocation(n_components=4, random_state=0)
lda_tf.fit(dtm_tf)

lda_tf

In [31]:
# for TFIDF DTM
lda_tfidf = LatentDirichletAllocation(n_components=3, random_state=0)
lda_tfidf.fit(dtm_tfidf)

Given that the models are probability distributions over topics and words, we will focus on their visualization to learn more about them.

<a id="sec4"></a>
## 4. Visualizing the models with pyLDAvis

Note on relevance metric: lambda influences the order in which the words are sorted. Adjust lambda to help with interpretation of the different categories. 

see this link: https://we1s.ucsb.edu/research/we1s-tools-and-software/topic-model-observatory/tmo-guide/tmo-guide-pyldavis/

In [37]:
# the TF representation model

pyLDAvis.lda_model.prepare(lda_tf, dtm_tf, tf_vectorizer)

In [34]:
# The TFIDF representation model
# not as helpful as TF-- could be due to French? Not sure

# pyLDAvis.lda_model.prepare(lda_tfidf, dtm_tfidf, tfidf_vectorizer)

## Repeat process for Kumasi

In [35]:
kumasi = pandas.read_excel("LexisUni_articles/processed/kumasi_articles.xlsx")
kumasi_docs_raw = kumasi['full_article'].tolist()
print(len(kumasi_docs_raw))

334


In [38]:
# Create the TF vector represetnation
kumasi_tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                stop_words = 'english',
                                lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b',
                                max_df = 0.5, 
                                min_df = 10)

kumasi_dtm_tf = kumasi_tf_vectorizer.fit_transform(kumasi_docs_raw)
print(kumasi_dtm_tf.shape)

(334, 2433)


In [None]:
# TF IDF = measures term importance relative to the other documents
kumasi_tfidf_vectorizer = TfidfVectorizer(**kumasi_tf_vectorizer.get_params())

dtm_tfidf = tfidf_vectorizer.fit_transform(docs_raw)
print(dtm_tfidf.shape)