# LDA Topic Modeling
* Notebook by Adam Lang
* Date: 8/5/2024

# Overview
* In this notebook I will go over implementing LDA topic modeling algorithm from scratch using sklearn in python.

## Notes
* First we will import LDA from sklearn.
* For feature extraction we will use `CountVectorizer`. However, you can use `TfidfVectorizer` as well.
   * The `CountVectorizer` will simply count each word in the document.

In [1]:
## imports
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

# instantiate a CountVectorizer() object
cvectorizer = CountVectorizer()

In [2]:
## Create a test corpus
corpus = ["I love cooking", "I have prepared a cake today", "He is going to a new place", "He will learn cooking there"]

In [3]:
# fit_transform the corpus with count vectorizer
cvz = cvectorizer.fit_transform(corpus)

In [4]:
## lets see the output
cvz

<4x15 sparse matrix of type '<class 'numpy.int64'>'
	with 17 stored elements in Compressed Sparse Row format>

Summary:
* We now have a sparse matrix 4 rows x 15 columns.
* The matrix represents the unique elements in our corpus.

In [5]:
## we now create a vocabulary array to represent the corpus
vocab = cvectorizer.get_feature_names_out()
vocab

array(['cake', 'cooking', 'going', 'have', 'he', 'is', 'learn', 'love',
       'new', 'place', 'prepared', 'there', 'to', 'today', 'will'],
      dtype=object)

## Building the LDA topic model
* `n_components` is number of topics we want to extract.
* `max_iter` is max number of iterations the LDA algorithm should run through before reaching steady state.
* `random_state` is well known, we use this so we can replicate our results.

In [6]:
## LDA model instantiate
lda_model = LatentDirichletAllocation(n_components = 3, max_iter = 20, random_state=20)

# Fit LDA model on data
X_topics = lda_model.fit_transform(cvz)

# get topic components
topic_words = lda_model.components_

In [7]:
## print first index of topics
topic_words[0]

array([0.33409872, 1.3520179 , 0.33426983, 0.33409872, 0.3344864 ,
       0.33426983, 0.33484162, 1.33184251, 0.33426983, 0.33426983,
       0.33409872, 0.33484162, 0.33426983, 0.33409872, 0.33484162])

## Extracting topic words
* We can extract a specific number of words per topic.

In [8]:
n_top_words = 4

for i, topic_dist in enumerate(topic_words):
    sorted_topic_dist = np.argsort(topic_dist)
    topic_words = np.array(vocab)[sorted_topic_dist]
    topic_words = topic_words[:-n_top_words:-1]
    print ("Topic", str(i+1), topic_words)

Topic 1 ['cooking' 'love' 'will']
Topic 2 ['today' 'prepared' 'have']
Topic 3 ['he' 'to' 'place']


## Extracting Topics by Document

In [9]:
# LDA model transform using count vectorizer -- assign to doc_topic
doc_topic = lda_model.transform(cvz)

#iterate over all documents
for n in range(doc_topic.shape[0]):
  topic_doc = doc_topic[n].argmax() #obtain max probability
  print("Document", n+1, " --- Topic:", topic_doc)

Document 1  --- Topic: 0
Document 2  --- Topic: 1
Document 3  --- Topic: 2
Document 4  --- Topic: 2


# Summary

**Pros of LDA**
- Able to handle large datasets and can be easily parallelized
- Can assign a probability to a new document due to the document-topic Dirichlet distribution
- Topics are open to human interpretation

**Cons of LDA**
- Extremely computationally expensive
- Usually may not work well for short documents and text files
- The number of topics must be known/set beforehand
- The bag-of-words approach disregards the semantic representation of words in a corpus, similar to LSA and pLSA
- An estimation of Bayes parameters lies under the assumption of exchangeability for the documents used
- LDA requires an extensive pre-processing phase to obtain a significant representation from the textual input data
- Some studies report LDA topic modeling may yield too general (Rizvi et al., 2019) or irrelevant (Alnusyan et al., 2020) topics.
   * Results may also be inconsistent across different executions (Egger et al., 2021).


In summary, we were able to implement LDA topic modeling which is a classical statistical topic modeling technique. However, with newer SOTA models such as BERTopic which use transformers, there are more viable options that consider semantic context and are less prone to error.