# **The fourth in-class-exercise (40 points in total, 03/28/2022)**

Question description: Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks:

## (1) (10 points) Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here: 

https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [None]:


import nltk
nltk.download('stopwords')

import re
import gensim
import pyLDAvis.gensim_models
from gensim.models import LdaModel
from gensim.corpora import Dictionary
from gensim.models import CoherenceModel
from nltk.corpus import stopwords

document_text = ["The product is great...!",
    "I'm not satisfied with the product",
    "It's an average and regular product, nothing more special"]

stopWord = set(stopwords.words('english'))
def pre_process(text):
    tokens_01 = re.findall(r'\w+', text.lower())
    tokens_01 = [word for word in tokens_01 if word not in stopWord]
    return tokens_01

processed_doc = [pre_process(doc) for doc in document_text]
dictionary = Dictionary(processed_doc)
corpus = [dictionary.doc2bow(doc) for doc in processed_doc]
coherence_scores = []
for k in range(2, 11):
    lda_model = LdaModel(corpus, num_topics=k, id2word=dictionary, passes=15)
    coherence_model = CoherenceModel(model=lda_model, texts=processed_doc, dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    coherence_scores.append(coherence_score)

optimal_k = coherence_scores.index(max(coherence_scores)) + 2  # +2 because we started from K=2
print(f"The number of optimal topics is: {optimal_k}")
optimal_lda_model = LdaModel(corpus, num_topics=optimal_k, id2word=dictionary, passes=15)
topics = optimal_lda_model.print_topics(num_words=10)  # Adjusting the number of words as needed

for t in topics:
    print(t)





## (2) (10 points) Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [17]:
!pip install pyLDAvis

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import numpy as np

req_documents = ["The product is great...!",
    "I'm not satisfied with the product",
    "It's an average and regular product, nothing more special"]

tfidf_vectorizer = TfidfVectorizer(max_df=0.85, max_features=5000, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(req_documents)

num_topics = 4
lsa = TruncatedSVD(n_components=num_topics)
lsa_topic_matrix = lsa.fit_transform(tfidf_matrix)
terms = tfidf_vectorizer.get_feature_names_out()
singular_values = lsa.singular_values_

for i, singular_value in enumerate(singular_values):
    top_terms = [terms[j] for j in np.argsort(lsa.components_[i])[::-1][:10]]
    print(f"Topic {i+1}: {', '.join(top_terms)}")






Topic 1: great, satisfied, regular, average, special
Topic 2: satisfied, special, regular, average, great
Topic 3: special, regular, average, great, satisfied


## (3) (10 points) Generate K topics by using  lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [32]:
import pyLDAvis
pyLDAvis.enable_notebook()
from nltk.tokenize import word_tokenize
from gensim.corpora import Dictionary
nltk.download('punkt')
req_documents = ['By analyzing Google search data using Google Trends, we measured the impact of highly publicized plastic surgery-related events on the interest level of the general population in specific search terms.',
             'Additionally, we investigated seasonal and geographic trends around interest in rhinoplasties, which is information that physicians and small surgical centers can use to optimize marketing decisions.',
             'A noticeable impact was observed in both celebrity cases on search term volume, and a seasonal effect is apparent for rhinoplasty searches. ',
             'As many surgeons already employ aggressive Internet marketing strategies, understanding and utilizing these trends could help optimize their investments, increase social engagement, and increase practice awareness by potential patients.']

tokenizedDocs = [word_tokenize(doc.lower()) for doc in req_documents]

dictionary = Dictionary(tokenizedDocs)

corpus = [dictionary.doc2bow(doc) for doc in tokenizedDocs]

from gensim.models.coherencemodel import CoherenceModel
from gensim.models import LdaModel

def compute_coherence_values(dictionary, corpus, tokenizedDocs, limit, start=2, step=1):
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = LdaModel(corpus=corpus, num_topics=num_topics, id2word=dictionary)
        model_list.append(model)
        coherence_model = CoherenceModel(model=model, texts=tokenizedDocs, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherence_model.get_coherence())

    return model_list, coherence_values

model_list, coherence_values = compute_coherence_values(dictionary, corpus, tokenizedDocs, limit=10)

optimal_model = model_list[coherence_values.index(max(coherence_values))]
optimal_K = optimal_model.num_topics
def summarize_topics(model):
    topics = model.print_topics(num_words=5)
    for topic in topics:
        print(topic)

summarize_topics(optimal_model)



[nltk_data] Downloading package punkt to
[nltk_data]     /Users/bhanuprasadkommula/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
(0, '0.020*"the" + 0.016*"google" + 0.016*"of" + 0.016*"search" + 0.015*"surgery-related"')
(1, '0.055*"," + 0.048*"and" + 0.027*"marketing" + 0.027*"t

## (4) (10 points) Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here: 

https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [33]:

!pip install bertopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups(subset='all')['data']

topic_model = BERTopic(nr_topics="auto", calculate_probabilities=True, verbose=True)
topics, _ = topic_model.fit_transform(data)

topic_overview = topic_model.get_topic_freq()

for topic_num, freq in topic_overview[1:].values:
    topic_words = topic_model.get_topic(topic_num)
    topic_summary = ", ".join([word[0] for word in topic_words[:5]])
    print(f"Topic {topic_num}: {topic_summary} (Freq: {freq})")


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Batches:   0%|          | 0/589 [00:00<?, ?it/s]

2023-11-05 22:50:17,846 - BERTopic - Transformed documents to Embeddings
2023-11-05 22:50:21,992 - BERTopic - Reduced dimensionality


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

2023-11-05 22:51:17,869 - BERTopic - Clustered reduced embeddings
2023-11-05 22:51:24,057 - BERTopic - Reduced number of topics from 366 to 249


Topic 0: game, team, he, games, players (Freq: 1586)
Topic 1: clipper, president, fbi, chip, government (Freq: 716)
Topic 2: god, jesus, that, you, is (Freq: 649)
Topic 3: gun, guns, militia, weapons, firearms (Freq: 395)
Topic 4: israel, israeli, jews, arab, palestinian (Freq: 386)
Topic 5: drive, drives, disk, mhz, hard (Freq: 321)
Topic 6: homosexuality, homosexual, gay, homosexuals, sex (Freq: 220)
Topic 7: turkish, armenian, armenians, armenia, were (Freq: 200)
Topic 8: radar, detector, detectors, ir, tempest (Freq: 187)
Topic 9: window, cursor, xterm, colormap, expose (Freq: 187)
Topic 10: windows, dos, nt, memory, 31 (Freq: 173)
Topic 11: drivers, card, diamond, driver, ati (Freq: 171)
Topic 12: address, internet, email, organization, mail (Freq: 167)
Topic 13: car, mustang, ford, toyota, convertible (Freq: 147)
Topic 14: sale, cds, cd, shipping, speakers (Freq: 120)
Topic 15: moon, billion, lunar, space, prize (Freq: 119)
Topic 16: sky, space, billboard, vandalizing, advertisin

## (5) (10 extra points) Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.

In [None]:
LDA is efficient with sparse text data, requires fewer topics for readability, and often reveals nouns and adjectives in topics.

LSA offers straightforward and distinct topics but may not capture complex interactions.

lda2Vec supports hierarchical topic reduction and auto-determination of topic numbers but may not suit small datasets.

BERTopic is versatile, stable across domains, supports hierarchical topics, but can generate many outliers.

If you seek a balance between interpretability and performance, LDA is a suitable choice.

For larger datasets with diverse content, BERTopic can be highly effective, assuming computational resources are available.



