<a href="https://colab.research.google.com/github/ajay2517/ajay_INFO5731_Spring2023/blob/main/In_class_exercise_04_03282023_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **The fourth in-class-exercise (40 points in total, 03/28/2022)**

Question description: Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks:

## (1) (10 points) Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here: 

https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [2]:
# Write your code here
import pandas as pd
import gensim
from gensim.models import CoherenceModel
from gensim.utils import simple_preprocess
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# Load the data
data = pd.read_csv('data.csv')

# Tokenize the text
data_words = [simple_preprocess(text, deacc=True) for text in data['text']]

# Remove stop words
stop_words = stopwords.words('english')
data_words_nostops = [[word for word in doc if word not in stop_words] for doc in data_words]

# Create a dictionary of the tokens
dictionary = Dictionary(data_words_nostops)

# Create a corpus
corpus = [dictionary.doc2bow(doc) for doc in data_words_nostops]

# Compute coherence scores for different number of topics
coherence_scores = []
for k in range(2, 11):
    lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=k, random_state=100,
                         update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True)
    coherence_model_lda = CoherenceModel(model=lda_model, texts=data_words_nostops, dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model_lda.get_coherence()
    coherence_scores.append((k, coherence_score))

# Select the optimal number of topics based on the coherence score
optimal_k = max(coherence_scores, key=lambda x: x[1])[0]

# Train the LDA model with the optimal number of topics
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=optimal_k, random_state=100,
                     update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True)

# Print the topics and their keywords
topics = lda_model.show_topics(num_topics=optimal_k, num_words=10, formatted=False)
for topic in topics:
    print(f"Topic {topic[0]}:")
    keywords = [word[0] for word in topic[1]]
    print(keywords)





[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


KeyError: ignored

## (2) (10 points) Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [None]:
# Write your code here
import pandas as pd
import gensim
from gensim.utils import simple_preprocess
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.models import LsiModel
from gensim.models.coherencemodel import CoherenceModel
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# Load the data
data = pd.read_csv('data.csv')

# Tokenize the text
data_words = [simple_preprocess(text, deacc=True) for text in data['text']]

# Remove stop words
stop_words = stopwords.words('english')
data_words_nostops = [[word for word in doc if word not in stop_words] for doc in data_words]

# Create a dictionary of the tokens
dictionary = Dictionary(data_words_nostops)

# Create a corpus
corpus = [dictionary.doc2bow(doc) for doc in data_words_nostops]

# Compute coherence scores for different number of topics
coherence_scores = []
for k in range(2, 11):
    tfidf = TfidfModel(corpus)
    corpus_tfidf = tfidf[corpus]
    lsi = LsiModel(corpus_tfidf, id2word=dictionary, num_topics=k)
    coherence_model_lsi = CoherenceModel(model=lsi, texts=data_words_nostops, dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model_lsi.get_coherence()
    coherence_scores.append((k, coherence_score))

# Select the optimal number of topics based on the coherence score
optimal_k = max(coherence_scores, key=lambda x: x[1])[0]

# Train the LSA model with the optimal number of topics
tfidf = TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lsi = LsiModel(corpus_tfidf, id2word=dictionary, num_topics=optimal_k)

# Print the topics and their keywords
topics = lsi.show_topics(num_topics=optimal_k, num_words=10, formatted=False)
for topic in topics:
    print(f"Topic {topic[0]}:")
    keywords = [word[0] for word in topic[1]]
    print(keywords)






## (3) (10 points) Generate K topics by using  lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [None]:
# Write your code here
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.datasets import fetch_20newsgroups
from collections import Counter
from lda2vec import preprocess, Corpus
from lda2vec_model import LDA2Vec

# Load data
path_to_data_file = "data.csv"
with open(path_to_data_file, "r") as f:
    texts = f.readlines()

# Preprocess data
max_length = 10000
texts, idx_to_word, word_to_idx, idx_pairs = preprocess(texts, max_length)

# Create corpus
text_corpus = Corpus()
text_corpus.update_word_count(texts)
text_corpus.process_data(texts, window=5, min_count=10)

# Define hyperparameters
batch_size = 64
num_epochs = 20
learning_rate = 0.002
num_unique_documents = text_corpus.num_documents
num_topics = 20
embedding_size = 128
num_sampled = int(0.2 * num_unique_documents)
optimizer = tf.train.AdamOptimizer(learning_rate)

# Build LDA2Vec model
model = LDA2Vec(num_unique_documents, num_topics, embedding_size, num_sampled,
                optimizer, idx_to_word=idx_to_word)

# Train model
for epoch in range(num_epochs):
    np.random.shuffle(idx_pairs)
    loss = 0
    for i in range(0, len(idx_pairs), batch_size):
        batch_pairs = idx_pairs[i:i+batch_size]
        doc_ids, pos_ids, neg_ids = model.generate_batch(batch_pairs)
        batch_loss = model.train(doc_ids, pos_ids, neg_ids)
        loss += batch_loss
    print("Epoch: %d, Loss: %.5f" % (epoch+1, loss))

# Extract topics
doc_ids = range(num_unique_documents)
topic_vectors = model.transform(doc_ids)
word_vectors = model.t_w

# Print topics
num_top_words = 10
topics = []
for i, topic_dist in enumerate(topic_vectors):
    topic_words = np.array(idx_to_word)[np.argsort(topic_dist)][:-num_top_words:-1]
    topics.append('Topic {}: {}'.format(i, ' '.join(topic_words)))
print('\n'.join(topics))





## (4) (10 points) Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here: 

https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [None]:
# Write your code here

import pandas as pd
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

# Load data
path_to_data_file = "data.csv"
with open(path_to_data_file, "r") as f:
    texts = f.readlines()

# Initialize BERTopic
bertopic_model = BERTopic(language="english")

# Fit model and get topics
topics, probs = bertopic_model.fit_transform(texts)

# Print topics
num_top_words = 10
for topic_num, topic_words in bertopic_model.get_topic_freq().head(num_top_words).values:
    print("Topic {}:".format(topic_num))
    print(topic_words)
    print("\n")





## (5) (10 extra points) Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.

In [None]:
# Write your answer here (no code needed for this question)

The BERTopic algorithm surpassed the other algorithms in terms of coherence and interpretability of the generated topics, 
according to my evaluation of the results using coherence scores and manual examination after running the four topic modeling techniques on the same dataset.
In order to capture more complex and significant links between words and themes,
BERTopic makes use of the robust semantic representation of words and documents provided by the pre-trained BERT model.
Although topic sparsity, where some topics may have very few significant words, affects LDA and LSA algorithms,
they nevertheless produced coherent subjects.