# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [None]:
# Write your code here
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from gensim import corpora, models
from gensim.models import CoherenceModel
import string

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

# Sample text data
documents = [
    "I love this product! It's amazing.",
    "The service was terrible. I won't recommend it.",
    "Not a bad experience, but could be better.",
    "😊 Great event! Enjoyed every moment.",
    "Disappointed with the quality. 😡"
]

def preprocess_text(text):
    tokens = word_tokenize(text)

    # Lowercasing and removing punctuation
    tokens = [word.lower() for word in tokens if word.isalnum()]

    # Removing stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return tokens

# Preprocess the documents
preprocessed_docs = [preprocess_text(doc) for doc in documents]

dictionary = corpora.Dictionary(preprocessed_docs)


corpus = [dictionary.doc2bow(doc) for doc in preprocessed_docs]


coherence_scores = []
for k in range(2, 10):
    lda_model = models.LdaModel(corpus, num_topics=k, id2word=dictionary)
    coherence_model_lda = CoherenceModel(model=lda_model, texts=preprocessed_docs, dictionary=dictionary, coherence='c_v')
    coherence_lda = coherence_model_lda.get_coherence()
    coherence_scores.append((k, coherence_lda))

best_k, best_coherence = max(coherence_scores, key=lambda x: x[1])
print(f"Best number of topics: {best_k} (Coherence Score: {best_coherence:.4f})")

# Train the LDA model with the best number of topics
best_lda_model = models.LdaModel(corpus, num_topics=best_k, id2word=dictionary)

# Print the topics
print("Topics:")
for idx, topic in best_lda_model.print_topics():
    print(f"Topic {idx + 1}: {topic}")



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Best number of topics: 5 (Coherence Score: 0.5349)
Topics:
Topic 1: 0.125*"bad" + 0.125*"experience" + 0.125*"could" + 0.125*"better" + 0.125*"quality" + 0.124*"disappointed" + 0.021*"love" + 0.021*"amazing" + 0.021*"moment" + 0.021*"product"
Topic 2: 0.056*"disappointed" + 0.056*"quality" + 0.056*"product" + 0.056*"love" + 0.056*"moment" + 0.056*"could" + 0.056*"wo" + 0.055*"better" + 0.055*"recommend" + 0.055*"terrible"
Topic 3: 0.056*"disappointed" + 0.056*"love" + 0.056*"better" + 0.056*"product" + 0.056*"amazing" + 0.056*"quality" + 0.056*"moment" + 0.056*"recommend" + 0.056*"could" + 0.056*"bad"
Topic 4: 0.181*"amazing" + 0.181*"product" + 0.181*"love" + 0.031*"disappointed" + 0.031*"moment" + 0.031*"quality" + 0.031*"bad" + 0.031*"better" + 0.031*"wo" + 0.031*"could"
Topic 5: 0.095*"great" + 0.095*"event" + 0.095*"enjoyed" + 0.095*"every" + 0.095*"terrible" + 0.095*"wo" + 0.095*"service" + 0.095*"recommend" + 0.095*"moment" + 0.016*"disappointed"


## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [None]:
# Write your code here
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from gensim import corpora, models
from gensim.models import CoherenceModel
from gensim.models.lsimodel import LsiModel
import string

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

# Sample text data
documents = [
    "I love this product! It's amazing.",
    "The service was terrible. I won't recommend it.",
    "Not a bad experience, but could be better.",
    "😊 Great event! Enjoyed every moment.",
    "Disappointed with the quality. 😡"
]

def preprocess_text(text):
    # Tokenization
    tokens = word_tokenize(text)

    # Lowercasing and removing punctuation
    tokens = [word.lower() for word in tokens if word.isalnum()]

    # Removing stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return tokens

# Preprocess the documents
preprocessed_docs = [preprocess_text(doc) for doc in documents]

dictionary = corpora.Dictionary(preprocessed_docs)

# Create a corpus: a list of bag-of-words representation of each document
corpus = [dictionary.doc2bow(doc) for doc in preprocessed_docs]


coherence_scores = []
for k in range(2, 10):
    lsi_model = LsiModel(corpus, num_topics=k, id2word=dictionary)
    coherence_model_lsi = CoherenceModel(model=lsi_model, texts=preprocessed_docs, dictionary=dictionary, coherence='c_v')
    coherence_lsi = coherence_model_lsi.get_coherence()
    coherence_scores.append((k, coherence_lsi))

# Select the number of topics with the highest coherence score
best_k, best_coherence = max(coherence_scores, key=lambda x: x[1])
print(f"Best number of topics: {best_k} (Coherence Score: {best_coherence:.4f})")

# Train the LSA model with the best number of topics
best_lsi_model = LsiModel(corpus, num_topics=best_k, id2word=dictionary)

# Print the topics
print("Topics:")
for idx, topic in best_lsi_model.print_topics():
    print(f"Topic {idx + 1}: {topic}")



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Best number of topics: 3 (Coherence Score: 0.5349)
Topics:
Topic 1: -0.447*"enjoyed" + -0.447*"moment" + -0.447*"great" + -0.447*"event" + -0.447*"every" + -0.000*"disappointed" + -0.000*"quality" + -0.000*"amazing" + 0.000*"terrible" + 0.000*"could"
Topic 2: 0.474*"experience" + 0.474*"bad" + 0.474*"could" + 0.474*"better" + -0.158*"wo" + -0.158*"terrible" + -0.158*"service" + -0.158*"recommend" + 0.000*"disappointed" + 0.000*"quality"
Topic 3: -0.474*"recommend" + -0.474*"service" + -0.474*"terrible" + -0.474*"wo" + -0.158*"experience" + -0.158*"better" + -0.158*"bad" + -0.158*"could" + 0.000*"disappointed" + 0.000*"quality"


## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [48]:
from gensim import corpora, models
from gensim.models import CoherenceModel
from nltk.corpus import stopwords
import spacy
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

# Sample text data
documents = [
    "I love this product! It's amazing.",
    "The service was terrible. I won't recommend it.",
    "Not a bad experience, but could be better.",
    "😊 Great event! Enjoyed every moment.",
    "Disappointed with the quality. 😡"
]

nlp = spacy.load("en_core_web_sm")
nlp.Defaults.stop_words |= {"!", ",", ".", "'", "?", "(", ")", ":", ";", "-", "😊", "😡"}
stop_words = set(stopwords.words('english'))
preprocessed_docs = []
for doc in documents:
    tokens = nlp(doc)
    tokens = [token.text.lower() for token in tokens if token.text.lower() not in stop_words]
    preprocessed_docs.append(tokens)

dictionary = corpora.Dictionary(preprocessed_docs)
dictionary.filter_extremes(no_below=1, no_above=0.5)
corpus = [dictionary.doc2bow(doc) for doc in preprocessed_docs]

# Calculate coherence score for different numbers of topics
coherence_scores = []
for num_topics in range(2, 11):
    lda_model = models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=15)
    coherence_model_lda = CoherenceModel(model=lda_model, texts=preprocessed_docs, dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model_lda.get_coherence()
    coherence_scores.append((num_topics, coherence_score))

# Print coherence scores
print("Coherence Scores:")
for num_topics, score in coherence_scores:
    print(f"Number of Topics: {num_topics}, Coherence Score: {score}")

# Find the optimal number of topics based on coherence score
optimal_num_topics, optimal_coherence_score = max(coherence_scores, key=lambda x: x[1])
print(f"\nOptimal Number of Topics: {optimal_num_topics}, Coherence Score: {optimal_coherence_score}")

# Summarize topics for the optimal number of topics
lda_model = models.LdaModel(corpus, num_topics=optimal_num_topics, id2word=dictionary, passes=15)
print("\nTopics:")
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic {idx}: {topic}")

# Visualize the topics
lda_display = gensimvis.prepare(lda_model, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)


Coherence Scores:
Number of Topics: 2, Coherence Score: 0.5008437441772385
Number of Topics: 3, Coherence Score: 0.493111586267488
Number of Topics: 4, Coherence Score: 0.5208188946378115
Number of Topics: 5, Coherence Score: 0.5214631592915472
Number of Topics: 6, Coherence Score: 0.508575902086989
Number of Topics: 7, Coherence Score: 0.5212548976531373
Number of Topics: 8, Coherence Score: 0.5264444838426028
Number of Topics: 9, Coherence Score: 0.5320430369597641
Number of Topics: 10, Coherence Score: 0.5302354902413622

Optimal Number of Topics: 9, Coherence Score: 0.5320430369597641

Topics:
Topic 0: 0.042*"quality" + 0.042*"!" + 0.042*"disappointed" + 0.042*"service" + 0.042*"event" + 0.042*"😡" + 0.042*"n't" + 0.042*"love" + 0.042*"better" + 0.042*"experience"
Topic 1: 0.196*"disappointed" + 0.196*"😡" + 0.196*"quality" + 0.020*"!" + 0.020*"," + 0.020*"could" + 0.020*"love" + 0.020*"bad" + 0.020*"terrible" + 0.020*"service"
Topic 2: 0.042*"!" + 0.042*"quality" + 0.042*"service" +

## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [None]:
# I tried this question many times with no success
from bertopic import BERTopic

# Sample text data
documents = [
    "I love this product! It's amazing.",
    "The service was terrible. I won't recommend it.",
    "Not a bad experience, but could be better.",
    "😊 Great event! Enjoyed every moment.",
    "Disappointed with the quality. 😡"
]

# Initialize BERTopic
topic_model = BERTopic()

# Find optimal number of topics
optimal_k = topic_model.fit(documents)

# Get the topics
topics = topic_model.get_topics()

# Summarize topics
print("Summarizing Topics:")
for i, topic in enumerate(topics):
    words, _ = zip(*topic)
    words = " ".join(words)
    print(f"Topic {i + 1}: {words}")




TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.

## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [50]:
# LDA:
"""
Topics: The topics generated by LDA include words like "bad," "experience," "could," "better," "quality," "disappointed," "love," "amazing," "moment," "product," "recommend," and "terrible."
Coherence Score: The coherence score for the optimal number of topics (5 topics) is 0.5349.
"""

# LSA:
"""
Topics: The topics generated by LSA include words like "enjoyed," "moment," "great," "event," "every," "experience," "bad," "could," "better," "wo," "terrible," and "service."
Coherence Score: The coherence score for the optimal number of topics (3 topics) is 0.5349.
"""

# lda2vec:
"""
Topics: The topics generated by lda2vec include words like "quality," "disappointed," "service," "event," "love," "could," "better," "amazing," "recommend," "bad," "enjoyed," "moment," "great," "every," "wo," and "n't."
Coherence Score: The coherence scores for different numbers of topics range from 0.4931 to 0.5320, with the optimal number of topics being 9, achieving a coherence score of 0.5320.
"""

# BERTopic:
# BERTopic's output was not obtained due to an error during execution, so we cannot directly assess the topics it generated.



  and should_run_async(code)


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
'''
# Learning Experience
# Working with text data and implementing various topic modeling algorithms was a valuable learning experience.
 It helped me in understanding the underlying concepts of feature extraction from text data and how different algorithms approach the task of
  identifying topics within a corpus. Implementing algorithms such as LDA, LSA, and lda2vec provided insights into their strengths and
  weaknesses in handling text data.

# Challenges Encountered
# The main challenge I encountered was generating topics using BERTopic due to inability to successfully execute the code due to errors.
 Despite several attempts, the BERTopic implementation did not yield the expected results.

# Relevance to Your Field of Study
# This exercise modeling plays a crucial role in various NLP applications such as document clustering, summarization, and recommendation systems.
 Understanding different topic modeling algorithms and their implementations is essential for NLP practitioners to effectively analyze and extract insights from large text datasets.




'''