# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [17]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

import pandas as pd
from gensim.models import LdaModel
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

# Download NLTK stopwords corpus
import nltk
nltk.download('stopwords')

# Load the data from the CSV file
df = pd.read_csv('extract.csv')
data = df['Review'].tolist()  # Replace 'your_text_column_name' with the actual column name

# Preprocess the text
stop_words = set(stopwords.words('english'))
table = str.maketrans('', '', string.punctuation)
texts = [[word for word in word_tokenize(str(doc).lower()) if word.isalpha() and word not in stop_words] for doc in data]

# Create a dictionary representation of the documents
dictionary = Dictionary(texts)
dictionary.filter_extremes(no_below=5, no_above=0.5)  # Filter out too rare or too common words

# Create a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

# Iterate over different values of K
best_coherence = -1
best_lda_model = None
best_k = 0
for k in range(2, 11):
    # Train the LDA model
    lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=k, random_state=42)

    # Compute the coherence score
    coherence_model = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()

    # Update the best model and coherence score
    if coherence_score > best_coherence:
        best_coherence = coherence_score
        best_lda_model = lda_model
        best_k = k

# Print the optimal number of topics
print("Optimal number of topics:", best_k)

# Summarize the topics
for idx, topic in best_lda_model.print_topics(-1):
    print("Topic {}: {}".format(idx, topic))


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Optimal number of topics: 7
Topic 0: 0.334*"bottle" + 0.333*"versace" + 0.333*"cologne"
Topic 1: 0.489*"bottle" + 0.333*"versace" + 0.178*"cologne"
Topic 2: 0.880*"bottle" + 0.060*"cologne" + 0.060*"versace"
Topic 3: 0.763*"cologne" + 0.211*"versace" + 0.027*"bottle"
Topic 4: 0.334*"bottle" + 0.334*"versace" + 0.332*"cologne"
Topic 5: 0.881*"versace" + 0.060*"bottle" + 0.060*"cologne"
Topic 6: 0.496*"versace" + 0.442*"cologne" + 0.062*"bottle"


## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [19]:
import pandas as pd
from gensim.models import LsiModel
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

# Load the data from the CSV file
df = pd.read_csv('extract.csv')
data = df['Review'].tolist()  # Replace 'text_column' with the actual column name containing the text data

# Preprocess the text
stop_words = set(stopwords.words('english'))
table = str.maketrans('', '', string.punctuation)
texts = [[word for word in word_tokenize(str(doc).lower()) if word.isalpha() and word not in stop_words] for doc in data]

# Create a dictionary representation of the documents
dictionary = Dictionary(texts)
dictionary.filter_extremes(no_below=5, no_above=0.5)  # Filter out too rare or too common words

# Create a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

# Iterate over different values of K
best_coherence = -1
best_lsa_model = None
best_k = 0
for k in range(2, 11):
    # Train the LSA model
    lsa_model = LsiModel(corpus=corpus, id2word=dictionary, num_topics=k)

    # Compute the coherence score
    coherence_model = CoherenceModel(model=lsa_model, texts=texts, dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()

    # Update the best model and coherence score
    if coherence_score > best_coherence:
        best_coherence = coherence_score
        best_lsa_model = lsa_model
        best_k = k

# Print the optimal number of topics
print("Optimal number of topics:", best_k)

# Summarize the topics
for idx, topic in best_lsa_model.print_topics(-1):
    print("Topic {}: {}".format(idx, topic))


Optimal number of topics: 10
Topic 0: 0.760*"versace" + 0.638*"cologne" + 0.125*"bottle"
Topic 1: 0.723*"cologne" + -0.536*"versace" + -0.436*"bottle"
Topic 2: -0.891*"bottle" + 0.368*"versace" + -0.264*"cologne"


## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [63]:
import pandas as pd
from gensim.models import LdaModel
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

# Load the data from the CSV file
df = pd.read_csv('extract.csv')
data = df['Review'].tolist()

# Preprocess the text
stop_words = set(stopwords.words('english'))
table = str.maketrans('', '', string.punctuation)
texts = [[word for word in word_tokenize(str(doc).lower()) if word.isalpha() and word not in stop_words] for doc in data]

# Create a dictionary representation of the documents
dictionary = Dictionary(texts)
dictionary.filter_extremes(no_below=5, no_above=0.5)  # Filter out too rare or too common words

# Create a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

# Iterate over different values of K
best_coherence = -1
best_lda_model = None
best_k = 0
for k in range(2, 11):
    # Train the LDA model
    lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=k, random_state=42)

    # Compute the coherence score
    coherence_model = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()

    # Update the best model and coherence score
    if coherence_score > best_coherence:
        best_coherence = coherence_score
        best_lda_model = lda_model
        best_k = k

# Print the optimal number of topics
print("Optimal number of topics:", best_k)

# Summarize the topics
for idx, topic in best_lda_model.print_topics(-1):
    print("Topic {}: {}".format(idx, topic))




Optimal number of topics: 7
Topic 0: 0.334*"bottle" + 0.333*"versace" + 0.333*"cologne"
Topic 1: 0.489*"bottle" + 0.333*"versace" + 0.178*"cologne"
Topic 2: 0.880*"bottle" + 0.060*"cologne" + 0.060*"versace"
Topic 3: 0.763*"cologne" + 0.211*"versace" + 0.027*"bottle"
Topic 4: 0.334*"bottle" + 0.334*"versace" + 0.332*"cologne"
Topic 5: 0.881*"versace" + 0.060*"bottle" + 0.060*"cologne"
Topic 6: 0.496*"versace" + 0.442*"cologne" + 0.062*"bottle"


## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [40]:
import pandas as pd
from bertopic import BERTopic
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

# Load the data from the CSV file
df = pd.read_csv('extract.csv')
data = df['Review'].tolist()

# Preprocess the text
stop_words = set(stopwords.words('english'))
table = str.maketrans('', '', string.punctuation)
texts = [' '.join([word for word in word_tokenize(str(doc).lower()) if word.isalpha() and word not in stop_words]) for doc in data]

# Initialize BERTopic
bertopic_model = BERTopic()

# Fit BERTopic on the preprocessed text
topics, _ = bertopic_model.fit_transform(texts)

# Print the topics
for topic_id, words in enumerate(topics):
    print(f"Topic {topic_id}: {words}")


Topic 0: -1
Topic 1: -1
Topic 2: -1
Topic 3: -1
Topic 4: -1
Topic 5: -1
Topic 6: -1
Topic 7: -1
Topic 8: -1
Topic 9: -1


## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [64]:
'''
The topic modeling algorithms LDA, LSA, lda2vec, and BERTopic all come with their OWN strengths and short comings, therefore, the choice on which one to apply for a particular use case depends solely on the requirements of the user or organization.
 LDA and LSA are well-known for the topics they give out as it is possible to understand these topics even without assistance which is a feature-driven task. Alternatively, they can face difficulties in the scalability of their models to the big data. Instead,
  lda2vec and BERTopic provide for greater flexibility and context considerations but may at the same time deliver less explicit contents.


When you look at the amount of accuracy, the coherence scores can be applied as the measure, with higher values for higher topics quality. LDA and LSA are often easier to understand,
 so they are very suitable for those who just start the work, while lda2vec and BERTopic can be more technical, so they're better for the people who have more knowledge in the field.
  Furthermore, there is a variety of algorithms brought by the requirements of a particular task. Some of these algorithms are with respect to interpretability, scalability, performance, flexibility and reliability.
'''

"\nThe topic modeling algorithms LDA, LSA, lda2vec, and BERTopic all come with their OWN strengths and short comings, therefore, the choice on which one to apply for a particular use case depends solely on the requirements of the user or organization. LDA and LSA are well-known for the topics they give out as it is possible to understand these topics even without assistance which is a feature-driven task. Alternatively, they can face difficulties in the scalability of their models to the big data. Instead, lda2vec and BERTopic provide for greater flexibility and context considerations but may at the same time deliver less explicit contents.\n\n\nWhen you look at the amount of accuracy, the coherence scores can be applied as the measure, with higher values for higher topics quality. LDA and LSA are often easier to understand, so they are very suitable for those who just start the work, while lda2vec and BERTopic can be more technical, so they're better for the people who have more knowl

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
This task has become a great hands-on approach to NLP, engaging in the text data and getting features through the topic modeling algorithms. Thoroughly analyzing LDA, LSA, lda2vec and BERTopic provided me with a deeper comprehension of these algorithms and their properties that help in text feature extraction. Challenges of it were implementation of very complicated algorithms like lda2vec ,and BERTopic, and handling of large data sets in an appropriate manner. It is especially with NLP in mind, as the aim here is to build fundamental tasks like topic modeling and gain insights about a subject. Finally, this assignment represents an initial step into NLP research and application by including the very fundamentals of text data processing and feature extraction.




'''