<a href="https://colab.research.google.com/github/Yusmitha-Lekha/YusmithaLekha_INFO5731_Fall2024/blob/main/Prathi_YusmithaLekha_Exercise_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [1]:
# Write your code here
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from gensim.models.coherencemodel import CoherenceModel
import pandas as pd
import nltk
from nltk.corpus import stopwords

# Ensure nltk stopwords are downloaded
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Sample data
data = pd.DataFrame({
    'text': [
        "I bought this phone two weeks ago, and I have been extremely satisfied with its performance. The battery lasts all day, and the camera quality is outstanding. Highly recommend!",
    "Terrible laptop. It overheats constantly, and the battery dies in just a couple of hours. I regret purchasing this product.",
    "This vacuum cleaner has made my life so much easier. It's lightweight, powerful, and very easy to maneuver. I love how clean my house feels after using it.",
    "The sound quality of these headphones is amazing. However, after just two months, one side stopped working. I'm disappointed with the durability.",
    "I’ve tried a lot of fitness trackers, and this one is by far the best. It’s comfortable to wear, the tracking is accurate, and I love the sleep monitoring feature.",
    "This blender is a waste of money. It struggles with basic tasks like blending frozen fruit and constantly overheats. Definitely avoid this product.",
    "Fantastic service! The customer support team was responsive and helped me resolve my issue quickly. I will continue to shop with this company.",
    "The quality of this TV is top-notch. The picture is crystal clear, and the smart features are easy to use. It's the best TV I’ve ever owned.",
    "Worst purchase ever. The shoes fell apart after just a few days of light use. Poorly made and not worth the money.",
    "The camera takes great photos, but the app is clunky and hard to navigate. I would give it 5 stars for the camera, but the software brings it down to a 3.",
    "I love this coffee maker. It’s quick, easy to use, and makes the best cup of coffee I’ve ever had at home. I can’t start my day without it.",
    "The washing machine is too loud and doesn't clean clothes as well as my old one. It's not worth the high price tag.",
    "The tablet is great for reading and light browsing, but it slows down when I try to use it for anything more demanding. Good for basic tasks, but not a powerful device.",
    "I purchased this sofa for my living room, and it fits perfectly. The material feels durable, and it’s very comfortable to sit on. Highly satisfied with this purchase.",
    "The smartwatch looks nice, but the battery life is terrible. It barely lasts through the day without needing a charge. I wouldn't recommend it.",
    "This lawnmower works great for my small yard. It’s easy to push and cuts the grass evenly. I’m happy with my purchase so far.",
    "The wireless mouse stopped working after just a week of use. I had higher expectations for this brand, but unfortunately, the quality is subpar.",
    "I'm thrilled with my new gaming chair. It’s extremely comfortable, and the adjustable settings let me customize it to my preferences. Definitely worth the investment.",
    "This printer is the worst. It constantly jams, the ink cartridges are expensive, and the print quality is terrible. Do not buy!",
    "I’ve been using this air purifier for a month, and I can already notice a difference in the air quality. My allergies have improved, and it runs quietly in the background."
    ]
})

# Preprocess text: lowercasing, removing stopwords
data['processed'] = data['text'].apply(lambda x: ' '.join([word for word in x.lower().split() if word not in stop_words]))
data['tokenized'] = data['processed'].apply(lambda x: x.split())

# Create dictionary and corpus
dictionary = Dictionary(data['tokenized'])
corpus = [dictionary.doc2bow(text) for text in data['tokenized']]

# Determine the best number of topics using coherence score
coherence_scores = []
models = []
for k in range(2, 8):  # Trying 2 to 5 topics
    lda_model = LdaModel(corpus, num_topics=k, id2word=dictionary, random_state=100, passes=10)
    coherence_model = CoherenceModel(model=lda_model, texts=data['tokenized'], dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    coherence_scores.append((k, coherence_score))
    models.append(lda_model)

# Find best model based on coherence score
best_k = max(coherence_scores, key=lambda x: x[1])[0]
best_lda_model = models[best_k - 2]

# Display topics
print(f"Best Number of Topics: {best_k}")
for idx, topic in best_lda_model.show_topics(formatted=False):
    print(f"Topic {idx + 1}: {[word for word, prob in topic]}")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Best Number of Topics: 2
Topic 1: ['quality', 'battery', 'lasts', 'basic', 'terrible.', 'constantly', 'satisfied', 'it’s', 'highly', 'comfortable']
Topic 2: ['easy', 'i’ve', 'it’s', 'worth', 'use.', 'great', 'air', 'tv', "i'm", 'quality']


## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [2]:
# Write your code here
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer

# Apply TF-IDF to convert text data into a word frequency matrix
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(data['processed'])

# Set up and apply LSA to reduce data to a specified number of topics (e.g., 2 topics)
n_topics = 2
lsa = TruncatedSVD(n_components=n_topics, random_state=100)
lsa.fit(X)

# Output the most relevant words for each topic identified
terms = vectorizer.get_feature_names_out()
for idx, topic in enumerate(lsa.components_):
    print(f"Topic {idx + 1}: ", [terms[i] for i in topic.argsort()[-10:]])


Topic 1:  ['coffee', 'best', 'constantly', 've', 'easy', 'terrible', 'day', 'battery', 'use', 'quality']
Topic 2:  ['laptop', 'hours', 'dies', 'recommend', 'lasts', 'product', 'overheats', 'constantly', 'battery', 'terrible']


## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [3]:
# Write your code here


## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [4]:
# Import BERTopic library
from bertopic import BERTopic

# Set up and train the BERTopic model on the text data
model = BERTopic()
topics, probs = model.fit_transform(data['text'])

# Output the topics generated by the model
for idx, topic in enumerate(model.get_topics().values()):
    print(f"Topic {idx + 1}: {topic[:10]}")  # Show the top 10 words for each identified topic


Topic 1: [('the', 0.2000459536807333), ('and', 0.13640410885445992), ('this', 0.10733112202710283), ('is', 0.10208274543782703), ('it', 0.10208274543782703), ('to', 0.08546675583397369), ('my', 0.08546675583397369), ('its', 0.07352527177588986), ('for', 0.06723848059743669), ('of', 0.06723848059743669)]


## **Question 3 (Alternative) - (10 points)**

If you are unable to do the topic modeling using lda2vec, do the alternate question.

Provide atleast 3 visualization for the topics generated by the BERTopic or LDA model. Explain each of the visualization in detail.

In [5]:
# Write your code here
# Then Explain the visualization
# Repeat for the other 2 visualizations as well.

In [6]:
!pip install pyLDAvis




In [7]:
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

# Visualize LDA topics
lda_display = gensimvis.prepare(best_lda_model, corpus, dictionary)
pyLDAvis.display(lda_display)


LDA (Latent Dirichlet Allocation): LDA is effective for generating distinct and interpretable topics by analyzing word distributions across documents. Its use of coherence scores to evaluate the quality of topic separation makes it highly reliable for applications requiring clear and non-overlapping topics. This model is especially suitable for scenarios where specific, well-defined topics are essential, as it helps differentiate distinct themes within a text corpus.

In [8]:
import pandas as pd
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from bertopic import BERTopic

# Sample reviews data in a DataFrame
reviews_data = pd.DataFrame({
    'content': [
         "I bought this phone two weeks ago, and I have been extremely satisfied with its performance. The battery lasts all day, and the camera quality is outstanding. Highly recommend!",
         "Terrible laptop. It overheats constantly, and the battery dies in just a couple of hours. I regret purchasing this product.",
         "This vacuum cleaner has made my life so much easier. It's lightweight, powerful, and very easy to maneuver. I love how clean my house feels after using it.",
         "The sound quality of these headphones is amazing. However, after just two months, one side stopped working. I'm disappointed with the durability.",
         "I’ve tried a lot of fitness trackers, and this one is by far the best. It’s comfortable to wear, the tracking is accurate, and I love the sleep monitoring feature.",
         "This blender is a waste of money. It struggles with basic tasks like blending frozen fruit and constantly overheats. Definitely avoid this product.",
         "Fantastic service! The customer support team was responsive and helped me resolve my issue quickly. I will continue to shop with this company.",
         "The quality of this TV is top-notch. The picture is crystal clear, and the smart features are easy to use. It's the best TV I’ve ever owned.",
         "Worst purchase ever. The shoes fell apart after just a few days of light use. Poorly made and not worth the money.",
         "The camera takes great photos, but the app is clunky and hard to navigate. I would give it 5 stars for the camera, but the software brings it down to a 3.",
         "I love this coffee maker. It’s quick, easy to use, and makes the best cup of coffee I’ve ever had at home. I can’t start my day without it.",
         "The washing machine is too loud and doesn't clean clothes as well as my old one. It's not worth the high price tag.",
         "The tablet is great for reading and light browsing, but it slows down when I try to use it for anything more demanding. Good for basic tasks, but not a powerful device.",
         "I purchased this sofa for my living room, and it fits perfectly. The material feels durable, and it’s very comfortable to sit on. Highly satisfied with this purchase.",
         "The smartwatch looks nice, but the battery life is terrible. It barely lasts through the day without needing a charge. I wouldn't recommend it.",
         "This lawnmower works great for my small yard. It’s easy to push and cuts the grass evenly. I’m happy with my purchase so far.",
         "The wireless mouse stopped working after just a week of use. I had higher expectations for this brand, but unfortunately, the quality is subpar.",
         "I'm thrilled with my new gaming chair. It’s extremely comfortable, and the adjustable settings let me customize it to my preferences. Definitely worth the investment.",
         "This printer is the worst. It constantly jams, the ink cartridges are expensive, and the print quality is terrible. Do not buy!",
         "I’ve been using this air purifier for a month, and I can already notice a difference in the air quality. My allergies have improved, and it runs quietly in the background."
    ]
})

# Lowercase text for consistent processing
reviews_data['processed_text'] = reviews_data['content'].apply(lambda x: x.lower())

print("LSA Model: Summarizing broader patterns and possible term overlap.\n")

# Initialize TF-IDF Vectorizer and transform the text data
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(reviews_data['processed_text'])

# Set the number of topics for LSA
num_topics = 3
lsa_model = TruncatedSVD(n_components=num_topics, random_state=100)
lsa_model.fit(tfidf_matrix)

# Display the most relevant terms for each LSA topic
feature_names = tfidf_vectorizer.get_feature_names_out()
for idx, topic in enumerate(lsa_model.components_):
    print(f"LSA Topic {idx + 1}: ", [feature_names[i] for i in topic.argsort()[-10:]])

# =================== BERTopic Analysis ===================
print("\nBERTopic Model: Interactive visuals for identified topic clusters.\n")

# Fit the BERTopic model to generate topics
topic_model = BERTopic(min_topic_size=2)
topic_clusters, probabilities = topic_model.fit_transform(reviews_data['content'])

# Display the BERTopic Barchart
print("BERTopic Top Words Barchart:")
barchart_fig = topic_model.visualize_barchart()
barchart_fig.show()

# Display the BERTopic Heatmap
print("\nBERTopic Topic Similarity Heatmap:")
heatmap_fig = topic_model.visualize_heatmap()
heatmap_fig.show()


  and should_run_async(code)


LSA Model: Summarizing broader patterns and possible term overlap.

LSA Topic 1:  ['coffee', 've', 'easy', 'constantly', 'terrible', 'day', 'battery', 'just', 'use', 'quality']
LSA Topic 2:  ['owned', 'notch', 'features', 'far', 'coffee', 'love', 'tv', 'best', 've', 'easy']
LSA Topic 3:  ['amazing', 'months', 'sound', 'disappointed', 'durability', 'headphones', 'use', 'just', 'stopped', 'working']

BERTopic Model: Interactive visuals for identified topic clusters.

BERTopic Top Words Barchart:



BERTopic Topic Similarity Heatmap:


1. **LSA Topic Terms**:
   - This output shows the most important words for each topic identified by the LSA model. By examining these key terms, we can understand the general themes that LSA has extracted from the dataset. Note that some words may appear in multiple topics, which indicates that LSA is capturing broader patterns but might have overlapping themes among topics.

2. **BERTopic Barchart**:
   - The BERTopic barchart displays the top words for each topic, where the length of each bar reflects the significance of the word within that specific topic. This visualization helps in quickly identifying the primary terms associated with each topic, allowing us to grasp the central ideas that each cluster represents.

3. **BERTopic Heatmap**:
   - The heatmap from BERTopic illustrates the similarity between the topics. Topics that are closely related are positioned nearer to each other and exhibit warmer colors, indicating higher similarity. This visual makes it easy to identify which topics are interrelated, providing insights into the relationships between the generated topic clusters.


## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [11]:
# Write your code here
import pandas as pd
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from gensim.models.coherencemodel import CoherenceModel
from bertopic import BERTopic

# Sample dataset containing various reviews
dataset = pd.DataFrame({
    'review': [
        "I bought this phone two weeks ago, and I have been extremely satisfied with its performance. The battery lasts all day, and the camera quality is outstanding. Highly recommend!",
    "Terrible laptop. It overheats constantly, and the battery dies in just a couple of hours. I regret purchasing this product.",
    "This vacuum cleaner has made my life so much easier. It's lightweight, powerful, and very easy to maneuver. I love how clean my house feels after using it.",
    "The sound quality of these headphones is amazing. However, after just two months, one side stopped working. I'm disappointed with the durability.",
    "I’ve tried a lot of fitness trackers, and this one is by far the best. It’s comfortable to wear, the tracking is accurate, and I love the sleep monitoring feature.",
    "This blender is a waste of money. It struggles with basic tasks like blending frozen fruit and constantly overheats. Definitely avoid this product.",
    "Fantastic service! The customer support team was responsive and helped me resolve my issue quickly. I will continue to shop with this company.",
    "The quality of this TV is top-notch. The picture is crystal clear, and the smart features are easy to use. It's the best TV I’ve ever owned.",
    "Worst purchase ever. The shoes fell apart after just a few days of light use. Poorly made and not worth the money.",
    "The camera takes great photos, but the app is clunky and hard to navigate. I would give it 5 stars for the camera, but the software brings it down to a 3.",
    "I love this coffee maker. It’s quick, easy to use, and makes the best cup of coffee I’ve ever had at home. I can’t start my day without it.",
    "The washing machine is too loud and doesn't clean clothes as well as my old one. It's not worth the high price tag.",
    "The tablet is great for reading and light browsing, but it slows down when I try to use it for anything more demanding. Good for basic tasks, but not a powerful device.",
    "I purchased this sofa for my living room, and it fits perfectly. The material feels durable, and it’s very comfortable to sit on. Highly satisfied with this purchase.",
    "The smartwatch looks nice, but the battery life is terrible. It barely lasts through the day without needing a charge. I wouldn't recommend it.",
    "This lawnmower works great for my small yard. It’s easy to push and cuts the grass evenly. I’m happy with my purchase so far.",
    "The wireless mouse stopped working after just a week of use. I had higher expectations for this brand, but unfortunately, the quality is subpar.",
    "I'm thrilled with my new gaming chair. It’s extremely comfortable, and the adjustable settings let me customize it to my preferences. Definitely worth the investment.",
    "This printer is the worst. It constantly jams, the ink cartridges are expensive, and the print quality is terrible. Do not buy!",
    "I’ve been using this air purifier for a month, and I can already notice a difference in the air quality. My allergies have improved, and it runs quietly in the background."
    ]
})

# Perform basic preprocessing: lowercase text and tokenize
dataset['cleaned'] = dataset['review'].apply(lambda x: x.lower())
dataset['tokenized'] = dataset['cleaned'].apply(lambda x: x.split())

# =================== 1. LDA Model with Coherence Score ===================
print("LDA Model Analysis: Creating distinct topics with well-separated themes.\n")

# Prepare LDA model input requirements
review_dict = Dictionary(dataset['tokenized'])
review_corpus = [review_dict.doc2bow(text) for text in dataset['tokenized']]

# Run LDA model with 3 topics
lda_analysis = LdaModel(review_corpus, num_topics=3, id2word=review_dict, random_state=100, passes=10)

# Calculate coherence score for LDA topics
lda_coherence_calc = CoherenceModel(model=lda_analysis, texts=dataset['tokenized'], dictionary=review_dict, coherence='c_v')
lda_coherence_score = lda_coherence_calc.get_coherence()
print(f"LDA Model Coherence Score: {lda_coherence_score}")

# Show the top terms for each LDA topic
lda_topic_terms = []
for idx, topic in lda_analysis.show_topics(formatted=False):
    lda_terms = [word for word, prob in topic]
    lda_topic_terms.append(lda_terms)
    print(f"LDA Topic {idx + 1}: {lda_terms}")
print("\n")

# =================== 2. LSA Model with Coherence Score ===================
print("LSA Model Analysis: Highlighting broad patterns with possible overlapping terms.\n")

# Apply TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
X_matrix = tfidf_vectorizer.fit_transform(dataset['cleaned'])

# Run LSA model with 3 topics
n_topics = 3
lsa_model = TruncatedSVD(n_components=n_topics, random_state=100)
lsa_model.fit(X_matrix)

# Gather top terms from each LSA topic for coherence scoring
vocab_terms = tfidf_vectorizer.get_feature_names_out()
lsa_topic_words = [[vocab_terms[i] for i in topic.argsort()[-10:]] for topic in lsa_model.components_]

# Calculate coherence score for LSA topics
lsa_coherence_calc = CoherenceModel(topics=lsa_topic_words, texts=dataset['tokenized'], dictionary=review_dict, coherence='c_v')
lsa_coherence_score = lsa_coherence_calc.get_coherence()
print(f"LSA Model Coherence Score: {lsa_coherence_score}")

# Display top terms for each LSA topic
for idx, topic in enumerate(lsa_model.components_):
    print(f"LSA Topic {idx + 1}: ", [vocab_terms[i] for i in topic.argsort()[-10:]])
print("\n")

# =================== 3. BERTopic Model ===================
print("BERTopic Model Analysis: Clustered topics with flexibility for nuanced exploration.\n")

# Run BERTopic model
bertopic_instance = BERTopic(min_topic_size=2)
topic_clusters, probabilities = bertopic_instance.fit_transform(dataset['review'])

# Show the top terms for each BERTopic cluster
bertopic_topic_words = bertopic_instance.get_topics()
for idx, topic in bertopic_topic_words.items():
    bertopic_terms = [word for word, prob in topic[:10]]
    print(f"BERTopic Topic {idx}: {bertopic_terms}")
print("\n")

# =================== Summary of Model Performance ===================
print("Summary of Model Comparisons:\n")

# Display coherence scores for LDA and LSA models
print(f"LDA Coherence Score: {lda_coherence_score}")
print(f"LSA Coherence Score: {lsa_coherence_score}")
print("\nInterpretability Insights:")

# Determine which model offers the most coherent and interpretable results
if lda_coherence_score > lsa_coherence_score:
    print("LDA shows better separation and coherence between topics compared to LSA.")
else:
    print("LSA captures broad thematic patterns and is effective for high-level analysis.")

# Additional notes on BERTopic's flexibility and interpretability
print("While BERTopic does not calculate a coherence score, it provides highly interpretable clusters and useful visualizations, ideal for complex datasets.")

# Final model selection based on performance
print("\nFinal Perdict:")
if lda_coherence_score > lsa_coherence_score and lda_coherence_score > 0.4:
    print("LDA is the recommended model for well-defined, distinct topics with high coherence.")
elif lsa_coherence_score > lda_coherence_score:
    print("LSA is preferred for identifying broader themes, even if overlap between topics is acceptable.")
else:
    print("BERTopic is optimal for datasets requiring nuanced topic interpretation and interactive visualizations.")



`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



LDA Model Analysis: Creating distinct topics with well-separated themes.

LDA Model Coherence Score: 0.221383092991048
LDA Topic 1: ['and', 'this', 'i', 'for', 'it', 'the', 'of', 'to', 'my', 'a']
LDA Topic 2: ['and', 'the', 'this', 'very', 'it', 'feels', 'i', 'a', 'made', 'for']
LDA Topic 3: ['the', 'and', 'is', 'i', 'this', 'to', 'my', 'a', 'it', 'quality']


LSA Model Analysis: Highlighting broad patterns with possible overlapping terms.

LSA Model Coherence Score: 0.4590628174593488
LSA Topic 1:  ['coffee', 've', 'easy', 'constantly', 'terrible', 'day', 'battery', 'just', 'use', 'quality']
LSA Topic 2:  ['owned', 'notch', 'features', 'far', 'coffee', 'love', 'tv', 'best', 've', 'easy']
LSA Topic 3:  ['amazing', 'months', 'sound', 'disappointed', 'durability', 'headphones', 'use', 'just', 'stopped', 'working']


BERTopic Model Analysis: Clustered topics with flexibility for nuanced exploration.

BERTopic Topic -1: ['this', 'with', 'and', 'perfectly', 'purchased', 'waste', 'material',

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [10]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
This exercise provided a valuable learning experience in handling text data and applying different topic modeling algorithms to extract meaningful features.
Working with LDA, LSA, and BERTopic allowed me to explore how each algorithm handles topic identification, and the hands-on implementation deepened my understanding of their strengths and limitations.
I gained insight into the nuances of feature extraction, especially in balancing topic coherence and interpretability.
A key challenge was understanding and configuring each model’s parameters to fit the data and produce clear topics, especially as each algorithm processes data differently.
This exercise is highly relevant to the field of NLP, as topic modeling is foundational in organizing and analyzing vast text corpora, whether for sentiment analysis, summarization, or content classification in real-world applications.





'''


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



'\nThis exercise provided a valuable learning experience in handling text data and applying different topic modeling algorithms to extract meaningful features.\nWorking with LDA, LSA, and BERTopic allowed me to explore how each algorithm handles topic identification, and the hands-on implementation deepened my understanding of their strengths and limitations. \nI gained insight into the nuances of feature extraction, especially in balancing topic coherence and interpretability. \nA key challenge was understanding and configuring each model’s parameters to fit the data and produce clear topics, especially as each algorithm processes data differently. \nThis exercise is highly relevant to the field of NLP, as topic modeling is foundational in organizing and analyzing vast text corpora, whether for sentiment analysis, summarization, or content classification in real-world applications.\n\n\n\n\n\n'