#### NOVA IMS / BSc in Data Science / Text Mining 2024/2025
### <b>Group Project: "Solving the Hyderabadi Word Soup"</b>
#### Notebook `Topic Modeling`

#### Group:
- `Miguel Matos - 20221925`
- `Nuno Leandro - 20221861`
- `Patrícia Bezerra - 20221907`
- `Rita Silva - 20221920`
- `Vasco Capão - 20221906`

#### <font color='#BFD72F'>Table of Contents </font> <a class="anchor" id='toc'></a> 
- [0. Imports](#p0)
- [1. Data Preparation for Topic Modeling](#p1)
- [2. LDA (Latent Dirichlet Allocation)](#p2)
- [3. LSA (Latent Semantic Analysis)](#p3)
- [4. Bertopic](#p4)
- [5. Conclusion](#p5)

<font color='#BFD72F' size=8>Topic Modeling (Information Requirement 3322)</font> <a class="anchor" id="p0-0"></a>

"Can the reviews be classified according to emergent topics? What do the emergent
topic mean?"

<font color='#BFD72F' size=7>0. Imports</font> <a class="anchor" id="p0"></a>

In this notebook, a separate environment had to be created because pyLDAvis and BERTopic often introduce compatibility issues due to dependencies on different versions of libraries.

In [2]:
import pandas as pd

import gensim
from gensim import corpora
from gensim.models import LsiModel, LdaModel, CoherenceModel

import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

import bertopic
from bertopic import BERTopic

from utils.pipeline_project import *

In [3]:
# Display library versions
print(f"Pandas version: {pd.__version__}")
print(f"Gensim version: {gensim.__version__}")
print(f"pyLDAvis version: {pyLDAvis.__version__}")
print(f"BERTopic version: {bertopic.__version__}")


Pandas version: 2.2.3
Gensim version: 4.3.0
pyLDAvis version: 3.4.0
BERTopic version: 0.16.4


<font color='#BFD72F' size=7>1. Data Preparation for Topic Modeling</font> <a class="anchor" id="p1"></a>

[Back to TOC](#toc)

In [2]:
data = pd.read_csv('Data/shorten_df.csv')

In [3]:
data

Unnamed: 0.1,Unnamed: 0,Name,Review,Rating,Cuisines,preproc_reviews,msg_len,sents,nr_sents
0,0,Beyond Flavours,"The ambience was good, food was quite good . h...",5.0,"Chinese, Continental, Kebab, European, South I...",ambience good food quite good saturday lunch c...,154,"['The ambience was good, food was quite good ....",5
1,1,Beyond Flavours,Ambience is too good for a pleasant evening. S...,5.0,"Chinese, Continental, Kebab, European, South I...",ambience good pleasant evening service prompt ...,98,['Ambience is too good for a pleasant evening....,5
2,2,Beyond Flavours,A must try.. great food great ambience. Thnx f...,5.0,"Chinese, Continental, Kebab, European, South I...",must try great food great ambience thnx servic...,137,"['A must try.. great food great ambience.', 'T...",3
3,3,Beyond Flavours,Soumen das and Arun was a great guy. Only beca...,5.0,"Chinese, Continental, Kebab, European, South I...",soumen da arun great guy behavior sincerety go...,83,"['Soumen das and Arun was a great guy.', 'Only...",2
4,4,Beyond Flavours,Food is good.we ordered Kodi drumsticks and ba...,5.0,"Chinese, Continental, Kebab, European, South I...",food ordered kodi drumstick basket mutton biry...,108,['Food is good.we ordered Kodi drumsticks and ...,6
...,...,...,...,...,...,...,...,...,...
9194,9991,Chinese Pavilion,I visited this restaurant with friends and was...,5.0,"Chinese, Seafood",visited restaurant friend immediately blown aw...,313,['I visited this restaurant with friends and w...,5
9195,9992,Chinese Pavilion,"Im going to cut to the chase, The food is exce...",5.0,"Chinese, Seafood",im going cut chase food excellent must say hon...,198,"['Im going to cut to the chase, The food is ex...",6
9196,9995,Chinese Pavilion,This place has never disappointed us.. The foo...,4.5,"Chinese, Seafood",place never disappointed u food courteous staf...,226,['This place has never disappointed us.. The f...,3
9197,9997,Chinese Pavilion,I personally love and prefer Chinese Food. Had...,4.0,"Chinese, Seafood",personally love prefer chinese food couple tim...,288,"['I personally love and prefer Chinese Food.',...",7


In [4]:
data['clean_reviews_tokens'] = data['preproc_reviews'].apply(
    lambda x: main_pipeline(str(x), 
                            print_output=False, 
                            no_stopwords=False,
                            custom_stopwords=[], 
                            convert_diacritics=False, 
                            lowercase=False, 
                            lemmatized=False,
                            stemmed=False, 
                            pos_tags_list="no_pos",
                            tokenized_output=True,
                            no_punctuation = False)
)

In [5]:
pd.set_option("display.max_colwidth", None) 

In [6]:
data[['Review', 'clean_reviews_tokens']]

Unnamed: 0,Review,clean_reviews_tokens
0,"The ambience was good, food was quite good . had Saturday lunch , which was cost effective .\nGood place for a sate brunch. One can also chill with friends and or parents.\nWaiter Soumen Das was really courteous and helpful.","[ambience, good, food, quite, good, saturday, lunch, cost, effective, good, place, sate, brunch, one, also, chill, friend, parent, waiter, soumen, das, really, courteous, helpful]"
1,Ambience is too good for a pleasant evening. Service is very prompt. Food is good. Over all a good experience. Soumen Das - kudos to the service,"[ambience, good, pleasant, evening, service, prompt, food, good, good, experience, soumen, das, -, kudos, service]"
2,A must try.. great food great ambience. Thnx for the service by Pradeep and Subroto. My personal recommendation is Penne Alfredo Pasta:) ....... Also the music in the background is amazing.,"[must, try, great, food, great, ambience, thnx, service, pradeep, subroto, personal, recommendation, penne, alfredo, pasta, also, music, background, amazing]"
3,"Soumen das and Arun was a great guy. Only because of their behavior and sincerety, And good food off course, I would like to visit this place again.","[soumen, da, arun, great, guy, behavior, sincerety, good, food, course, would, like, visit, place]"
4,Food is good.we ordered Kodi drumsticks and basket mutton biryani. All are good. Thanks to Pradeep. He served well. We enjoyed here. Ambience is also very good.,"[food, ordered, kodi, drumstick, basket, mutton, biryani, good, thanks, pradeep, served, well, enjoyed, ambience, also, good]"
...,...,...
9194,I visited this restaurant with friends and was immediately blown away with the quality of service.\nWe were seated immediately and the staff was courteous and professional especially with our large group.\nThe ambience is one of the best I've come across with a rather unusually quirky ceiling piece.\nFood - I had the stuffed mushroom which was delicious along with a Fruit punch and a chocolate volcano.\nAll in all one of the best culinary experiences I've had in the city.,"[visited, restaurant, friend, immediately, blown, away, quality, service, seated, immediately, staff, courteous, professional, especially, large, group, ambience, one, best, 've, come, across, rather, unusually, quirky, ceiling, piece, food, -, stuffed, mushroom, delicious, along, fruit, punch, chocolate, volcano, one, best, culinary, experience, 've, city]"
9195,"Im going to cut to the chase, The food is excellent! \n\nI must say the honey Chicken and the Thai Chow Kay is by far the best. \nThe Man Chow soup is another brilliant piece of art! \nFor me this Chinese Pavilion beats any Chinese restaurant in the city. \n\nNo wonder its called Chinese Pavilion! :D","[im, going, cut, chase, food, excellent, must, say, honey, chicken, thai, chow, kay, far, best, man, chow, soup, another, brilliant, piece, art, chinese, pavilion, beat, chinese, restaurant, city, wonder, called, chinese, pavilion]"
9196,"This place has never disappointed us.. The food, the courteous staff, the serene ambience.. We wanted to have something totally rice free and with very little oil. They served us Steamed Fish with chilly garlic noodles with Chicken. As always it was awesome.. Thanks Chinese Pavilion, it is always a pleasant experience!","[place, never, disappointed, u, food, courteous, staff, serene, ambience, wanted, something, totally, rice, free, little, oil, served, u, steamed, fish, chilly, garlic, noodle, chicken, always, awesome, thanks, chinese, pavilion, always, pleasant, experience]"
9197,"I personally love and prefer Chinese Food. Had been here couple of times with my husband.\n\nThe ambiance of the place is very good. The entrance has some carvings and the walls were very nicely decorated. The server was very polite\n\nWhen it comes to food this place does not disappoint.This is a small/comfy restaurant that is surprisingly not very crowded even on weekends. \n\nWe went there for our anniversary celebrations and it was just perfect. No crowd, we got the attention and service we were looking for.","[personally, love, prefer, chinese, food, couple, time, husband, ambiance, place, good, entrance, carving, wall, nicely, decorated, server, polite, come, food, place, disappointthis, smallcomfy, restaurant, surprisingly, crowded, even, weekend, went, anniversary, celebration, perfect, crowd, got, attention, service, looking]"


In [7]:
dictionary = corpora.Dictionary(data['clean_reviews_tokens'])
corpus = [dictionary.doc2bow(text) for text in data['clean_reviews_tokens']]

<font color='#BFD72F' size=7>2. LDA</font> <a class="anchor" id="p2"></a>

[Back to TOC](#toc)

For all three models, we intentionally selected and test a lower number of topics compared to the examples used in class. This decision was driven by the goal of making the results more manageable and easier to interpret, especially for a preliminary analysis. A smaller number of topics allowed us to focus on the most prominent themes without getting overwhelmed by too many fine-grained, potentially less relevant topics.

We are now going to test different numbers of topics for LDA model to identify the optimal topic distribution for our dataset. By adjusting the number of topics and evaluating the coherence score, we aim to identify the model that effectively uncovers meaningful and interpretable patterns in the data.

In [8]:
# Parameters to test
num_topics_range = [3, 5, 10, 15]

# Variables to store the best results
best_coherence = 0
best_model = None
best_num_topics = 0

# List to store coherence scores
coherence_scores = []

# Counter for consecutive non-improvements
no_improvement_count = 0

# Loop to test different values of num_topics
for num_topics in num_topics_range:
    print(f"Testing num_topics={num_topics}...")
    
    # Train the LDA model
    lda_model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=20, random_state=42)
    
    # Calculate the coherence score
    coherence_model = CoherenceModel(model=lda_model, texts=data['clean_reviews_tokens'], dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    
    # Store the result
    coherence_scores.append((num_topics, coherence_score))
    print(f"Coherence Score: {coherence_score:.4f}")
    
    # Check if the coherence score improved
    if coherence_score > best_coherence:
        best_coherence = coherence_score
        best_model = lda_model
        best_num_topics = num_topics
        no_improvement_count = 0  # Reset counter
    else:
        no_improvement_count += 1
    
    # Stop if no improvement for two consecutive iterations
    if no_improvement_count >= 2:
        print("Coherence did not improve for two consecutive iterations. Stopping...")
        break

# Final results
print("\nBest Model Found:")
print(f"Number of Topics: {best_num_topics}")
print(f"Best Coherence Score: {best_coherence:.4f}")

# Save results in a DataFrame
results_df = pd.DataFrame(coherence_scores, columns=['num_topics', 'coherence'])
print("\nResults Summary:")
print(results_df)

Testing num_topics=3...
Coherence Score: 0.5132
Testing num_topics=5...
Coherence Score: 0.5529
Testing num_topics=10...
Coherence Score: 0.5160
Testing num_topics=15...
Coherence Score: 0.5280
Coherence did not improve for two consecutive iterations. Stopping...

Best Model Found:
Number of Topics: 5
Best Coherence Score: 0.5529

Results Summary:
   num_topics  coherence
0           3   0.513184
1           5   0.552922
2          10   0.515994
3          15   0.527988


In [9]:
best_model.save('Models/best_LDA_model.gensim')

In [10]:
best_model.get_topics().shape

(5, 13342)

In [11]:
topics = best_model.print_topics(num_words=5)
for topic in topics:
    print(topic)

(0, '0.024*"food" + 0.023*"order" + 0.014*"dont" + 0.014*"restaurant" + 0.014*"worst"')
(1, '0.031*"service" + 0.028*"good" + 0.026*"food" + 0.014*"staff" + 0.013*"buffet"')
(2, '0.052*"place" + 0.039*"food" + 0.033*"good" + 0.018*"service" + 0.017*"ambience"')
(3, '0.026*"chocolate" + 0.024*"cake" + 0.017*"cream" + 0.015*"ice" + 0.010*"brownie"')
(4, '0.044*"good" + 0.040*"chicken" + 0.024*"ordered" + 0.022*"taste" + 0.016*"biryani"')


Topic 0 - **Negative Dining Experience**: This topic includes terms such as "food", "order", "restaurant", "worst", and "don't", indicating negative reviews about dining experiences.

Topic 1 - **Positive Service and Buffet**: The main terms in this topic include "service", "good", "food", "staff", and "buffet". It reflects positive experiences, particularly those related to good service and buffet offerings. 

Topic 2 - **Restaurant Atmosphere and Food Quality**: This topic is characterized by words such as "place", "food", "good", "service", and "ambience". It suggests a focus on the overall dining experience, including the quality of food, the atmosphere of the restaurant, and the service provided.

Topic 3 - **Desserts and Sweets**: Words like "chocolate", "cake", "cream", "ice", and "brownie" dominate this topic, highlighting customer reviews about desserts. 

Topic 4 - **Positive Reviews on Food (Chicken & Biryani)**: This topic includes terms like "good", "chicken", "ordered", "taste", and "biryani". It reflects positive feedback on specific food items, particularly chicken dishes and biryani.

In [12]:
lda_vis = gensimvis.prepare(best_model, corpus, dictionary)
pyLDAvis.display(lda_vis)

<font color='#BFD72F' size=7>3. LSA</font> <a class="anchor" id="p3"></a>

[Back to TOC](#toc)

As we did previously for the LDA model, we will test different numbers of topics for the LSA model. By varying the number of topics, we aim to identify the configuration that best captures the themes in the data. While coherence is a commonly used and effective metric for models like LDA due to their probabilistic nature, it can also be used with LSA. However, the results may be less informative in LSA due to the model's mathematical nature.

In [13]:
# Parameters to test
num_topics_range = [3, 5, 10, 15]  

# Variables to store the best results
best_coherence = 0
best_model_lsa = None
best_num_topics = 0

# List to store coherence scores
coherence_scores = []

# Counter for consecutive non-improvements
no_improvement_count = 0

# Loop to test different values of num_topics
for num_topics in num_topics_range:
    print(f"Testing num_topics={num_topics}...")
    
    # Train the LSA model (LsiModel)
    lsa_model = LsiModel(corpus=corpus, num_topics=num_topics, id2word=dictionary)
    
    # Calculate the coherence score
    coherence_model = CoherenceModel(model=lsa_model, texts=data['clean_reviews_tokens'], dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    
    # Store the result
    coherence_scores.append((num_topics, coherence_score))
    print(f"Coherence Score: {coherence_score:.4f}")
    
    # Check if the coherence score improved
    if coherence_score > best_coherence:
        best_coherence = coherence_score
        best_model_lsa = lsa_model
        best_num_topics = num_topics
        no_improvement_count = 0  # Reset counter
    else:
        no_improvement_count += 1
    
    # Stop if no improvement for two consecutive iterations
    if no_improvement_count >= 2:
        print("Coherence did not improve for two consecutive iterations. Stopping...")
        break

# Final results
print("\nBest Model Found:")
print(f"Number of Topics: {best_num_topics}")
print(f"Best Coherence Score: {best_coherence:.4f}")

# Save results in a DataFrame
results_df = pd.DataFrame(coherence_scores, columns=['num_topics', 'coherence'])
print("\nResults Summary:")
print(results_df)

Testing num_topics=3...


  sparsetools.csc_matvecs(


Coherence Score: 0.4932
Testing num_topics=5...


  sparsetools.csc_matvecs(


Coherence Score: 0.5072
Testing num_topics=10...


  sparsetools.csc_matvecs(


Coherence Score: 0.5030
Testing num_topics=15...


  sparsetools.csc_matvecs(


Coherence Score: 0.4807
Coherence did not improve for two consecutive iterations. Stopping...

Best Model Found:
Number of Topics: 5
Best Coherence Score: 0.5072

Results Summary:
   num_topics  coherence
0           3   0.493243
1           5   0.507231
2          10   0.503022
3          15   0.480742


In [14]:
best_model_lsa.save('Models/best_LSA_model.gensim')

In [15]:
best_model_lsa.get_topics().shape

(5, 13342)

In [16]:
topics = best_model_lsa.print_topics(num_words=5)
for topic in topics:
    print(topic)

(0, '0.492*"good" + 0.454*"food" + 0.423*"place" + 0.209*"service" + 0.148*"chicken"')
(1, '-0.772*"good" + 0.514*"place" + 0.206*"food" + 0.118*"great" + 0.086*"visit"')
(2, '0.624*"place" + -0.618*"food" + 0.279*"good" + -0.147*"chicken" + -0.134*"ordered"')
(3, '-0.629*"chicken" + 0.390*"food" + -0.301*"ordered" + -0.254*"biryani" + -0.205*"taste"')
(4, '-0.705*"service" + 0.408*"food" + -0.261*"great" + -0.218*"-" + 0.207*"place"')


Although both models address similar aspects of food and service, LDA presents a greater diversity of topics, including more negative and specific criticisms about the environment and food. LSA struggled to differentiate the topics and did not capture them as effectively as LDA.

<font color='#BFD72F' size=7>4. BERTOPIC</font> <a class="anchor" id="p4"></a>

[Back to TOC](#toc)

In [8]:
print(data['preproc_reviews'].isna().sum())

22


In [12]:
docs = data["preproc_reviews"].fillna("").reset_index(drop=True)

In [13]:
topic_model = BERTopic(nr_topics=20)
topics, probs = topic_model.fit_transform(docs)

In [14]:
topics_df = pd.DataFrame({'topic': topics, 'document': docs})

In [15]:
# Save the model
topic_model.save("Models/bertopic_model")

  self._set_arrayXarray(i, j, x)


In [43]:
pd.reset_option("display.max_colwidth")

In [44]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,3994,-1_food_good_place_service,"[food, good, place, service, ambience, great, ...",[loved food ambience good desert could somewha...
1,0,2161,0_place_good_food_ambience,"[place, good, food, ambience, best, service, g...",[ambience great perfectly lit separate section...
2,1,1469,1_biryani_chicken_ordered_taste,"[biryani, chicken, ordered, taste, worst, food...","[biryani good, good service excellent food chi..."
3,2,515,2_good_nice_excellent_awesome,"[good, nice, excellent, awesome, job, thank, a...","[good, good, v good]"
4,3,285,3_delivery_delivered_time_fast,"[delivery, delivered, time, fast, packing, boy...","[good delivery, delivery, good delivery time d..."
5,4,204,4_zomato_order_gold_restaurant,"[zomato, order, gold, restaurant, received, de...",[delivery poor customer support zomato unable ...
6,5,163,5_taste_tasty_food_good,"[taste, tasty, food, good, awesome, delicious,...","[good taste, taste good, good food tasty]"
7,6,59,6_spicy_cold_food_hot,"[spicy, cold, food, hot, tasty, late, delivere...","[spicy, spicy food, food spicy]"
8,7,56,7_ok_nom_frsh_yup,"[ok, nom, frsh, yup, wowww, weast, hmmm, hhsjo...","[ok ok, ok, ok]"
9,8,55,8_service_good_excellent_mohammed,"[service, good, excellent, mohammed, assum, li...","[good service, good service, good service]"


Due to issues with rendering interactive Plotly visualizations directly in VSCode's Jupyter environment, we will need to use a workaround to display the BERTopic visualizations. The error encountered (Error loading renderer 'jupyter-notebook-renderer') indicates a problem with the renderer configuration in VSCode, potentially related to the Jupyter or Jupyter Renderers extensions.

As a solution, we will display the visualizations using an external browser by specifying the browser renderer. This allows us to view the interactive plots outside of VSCode.

In [34]:
topic_model = BERTopic.load("Models/bertopic_model")

In [35]:
fig = topic_model.visualize_topics()
fig.show(renderer="browser")

## Output:


![](https://i.ibb.co/0t98fYB/newplot.png)

In [37]:
topics_df

Unnamed: 0,topic,document
0,-1,ambience good food quite good saturday lunch c...
1,-1,ambience good pleasant evening service prompt ...
2,0,must try great food great ambience thnx servic...
3,-1,soumen da arun great guy behavior sincerety go...
4,1,food ordered kodi drumstick basket mutton biry...
...,...,...
9194,-1,visited restaurant friend immediately blown aw...
9195,0,im going cut chase food excellent must say hon...
9196,0,place never disappointed u food courteous staf...
9197,-1,personally love prefer chinese food couple tim...


The BERTopic model produced a range of topics that are fairly coherent, reflecting common themes found in the dataset, such as food quality, service, ambiance, and specific dishes. However, some topics might not seem entirely relevant or meaningful, and this could be due to issues in the preprocessing stage, such as incorrect tokenization or the inclusion of irrelevant terms.

<font color='#BFD72F' size=7>5. Conclusions</font> <a class="anchor" id="p5"></a>

[Back to TOC](#toc)

LDA (Latent Dirichlet Allocation) identified clear and diverse topics related to food quality, service, and specific dishes. By limiting the number of topics to five, we aimed to make the results easier to interpret. This approach allowed us to focus on the key themes without being overwhelmed by too many details. LDA was effective in distinguishing between different aspects of the reviews, such as specific complaints, praises, and references to the restaurant ambiance.

LSA (Latent Semantic Analysis), on the other hand, was less successful at producing distinct and coherent topics. Although it captured some broad themes similar to those identified by LDA, the topics were more general and sometimes overlapped. 

BERTopic offered another valuable way to explore the data. BERTopic successfully highlighted themes related to food, service, and ambiance, similar to the other models. However, some of the identified topics lacked coherence or contained redundant patterns, likely due to the preprocessing stage or the inherent noise in the dataset.

Due to limitations with VSCode and technical issues encountered during the analysis, we were unable to fully explore the potential of BERTopic. The visualization challenges and errors prevented us from leveraging all the capabilities BERTopic offers, such as interactive exploration and detailed topic refinement.
Given these constraints, we found LDA (Latent Dirichlet Allocation) to be the more reliable and interpretable option. LDA provided clear, distinct, and meaningful topics that were easier to analyze and understand.