# Topic Modeling with BERTopic for Amazon Fine Food Reviews

In this project, we will explore "Amazon Fine Food Reviews" dataset that consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012.
 
We will find topics specifically in negative feedback (e.g., 1-star and 2-star reviews) to uncover recurring problems, complaints, or dissatisfaction themes. 

## Data Preparation

* Dataset: https://www.kaggle.com/snap/amazon-fine-food-reviews

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [2]:
df = pd.read_csv('/kaggle/input/amazon-fine-food-reviews/Reviews.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568454 entries, 0 to 568453
Data columns (total 10 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   Id                      568454 non-null  int64 
 1   ProductId               568454 non-null  object
 2   UserId                  568454 non-null  object
 3   ProfileName             568428 non-null  object
 4   HelpfulnessNumerator    568454 non-null  int64 
 5   HelpfulnessDenominator  568454 non-null  int64 
 6   Score                   568454 non-null  int64 
 7   Time                    568454 non-null  int64 
 8   Summary                 568427 non-null  object
 9   Text                    568454 non-null  object
dtypes: int64(5), object(5)
memory usage: 43.4+ MB


In [3]:
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [4]:
#Filter the Dataset
negative_df = df[df['Score'].isin([1, 2])].copy()
negative_df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
12,13,B0009XLVG0,A327PCT23YH90,LT,1,1,1,1339545600,My Cats Are Not Fans of the New Food,My cats have been happily eating Felidae Plati...
16,17,B001GVISJM,A3KLWF6WQ5BNYO,Erica Neathery,0,0,2,1348099200,poor taste,I love eating them and they are good for watch...
26,27,B001GVISJM,A3RXAU2N8KV45G,lady21,0,1,1,1332633600,Nasty No flavor,"The candy is just red , No flavor . Just plan..."


In [5]:
negative_df.shape

(82037, 10)

## Data Preprocessing

In [6]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download necessary resources
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    # Convert to lowercase
    text = str(text).lower()

    # Remove URLs
    text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE)
    
    # Remove @mentions and hashtags
    text = re.sub(r'\@\w+|\#','', text)
    
    # Remove numbers and punctuation
    text = re.sub(r"[^a-z\s]", '', text)
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    # Tokenize, remove stopwords, and lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in text.split() if word not in stop_words]
    
    return ' '.join(tokens)

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [7]:
from tqdm.notebook import tqdm
tqdm.pandas()

negative_df['Clean_Text'] = negative_df['Text'].progress_apply(clean_text)
negative_df.head()

  0%|          | 0/82037 [00:00<?, ?it/s]

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,Clean_Text
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,product arrived labeled jumbo salted peanutsth...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...,looking secret ingredient robitussin believe f...
12,13,B0009XLVG0,A327PCT23YH90,LT,1,1,1,1339545600,My Cats Are Not Fans of the New Food,My cats have been happily eating Felidae Plati...,cat happily eating felidae platinum two year g...
16,17,B001GVISJM,A3KLWF6WQ5BNYO,Erica Neathery,0,0,2,1348099200,poor taste,I love eating them and they are good for watch...,love eating good watching tv looking movie swe...
26,27,B001GVISJM,A3RXAU2N8KV45G,lady21,0,1,1,1332633600,Nasty No flavor,"The candy is just red , No flavor . Just plan...",candy red flavor plan chewy would never buy


## Training the model

Ref: https://maartengr.github.io/BERTopic/getting_started/best_practices/best_practices.html

In [8]:
%%capture
!pip install bertopic

In [9]:
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer

# Get embeddings
reviews = negative_df['Clean_Text'].tolist()
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(reviews, show_progress_bar=True)

# Reduce dimensionality
umap_model = UMAP(n_neighbors=15, n_components=5, metric='cosine', random_state=42)

# Cluster the reduced embedding
hdbscan_model = HDBSCAN(min_cluster_size=150, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

# Vectorizer and c-TFIDF
vectorizer_model = CountVectorizer(min_df=2, ngram_range=(1, 2))
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

# Dictionary to hold models and results
topic_models = {}

# Initialize and train model
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    ctfidf_model=ctfidf_model,
    verbose=True
)

topics, probs = topic_model.fit_transform(reviews, embeddings)

# Show topics
topic_model.get_topic_info()

2025-05-04 16:54:14.813839: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1746377654.836336     569 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746377654.842596     569 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Batches:   0%|          | 0/2564 [00:00<?, ?it/s]

2025-05-04 16:55:26,638 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-05-04 16:57:57,498 - BERTopic - Dimensionality - Completed ✓
2025-05-04 16:57:57,503 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-05-04 16:58:12,420 - BERTopic - Cluster - Completed ✓
2025-05-04 16:58:12,442 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-05-04 16:58:19,507 - BERTopic - Representation - Completed ✓


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,34883,-1_box_chip_amazon_product,"[box, chip, amazon, product, bar, taste, flavo...",[roller ball treat basically gravy roll around...
1,0,9717,0_coffee_cup_kcups_pod,"[coffee, cup, kcups, pod, roast, starbucks, kc...",[probably good organic coffee whole bean form ...
2,1,5357,1_dog_treat_dog food_toy,"[dog, treat, dog food, toy, bone, chew, puppy,...",[bought dog food bag showed pictured veggie ch...
3,2,2870,2_tea_green tea_green_tea bag,"[tea, green tea, green, tea bag, leaf, jasmine...",[dont much care stash premium loose green tea ...
4,3,2277,3_cat_litter_cat food_food,"[cat, litter, cat food, food, eat, food cat, t...",[bought cat food wonderful low price enjoy cat...
5,4,1805,4_energy_juice_drink_soda,"[energy, juice, drink, soda, energy drink, caf...",[love soda stream mix ive used quite good pret...
6,5,1704,5_bread_gluten_cake_gluten free,"[bread, gluten, cake, gluten free, flour, mix,...",[mom celiac disease one son tree nut allergy a...
7,6,1231,6_oz_price_ounce_walmart,"[oz, price, ounce, walmart, store, shipping, d...",[oct ordered different vendor product oz pack ...
8,7,1218,7_stevia_sugar_fructose_sweetener,"[stevia, sugar, fructose, sweetener, agave, sp...",[yes know know agave nectar supposed answer se...
9,8,1188,8_china_made china_made_usa,"[china, made china, made, usa, chicken jerky, ...",[br fda issue health warning veterinarian pet ...


## Visualize Topics

In [10]:
topic_model.visualize_barchart(top_n_topics=16)

In [11]:
topic_model.visualize_topics()

In [13]:
# Get the top words in Topic 0
top_words_topic_0 = topic_model.get_topic(0)
print("Top words in Topic 5:", top_words_topic_0)

# Get the top words in Topic 2
top_words_topic_2 = topic_model.get_topic(2)
print("Top words in Topic 2:", top_words_topic_2)

# Get the top words in Topic 38
top_words_topic_38 = topic_model.get_topic(38)
print("Top words in Topic 38:", top_words_topic_38)

Top words in Topic 5: [('coffee', 0.31902365100288743), ('cup', 0.2382136148296567), ('kcups', 0.22249013770430487), ('pod', 0.22196782678669902), ('roast', 0.2160466721489919), ('starbucks', 0.20161280096473155), ('kcup', 0.19988676177014714), ('keurig', 0.19673825246435842), ('weak', 0.1831696613265393), ('bean', 0.1786750594192751)]
Top words in Topic 2: [('tea', 0.4358793799580712), ('green tea', 0.3082123485256532), ('green', 0.2670820769048698), ('tea bag', 0.24752372355306748), ('leaf', 0.2352804505535101), ('jasmine', 0.21967011370646877), ('black tea', 0.19652412867467503), ('tea taste', 0.19140035078228346), ('matcha', 0.1827805383324629), ('earl', 0.18037728753099738)]
Top words in Topic 38: [('chai', 0.7765776770997975), ('chai tea', 0.5522085960813354), ('clove', 0.41886581863865663), ('spice', 0.3792129494847659), ('tazo', 0.3529988685550384), ('tea', 0.3487895935454036), ('cinnamon', 0.3282669280666814), ('love chai', 0.32664186784333543), ('latte', 0.2996226733751508), 

In [16]:
topic_model.visualize_heatmap()

**Some topics are semantically similar and may form higher-level clusters.**

In [18]:
topic_model.visualize_hierarchy()

## Testing

In [19]:
test_review = "I bought this dog food hoping it would be a healthy choice, but it made my dog sick. He had stomach issues for days, and I had to visit the vet. I’ll never buy this brand again."
topic, prob = topic_model.transform(test_review)

print("Predicted Topic:", topic)
print("Probability:", prob)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-05-04 17:01:54,307 - BERTopic - Dimensionality - Reducing dimensionality of input embeddings.
2025-05-04 17:02:11,678 - BERTopic - Dimensionality - Completed ✓
2025-05-04 17:02:11,678 - BERTopic - Clustering - Approximating new points with `hdbscan_model`
2025-05-04 17:02:11,680 - BERTopic - Cluster - Completed ✓


Predicted Topic: [1]
Probability: [0.75502627]
Top words in topic: [('dog', 0.31566474401445194), ('treat', 0.23684333169887137), ('dog food', 0.22357212442696656), ('toy', 0.22272746395441928), ('bone', 0.22162442277794261), ('chew', 0.203437756092578), ('puppy', 0.19997265407481432), ('food', 0.18585313116010832), ('pet', 0.18153067933563294), ('vet', 0.18059930313628544)]


In [21]:
test_review = "I had high hopes for this tea, but it was a complete disappointment. The flavor was incredibly weak and watery"
topic, prob = topic_model.transform(test_review)

print("Predicted Topic:", topic)
print("Probability:", prob)
print("Top words in topic:", topic_model.get_topic(topic[0]))

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-05-04 17:03:18,155 - BERTopic - Dimensionality - Reducing dimensionality of input embeddings.
2025-05-04 17:03:18,160 - BERTopic - Dimensionality - Completed ✓
2025-05-04 17:03:18,161 - BERTopic - Clustering - Approximating new points with `hdbscan_model`
2025-05-04 17:03:18,162 - BERTopic - Cluster - Completed ✓


Predicted Topic: [2]
Probability: [0.87007208]
Top words in topic: [('tea', 0.4358793799580712), ('green tea', 0.3082123485256532), ('green', 0.2670820769048698), ('tea bag', 0.24752372355306748), ('leaf', 0.2352804505535101), ('jasmine', 0.21967011370646877), ('black tea', 0.19652412867467503), ('tea taste', 0.19140035078228346), ('matcha', 0.1827805383324629), ('earl', 0.18037728753099738)]


In [24]:
test_review = "I had high hopes for this tea, but it was a complete disappointment."
topic, prob = topic_model.transform(test_review)

print("Predicted Topic:", topic)
print("Probability:", prob)
print("Top words in topic:", topic_model.get_topic(topic[0]))

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-05-04 17:04:46,420 - BERTopic - Dimensionality - Reducing dimensionality of input embeddings.
2025-05-04 17:04:46,426 - BERTopic - Dimensionality - Completed ✓
2025-05-04 17:04:46,427 - BERTopic - Clustering - Approximating new points with `hdbscan_model`
2025-05-04 17:04:46,428 - BERTopic - Cluster - Completed ✓


Predicted Topic: [2]
Probability: [0.85255125]
Top words in topic: [('tea', 0.4358793799580712), ('green tea', 0.3082123485256532), ('green', 0.2670820769048698), ('tea bag', 0.24752372355306748), ('leaf', 0.2352804505535101), ('jasmine', 0.21967011370646877), ('black tea', 0.19652412867467503), ('tea taste', 0.19140035078228346), ('matcha', 0.1827805383324629), ('earl', 0.18037728753099738)]


In [27]:
test_review = "I was really disappointed with this cocoa powder. It had a bitter, stale taste "
topic, prob = topic_model.transform(test_review)

print("Predicted Topic:", topic)
print("Probability:", prob)
print("Top words in topic:", topic_model.get_topic(topic[0]))

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-05-04 17:20:55,494 - BERTopic - Dimensionality - Reducing dimensionality of input embeddings.
2025-05-04 17:20:55,500 - BERTopic - Dimensionality - Completed ✓
2025-05-04 17:20:55,500 - BERTopic - Clustering - Approximating new points with `hdbscan_model`
2025-05-04 17:20:55,502 - BERTopic - Cluster - Completed ✓


Predicted Topic: [9]
Probability: [0.76716408]
Top words in topic: [('chocolate', 0.45938661812080545), ('cocoa', 0.4063697197190554), ('hot chocolate', 0.37204309548071457), ('hot', 0.3207766485659092), ('dark chocolate', 0.3030653631398925), ('hershey', 0.2613760960234172), ('lindt', 0.2605582810249242), ('dark', 0.26022404336671806), ('hot cocoa', 0.25478990121963135), ('milk chocolate', 0.25445231057509526)]
