# LDA Topic Modeling — Amazon Trustpilot Reviews

**Purpose:** Perform efficient topic clustering using Latent Dirichlet Allocation (LDA) on Trustpilot Amazon reviews.  
This notebook is a draft for team use: every cell has clear explanations so teammates can follow and reproduce results.

**High-level steps:**
1. Download dataset from Kaggle using kaggle CLI (curl-like via kaggle command)
2. Exploratory Data Analysis (EDA)
3. Preprocessing & cleaning
4. Vectorization (CountVectorizer)
5. Train LDA
6. Inspect topics and assign dominant topic per review
7. Visualize with pyLDAvis
8. Save outputs for integration with backend/dashboard


In [2]:
!pip install nltk spacy scikit-learn pyLDAvis joblib
!python -m spacy download en_core_web_sm

Collecting nltk
  Downloading nltk-3.9.2-py3-none-any.whl.metadata (3.2 kB)
Collecting spacy
  Downloading spacy-3.8.11-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (27 kB)
Collecting scikit-learn
  Downloading scikit_learn-1.7.2-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (11 kB)
Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting joblib
  Using cached joblib-1.5.2-py3-none-any.whl.metadata (5.6 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2025.11.3-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (40 kB)
Collecting tqdm (from nltk)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Colle

In [None]:
# Import Essential Libraries
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import warnings
import nltk

# Download NLTK resources (Run this cell first if you haven't yet!)
try:
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))
except LookupError:
    print("NLTK resources not found. Downloading 'stopwords' and 'wordnet'...")
    nltk.download('stopwords')
    nltk.download('wordnet')
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))

# Suppress warnings
warnings.filterwarnings("ignore")

# Define global constants and file path
FILE_PATH = "/home/ismail/code/belachkar/voicelens_capstone_project/raw_data/review_for_amazon.csv"
ENCODING = 'latin-1'
TARGET_COLUMN = 'topic' # Final column for descriptive topic string
TEMP_ID_COLUMN = 'topic_id' # Temporary column for the numerical ID
SCORE_COLUMN = 'topic_score'
TEXT_COLUMN = 'comment' # Corrected text column
RANDOM_STATE = 42
OPTIMAL_K = 5 # Used for demonstration

In [21]:
import os
import re
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import nltk
import spacy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import joblib

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Ensure NLTK tokenizers are available (optional)
try:
    nltk.data.find("tokenizers/punkt")
except LookupError:
    nltk.download("punkt")

print("Libraries loaded successfully")

Libraries loaded successfully


In [43]:
## Method 1: Data Loading
def load_data(file_path: str, encoding: str) -> pd.DataFrame:
    """Loads the CSV file into a Pandas DataFrame."""
    try:
        df = pd.read_csv(file_path, encoding=encoding)
        if TEXT_COLUMN not in df.columns:
            raise ValueError(f"Column '{TEXT_COLUMN}' not found.")
        df.dropna(subset=[TEXT_COLUMN], inplace=True)
        return df
    except Exception as e:
        print(f"Error loading data: {e}")
        return pd.DataFrame()

## Method 2: Text Preprocessing
def preprocess_text(text: str) -> str:
    """Cleans text: lowercasing, punctuation removal, stop word removal, and lemmatization."""
    if not isinstance(text, str):
        return ""
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    words = text.split()
    words = [word for word in words if word not in stop_words and len(word) > 2]
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)

# Apply and Test the methods
df = load_data(FILE_PATH, ENCODING)
df['processed_text'] = df[TEXT_COLUMN].apply(preprocess_text)
print(f"Dataset Shape: {df.shape}")
df.info()

Dataset Shape: (12948, 7)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12948 entries, 0 to 12947
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   name            12948 non-null  object
 1   n_review        12948 non-null  int64 
 2   country         12947 non-null  object
 3   comment         12948 non-null  object
 4   rating          12948 non-null  int64 
 5   date            12948 non-null  object
 6   processed_text  12948 non-null  object
dtypes: int64(2), object(5)
memory usage: 708.2+ KB


In [None]:
## Method 3: Feature Vectorization (Count Vectorizer)
def vectorize_text_for_lda(series: pd.Series, max_features: int = 5000):
    """Creates a Count Matrix (Document-Term Matrix) from the processed text for LDA."""
    vectorizer = CountVectorizer(max_features=max_features)
    X = vectorizer.fit_transform(series)
    return X, vectorizer

## Method 4: Train LDA Topic Model
def train_lda_model(X, n_topics: int, random_state: int) -> LatentDirichletAllocation:
    """Trains the Latent Dirichlet Allocation (LDA) model."""
    print(f"\nTraining LDA model with {n_topics} topics...")
    model = LatentDirichletAllocation(
        n_components=n_topics,
        max_iter=10,
        learning_method='batch',
        random_state=random_state,
        n_jobs=-1,
        verbose=0
    )
    model.fit(X)
    print("Training complete.")
    return model

# Run Feature Engineering and Training
X_counts, vectorizer = vectorize_text_for_lda(df['processed_text'])
lda_model = train_lda_model(X_counts, OPTIMAL_K, RANDOM_STATE)

## Method 5: Interpret Topics (Extract Keywords)
def interpret_topics(model: LatentDirichletAllocation, vectorizer: CountVectorizer, top_n: int = 10):
    """Extracts the top-weighted words for each topic center."""
    feature_names = vectorizer.get_feature_names_out()
    topics = {}

    for topic_idx, topic in enumerate(model.components_):
        top_words_indices = topic.argsort()[:-top_n - 1:-1]
        top_words = [feature_names[i] for i in top_words_indices]
        topics[topic_idx] = top_words
        print(f"Topic {topic_idx} Keywords: {', '.join(top_words)}")

    return topics

# Test the method
topic_keywords_dict = interpret_topics(lda_model, vectorizer)


Training LDA model with 5 topics...
Training complete.
Topic 0 Keywords: amazon, account, card, money, customer, gift, company, credit, email, prime
Topic 1 Keywords: amazon, delivery, day, prime, order, item, time, package, shipping, get
Topic 2 Keywords: amazon, service, great, always, good, customer, love, price, delivery, best
Topic 3 Keywords: amazon, review, product, seller, item, return, bad, buy, good, get
Topic 4 Keywords: customer, amazon, service, item, order, refund, time, would, get, told


In [None]:
## NEW Method 6: Create Descriptive Topic Mapping
def create_topic_mapping(topic_keywords_dict: dict) -> dict:
    """
    Creates a mapping dictionary from numerical ID to a descriptive string label.

    NOTE: In a real-world scenario, a human analyst would review the keywords
    and manually assign these names for better accuracy.
    """
    topic_mapping = {}
    print("\n--- Manually Review & Name Topics ---")

    for topic_id, keywords in topic_keywords_dict.items():
        # Heuristic Naming: Use the top 3 keywords concatenated
        # REPLACE THIS LOGIC with manual names after reviewing the keywords
        name = " ".join(keywords[:3]).title()
        topic_mapping[topic_id] = name
        print(f"Topic {topic_id} (Keywords: {', '.join(keywords[:5])}) -> Assigned Name: '{name}'")

    return topic_mapping

# Create the mapping
topic_name_map = create_topic_mapping(topic_keywords_dict)


## Method 7 (Updated): Assign Topics and Descriptive Labels
def assign_topics(df: pd.DataFrame, model: LatentDirichletAllocation, X_counts, topic_map: dict) -> pd.DataFrame:
    """
    Assigns the numerical topic ID, confidence score, and the final descriptive topic string.
    """
    X_topic_distribution = model.transform(X_counts)

    # 1. Assign Numerical ID and Score
    df[TEMP_ID_COLUMN] = X_topic_distribution.argmax(axis=1)
    df[SCORE_COLUMN] = X_topic_distribution.max(axis=1)

    # 2. Assign Descriptive String Label (The Final Target Column)
    df[TARGET_COLUMN] = df[TEMP_ID_COLUMN].map(topic_map)

    # Drop the temporary numerical ID column
    df.drop(columns=[TEMP_ID_COLUMN], inplace=True)

    return df

# Test the final assignment method
df = assign_topics(df, lda_model, X_counts, topic_name_map)

print("\nSample Data with Final Descriptive Topics:")
print(df[[TEXT_COLUMN, TARGET_COLUMN, SCORE_COLUMN]].head())

# Check final topic distribution
print("\nFinal Topic Distribution:")
print(df[TARGET_COLUMN].value_counts())


--- Manually Review & Name Topics ---
Topic 0 (Keywords: amazon, account, card, money, customer) -> Assigned Name: 'Amazon Account Card'
Topic 1 (Keywords: amazon, delivery, day, prime, order) -> Assigned Name: 'Amazon Delivery Day'
Topic 2 (Keywords: amazon, service, great, always, good) -> Assigned Name: 'Amazon Service Great'
Topic 3 (Keywords: amazon, review, product, seller, item) -> Assigned Name: 'Amazon Review Product'
Topic 4 (Keywords: customer, amazon, service, item, order) -> Assigned Name: 'Customer Amazon Service'

Sample Data with Final Descriptive Topics:
                                             comment                    topic  \
0  Uncaring and incompetent\r\r\n\r\r\nImpossible...  Customer Amazon Service   
1  Amazon maybe the quickest way to get<U+0085>\r...      Amazon Account Card   
2  Not fair!\r\r\n\r\r\nIn genera! I am an Amazon...      Amazon Account Card   
3  Amazon Prime is crap\r\r\n\r\r\nAmazon Prime i...      Amazon Delivery Day   
4  Terrible deli

In [None]:
## Method 8: Evaluate LDA Model
def evaluate_lda(model: LatentDirichletAllocation, X):
    """Calculates internal LDA metrics (Perplexity and Log-Likelihood)."""

    perplexity = model.perplexity(X)
    log_likelihood = model.score(X)

    print("\n--- LDA Model Evaluation Metrics ---")
    print(f"Perplexity (Lower is Better): {perplexity:.2f}")
    print(f"Log Likelihood (Higher is Better): {log_likelihood:.2f}")

# Test the method
evaluate_lda(lda_model, X_counts)


--- LDA Model Evaluation Metrics ---
Perplexity (Lower is Better): 923.50
Log Likelihood (Higher is Better): -3517017.54


In [36]:
df.head()

Unnamed: 0,name,n_review,country,comment,rating,date,processed_text,topic_score,topic
0,Graham MOORE,21,GB,Uncaring and incompetent\r\r\n\r\r\nImpossible...,1,2022-06-20,uncaring incompetent impossible deal customer ...,0.5523,Customer Amazon Service
1,popadog,5,GB,Amazon maybe the quickest way to get<U+0085>\r...,2,2022-06-20,amazon maybe quickest way getu amazon maybe qu...,0.535572,Amazon Account Card
2,Andrew Torok,6,US,Not fair!\r\r\n\r\r\nIn genera! I am an Amazon...,1,2022-06-20,fair genus amazon junkie love tthose package c...,0.403145,Amazon Account Card
3,Jerry Jocoy,15,US,Amazon Prime is crap\r\r\n\r\r\nAmazon Prime i...,1,2022-06-20,amazon prime crap amazon prime crap first orde...,0.531605,Amazon Delivery Day
4,steve erickson,3,US,Terrible delivery services\r\r\n\r\r\nTerrible...,1,2022-06-19,terrible delivery service terrible delivery se...,0.707978,Amazon Delivery Day


In [None]:
print(df[['comment','topic']])
for row in df[['comment','topic']][:5]

                                                 comment  \
0      Uncaring and incompetent\r\r\n\r\r\nImpossible...   
1      Amazon maybe the quickest way to get<U+0085>\r...   
2      Not fair!\r\r\n\r\r\nIn genera! I am an Amazon...   
3      Amazon Prime is crap\r\r\n\r\r\nAmazon Prime i...   
4      Terrible delivery services\r\r\n\r\r\nTerrible...   
...                                                  ...   
12943  Fast!!\r\n\r\nI have had perfect order fulfill...   
12944  Consistently Excellent\r\n\r\nI have had perfe...   
12945  Good prices but delivery can take time :(\r\n\...   
12946  World-class online shopping\r\n\r\nI have plac...   
12947  No title\r\n\r\nthose goods i've ordered by Am...   

                         topic  
0      Customer Amazon Service  
1          Amazon Account Card  
2          Amazon Account Card  
3          Amazon Delivery Day  
4          Amazon Delivery Day  
...                        ...  
12943     Amazon Service Great  
12944     Amazo

In [41]:
pd.set_option('display.max_colwidth', None)
df[['comment','topic','processed_text']].head()

Unnamed: 0,comment,topic,processed_text
0,"Uncaring and incompetent\r\r\n\r\r\nImpossible to deal with customer service. I purchased an echo show as a gift for an elderly, blind lady.\r\r\nAfter a power cut the machine would not reconnect.\r\r\n\r\r\nI tried to help but Amazon blocked her account.\r\r\n\r\r\nNumerous attempts to reset,many attempts to deal with Amazon service -all fobbed off to various other departments.\r\r\n\r\r\nA blind elderly lady has been ignored and no help given.\r\r\n\r\r\nShame on Amazon",Customer Amazon Service,uncaring incompetent impossible deal customer service purchased echo show gift elderly blind lady power cut machine would reconnect tried help amazon blocked account numerous attempt resetmany attempt deal amazon service fobbed various department blind elderly lady ignored help given shame amazon
1,"Amazon maybe the quickest way to get<U+0085>\r\r\n\r\r\nAmazon maybe the quickest way to get what you want but it isn<U+0092>t always the best option; customer service is frankly overall poor; frequent contact after purchase is necessary to report poor construction, inferior materials and specification in the face of Amazon<U+0092>s massive under-cutting of UK manufacturers; undermining the value of the UK highstreet just to get back your hard-earner cash. Agents are sometime helpful but often slow to act when challenged over repeated account violations by Sellers who try to interfere with the returns process or who object to some reviews; Amazon and it<U+0092>s Sellers are overly sensitive to criticism; I<U+0092>ve finally pulled the plug on online purchases from the big A, cancelled my Prime package and frozen my Amazon New Day linked credit card account after repeated attempts to take money not authorised by me. Had to invoke fraud investigation to stop this worrying account activity and force an investigation. Outcome unsatisfactory!!",Amazon Account Card,amazon maybe quickest way getu amazon maybe quickest way get want isnut always best option customer service frankly overall poor frequent contact purchase necessary report poor construction inferior material specification face amazonus massive undercutting manufacturer undermining value highstreet get back hardearner cash agent sometime helpful often slow act challenged repeated account violation seller try interfere return process object review amazon itus seller overly sensitive criticism iuve finally pulled plug online purchase big cancelled prime package frozen amazon new day linked credit card account repeated attempt take money authorised invoke fraud investigation stop worrying account activity force investigation outcome unsatisfactory
2,"Not fair!\r\r\n\r\r\nIn genera! I am an Amazon junkie. I love tthose packages coming in day after day. I think they have one of the best delivery systems in the business. So why a one star? When they protect sellers so customers can't post an honest review I think that is a dis-service. I recently received a gift which would have benefited from an honest review but Amazon blocked it by saying there was an unusual level of activity so only reviews by senders could be written. Duh, how could they know? Not fair. I was blocked from writing a review, being the recipient of the gift.",Amazon Account Card,fair genus amazon junkie love tthose package coming day day think one best delivery system business one star protect seller customer cant post honest review think disservice recently received gift would benefited honest review amazon blocked saying unusual level activity review sender could written duh could know fair blocked writing review recipient gift
3,"Amazon Prime is crap\r\r\n\r\r\nAmazon Prime is crap first of all I order an Al Mar knife for $140 and then what I get is a counterfeit $30 knife I send it back and they tell me I may have to wait 30 days to get my money back! Second thing I order 4 air filters for my RV A/C and I get the package and there's two filters in there they want me to take my filter out put my dirty one back in my air conditioner and send it back to them for them to do anything about it. I'm tired of their incompetence, and they're Crooks!",Amazon Delivery Day,amazon prime crap amazon prime crap first order mar knife get counterfeit knife send back tell may wait day get money back second thing order air filter get package there two filter want take filter put dirty one back air conditioner send back anything tired incompetence theyre crook
4,"Terrible delivery services\r\r\n\r\r\nTerrible delivery services and very inconsistent of where deliveries are left at a presice location.\r\r\nI would estimate 95% of my deliveries are not at the\r\r\nlocation i left in directions i left. Most diver's do not read the simple instructions, their to busy.",Amazon Delivery Day,terrible delivery service terrible delivery service inconsistent delivery left presice location would estimate delivery location left direction left diver read simple instruction busy


In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12948 entries, 0 to 12947
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   name            12948 non-null  object 
 1   n_review        12948 non-null  int64  
 2   country         12947 non-null  object 
 3   comment         12948 non-null  object 
 4   rating          12948 non-null  int64  
 5   date            12948 non-null  object 
 6   processed_text  12948 non-null  object 
 7   topic_score     12948 non-null  float64
 8   topic           12948 non-null  object 
dtypes: float64(1), int64(2), object(6)
memory usage: 910.5+ KB
