# Introduction

As a business owner, customer reviews can be a valuable source of insight. Imagine being able to gradually monitor areas for improvement that increase customer satisfaction and highlight the best parts of the business for effective branding.

This project aims to segment user reviews into several topics for easier analysis.

The key components of our project include:
- **Review clustering**: to segment customer reviews into distinct clusters by representing the reviews as word embedding (combination of pre-trained LLM and self-train model),
- **Sentiment analysis**: to classify the sentiment of a review as positive or negative,
- **Topic labeling**: to label review topics within each cluster using a large language model (LLM).


## Dataset Introduction

The dataset for this project is [Google Local dataset](https://cseweb.ucsd.edu/~jmcauley/datasets.html#google_local) obtained from J. McAuley lab.

Originally, the dataset contains millions of business reviews from across the United States up to 2021. However, for the sake of simplicity and due to limited resources for this project, we focus exclusively on one state and one business type: **tourist attractions in Hawaii**.

In the end, we settle on approximately 260k reviews covering about 1200 locations.

In [None]:
!pip install google-cloud-storage
!pip install en_core_web_sm
!pip install langdetect
!pip install -q transformers
!pip install sentence-transformers
!pip install hdbscan
!pip install skorch
!pip install openai

In [69]:
from google.cloud import storage
from datetime import datetime

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import io

In [449]:
def download_csv_from_gcs(bucket, file_name,
                          date_columns=None, col_names=None):
    """ A function to download dataset from GCS. """

    blob = bucket.blob(file_name)
    data = blob.download_as_text()
    df = pd.read_csv(io.StringIO(data),
                     parse_dates=date_columns,
                     usecols=col_names)
    return df

In [453]:
# Create a client GCS and get the specified bucket
client = storage.Client()
bucket = client.get_bucket(BUCKET_NAME)

In [454]:
# Download the dataset from GCS
reviews_df = download_csv_from_gcs(bucket, REVIEW_CSV)

In [455]:
reviews_df.head()

Unnamed: 0,business_id,user_id,time,text
0,0x7954d184b450b1e7:0x4bee7e570ae07db8,109709907397075607894,1521793918433,Went their for a field trip. It was awesome! s...
1,0x7954d184b450b1e7:0x4bee7e570ae07db8,108968256029885805791,1574633258124,"Nice interpretation center, hard to find the w..."
2,0x7954d184b450b1e7:0x4bee7e570ae07db8,113167915373388818291,1583292550820,Great water birds! Clean place and easy access!
3,0x7954d184b450b1e7:0x4bee7e570ae07db8,117153367922518677632,1528995771126,Be sure to stop by the visitors center first. ...
4,0x7954d184b450b1e7:0x4bee7e570ae07db8,112253051829957730666,1549652370655,Great outdoor excursion. Ponds next to the oce...


In [456]:
# Convert the `time` from 'ms' to datetime
reviews_df["time"] = pd.to_datetime(reviews_df["time"], unit="ms")

In [457]:
reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 259581 entries, 0 to 259580
Data columns (total 4 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   business_id  259581 non-null  object        
 1   user_id      259581 non-null  object        
 2   time         259581 non-null  datetime64[ns]
 3   text         259579 non-null  object        
dtypes: datetime64[ns](1), object(3)
memory usage: 7.9+ MB


In [458]:
# Count the number of unique businesses and users
reviews_df[["business_id", "user_id"]].nunique()

Unnamed: 0,0
business_id,1235
user_id,127721


In [460]:
# Count the average number of reviews for each business
int(reviews_df.groupby("business_id")["text"].count().mean())

210

In [461]:
# Check the missing values
reviews_df.isna().sum()

Unnamed: 0,0
business_id,0
user_id,0
time,0
text,2


In [462]:
# Dropt the missing values
reviews_df.dropna(subset="text", inplace=True)

# Word2Vec Model

In this section, we will train our own word embedding using Word2Vec with our restaurant dataset.

From the original review dataset, we apply prepricessing such as cleaning, splitting into individual sentence, tokenization, and lemmatization.

In the end, we obtain around 550k sentences to train our language model.

## Dataset Preparation

In [14]:
from langdetect import detect, DetectorFactory
from transformers import pipeline

import en_core_web_sm
import re

In [15]:
spacy_nlp = en_core_web_sm.load()

In [467]:
class ProcessDataset():
    """
      A class for preprocessing Reviews data for training downstream models.
      Preprocessing includes:
        - clean, split, and expand setences,
        - tokenize, lemmatize, and remove stop words from sentences
    """

    def __init__(self, spacy_nlp):

        self.nlp = spacy_nlp

    def _clean_text(self, text):
        """ Clean text from unnecessary tokens/substrings """

        # Remove emoji patterns
        emoji_pattern = re.compile(
            "["
            "\U0001F600-\U0001F64F"  # Emoticons
            "\U0001F300-\U0001F5FF"  # Symbols & pictographs
            "\U0001F680-\U0001F6FF"  # Transport & map symbols
            "\U0001F1E0-\U0001F1FF"  # Flags (iOS)
            "\U00002700-\U000027BF"  # Dingbats
            "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
            "\U00002600-\U000026FF"  # Misc symbols
            "\U00002B50-\U00002B59"  # Stars
            "]+", flags=re.UNICODE
        )
        text = emoji_pattern.sub(r"", text)

        # Extracts text between '(Translated by Google)' and '(Original)'.
        match = re.search(r"\(Translated by Google\)(.+?)  ", text)
        if match:
            text = match.group(1)

        return text

    def _split_and_tokenize(self, text):
        """
          Splits text into sentences using the spaCy model.
          Also tokenize and lemmatize.
        """

        sents = [sent for sent in self.nlp(text.lower()).sents if sent.text]

        full_sents = [sent.text for sent in sents]

        tokenized = [[ token.lemma_ for token in sent
                    if token.is_alpha
                     and (not token.is_stop)
                     and (not token.is_punct) ] for sent in sents]
        tokenized = [" ".join(sent) for sent in tokenized]
        # We do the above operation so that it can be exploded later

        return full_sents, tokenized

    def transform(self, dataset):
        """ The main text processing function. """

        data = dataset.copy()

        # Clean, split, and expand sentences
        data["text"] = data["text"].apply(self._clean_text)
        data.loc[:, ["processed_text", "tokens"]] = data["text"].apply(
            self._split_and_tokenize).apply(
                lambda x: pd.Series(x, index=["processed_text", "tokens"]))

        data = data.explode(["processed_text", "tokens"]).reset_index(drop=True)
        data = data[data["processed_text"].str.len() >= 10]

        data["tokens"] = data["tokens"].apply(lambda x: x.split())
        data = data[data["tokens"].apply(lambda x: len(x) >= 2)]

        return data

In [468]:
data_processor = ProcessDataset(spacy_nlp)

In [470]:
processed_dataset = data_processor.transform(reviews_df.sample(100))

In [471]:
processed_dataset.head()

Unnamed: 0,business_id,user_id,time,text,processed_text,tokens
0,0x7c006f4309506633:0xb8c592a70d9c602c,103337304568968502574,2016-11-24 15:21:38.914,Went twice with my family growing up,went twice with my family growing up,"[go, twice, family, grow]"
1,0x7c00128bab113437:0x63ffd0979b5620cb,111676273773967615495,2018-06-14 08:06:02.687,Very good place to launch a jet ski,very good place to launch a jet ski,"[good, place, launch, jet, ski]"
3,0x7954b73425a04bd1:0x9c23fd88e8f5f4ca,108903954090738421114,2019-09-15 22:04:24.139,Stunning. Definitely layer up appropriately. W...,definitely layer up appropriately.,"[definitely, layer, appropriately]"
4,0x7954b73425a04bd1:0x9c23fd88e8f5f4ca,108903954090738421114,2019-09-15 22:04:24.139,Stunning. Definitely layer up appropriately. W...,we went from hiking shorts and t-shirts to swe...,"[go, hiking, short, t, shirt, sweater, jacket,..."
5,0x7954b73425a04bd1:0x9c23fd88e8f5f4ca,108903954090738421114,2019-09-15 22:04:24.139,Stunning. Definitely layer up appropriately. W...,"bring lots of water, the air is pretty dry at ...","[bring, lot, water, air, pretty, dry, ft]"


In [473]:
processed_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 213 entries, 0 to 233
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   business_id     213 non-null    object        
 1   user_id         213 non-null    object        
 2   time            213 non-null    datetime64[ns]
 3   text            213 non-null    object        
 4   processed_text  213 non-null    object        
 5   tokens          213 non-null    object        
dtypes: datetime64[ns](1), object(5)
memory usage: 11.6+ KB


## Training

In [20]:
from gensim.models import Word2Vec

In [474]:
model_w2v = Word2Vec(processed_dataset["tokens"], vector_size=300, window=1, min_count=1, workers=4)

In [None]:
# Save model to GCS for future access
timestamp = datetime.now().strftime("%d%m%H")

model_w2v.save(f"word2vec_amazon_reviews-{timestamp}")

In [None]:
!gsutil cp "word2vec_amazon_reviews"* "gs://customer_review_hawaii/models/"
!rm "word2vec_amazon_reviews"*

# Autoencoder for Dimensionality Reduction

In this section, we *merge* the Word2Vec embedding with the one generated by `SentenceTransformers`.

Next, we train an autoencoder model to reduce the dimensionality of our combined embeddings, as in theory it can effectively handles complex, non-linear relationships that traditional linear methods like PCA may not fully capture.

Before training the autoencoder, we apply PCA to determine the optimal dimensionality required to retain sufficient information from the embeddings, providing a basis for comparison between the autoencoder and PCA.

## Dataset preparation

In [145]:
from sentence_transformers import SentenceTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.decomposition import PCA
from skorch import NeuralNet
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm

import torch
import torch.nn as nn
import torch.optim as optim

In [477]:
model_llm = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")



In [489]:
class AutoEncoderDataset(Dataset):
    def __init__(self, dataset, model_wv, model_llm, source="mix"):
        self.model_llm = model_llm
        self.model_wv = model_wv
        self.vocab_wv = model_wv.wv.index_to_key

        self.data = dataset.copy()
        self.data["w2v"] = self.data["processed_text"].apply(self._get_word_vector)
        self.data["llm"] = self.data["processed_text"].apply(self.model_llm.encode)

        assert source in ["w2v", "llm", "mix"]
        self.source = source

    def __len__(self):
        return len(self.data)

    def _get_word_vector(self, sentence):
        words = [w for w in sentence.strip().split()]

        wv = [self.model_wv.wv[w] for w in words if w in self.vocab_wv]
        if wv:
            wv_mean = np.array(wv).mean(axis=0)
        else:
            wv_mean = np.zeros((100))

        return wv_mean

    def get_feature_dim(self):
        return len(self.__getitem__(0))

    def get_dataset(self):
        return self.data

    def get_full_wordvec(self):
        if self.source == "w2v":
            return self.data["w2v"]
        elif self.source == "llm":
            return self.data["llm"]
        else:
            return self.data[["w2v", "llm"]].apply(lambda x: np.hstack(x.values),
                                                   axis=1)

    def __getitem__(self, idx):
        data = self.data.iloc[idx, :]

        wv_mean = data["w2v"]
        embed = data["llm"]

        # Return the dataset according to wordvecs' source
        if self.source == "w2v":
            return torch.tensor(wv_mean, dtype=torch.float32)
        elif self.source == "llm":
            return torch.tensor(embed, dtype=torch.float32)
        else:
            return torch.tensor(np.hstack([wv_mean, embed]),
                                dtype=torch.float32)


In [500]:
sample_100k = processed_dataset.sample(100, random_state=2024)
dataset_ae = AutoEncoderDataset(sample_100k,
                                model_w2v,
                                model_llm,
                                source="llm")

In [501]:
full_wordvec = dataset_ae.get_full_wordvec()

In [502]:
full_wordvec.iloc[0].shape

(384,)

## Determine the size of the latent vars

To determine the size of the latent variables, we estimate it using PCA.

However, due to the size of our dataset, we only use a fraction of it for this analysis. We tested PCA with several fractions of the dataset and found that a 100-dimensions vector is sufficient to capture 98% of the total variance.

In [513]:
cumulative_vars = {}
for f in [.1, .3]:
    sample_len = int(f*len(full_wordvec))
    X_subset = full_wordvec.sample(sample_len, random_state=2024)
    X_subset = np.vstack(X_subset.values)

    pca = PCA()
    pca.fit(X_subset)
    cumulative_vars[f] = np.cumsum(pca.explained_variance_ratio_)

In [514]:
cumulative_vars

{0.1: array([0.1735438 , 0.3252003 , 0.46562293, 0.57443386, 0.67813617,
        0.7735233 , 0.86048883, 0.9339694 , 1.        , 1.        ],
       dtype=float32),
 0.3: array([0.10898483, 0.18409833, 0.25152448, 0.31464347, 0.37339377,
        0.42180645, 0.46933395, 0.5126991 , 0.55329597, 0.5900661 ,
        0.6254321 , 0.65939033, 0.69083256, 0.71995485, 0.7469614 ,
        0.7732771 , 0.7988464 , 0.822934  , 0.8456104 , 0.8672795 ,
        0.88832945, 0.9079746 , 0.92594796, 0.9417274 , 0.9556282 ,
        0.9687686 , 0.98158205, 0.9916443 , 0.99999994, 0.99999994],
       dtype=float32)}

 Here, we apply PCA to estimate the dimensionality of the latent variable.

In [None]:
# plot the cummulative variance with different dataset sizes
colors = ['red', 'green', 'blue']

cumulative_vars_df = pd.DataFrame(cumulative_vars)
cumulative_vars_df.plot(color=colors)
plt.xlim(0, 100)

threshold = 0.98
least_n = cumulative_vars_df.apply(lambda x: np.argmax(x >= threshold) + 1)
for idx, n in enumerate(least_n):
    plt.axvline(x=n, linestyle='--', color=colors[idx], label=n)

We can see from the plot above that the number to get at least 95% variance from the original data need around 80 dimension.

With this information, we will train autoencoder accordingly.

## Autoencoder training

In [516]:
class Autoencoder(BaseEstimator, TransformerMixin, nn.Module):
    def __init__(self,
                 input_dim,
                 hidden_dim,
                 latent_dim,
                 lr=1e-4,
                 epochs=1,
                 batch_size=256):

        super().__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.latent_dim = latent_dim
        self.lr = lr
        self.epochs = epochs
        self.batch_size = batch_size

        # Encoder
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.LeakyReLU(),
            nn.Linear(hidden_dim, latent_dim)
        )

        # Decoder
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.LeakyReLU(),
            nn.Linear(hidden_dim, input_dim),
            nn.Sigmoid()  # Output range [0,1]
        )

    def forward(self, X):
        encoded = self.encoder(X)
        decoded = self.decoder(encoded)
        return decoded

    def encode(self, X):
        return self.encoder(X)

    def fit(self, X):
        dataloader = DataLoader(X, batch_size=self.batch_size, shuffle=True)

        # Define optimizer and loss function
        criterion = nn.MSELoss()
        optimizer = optim.Adam(self.parameters(), lr=self.lr)

        # Training loop
        self.train()
        for epoch in range(self.epochs):
            losses = 0
            with tqdm(total=len(dataloader),
                      desc=f"Epoch {epoch + 1}", unit="batch") as pbar:

                for X_batch in dataloader:
                    # Forward pass
                    reconstructed_X = self.forward(X_batch)
                    loss = criterion(reconstructed_X, X_batch)

                    # Backward pass and optimize
                    optimizer.zero_grad()
                    loss.backward()
                    optimizer.step()

                    losses += loss.item()

                    pbar.set_postfix({'Loss': loss.item()})
                    pbar.update(1)

                print(f"Epoch {epoch + 1}/{self.epochs}, Loss: {(losses/len(dataloader)):.3f}")

        return self

    def transform(self, X):
        # Switch to evaluation mode and encode
        self.eval()

        dataloader = DataLoader(X, batch_size=len(X), shuffle=True)
        for X_batch in dataloader:
            with torch.no_grad():
                encoded = self.encode(X_batch).cpu().numpy()
        return encoded

    def predict(self, X):
        # This is used for reconstruction
        self.eval()

        dataloader = DataLoader(X, batch_size=len(X), shuffle=True)
        for X_batch in dataloader:
            with torch.no_grad():
                reconstructed = self.forward(X_batch).cpu().numpy()
        return reconstructed

In [518]:
input_dim = dataset_ae.get_feature_dim()
hidden_dim = 1024
latent_dim = 100

In [519]:
model_ae = Autoencoder(input_dim=input_dim,
                  hidden_dim=hidden_dim,
                  latent_dim=latent_dim,
                  lr=1e-4,
                  epochs=1,
                  batch_size=256)

In [520]:
model_ae.fit(dataset_ae)

Epoch 1: 100%|██████████| 1/1 [00:00<00:00, 24.30batch/s, Loss=0.252]

Epoch 1/1, Loss: 0.252





In [360]:
# Save model to GCS for future access
timestamp = datetime.now().strftime("%d%m%H")

torch.save(model_ae.state_dict(), f"autoencoder-{timestamp}")

In [362]:
!gsutil cp "autoencoder"* "gs://customer_review_hawaii/models/"
!rm "autoencoder"*

autoencoder-041111  word2vec_amazon_reviews-041109


# Clustering

Task TODO:
- (opt) ask for cluster fixing from OpenAI

In [386]:
from sklearn.pipeline import Pipeline
from sklearn.cluster import HDBSCAN

In [364]:
business_id = np.random.choice(processed_dataset["business_id"])
business_reviews = processed_dataset[processed_dataset["business_id"] == business_id]

In [365]:
# we will only cluster the reviews from the last 6 months
min_review_date = business_reviews["time"].max() - pd.DateOffset(months=3)
business_reviews = business_reviews[business_reviews["time"] >= min_review_date]

In [366]:
business_reviews.shape

(188, 6)

In [1]:
# class WordVectorEncoder(BaseEstimator, TransformerMixin):
#     def __init__(self, autoencoder):
#         self.autoencoder = autoencoder

#     def fit(self, X, y=None):
#         return self

#     def transform(self, X):
#         dl = DataLoader(X, batch_size=len(X), shuffle=False)
#         encoded_list = []

#         for batch in dl:
#             encoded_batch = self.autoencoder.encode(batch).detach().cpu().numpy()
#             encoded_list.append(encoded_batch)

#         encoded = np.vstack(encoded_list)
#         return encoded

In [2]:
# class PCAEncoder(BaseEstimator, TransformerMixin):
#     def __init__(self):
#         self.autoencoder = PCA(n_components=85)

#     def fit(self, X, y=None):
#         return self

#     def transform(self, X):
#         dl = DataLoader(X, batch_size=len(X), shuffle=False)
#         for batch in dl:
#             print(batch.shape)
#             encoded_batch = self.autoencoder.fit_transform(batch)
#             print(encoded_batch.shape)

#         return encoded_batch

In [3]:
# class HDBSCANClustering(BaseEstimator, TransformerMixin):
#     def __init__(self):
#         self.model = HDBSCAN(min_cluster_size=5,
#                              min_samples=3,
#                              cluster_selection_epsilon=0.5)
#         self.labels_ = None

#     def fit(self, X, y=None):
#         dl = DataLoader(X, batch_size=len(X), shuffle=False)
#         for x in dl:
#             self.labels_ = self.model.fit_predict(x)
#         return self

#     def transform(self, X):
#         # Return the labels as a column
#         return self.labels_.reshape(-1, 1)

In [4]:
# class PCAVisualization(BaseEstimator, TransformerMixin):
#     def __init__(self):
#         self.model = PCA(n_components=2)

#     def fit(self, X, y=None):
#         return self

#     def transform(self, X):
#         # Return the labels as a column
#         dl = DataLoader(X, batch_size=len(X), shuffle=False)
#         for x in dl:
#             reduced = self.model.fit_predict(x)
#         return reduced

In [5]:
# pipeline = Pipeline([
#     ('autoencoder', WordVectorEncoder(net)),
#     ('clustering', HDBSCANClustering())
# ])

In [6]:
# pipeline_pca = Pipeline([
#     ('autoencoder', PCAEncoder()),
#     ('clustering', HDBSCANClustering())
# ])

In [7]:
# # Convert the DataFrame to Pytorch dataset
# restaurant_dataset = AutoEncoderDataset(restaurant_reviews.copy().reset_index(),
#                                         model_wv,
#                                         model_transformer)

In [8]:
# pipeline.fit(restaurant_dataset)
# restaurant_reviews["labels"] = pipeline['clustering'].model.labels_

In [9]:
# pipeline_pca.fit(restaurant_dataset)
# restaurant_reviews["labels_pca"] = pipeline_pca['clustering'].model.labels_

In [10]:
# wve = WordVectorEncoder(net)
# encoded = wve.fit_transform(restaurant_dataset)
# print(encoded.shape)

# pca = PCA(n_components=2)
# reduced = pca.fit_transform(encoded)
# print(reduced.shape)

In [11]:
# restaurant_reviews["labels"].value_counts()

In [12]:
# restaurant_reviews["labels_pca"].value_counts()

In [13]:
# plt.scatter(x=reduced[:, 0], y=reduced[:, 1], c=restaurant_reviews["labels_pca"])

In [14]:
# plt.scatter(x=reduced[:, 0], y=reduced[:, 1], c=restaurant_reviews["labels"])

In [15]:
# from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

In [16]:
# silhouette_score(reduced, restaurant_reviews["labels"]), silhouette_score(reduced, restaurant_reviews["labels_pca"])

## Cluster labeling

In [17]:
# import openai

# def get_cluster_name(text_samples):
#     prompt = f"""
#         You are an expert in giving a descriptive topic to a given list of sentences.
#         The sentences may have different topics, so choose one that is the most commonly shared.
#         Please return the topic as consice as possible, maximum in 3 words.
#         Please also avoid a vague topic.

#         There are 5 sentences as the input.
#         The content of the sentences is limited to customer reviews for a tourist attraction.
#         So, please only choose the topic according to possible aspects in this business.

#         Here is an example with just 3 setences:

#         INPUT
#         'Gorgeous place to visit, it can get crowded on holidays.'
#         'Great hike and beautiful views.'
#         'Awesome view.'
#         ENDINPUT

#         LABEL 'Scenic view'

#         So here is the sentences:
#         INPUT
#         {'\n'.join(text_samples)}
#         ENDINPUT

#         LABEL ...
#     """
#     response = openai.ChatCompletion.create(
#         model="gpt-3.5-turbo",
#         messages=[{"role": "user", "content": prompt}]
#     )
#     return response.choices[0].message['content'].strip()

TODO:
- check again the youtube code
- review clustering methods with one core example on each method
- review again PCA vs autoencoder
- make sure the statistics are the same (wv and transformer)