# Introduction

As a business owner, customer reviews can be a valuable source of insight. Imagine being able to gradually monitor areas for improvement that increase customer satisfaction and highlight the best parts of the business for effective branding.

This project aims to segment user reviews into several topics for easier analysis.

The key components of our project include:
- **Review clustering**: to segment customer reviews into distinct clusters by representing the reviews as word embedding (pre-trained LLM + self-train model),
- **Sentiment analysis**: to classify the sentiment of a review as positive or negative,
- **Topic labeling**: to label review topics within each cluster using a large language model (LLM).


## Dataset Introduction

Originally, the dataset contains millions of business reviews from across the United States up to 2021. However, for the sake of simplicity and due to limited resources for this project, we focus exclusively on one state and one business type: resturants in Hawaii.

In the end, we settle on approximately 260k reviews covering about 1200 restaurants.

TODO:
- autoencoder for dimension reduction.
- only covers until last 6-month reviews during cluster prediction.

In [None]:
!pip install google-cloud-storage
!pip install en_core_web_sm
!pip install langdetect
!pip install -q transformers
!pip install sentence-transformers

In [155]:
from google.cloud import storage
from datetime import datetime

import pandas as pd
import numpy as np
import io

In [146]:
def download_csv_from_gcs(bucket, file_name, date_columns=None, col_names=None):
    blob = bucket.blob(file_name)
    data = blob.download_as_text()
    df = pd.read_csv(io.StringIO(data),
                     parse_dates=date_columns,
                     usecols=col_names)
    return df

In [147]:
PROJECT_ID="machine-learning-toy-project"
LOCATION="us-west1"
SERVICE_ACCOUNT="ml-project-user@machine-learning-toy-project.iam.gserviceaccount.com"

In [148]:
BUCKET_NAME="customer_review_hawaii"

client = storage.Client()
bucket = client.get_bucket(BUCKET_NAME)

In [159]:
REVIEW_CSV="customer_review.csv"

reviews_df = download_csv_from_gcs(bucket, REVIEW_CSV)

In [160]:
reviews_df.head()

Unnamed: 0,business_id,user_id,time,text
0,0x7954d184b450b1e7:0x4bee7e570ae07db8,109709907397075607894,1521793918433,Went their for a field trip. It was awesome! s...
1,0x7954d184b450b1e7:0x4bee7e570ae07db8,108968256029885805791,1574633258124,"Nice interpretation center, hard to find the w..."
2,0x7954d184b450b1e7:0x4bee7e570ae07db8,113167915373388818291,1583292550820,Great water birds! Clean place and easy access!
3,0x7954d184b450b1e7:0x4bee7e570ae07db8,117153367922518677632,1528995771126,Be sure to stop by the visitors center first. ...
4,0x7954d184b450b1e7:0x4bee7e570ae07db8,112253051829957730666,1549652370655,Great outdoor excursion. Ponds next to the oce...


In [161]:
reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 259581 entries, 0 to 259580
Data columns (total 4 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   business_id  259581 non-null  object
 1   user_id      259581 non-null  object
 2   time         259581 non-null  int64 
 3   text         259579 non-null  object
dtypes: int64(1), object(3)
memory usage: 7.9+ MB


In [162]:
reviews_df.shape

(259581, 4)

In [163]:
reviews_df[["business_id", "user_id"]].nunique()

Unnamed: 0,0
business_id,1235
user_id,127721


In [164]:
reviews_df.groupby("business_id")["user_id"].count().mean()

210.18704453441296

In [177]:
reviews_df.isna().sum()

Unnamed: 0,0
business_id,0
user_id,0
time,0
text,2


In [178]:
reviews_df.dropna(subset="text", inplace=True)

# Word2Vec Training

In this section, we will train our own word embedding using Word2Vec with our restaurant dataset.

From the original review dataset, we apply prepricessing such as cleaning, splitting into individual sentence, tokenization, and lemmatization.

In the end, we obtain around 550k sentences to train our language model.

## Dataset Preparation

In [165]:
from langdetect import detect, DetectorFactory
from transformers import pipeline

import en_core_web_sm
import re

In [166]:
spacy_nlp = en_core_web_sm.load()

In [186]:
class Word2VecDataset():
    """
      A class for preprocessing Reviews data for training Word2Vec Model.
      Preprocessing includes:
        - clean, split, and expand setences,
        - tokenize, lemmatize, and remove stop words from sentences
    """

    def __init__(self, dataset, spacy_nlp):

        self.nlp = spacy_nlp
        self.data = self.fit(dataset)

    def _clean_text(self, text):
        """ Clean text from unnecessary tokens/substrings """

        # Remove emoji patterns
        emoji_pattern = re.compile(
            "["
            "\U0001F600-\U0001F64F"  # Emoticons
            "\U0001F300-\U0001F5FF"  # Symbols & pictographs
            "\U0001F680-\U0001F6FF"  # Transport & map symbols
            "\U0001F1E0-\U0001F1FF"  # Flags (iOS)
            "\U00002700-\U000027BF"  # Dingbats
            "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
            "\U00002600-\U000026FF"  # Misc symbols
            "\U00002B50-\U00002B59"  # Stars
            "]+", flags=re.UNICODE
        )
        text = emoji_pattern.sub(r"", text)

        # Extracts text between '(Translated by Google)' and '(Original)'.
        match = re.search(r"\(Translated by Google\)(.+?)  ", text)
        if match:
            text = match.group(1)

        return text

    def _split_and_tokenize(self, text):
        """
          Splits text into sentences using the spaCy model.
          Also tokenize and lemmatize
        """

        sents = [sent for sent in self.nlp(text.lower()).sents if sent.text]

        full_sents = [sent.text for sent in sents]

        tokenized = [[ token.lemma_ for token in sent
                    if token.is_alpha
                     and (not token.is_stop)
                     and (not token.is_punct) ] for sent in sents]
        tokenized = [" ".join(sent) for sent in tokenized]
        # We do the above operation so that it can be exploded later

        return full_sents, tokenized

    def fit(self, dataset):
        """ The main text processing function. """

        data = dataset.copy()

        # Clean, split, and expand sentences
        data["text"] = data["text"].apply(self._clean_text)
        data.loc[:, ["processed_text", "tokens"]] = data["text"].apply(
            self._split_and_tokenize).apply(
                lambda x: pd.Series(x, index=["processed_text", "tokens"]))

        data = data.explode(["processed_text", "tokens"]).reset_index(drop=True)
        data = data[data["processed_text"].str.len() >= 10]

        data["tokens"] = data["tokens"].apply(lambda x: x.split())
        data = data[data["tokens"].apply(lambda x: len(x) >= 2)]

        return data

In [188]:
dataset_wv = Word2VecDataset(reviews_df, spacy_nlp)

print(f"Finished at {datetime.now().strftime('%H:%M %S')}")

Finished at 15:15 24


In [189]:
dataset_wv.data.head()

Unnamed: 0,business_id,user_id,time,text,processed_text,tokens
0,0x7954d184b450b1e7:0x4bee7e570ae07db8,109709907397075607894,1521793918433,Went their for a field trip. It was awesome! s...,went their for a field trip.,"[go, field, trip]"
2,0x7954d184b450b1e7:0x4bee7e570ae07db8,109709907397075607894,1521793918433,Went their for a field trip. It was awesome! s...,sometimes the road to the pond is flooded so y...,"[road, pond, flood, balance, concrete, block, ..."
3,0x7954d184b450b1e7:0x4bee7e570ae07db8,109709907397075607894,1521793918433,Went their for a field trip. It was awesome! s...,"plus, the birds there are cute!","[plus, bird, cute]"
4,0x7954d184b450b1e7:0x4bee7e570ae07db8,109709907397075607894,1521793918433,Went their for a field trip. It was awesome! s...,i recommend this place for tourists and i do n...,"[recommend, place, tourist, recommend, litter,..."
5,0x7954d184b450b1e7:0x4bee7e570ae07db8,108968256029885805791,1574633258124,"Nice interpretation center, hard to find the w...","nice interpretation center, hard to find the w...","[nice, interpretation, center, hard, find, wil..."


In [190]:
dataset_wv.data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 553246 entries, 0 to 618377
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   business_id     553246 non-null  object
 1   user_id         553246 non-null  object
 2   time            553246 non-null  int64 
 3   text            553246 non-null  object
 4   processed_text  553246 non-null  object
 5   tokens          553246 non-null  object
dtypes: int64(1), object(5)
memory usage: 29.5+ MB


## Training

In [191]:
from gensim.models import Word2Vec

In [192]:
model_wv = Word2Vec(dataset_wv.data["tokens"], vector_size=100, window=5, min_count=1, workers=4)

In [193]:
# Save model to GCS for future access
timestamp = datetime.now().strftime("%H%M%S")

model_wv.save(f"word2vec_amazon_reviews-{timestamp}")

In [194]:
!gsutil cp "word2vec_amazon_reviews"* "gs://customer_review_hawaii/models/"
!rm "word2vec_amazon_reviews"*

Copying file://word2vec_amazon_reviews-151545 [Content-Type=application/octet-stream]...
-
Operation completed over 1 objects/28.9 MiB.                                     


# Autoencoder for Dimensionality Reduction

In this section, we will merge the Word2Vec word embedding with an embedding from LLM `SentenceTransformers`.

After that, we will train an autoencoder model to reduce the dimensionality of our input embedding.

The autoencoder model is chosen as it works well for complex, non-linear datasets where traditional methods like PCA may not capture the full complexity.

TODO: Comparison with PCA?

In [198]:
from sentence_transformers import SentenceTransformer

In [None]:
model_transformer = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

In [203]:
embeddings = sentence_transformer_model.encode(dataset.data["processed_text"].sample(10).tolist())

In [205]:
embeddings.min(), embeddings.max()

(-0.19199066, 0.2247755)