# Introduction

As a business owner, customer reviews can be a valuable source of insight. Imagine being able to gradually monitor areas for improvement that increase customer satisfaction and highlight the best parts of the business for effective branding.

This project aims to segment user reviews into several topics for easier analysis.

The key components of our project include:
- **Review clustering**: to segment customer reviews into distinct clusters by representing the reviews as word embedding (combination of pre-trained LLM and self-train model),
- **Sentiment analysis**: to classify the sentiment of a review as positive or negative,
- **Topic labeling**: to label review topics within each cluster using a large language model (LLM).


## Dataset

The dataset for this project is [Google Local dataset](https://cseweb.ucsd.edu/~jmcauley/datasets.html#google_local) obtained from J. McAuley lab.

Originally, the dataset contains millions of business reviews from across the United States up to 2021. However, for the sake of simplicity and due to limited resources for this project, we focus exclusively on one state and one business type: **tourist attractions in Hawaii**.

In the end, we limit to 1000 locations with approximately 144k reviews.

In [1]:
!pip install -q en_core_web_sm transformers sentence-transformers openai

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/255.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m255.8/255.8 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/389.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m389.3/389.3 kB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/76.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/78.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.0/78.0 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[

In [2]:
from google.cloud import storage
from datetime import datetime

import matplotlib.pyplot as plt
import plotly.express as px
import pandas as pd
import numpy as np
import openai
import io

In [3]:
def download_csv_from_gcs(bucket, file_name,
                          date_columns=None, col_names=None):
    """ A function to download dataset from GCS. """

    blob = bucket.blob(file_name)
    data = blob.download_as_text()
    df = pd.read_csv(io.StringIO(data),
                     parse_dates=date_columns,
                     usecols=col_names)
    return df

In [5]:
# Create a client GCS and get the specified bucket
client = storage.Client()
bucket = client.get_bucket(BUCKET_NAME)

In [6]:
# Download the dataset from GCS
reviews_df = download_csv_from_gcs(bucket, REVIEW_CSV)

In [7]:
reviews_df.head()

Unnamed: 0,business_id,user_id,time,text
0,0x7c006afc71065bd1:0x7a706dc72f4623ee,113728016128003691063,2021-08-31 04:41:40.565,We needed sun glasses before a boat ride and n...
1,0x7c006afc71065bd1:0x7a706dc72f4623ee,101488063088102775913,2021-08-24 07:48:29.263,Reasonable Prices Has quick food to go
2,0x7c006afc71065bd1:0x7a706dc72f4623ee,100536389119109882523,2021-07-25 00:58:57.672,Convenient liquor store with a little of every...
3,0x7c006afc71065bd1:0x7a706dc72f4623ee,110229694025406635705,2021-07-24 19:28:41.268,I love the spam musubis It's the bomb
4,0x7c006afc71065bd1:0x7a706dc72f4623ee,108697060494095206753,2021-04-08 21:00:29.786,Had a good selection of items at fairly decent...


In [8]:
reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144762 entries, 0 to 144761
Data columns (total 4 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   business_id  144762 non-null  object
 1   user_id      144762 non-null  object
 2   time         144762 non-null  object
 3   text         144762 non-null  object
dtypes: object(4)
memory usage: 4.4+ MB


In [40]:
reviews_df["time"] = pd.to_datetime(reviews_df["time"])

In [9]:
# Count the number of unique businesses and users
reviews_df[["business_id", "user_id"]].nunique()

Unnamed: 0,0
business_id,1035
user_id,72582


In [10]:
# Count the average number of reviews for each business
int(reviews_df.groupby("business_id")["text"].count().mean())

139

In [11]:
# Check the missing values
reviews_df.isna().sum()

Unnamed: 0,0
business_id,0
user_id,0
time,0
text,0


# Dataset Preparation

In [12]:
from transformers import pipeline

import en_core_web_sm
import re

In [13]:
spacy_nlp = en_core_web_sm.load()

In [17]:
class ProcessDataset():
    """
      A class for preprocessing Reviews data for training downstream models.
      Preprocessing includes:
        - clean, split, and expand setences,
        - tokenize, lemmatize, and remove stop words from sentences
    """

    def __init__(self, spacy_nlp):

        self.nlp = spacy_nlp

    def _clean_text(self, text):
        """ Clean text from unnecessary tokens/substrings """

        # Remove emoji patterns
        emoji_pattern = re.compile(
            "["
            "\U0001F600-\U0001F64F"  # Emoticons
            "\U0001F300-\U0001F5FF"  # Symbols & pictographs
            "\U0001F680-\U0001F6FF"  # Transport & map symbols
            "\U0001F1E0-\U0001F1FF"  # Flags (iOS)
            "\U00002700-\U000027BF"  # Dingbats
            "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
            "\U00002600-\U000026FF"  # Misc symbols
            "\U00002B50-\U00002B59"  # Stars
            "]+", flags=re.UNICODE
        )
        text = emoji_pattern.sub(r"", text)

        # Extracts text between '(Translated by Google)' and '(Original)'.
        match = re.search(r"\(Translated by Google\)(.+?)  ", text)
        if match:
            text = match.group(1)

        return text

    def _split_and_tokenize(self, text):
        """
          Splits text into sentences using the spaCy model.
          Also tokenize and lemmatize.
        """

        sents = [sent for sent in self.nlp(text.lower()).sents if sent.text]

        full_sents = [sent.text for sent in sents]

        tokenized = [[ token.lemma_ for token in sent
                      if token.is_alpha and not token.is_punct ]
                     for sent in sents ]
        tokenized = [" ".join(sent) for sent in tokenized]
        # We do the above operation so that it can be exploded later

        return full_sents, tokenized

    def transform(self, dataset):
        """ The main text processing function. """

        data = dataset.copy()

        # Clean, split, and expand sentences
        data["text"] = data["text"].apply(self._clean_text)
        data.loc[:, ["processed_text", "tokens"]] = data["text"].apply(
            self._split_and_tokenize).apply(
                lambda x: pd.Series(x, index=["processed_text", "tokens"]))

        data = data.explode(["processed_text", "tokens"]).reset_index(drop=True)
        data = data[data["processed_text"].str.len() >= 10]

        # Tokenize sentence
        data["tokens"] = data["tokens"].apply(lambda x: x.split())
        data = data[data["tokens"].apply(lambda x: len(x) >= 2)]

        return data

In [18]:
data_processor = ProcessDataset(spacy_nlp)

In [19]:
processed_dataset = data_processor.transform(reviews_df)

In [20]:
# NOTE: we use `processed_text` as the transformer input and `tokens` for w2v
processed_dataset[["processed_text", "tokens"]].head()

Unnamed: 0,processed_text,tokens
0,we needed sun glasses before a boat ride and n...,"[we, need, sun, glass, before, a, boat, ride, ..."
1,this little mom and pop convenience store had ...,"[this, little, mom, and, pop, convenience, sto..."
2,we both got our sunglasses for a great price a...,"[we, both, get, our, sunglass, for, a, great, ..."
3,reasonable prices has quick food to go,"[reasonable, price, have, quick, food, to, go]"
4,convenient liquor store with a little of every...,"[convenient, liquor, store, with, a, little, o..."


In [21]:
processed_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 323490 entries, 0 to 340746
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   business_id     323490 non-null  object
 1   user_id         323490 non-null  object
 2   time            323490 non-null  object
 3   text            323490 non-null  object
 4   processed_text  323490 non-null  object
 5   tokens          323490 non-null  object
dtypes: object(6)
memory usage: 17.3+ MB


In [22]:
cols = ["processed_text", "tokens"]
processed_dataset[cols].to_parquet("processed_dataset_ckpt.parquet",
                                   index=False)
!gsutil cp "processed_dataset_ckpt.parquet" "gs://customer_review_hawaii/data/"

Copying file://processed_dataset_ckpt.parquet [Content-Type=application/octet-stream]...
-
Operation completed over 1 objects/16.5 MiB.                                     


# Word2Vec Model

In this section, we train custom word embeddings using skip-gram Word2Vec on our review dataset.

We apply preprocessing steps including cleaning, sentence splitting, tokenization, and lemmatization to the original dataset.

This process resutls approximately 588k sentences for training our Word2Vec model.

## Training

In [23]:
from gensim.models import Word2Vec
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cluster import HDBSCAN
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

In [24]:
class W2VEncoder(BaseEstimator, TransformerMixin):

    def __init__(self, model=None, normalize=True):
        # Use any existing w2v model
        self.model = model
        self.normalize = normalize

    def fit(self, X, y=None):
        """ Train the w2v model. """

        if not self.model:
            self.model = Word2Vec(X, vector_size=300,
                                  window=5, min_count=1,
                                  compute_loss=True, epochs=100,
                                  alpha=0.01, min_alpha=0.001)
        print("Finished training!")
        print(f"Latest training loss (cumulative): {self.model.get_latest_training_loss()}")

        return self

    def transform(self, X):
        """ Transform the data using the learned w2v model. """

        wv = X.apply(lambda tokens: [self.model.wv[token] for token in tokens
                                     if token in self.model.wv])
        if self.normalize:
            # Normalize the vector
            wv = wv.apply(lambda v: v / np.linalg.norm(v))
        wv = wv.apply(lambda v: np.array(v).mean(axis=0))
        return wv

In [25]:
# Train the Word2Vec model
w2v_encoder = W2VEncoder()
w2v_encoder.fit(processed_dataset["tokens"])

Finished training!
Latest training loss (cumulative): 66142100.0


In [26]:
# Save model to GCS for future access
# w2v_encoder.model.save(f"model_w2v")

## Clustering

In [None]:
model_sentiment = pipeline(model="cardiffnlp/twitter-roberta-base-sentiment-latest")

In [141]:
selected_id = np.random.choice(processed_dataset["business_id"])

In [142]:
# Helper functions for data selection, cluster evaluation, and visualization

def select_dataset(dataset, selected_id, model_sentiment,
                   sentiment="Positive"):

    # Select data based on `business_id`
    data = dataset[dataset["business_id"] == selected_id].copy()

    # Select reviews from the last six months
    data["time"] = pd.to_datetime(data["time"])
    time_limit = data["time"].max() - pd.DateOffset(months=6)
    data = data[data["time"] >= time_limit]

    # Extract the sentiment using a pretrained model
    data["sentiment"] = data["processed_text"].apply(lambda x: model_sentiment(x))
    data["sentiment"] = data["sentiment"].apply(lambda x: x[0]["label"])

    # Select only the matching sentiment
    data = data[data["sentiment"] == sentiment]

    return data

def evaluate_cluster(X, y):
    """
        Evaluate clustering result using 3 evaluation metrics:
        1. `silhouette_score`: a metric used to calculate the goodness of fit
            of a clustering algorithm. Its value ranges from -1 to 1.
        2. `davies_bouldin_score`: the average similarity measure of each cluster
            with its most similar cluster. The minimum value is 0, indicating better model.
        3. `calinski_harabasz_score`: the ratio of the sum of between-cluster dispersion
            and of within-cluster dispersion. Higher index indicates separable clusters.
    """
    evaluation_scores = {}
    evaluation_scores["silhouette_score"] = [silhouette_score(X, y)]
    evaluation_scores["davies_bouldin_score"] = [davies_bouldin_score(X, y)]
    evaluation_scores["calinski_harabasz_score"] = [calinski_harabasz_score(X, y)]

    return pd.DataFrame(evaluation_scores).T

def visualize_cluster(df, x_column, y_column):
    pca = PCA(n_components=2)
    # df[x_column].apply(lambda x: print(x.shape))
    df["w2v_pca"] = list(pca.fit_transform(np.vstack(df[x_column].values)))
    df[["w2v_pca_x", "w2v_pca_y"]] = pd.DataFrame(df["w2v_pca"].tolist(), index=df.index)

    # Plot with px.scatter using the new PCA columns
    fig = px.scatter(df,
                    x="w2v_pca_x",
                    y="w2v_pca_y",
                    color=df[y_column],
                    hover_data={"processed_text": True},
                    title="Customer Review - Cluster Distribution")
    fig.show()

In [143]:
class HDBSCANClustering(BaseEstimator, TransformerMixin):
    def __init__(self, apply_pca=False):
        self.model = HDBSCAN(metric="cosine",
                             min_cluster_size=3)
        self.labels_ = None

        self.apply_pca = apply_pca
        self.pca = PCA()

    def fit(self, X, y=None):
        """ Fit the cluster. """
        if self.apply_pca:
            X = self.pca.fit_transform(X)
            explained_variance = np.cumsum(self.pca.explained_variance_ratio_)
            print(f"Explained variance: {explained_variance[-1]:.3f}")

        self.labels_ = self.model.fit_predict(X)
        return self

    def transform(self, X):
        """ Return the label """

        return self.labels_

In [145]:
# Get the word vector from the trained Word2Vec model
selected_data = select_dataset(processed_dataset,
                               selected_id,
                               model_sentiment,
                               sentiment="positive")
selected_data["w2v"] = w2v_encoder.transform(selected_data["tokens"])

# Clustering
clust = HDBSCANClustering(apply_pca=True)

X = np.vstack(selected_data["w2v"].values)
selected_data["w2v_label"] = clust.fit_transform(X)

# Visualization
visualize_cluster(selected_data, "w2v", "w2v_label")

Explained variance: 1.000


In [146]:
evaluate_cluster(X, selected_data["w2v_label"])

Unnamed: 0,0
silhouette_score,0.337746
davies_bouldin_score,1.63514
calinski_harabasz_score,17.123084


# `SentenceTransformer` Model

In [149]:
from sentence_transformers import SentenceTransformer

## Clustering

In [150]:
class LLMEncoder(BaseEstimator, TransformerMixin):

    def __init__(self, model):
        # Load the Transformer model
        self.model = model

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        """ Transform the data using the transformer model. """

        wv = X.apply(self.model.encode)
        return wv

In [None]:
model_llm = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

In [152]:
llm_encoder = LLMEncoder(model_llm)

In [153]:
selected_data = select_dataset(processed_dataset,
                               selected_id,
                               model_sentiment,
                               sentiment="positive")
selected_data["llm"] = llm_encoder.transform(selected_data["processed_text"])

# Clustering
clust = HDBSCANClustering(apply_pca=True)

X = np.vstack(selected_data["llm"].values)
selected_data["llm_label"] = clust.fit_transform(X)

# Visualization
visualize_cluster(selected_data, "llm", "llm_label")

Explained variance: 1.000


In [155]:
evaluate_cluster(X, selected_data["llm_label"])

Unnamed: 0,0
silhouette_score,0.041643
davies_bouldin_score,2.857382
calinski_harabasz_score,4.004492


# Concatenated Word Vectors

In this section, we *concatenate* the Word2Vec embedding with the one generated by `SentenceTransformers`.

We also train an autoencoder model to reduce the dimensionality of our combined embeddings, as in theory it can effectively handles complex, non-linear relationships that traditional linear methods like PCA may not fully capture.

Before training the autoencoder, we apply PCA to determine the optimal dimensionality required to retain sufficient information from the embeddings.

## Determine the size of the latent vars

We estimate the size of the latent variables using PCA.

However, due to the size of our dataset, we only use a fraction of it for this analysis. We tested PCA with several fractions of the dataset and found that a 100-dimensions vector is sufficient to capture 98% of the total variance.

In [None]:
processed_dataset["mix"] = processed_dataset[["w2v", "llm"]].apply(lambda x: np.hstack(x.values),
                                                                   axis=1)

In [None]:
cumulative_vars = {}
for f in [.1, .3]:
    X = np.vstack(processed_dataset["mix"].values)

    pca = PCA()
    pca.fit(X)
    cumulative_vars[f] = np.cumsum(pca.explained_variance_ratio_)

 Here, we apply PCA to estimate the dimensionality of the latent variable.

In [None]:
# plot the cummulative variance with different dataset sizes
colors = ["D76C82", "7ED4AD"]

cumulative_vars_df = pd.DataFrame(cumulative_vars)
cumulative_vars_df.plot(color=colors)
plt.xlim(0, 100)

threshold = 0.98
least_n = cumulative_vars_df.apply(lambda x: np.argmax(x >= threshold) + 1)
for idx, n in enumerate(least_n):
    plt.axvline(x=n, linestyle='--', color=colors[idx], label=n)

We can see from the plot above that the number to get at least 95% variance from the original data need around 80 dimension.

With this information, we will train autoencoder accordingly.

In [None]:
# class PCAEncoder(BaseEstimator, TransformerMixin):
#     def __init__(self, n):
#         self.pca = PCA(n_components=n)

#     def fit(self, X, y=None):
#         return self

#     def transform(self, X):
#         encoded = self.pca.fit_transform(batch)

#         return encoded

## Dataset preparation

In [None]:
from sklearn.decomposition import PCA
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm

import torch
import torch.nn as nn
import torch.optim as optim

  from tqdm.autonotebook import tqdm, trange




In [None]:
# Turn the DataFrame to Pytorch Dataset

class AutoEncoderDataset(Dataset):
    def __init__(self, dataset, source="mix"):
        self.data = dataset

        assert source in ["w2v", "llm", "mix"]
        self.source = source

    def get_feature_dim(self):
        """ Return the feature (word vector) dimension """

        return len(self.__getitem__(0))

    def __len__(self):
        """ Return the length of the dataset """

        return len(self.data)

    def __getitem__(self, idx):
        """ Return row data """

        data = self.data.iloc[idx, :]

        # Return the dataset according to wordvecs' source
        return torch.tensor(data[self.source], dtype=torch.float32)

In [None]:
dataset_ae = AutoEncoderDataset(processed_dataset, source="mix")

## Autoencoder training

In [None]:
class Autoencoder(BaseEstimator, TransformerMixin, nn.Module):
    def __init__(self, input_dim, hidden_dim, latent_dim,
                 lr=1e-4, epochs=1, batch_size=256):

        super().__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.latent_dim = latent_dim
        self.lr = lr
        self.epochs = epochs
        self.batch_size = batch_size

        # Encoder
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.LeakyReLU(),
            nn.Linear(hidden_dim, latent_dim)
        )

        # Decoder
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.LeakyReLU(),
            nn.Linear(hidden_dim, input_dim),
            nn.Sigmoid()  # Output range [0,1]
        )

    def forward(self, X):
        encoded = self.encoder(X)
        decoded = self.decoder(encoded)
        return decoded

    def encode(self, X):
        return self.encoder(X)

    def fit(self, X):
        dataloader = DataLoader(X, batch_size=self.batch_size, shuffle=True)

        # Define optimizer and loss function
        criterion = nn.MSELoss()
        optimizer = optim.Adam(self.parameters(), lr=self.lr)

        # Training loop
        self.train()
        for epoch in range(self.epochs):
            losses = 0
            with tqdm(total=len(dataloader),
                      desc=f"Epoch {epoch + 1}", unit="batch") as pbar:

                for X_batch in dataloader:
                    # Forward pass
                    reconstructed_X = self.forward(X_batch)
                    loss = criterion(reconstructed_X, X_batch)

                    # Backward pass and optimize
                    optimizer.zero_grad()
                    loss.backward()
                    optimizer.step()

                    losses += loss.item()

                    pbar.set_postfix({'Loss': loss.item()})
                    pbar.update(1)

                print(f"Epoch {epoch + 1}/{self.epochs}, Loss: {(losses/len(dataloader)):.3f}")

        return self

    def transform(self, X):
        # Switch to evaluation mode and encode
        self.eval()

        dataloader = DataLoader(X, batch_size=self.batch_size, shuffle=False)

        encoded = []
        for X_batch in dataloader:
            with torch.no_grad():
                encoded.append(self.encode(X_batch).cpu().numpy())

        return np.vstack(encoded)

In [None]:
# Define the hyperparameters for autoencoder training
ae_hyperparams = {
    "input_dim": dataset_ae.get_feature_dim(),
    "hidden_dim": 1024,
    "latent_dim": 100,
    "lr": 1e-4,
    "epochs": 1,
    "batch_size": 256
}

In [None]:
# Train autoencoder
model_ae = Autoencoder(**ae_hyperparams)
wv_ae = model_ae.fit_transform(dataset_ae)

In [None]:
# Save model to GCS for future access
# timestamp = datetime.now().strftime("%d%m%H")

# torch.save(model_ae.state_dict(), f"autoencoder-{timestamp}")

In [None]:
# !gsutil cp "autoencoder"* "gs://customer_review_hawaii/models/"
# !rm "autoencoder"*

autoencoder-041111  word2vec_amazon_reviews-041109


## Clustering

In [None]:
# Start clustering
clustering = HDBSCANClustering()

In [None]:
# Only select a certain location
selected_data = processed_dataset[
                        (processed_dataset["business_id"] == selected_id) &
                        (processed_dataset["time"] >= timestamp_limit)
                ]
len(selected_data)

In [None]:
X = np.vstack(selected_data["w2v"].values)
selected_data["w2v_cluster"] = clustering.fit_transform(X)

In [None]:
evaluate_cluster(X, selected_data["w2v_cluster"])

In [None]:
visualize_cluster(X, selected_data["w2v_cluster"])

## Cluster labeling

In [None]:
# import openai

# def get_cluster_name(text_samples):
#     prompt = f"""
#         You are an expert in giving a descriptive topic to a given list of sentences.
#         The sentences may have different topics, so choose one that is the most commonly shared.
#         Please return the topic as consice as possible, maximum in 3 words.
#         Please also avoid a vague topic.

#         There are 5 sentences as the input.
#         The content of the sentences is limited to customer reviews for a tourist attraction.
#         So, please only choose the topic according to possible aspects in this business.

#         Here is an example with just 3 setences:

#         INPUT
#         'Gorgeous place to visit, it can get crowded on holidays.'
#         'Great hike and beautiful views.'
#         'Awesome view.'
#         ENDINPUT

#         LABEL 'Scenic view'

#         So here is the sentences:
#         INPUT
#         {'\n'.join(text_samples)}
#         ENDINPUT

#         LABEL ...
#     """
#     response = openai.ChatCompletion.create(
#         model="gpt-3.5-turbo",
#         messages=[{"role": "user", "content": prompt}]
#     )
#     return response.choices[0].message['content'].strip()

TODO
- review clustering methods with one core example on each method
- make sure the statistics are the same (wv and transformer)

TODO:
- (opt) ask for cluster fixing from OpenAI