# Yelp Pulse

##### Generating Positive vs Negative Mentions from Yelp reviews for deeper sentiment understanding, leveraging LDA and BERT.

NOTE: Since the dataset is pretty huge (3.8M observations), we'll start with "Restaurant" businesses category as the most popular one and we'll start with California state (the state I currently live in). This is further reduce the computational cost when running the UMAP & BERTopic.

Hypothetical Requirement: [https://github.com/clement-hironimus/yelp-pulse-sentiment-analysis/blob/main/README.md](https://github.com/clement-hironimus/yelp-pulse-sentiment-analysis/blob/main/README.md)

&nbsp;
PROJECT OUTLINE:
- Step 1: Data Wrangling & Cleaning (see: 01_yelp_pulse_data_wrangling_and_cleaning.ipynb)
- Step 2: Exploratory Data Analysis (see: 02_yelp_pulse_exploratory_data_analysis.ipynb)
- Step 3: Topic Extraction and Sentiment Classification (THIS NOTEBOOK)

# TL;DR Summary

## Project Overview
This project analyzes Yelp reviews for restaurants in California, identifying key topics and categorizing reviews into high-level themes such as food and service. By leveraging advanced NLP techniques, we aim to provide actionable insights for restaurant owners and potential customers.

&nbsp;
#### 1. Load Data & Filter for Restaurants in California
- **Data Reduction**: Focused on restaurant reviews in California from 2017 to 2021 to ensure computational feasibility and relevance.
- **Assumption**: The latest Yelp review data is from Jan 2022.

#### 2. Create Sentence-Level Embeddings on Yelp Reviews
- **Model**: Used `all-MiniLM-L6-v2` SentenceTransformers to create embeddings that capture the semantic content of the reviews.

#### 3. UMAP Parameter Tuning
- **Optimal Parameters**: `n_neighbors=20` and `min_dist=0.1`.
- **Usage**: Employed higher dimensions (e.g., n_components=50) for HDBSCAN to capture more complex data structures.

#### 4. HDBSCAN Parameter Tuning
- **Optimal Parameter**: `min_cluster_size=300`.
- **Goal**: Achieved a balance between detailed clustering and capturing high-level clusters.

#### 5. Guided Topic Modeling with BERTopic
- **Seed Topics**: Used predefined topics to guide the modeling process.
- **Clustering Models**: Applied UMAP for dimensionality reduction and HDBSCAN for clustering.

#### 6. Evaluate Model Prediction
- **Prediction Confidence**: 42.06%
  - Indicates moderate model confidence, understandable given the reduced data size.
- **Average Prediction Spread**: 24.44%
  - Suggests some overlap in topic probabilities, reflecting the complexity of review content.
- **Outliers**: 40.23%
  - High outlier rate, likely due to strict tuning and limited data size, indicating room for improvement with more data.

#### 7. Analyze the Topic Result
- **Service-Related Topics**: Service quality, friendliness, dietary options.
- **Product-Related Topics**: 
  - By origins: Mexican, Italian, Chinese, Southeast Asian.
  - By genres: Breakfast, brunch, bread, seafood, beverages.
- **Future Work**: Further breakdown into categories like dietary, ambience, cleanliness, accessibility, wait-time, and price.

#### 8. Export Data
- Reviews categorized into topics with associated probabilities exported for dashboard visualization.

#### 9. Potential Next Steps
- **Include More Data & More Granular Topic Clusters**: Enhancing model prediction and discovering more topic categories.
- **Handling Multiple Topics per Review**: Include top-N predictions to account for reviews with multiple topics.
- **Deeper Sentiment Classification with BERT**: Incorporate a BERT-based sentiment classifier to provide deeper sentiment insights. Current "star-based sentiment label" already offers reasonable accuracy.


In [1]:
# Load Libraries
import os

# Data manipulation
import pandas as pd
import numpy as np

# Dimensional Reduction
from sklearn.decomposition import PCA
from umap import UMAP

# Topic Modeling
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer # To create sentence embeddings
from sklearn.feature_extraction.text import CountVectorizer # For topic representation
import hdbscan

# Metrics
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

# Viz
import matplotlib.pyplot as plt

# Utils
import time
import joblib # To save models


## 1. Load Data & Filter for Restaurants in California
Reduce the dataset size to make it manageable for computational resources.
- **Restaurant Category**: For more guided topic modeling, as different business categories can have completely different topics.
- **Business Location=California**: To reduce the computational resource required (and since I live in this state too).

Data Assumptions:
- Yelp Open Dataset contains review data from `2005-02-16` to `2022-01-19`. This project only uses more recent data from 2017 to 2021.
- Let's assume that "TODAY" is Jan 2022 as indicated from the latest Yelp review data.

### Load Data

In [2]:
# Load Yelp Reviews
print("Loading Yelp Reviews...")
df_clean_yelp_review_and_business = pd.read_json(
    'clean_dataset/02_features_yelp_review_and_business_metadata.json',
    lines=True  # Using "lines" because each line in the file is a separate JSON object
)

# Preview Data
print(f'Initial data loaded for df_clean_yelp_review_and_business: {df_clean_yelp_review_and_business.shape}')
df_clean_yelp_review_and_business.head()

Loading Yelp Reviews...
Initial data loaded for df_clean_yelp_review_and_business: (3797310, 14)


Unnamed: 0,review_id,review_text,cleaned_review_text,business_average_review_stars,business_review_count,original_review_text_length,review_stars,star_based_sentiment_label,business_name,is_business_open,business_categories,business_city,business_state,business_country
0,KU_O5udG6zpxOg-VcAEodg,"If you decide to eat here, just be aware it is...","if you decide to eat here, just be aware it is...",3.0,169,513,3,neutral,Turning Point of North Wales,True,"Restaurants, Breakfast & Brunch, Food, Juice B...",North Wales,PA,United States
1,Sx8TMOWLNuJBWer-0pcmoA,Cute interior and owner (?) gave us tour of up...,cute interior and owner (?) gave us tour of up...,4.0,32,534,4,positive,Melt,False,"Sandwiches, Beer, Wine & Spirits, Bars, Food, ...",New Orleans,LA,United States
2,lUUhg8ltDsUZ9h0xnwY4Dg,I was really between 3 and 4 stars for this on...,i was really between and stars for this one. i...,3.5,33,1555,4,positive,Naked Tchopstix Express,False,"Restaurants, Food, Poke, Hawaiian, Sushi Bars",Indianapolis,IN,United States
3,-P5E9BYUaK7s3PwBF5oAyg,First time there and it was excellent!!! It fe...,first time there and it was excellent!!! it fe...,4.0,137,222,5,positive,Portobello Cafe,True,"Restaurants, Seafood, Cafes, Italian",Eddystone,PA,United States
4,YbMyvlDA2W3Py5lTz8VK-A,"Great burgers,fries and salad! Burgers have a...","great burgers,fries and salad! burgers have a ...",4.0,329,209,5,positive,The Original Habit Burger Grill,True,"Fast Food, Burgers, Restaurants",Goleta,CA,United States


### Filter for Restaurants in California

In [3]:
# Filter for restaurant businesses
print("Filtering for restaurant businesses located in California...")
is_business_restaurant = df_clean_yelp_review_and_business['business_categories'].str.contains(
    'restaurants',  # word to search
    case=False,  # case=False to make search non-case sensitive
    na=False  # handle NaN values as False (ie. not containing word "restaurants")
)
is_business_in_california = df_clean_yelp_review_and_business['business_state'] == 'CA'
df_clean_yelp_review_and_business = df_clean_yelp_review_and_business[is_business_restaurant & is_business_in_california]
print(f"Filtered data: {df_clean_yelp_review_and_business.shape}")


Filtering for restaurant businesses located in California...
Filtered data: (103965, 14)


In [22]:
df_clean_yelp_review_and_business['business_categories'].str.split(', ').explode().value_counts()

business_categories
Restaurants                103965
Food                        35907
Nightlife                   34327
Bars                        32632
American (New)              27022
                            ...  
Bartenders                      3
Wholesalers                     1
Boating                         1
Religious Organizations         1
Churches                        1
Name: count, Length: 239, dtype: int64

## 2. Create Sentence-Level Embeddings on Yelp Reviews

Using `all-MiniLM-L6-v2` SentenceTransformers to capture the semantic content of the reviews and create the embeddings.


In [117]:
# Start timing
start_time = time.time()

# Configuration
sentence_transformer_model_name = 'all-MiniLM-L6-v2'
np_review_text_embeddings_file_name = 'np_review_text_embeddings.npy'
save_directory = 'clean_dataset/dependencies'
sentence_transformer_batch_size = 128  # To prevent memory overload


# Create a new directory if it doesn't exist yet
os.makedirs(save_directory, exist_ok=True)


# Convert review text to lowercase for consistency
print("Converting review text to lowercase for consistency...")
review_text_list = df_clean_yelp_review_and_business['cleaned_review_text'].str.lower().tolist()

# Create embeddings with SentenceTransformer
print(f"Creating embeddings with SentenceTransformer '{sentence_transformer_model_name}'...")
sentence_transformer = SentenceTransformer(sentence_transformer_model_name)
np_review_text_embeddings = sentence_transformer.encode(
    review_text_list,
    batch_size=sentence_transformer_batch_size,
    show_progress_bar=True
)

# Save the embeddings to a file, so we don't have to repeat when there's downstream error
print("Saving the embeddings result...")
np.save(
    os.path.join(save_directory, np_review_text_embeddings_file_name),
    np_review_text_embeddings
)

# Show time elapsed
end_time = time.time()
print(f"Embeddings saved. Process completed. Total time taken: {end_time - start_time:.2f} seconds")

# Preview Results
print(f"Embeddings shape: {np_review_text_embeddings.shape}")
print(np_review_text_embeddings[:5])

Converting review text to lowercase for consistency...
Creating embeddings with SentenceTransformer 'all-MiniLM-L6-v2'...


Batches:   0%|          | 0/813 [00:00<?, ?it/s]

Saving the embeddings result...
Embeddings saved. Process completed. Total time taken: 177.60 seconds
Embeddings shape: (103965, 384)
[[ 0.00310427 -0.00871006  0.05627474 ...  0.03302469 -0.08433873
   0.07478962]
 [-0.07137635  0.01459838  0.05616115 ...  0.01244401  0.01309771
  -0.03240254]
 [-0.07648057 -0.051689    0.03715101 ... -0.02480879 -0.0016976
   0.01311367]
 [-0.08753921  0.08483326 -0.00827177 ...  0.06044303 -0.06865957
   0.013284  ]
 [-0.01195816 -0.08051581  0.0200472  ... -0.00040389 -0.06875212
  -0.00902688]]


## 3. UMAP Parameter Tuning

Below are the examples of different UMAP parameter configurations in 2D space. **Optimal parameters: `n_neighbors=20` and `min_dist=0.1` (bottom-right)** seems to have good balance between capturing local structures and maintaining compact clusters.

**Note**: The optimal parameters for UMAP will be used in conjunction with HDBSCAN for clustering the embeddings. Higher dimensions (e.g., n_components=50) will be used for HDBSCAN to capture more complex structures in the data.

&nbsp;
<table>
  <tr>
    <td><img src="images/umap_tuning/option7_UMAP_n_neighbors=2_n_components=2_min_dist=0.1.png" style="width: 50%;"></td>
    <td><img src="images/umap_tuning/option8_UMAP_n_neighbors=3_n_components=2_min_dist=0.png" style="width: 50%;"></td>
  </tr>
  <tr>
    <td><img src="images/umap_tuning/option1_UMAP_n_neighbors=5_n_components=2_min_dist=0.1.png" style="width: 50%;"></td>
    <td><img src="images/umap_tuning/option6_UMAP_n_neighbors=20_n_components=2_min_dist=0.1.png" style="width: 50%;"></td>
  </tr>
</table>


In [30]:
# Ensure directory exists
output_dir = "images/umap_tuning"
os.makedirs(output_dir, exist_ok=True)

# UMAP parameter grid
n_neighbors_options = [5, 20, 30]
min_dist_options = [0, 0.1, 0.2]
n_components_chosen = 2

# Function to plot UMAP
def plot_umap(embedding, title):
    plt.figure(figsize=(10, 8))
    plt.scatter(embedding[:, 0], embedding[:, 1], s=1)
    plt.title(title)
    plt.xlabel('UMAP Dimension 1')
    plt.ylabel('UMAP Dimension 2')
    plt.grid(True)
    plt.savefig(f"{output_dir}/{title}.png")
    plt.close()

# Loop through the parameter grid
for n_neighbors in n_neighbors_options:
        for min_dist in min_dist_options:
            print(f"Running UMAP with n_neighbors={n_neighbors}, n_components={n_components_chosen}, min_dist={min_dist}")
            umap_model = UMAP(
                n_neighbors=n_neighbors,
                n_components=n_components_chosen,
                min_dist=min_dist,
                metric='cosine', # Suitable for sparse text clustering
                n_jobs=11, # Use 11 CPU cores
                random_state=42 # For reproducibility
            )
            start_time = time.time()
            umap_embedding = umap_model.fit_transform(np_review_text_embeddings)
            end_time = time.time()
            print(f"UMAP completed in {end_time - start_time:.2f} seconds")
            plot_title = f"UMAP_n_neighbors={n_neighbors}_n_components={n_components_chosen}_min_dist={min_dist}"
            plot_umap(umap_embedding, plot_title)


Running UMAP with n_neighbors=3, n_components=2, min_dist=0
UMAP completed in 18.81 seconds
Running UMAP with n_neighbors=3, n_components=2, min_dist=0.1
UMAP completed in 17.70 seconds
Running UMAP with n_neighbors=4, n_components=2, min_dist=0
UMAP completed in 12.17 seconds
Running UMAP with n_neighbors=4, n_components=2, min_dist=0.1
UMAP completed in 15.80 seconds


## 4. HDBSCAN Parameter Tuning

<table>
    <tr>
        <td><img src="images/hdbscan_tuning/iter1_HDBSCAN_visualization_umap_n_components=2_min_cluster_size=208.png" style="height:200px;"></td>
        <td><img src="images/hdbscan_tuning/iter2_HDBSCAN_visualization_umap_n_components=2_min_cluster_size=312.png" style="height:200px;"></td>
    </tr>
    <tr>
        <td><img src="images/hdbscan_tuning/iter3_HDBSCAN_visualization_umap_n_components=2_min_cluster_size=300.png" style="height:200px;"></td>
    </tr>
    
</table>

![HDBSCAN Tuning Overview](images/hdbscan_tuning/hdbscan_tuning.png)

In [7]:
# Optimal UMAP parameters obtained from initial tuning
optimal_n_neighbors = 20
optimal_min_dist = 0.1

In [37]:
# Directory to save the plot and CSV results
output_dir = 'images/hdbscan_tuning'
os.makedirs(output_dir, exist_ok=True)  # Creates the directory only if missing

# PARAMETER GRID

# UMAP
umap_n_components_options = [50] # Generally recommended for sparse text analysis

# HDBSCAN
data_length = len(df_clean_yelp_review_and_business)
min_cluster_size_options = [250, 350]

# Set environment variable for parallelism
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Load existing metrics DataFrame if it exists (so we can keep adding the results as we do more experiments)
csv_path = f"{output_dir}/hdbscan_tuning.csv"
if os.path.exists(csv_path):
    hdbscan_metrics_df = pd.read_csv(csv_path)
else:
    hdbscan_metrics_df = pd.DataFrame(columns=[
        'UMAP_n_components', 'HDBSCAN_min_cluster_size', 'UMAP_n_neighbors', 'UMAP_min_dist',
        'Num_Clusters', 'Silhouette_Score', 'Davies_Bouldin_Score', 'Calinski_Harabasz_Score', 
        'Outlier_Proportion'
    ])

# Function to run HDBSCAN and store results
def run_hdbscan(embedding, hdbscan_min_cluster_size, umap_n_components):
    """
    Runs HDBSCAN on the given embedding and plots the results.

    Parameters:
    - embedding: np.array, the UMAP reduced embedding.
    - hdbscan_min_cluster_size: int, the minimum cluster size for HDBSCAN.
    - umap_n_components: int, the number of UMAP components.
    """
    # Initialize HDBSCAN
    print(f"Running HDBSCAN with min_cluster_size={hdbscan_min_cluster_size}...")
    hdbscan_model = hdbscan.HDBSCAN(
        min_cluster_size=hdbscan_min_cluster_size,
        metric='euclidean',  # Trying with default euclidean as faster to compute, will try "cosine" later if needed
        core_dist_n_jobs=11  # Use 11 CPU cores
    )

    # Fit HDBSCAN model and count clusters
    clusters = hdbscan_model.fit_predict(embedding)
    num_clusters = len(set(clusters)) - (1 if -1 in clusters else 0)
    num_outliers = list(clusters).count(-1)
    outlier_proportion = num_outliers / len(clusters)
    print(f"Clustering complete. Number of clusters found: {num_clusters}")
    print(f"Outlier Proportion: {outlier_proportion}")

    # Calculate silhouette score if there are more than one clusters
    if num_clusters > 1:
        silhouette_avg = silhouette_score(embedding, clusters)
        davies_bouldin_avg = davies_bouldin_score(embedding, clusters)
        calinski_harabasz_avg = calinski_harabasz_score(embedding, clusters)
        print(f"Silhouette Score: {silhouette_avg}")
        print(f"Davies-Bouldin Score: {davies_bouldin_avg}")
        print(f"Calinski-Harabasz Score: {calinski_harabasz_avg}")
    else:
        silhouette_avg = None
        davies_bouldin_avg = None
        calinski_harabasz_avg = None
        print("Silhouette Score: Not applicable (only one cluster or noise).")
        print("Davies-Bouldin Score: Not applicable (only one cluster or noise).")
        print("Calinski-Harabasz Score: Not applicable (only one cluster or noise).")

    # Store metrics in the DataFrame
    hdbscan_metrics_df.loc[len(hdbscan_metrics_df)] = [
        umap_n_components, hdbscan_min_cluster_size, optimal_n_neighbors, optimal_min_dist, 
        num_clusters, silhouette_avg, davies_bouldin_avg, calinski_harabasz_avg,
        outlier_proportion
    ]

# Loop through each UMAP "n_components"
for n_components in umap_n_components_options:
    
    # Dimensional reduction with UMAP
    print(f"Initializing UMAP with n_components={n_components}, n_neighbors={optimal_n_neighbors}, min_dist={optimal_min_dist}...")
    umap_model = UMAP(
        n_neighbors=optimal_n_neighbors,
        min_dist=optimal_min_dist,
        n_components=n_components,
        metric='cosine',  # Optimal for sparse text clustering
        n_jobs=11,  # Use 11 CPU cores
        random_state=42 # For reproducibility
    )
    umap_result = umap_model.fit_transform(np_review_text_embeddings)
    print(f"UMAP transformation complete. Shape of UMAP embedding: {umap_result.shape}")

    # Loop through each HDBSCAN parameters
    for min_cluster_size in min_cluster_size_options:
        run_hdbscan(
            embedding=umap_result, 
            hdbscan_min_cluster_size=min_cluster_size, 
            umap_n_components=n_components
        )
        print(f"HDBSCAN run with n_components={n_components}, min_cluster_size={min_cluster_size} completed.")
        print('----------------------------------------------------------------------------------------')

print("All UMAP and HDBSCAN parameter combinations have been processed.")

# Save the metrics DataFrame to a CSV file
hdbscan_metrics_df.to_csv(csv_path, index=False)
print("HDBSCAN metrics have been saved to hdbscan_tuning.csv.")

# Show metric results
hdbscan_metrics_df


Initializing UMAP with n_components=50, n_neighbors=20, min_dist=0.1...
UMAP transformation complete. Shape of UMAP embedding: (103965, 50)
Running HDBSCAN with min_cluster_size=250...
Clustering complete. Number of clusters found: 38
Outlier Proportion: 0.4170730534314433
Silhouette Score: 0.14272354543209076
Davies-Bouldin Score: 0.8844499958672517
Calinski-Harabasz Score: 6787.781830741457
HDBSCAN run with n_components=50, min_cluster_size=250 completed.
----------------------------------------------------------------------------------------
Running HDBSCAN with min_cluster_size=350...
Clustering complete. Number of clusters found: 2
Outlier Proportion: 0.006127061992016544
Silhouette Score: 0.3711826801300049
Davies-Bouldin Score: 1.992668725157575
Calinski-Harabasz Score: 1228.4118584360492
HDBSCAN run with n_components=50, min_cluster_size=350 completed.
----------------------------------------------------------------------------------------
All UMAP and HDBSCAN parameter combina

Unnamed: 0,UMAP_n_components,HDBSCAN_min_cluster_size,UMAP_n_neighbors,UMAP_min_dist,Num_Clusters,Silhouette_Score,Davies_Bouldin_Score,Calinski_Harabasz_Score,Outlier_Proportion
0,50.0,100.0,20.0,0.1,65.0,0.11,0.9,5334.0,0.38
1,50.0,200.0,20.0,0.1,45.0,0.12,0.92,5997.0,0.41
2,50.0,300.0,20.0,0.1,34.0,0.15,0.87,7489.0,0.42
3,100.0,100.0,20.0,0.1,65.0,0.08,0.9,4941.0,0.42
4,5.0,50.0,20.0,0.1,135.0,-0.063279,0.947693,1546.363139,0.475949
5,5.0,150.0,20.0,0.1,53.0,0.111418,0.943953,6194.797213,0.368307
6,5.0,250.0,20.0,0.1,38.0,0.149044,0.980809,7838.109976,0.364709
7,10.0,50.0,20.0,0.1,144.0,-0.092178,0.93206,1325.103155,0.504122
8,10.0,150.0,20.0,0.1,57.0,0.10845,0.887338,5597.544505,0.406425
9,10.0,250.0,20.0,0.1,38.0,0.152401,0.889673,6938.917154,0.402308


### HDBSCAN Cluster Visualization

From the optimal UMAP & HDBSCAN parameters. Data is converted into UMAP 2-D for visualization purpose.

<img src="images/hdbscan_tuning/iter3_HDBSCAN_visualization_umap_n_components=2_min_cluster_size=300.png" style="height:200px;">

In [None]:
# Visualize using a specific UMAP "n_components=2". More representative than using the above "n_components=50" and takes only the first two dimensions.

best_min_cluster_size = 300 # Optimal from above parameter tuning (about 0.3% of the dataset length)

# Recompute UMAP with n_components=2 for visualization
print(f"Recomputing UMAP with n_components=2 for visualization...")
umap_model_2d = UMAP(
    n_neighbors=optimal_n_neighbors,
    min_dist=optimal_min_dist,
    n_components=2,
    metric='cosine',  # Optimal for sparse text clustering
    n_jobs=11,  # Use 11 CPU cores
    random_state=42 # For reproducibility
)
umap_result_2d = umap_model_2d.fit_transform(np_review_text_embeddings)
print(f"UMAP (2D) transformation complete. Shape of UMAP embedding: {umap_result_2d.shape}")

# Run HDBSCAN with the best parameters
print(f"Running HDBSCAN with the best min_cluster_size={best_min_cluster_size} for visualization...")
hdbscan_model_best = hdbscan.HDBSCAN(
    min_cluster_size=int(best_min_cluster_size),
    metric='euclidean',
    core_dist_n_jobs=11 # Use 11 CPU cores
)
clusters_best = hdbscan_model_best.fit_predict(umap_result_2d)

# Plot the DBSCAN clusters
plt.figure(figsize=(10, 8))
plt.scatter(umap_result_2d[:, 0], umap_result_2d[:, 1], c=clusters_best, cmap='Spectral', s=1)
plt.title(f"HDBSCAN Visualization (UMAP n_components=2, n_neighbors={optimal_n_neighbors}, min_dist={optimal_min_dist}, min_cluster_size={best_min_cluster_size})")
plt.xlabel('UMAP Dimension 1')
plt.ylabel('UMAP Dimension 2')
plt.grid(True)
plt.savefig(f"{output_dir}/HDBSCAN_visualization_umap_n_components=2_min_cluster_size={best_min_cluster_size}.png") # Save for later use
plt.show()


## 5. Guided Topic Modeling with BERTopic


In [123]:
# Obtained from previous parameter tuning
optimal_n_neighbors = 20
optimal_min_dist = 0.1
best_min_cluster_size = 300  # Optimal from above parameter tuning (about 0.3% of the dataset length)
optimal_n_components = 50

# Define seed topics
print("Defining seed topics...")
seed_topic_list = [
    ["food", "quality", "taste", "delicious", "flavor", "fresh", "ingredients", "cuisine"],
    ["service", "waiter", "staff", "attentive", "friendly", "helpful", "polite", "rude", "customer", "manager"],
    ["wait", "time", "delay", "quick", "slow", "waiting", "line", "queue"],
    ["price", "cost", "value", "expensive", "cheap", "affordable", "reasonable", "worth"],
    ["clean", "dirty", "hygiene", "tidy", "sanitary", "spotless", "neat", "bathroom", "restroom", "kitchen"],
    ["dietary", "vegan", "vegetarian", "gluten-free", "allergy", "intolerance", "options", "healthy", "gluten"],
    ["ambience", "atmosphere", "vibe", "mood", "decor", "setting", "environment", "views", "interior"],
    ["accessibility", "parking", "wheelchair", "access", "entrance", "stairs", "ramps", "handicap", "service animal"]
]

# Ensure lengths of DataFrame and embeddings match
assert len(df_clean_yelp_review_and_business) == len(np_review_text_embeddings), "Mismatch between DataFrame and embeddings lengths"
print(f'np_review_text_embeddings shape: {np_review_text_embeddings.shape}')

# Initialize UMAP
print("Initializing UMAP...")
umap_model = UMAP(
    n_components=optimal_n_components, 
    n_neighbors=optimal_n_neighbors,  # Balances local and global structure
    min_dist=optimal_min_dist,  # Minimum distance between points
    metric="cosine",  # Effective for text data
    n_jobs=11,  # Utilize 11 CPU cores
    verbose=True, # Show progress log
    random_state=42  # For reproducibility
)

# Initialize HDBSCAN
print("Initializing HDBSCAN...")
hdbscan_model = hdbscan.HDBSCAN(
    min_cluster_size=best_min_cluster_size,  # Balancing major vs micro clusters
    prediction_data=True,  # Enable prediction for new data
    cluster_selection_epsilon=0,  # Default value, utilizing the hierarchical nature in clustering process
    core_dist_n_jobs=11  # Utilize 11 CPU cores
)

# Initialize CountVectorizer
print("Initializing CountVectorizer...")
vectorizer_model = CountVectorizer(
    stop_words="english",  # Remove stop words
    ngram_range=(1, 3)  # Use 1-3 words when summarizing the topic
)

# Set environment variable for parallelism
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Initialize BERTopic
print("Initializing BERTopic modeling...")
bertopic_model = BERTopic(
    seed_topic_list=seed_topic_list,  # Guide the modeling with predefined topics
    calculate_probabilities=True,  # Allow multiple topics per document
    min_topic_size=round(0.0003 * len(df_clean_yelp_review_and_business)),  # Default is 0.0001 (100 of 1M documents), setting 0.0003 to prevent too many micro-topics appearing
    vectorizer_model=vectorizer_model,  # Use custom vectorizer modeling with stop words removal
    umap_model=umap_model,  # Use the customized UMAP modeling
    hdbscan_model=hdbscan_model,  # Use the customized HDBSCAN modeling
    verbose=True,  # Provide detailed progress indicators
    nr_topics='auto',  # Automatically determine the optimal number of topics
    top_n_words=15  # Use 15 words to describe each topic
)

# Fit BERTopic model
print("Fitting BERTopic model...")
topics, probabilities = bertopic_model.fit_transform(
    documents=df_clean_yelp_review_and_business['cleaned_review_text'],
    embeddings=np_review_text_embeddings
)

# Ensure that the lengths match
assert len(topics) == len(df_clean_yelp_review_and_business), "Mismatch between topics and DataFrame lengths"

# Add the "Topic" id to the original dataframe 
df_clean_yelp_review_and_business['Topic'] = topics

# Extract the probability associated with each predicted "Topic" on each row
df_clean_yelp_review_and_business['topic_probability'] = [
    probabilities[i][topic_id] if topic_id != -1 else 0 for i, topic_id in enumerate(topics)
]

# Preview the result
df_clean_yelp_review_and_business.head(10)


Defining seed topics...
np_review_text_embeddings shape: (103965, 384)
Initializing UMAP...
Initializing HDBSCAN...
Initializing CountVectorizer...
Initializing BERTopic modeling...
Fitting BERTopic model...


2024-05-27 20:02:18,973 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm


UMAP(angular_rp_forest=True, metric='cosine', n_components=50, n_jobs=1, n_neighbors=20, random_state=42, verbose=True)
Mon May 27 20:02:19 2024 Construct fuzzy simplicial set
Mon May 27 20:02:19 2024 Finding Nearest Neighbors
Mon May 27 20:02:19 2024 Building RP forest with 21 trees
Mon May 27 20:02:22 2024 NN descent for 17 iterations
	 1  /  17
	 2  /  17
	 3  /  17
	 4  /  17
	 5  /  17
	 6  /  17
	Stopping threshold met -- exiting after 6 iterations
Mon May 27 20:02:39 2024 Finished Nearest Neighbor Search
Mon May 27 20:02:40 2024 Construct embedding


Epochs completed:   0%|            0/200 [00:00]

	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs
Mon May 27 20:04:26 2024 Finished embedding


2024-05-27 20:04:27,328 - BERTopic - Dimensionality - Completed ✓
2024-05-27 20:04:27,337 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-05-27 20:05:52,184 - BERTopic - Cluster - Completed ✓
2024-05-27 20:05:52,185 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-05-27 20:06:23,112 - BERTopic - Representation - Completed ✓
2024-05-27 20:06:23,150 - BERTopic - Topic reduction - Reducing number of topics
2024-05-27 20:06:54,577 - BERTopic - Topic reduction - Reduced number of topics from 34 to 34


Unnamed: 0,review_id,review_text,cleaned_review_text,business_average_review_stars,business_review_count,original_review_text_length,review_stars,star_based_sentiment_label,business_name,is_business_open,business_categories,business_city,business_state,business_country,Topic,topic_probability
4,YbMyvlDA2W3Py5lTz8VK-A,"Great burgers,fries and salad! Burgers have a...","great burgers,fries and salad! burgers have a ...",4.0,329,209,5,positive,The Original Habit Burger Grill,True,"Fast Food, Burgers, Restaurants",Goleta,CA,United States,8,1.0
16,5obXxR0b94b5q6j1zYCAzw,We visited once and were very disappointed in ...,we visited once and were very disappointed in ...,4.0,124,315,1,negative,PizzaMan Dan's,True,"Chicken Wings, Restaurants, Beer, Wine & Spiri...",Carpinteria,CA,United States,4,1.0
100,b1GkT5ojVlMCM6jdwPi3wQ,Best smoothies in the whole world. And fresh b...,best smoothies in the whole world. and fresh b...,4.5,97,88,5,positive,Pacific Health Foods,True,"Sandwiches, Specialty Food, Food, Health Marke...",Carpinteria,CA,United States,30,1.0
108,LFpaQzYkP5Pzm5lEjJpTRw,Possibly the best breakfast sandwich EVER. On...,possibly the best breakfast sandwich ever. on ...,4.0,389,86,5,positive,Helena Avenue Bakery,True,"Food, Restaurants, Salad, Coffee & Tea, Breakf...",Santa Barbara,CA,United States,-1,0.0
165,raMmqAddReOruHYmUTdT9Q,I love this place. It's been in La Cumbre Plaz...,i love this place. it's been in la cumbre plaz...,4.0,188,888,4,positive,Plaza Deli,True,"Sandwiches, Restaurants, Salad, Hot Dogs, Delis",Santa Barbara,CA,United States,7,0.204797
204,A-AIvNIGCwUplBhjd2OMhg,Their sushi is pretty good. Mesa rolls are my ...,their sushi is pretty good. mesa rolls are my ...,3.5,227,200,3,neutral,Ichiban,True,"Restaurants, Japanese, Sushi Bars",Santa Barbara,CA,United States,9,0.774801
207,GSyttRJvd5VqKK6O0WV1Eg,"According to my experience, there is just one ...","according to my experience, there is just one ...",4.0,2404,648,5,positive,Santa Barbara Shellfish Company,True,"Live/Raw Food, Restaurants, Seafood, Beer Bar,...",Santa Barbara,CA,United States,3,0.152615
228,s_X5uNLjLSgK_itDNFiadg,I am from Colorado and am visiting Santa Barba...,i am from colorado and am visiting santa barba...,4.0,807,338,4,positive,Dawn Patrol,True,"Coffee & Tea, Breakfast & Brunch, Restaurants,...",Santa Barbara,CA,United States,5,0.053964
274,Q3fPo_x6xKxafAzy1hFITg,Pricey ( a ham and cheese croissant was $5.50)...,pricey ( a ham and cheese croissant was $ . ) ...,4.0,389,392,4,positive,Helena Avenue Bakery,True,"Food, Restaurants, Salad, Coffee & Tea, Breakf...",Santa Barbara,CA,United States,5,0.061375
275,bc9aVpeB0WyByVVTnmOCWQ,I can say this is the best burger of my life!\...,i can say this is the best burger of my life! ...,4.0,329,135,5,positive,The Original Habit Burger Grill,True,"Fast Food, Burgers, Restaurants",Goleta,CA,United States,8,1.0


### Store BERT Topic Info in Dataframe

In [124]:
# Get topic info and convert list columns into the right format
df_bertopic_model_topic = bertopic_model.get_topic_info()
df_bertopic_model_topic.rename(
    columns={
        'Count': 'topic_count', 
        'Name': 'topic_name', 
        'Representation': 'topic_representation', 
        'Representative_Docs': 'topic_representative_docs'
    },
    inplace=True
)

# Check column format
print(f"'topic_representation' data type: {type(df_bertopic_model_topic['topic_representation'][0])}")
print(f"'topic_representative_docs' data type: {type(df_bertopic_model_topic['topic_representative_docs'][0])}")

df_bertopic_model_topic.head()

'topic_representation' data type: <class 'list'>
'topic_representative_docs' data type: <class 'list'>


Unnamed: 0,Topic,topic_count,topic_name,topic_representation,topic_representative_docs
0,-1,41825,-1_food_service_good_great,"[food, service, good, great, place, time, deli...",[my boyfriend took me here for my birthday din...
1,0,11821,0_tacos_food_mexican_taco,"[tacos, food, mexican, taco, salsa, burrito, g...",[great tacos! i would highly recommend this ta...
2,1,6296,1_food_great_service_place,"[food, great, service, place, friendly, good, ...",[this restaurant is absolutely delicious!!! gr...
3,2,5785,2_food_service_order_said,"[food, service, order, said, minutes, asked, t...",[stars for the service tonight. we love this p...
4,3,4707,3_chowder_fish_seafood_crab,"[chowder, fish, seafood, crab, clam, lobster, ...",[not bad. a . star rating for me. neighed bad ...


## 6. Evaluate Model Prediction

### Average Prediction Confidence Score

Represents the percentage of predictions where the model is confident about the assigned topic. A confidence score of 42.06% indicates that the model is moderately confident in its predictions.

- Given the reduced dataset size, a confidence score of 42.06% is understandable. The model is performing reasonably well given the limited data (reducing from 2.5M to 105K).
- Potential Improvements:
    - Incorporating more training data to help the model learn better.
    - Fine-tuning model parameters and exploring different configurations of the UMAP and HDBSCAN models.
    - Utilizing top-N predictions instead of just the highest probability topic to account for reviews with multiple relevant topics.

In [202]:
# Get all prediction probabilities excluding outliers
prediction_probabilities = df_clean_yelp_review_and_business[df_clean_yelp_review_and_business['Topic'] != -1]['topic_probability']
print(f"Average Prediction Confidence (can be confidently right or wrong): {np.mean(prediction_probabilities) * 100:.2f}%")

Average Prediction Confidence (can be confidently right or wrong): 42.06%


### Prediction Spread

Measures the average difference in probabilities between the top predicted topic and the other topics. A lower spread indicates that the model often finds multiple topics equally probable for a given review.
- A spread of 24.44% suggests that there is considerable overlap in the model's topic probabilities, indicating that reviews might be related to multiple topics.
- Potential Improvements:
    - Adjusting the granularity of topics to ensure more distinct topic categories.
    - Exploring alternative topic modeling methods or configurations that can better differentiate between topics.

In [196]:
# Prediction spread excluding outliers
np_topics = pd.Series(topics)
outlier_index = np_topics[np_topics == -1].index
probabilities_excluding_outliers = probabilities[~outlier_index]

prediction_spread = [
    sorted(row, reverse=True)[0] - sorted(row, reverse=True)[1] for row in probabilities_excluding_outliers
]

print(f"Average Prediction Spread (lower means clusters are more distinct): {np.mean(prediction_spread) * 100:.2f}%")

Average Prediction Spread (lower means clusters are more distinct): 24.44%


### Outliers Percentage

Shows the percentage of reviews that were not assigned to any meaningful topic cluster, often due to the review content being too unique or diverse.
- A high outlier rate of 40.23% indicates that a substantial portion of the data does not fit well into the discovered topics, which is not ideal.
- Potential Improvements:
    - Reducing the outlier rate by fine-tuning the clustering parameters.
    - Increasing the amount of training data to help the model form more robust clusters.
    - Adjusting the definition and granularity of topics to better capture the diversity in the reviews.

In [161]:
print(f"Outliers Percentage: {sum([1 for topic_id in topics if topic_id == -1]) / len(topics) * 100:.2f}%")

Outliers Percentage: 40.23%


## 7. Analyze the Topic Result

For restaurants in California, the topic clusters are mostly separated by:

Service-Related:
- Service: service, friendly, order, place, view, atmosphere. 
- Dietary: gluten-free, vegan options

Product-Related:
- By origins:
    - Mexican: tacos, empanadas, salsa.
    - Italian: pizza, pasta.
    - Chinese & Southeast Asian food: Chinese, masala, curry, ramen, pho.
- By genres:
    - Breakfast & Brunch: eggs, brunch, toast, cake, donuts.
    - Bread: burger, sandwich, bagel, gluten-free, hot dog.
    - Seafood: fish, chowder, poke bowl, sushi, crab.
    - Beverage: coffee, boba, smoothie.


Hover the map below to see more details

In [125]:
# To see more details of the clusters separation
bertopic_model.visualize_topics()

#### Top Topic & Keywords

In [126]:
# Create custom labels
custom_labels = df_bertopic_model_topic.set_index('Topic')['topic_name'].to_dict() # Create a dictionary of {Topic: Name}
bertopic_model.set_topic_labels(custom_labels) # Update the custom label

# Step 4: Visualize the updated topics
bertopic_model.visualize_barchart(
    n_words=3,
    top_n_topics=10,
    title='Top-10 Topic & Keywords', 
    autoscale=True,
    custom_labels=True,
    width=350
)


#### Topic Similarity Matrix

Code: ```bertopic_model.visualize_heatmap(top_n_topics=16, width=1000)```

<img src="images/topic_similarity_matrix.png" style="width: 50%;">

### Map Yelp Reviews with Topic Categories (Food/Service)

Given reviews are heavily distributed to either Service-Related or Food-Related, we'll create these two high-level categories.

In the future, I might incorporate more data so we can further breakdown the topics into `dietary`, `ambience`, `cleanliness`, `accessibility`, `wait-time`, and/or `price`.

In [127]:
# Set the category & keywords
food_category_keywords = {
    'food', 'taco', 'seafood', 'pizza', 'breakfast', 'sandwich', 'burger', 'sushi', 'italian',
    'coffee', 'thai', 'ramen', 'indian', 'chinese', 'poke', 'donut', 'acai', 'vietnamese', 'boba',
    'empanada', 'cake', 'dogs', 'smoothie', 'bagel'
}

service_category_keywords = {'service', 'vegan', 'option', 'gluten', 'view'}

categories = {
    "topic_food": food_category_keywords,
    "topic_service": service_category_keywords
}

# To map each topic_label ("Name" column) to one or more categories
def map_topic_to_categories(topic_label):
    
    # Initialize setup: {'food_category': False, 'service_category': False}
    category_mapping = {category: False for category in categories}
    
    # To not assign food/service category for outliers (marked by "-1")
    if "-1" in topic_label:
        return category_mapping
    
    # For each category, check if "topic_label" contains any of the category's keywords
    for category, keywords in categories.items():
        if any(keyword in topic_label for keyword in keywords):
            category_mapping[category] = True
            
    return category_mapping

# Map each topic into one or more categories (food, service)
bertopic_category_mappings = df_bertopic_model_topic['topic_name'].apply(map_topic_to_categories)

# Since the result is in dictionary, we convert each key into a separate column
bertopic_category_mappings = bertopic_category_mappings.apply(pd.Series)

# Create a dataframe containing the topic and food/service labels
bertopic_category_mappings.index = df_bertopic_model_topic.index # Ensure indexes are matching
df_bertopic_model_topic_and_category = pd.concat(
    [df_bertopic_model_topic, bertopic_category_mappings],
    axis=1 # Combine by columns
)

# Preview results
df_bertopic_model_topic_and_category[
    ['Topic', 'topic_count', 'topic_name', 'topic_food', 'topic_service']
].head()


Unnamed: 0,Topic,topic_count,topic_name,topic_food,topic_service
0,-1,41825,-1_food_service_good_great,False,False
1,0,11821,0_tacos_food_mexican_taco,True,False
2,1,6296,1_food_great_service_place,True,True
3,2,5785,2_food_service_order_said,True,True
4,3,4707,3_chowder_fish_seafood_crab,True,False


## 8. Export Data

In [203]:
# Create the final dataframe, combining the original dataframe with the topic categorization
df_final_clean_yelp_review_and_business_topic_and_category = pd.merge(
    df_clean_yelp_review_and_business, 
    df_bertopic_model_topic_and_category[['Topic', 'topic_name', 'topic_food', 'topic_service']],
    on='Topic',
    how='left'
)

# Export data to csv
df_final_clean_yelp_review_and_business_topic_and_category.to_csv('clean_dataset/03_yelp_pulse_final_data.csv', index=False)

# Preview results
df_final_clean_yelp_review_and_business_topic_and_category[
    ['business_name', 'business_categories', 'topic_name', 'topic_probability', 'topic_food', 'topic_service', 'cleaned_review_text']
].sample(10, random_state=24) # The random state=24 shows examples of reviews with both "topic_food" and "topic_service"


Unnamed: 0,business_name,business_categories,topic_name,topic_probability,topic_food,topic_service,cleaned_review_text
46672,Jeannine's American Bakery Restaurant,"Food, Breakfast & Brunch, Bakeries, Restaurant...",12_coffee_latte_coffee shop_shop,0.10088,True,False,"i had a delicious latte here, and a smashing b..."
2,Pacific Health Foods,"Sandwiches, Specialty Food, Food, Health Marke...",30_smoothie_blenders_smoothies_juice,1.0,True,False,best smoothies in the whole world. and fresh b...
32609,South Coast Deli- Carrillo,"Salad, Sandwiches, Restaurants, Delis",7_sandwich_sandwiches_pickles_salad,0.255777,True,False,"can't attach photo, food gone too quickly. max..."
52966,Cafe Ana,"Donuts, Wine Bars, Bars, Breakfast & Brunch, B...",-1_food_service_good_great,0.0,False,False,this place is amazing. their breakfast platter...
753,Santa Barbara Shellfish Company,"Live/Raw Food, Restaurants, Seafood, Beer Bar,...",3_chowder_fish_seafood_crab,1.0,True,False,stopped by to visit stearns wharf and came acr...
16345,Mac N Cheese After Dark,"Nightlife, Soul Food, Restaurants, Comfort Foo...",1_food_great_service_place,0.176587,True,True,my friends and i go here when we need a break ...
95373,Mesa Verde,"Garage Door Services, Restaurants, Live/Raw Fo...",11_vegan_options_food_vegetarian,0.166393,True,True,this was such a wonderful dining experience! t...
84013,Los Arroyos Mexican Restaurant & Take Out,"Mexican, Coffee & Tea, Food, Cocktail Bars, Ni...",0_tacos_food_mexican_taco,0.679147,True,False,my husband and i absolutely love this place. w...
22366,Bar 29,"Restaurants, Bars, Gastropubs, American (Tradi...",8_burger_fries_burgers_good,0.076583,True,False,rather disappointing experience. came in becau...
57375,Wingstop,"Chicken Wings, Restaurants",-1_food_service_good_great,0.0,False,False,this is why santa barbara cannot have nice thi...


## 9. Potential Next Steps

1) **Include More Data & More Granular Topic Clusters**: Given we have 40% outliers in the topic clusters, this is due to stricter tuning and potentially not enough data for the model to be confident in forming a cluster. Adding more data can enhance the model's overall prediction performance (more data to learn) and potentially discover more "topic categories" (e.g., Cleanliness, Ambience, Dietary Options).
    - Action Items:
        - Collect and incorporate additional data to reduce the outlier rate and improve model confidence.
        - Explore different granularity levels in topic modeling to ensure that topics are neither too broad nor too specific.
 
2) **Handling Multiple Topics per Review**: As reviews can contain multiple topics, consider modifying the model to include top-N predictions. This will involve:
    - Identifying the optimal N value to capture multiple relevant topics without increasing the prediction spread excessively.
    - Refining the model to better handle multi-topic reviews, potentially improving overall prediction confidence and reducing outliers.

3) **Deeper Sentiment Classification with BERT**: As I iterate and improve this project further, I might incorporate a BERT-based sentiment classifier to provide deeper insights into customer sentiment. This may include utilizing a pretrained BERT model and fine-tuning it specifically for this dataset. I am not including this yet since the "star-based sentiment label" already provides reasonable accuracy and insights.


