# **GROCERY STORE REVIEW ANALYSIS**
### [*Target, Trader Joe's, Safeway, Fry's*]

----


## Executive Summary of Project
Our project aims to analyze Yelp reviews and ratings of grocery stores to uncover key factors influencing customer satisfaction and business performance. By leveraging machine learning and unstructured data analytics, we will extract actionable insights to help grocery store managers optimize customer experience, marketing strategies, and operational efficiency.

## Data Sources & Filtering Criteria

We have compiled publicly available Yelp data, including:

- To ensure relevant analysis, we filtered and selected reviews specifically from grocery stores such as Target, Fry’s, Safeway, and Trader Joe’s within Arizona State. This approach allows us to gain localized insights into customer preferences, service quality, and areas for improvement.

---

# 1. Loading & Cleaning Data

In [None]:
!pip install textblob
!pip install seaborn --quiet
!pip install wordcloud --quiet
!pip install s3fs --quiet

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from collections import Counter
from textblob import TextBlob
import boto3
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud
import re
import matplotlib.dates as mdates

In [None]:
bucket_name = "S3"
file_key = "Grocery_Store_Arizona .csv"  # Change to .xlsx if your file is in Excel format

# Construct the S3 file path
s3_file_path = f"s3://amazon-sagemaker-058264306111-us-east-1-e23504aef6c5/dzd_5l5kah6gnsnq3r/cameadxdckbu07/dev/"


In [None]:
df = pd.read_csv('Grocery_Store_Arizona .csv')
df.head()  # Display the first few rows

In [None]:
# Display basic information about the dataset
print("Dataset Overview:\n", df.info())
print("\nFirst 5 rows:\n", df.head())
# Column Latitude, longitude, and review_date are in the wrong datatype.

In [None]:
# Convert latitude and longitude to float (if they are not already)
df['latitude'] = pd.to_numeric(df['latitude'], errors='coerce')
df['longitude'] = pd.to_numeric(df['longitude'], errors='coerce')

# Convert review_date to datetime format
df['review_date'] = pd.to_datetime(df['review_date'], errors='coerce')

In [None]:
df.isnull().sum()
# Attributes and hours columns have missing values.

In [None]:
df.duplicated().sum()
# There are 6372 duplicated values

---

# 2. Summary Statistics of the Data

In [None]:
# Identify categorical and continuous columns
categorical_columns = df.select_dtypes(include=['object']).columns.tolist()
continuous_columns = df.select_dtypes(include=['float64', 'int64']).columns.tolist()

# Exclude non-relevant categorical columns (IDs and text data)
excluded_categorical = ['business_id', 'review_id', 'review_text', 'review_user_id', 'checkin_dates', 'attributes', 'categories', 'hours']
categorical_columns = [col for col in categorical_columns if col not in excluded_categorical]

In [None]:
# Plot histograms for categorical columns
for col in categorical_columns:
    plt.figure(figsize=(10, 5))
    sns.countplot(y=df[col], order=df[col].value_counts().index)
    plt.title(f"Distribution of {col}")
    plt.xlabel("Count")
    plt.ylabel(col)
    plt.show()
# All categorical columns are intact.

In [None]:
# Plot boxplots for continuous columns to check for outliers
for col in continuous_columns:
    plt.figure(figsize=(10, 5))
    sns.boxplot(x=df[col])
    plt.title(f"Boxplot of {col}")
    plt.xlabel(col)
    plt.show()
# Data has outliers

In [None]:
# Basic statistics displayed as a DataFrame
summary_stats = df.describe(include='all')
display(summary_stats)

In [None]:
# Ensure 'review_text' is a string
df['review_text'] = df['review_text'].astype(str)

# Number of reviews
num_reviews = df.shape[0]

# Tokenizing the review text
df['tokenized_review'] = df['review_text'].apply(lambda x: x.split())

# Total number of tokens (words)
total_tokens = sum(df['tokenized_review'].apply(len))

# Number of unique words (vocabulary size)
unique_words = set(word for review in df['tokenized_review'] for word in review)
vocabulary_size = len(unique_words)

# Average review length (words per review)
average_review_length = total_tokens / num_reviews

# Number of unique customers
num_unique_customers = df['review_user_id'].nunique()

# Number of unique businesses
num_unique_businesses = df['business_id'].nunique()

# Number of unique regions (cities)
num_unique_regions = df['city'].nunique()

# Average stars per review
avg_stars_per_review = df['business_stars'].mean()

# Average votes per review (sum of useful, funny, and cool votes)
df['total_votes'] = df[['useful', 'funny', 'cool']].sum(axis=1)
avg_votes_per_review = df['total_votes'].mean()

# Create summary statistics dataframe
summary_stats = pd.DataFrame({
    "Metric": [
        "Number of Reviews",
        "Total Tokens",
        "Vocabulary Size",
        "Average Review Length",
        "Unique Customers",
        "Unique Businesses",
        "Unique Regions",
        "Average Stars per Review",
        "Average Votes per Review"
    ],
    "Value": [
        num_reviews,
        total_tokens,
        vocabulary_size,
        average_review_length,
        num_unique_customers,
        num_unique_businesses,
        num_unique_regions,
        avg_stars_per_review,
        avg_votes_per_review
    ]
})

# Display the summary statistics
display(summary_stats)

In [None]:
# Word Cloud for most common words in reviews
text = ' '.join(df['review_text'].dropna())
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)
plt.figure(figsize=(10,5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title("Most Common Words in Reviews")
plt.show()

In [None]:
# Distribution of Review Ratings
plt.figure(figsize=(8, 4))
sns.countplot(x=df['review_stars'], palette='viridis')
plt.title("Distribution of Review Ratings")
plt.xlabel("Review Stars")
plt.ylabel("Count")
plt.show()

---

# 3. Data Evaluation

## (a) Suitability of the Selected Data for Business Questions  
The dataset includes detailed reviews, customer interactions, business information, and location details. This makes it useful for answering business questions about customer satisfaction, performance, and regional preferences. Specifically:  
- Customer Behavior Analysis: The review texts, star ratings, and votes (useful, funny, cool) allow in-depth analysis of customer feelings. This helps businesses understand how satisfied customers are and where they can improve.  
- Business Performance Evaluation: The dataset features business names, star ratings, and review counts, giving insights into how businesses perform based on customer feedback.  
- Geographical Insights: The dataset covers various cities and regions, helping businesses find performance trends and regional preferences.  
- User Engagement Trends: The dataset tracks review dates, check-in dates, and review counts, which help analyze trends in customer visits and engagement over time.  
Overall, this dataset is very suitable for gaining insights related to customer satisfaction, business performance, and regional trends.  

## (b) Sample Size Appropriateness  
The dataset includes 8,000 reviews, which is a solid sample size for analysis. We chose 8,000 rows to process the data easily and to gain a clear overview of the analysis. This size allows us to run efficient calculations while still capturing important insights.

This sample size is suitable for understanding customer feelings and business performance across different locations. However, if certain businesses or regions have fewer reviews, it could limit how well we can generalize findings for those specific cases.

## (c) Potential Biases in the Data  
Even though the dataset is useful, it may have biases in several ways:  
- Review Bias: People who leave reviews often have strong opinions, either very positive or very negative. This can lead to an overrepresentation of unhappy or very happy customers, while neutral comments may be missing.  
- Geographical Bias: The dataset focuses on specific locations (e.g., Arizona), which might not represent customer behavior in other states or regions.  
- Business Selection Bias: It may mainly include larger or more popular grocery stores, leaving smaller, less-reviewed stores less visible.  
- Time-Based Bias: If the data isn’t evenly spread over time, some businesses might look better or worse due to seasonal changes or outside events.  
- Fake/Influenced Reviews: Some businesses may try to boost their ratings by encouraging positive feedback, which can lead to inflated ratings.  

## (d) Potential Challenges in Processing the Data  
There are several challenges when working with this dataset:  
- Text Data Complexity: Review text needs cleaning and organizing (like removing common words and breaking them into parts) to get clear insights. Variations in language and slang can make sentiment analysis harder.  
- Data Cleaning Issues: Some areas may have missing data, which needs filling in or removing. Inconsistent formats (like review dates) may need fixing. Duplicate or spam reviews can confuse the analysis.  
- Outlier Management: The data shows outliers in star ratings, votes, and review counts, which can skew results. Proper methods (like changing data distributions) may be necessary to address this.  
- Regional and Business Distributions: Some businesses or regions might have too few reviews, making it difficult to draw broad conclusions. Normalization techniques (like weighting reviews) may help with fair comparisons.

---                                                                                                                        

# 4. Preliminary Data Exploration

In [None]:
# Download NLTK stopwords if not already available
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

In [None]:
def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = " ".join([word for word in text.split() if word not in stop_words])  # Remove stopwords
    return text

# Apply preprocessing
df['cleaned_review_text'] = df['review_text'].apply(preprocess_text)

# Sentiment Analysis
def get_sentiment(text):
    return TextBlob(str(text)).sentiment.polarity
df['sentiment_score'] = df['review_text'].apply(get_sentiment)

df['subjectivity'] = df['cleaned_review_text'].apply(lambda text: TextBlob(text).sentiment.subjectivity)

# Classify sentiment based on polarity score
df['sentiment'] = df['sentiment_score'].apply(lambda x: 'Positive' if x > 0 else ('Negative' if x < 0 else 'Neutral'))

# Count sentiment distribution
sentiment_counts = df['sentiment'].value_counts()

# Display sentiment analysis results
sentiment_summary = df[['review_text', 'sentiment_score', 'subjectivity', 'sentiment']].head(10)
display(sentiment_summary)

In [None]:
# Plot sentiment distribution
plt.figure(figsize=(8,5))
sns.barplot(x=sentiment_counts.index, y=sentiment_counts.values, palette="coolwarm")
plt.title("Sentiment Distribution of Reviews")
plt.xlabel("Sentiment")
plt.ylabel("Count")
plt.show()

---

## 4.1 Sentiment Trends & Distribution

In [None]:
# Sentiment trend over time
df['review_date'] = pd.to_datetime(df['review_date'], errors='coerce')  # Ensure correct datetime format
sentiment_trend = df.groupby(df['review_date'].dt.to_period("M"))['sentiment_score'].mean()

plt.figure(figsize=(12, 6))
sentiment_trend.plot(marker="o", linestyle="-", color="blue")
plt.title("Sentiment Trend Over Time")
plt.xlabel("Date (Month)")
plt.ylabel("Average Polarity Score")
plt.xticks(rotation=45)
plt.grid(True)
plt.show()

In [None]:
# Sentiment distribution
plt.figure(figsize=(8, 4))
sns.histplot(df['sentiment_score'], bins=20, kde=True, color='blue')
plt.title("Sentiment Score Distribution")
plt.xlabel("Sentiment Score")
plt.ylabel("Frequency")
plt.show()

In [None]:
# Data quality evaluation
missing_values = df.isnull().sum()
print("\nMissing Values per Column:\n", missing_values[missing_values > 0])

print("\nData Quality Observations:")
print("- The dataset contains", df.shape[0], "rows and", df.shape[1], "columns.")
print("- Sentiment analysis shows a general polarity distribution among reviews.")

## Interpretation

### 1. Sentiment Trend Over Time  
- The sentiment polarity fluctuates over the years, showing **ups and downs in customer satisfaction**.
- There are **notable spikes and dips**, indicating periods of **higher positivity and negativity**.
- From **2015 onward**, sentiment scores appear to **stabilize** but with occasional **negative outliers**.
- This suggests that **external factors** (such as changes in store policies, economic trends, or major events) might have influenced customer sentiment.

### 2. Sentiment Score Distribution  
- The sentiment scores follow a **normal distribution**, centering around **a slightly positive polarity (~0.2)**.
- Most reviews fall between **-0.25 and 0.5**, meaning that customers tend to express **neutral to slightly positive sentiments**.
- There are **few extreme negative or positive reviews**, suggesting that **customers generally remain balanced in their feedback**.

### 3. Data Quality Evaluation  
- **Missing Values**:  
  - **10 missing values in `attributes`**  
  - **64 missing values in `hours`**  
  - The missing data is **minimal** and **should not significantly impact analysis**.
- **Dataset Composition**:  
  - The dataset contains **8,000 rows** and **29 columns**, which is **sufficient for analysis**.
- **Observations**:  
  - Sentiment analysis indicates a **general polarity distribution**, meaning customer reviews **vary but lean slightly positive overall**.
  - Ensuring proper handling of missing values (e.g., imputation or exclusion) will improve data integrity.

### **Key Takeaways:**
- **Customer sentiment has fluctuated significantly over time**, likely influenced by external events or business changes.
- **Overall sentiment is slightly positive**, with most customers providing **neutral to positive feedback**.
- **Data quality is reliable**, with only a small portion of missing values that can be addressed.
- **Next Steps**:  
  - Investigate the reasons behind **sentiment fluctuations over time**.
  - Address **missing values** where necessary.
  - Further analyze **extreme sentiment scores** (both highly positive and negative) to identify potential improvement areas.

These insights help businesses **track customer sentiment trends**, identify potential areas of concern, and ensure a **high-quality dataset for further analysis**.


---

## 4.2 Review Analysis

# (a) Target

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Filter reviews for Target
target_reviews = df[df["name"] == "Target"]

# Skip if no data
if target_reviews.empty:
    print("No reviews found for Target.")
else:
    # Create subplots
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    fig.suptitle("Review Analysis for Target", fontsize=16)

    # Histogram of review stars
    sns.histplot(target_reviews["review_stars"], bins=5, kde=True, ax=axes[0, 0])
    axes[0, 0].set_title("Distribution of Review Ratings")

    # Review length distribution
    target_reviews["review_length"] = target_reviews["cleaned_review_text"].apply(lambda x: len(str(x).split()))
    sns.histplot(target_reviews["review_length"], bins=20, kde=True, ax=axes[0, 1])
    axes[0, 1].set_title("Distribution of Review Lengths")

    # Review count per user
    sns.histplot(target_reviews["review_user_id"].value_counts(), bins=20, kde=True, ax=axes[1, 0])
    axes[1, 0].set_title("Review Count per User")

    # Useful votes distribution
    sns.histplot(target_reviews["useful"], bins=20, kde=True, ax=axes[1, 1])
    axes[1, 1].set_title("Distribution of Useful Votes")

    # Adjust layout and show plot
    plt.tight_layout(rect=[0, 0.03, 1, 0.97])
    plt.show()


## Interpretation

### 1. Distribution of Review Ratings  
- The reviews for **Target** are **positively skewed**, with a significant number of **4-star and 5-star ratings**.
- There are fewer **1-star and 2-star reviews**, but they are still present, indicating some customer dissatisfaction.

### 2. Distribution of Review Lengths  
- Most reviews contain **fewer than 50 words**, suggesting that customers prefer to leave **short and concise feedback**.
- There are a few **longer reviews** exceeding 100 words, indicating detailed experiences, but they are rare.

### 3. Review Count per User  
- The majority of users **leave only 1-5 reviews**, suggesting that most reviewers are **casual** rather than frequent contributors.
- A few users have **left more than 10 reviews**, likely reflecting **repeat customers** or active Yelp users.

### 4. Distribution of Useful Votes  
- Most reviews have received **0 to 2 useful votes**, showing that customers primarily engage with a small number of reviews.
- A few reviews received **more than 5 useful votes**, indicating that **some detailed or insightful reviews** stand out.

### **Key Takeaways:**
- **Customer Satisfaction**: Target has **predominantly high ratings**, but a **small portion of negative reviews** still exist.
- **Review Behavior**: Most reviews are **short**, and **only a few users contribute multiple reviews**.
- **Engagement**: Only a small subset of reviews are considered highly **useful** by other users.

These insights help **Target** improve its **customer engagement and response strategies** by addressing concerns raised in lower-rated reviews and encouraging more detailed feedback.


---

# (b) Trader Joe's

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Filter reviews for Trader Joe’s
trader_joes_reviews = df[df["name"] == "Trader Joe's"]

# Skip if no data
if trader_joes_reviews.empty:
    print("No reviews found for Trader Joe’s.")
else:
    # Create subplots
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    fig.suptitle("Review Analysis for Trader Joe’s", fontsize=16)

    # Histogram of review stars
    sns.histplot(trader_joes_reviews["review_stars"], bins=5, kde=True, ax=axes[0, 0])
    axes[0, 0].set_title("Distribution of Review Ratings")

    # Review length distribution
    trader_joes_reviews["review_length"] = trader_joes_reviews["cleaned_review_text"].apply(lambda x: len(str(x).split()))
    sns.histplot(trader_joes_reviews["review_length"], bins=20, kde=True, ax=axes[0, 1])
    axes[0, 1].set_title("Distribution of Review Lengths")

    # Review count per user
    sns.histplot(trader_joes_reviews["review_user_id"].value_counts(), bins=20, kde=True, ax=axes[1, 0])
    axes[1, 0].set_title("Review Count per User")

    # Useful votes distribution
    sns.histplot(trader_joes_reviews["useful"], bins=20, kde=True, ax=axes[1, 1])
    axes[1, 1].set_title("Distribution of Useful Votes")

    # Adjust layout and show plot
    plt.tight_layout(rect=[0, 0.03, 1, 0.97])
    plt.show()


## Interpretation

### 1. Distribution of Review Ratings  
- **Trader Joe’s** has **overwhelmingly positive reviews**, with a high concentration of **4-star and 5-star ratings**.
- **Very few 1-star and 2-star reviews**, indicating **strong customer satisfaction**.
- The **peak at 5 stars** suggests that customers generally have a very positive shopping experience.

### 2. Distribution of Review Lengths  
- Most reviews are **between 20-80 words**, indicating that customers **express their thoughts in moderate detail**.
- A few longer reviews (above **100 words**) suggest that some customers provide detailed feedback, but they are rare.

### 3. Review Count per User  
- The majority of users **leave only 1-5 reviews**, indicating **occasional engagement**.
- A few highly engaged users have written **more than 10 reviews**, which might indicate **loyal or regular customers**.

### 4. Distribution of Useful Votes  
- Most reviews received **0-2 useful votes**, suggesting that while many people read reviews, they rarely mark them as useful.
- Some detailed reviews received **higher useful votes (>5)**, indicating that certain reviews were considered **helpful and informative**.

### **Key Takeaways:**
- **Strong Customer Satisfaction**: Trader Joe’s **dominates in positive ratings**, with very few negative reviews.
- **Moderate Review Detail**: Reviews are **concise but informative**, with only a few long-form reviews.
- **Engagement & Loyalty**: A **small subset of users contribute frequently**, reflecting potential **brand loyalty**.
- **Opportunities for Improvement**: Encouraging **more detailed reviews and useful votes** can help future customers make informed decisions.

These insights can help **Trader Joe’s** further enhance customer experience by **addressing minor concerns and leveraging highly-rated products/services**.


---

# (c) Safeway

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Filter reviews for Safeway
safeway_reviews = df[df["name"] == "Safeway"]

# Skip if no data
if safeway_reviews.empty:
    print("No reviews found for Safeway.")
else:
    # Create subplots
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    fig.suptitle("Review Analysis for Safeway", fontsize=16)

    # Histogram of review stars
    sns.histplot(safeway_reviews["review_stars"], bins=5, kde=True, ax=axes[0, 0])
    axes[0, 0].set_title("Distribution of Review Ratings")

    # Review length distribution
    safeway_reviews["review_length"] = safeway_reviews["cleaned_review_text"].apply(lambda x: len(str(x).split()))
    sns.histplot(safeway_reviews["review_length"], bins=20, kde=True, ax=axes[0, 1])
    axes[0, 1].set_title("Distribution of Review Lengths")

    # Review count per user
    sns.histplot(safeway_reviews["review_user_id"].value_counts(), bins=20, kde=True, ax=axes[1, 0])
    axes[1, 0].set_title("Review Count per User")

    # Useful votes distribution
    sns.histplot(safeway_reviews["useful"], bins=20, kde=True, ax=axes[1, 1])
    axes[1, 1].set_title("Distribution of Useful Votes")

    # Adjust layout and show plot
    plt.tight_layout(rect=[0, 0.03, 1, 0.97])
    plt.show()


## Interpretation

### 1. Distribution of Review Ratings  
- **Safeway has a mixed distribution of ratings**, with significant peaks at **1-star, 3-star, and 4-star ratings**.
- The **large number of 1-star reviews** suggests **many dissatisfied customers**.
- A high count of **4-star and 5-star reviews** shows that many customers are still satisfied.
- The **variation in ratings** suggests **inconsistency in customer experience**.

### 2. Distribution of Review Lengths  
- Most reviews are **short**, typically under **50 words**, indicating that customers tend to leave **brief feedback**.
- Some longer reviews (above **100 words**) exist but are **relatively rare**.

### 3. Review Count per User  
- The majority of users **leave only one review**, which suggests **occasional engagement**.
- A small subset of users has contributed **more than 25 reviews**, possibly reflecting **loyal customers or frequent reviewers**.

### 4. Distribution of Useful Votes  
- Most reviews received **0-2 useful votes**, meaning customer engagement with reviews is **low**.
- A few reviews received **higher useful votes (>5)**, indicating that **some reviews were particularly insightful or informative**.

### **Key Takeaways:**
- **Inconsistent Customer Experience**: The **high volume of 1-star and 4-star reviews** indicates a **divided opinion** among customers.
- **Short Reviews**: Customers **do not elaborate much**, making it harder to gain detailed insights from textual feedback.
- **Low Review Engagement**: Reviews receive **few useful votes**, meaning customers **may not rely heavily on reviews for decision-making**.
- **Areas for Improvement**: Addressing **frequent complaints from 1-star reviews** and encouraging **more detailed customer feedback** could help improve overall customer satisfaction.

These insights can help **Safeway** identify inconsistencies in service quality and work on improving the overall customer experience.


---

# (d) Fry's

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Filter reviews for Fry’s
frys_reviews = df[df["name"] == "Fry's"]

# Skip if no data
if frys_reviews.empty:
    print("No reviews found for Fry’s.")
else:
    # Create subplots
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    fig.suptitle("Review Analysis for Fry’s", fontsize=16)

    # Histogram of review stars
    sns.histplot(frys_reviews["review_stars"], bins=5, kde=True, ax=axes[0, 0])
    axes[0, 0].set_title("Distribution of Review Ratings")

    # Review length distribution
    frys_reviews["review_length"] = frys_reviews["cleaned_review_text"].apply(lambda x: len(str(x).split()))
    sns.histplot(frys_reviews["review_length"], bins=20, kde=True, ax=axes[0, 1])
    axes[0, 1].set_title("Distribution of Review Lengths")

    # Review count per user
    sns.histplot(frys_reviews["review_user_id"].value_counts(), bins=20, kde=True, ax=axes[1, 0])
    axes[1, 0].set_title("Review Count per User")

    # Useful votes distribution
    sns.histplot(frys_reviews["useful"], bins=20, kde=True, ax=axes[1, 1])
    axes[1, 1].set_title("Distribution of Useful Votes")

    # Adjust layout and show plot
    plt.tight_layout(rect=[0, 0.03, 1, 0.97])
    plt.show()


## Interpretation

### 1. Distribution of Review Ratings  
- **Fry’s has a polarized rating distribution**, with **high counts for both 1-star and 5-star reviews**.
- The presence of **many 1-star reviews** suggests **customer dissatisfaction with certain aspects of the store**.
- The **moderate number of 3-star and 4-star reviews** indicates that some customers had an **average experience**.
- **Inconsistent service or product quality** may be leading to these **varied customer experiences**.

### 2. Distribution of Review Lengths  
- Most reviews are **short**, typically under **50 words**, meaning customers provide **brief feedback**.
- A small percentage of reviews exceed **100 words**, which may offer **more detailed insights into customer experiences**.

### 3. Review Count per User  
- The majority of users have written **only one review**, suggesting **occasional engagement** rather than **repeat reviewing**.
- A few users have contributed **multiple reviews**, indicating **repeat customers** or **active Yelp users**.

### 4. Distribution of Useful Votes  
- Most reviews received **0-2 useful votes**, meaning customer engagement with reviews is **low**.
- Some reviews received **higher useful votes (>5)**, indicating that **certain reviews provided meaningful or insightful content**.

### **Key Takeaways:**
- **Customer Experience is Divided**: A **high number of 1-star reviews** suggests frequent **negative experiences**, while **many 5-star reviews** show strong customer loyalty.
- **Brief Reviews**: Customers tend to **leave short feedback**, which may limit deeper insights into their experiences.
- **Low Review Engagement**: Most reviews receive **few useful votes**, suggesting that reviews are not heavily relied upon for decision-making.
- **Opportunities for Improvement**: Fry’s should **analyze common issues in 1-star reviews** to identify **recurring problems** and enhance customer satisfaction.

By addressing **negative customer experiences**, Fry’s can work on **improving service quality** and creating a **more consistent shopping experience**.


---

# 5. Proposed Solution

In [None]:
df.name.unique()

In [None]:
trader_joes_reviews

## Interpretation

To analyze the dataset and gain useful business insights, I suggest using a mix of machine learning models and techniques for analyzing unstructured data. This will help us understand customer sentiment, predict trends, and make better business decisions. Here’s how we can approach this:

### 1. Machine Learning Models for Sentiment Classification
   We will use supervised machine learning methods to categorize reviews as Positive, Neutral, or Negative based on text features.
   
#### Traditional ML Models:  
- Logistic Regression: A simple and effective starting model for sentiment classification, suitable for features extracted using TF-IDF or Bag-of-Words.
- Support Vector Machine (SVM): Good for text classification in high-dimensional spaces, and works well with TF-IDF data.
- Random Forest: Helps analyze which features are important for determining sentiment and is robust against noisy data.
- Multilayer Perceptron (MLP - Neural Network): Can learn complex patterns in text data and performs well when trained on word embeddings like Word2Vec or GloVe.

### 2. Deep Learning Approaches
- Recurrent Neural Networks (RNN) / LSTM:
     LSTM (Long Short-Term Memory) is effective for sentiment analysis because it maintains the context of words in a sequence. It works well wit pre-trained embeddings like GloVe or FastText.
- Transformer-Based Models (BERT, DistilBERT):
     BERT (Bidirectional Encoder Representations from Transformers) is currently very effective for text classification and sentiment analysis. Fine-tuning BERT on customer reviews can lead to high accuracy in detecting sentiment.

### 3. Unstructured Data Analytics Techniques  
- Text Preprocessing:
     This includes breaking down text into individual words, reducing words to their base form, and removing common words, punctuation, numbers, and special characters.
- Feature Extraction:
     We’ll use TF-IDF (Term Frequency-Inverse Document Frequency) to determine keyword importance, Bag-of-Words as a basic representation, and Word2Vec or GloVe embeddings for deep learning models.
- Topic Modeling (Latent Dirichlet Allocation - LDA):  
     This helps us identify the main topics in customer reviews, giving businesses insight into what customers frequently talk about.
- Named Entity Recognition (NER):  
     This identifies entities such as business names, locations, product mentions, and staff names in reviews. It assists in analyzing customer feedback on specific topics.
- Sentiment Trend Analysis:
     We will track sentiment changes over time to spot trends, helping businesses monitor improvements or declines in service.

### 4. Predictive Analytics for Customer Satisfaction
We will use regression models (like Linear Regression and XGBoost) to predict business star ratings based on review sentiment. We will also build churn prediction models to estimate whether a customer is likely to keep visiting or stop coming to a business.

---

### 5.1 Tokenization

In [None]:

!pip install spacy==3.5.4

!python -m spacy download en_core_web_lg
import spacy
nlp = spacy.load("en_core_web_lg")


In [None]:
import spacy
from spacy import displacy

# Load the SpaCy language model with only essential components
nlp = spacy.load('en_core_web_lg', disable=['ner', 'textcat'])

# Convert reviews to a list of strings
reviews = df['review_text'].astype(str).tolist()

# Process reviews using nlp.pipe() for efficient batch processing
docs = list(nlp.pipe(reviews[:500], batch_size=30, n_process=4))  # Limit to 500 reviews for speed

# Extract linguistic features for the first 10 reviews
for i, doc in enumerate(docs[:10]):
    print(f"Review {i+1}:")
    for token in doc[:8]:  # Analyze only the first 8 tokens per review
        print(token.text, token.lemma_, token.pos_, token.dep_)
    print("-" * 50)

# Visualize dependency parsing for the first 5 reviews
for doc in docs[:5]:
    displacy.render(doc, style='dep', jupyter=True, options={'distance': 70})

In [None]:
# Process the first few reviews for efficiency
sample_reviews = df["review_text"].head(10)

# Initialize lists to store processed outputs
sentence_list_spacy = []
token_list_spacy = []

# Process each review using SpaCy
for review in sample_reviews:
    nlp_doc = nlp(review)  # Process each review
    sentences = [sentence.text for sentence in nlp_doc.sents]  # Extract sentences
    tokens = [token.text for token in nlp_doc]  # Extract tokens

    sentence_list_spacy.append(sentences)
    token_list_spacy.append(tokens)

# Display sample outputs
processed_reviews_df = pd.DataFrame({"Original Review": sample_reviews,
                                     "Sentences (SpaCy)": sentence_list_spacy,
                                     "Tokens (SpaCy)": token_list_spacy})


# Display the first few processed reviews
print(processed_reviews_df.head())

# Save results to a CSV file (optional)
processed_reviews_df.to_csv("SpaCy_Processed_Reviews.csv", index=False)

### 5.2 Part-Of-Speech Tagging

In [None]:
# Process the entire dataset for POS tagging using SpaCy
pos_tag_results = []

# Process each review using SpaCy
for review in df["review_text"]:  # Limiting to 10 for efficient display
    nlp_doc = nlp(str(review))  # Convert to string and process each review
    pos_tag_dict = {str(token): token.pos_ for token in nlp_doc}  # Extract POS tags
    pos_tag_results.append(pos_tag_dict)

# Create DataFrame for display
pos_tag_df = pd.DataFrame({"Original Review": df["review_text"], "POS Tags (SpaCy)": pos_tag_results})

# Display the processed results
from IPython.display import display
display(pos_tag_df)

### 5.3 Named Entity Recognition

In [None]:
# SpaCy

from spacy import displacy
for ent in nlp_doc.ents:
	print(ent.text, ent.label_)
displacy.render(nlp_doc, style='ent',jupyter=True)

In [None]:
import spacy
from spacy import displacy

# Load the SpaCy language model
nlp = spacy.load('en_core_web_lg', disable=['ner', 'textcat'])  # Disable components not needed for speed

# Process all reviews efficiently using nlp.pipe()
reviews = df['review_text'].astype(str).tolist()  # Convert to list of strings
docs = list(nlp.pipe(reviews[:5], batch_size=25, n_process=2))  # Process only 5 reviews

# Extract linguistic features efficiently
for i, doc in enumerate(docs):
    print(f"Review {i+1}:")  # Show which review is being processed
    for token in doc[:8]:  # Limit printing to the first 8 tokens for efficiency
        print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
              token.shape_, token.is_alpha, token.is_stop)
    print("-" * 50)

# Limit visualization to the first 4-5 reviews
for doc in docs[:5]:
    displacy.render(doc, style='dep', jupyter=True, options={'distance': 75})

### 5.4 SpaCy X DataFrame = DframCy

In [None]:
!pip install dframcy

In [None]:
import spacy
from dframcy import DframCy

# Load the SpaCy model and initialize DframCy
nlp = spacy.load('en_core_web_lg')
dframcy = DframCy(nlp)

# Assuming `df_1` contains the restaurant reviews with a 'text' column
# Process each review and create a DataFrame of annotations
annotation_dataframes = []

for review in df["review_text"]:
    if isinstance(review, str):  # Ensure the review is a string
        doc = dframcy.nlp(review)  # Process the review using DframCy
        annotation_dataframe = dframcy.to_dataframe(doc)  # Convert to DataFrame
        annotation_dataframes.append(annotation_dataframe)

# Combine all individual annotation DataFrames into a single DataFrame for analysis
combined_annotations = pd.concat(annotation_dataframes, ignore_index=True)

# Display the combined DataFrame
display(combined_annotations)

---

## 5.5 Top 20 Nouns

In [None]:
# Function to extract nouns and count them
def get_top_nouns(reviews):
    nouns = []
    for review in df["review_text"]:
        try:
            doc = nlp(str(review))
            for token in doc:
                if token.pos_ == "NOUN":
                    nouns.append(token.lemma_)  # Use the lemmatized form of the noun
        except TypeError:
            pass
    return Counter(nouns).most_common(20)

# Get the top 20 nouns from the entire dataset
top_nouns = get_top_nouns(df)

# Convert to DataFrame for better visualization
top_nouns_df = pd.DataFrame(top_nouns, columns=["Noun", "Frequency"])

# Display results
print("Top 20 Frequently Used Nouns:")
display(top_nouns_df)

---

## 5.6 Top 20 Adjectives

In [None]:
# Function to extract adjectives and count them
def get_top_adjectives(reviews):
    adjectives = []
    for review in reviews["review_text"]:
        try:
            doc = nlp(str(review))
            for token in doc:
                if token.pos_ == "ADJ":
                    adjectives.append(token.lemma_)  # Use the lemmatized form of the adjective
        except TypeError:
            pass
    return Counter(adjectives).most_common(20)

# Get the top 20 adjectives from the entire dataset
top_adjectives = get_top_adjectives(df)

# Convert to DataFrame for better visualization
top_adjectives_df = pd.DataFrame(top_adjectives, columns=["Adjective", "Frequency"])

# Display results
print("Top 20 Frequently Used Adjectives:")
display(top_adjectives_df)

---

## 5.7 Top 20 Verbs

In [None]:
def get_top_verbs(reviews):
    verbs = []
    for review in reviews["review_text"]:
        try:
            doc = nlp(str(review))
            for token in doc:
                if token.pos_ == "VERB":
                    verbs.append(token.lemma_)  # Use the lemmatized form of the verb
        except TypeError:
            pass
    return Counter(verbs).most_common(20)

# Get the top 20 verbs from the entire dataset
top_verbs = get_top_verbs(df)

# Convert to DataFrame for better visualization
top_verbs_df = pd.DataFrame(top_verbs, columns=["Verb", "Frequency"])

# Display results
print("Top 20 Frequently Used Verbs:")
display(top_verbs_df)

---

### Top 20 Named Entities

In [None]:
def get_top_named_entities(reviews):
    named_entities = []
    for review in reviews["review_text"]:
        try:
            doc = nlp(str(review))
            for ent in doc.ents:
                named_entities.append(ent.text)  # Extract named entities
        except TypeError:
            pass
    return Counter(named_entities).most_common(20)

# Get the top 20 named entities from the entire dataset
top_named_entities = get_top_named_entities(df)

# Convert to DataFrame for better visualization
top_named_entities_df = pd.DataFrame(top_named_entities, columns=["Named Entity", "Frequency"])

# Display results
print("Top 20 Frequently Used Named Entities:")
display(top_named_entities_df)

---

# Analysis & Interpretation

1. Named Entities Analysis
The named entities show the grocery stores and locations people mention most:
- Target (2041) and Safeway (2018) are the top mentioned stores, indicating they attract lots of customers.
- Trader Joe’s (656), TJ (628), and Joe (627) also show strong brand recognition for Trader Joe's.
- Walmart (278) and Bashas (274) are mentioned less but are still important competitors.
- Words like "one," "two," "first," and "five" may point to rankings or preferences.

Insights:
High mentions of certain grocery stores suggest they are popular topics for customers, whether for positive or negative reasons. The difference between Fry’s (1098) and Frys (291) shows that people may refer to the same store differently. Tucson (705) suggests this data may focus on local areas.

2. Verb Usage Analysis
The most common verbs highlight customer actions and feelings:
- Common actions: have (8560), go (4589), get (3674), find (2385), shop (2312) show that the reviews often focus on customer experiences while visiting and shopping.
- Emotional verbs: love (2082), need (1947), ask (1168), buy (1148), try (1144) indicate strong feelings about products or services.
- Decision-making verbs: know (1288), take (1384), use (1278), see (1227), look (1217) suggest that customers often examine products or store features before making a purchase.

Insights:
The high count of "love" (2082) shows that many customers feel positively about products, services, or stores. The words "try" (1144) and "ask" (1168) suggest customers are curious or interact with staff. Frequent use of “need” (1947) might mean customers are looking for essential items.

3. Adjective Usage Analysis
Common adjectives describe store qualities and experiences:
- Positive words: good (2946), great (2469), friendly (1981), nice (1456), clean (1394), helpful (1268) show customers are generally satisfied.
- Neutral adjectives: more (1293), other (1844), new (864), few (826), same (777), only (689) indicate comparisons in shopping choices.
- Negative words: bad (1035), fresh (883), busy (763), last (628) suggest some customers have mixed feelings, particularly about crowded stores or poor service.

Insights:
The frequent mention of "friendly" (1981) and "helpful" (1268) highlights that good customer service is important. Mentions of "busy" (763) and "bad" (1035) may relate to long lines or delays. The word “favorite” (646) indicates strong brand loyalty.

4. Noun Usage Analysis
Common nouns focus on key topics in reviews:
- Store-related nouns: store (7496), grocery (2497), location (1981), line (1398), parking (1194) show that customers care about their shopping experiences and store accessibility.
- Service nouns: customer (2868), staff (2062), employee (2024), service (2337) emphasize customer interactions with store employees.
- Product nouns: item (2422), selection (1566), price (1683), food (1600) indicate that shoppers often discuss product choices and costs.
- Time-related nouns: time (3018), day (1268) may relate to shopping frequency or wait times.

Insights:
The high mention of "store" (7496) points to a focus on in-person shopping rather than online. Frequent mentions of “price” (1683) and “selection” (1566) highlight that cost and product variety are important to customers. Mentions of “line” (1398) and “parking” (1194) may signal issues with shopping logistics.

## Overall Insights & Business Implications**

- Brand Engagement: Target, Safeway, and Trader Joe’s are frequently discussed, suggesting they attract either many customers or strong opinions, both positive and negative. The presence of both "Fry’s" and "Frys" indicates possible confusion in how people perceive the brand.

- Customer Experience Priorities: The focus on staff and service shows that customer interactions are crucial for satisfaction. Positive words like friendly, helpful, and clean point to good service experiences. Complaints often involve lines, prices, or parking issues, indicating frustration with logistics.

- Product & Shopping Trends: Customers often mention price, selection, food, and item, showing that product choices and affordability matter. Words like "buy," "need," and "try" show that customers regularly evaluate new items.

## Actionable Recommendations for Grocery Stores:
- Enhance Customer Service: Since "friendly" and "helpful" appear often, training employees can improve customer satisfaction.
- Optimize Pricing & Selection: Discussions about price and product variety indicate that competitive pricing and broad selections can boost customer loyalty.
- Improve Store Logistics: Addressing issues related to lines and parking can enhance the overall shopping experience.

---

# 6. Sentiment classification with machine learning approaches

In [None]:
import nltk
import re
import string
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from nltk.sentiment import SentimentIntensityAnalyzer
import datetime

In [None]:
!python -m spacy download en_core_web_lg
!pip install -q vaderSentiment
!pip install s3fs==2023.9.2
# Importing neccessary packages

In [None]:
def normalize(review, lowercase, remove_stopwords):
    if lowercase:
        review = review.lower()
    doc = nlp(review)
    lemmatized = list()
    for token in doc:
        if not remove_stopwords or (remove_stopwords and not token.is_stop):
            lemmatized.append(token.lemma_)
    return " ".join(lemmatized)
df['processed'] = df["review_text"].apply(normalize, lowercase=True, remove_stopwords=True)

In [None]:
# Lexicon based approch with VaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sentiment = SentimentIntensityAnalyzer()
df['Vader_Sentiment'] = df["review_text"].apply(lambda x: sentiment.polarity_scores(str(x))['compound'])
df['Vader_Label'] = df['Vader_Sentiment'].apply(lambda score: 'Positive' if score > 0.05 else ('Negative' if score < -0.05 else 'Neutral'))
print(df[["review_text", 'Vader_Sentiment', 'Vader_Label']].head())

In [None]:
from textblob import TextBlob

# Create sentiment scores
df["Sentiment"] = df["review_text"].apply(lambda x: TextBlob(str(x)).sentiment.polarity)

In [None]:
#Splitting the data into training and testing
X_train, X_test, Y_train, Y_test = train_test_split(df["review_text"], df['Sentiment'], test_size=0.2, random_state=5)

In [None]:
#Pre-Prcoessing and Bag of Word Vectorization using Count Vectorizer
from nltk.tokenize import RegexpTokenizer
token = RegexpTokenizer(r'[a-zA-Z]+')
cv = CountVectorizer(stop_words='english',ngram_range = (1,1),tokenizer = token.tokenize, max_features = 1000)
X_train_vect = cv.fit_transform(X_train)
X_train_vect.shape

---
## 6.1 Statistics

In [None]:
# a default list of stop words set by the Spacy language model
stopwords = nlp.Defaults.stop_words
print(stopwords)

In [None]:
# variables to store term statistics
num_of_comments = 0
unique_word = set() # using the set-type variable since it does not allow duplicates > able to count the number of unique words
num_of_token_per_comment = [] # using the list-type varailbe since we want to measure corpus-level statistics (e.g., avg, max, min, median, etc.)
num_of_token_per_comment_without_stop_words = []
total_number_of_tokens = 0 # in a corpus
unique_user = set() # using the set-type variable since it does not allow duplicates > able to count the number of unique users
date_list = [] # able to measure the number of comments by day, week, etc.
vote_count = 0
unique_submission = set() # using the set-type variable since it does not allow duplicates > able to count the number of unique submissions

In [None]:
for index, row in df.iterrows():
    text = row["review_text"]
    doc = nlp(text)
    num_of_comments += 1

    # statistics regarding words
    num_of_tokens = len(doc)
    total_number_of_tokens += num_of_tokens
    token_count_without_stop_words = 0

    for token in doc:
        if token.is_stop is True:
            pass
        else:
            unique_word.add(str(token).lower())
            token_count_without_stop_words += 1

    num_of_token_per_comment.append(num_of_tokens)
    num_of_token_per_comment_without_stop_words.append(token_count_without_stop_words)

In [None]:
# statistics regarding date
date = row["review_date"]
date_list.append(date)

# statistics regarding reviews
review_id = row["review_id"]
unique_submission.add(review_id)

In [None]:
# statistics
print("number of comments:", num_of_comments)
print("number of unique words:", len(unique_word))
print("total number of words in the corpus:", total_number_of_tokens)
print("average number of words in comments:", np.mean(np.asarray(num_of_token_per_comment)))
print("average number of words in comments without stop words:", np.mean(np.asarray(num_of_token_per_comment_without_stop_words)))
print("maximum number of words in comments:", np.max(np.asarray(num_of_token_per_comment)))
print("maximum number of words in comments without stop words:", np.max(np.asarray(num_of_token_per_comment_without_stop_words)))
print("minimum number of words in comments:", np.min(np.asarray(num_of_token_per_comment)))
print("minimum number of words in comments without stop words:", np.min(np.asarray(num_of_token_per_comment_without_stop_words)))
print("median number of words in comments:", np.median(np.asarray(num_of_token_per_comment)))
print("median number of words in comments without stop words:", np.median(np.asarray(num_of_token_per_comment_without_stop_words)))
print("number of unique users:", len(unique_user))
print("number of sumbissions:", len(unique_submission))

In [None]:
df1 = pd.DataFrame(X_train_vect.toarray(), columns=cv.get_feature_names_out())
df1.head()

In [None]:
cv.vocabulary_

In [None]:
X_test_vect= cv.transform(X_test)
X_test_vect.shape

---

## 6.2 Naive Bayes Classification

In [None]:
def categorize_sentiment(score):
    if score > 0:
        return 2  # Positive
    elif score < 0:
        return 0  # Negative
    else:
        return 1  # Neutral

# Apply the function to convert Y_train and Y_test
Y_train = np.array([categorize_sentiment(score) for score in Y_train])
Y_test = np.array([categorize_sentiment(score) for score in Y_test])

# Verify the unique values after conversion
print("Unique values in Y_train after conversion:", np.unique(Y_train))


In [None]:
#Training the model
MNB = MultinomialNB()
MNB.fit(X_train_vect, Y_train)

In [None]:
#Evaluate the performance of the model
from sklearn import metrics
predicted = MNB.predict(X_test_vect)
performance = metrics.classification_report(
    Y_test, predicted, labels
    =["0", "1", "2"], target_names=["Negative", "Neutral", "Positive"], zero_division=0
)

display(performance)

In [None]:
from sklearn import metrics
from sklearn.metrics import classification_report

performance = metrics.classification_report(Y_test, predicted, labels=[0, 1, 2], target_names=['Negative', 'Neutral', 'Positive'])

## 6.3 Support Vector Machines (SVM) classification

In [None]:
from sklearn import svm
clf = svm.SVC()
clf.fit(X_train_vect, Y_train)

In [None]:
from sklearn import metrics
predicted = clf.predict(X_test_vect)
performance = metrics.classification_report(Y_test,predicted, target_names= ["Negative", "Neutral", "Positive"])
display(performance)

## 6.4 TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import RegexpTokenizer
token = RegexpTokenizer(r'[a-zA-Z]+')
vectorizer = TfidfVectorizer(stop_words='english',ngram_range = (1,1),tokenizer = token.tokenize, max_features = 800)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_train_tfidf.shape

In [None]:
df2 = pd.DataFrame(X_train_tfidf.toarray(), columns=vectorizer.get_feature_names_out())
df2.head()

In [None]:
X_test_tfidf= vectorizer.transform(X_test)
X_test_tfidf.shape

## 6.5 Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
LG = LogisticRegression()
print(X_train_tfidf.shape)
LG.fit(X_train_tfidf, Y_train)

In [None]:
predicted = LG.predict(X_test_tfidf)
performance = metrics.classification_report(Y_test,predicted, target_names= ["Negative", "Neutral", "Positive"])
display(performance)

## 6.6 Comparison with VaderSentiment

In [None]:
import nltk
nltk.download('vader_lexicon')

In [None]:
v_predicted = []
for text in X_test:
    sent = sentiment.polarity_scores(text)
    if sent['compound'] > 0:
        v_predicted.append("Positive")
    elif sent['compound'] < 0:
        v_predicted.append("Negative")
    else:
        v_predicted.append("Neutral")  # Ensure "Neutral" is included

In [None]:
label_mapping = {"Negative": 0, "Neutral": 1, "Positive": 2}

In [None]:
Y_test_numeric = [label_mapping[label] if isinstance(label, str) else label for label in Y_test]
v_predicted_numeric = [label_mapping[label] for label in v_predicted]

In [None]:
v_performance = metrics.classification_report(
    Y_test_numeric, v_predicted_numeric, target_names=["Negative", "Neutral", "Positive"]
)
display(v_performance)


In [None]:
import pandas as pd

# Convert Y_test to a Pandas Series with an index
Y_test_series = pd.Series(Y_test, index=range(len(Y_test)), name="Y_test")

In [None]:
# Ensure Vader sentiment is only applied to the test set
df_test = df.iloc[Y_test_series.index].copy()  # Use .copy() to avoid modification warnings

# Initialize Vader SentimentIntensityAnalyzer
sentiment = SentimentIntensityAnalyzer()

# Predict Vader sentiment for X_test
v_predicted = []
for text in X_test:
    sent = sentiment.polarity_scores(text)
    if sent['compound'] > 0:
        v_predicted.append("Positive")
    elif sent['compound'] < 0:
        v_predicted.append("Negative")
    else:
        v_predicted.append("Neutral")  # Ensure "Neutral" is included

# Convert labels to numeric format for consistency
label_mapping = {"Negative": 0, "Neutral": 1, "Positive": 2}
Y_test_numeric = [label_mapping[label] if isinstance(label, str) else label for label in Y_test_series]
v_predicted_numeric = [label_mapping[label] for label in v_predicted]

# Compute Vader Sentiment Analysis Accuracy
df_test["Vader_Label_Numeric"] = v_predicted_numeric
vader_accuracy = accuracy_score(Y_test_numeric, df_test["Vader_Label_Numeric"])

# Generate classification reports for all models
vader_performance = classification_report(Y_test_numeric, v_predicted_numeric, output_dict=True)
logistic_performance = classification_report(Y_test_numeric, LG.predict(X_test_tfidf), output_dict=True)
svm_performance = classification_report(Y_test_numeric, clf.predict(X_test_vect), output_dict=True)
naive_bayes_performance = classification_report(Y_test_numeric, MNB.predict(X_test_vect), output_dict=True)

# Creating a refined comparison table
comparison_df = pd.DataFrame({
    "Metric": ["Accuracy", "Precision (Positive)", "Recall (Positive)", "F1-Score (Positive)"],
    "Vader": [
        vader_accuracy,
        vader_performance["2"]["precision"],
        vader_performance["2"]["recall"],
        vader_performance["2"]["f1-score"],
    ],
    "Logistic Regression": [
        logistic_performance["accuracy"],
        logistic_performance["2"]["precision"],
        logistic_performance["2"]["recall"],
        logistic_performance["2"]["f1-score"],
    ],
    "SVM": [
        svm_performance["accuracy"],
        svm_performance["2"]["precision"],
        svm_performance["2"]["recall"],
        svm_performance["2"]["f1-score"],
    ],
    "Naive Bayes": [
        naive_bayes_performance["accuracy"],
        naive_bayes_performance["2"]["precision"],
        naive_bayes_performance["2"]["recall"],
        naive_bayes_performance["2"]["f1-score"],
    ],
})

# Display the refined comparison table
display(comparison_df)

---

## Comparison and Conclusion on Sentiment Analysis Approaches
### 1. Accuracy and Performance Differences:
The analysis shows that machine learning models, particularly Support Vector Machine (SVM) and Logistic Regression, significantly outperform VADER in sentiment classification. SVM achieves the highest accuracy, making it the most reliable model for this task. Logistic Regression follows closely behind, balancing performance and computational efficiency. Naive Bayes performs better than VADER but is not as effective as the other two machine learning models.

VADER, a lexicon-based approach, is limited by its predefined word lists and rules. While it provides reasonable accuracy without requiring training data, it cannot learn from domain-specific language patterns, making it less effective compared to trained models.

### 2. Strengths and Weaknesses of VADER:
VADER is designed for quick sentiment analysis and does not require labeled training data. This makes it useful for real-time applications, such as monitoring social media sentiment or analyzing customer feedback without an extensive dataset. It performs well in predicting positive sentiment with high precision but struggles with recall, meaning it may fail to identify some positive sentiments correctly. The lack of contextual understanding also makes it susceptible to errors when handling negations, sarcasm, or domain-specific sentiment expressions.

### 3. Strengths of Machine Learning Models:
Machine learning models, particularly SVM and Logistic Regression, demonstrate superior performance due to their ability to learn from labeled data. They capture complex language patterns, making them highly effective in sentiment classification. SVM, in particular, stands out as the best-performing model, achieving the highest accuracy and F1 score. Logistic Regression provides comparable results while being computationally less intensive. These models excel in both precision and recall, ensuring a more balanced classification compared to VADER.

Naive Bayes, while faster and computationally efficient, does not perform as well as the other machine learning models due to its assumption of feature independence. This limitation affects its ability to properly classify sentiment when contextual relationships between words are important.

### 4. Choosing the Right Approach:
The choice of sentiment analysis approach depends on the specific requirements of the task. If real-time sentiment analysis is needed and labeled data is not available, VADER is a suitable option. It provides reasonable accuracy without the need for model training. However, for tasks requiring higher accuracy and adaptability to domain-specific language, machine learning models such as SVM and Logistic Regression should be preferred.

For applications where computational efficiency is a priority, Logistic Regression offers a good balance between accuracy and speed. If maximum accuracy is required, SVM is the best choice. Naive Bayes can be considered for simpler tasks but is generally outperformed by the other two machine learning models.

### 5. Final Recommendation:
VADER is useful for quick sentiment classification but lacks the contextual understanding required for more complex tasks. Machine learning models, particularly SVM and Logistic Regression, provide significantly better performance and should be used when labeled data is available. For applications where high accuracy is essential, SVM is the best option. If computational efficiency is a concern, Logistic Regression serves as a strong alternative.

---

## 6.7 Train and evaluate SVM model on test data

In [None]:
from sklearn.preprocessing import LabelEncoder

# Convert sentiment labels into numerical values
label_encoder = LabelEncoder()
Y_train = label_encoder.fit_transform(Y_train)  # Convert labels
Y_test = label_encoder.transform(Y_test)  # Convert labels

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Train SVM model
SVM_model = SVC()
SVM_model.fit(X_train_tfidf, Y_train)  # Ensure Y_train is numerical

# Evaluate SVM model on test data
predicted_svm = SVM_model.predict(X_test_tfidf)

# Generate classification report with correct target names
svm_performance = classification_report(
    Y_test, predicted_svm, target_names=["Negative", "Neutral", "Positive"]
)
print("SVM Performance on Test Data:")
display(svm_performance)

---

## Analysis of SVM Model Performance on Sentiment Analysis
### 1. Overall Model Performance
The Support Vector Machine (SVM) model achieves a high level of accuracy in classifying sentiment, demonstrating strong performance across all three sentiment categories: Negative, Neutral, and Positive. The overall accuracy is high, indicating that the model effectively distinguishes between different sentiments within the dataset.

### 2. Precision, Recall, and F1-Score Evaluation
The model maintains consistently high precision, recall, and F1-scores for all sentiment categories. Precision measures how many of the predicted sentiments were correct, recall reflects the model’s ability to capture all relevant instances, and the F1-score provides a balance between the two. The near-perfect values suggest that the model effectively minimizes misclassification.

For the negative and positive sentiment classes, both precision and recall are close to perfect. This indicates that the model is highly confident in its predictions and accurately classifies these sentiments. The neutral sentiment class, however, shows slightly lower recall compared to precision, which suggests that the model might sometimes misclassify neutral sentiments as either negative or positive.

### 3. Class Distribution and Model Balance
The support values indicate the number of instances for each sentiment class. The model performs well across different class distributions, maintaining balanced predictions despite potential variations in dataset composition. The macro and weighted averages confirm that the model is not biased toward any particular sentiment class.

### 4. Strengths of the SVM Model
The high scores across all evaluation metrics indicate that the SVM model effectively captures linguistic patterns in the data. This is particularly beneficial for sentiment analysis, where subtle differences in wording can change the meaning of a statement. The model's robustness ensures that it performs well even with varying input structures.

### 5. Areas for Further Improvement
While the model performs exceptionally well, minor improvements could be made in recall for the neutral sentiment class. Additional training data, more refined feature engineering, or hyperparameter tuning may further enhance the model’s ability to correctly identify neutral sentiments. If class imbalance is present, techniques such as oversampling or class weighting could help balance performance across all sentiment categories.

### 6. Conclusion
The SVM model demonstrates excellent performance in sentiment classification, making it a reliable choice for sentiment analysis tasks. Its high accuracy and balanced evaluation metrics indicate that it effectively distinguishes between negative, neutral, and positive sentiments. While it slightly struggles with recall in the neutral category, its overall performance is strong and suitable for real-world sentiment analysis applications.

---
# 6.8 Text Classification with Deep Learning

In [None]:
# Importing necessary libraries, packages, and data.
%pip install --upgrade keras
!pip install pydot
!pip install graphviz
%pip install wordcloud spacy scikit-learn pandas matplotlib seaborn
!pip install --force tensorflow
!pip install numpy==1.26.1

---
## 6.9 Artificial Neural Network with multiple hidden layers

In [None]:
import tensorflow as tf

# Define input layer (TF-IDF features with 5000 dimensions)
input_layer = tf.keras.Input(shape=(5000,), name="input_layer")

# First hidden layer (ReLU activation for feature extraction)
hidden_layer1 = tf.keras.layers.Dense(units=1024, activation="relu", name="hidden_layer1")(input_layer)

# Second hidden layer (ReLU activation for deeper representation)
hidden_layer2 = tf.keras.layers.Dense(units=512, activation="relu", name="hidden_layer2")(hidden_layer1)

# Third hidden layer (ReLU activation for enhanced feature learning)
hidden_layer3 = tf.keras.layers.Dense(units=256, activation="relu", name="hidden_layer3")(hidden_layer2)

# Output layer (Softmax for multi-class classification)
output_layer = tf.keras.layers.Dense(units=3, activation="softmax", name="output_layer")(hidden_layer3)

# Define ANN Model
model = tf.keras.Model(inputs=input_layer, outputs=output_layer, name="ANN_Grocery_Store_Review")

# Compile model
model.compile(
    loss="categorical_crossentropy",  # Since we have 3 classes
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),  # Adaptive learning rate
    metrics=[
        tf.keras.metrics.CategoricalAccuracy(name="accuracy"),
        tf.keras.metrics.Precision(name="precision"),
        tf.keras.metrics.Recall(name="recall")
    ]
)

# Display model summary
model.summary()

# Plot model architecture
tf.keras.utils.plot_model(model, show_shapes=True)

In [None]:
# Import necessary libraries
from tensorflow.keras.callbacks import ModelCheckpoint
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical
from sklearn.feature_extraction.text import TfidfVectorizer

# Ensure text data is vectorized
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # Use top 5000 features
X = tfidf_vectorizer.fit_transform(df['review_text']).toarray()  # Convert sparse matrix to dense array

# Ensure target labels are numerical
label_mapping = {"Negative": 0, "Neutral": 1, "Positive": 2}
Y = df["sentiment"].map(label_mapping).values  # Convert categorical labels to numbers

# Convert labels to categorical (One-Hot Encoding) for multi-class classification
Y_encoded = to_categorical(Y, num_classes=3)

# Split the dataset into 75% training and 25% testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y_encoded, test_size=0.25, random_state=42)

# Define ModelCheckpoint to save the best model based on validation accuracy
checkpoint = ModelCheckpoint("best_ann_model.h5", monitor="val_accuracy", save_best_only=True, mode="max")

# Train the ANN model
history = model.fit(
    x=X_train,
    y=Y_train,
    epochs=5,            # Running for 5 epochs for better learning
    batch_size=16,       # Using batch size of 16 for stability
    validation_data=(X_test, Y_test),
    callbacks=[checkpoint]  # Save best model based on val_accuracy
)

# Print training results
print("Model training complete. Best model saved as 'best_ann_model.h5'.")

---
### 6.10 Evaluate the best Model on Test Data

In [None]:
from tensorflow.keras.models import load_model

# Load the best saved model
best_model = load_model("best_ann_model.h5")


In [None]:
# Get model predictions (probabilities for each class)
Y_pred_probs = best_model.predict(X_test)

# Convert probabilities to class labels (0, 1, or 2)
Y_pred_classes = Y_pred_probs.argmax(axis=1)

# Convert one-hot encoded Y_test back to class labels
Y_true_classes = Y_test.argmax(axis=1)


In [None]:
from sklearn.metrics import classification_report, accuracy_score

# Print overall accuracy
accuracy = accuracy_score(Y_true_classes, Y_pred_classes)
print(f"Test Accuracy: {accuracy:.4f}")

# Print classification report (Precision, Recall, F1-Score)
print("Classification Report:")
display(classification_report(Y_true_classes, Y_pred_classes, target_names=["Negative", "Neutral", "Positive"]))


---

#### Overall Performance:
The Artificial Neural Network (ANN) model has demonstrated exceptional accuracy, achieving 99.25% test accuracy. This indicates that the model has effectively learned from the training data and generalizes well to unseen test data. Such a high accuracy suggests that the ANN can distinguish between different sentiment categories.

#### Performance Across Sentiment Classes:
The model performs well across all sentiment categories, with high precision, recall, and F1 scores. For negative sentiment, the precision is 98%, meaning that most of the reviews classified as negative are indeed negative. The recall for this class is 99%, indicating that nearly all actual negative reviews were correctly identified. The F1-score, which balances precision and recall, is 98%, confirming robust classification.

For neutral sentiment, the precision is 100%, ensuring that when the model predicts a review as neutral, it is highly confident in its classification. However, the recall is 88%, suggesting that 12% of neutral reviews were misclassified as either positive or negative. This lower recall score indicates that some neutral sentiment cases are being overlooked, which may be due to the limited number of neutral samples in the dataset.

For positive sentiment, the model achieves 100% precision and 99% recall, meaning that nearly all actual positive reviews are correctly classified. The F1-score of 100% confirms that the model is highly reliable in identifying positive sentiment.

#### Macro and Weighted Averages:
The macro average F1-score of 97% represents an overall balanced performance across all three sentiment classes. Meanwhile, the weighted average F1-score of 99% accounts for class imbalances and confirms that the model is not disproportionately favoring any particular sentiment category.

---

## 6.11 RNN with word embeddings

In [None]:
# Word embeddings
# Import necessary libraries
import numpy as np
import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Extract review text from the dataset
review_corpus = df['review_text'].astype(str).tolist()  # Ensure all text is treated as string

# Define tokenizer parameters specific to Grocery Store Arizona Data
max_words = 7000  # Increased vocabulary size to accommodate grocery-related terms
embedding_dim = 200  # Lower dimension to reduce computation while preserving meaning
max_length = 75  # Adjusted to accommodate longer grocery reviews
trunc_type = "post"  # Truncate from the end of reviews
padding_type = "post"  # Pad shorter reviews at the end

# Initialize and fit the tokenizer
tokenizer = Tokenizer(num_words=max_words, oov_token="<OOV>")  # Handle out-of-vocabulary words
tokenizer.fit_on_texts(review_corpus)

# Convert text to sequences
sequences = tokenizer.texts_to_sequences(review_corpus)

# Pad sequences to ensure uniform input size
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

# Convert to NumPy array for TensorFlow compatibility
X_embedded = np.array(padded_sequences)

# Convert embedded sequences to a DataFrame for inspection
embedding_df = pd.DataFrame(X_embedded)

# Display output
print("Shape of Word Embedding Representation:", embedding_df.shape)
display(embedding_df.head())  # Shows the first few rows


In [None]:
# Import necessary libraries
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense, Input
from tensorflow.keras.callbacks import ModelCheckpoint

# Define RNN Model
input_layer = Input(shape=(max_length,), name="input_layer")

# Embedding Layer (Uses word embeddings processed earlier)
embedding_layer = Embedding(
    input_dim=max_words,  # Vocabulary size (7000, as per Grocery Store dataset)
    output_dim=embedding_dim,  # Embedding dimensions (200)
    input_length=max_length,  # Max sequence length (75)
    name="embedding"
)(input_layer)

# RNN Layer
rnn_layer = SimpleRNN(
    units=256,  # Optimized to 256 neurons for sequential text learning
    activation="relu",
    return_sequences=False,  # We only need the final output
    name="rnn_layer"
)(embedding_layer)

# Fully Connected Hidden Layers
hidden_layer1 = Dense(units=256, activation="relu", name="hidden_layer1")(rnn_layer)
hidden_layer2 = Dense(units=128, activation="relu", name="hidden_layer2")(hidden_layer1)

# Output Layer (Multi-class classification)
output_layer = Dense(units=3, activation="softmax", name="output_layer")(hidden_layer2)

# Create the RNN model
model_rnn = Model(inputs=input_layer, outputs=output_layer, name="RNN_Model")

# Compile Model
model_rnn.compile(
    loss="categorical_crossentropy",  # Multi-class classification loss function
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),  # Optimized learning rate
    metrics=["accuracy"]
)

# Print Model Summary
model_rnn.summary()

In [None]:
# Import necessary libraries
from tensorflow.keras.callbacks import ModelCheckpoint
from sklearn.model_selection import train_test_split

# Split data into 75% training and 25% testing
X_train, X_test, Y_train, Y_test = train_test_split(X_embedded, Y_encoded, test_size=0.25, random_state=42)

# Define ModelCheckpoint (Save in current directory)
checkpoint = ModelCheckpoint("best_rnn_model.h5", monitor="val_accuracy", save_best_only=True, mode="max")

# Train the RNN model
history = model_rnn.fit(
    x=X_train,
    y=Y_train,
    epochs=5,            # Increased to 5 epochs for better learning
    batch_size=16,       # Using batch size of 16 for more stable training
    validation_data=(X_test, Y_test),
    callbacks=[checkpoint]  # Save best model based on validation accuracy
)

# Print training results
print("Model training complete. Best model saved as 'best_rnn_model.h5'.")

---

### 6.12 Evaluation of RNN Model Performance on Test Data

In [None]:
# Load the best saved model
best_rnn_model = load_model("best_rnn_model.h5")
# Generate predictions (probabilities for each class)
Y_pred_probs = best_rnn_model.predict(X_test)

# Convert probabilities to class labels (0 = Negative, 1 = Neutral, 2 = Positive)
Y_pred_classes = Y_pred_probs.argmax(axis=1)

# Convert one-hot encoded Y_test back to class labels
Y_true_classes = Y_test.argmax(axis=1)

In [None]:
from sklearn.metrics import classification_report, accuracy_score

# Print overall accuracy
accuracy = accuracy_score(Y_true_classes, Y_pred_classes)
print(f"Test Accuracy: {accuracy:.4f}")

# Generate classification report
print("Classification Report:")
display(classification_report(Y_true_classes, Y_pred_classes, target_names=["Negative", "Neutral", "Positive"]))

---

#### Overall Accuracy
The RNN model achieved an overall accuracy of 89.40% on the test dataset. While this is a strong result, it is lower than the performance observed in the ANN model with TF-IDF, suggesting that the sequential nature of RNNs may not have been fully utilized in this case.

#### Precision, Recall, and F1-Score Evaluation
The classification report provides a breakdown of the model’s ability to classify each sentiment category:

#### Negative Sentiment:

- Precision: 0.98 (high confidence in negative predictions)
- Recall: 0.46 (many actual negative reviews were misclassified)
- F1-Score: 0.63 (suggests an imbalance in capturing negative sentiments effectively)
- The model struggles to recall negative instances despite high precision.

#### Neutral Sentiment:

- Precision: 1.00 (perfect precision but on very few samples)
- Recall: 0.06 (fails to correctly identify most neutral samples)
- F1-Score: 0.11 (extremely low, indicating poor recognition of neutral sentiment)
- The model performs poorly in detecting neutral sentiment, likely due to class imbalance in the dataset.
    
#### Positive Sentiment:

- Precision: 0.89 (relatively high, meaning most predicted positives are correct)
- Recall: 1.00 (captures all actual positive instances)
- F1-Score: 0.94 (a strong balance between precision and recall)
- The model is highly effective in recognizing positive sentiments, benefiting from a large sample size in the dataset.
    
#### Class Distribution and Model Bias
The results suggest that the RNN model is heavily biased towards positive sentiment, likely due to the dominance of positive samples in the dataset. While it is highly precise in classifying negative sentiments, it struggles with recall, meaning it fails to correctly identify many actual negative reviews. The model's inability to recall neutral sentiments suggests that neutral samples are too few for the model to learn meaningful patterns.

---

## 6.13 LSTM with word embeddings

In [None]:
# Import necessary libraries
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
import tensorflow as tf

# Define LSTM Model
input_layer = tf.keras.Input(shape=(max_length,), name="input_layer")

# Embedding Layer (Convert word indices into word vectors)
embedding_layer = Embedding(input_dim=max_words, output_dim=embedding_dim, input_length=max_length, name="embedding")(input_layer)

# LSTM Layer (Increased Units for Better Feature Extraction)
lstm_layer = LSTM(units=256, activation="tanh", return_sequences=False, dropout=0.3, name="lstm_layer")(embedding_layer)

# Fully Connected Hidden Layers (Deeper Network for Enhanced Learning)
hidden_layer1 = Dense(units=128, activation="relu", name="hidden_layer1")(lstm_layer)
hidden_layer2 = Dense(units=64, activation="relu", name="hidden_layer2")(hidden_layer1)
dropout_layer = Dropout(0.3)(hidden_layer2)  # Dropout to prevent overfitting

# Output Layer (Binary classification)
output_layer = Dense(units=1, activation="sigmoid", name="output_layer")(dropout_layer)

# Create the LSTM model
model_lstm = tf.keras.Model(inputs=input_layer, outputs=output_layer, name="LSTM_Model")

# Compile Model
model_lstm.compile(
    loss="binary_crossentropy",
    optimizer="adam",  # Adam optimizer is better for deep learning tasks
    metrics=["accuracy"]  # Track accuracy
)

# Print Model Summary
model_lstm.summary()

In [None]:
# Import necessary libraries
from tensorflow.keras.callbacks import ModelCheckpoint
from sklearn.model_selection import train_test_split

# Split data into 75% training and 25% testing
X_train, X_test, Y_train, Y_test = train_test_split(X_embedded, Y, test_size=0.25, random_state=42)

# Define ModelCheckpoint to save the best model based on validation accuracy
checkpoint = ModelCheckpoint("best_lstm_model.h5", monitor="val_accuracy", save_best_only=True, mode="max")

# Train the LSTM model
history = model_lstm.fit(
    x=X_train,
    y=Y_train,
    epochs=5,            # Increased to 5 epochs for deeper training
    batch_size=16,       # Increased batch size for stable learning
    validation_data=(X_test, Y_test),
    callbacks=[checkpoint]  # Save best model based on validation accuracy
)

# Print training results
print("Model training complete. Best model saved as 'best_lstm_model.h5'.")

In [None]:
# Load the best LSTM model
from tensorflow.keras.models import load_model
from sklearn.metrics import classification_report

# Load the saved best model
best_lstm_model = load_model("best_lstm_model.h5")

# Make predictions on the test set
Y_pred = best_lstm_model.predict(X_test)
Y_pred_binary = (Y_pred > 0.5).astype("int32")  # Convert probabilities to binary labels

# Print test accuracy
test_accuracy = best_lstm_model.evaluate(X_test, Y_test, verbose=0)[1]
print(f"Test Accuracy: {test_accuracy:.4f}")

# Generate classification report
lstm_classification_report = classification_report(Y_test, Y_pred_binary, target_names=["Negative", "Neutral", "Positive"])
print("Classification Report:")
print(lstm_classification_report)

---

#### Possible Reasons for Poor LSTM Performance
LSTM model is performing very poorly, with an overall accuracy of 2.8%, indicating that the model is failing to generalize. Below are the key reasons that could be causing this issue:

#### Poor Word Representation in Embeddings
- If the word embeddings are not well-trained, the model struggles to understand the context of words.
- The embeddings might not capture enough semantic meaning, leading to incorrect classifications.

#### Imbalanced Data Distribution
- The model has zero recall and F1-score for the "Positive" class, which suggests it is biased towards other classes.
- If there is an imbalance in training data, the model may not learn enough from underrepresented classes.

#### Ineffective Model Hyperparameters
- The activation function, number of layers, or number of units may not be optimized for text classification.
- Using ReLU activation in LSTM could lead to vanishing gradient issues, especially for long text sequences.

---                                                    

## 6.14 Improved LSTM Model with Pre-trained Word Embeddings and Optimized Parameters

In [None]:
import os
print(os.listdir('.'))  # Lists all files in the current directory

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

In [None]:
!unzip glove.6B.zip glove.6B.300d.txt

In [None]:
# Ensure the correct embedding dimension is set
embedding_dim = 300  # Change to 200 if using glove.6B.200d.txt
# Load GloVe embeddings
embedding_index = {}
with open("glove.6B.300d.txt", encoding="utf8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        embedding_index[word] = np.array(values[1:], dtype="float32")

# Create embedding matrix
embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in tokenizer.word_index.items():
    if i < max_words:
        embedding_vector = embedding_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

In [None]:
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional

# Define the LSTM model
input_layer = tf.keras.Input(shape=(max_length,), name="input_layer")

# Embedding Layer with Pre-trained Weights
embedding_layer = Embedding(input_dim=max_words,
                            output_dim=embedding_dim,
                            input_length=max_length,
                            weights=[embedding_matrix],
                            trainable=False,
                            name="embedding")(input_layer)

# First LSTM Layer (Bidirectional for better learning)
lstm_layer1 = Bidirectional(LSTM(units=256, activation="tanh", return_sequences=True, name="lstm_layer1"))(embedding_layer)
dropout1 = Dropout(0.3)(lstm_layer1)  # Dropout for regularization

# Second LSTM Layer
lstm_layer2 = LSTM(units=128, activation="tanh", return_sequences=False, name="lstm_layer2")(dropout1)
dropout2 = Dropout(0.3)(lstm_layer2)

# Fully Connected Hidden Layers
hidden_layer1 = Dense(units=128, activation="relu", name="hidden_layer1")(dropout2)
hidden_layer2 = Dense(units=64, activation="relu", name="hidden_layer2")(hidden_layer1)

# Output Layer (Binary classification)
output_layer = Dense(units=1, activation="sigmoid", name="output_layer")(hidden_layer2)

# Create the LSTM model
model_lstm = tf.keras.Model(inputs=input_layer, outputs=output_layer, name="Optimized_LSTM_Model")

# Compile Model with Adam Optimizer
model_lstm.compile(
    loss="binary_crossentropy",
    optimizer="adam",  # Adam works better than SGD for text data
    metrics=["accuracy"]
)

# Print Model Summary
model_lstm.summary()

In [None]:
# Handle Class Imbalance with Class Weights
from sklearn.utils.class_weight import compute_class_weight

# Compute class weights
class_weights = compute_class_weight(class_weight="balanced", classes=np.unique(Y_train), y=Y_train)
class_weights_dict = dict(enumerate(class_weights))

print("Class Weights:", class_weights_dict)

In [None]:
# Train the Improved LSTM Model
from tensorflow.keras.callbacks import ModelCheckpoint

# Split data into 75% training and 25% testing
#X_train, X_test, Y_train, Y_test = train_test_split(X_embedded, Y, test_size=0.25, random_state=42)


# Define ModelCheckpoint to save the best model based on validation accuracy
#checkpoint = ModelCheckpoint("best_lstm_model.h5", monitor="val_accuracy", save_best_only=True, mode="max")


# Train the LSTM model
'''history = model_lstm.fit(
    x=X_train,
    y=Y_train,
    epochs=10,            # Increased to 10 epochs for better learning
    batch_size=16,        # Increased batch size for stability
    validation_data=(X_test, Y_test),
    class_weight=class_weights_dict,  # Apply class weights
    callbacks=[checkpoint]  # Save best model
)'''

# Print training results
#print("Model training complete. Best model saved as '/mnt/data/best_optimized_lstm_model.h5'.")

---

### 6.15 Evaluate LSTM Model Performance on Test Data

In [None]:
# Load the best LSTM model
from tensorflow.keras.models import load_model
from sklearn.metrics import classification_report

# Load the saved best model
best_lstm_model = load_model("best_lstm_model.h5")

# Make predictions on the test set
Y_pred = best_lstm_model.predict(X_test)
Y_pred_binary = (Y_pred > 0.5).astype("int32")  # Convert probabilities to binary labels

# Print test accuracy
test_accuracy = best_lstm_model.evaluate(X_test, Y_test, verbose=0)[1]
print(f"Test Accuracy: {test_accuracy:.4f}")

# Generate classification report
lstm_classification_report = classification_report(Y_test, Y_pred_binary, target_names=["Negative", "Neutral", "Positive"])
print("Classification Report:")
print(lstm_classification_report)

#### Overall Performance
The LSTM model performed poorly on the test dataset, achieving an overall test accuracy of only 8.35%. This indicates that the model is failing to correctly classify most instances, which suggests potential issues with the model architecture, data preprocessing, or hyperparameters.

#### Precision, Recall, and F1-Score Analysis
- Negative Sentiment: The model achieved a precision of 0.60 and a recall of 0.45, meaning it can somewhat identify negative reviews but still misclassifies a significant portion.
- Neutral Sentiment: The model has a very low recall of 0.47 and almost no precision, indicating it struggles to distinguish neutral reviews from other categories.
- Positive Sentiment: The model completely failed to predict positive sentiment, with precision, recall, and F1-score all at 0.00. This means that the model did not correctly classify any positive samples.

#### Accuracy and Weighted Scores
- The overall accuracy is extremely low (8%), meaning that the model is making incorrect predictions for the vast majority of the test data.
- Macro and weighted averages are also very low, suggesting that the model is not learning meaningful patterns from the training data.

---

## 6.16 One-Hot Vector Encoding + RNN

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Define tokenizer parameters
max_words = 5000  # Vocabulary size (adjusted for project)
max_length = 50   # Max sequence length (fixed to 50 words)

# Initialize Tokenizer
tokenizer = Tokenizer(num_words=max_words, oov_token="<OOV>")
tokenizer.fit_on_texts(df['review_text'])  # Ensure to use the correct column from your dataset

# Convert reviews to sequences
sequences = tokenizer.texts_to_sequences(df["review_text"])

# Pad sequences to ensure uniform input size
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding="post", truncating="post")

# Convert to One-Hot Encoding
X_onehot = np.zeros((len(padded_sequences), max_length, max_words))
for i, sequence in enumerate(padded_sequences):
    for j, word_index in enumerate(sequence):
        if word_index < max_words:
            X_onehot[i, j, word_index] = 1  # One-hot encode each word

# Convert labels
Y = np.array(df["Sentiment"])  # Ensure sentiment labels are numeric (0/1)


In [None]:
import tensorflow as tf
from tensorflow.keras.layers import SimpleRNN, Dense

# Define the input layer with one-hot encoding (sequence length, vocabulary size)
input_layer = tf.keras.Input(shape=(50, 5000), name="input_layer")  # 50 words per review, vocabulary size of 5000

# Simple RNN Layer
rnn_layer = SimpleRNN(units=512, activation="tanh", return_sequences=False, name="rnn_layer")(input_layer)

# Fully Connected Hidden Layers
hidden_layer1 = Dense(units=256, activation="relu", name="hidden_layer1")(rnn_layer)
hidden_layer2 = Dense(units=128, activation="relu", name="hidden_layer2")(hidden_layer1)

# Output Layer (Binary classification)
output_layer = Dense(units=1, activation="sigmoid", name="output_layer")(hidden_layer2)

# Create the RNN model
model_rnn_onehot = tf.keras.Model(inputs=input_layer, outputs=output_layer, name="RNN_OneHot_Model")

# Compile the Model
model_rnn_onehot.compile(
    loss="binary_crossentropy",
    optimizer="adam",  # Adam optimizer for better convergence
    metrics=[
        tf.keras.metrics.Precision(name="precision"),
        tf.keras.metrics.Recall(name="recall"),
        tf.keras.metrics.BinaryAccuracy(name="accuracy"),
    ],
)

# Print Model Summary
model_rnn_onehot.summary()

# Visualize Model Architecture
tf.keras.utils.plot_model(model_rnn_onehot, show_shapes=True)



In [None]:
from tensorflow.keras.callbacks import ModelCheckpoint
from sklearn.model_selection import train_test_split

# Split data into 75% training and 25% testing
X_train, X_test, Y_train, Y_test = train_test_split(X_onehot, Y, test_size=0.25, random_state=42)

# Define ModelCheckpoint to save the best model based on validation accuracy
checkpoint = ModelCheckpoint(
    "best_rnn_onehot_model.h5", monitor="val_accuracy", save_best_only=True, mode="max"
)

# Train the RNN model with one-hot vector encoding
history = model_rnn_onehot.fit(
    x=X_train,
    y=Y_train,
    epochs=5,            # Running for 5 epochs for better learning
    batch_size=16,       # Using batch size of 16 for stable training
    validation_data=(X_test, Y_test),
    callbacks=[checkpoint]  # Save best model based on validation accuracy
)

# Print training results
print("Model training complete. Best model saved as 'best_rnn_onehot_model.h5'.")


#### Training Accuracy and Loss Behavior
The training accuracy remains extremely low throughout all five epochs. Initially, the accuracy is 0.01 (1%), and it does not improve significantly, dropping further in later epochs. The loss values fluctuate, with some negative values appearing in later epochs, which is highly unusual for a binary classification model using cross-entropy loss. This suggests that the model is not learning effectively.

#### Validation Performance
The validation accuracy remains fixed at 0.0085 (0.85%), indicating that the model is essentially making random predictions. The validation loss fluctuates dramatically, with values turning negative and eventually reaching 56.6585, which is an extremely poor result. This instability suggests that the model parameters are diverging rather than converging.

#### Precision and Recall Issues
Precision and recall values for all epochs are 0.0000e+00, meaning that the model is not making any positive predictions. This is a strong indication of a model failure, potentially due to:

- Vanishing gradients in the SimpleRNN layer, leading to ineffective weight updates.
- Poorly initialized parameters or an unsuitable activation function for learning long-term dependencies.
- Incompatibility with One-Hot Encoding where a large vocabulary size leads to inefficient learning.

#### Possible Causes for Poor Performance
1. Choice of SimpleRNN Instead of LSTM or GRU

- SimpleRNN suffers from vanishing gradients, making it difficult to capture sequential dependencies in long text sequences.
- LSTM or GRU models are better suited for this task.
2. High Dimensionality in One-Hot Encoding

- The model's input shape (50, 5000) results in very sparse and high-dimensional representations.

---

## 7. Bidrectional BiLSTM Model with Pre-trained GloVe Embeddings

In [None]:
print("Unique values in sentiment column:", df["sentiment"].unique())

In [None]:
df["sentiment"] = df["sentiment"].astype(str).str.lower()

In [None]:
from tensorflow.keras.utils import to_categorical
import numpy as np

# Ensure valid labels
df = df[df["sentiment"].isin(["negative", "neutral", "positive"])]

# Convert labels to numeric values
label_mapping = {"negative": 0, "neutral": 1, "positive": 2}
Y = df["sentiment"].map(label_mapping).astype(int)  # Ensure correct data type

# Convert to one-hot encoding
Y = to_categorical(Y, num_classes=3)

print("Shape of Y:", Y.shape)  # Should be (num_samples, 3)

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
import numpy as np

# Define tokenizer parameters
max_words = 5000  # Vocabulary size
max_length = 50   # Maximum sequence length

# Initialize Tokenizer
tokenizer = Tokenizer(num_words=max_words, oov_token="<OOV>")
tokenizer.fit_on_texts(df["review_text"])  # Tokenize the text column

# Convert text to sequences of integers
sequences = tokenizer.texts_to_sequences(df["review_text"])

# Pad sequences to ensure uniform input size
X_padded = pad_sequences(sequences, maxlen=max_length, padding="post", truncating="post")

# Convert labels to categorical format (One-Hot Encoding)
label_mapping = {"negative": 0, "neutral": 1, "positive": 2}
Y = np.array(df["sentiment"].map(label_mapping))  # Convert text labels to integers
Y = to_categorical(Y, num_classes=3)  # Convert to one-hot encoding

print("Shape of X_padded:", X_padded.shape)  # Should be (num_samples, 50)
print("Shape of Y:", Y.shape)  # Should be (num_samples, 3) for three sentiment classes

In [None]:
import tensorflow as tf

# Ensure embedding_matrix shape is correct
embedding_matrix = embedding_matrix[:max_words]  # Adjust size to match input_dim

# BiLSTM model with multi-class classification
bilstm = tf.keras.Sequential([
    tf.keras.layers.Embedding(
        input_dim=max_words,  # Adjusted to match embedding_matrix
        output_dim=300,  # GloVe embedding dimensions
        input_length=max_length,  # Sequence length
        weights=[embedding_matrix],  # Use pre-trained GloVe embeddings
        trainable=False  # Keep embeddings fixed
    ),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(units=128, dropout=0.3, return_sequences=True)),  # First BiLSTM Layer
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(units=64, dropout=0.3)),  # Second BiLSTM Layer
    tf.keras.layers.Dense(64, activation="relu"),  # Fully connected layer
    tf.keras.layers.Dropout(0.3),  # Regularization
    tf.keras.layers.Dense(32, activation="relu"),  # Additional feature extraction layer
    tf.keras.layers.Dense(3, activation="softmax")  # Output Layer (3 classes: Negative, Neutral, Positive)
])

# Compile the model for multi-class classification
bilstm.compile(
    loss="categorical_crossentropy",
    optimizer="adam",
    metrics=["accuracy", tf.keras.metrics.Precision(name="precision"), tf.keras.metrics.Recall(name="recall")]
)

# Model summary
bilstm.summary()

In [None]:
from tensorflow.keras.callbacks import ModelCheckpoint
from sklearn.model_selection import train_test_split

# Split data into 75% training and 25% testing
X_train, X_test, Y_train, Y_test = train_test_split(X_padded, Y, test_size=0.25, random_state=42)

# Define ModelCheckpoint to save the best model based on validation accuracy
checkpoint = ModelCheckpoint(
    "best_bilstm_model_multiclass.h5", monitor="val_accuracy", save_best_only=True, mode="max"
)

# Train the model
history = bilstm.fit(
    x=X_train,
    y=Y_train,
    epochs=10,  # Increased epochs for small dataset training
    batch_size=32,  # Adjusted batch size for stability
    validation_data=(X_test, Y_test),
    callbacks=[checkpoint]
)

# Print training completion message
print("Model training complete. Best model saved as 'best_bilstm_model_multiclass.h5'.")

In [None]:
# Evaluate the BiLSTM Model on Test Data
from tensorflow.keras.models import load_model
from sklearn.metrics import classification_report, accuracy_score

# Load the best saved model
best_bilstm_model = load_model("best_bilstm_model_multiclass.h5")

# Make predictions on test data
Y_pred_probs = best_bilstm_model.predict(X_test)

# Convert probabilities to class labels (argmax for multi-class classification)
Y_pred = Y_pred_probs.argmax(axis=1)
Y_true = Y_test.argmax(axis=1)  # Convert one-hot encoded labels back to class numbers

# Compute accuracy
test_accuracy = accuracy_score(Y_true, Y_pred)
print(f"Test Accuracy: {test_accuracy:.4f}")

# Generate classification report
bilstm_classification_report = classification_report(Y_true, Y_pred, target_names=["Negative", "Neutral", "Positive"])
print("Classification Report:\n", bilstm_classification_report)


#### Overall Model Performance
The Bidirectional LSTM (BiLSTM) model has achieved an outstanding test accuracy of 99.15%, indicating that it is highly effective in classifying sentiment into Negative, Neutral, and Positive categories. This level of accuracy suggests that the model has learned strong representations of the input text and is making highly reliable predictions.

#### Precision, Recall, and F1-Score
The model exhibits exceptionally high precision across all classes. Precision refers to how many of the predicted instances of a class are actually correct. The Neutral class achieved perfect precision (1.00), meaning that whenever the model predicted a review as Neutral, it was indeed Neutral. The Negative and Positive classes also show high precision scores of 0.98 and 0.99, respectively.

The recall, which measures how well the model captures all actual instances of a class, is slightly lower for Neutral (0.88) compared to Negative (0.97) and Positive (1.00). This suggests that some Neutral examples may have been misclassified as either Negative or Positive. However, the model correctly identified all Positive reviews, with a recall of 1.00.

The F1-score, which balances precision and recall, is extremely high for all classes. The Negative and Positive classes both have nearly perfect F1-scores, while the Neutral class has a slightly lower score of 0.94 due to its recall drop.

#### Macro and Weighted Averages
The macro average (which gives equal weight to each class) results in a recall of 0.96, reflecting the slight recall drop in the Neutral category. The weighted average, which accounts for class distribution, remains at 0.99, confirming the model’s strong overall performance.

## Predict Customer Sentiment Under Business Changes

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Get unique store names
stores = df["name"].unique()

# Define expanded scenario reviews
scenario_reviews_template = {
    "Original": [
        "The store has good deals and fresh produce.",
        "The prices are decent and staff is friendly.",
        "I like the variety but sometimes items are missing.",
        "A decent shopping experience, but could improve in some areas.",
        "Love the cleanliness and organization of the store.",
        "Fresh vegetables and meat are always available.",
        "I can find almost everything I need, great selection!",
        "Service is okay, but checkout lines can be long.",
        "The store layout makes it easy to find products.",
        "A reliable place to shop for groceries every week."
    ],
    "Price Increase": [
        "The store increased prices and now it’s too expensive.",
        "Everything is overpriced now, not worth it.",
        "Higher prices but same quality, not fair!",
        "I used to shop here often, but now it’s just too expensive.",
        "Price hikes are ridiculous, I’m looking for alternatives.",
        "They raised prices on essential items, very disappointing.",
        "The quality is still good, but the prices are frustrating.",
        "Why are prices increasing every few weeks?",
        "It used to be affordable, now it’s just too costly.",
        "I don’t shop here as much anymore due to the price jumps."
    ],
    "Price Decrease": [
        "The store lowered prices, now it’s much more affordable!",
        "Great discounts, love shopping here now.",
        "Prices went down and I can buy more for the same budget.",
        "This place just became my go-to store for groceries!",
        "More savings means I can get extra items every visit.",
        "Happy to see better deals and fair pricing again.",
        "Affordable groceries make a big difference, love it.",
        "Shopping here now feels like a bargain.",
        "The discounts on fresh produce are a great improvement.",
        "Finally, a store that values its customers with fair pricing!"
    ],
    "Inventory Issues": [
        "They are always out of stock on essential items.",
        "I can never find what I need, shelves are empty.",
        "Stock issues are frustrating, they need better supply.",
        "Why is milk always out of stock?",
        "I came here for a few basics and left empty-handed.",
        "This store needs better inventory management.",
        "Out of stock signs everywhere, so frustrating.",
        "It’s impossible to do weekly shopping here anymore.",
        "I went to three different locations and still no stock!",
        "If they don’t fix inventory issues, I’m switching stores."
    ]
}

# Process each store separately
for store in stores:
    print(f"\n🔍 Simulating Business Scenarios for {store}...\n")

    # Tokenize and pad the scenario reviews
    scenario_sequences = tokenizer.texts_to_sequences(
        sum(scenario_reviews_template.values(), [])
    )
    scenario_padded = pad_sequences(scenario_sequences, maxlen=max_length, padding="post")

    # Predict sentiment using the trained BiLSTM model
    scenario_probs = best_bilstm_model.predict(scenario_padded)
    scenario_preds = scenario_probs.argmax(axis=1)

    # Map numeric predictions back to labels
    label_mapping_inv = {0: "Negative", 1: "Neutral", 2: "Positive"}
    scenario_sentiments = [label_mapping_inv[pred] for pred in scenario_preds]

    # Organize results into a dictionary
    scenario_results = {}
    index = 0
    for scenario, reviews in scenario_reviews_template.items():
        scenario_results[scenario] = [scenario_sentiments[index + i] for i in range(len(reviews))]
        index += len(reviews)

    # Count sentiment distribution per scenario
    sentiment_counts = {scenario: {"Negative": 0, "Neutral": 0, "Positive": 0} for scenario in scenario_results}
    for scenario, sentiments in scenario_results.items():
        for sentiment in sentiments:
            sentiment_counts[scenario][sentiment] += 1

    # Convert results into a visualization
    fig, ax = plt.subplots(figsize=(10, 5))
    scenarios = list(sentiment_counts.keys())
    negative_counts = [sentiment_counts[sc]["Negative"] for sc in scenarios]
    neutral_counts = [sentiment_counts[sc]["Neutral"] for sc in scenarios]
    positive_counts = [sentiment_counts[sc]["Positive"] for sc in scenarios]

    bar_width = 0.4
    x = np.arange(len(scenarios))

    ax.bar(x - bar_width, negative_counts, bar_width, label="Negative", color="red")
    ax.bar(x, neutral_counts, bar_width, label="Neutral", color="gray")
    ax.bar(x + bar_width, positive_counts, bar_width, label="Positive", color="green")

    ax.set_xticks(x)
    ax.set_xticklabels(scenarios, rotation=15)
    ax.set_ylabel("Sentiment Count")
    ax.set_title(f"Predicted Sentiment Shift Under Business Changes - {store}")
    ax.legend()
    plt.show()

---

## 7.1 Comparison of results we got by evaluating the models on test data

In [None]:
import pandas as pd

# Create a dictionary with model performance metrics
comparison_data = {
    "Model": [
        "ANN + TF-IDF",
        "RNN + Word Embeddings",
        "LSTM + Word Embeddings",
        "Improved LSTM + GloVe",
        "One-Hot Encoding + RNN",
        "BiLSTM + GloVe"
    ],
    "Accuracy": [99.25, 89.40, 2.80, 8.35, 0.85, 99.15],
    "Precision": [99, 90, 97, 60, 0, 99],
    "Recall": [99, 89, 11, 45, 0, 99],
    "F1-Score": [99, 88, 20, 51, 0, 99]
}

# Convert to DataFrame
comparison_df = pd.DataFrame(comparison_data)

# Display results
from IPython.display import display
display(comparison_df)


### Analysis of Model Performance
Based on the results from the comparison table, we can analyze the performance of each model in terms of accuracy, precision, recall, and F1-score.

#### Best Performing Models
1. ANN + TF-IDF

- Accuracy: 99.25% | Precision: 99 | Recall: 99 | F1-Score: 99
- This model performed exceptionally well across all metrics. Combining Artificial Neural Networks (ANN) with Term Frequency-Inverse Document Frequency (TF-IDF) effectively captured patterns in the data.
- This method is fast and efficient for structured text classification, making it a strong candidate for further tasks like topic modeling if supplemented with topic extraction techniques.

2. BiLSTM + GloVe

- Accuracy: 99.15% | Precision: 99 | Recall: 99 | F1-Score: 99
- The Bidirectional Long Short-Term Memory (BiLSTM) with Pre-trained GloVe Embeddings also achieved outstanding results, slightly behind ANN + TF-IDF.
- The BiLSTM model, leveraging GloVe embeddings, can capture contextual word meanings and dependencies, making it ideal for tasks involving sequential text dependencies.
                                                                                                                       
#### Poor Performing Models
1. LSTM + Word Embeddings (Accuracy: 2.80%)

- Despite using word embeddings, this model struggled significantly, likely due to poor training convergence, incorrect parameter tuning, or a dataset too small for LSTM to generalize properly.
- Not suitable for topic modeling due to its very low recall and accuracy.
              
2. One-Hot Encoding + RNN (Accuracy: 0.85%)

- This model also failed, as it cannot capture rich semantic relationships between words.
- One-hot encoding does not retain word meanings, making it ineffective for deep learning models like RNNs.
                                                
#### Top 2 Models for Topic Modeling
For topic modeling, we need models that can accurately capture the structure and meaning of text, ensuring strong contextual understanding.

1. BiLSTM + GloVe

- Since topic modeling often requires context-aware text representations, BiLSTM is an excellent choice as it captures both past and future word dependencies.
- GloVe embeddings help with semantic representation, which is crucial for distinguishing topics.
    
2. ANN + TF-IDF

- While ANN is a simple and effective model, TF-IDF provides a robust way to extract important words from the text.
- This method is fast, efficient, and works well for topic modeling, especially when combined with clustering techniques like LDA (Latent Dirichlet Allocation) or NMF (Non-negative Matrix Factorization).

---

## 7.2 Sentiment classification for each store

In [None]:
# Load the best trained BiLSTM model
best_bilstm_model = tf.keras.models.load_model("best_bilstm_model_multiclass.h5")

# Define tokenizer and padding parameters (same as used during training)
max_length = 50  # Ensure this matches your training setting


# Tokenize and pad sequences
sequences = tokenizer.texts_to_sequences(df["review_text"])
X_padded = pad_sequences(sequences, maxlen=max_length, padding="post", truncating="post")

# Predict sentiment probabilities
Y_pred_probs = best_bilstm_model.predict(X_padded)

# Convert probabilities to class labels (argmax for multi-class)
Y_pred = Y_pred_probs.argmax(axis=1)  # 0: Negative, 1: Neutral, 2: Positive

# Map predictions to labels
sentiment_labels = {0: "Negative", 1: "Neutral", 2: "Positive"}
df["Predicted_Sentiment"] = [sentiment_labels[label] for label in Y_pred]

# Split data by store
stores = ["Fry's", "Safeway", "Target", "Trader Joe's"]
store_dfs = {store: df[df["name"] == store] for store in stores}

# Save sentiment-labeled datasets per store
for store, store_df in store_dfs.items():
    store_df.to_csv(f"{store}_sentiment_labeled.csv", index=False)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot sentiment distribution per store (with percentage labels)
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes = axes.flatten()

for i, store in enumerate(store_dfs.keys()):
    store_df = store_dfs[store]
    sentiment_counts = store_df["Predicted_Sentiment"].value_counts(normalize=True) * 100  # Convert to percentages

    ax = sns.barplot(x=sentiment_counts.index, y=sentiment_counts.values, palette="coolwarm", ax=axes[i])
    axes[i].set_title(f"Sentiment Distribution - {store}")
    axes[i].set_xlabel("Sentiment")
    axes[i].set_ylabel("Percentage")

    # Adding percentage labels on bars
    for p in ax.patches:
        ax.annotate(f'{p.get_height():.1f}%', (p.get_x() + p.get_width() / 2, p.get_height()),
                    ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

---

### 7.3 Sentiment Trend Over Time (Time Series Analysis)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Convert review_date to datetime format
df["review_date"] = pd.to_datetime(df["review_date"])

# Aggregate sentiment counts over time (monthly)
df["month"] = df["review_date"].dt.to_period("M")  # Convert to month format

# Plot sentiment trends per store
plt.figure(figsize=(12, 6))
for store in stores:
    store_df = df[df["name"] == store]
    sentiment_trend = store_df.groupby(["month", "Predicted_Sentiment"]).size().unstack()
    sentiment_trend.plot(kind="line", marker="o", title=f"Sentiment Trend - {store}")
    plt.xlabel("Month")
    plt.ylabel("Review Count")
    plt.legend(title="Sentiment")
    plt.show()

---

### 7.4 Word Cloud for Each Sentiment Category

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Define stores
stores = ["Fry's", "Safeway", "Target", "Trader Joe's"]

# Generate word clouds for each sentiment within each store
for store in stores:
    store_df = df[df["name"] == store]  # Filter data for the specific store

    for sentiment in ["Positive", "Neutral", "Negative"]:
        text = " ".join(store_df[store_df["Predicted_Sentiment"] == sentiment]["review_text"])

        if text.strip():  # Only generate word cloud if text is available
            wordcloud = WordCloud(width=800, height=400, background_color="white").generate(text)

            # Plot
            plt.figure(figsize=(10, 5))
            plt.imshow(wordcloud, interpolation="bilinear")
            plt.axis("off")
            plt.title(f"Word Cloud - {sentiment} Reviews ({store})")
            plt.show()


---

### 7.5 Predicting Future Store Ratings Using Past Sentiment Trends

In [None]:
!pip install prophet
from prophet import Prophet

# Convert review_date to datetime
df["review_date"] = pd.to_datetime(df["review_date"])
df["month"] = df["review_date"].dt.to_period("M").astype(str)

# Aggregate sentiment scores over time
df["sentiment_score"] = df["Predicted_Sentiment"].map({"Negative": -1, "Neutral": 0, "Positive": 1})
monthly_sentiment = df.groupby("month")["sentiment_score"].mean().reset_index()

# Prepare data for Prophet model
monthly_sentiment.columns = ["ds", "y"]  # Prophet requires column names 'ds' (date) and 'y' (value)

# Train Prophet model
model = Prophet()
model.fit(monthly_sentiment)

# Make future predictions
future = model.make_future_dataframe(periods=6, freq="M")  # Predict next 6 months
forecast = model.predict(future)

# Plot forecast
model.plot(forecast)
plt.title("Predicted Sentiment Trend for Future Months")
plt.show()

In [None]:
from prophet import Prophet
import matplotlib.pyplot as plt

# Convert review_date to datetime
df["review_date"] = pd.to_datetime(df["review_date"])
df["month"] = df["review_date"].dt.to_period("M").astype(str)

# Map sentiment labels to numeric scores
df["sentiment_score"] = df["Predicted_Sentiment"].map({"Negative": -1, "Neutral": 0, "Positive": 1})

# Define store names
stores = ["Fry's", "Safeway", "Target", "Trader Joe's"]

# Define colors for actual and predicted trends
actual_color = "blue"
predicted_color = "red"

# Initialize figure for multiple plots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

# Loop through each store
for i, store in enumerate(stores):
    store_df = df[df["name"] == store]  # Filter data for store
    monthly_sentiment = store_df.groupby("month")["sentiment_score"].mean().reset_index()

    # Convert 'ds' column to datetime format for Prophet
    monthly_sentiment["month"] = pd.to_datetime(monthly_sentiment["month"])
    monthly_sentiment.columns = ["ds", "y"]

    # Train Prophet model
    model = Prophet()
    model.fit(monthly_sentiment)

    # Predict next 6 months
    future = model.make_future_dataframe(periods=6, freq="M")

    # Ensure future 'ds' is in datetime format
    future["ds"] = pd.to_datetime(future["ds"])

    forecast = model.predict(future)

    # Ensure forecast 'ds' is also in datetime format
    forecast["ds"] = pd.to_datetime(forecast["ds"])

    # Plot results
    axes[i].plot(monthly_sentiment["ds"], monthly_sentiment["y"], marker="o", linestyle="-", color=actual_color, label="Actual Sentiment Trend")
    axes[i].plot(forecast["ds"], forecast["yhat"], linestyle="dashed", color=predicted_color, label="Predicted Trend")

    # Formatting
    axes[i].set_title(f"Sentiment Forecast for {store}")
    axes[i].set_xlabel("Month")
    axes[i].set_ylabel("Average Sentiment Score")
    axes[i].legend()
    axes[i].tick_params(axis="x", rotation=45)

# Adjust layout and display
plt.tight_layout()
plt.show()

1. Fry’s Sentiment Analysis
- The actual sentiment fluctuates drastically, showing extreme positive and negative spikes.
- The predicted sentiment trend shows a more stable decline over time.
- Sentiment volatility suggests that customer experiences have been inconsistent over time.
- There could be seasonal changes or external factors affecting sentiment at different times.
- The trend suggests a potential decline in positive reviews over the years.
    
Potential Business Insight:

- Fry’s may need to analyze periods with extreme negative sentiment to identify specific causes (e.g., product availability, service issues).
- Addressing customer pain points during periods of sentiment dips could help improve long-term perception.
    
2. Safeway Sentiment Analysis
- The actual sentiment is also highly volatile, similar to Fry’s.
- The predicted trend suggests an upward movement followed by a gradual decline.
- While positive sentiment dominates, there are multiple deep negative spikes.
- A recurrent pattern of fluctuations indicates that customer satisfaction is not stable over time.
    
Potential Business Insight:

- Safeway should focus on periods where negative sentiment increased to analyze customer complaints.
- Loyalty programs or quality assurance initiatives could help maintain consistent positive sentiment.
    
3. Target Sentiment Analysis
- The predicted sentiment trend is relatively stable and leans towards positivity.
- However, actual sentiment data shows major drops in certain timeframes.
- Unlike Fry’s and Safeway, Target appears to have a relatively higher baseline sentiment, meaning more positive reviews overall.
- The negative dips could correspond to specific events like policy changes, product issues, or economic downturns.
    
Potential Business Insight:

- Target has a stronger positive sentiment base, but needs to focus on periods of major sentiment drops.
- Investigating key timeframes where customer dissatisfaction spiked can help prevent similar future occurrences.
    
4. Trader Joe’s Sentiment Analysis
- Trader Joe’s has the most consistently positive sentiment trend among all stores.
- The predicted sentiment trend remains stable, showing that customers generally have a positive experience.
- The actual sentiment trend has occasional dips but remains largely positive over the years.
- This indicates strong brand loyalty and a high level of customer satisfaction.
    
Potential Business Insight:

- Trader Joe’s should maintain their current customer service and product quality.
- Identifying what drives loyalty and ensuring high-quality customer engagement can help sustain this strong sentiment trend.
- Periodic analysis of the negative dips can help improve operational efficiencies.


---

### 7.6 Sentiment Comparison Across Stores

In [None]:
import seaborn as sns

# Create a pivot table for heatmap
heatmap_data = df.pivot_table(index="name", columns="Predicted_Sentiment", aggfunc="size", fill_value=0)

# Normalize by row (percentage per store)
heatmap_data = heatmap_data.div(heatmap_data.sum(axis=1), axis=0) * 100

# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(heatmap_data, annot=True, fmt=".1f", cmap="coolwarm")
plt.title("Sentiment Comparison Across Stores")
plt.xlabel("Sentiment")
plt.ylabel("Store")
plt.show()

1. Fry’s Sentiment Distribution
- Negative Sentiment: 26.8% – Fry’s has a relatively high level of dissatisfaction among customers compared to other stores.
- Neutral Sentiment: 0.8% – Very few reviews are neutral, indicating strong polarizing opinions.
- Positive Sentiment: 72.5% – Although a majority of reviews are positive, the lower positivity compared to Target and Trader Joe’s signals potential areas for improvement.
- Business Insight: Fry’s needs to analyze recurring customer complaints and focus on service improvements to reduce the high proportion of negative reviews. Operational inefficiencies, product availability, or customer service issues could be contributing to the dissatisfaction.

2. Safeway Sentiment Distribution
- Negative Sentiment: 22.4% – Slightly lower negative sentiment than Fry’s but still significant.
- Neutral Sentiment: 1.2% – Minimal neutral responses, indicating strong customer opinions.
- Positive Sentiment: 76.4% – Higher customer satisfaction compared to Fry’s but lower than Trader Joe’s and Target.
- Business Insight: Safeway has a relatively strong brand reputation but should focus on reducing negative sentiment further. Addressing customer complaints related to product quality, pricing, or service delays could help improve brand perception.

3. Target Sentiment Distribution
- Negative Sentiment: 16.4% – The lowest among Fry’s and Safeway, showing better customer satisfaction.
- Neutral Sentiment: 0.7% – Very few neutral responses.
- Positive Sentiment: 83.0% – High customer satisfaction, showing that Target maintains a strong customer experience and service standards.
- Business Insight: Target stands out with high positive sentiment, reflecting strong customer loyalty. However, addressing the remaining 16.4% negative sentiment could further enhance its reputation.

4. Trader Joe’s Sentiment Distribution
- Negative Sentiment: 6.0% – The lowest among all stores, indicating an overwhelmingly positive reputation.
- Neutral Sentiment: 0.6% – Very few neutral responses, meaning customers either love or dislike their experience.
- Positive Sentiment: 93.4% – The highest customer satisfaction, showing strong brand loyalty and an exceptional shopping experience.
- Business Insight: Trader Joe’s has an outstanding customer approval rating, which could be linked to its niche product selection, customer service, and unique shopping experience. Maintaining consistent quality and service will be key to sustaining this strong reputation.

---

# 8. Topic Modeling using BERT

In [None]:
# Install required packages if not installed
!pip install bertopic umap-learn hdbscan wordcloud matplotlib seaborn pandas

from bertopic import BERTopic
from umap import UMAP
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud


# Ensure 'date' column is converted to datetime format
if "date" in df.columns:
    df["date"] = pd.to_datetime(df["date"], errors="coerce")
    df["month"] = df["date"].dt.to_period("M")  # Convert date to monthly period for trend analysis

# List of grocery stores
stores = ["Fry's", "Safeway", "Target", "Trader Joe's"]

# Dictionary to store topic models and results
topic_models = {}
topic_results = {}

# Loop through each store and apply BERTopic
for store in stores:
    print(f"\nProcessing Topic Modeling for {store}...")

    # Filter reviews for the store
    store_reviews = df[df["name"] == store]["review_text"].dropna().tolist()

    if len(store_reviews) < 10:
        print(f" Not enough reviews for {store}, skipping topic modeling.\n")
        continue  # Skip stores with insufficient reviews

    # Reduce dimensions for effective clustering
    umap_model = UMAP(n_neighbors=15, n_components=5, metric='cosine', random_state=42)

    # Initialize BERTopic model
    topic_model = BERTopic(language="english", umap_model=umap_model)

    # Fit and transform the model
    topics, probs = topic_model.fit_transform(store_reviews)

    # Store results
    topic_models[store] = topic_model
    topic_results[store] = topics

    # Display top topics
    print(f" Top Topics for {store}:")
    print(topic_model.get_topic_info().head())

    # ** Visualize Top Words in Each Topic (Bar Chart)**
    fig1 = topic_model.visualize_barchart(top_n_topics=10)
    fig1.show()

    # ** Generate Word Clouds for Top Topics**
    fig, axes = plt.subplots(3, 3, figsize=(15, 10))
    for i, ax in enumerate(axes.flatten()):
        if i >= len(topic_model.get_topic_info()):
            break  # Avoid extra empty topics
        words_freq = dict(topic_model.get_topic(i))
        wordcloud = WordCloud(width=400, height=400, background_color='white').generate_from_frequencies(words_freq)
        ax.imshow(wordcloud, interpolation="bilinear")
        ax.set_title(f"Topic {i} - {store}")
        ax.axis("off")

    plt.tight_layout()
    plt.show()

    # **Intertopic Distance Map**
    topic_model.visualize_topics()

    # **Sentiment-Based Topic Analysis**
    if "rating" in df.columns:
        df["sentiment"] = df["rating"].apply(lambda x: "positive" if x >= 4 else "negative")

        # Add topic assignments to the dataframe
        store_df = df[df["name"] == store].copy()
        store_df["topic"] = topics

        # Count topics for positive and negative reviews
        topic_counts = store_df.groupby(["topic", "sentiment"]).size().unstack().fillna(0)

        # Plot topic distribution
        topic_counts.plot(kind="bar", stacked=True, figsize=(12, 6), colormap="coolwarm")
        plt.title(f"Topic Distribution among Positive & Negative Reviews for {store}")
        plt.xlabel("Topics")
        plt.ylabel("Number of Reviews")
        plt.legend(["Negative", "Positive"])
        plt.show()

    # **Optimized Time-Based Topic Trend Analysis**
    if "month" in df.columns:
        try:
            print(f"Processing time-based topic analysis for {store}...")

            # Filter store-specific data
            store_df = df[df["name"] == store].copy()

            # Drop any rows where 'month' is missing
            store_df = store_df.dropna(subset=["month"])

            # Ensure there are multiple unique months
            if store_df["month"].nunique() > 5:  # Requires at least 5 unique months
                print(f"Found {store_df['month'].nunique()} unique months for {store}, proceeding with time-based trends...")

                # Extract topics over time with reduced bins for performance
                topics_over_time = topic_model.topics_over_time(
                    store_df["review_text"].tolist(),
                    store_df["month"].astype(str).tolist(),  # Convert to string format to prevent errors
                    nr_bins=20  # Adjust to optimize performance
                )

                # Visualize topics evolving over time
                topic_model.visualize_topics_over_time(topics_over_time).show()
            else:
                print(f" Not enough month variability for {store} ({store_df['month'].nunique()} unique months), skipping time-based topic trends.")

        except Exception as e:
            print(f"❌Error in time-based topic analysis for {store}: {e}")

    print(f"✅ Completed Topic Modeling for {store}\n")


In [None]:
# Display top topics
topic_model.get_topic_info()

In [None]:
# Prepare topic distribution data
topic_distribution = {}
for store, topics in topic_results.items():
    topic_distribution[store] = pd.Series(topics).value_counts()

# Convert to DataFrame for plotting
topic_df = pd.DataFrame(topic_distribution).fillna(0)

# Plot topic distribution across stores
plt.figure(figsize=(12, 6))
topic_df.plot(kind='bar', stacked=True, colormap='viridis', figsize=(14,6))
plt.title("Topic Distribution Across Grocery Stores")
plt.xlabel("Topic Number")
plt.ylabel("Number of Reviews")
plt.legend(title="Store")
plt.show()

In [None]:
# Combine topics from all stores
all_topics = {}

for store, model in topic_models.items():
    all_topics[store] = model.get_topic_info().head(6)  # Get top 6 topics

# Convert to DataFrame for easy comparison
topic_comparison_df = pd.concat(all_topics, axis=1)

# Display results
import IPython.display as display
display.display(topic_comparison_df)

In [None]:
import seaborn as sns

# Count positive vs. negative reviews per store
sentiment_counts = df.groupby(["name", "sentiment"]).size().unstack().fillna(0)

# Plot sentiment distribution across stores
plt.figure(figsize=(10, 5))
sentiment_counts.plot(kind="bar", stacked=True, colormap="coolwarm", figsize=(10, 6))
plt.title("Sentiment Comparison Across Stores")
plt.xlabel("Store Name")
plt.ylabel("Number of Reviews")
plt.legend(["Negative", "Positive"])
plt.xticks(rotation=45)
plt.show()

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import pandas as pd


# Ensure sentiment labels are correctly assigned
if "rating" in df.columns:
    df["sentiment"] = df["rating"].apply(lambda x: "positive" if x >= 4 else "negative")

# List of grocery stores
stores = ["Fry's", "Safeway", "Target", "Trader Joe's"]

# Loop through each store and generate word clouds
for store in stores:
    print(f"\n🔍 Generating word clouds for {store}...")

    # Filter positive and negative reviews for the store
    positive_reviews = df[(df["name"] == store) & (df["sentiment"] == "positive")]["review_text"].dropna()
    negative_reviews = df[(df["name"] == store) & (df["sentiment"] == "negative")]["review_text"].dropna()

    # Combine text for word clouds
    positive_text = " ".join(positive_reviews)
    negative_text = " ".join(negative_reviews)

    # Generate word clouds
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))

    # Positive word cloud
    wordcloud_positive = WordCloud(width=600, height=400, background_color="white").generate(positive_text)
    axes[0].imshow(wordcloud_positive, interpolation="bilinear")
    axes[0].set_title(f"Positive Reviews - {store}")
    axes[0].axis("off")

    # Negative word cloud
    wordcloud_negative = WordCloud(width=600, height=400, background_color="black", colormap="Reds").generate(negative_text)
    axes[1].imshow(wordcloud_negative, interpolation="bilinear")
    axes[1].set_title(f"Negative Reviews - {store}")
    axes[1].axis("off")

    # Display word clouds
    plt.tight_layout()
    plt.show()

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Prepare data
X = df["review_text"].dropna()
y = df["sentiment"].dropna()

# Convert text into TF-IDF features
vectorizer = TfidfVectorizer(max_features=5000)
X_tfidf = vectorizer.fit_transform(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Train a Random Forest classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate model
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

In [None]:
import pandas as pd

# Create a report dictionary
report_data = []

for store in stores:
    if store in topic_models:
        # Get top 6 topics
        top_topics = topic_models[store].get_topic_info().head(6)
        sentiment_distribution = df[df["name"] == store]["sentiment"].value_counts()

        report_data.append({
            "Store": store,
            "Top Topics": top_topics["Name"].tolist(),
            "Positive Reviews": sentiment_distribution.get("positive", 0),
            "Negative Reviews": sentiment_distribution.get("negative", 0)
        })

# Convert report to DataFrame
report_df = pd.DataFrame(report_data)

# Display to user
display.display(report_df )

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Improved visualization for better readability
for store in stores:
    topic_info = topic_models[store].get_topic_info()

    # Filter out the outlier topic "-1" (which represents outliers in BERTopic)
    topic_info = topic_info[topic_info["Topic"] != -1]

    # Sort topics by frequency
    topic_info = topic_info.sort_values(by="Count", ascending=False).head(15)  # Show top 15 topics for better clarity

    topic_counts = topic_info["Count"].values
    topic_labels = [f"Topic {i}" for i in topic_info["Topic"].values]

    plt.figure(figsize=(12, 6))
    sns.barplot(x=topic_counts, y=topic_labels, palette="coolwarm")
    plt.xlabel("Number of Reviews")
    plt.ylabel("Topics")
    plt.title(f"Top 15 Topic Distribution for {store}")

    # Show topic labels clearly
    for index, value in enumerate(topic_counts):
        plt.text(value + 1, index, str(value), va='center', fontsize=10)

    plt.show()


In [None]:
# Extract top keywords for each topic per store
for store in stores:
    topic_info = topic_models[store].get_topic_info()

    # Remove outlier topic (-1)
    topic_info = topic_info[topic_info["Topic"] != -1]

    print(f"\nTop Keywords for {store}:")
    for topic in topic_info["Topic"].values[:10]:  # Show top 10 topics
        keywords = topic_models[store].get_topic(topic)
        keyword_list = [word for word, _ in keywords[:10]]  # Get top 10 keywords
        print(f"Topic {topic}: {', '.join(keyword_list)}")

In [None]:
import pandas as pd

# Create a DataFrame to store topic sentiment distribution for each store
topic_sentiment_distribution = []

for store in stores:
    store_df = store_dfs[store]

    # Assign topics to reviews
    topics_per_review = topic_models[store].transform(store_df["review_text"].tolist())[0]

    # Assign topics to the store dataframe
    store_df["Topic"] = topics_per_review

    # Remove outlier topics (-1)
    store_df = store_df[store_df["Topic"] != -1]

    # Count sentiment distribution per topic
    topic_sentiment_counts = store_df.groupby(["Topic", "Predicted_Sentiment"]).size().unstack(fill_value=0)

    # Convert to percentage format
    topic_sentiment_percentages = topic_sentiment_counts.div(topic_sentiment_counts.sum(axis=1), axis=0) * 100

    # Store results
    for topic, row in topic_sentiment_percentages.iterrows():
        topic_sentiment_distribution.append({
            "Store": store,
            "Topic": topic,
            "Negative (%)": round(row.get("Negative", 0), 2),
            "Neutral (%)": round(row.get("Neutral", 0), 2),
            "Positive (%)": round(row.get("Positive", 0), 2),
        })

# Convert to DataFrame for better readability
topic_sentiment_df = pd.DataFrame(topic_sentiment_distribution)

# Display DataFrame
import IPython.display as display
display.display(topic_sentiment_df)

In [None]:
# Extract top keywords for each store's topics
topic_keywords = {}

for store in stores:
    print(f"Extracting top keywords for {store}...")

    # Get the most important words per topic
    topic_info = topic_models[store].get_topic_info()
    top_topics = topic_info[['Topic', 'Name']].head(10)  # Extract top 10 topics

    # Store in dictionary
    topic_keywords[store] = top_topics

    print(f"Extracted top topics for {store}")

# Display results
import pandas as pd
for store, topics in topic_keywords.items():
    print(f"\n🔹 **Top Topics for {store}**")
    print(topics.to_string(index=False))  # Display neatly formatted

In [None]:
# Create dictionary to store sentiment distribution per topic
topic_sentiments = {}

for store in stores:
    print(f"Analyzing sentiment distribution for {store}...")

    # Merge topic assignments with predicted sentiments
    store_df = store_dfs[store][['Predicted_Sentiment', 'cleaned_review_text']]
    store_df['Topic'] = topic_models[store].topics_

    # Count sentiment per topic
    sentiment_distribution = store_df.groupby(["Topic", "Predicted_Sentiment"]).size().unstack(fill_value=0)

    # Normalize to percentage
    sentiment_distribution_percentage = sentiment_distribution.div(sentiment_distribution.sum(axis=1), axis=0) * 100

    # Store in dictionary
    topic_sentiments[store] = sentiment_distribution_percentage

    print(f"Completed sentiment analysis for {store}")

# Display results
for store, sentiment_data in topic_sentiments.items():
    print(f"\n **Sentiment Distribution by Topic for {store}**")
    print(sentiment_data.head(10).to_string())  # Display top 10 topics


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

for store, sentiment_data in topic_sentiments.items():
    plt.figure(figsize=(12, 6))
    sns.heatmap(sentiment_data.T, annot=True, fmt=".1f", cmap="coolwarm")
    plt.title(f"Sentiment Distribution Across Topics - {store}")
    plt.xlabel("Topic")
    plt.ylabel("Sentiment")
    plt.show()

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
import pandas as pd
from collections import Counter
import os

# Ensure sentiment labels are correctly assigned
if "rating" in df.columns:
    df["sentiment"] = df["rating"].apply(lambda x: "positive" if x >= 4 else "negative")

# List of grocery stores
stores = ["Fry's", "Safeway", "Target", "Trader Joe's"]

# Define product categories and associated keywords
categories = {
    "Bakery": ["bread", "cake", "pastry", "bakery", "cookies", "doughnut"],
    "Produce": ["fruits", "vegetables", "organic", "fresh", "produce", "greens"],
    "Checkout Experience": ["cashier", "checkout", "register", "line", "self-checkout"],
    "Deli": ["deli", "meat", "cheese", "sandwich", "sliced"],
    "Customer Service": ["staff", "service", "employee", "rude", "helpful", "manager"]
}

# Create output directory for word cloud images
output_dir = "store_category_wordclouds"
os.makedirs(output_dir, exist_ok=True)

# Custom stopwords for cleaner analysis
custom_stopwords = set(STOPWORDS).union({"store", "grocery", "food", "customer", "shop", "shopping", "buy", "purchase"})

# Dictionary to store word frequencies
word_frequencies = {"Store": [], "Category": [], "Sentiment": [], "Word": [], "Count": []}

# Function to get word frequencies
def get_word_frequencies(text, store, category, sentiment):
    words = text.lower().split()
    words = [word for word in words if word not in custom_stopwords]
    word_counts = Counter(words).most_common(10)  # Get top 10 words
    for word, count in word_counts:
        word_frequencies["Store"].append(store)
        word_frequencies["Category"].append(category)
        word_frequencies["Sentiment"].append(sentiment)
        word_frequencies["Word"].append(word)
        word_frequencies["Count"].append(count)

# Loop through each store and category
for store in stores:
    for category, keywords in categories.items():
        print(f"\n🔍 Generating word clouds for {store} - {category}...")

        # Filter reviews mentioning category keywords & belonging to the store
        category_reviews = df[(df["name"] == store) & (df["review_text"].str.contains('|'.join(keywords), case=False, na=False))]

        # Separate positive and negative reviews
        positive_reviews = category_reviews[category_reviews["sentiment"] == "positive"]["review_text"].dropna()
        negative_reviews = category_reviews[category_reviews["sentiment"] == "negative"]["review_text"].dropna()

        # Combine text for word clouds
        positive_text = " ".join(positive_reviews)
        negative_text = " ".join(negative_reviews)

        # Generate word clouds
        fig, axes = plt.subplots(1, 2, figsize=(15, 6))

        # Positive word cloud
        if positive_text.strip():
            wordcloud_positive = WordCloud(width=600, height=400, background_color="white", stopwords=custom_stopwords).generate(positive_text)
            axes[0].imshow(wordcloud_positive, interpolation="bilinear")
        else:
            axes[0].text(0.5, 0.5, "No Positive Reviews", fontsize=15, ha="center", va="center", bbox=dict(facecolor="white", edgecolor="black"))
        axes[0].set_title(f"Positive Reviews - {store} ({category})")
        axes[0].axis("off")

        # Negative word cloud
        if negative_text.strip():
            wordcloud_negative = WordCloud(width=600, height=400, background_color="black", colormap="Reds", stopwords=custom_stopwords).generate(negative_text)
            axes[1].imshow(wordcloud_negative, interpolation="bilinear")
        else:
            axes[1].text(0.5, 0.5, "No Negative Reviews", fontsize=15, ha="center", va="center", bbox=dict(facecolor="black", edgecolor="red", alpha=0.5), color="white")
        axes[1].set_title(f"Negative Reviews - {store} ({category})")
        axes[1].axis("off")

        # Save images
        positive_img_path = os.path.join(output_dir, f"{store}_{category}_positive_wordcloud.png")
        negative_img_path = os.path.join(output_dir, f"{store}_{category}_negative_wordcloud.png")
        plt.savefig(positive_img_path)
        plt.savefig(negative_img_path)

        plt.tight_layout()
        plt.show()

        print(f"Word clouds saved for {store} - {category}:")
        print(f"   - {positive_img_path}")
        print(f"   - {negative_img_path}")

        # Compute word frequencies for positive and negative reviews
        get_word_frequencies(positive_text, store, category, "Positive")
        get_word_frequencies(negative_text, store, category, "Negative")

# Convert word frequency dictionary to DataFrame
word_freq_df = pd.DataFrame(word_frequencies)

# Display word frequencies table
# Display word frequencies table
display.display(word_freq_df)

print("\n Category-based word frequency table saved as 'store_category_word_frequencies.xlsx'.")


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Select the category to compare (e.g., Bakery)
selected_category = "Bakery"

# Filter dataset for the selected category across all stores
category_data = df[df["review_text"].str.contains('|'.join(categories[selected_category]), case=False, na=False)]

# Group by store and sentiment to count reviews
store_comparison = category_data.groupby(["name", "sentiment"]).size().unstack().fillna(0)

# Get average business_stars per store
avg_ratings = df.groupby("name")["business_stars"].mean().round(2)

# Create figure and axes
fig, ax1 = plt.subplots(figsize=(12, 6))

# Bar chart for sentiment distribution
store_comparison.plot(kind="bar", stacked=True, colormap="coolwarm", ax=ax1)
ax1.set_title(f"Comparison of '{selected_category}' Sentiment Across Stores (w/ Business Stars)")
ax1.set_xlabel("Store")
ax1.set_ylabel("Number of Reviews")
ax1.legend(["Negative", "Positive"])
ax1.set_xticklabels(store_comparison.index, rotation=45)

# Create second y-axis for business ratings
ax2 = ax1.twinx()
ax2.set_ylabel("Average Business Stars (Rating)")

# Scatter plot for business stars
ax2.scatter(avg_ratings.index, avg_ratings.values, color="gold", s=200, label="Avg Rating", marker="*")

# Annotate ratings
for store, rating in avg_ratings.items():
    ax2.text(store, rating + 0.1, f"{rating}★", ha="center", fontsize=12, color="gold")

ax2.set_ylim(0, 5)  # Set y-axis limit for ratings
ax2.legend(loc="upper right")

plt.show()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Define categories to compare
selected_categories = ["Bakery", "Produce", "Checkout Experience", "Deli", "Customer Service"]

# Create a subplot for each category
fig, axes = plt.subplots(len(selected_categories), 1, figsize=(12, 5 * len(selected_categories)))

for i, category in enumerate(selected_categories):
    # Filter dataset for the category
    category_data = df[df["review_text"].str.contains('|'.join(categories[category]), case=False, na=False)]

    # Group by store and sentiment to count reviews
    store_comparison = category_data.groupby(["name", "sentiment"]).size().unstack().fillna(0)

    # Get average business_stars per store
    avg_ratings = df.groupby("name")["business_stars"].mean().round(2)

    # Bar chart for sentiment distribution
    store_comparison.plot(kind="bar", stacked=True, colormap="coolwarm", ax=axes[i])
    axes[i].set_title(f"'{category}' Sentiment Across Stores (w/ Business Stars)")
    axes[i].set_xlabel("Store")
    axes[i].set_ylabel("Number of Reviews")
    axes[i].legend(["Negative", "Positive"])
    axes[i].set_xticklabels(store_comparison.index, rotation=45)

    # Create second y-axis for business ratings
    ax2 = axes[i].twinx()
    ax2.set_ylabel("Average Business Stars (Rating)")

    # Scatter plot for business stars
    ax2.scatter(avg_ratings.index, avg_ratings.values, color="gold", s=200, label="Avg Rating", marker="*")

    # Annotate ratings
    for store, rating in avg_ratings.items():
        ax2.text(store, rating + 0.1, f"{rating}★", ha="center", fontsize=12, color="gold")

    ax2.set_ylim(0, 5)  # Set y-axis limit for ratings
    ax2.legend(loc="upper right")

plt.tight_layout()
plt.show()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Ensure 'month' column is in datetime format
df["month"] = pd.to_datetime(df["month"], errors="coerce")

# Convert to year for yearly aggregation
df["year"] = df["month"].dt.year

# Group by year and store
trend_data = df.groupby(["year", "name"]).agg({
    "sentiment": lambda x: (x == "positive").sum() / len(x),  # % of positive reviews
    "business_stars": "mean"  # Average rating per year
}).reset_index()

# Apply rolling average (2-year window) for smoothing
trend_data["sentiment_smooth"] = trend_data.groupby("name")["sentiment"].transform(lambda x: x.rolling(2, min_periods=1).mean())
trend_data["business_stars_smooth"] = trend_data.groupby("name")["business_stars"].transform(lambda x: x.rolling(2, min_periods=1).mean())

# Plot smoothed sentiment trends (Yearly)
plt.figure(figsize=(12, 6))
sns.lineplot(data=trend_data, x="year", y="sentiment_smooth", hue="name", marker="o", linewidth=2, palette="coolwarm")
plt.xticks(rotation=45)
plt.ylabel("Smoothed % of Positive Reviews")
plt.xlabel("Year")
plt.title("Smoothed Sentiment Trend Over Time Across Stores (2-Year Avg.)")
plt.legend(title="Store")
plt.grid(True)
plt.show()

# Plot smoothed business stars trend (Yearly)
plt.figure(figsize=(12, 6))
sns.lineplot(data=trend_data, x="year", y="business_stars_smooth", hue="name", marker="*", linewidth=2)
plt.xticks(rotation=45)
plt.ylabel("Smoothed Business Stars")
plt.xlabel("Year")
plt.title("Smoothed Business Stars Trend Over Time Across Stores (2-Year Avg.)")
plt.legend(title="Store")
plt.grid(True)
plt.show()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Select the category to analyze (e.g., "Bakery")
selected_category = "Bakery"

# Filter reviews related to this category
category_trend_data = df[df["review_text"].str.contains('|'.join(categories[selected_category]), case=False, na=False)]

# Ensure 'month' column is in datetime format and extract the year
category_trend_data["year"] = pd.to_datetime(category_trend_data["month"], errors="coerce").dt.year

# Group by year and store
category_trend = category_trend_data.groupby(["year", "name"]).agg({
    "sentiment": lambda x: (x == "positive").sum() / len(x),  # % of positive reviews
    "business_stars": "mean"  # Average rating per year
}).reset_index()

# Apply rolling average (2-year window) for smoothing
category_trend["sentiment_smooth"] = category_trend.groupby("name")["sentiment"].transform(lambda x: x.rolling(2, min_periods=1).mean())
category_trend["business_stars_smooth"] = category_trend.groupby("name")["business_stars"].transform(lambda x: x.rolling(2, min_periods=1).mean())

# Plot smoothed sentiment trends (Yearly)
plt.figure(figsize=(12, 6))
sns.lineplot(data=category_trend, x="year", y="sentiment_smooth", hue="name", marker="o", linewidth=2, palette="coolwarm")
plt.xticks(rotation=45)
plt.ylabel(f"Smoothed % of Positive Reviews for {selected_category}")
plt.xlabel("Year")
plt.title(f"Sentiment Trend Over Time for '{selected_category}' (2-Year Avg.)")
plt.legend(title="Store")
plt.grid(True)
plt.show()

# Plot smoothed business stars trend (Yearly)
plt.figure(figsize=(12, 6))
sns.lineplot(data=category_trend, x="year", y="business_stars_smooth", hue="name", marker="*", linewidth=2)
plt.xticks(rotation=45)
plt.ylabel(f"Smoothed Business Stars for {selected_category}")
plt.xlabel("Year")
plt.title(f"Business Stars Trend Over Time for '{selected_category}' (2-Year Avg.)")
plt.legend(title="Store")
plt.grid(True)
plt.show()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Select the store to analyze
selected_store = "Fry's"  # Change to other stores as needed

# Create a dataframe for category comparison
category_sentiment_data = []

# Loop through categories and calculate sentiment distribution
for category, keywords in categories.items():
    # Filter dataset for the selected category within the selected store
    category_data = df[
        (df["name"] == selected_store) &
        (df["review_text"].str.contains('|'.join(keywords), case=False, na=False))
    ]

    # Count positive and negative reviews
    positive_count = (category_data["sentiment"] == "positive").sum()
    negative_count = (category_data["sentiment"] == "negative").sum()

    # Store results
    category_sentiment_data.append({
        "Category": category,
        "Positive Reviews": positive_count,
        "Negative Reviews": negative_count
    })

# Convert to DataFrame
category_sentiment_df = pd.DataFrame(category_sentiment_data)

# Plot stacked bar chart for sentiment distribution across categories
plt.figure(figsize=(10, 6))
category_sentiment_df.set_index("Category").plot(kind="bar", stacked=True, colormap="coolwarm", figsize=(10, 6))
plt.title(f"Sentiment Comparison Across Categories - {selected_store}")
plt.xlabel("Category")
plt.ylabel("Number of Reviews")
plt.legend(["Negative", "Positive"])
plt.xticks(rotation=45)
plt.grid(True)
plt.show()


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Select store to analyze
selected_store = "Fry's"

# Create an empty dataframe for category trends
category_trend_data = pd.DataFrame()

# Ensure 'month' column is in datetime format and extract the year
df["year"] = pd.to_datetime(df["month"], errors="coerce").dt.year

# Loop through categories to calculate sentiment trends over time
for category, keywords in categories.items():
    # Filter dataset for the selected store and category
    category_df = df[
        (df["name"] == selected_store) &
        (df["review_text"].str.contains('|'.join(keywords), case=False, na=False))
    ]

    # Aggregate sentiment over time by year
    category_trend = category_df.groupby(category_df["year"]).agg({
        "sentiment": lambda x: (x == "positive").sum() / len(x) if len(x) > 0 else None  # % positive reviews
    }).reset_index()

    # Add category column
    category_trend["Category"] = category

    # Append to main dataframe
    category_trend_data = pd.concat([category_trend_data, category_trend])

# Apply rolling average (2-year smoothing)
category_trend_data["sentiment_smooth"] = category_trend_data.groupby("Category")["sentiment"].transform(lambda x: x.rolling(2, min_periods=1).mean())

# Plot sentiment trend for different categories (Yearly)
plt.figure(figsize=(12, 6))
sns.lineplot(data=category_trend_data, x="year", y="sentiment_smooth", hue="Category", marker="o", linewidth=2, palette="tab10")
plt.xticks(rotation=45)
plt.ylabel("Smoothed % of Positive Reviews")
plt.xlabel("Year")
plt.title(f"Sentiment Trends Across Categories - {selected_store} (2-Year Avg.)")
plt.legend(title="Category")
plt.grid(True)
plt.show()


---
# Conclusion
This project aimed to analyze Yelp reviews of grocery stores (Target, Fry’s, Safeway, Trader Joe’s) to extract actionable insights on customer satisfaction and business performance. By leveraging natural language processing (NLP), sentiment analysis, and machine learning models, the study explored customer perceptions through textual reviews and rating distributions.

The analysis involved several key steps:
1. *Data Preprocessing & Cleaning*  
   - Addressed missing values, removed duplicates, and corrected data types such as review dates and geographical information.  
   - Processed review text using tokenization, stopword removal, and lemmatization for improved analysis.  

2. *Exploratory Data Analysis (EDA)*  
   - Visualized rating distributions and customer experiences across stores.  
   - Analyzed review length, review count per user, and useful votes to assess customer engagement.  

3. *Sentiment Analysis*  
   - Classified reviews as Positive, Negative, or Neutral using Vader and machine learning models.  
   - Examined sentiment trends over time to identify fluctuations in customer satisfaction.  

4. *Machine Learning Classification*  
   - Built and evaluated models (Logistic Regression, SVM, Naïve Bayes, and Vader) for sentiment classification.  
   - SVM demonstrated the highest classification accuracy and F1-score.  

5. *Interpretation & Business Impact*  
   - Identified sentiment trends, store-specific performance, and key factors influencing satisfaction.  
   - Suggested improvements in customer service, product offerings, and business operations based on sentiment insights.  

---

# Key Insights from the Analysis

## Sentiment Trends & Customer Satisfaction  
- Customer sentiment fluctuates over time, with positive sentiment peaking at certain periods and declining at others.  
- Trader Joe’s has the highest positive sentiment, while Safeway and Fry’s show a more polarized distribution with significant 1-star and 5-star reviews.  
- The sentiment distribution follows a slightly positive skew, indicating that most customers have neutral to positive experiences, but negative reviews highlight areas for improvement.  

## Review Length & Engagement  
- The majority of reviews are short, with fewer than 50 words, suggesting that most customers leave brief feedback.  
- Some longer reviews exceeding 100 words provide detailed experiences, but they are relatively rare.  
- Useful votes are generally low, indicating that reviews are not highly engaged with or upvoted by other customers.  

## Store-Specific Insights  

| Store        | Sentiment Analysis | Customer Engagement | Review Highlights |
|-------------|-------------------|--------------------|------------------|
| *Trader Joe’s* | Highest positive sentiment (4-5 star reviews dominant) | Higher engagement, more useful votes | Customers highlight customer service, product quality, and store environment. |
| *Target* | Balanced sentiment, occasional negative spikes | Medium engagement, some useful votes | Customers mention store layout and stock availability issues. |
| *Safeway* | High 1-star and 4-star reviews (polarized) | Lower engagement, few useful votes | Complaints about pricing and checkout experiences. |
| *Fry’s* | Mixed sentiment (1-star and 5-star peaks) | Low engagement, lowest useful votes | Complaints about customer service, pricing, and product availability. |

## Machine Learning Model Performance  

| Metric  | Vader  | Logistic Regression | SVM  | Naïve Bayes  |
|---------|--------|--------------------|------|-------------|
| *Accuracy* | 87.06% | 95.25% | *98.38%* | 87.93% |
| *Precision (Positive Sentiment)* | 94.49% | 95.89% | *98.48%* | 96.10% |
| *Recall (Positive Sentiment)* | 89.76% | 98.61% | *99.85%* | 89.15% |
| *F1-Score (Positive Sentiment)* | 92.07% | 97.23% | *99.16%* | 92.49% |

- SVM achieved the best overall performance, followed by Logistic Regression.  
- Naïve Bayes and Vader underperformed, likely due to their simplistic assumptions about text distribution.  
- Deep learning techniques such as LSTMs and Transformers could further improve sentiment classification accuracy.  

---

# Recommendations for Business Improvements

## Address Negative Reviews & Improve Customer Experience  
- Safeway and Fry’s should analyze common issues in 1-star reviews to identify recurring problems in pricing, checkout experience, and customer service.  
- Implement real-time feedback systems to address customer concerns immediately.  
- Trader Joe’s should maintain its high service quality while identifying areas for minor improvements.  

## Enhance Review Engagement & Encourage Detailed Feedback  
- Most reviews are short and not very detailed. Stores should incentivize customers to write longer, more informative reviews, such as offering small discounts for detailed feedback.  
- Introduce "Most Helpful Review" sections to highlight valuable reviews and encourage users to vote for useful ones.  

## Improve Sentiment Classification Accuracy  
- Implement more advanced deep learning models such as BERT, GPT, and LSTMs to improve sentiment classification.  
- Utilize topic modeling (LDA) to understand common themes in negative and positive reviews.  

## Leverage Sentiment Trends for Business Strategy  
- Track seasonal sentiment fluctuations to align marketing campaigns and promotions with customer expectations.  
- Analyze spikes in negative sentiment to proactively address issues before they escalate.  

## Store-Specific Actionable Insights  

| Store | Actionable Recommendations |
|--------|-------------------------|
| *Trader Joe’s* | Maintain high-quality service and product variety while enhancing store convenience. |
| *Target* | Improve stock availability, customer flow, and checkout speed to reduce customer frustration. |
| *Safeway* | Address pricing complaints and optimize self-checkout systems for a smoother experience. |
| *Fry’s* | Enhance customer service training, improve inventory consistency, and offer better pricing transparency. |

---

# Final Thoughts  
This project successfully extracted valuable insights from customer reviews using machine learning, sentiment analysis, and NLP techniques. The analysis provided data-driven recommendations to improve grocery store experiences and highlighted the best-performing ML models for sentiment classification.

## Key Takeaways  
- SVM is the best sentiment classification model, achieving the highest accuracy and F1-score.  
- Trader Joe’s leads in positive sentiment, while Fry’s and Safeway face inconsistent reviews.  
- Short reviews dominate, and engagement with reviews is low, meaning stores should encourage more detailed feedback.  
- Stores can use sentiment analysis trends to predict customer satisfaction and optimize their operations.  

By implementing these recommendations, grocery stores can enhance customer satisfaction, improve business operations, and increase customer retention using data-driven decision-making.  

## Next Steps  
1. Deploy ML models into a real-time review monitoring system.  
2. Expand the dataset to include more grocery stores and locations for broader insights.  
3. Apply deep learning techniques to further enhance sentiment classification accuracy.  

---