<a href="https://colab.research.google.com/github/amantiwari-java/Zomato-Restaurant-Clustering-Project/blob/main/Zomato_Restaurant_Clustering_Project_Aman.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Zomato Restaurant Clustering Project



##### **Project Type**    - Unsupervised ML/EDA/Sentiment Analysis.
##### **Contribution**    - Individual
##### **Team Member 1 -** Aman Tiwari
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**  The core objectives of this project include performing sentiment analysis on customer reviews to derive useful conclusions, which will be visualized to make data analysis easier.Furthermore, a key aspect is to cluster Zomato restaurants into different segments based on various attributes. This analysis aims to directly assist customers in finding the best restaurants in their locality and enable Zomato to identify areas for growth and improvement. The dataset contains valuable information on cuisines and costing, which will be utilized for cost-benefit analysis, and reviewer metadata to identify industry critics.This project leverages machine learning techniques, particularly unsupervised learning for clustering, along with extensive exploratory data analysis and data visualization.

Write the summary here within 500-600 words.

# **GitHub Link -**

https://github.com/amantiwari-java/Zomato-Restaurant-Clustering-Project

# **Problem Statement**



The rapid growth of the restaurant business in India and the increasing reliance on platforms like Zomato for food-related decisions necessitate a deeper understanding of the vast Zomato restaurant data. The primary problem this project addresses is how to extract meaningful insights from Zomato's restaurant and review data to benefit both customers and the company. Specifically, this involves:
1. Analyzing customer review sentiments to understand user perceptions and identify areas of strength and weakness for restaurants.
2. Effectively clustering Zomato restaurants into distinct segments based on their characteristics (e.g., cuisine, cost, reviews) to help customers easily discover restaurants best suited to their preferences and locality.
3.Providing actionable business insights to Zomato, helping them identify market trends, areas for operational improvement, and opportunities for strategic growth within the Indian food industry.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import pandas as pd #
import numpy as np #

import matplotlib.pyplot as plt #
import seaborn as sns #

# For machine learning (specifically for clustering later)
from sklearn.cluster import KMeans # This will be used for clustering, a core part of the project.
from sklearn.preprocessing import StandardScaler, MinMaxScaler # For feature scaling during preprocessing.
from sklearn.metrics import silhouette_score # For evaluating clustering performance.
from scipy.cluster.hierarchy import dendrogram, linkage # For hierarchical clustering visualization (Dendrograms).

# For natural language processing (NLP) - important for review sentiment analysis
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

# To ignore warnings
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:

restaurant_df = pd.read_csv('/content/Zomato Restaurant names and Metadata.csv')
print("Restaurant Data Loaded Successfully!")
print(f"Shape of Restaurant Data: {restaurant_df.shape}")

review_df = pd.read_csv('/content/Zomato Restaurant reviews.csv')
print("\nReview Data Loaded Successfully!")
print(f"Shape of Review Data: {review_df.shape}")

### Dataset First View

In [None]:
# Dataset First Look

print("--- First 5 rows of Restaurant Data ---")
print(restaurant_df.head())

print("\n--- First 5 rows of Review Data ---")
print(review_df.head())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

print(f"Restaurant Data has {restaurant_df.shape[0]} rows and {restaurant_df.shape[1]} columns.")
print(f"Review Data has {review_df.shape[0]} rows and {review_df.shape[1]} columns.")

### Dataset Information

In [None]:
# Dataset Info

print("--- Information about Restaurant Data ---")
restaurant_df.info()

print("\n--- Information about Review Data ---")
review_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

print("--- Duplicate values in Restaurant Data ---")
print(restaurant_df.duplicated().sum())

print("\n--- Duplicate values in Review Data ---")
print(review_df.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

print("--- Missing values in Restaurant Data ---")
print(restaurant_df.isnull().sum())

print("\n--- Missing values in Review Data ---")
print(review_df.isnull().sum())

In [None]:
# Visualizing the missing values

plt.figure(figsize=(10, 6))
sns.heatmap(restaurant_df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values in Restaurant Data')
plt.show()

plt.figure(figsize=(10, 6))
sns.heatmap(review_df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values in Review Data')
plt.show()

### What did you know about your dataset?

My Answer :-

From the initial exploration of the Zomato Restaurant and Review datasets, several key characteristics and areas for preprocessing have been identified:

**Restaurant Data (`restaurant_df`):**
* It contains **105 entries and 6 columns**.
* **No duplicate rows** were found, indicating each restaurant entry is unique.
* **Data Types and Missing Values:**
    * `Name`, `Links`, `Cuisines` are complete with **no missing values** and are of `object` (string) type.
    * `Cost` is currently an `object` type. It was observed during the first look that it contains commas (e.g., '1,300'), which needs to be cleaned and **converted to a numerical type** (integer or float) for calculations.
    * `Collections` has **54 missing values** (out of 105), which is a significant portion. This suggests a need for a strategy to handle these missing entries, such as imputation or careful consideration during analysis.
    * `Timings` has **1 missing value**, which is minor and can be easily addressed.

**Review Data (`review_df`):**
* It contains **10,000 entries and 7 columns**.
* **36 duplicate rows** were identified. While these could potentially be identical reviews from different users, they will need to be addressed to avoid skewing sentiment analysis or other metrics.
* **Data Types and Missing Values:**
    * `Restaurant` and `Pictures` are complete with **no missing values**; `Pictures` is already in `int64` format.
    * `Reviewer`, `Review`, `Rating`, `Metadata`, and `Time` columns all have a small number of **missing values (ranging from 38 to 45)**. The heatmap visualization indicated that these missing values tend to be clustered around specific row indices, suggesting potential systemic omissions.
    * `Rating` is currently an `object` type but represents numerical values. This column will require cleaning and **conversion to a numerical data type**.
    * `Metadata` and `Time` are `object` types and will likely require parsing to extract meaningful numerical features (e.g., number of reviews/followers, date/time components).
    * The `Review` column contains textual data, which will necessitate **extensive text preprocessing** (e.g., lowercasing, punctuation removal, tokenization, lemmatization, and vectorization) for sentiment analysis and text-based clustering.

Overall, both datasets require significant data cleaning and preprocessing, particularly for handling missing values, converting data types, and preparing text for NLP tasks, before in-depth analysis and model building can commence.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

print("--- Columns in Restaurant Data ---")
print(restaurant_df.columns)

print("\n--- Columns in Review Data ---")
print(review_df.columns)

In [None]:
# Dataset Describe

print("--- Statistical description of Restaurant Data ---")
print(restaurant_df.describe(include='all')) # Using include='all' to get description for object type columns too

print("\n--- Statistical description of Review Data ---")
print(review_df.describe(include='all')) # Using include='all' to get description for object type columns too

### Variables Description

Answer Here

Based on the `df.info()` and `df.describe(include='all')` outputs, here's a detailed description of each variable in both datasets:

**Restaurant Data (`restaurant_df`):**

* **`Name`**: (Object) - Represents the name of the restaurant. There are 105 unique restaurant names, indicating no duplicate restaurant entries. This is a nominal categorical variable.
* **`Links`**: (Object) - Contains the URL links to the Zomato page of each restaurant. All 105 links are unique, serving as a unique identifier for each restaurant's Zomato page. This is a nominal categorical variable.
* **`Cost`**: (Object) - Denotes the estimated cost of dining per person. It is currently an `object` (string) type, but it represents numerical values. The most frequent cost is '500'. This column will require cleaning (removing commas) and type conversion to a numerical format for quantitative analysis.
* **`Collections`**: (Object) - Tags or categories assigned to restaurants by Zomato (e.g., "Great Buffets", "Hyderabad's Hottest"). It has 54 missing values out of 105 entries, which is a substantial amount, suggesting careful handling for imputation or exclusion from certain analyses. There are 42 unique collection tags, with "Food Hygiene Rated Restaurants in Hyderabad" being the most frequent. This is a nominal categorical variable, often multi-valued.
* **`Cuisines`**: (Object) - Lists the types of cuisines served by the restaurants. All entries are present. It's an `object` type, and contains comma-separated values, indicating that most restaurants serve multiple cuisines. There are 92 unique combinations of cuisines. This is a nominal categorical variable, multi-valued.
* **`Timings`**: (Object) - Specifies the operating hours of the restaurants. It has 1 missing value. There are 77 unique timing entries, with "11 AM to 11 PM" being the most common. This is a categorical variable.

**Review Data (`review_df`):**

* **`Restaurant`**: (Object) - The name of the restaurant the review is for. There are 100 unique restaurant names, and each appears 100 times, suggesting an even distribution of reviews across these restaurants. This is a nominal categorical variable.
* **`Reviewer`**: (Object) - The name of the person who wrote the review. There are 38 missing values. With 7446 unique reviewers, it suggests a diverse user base, but also that many reviewers contribute only a few reviews (top reviewer has 13 reviews in this subset). This is a nominal categorical variable.
* **`Review`**: (Object) - The actual text content of the review. It has 45 missing values. There are 9364 unique review texts, indicating some duplicate reviews (36 duplicate rows were found overall). The most frequent review text is simply "good". This is textual data, requiring extensive NLP preprocessing for sentiment analysis.
* **`Rating`**: (Object) - The rating provided by the reviewer. It has 38 missing values. It's currently an `object` type but clearly represents numerical ratings (e.g., '5' is the top frequency). There are 10 unique values, implying the presence of non-numeric entries or different rating scales that will require investigation and conversion to a numerical (e.g., float) type.
* **`Metadata`**: (Object) - Contains information about the reviewer, typically the number of reviews and followers (e.g., "1 Review , 2 Followers"). It has 38 missing values. This will need parsing to extract numerical features for each reviewer, which could be useful for identifying influential critics.
* **`Time`**: (Object) - The date and time when the review was posted. It has 38 missing values. This `object` type column needs to be converted to a `datetime` object to enable time-series analysis (e.g., review trends over time).
* **`Pictures`**: (int64) - The number of pictures posted along with the review. This is a numerical feature with no missing values. The data shows a high frequency of reviews with 0 pictures, but also a maximum of 64 pictures, indicating a skewed distribution.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

print("--- Unique Values for 'Cost' in Restaurant Data ---")
print(restaurant_df['Cost'].unique())

print("\n--- Unique Values for 'Collections' in Restaurant Data (showing first 20 if many) ---")
print(restaurant_df['Collections'].unique()[:20]) # Displaying only first 20 for brevity if there are many

print("\n--- Unique Values for 'Cuisines' in Restaurant Data (showing first 20 if many) ---")
print(restaurant_df['Cuisines'].unique()[:20]) # Displaying only first 20 for brevity

print("\n--- Unique Values for 'Timings' in Restaurant Data (showing first 20 if many) ---")
print(restaurant_df['Timings'].unique()[:20]) # Displaying only first 20 for brevity

print("\n--- Unique Values for 'Rating' in Review Data ---")
print(review_df['Rating'].unique())

print("\n--- Unique Values for 'Metadata' in Review Data (showing first 20 if many) ---")
print(review_df['Metadata'].unique()[:20]) # Displaying only first 20 for brevity

# Optionally, if you want to see all unique values for a column (be careful with very high cardinality)
# print("\n--- All Unique Values for 'Restaurant' in Review Data ---")
# print(review_df['Restaurant'].unique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# --- Data Wrangling for Restaurant Data ---

# 1. Clean and convert 'Cost' column to numeric
# Remove commas and convert to integer
restaurant_df['Cost'] = restaurant_df['Cost'].astype(str).str.replace(',', '', regex=False).astype(int)
print("Updated 'Cost' column in restaurant_df to integer type and removed commas.")
print(restaurant_df['Cost'].head()) # Display first few values to verify type change

# 2. Handle missing value in 'Timings'
# For now, let's fill with a placeholder or the mode. Given it's just 1, we can fill with mode or 'Not Available'.
# For simplicity and to not introduce bias, let's fill with 'Not Available'
restaurant_df['Timings'].fillna('Not Available', inplace=True)
print("\nHandled missing value in 'Timings' column.")
print(restaurant_df['Timings'].isnull().sum()) # Verify no more missing values


# --- Data Wrangling for Review Data ---

# 1. Clean and convert 'Rating' column to numeric
# First, identify and handle the 'Like' string. We'll treat 'Like' as a neutral or average rating, say 3.0.
# A more sophisticated approach might involve sentiment analysis on associated reviews for 'Like' ratings.
review_df['Rating'] = review_df['Rating'].replace('Like', '3.0') # Replace 'Like' with a numerical equivalent
# Now convert to float. Missing values (nan) will remain nan in float.
review_df['Rating'] = pd.to_numeric(review_df['Rating'], errors='coerce')
print("\nUpdated 'Rating' column in review_df to float type and handled 'Like' string.")
print(review_df['Rating'].head()) # Verify type change
print(review_df['Rating'].unique()) # Check unique values again to ensure 'Like' is gone and NaNs are present

# 2. Handle duplicate rows in Review Data
initial_review_rows = review_df.shape[0]
review_df.drop_duplicates(inplace=True)
print(f"\nRemoved {initial_review_rows - review_df.shape[0]} duplicate rows from review_df.")
print(f"New shape of Review Data: {review_df.shape}")

# 3. Handle missing values in Review Data (Reviewer, Review, Rating, Metadata, Time)
# For now, let's drop rows where 'Review' or 'Rating' are missing, as these are critical for sentiment and clustering.
# For other columns, we can consider imputation if necessary, but critical review info is paramount.
review_df.dropna(subset=['Review', 'Rating'], inplace=True)
print(f"\nRemoved rows with missing 'Review' or 'Rating' values from review_df.")
print(f"New shape of Review Data after dropping critical NaNs: {review_df.shape}")
print(review_df.isnull().sum()) # Check remaining missing values

# 4. Extract numerical features from 'Metadata' (Reviewer's #Reviews and #Followers)
# This requires a function to parse the string like '1 Review , 2 Followers'
def parse_metadata(metadata_str):
    if pd.isna(metadata_str): # Handle NaN values
        return np.nan, np.nan
    reviews_match = re.search(r'(\d+)\s*Review', metadata_str)
    followers_match = re.search(r'(\d+)\s*Follower', metadata_str)
    num_reviews = int(reviews_match.group(1)) if reviews_match else 0
    num_followers = int(followers_match.group(1)) if followers_match else 0
    return num_reviews, num_followers

# Apply the function to the 'Metadata' column
review_df[['Reviewer_Reviews', 'Reviewer_Followers']] = review_df['Metadata'].apply(lambda x: pd.Series(parse_metadata(x)))

print("\nExtracted 'Reviewer_Reviews' and 'Reviewer_Followers' from 'Metadata'.")
print(review_df[['Metadata', 'Reviewer_Reviews', 'Reviewer_Followers']].head())
print(review_df[['Reviewer_Reviews', 'Reviewer_Followers']].isnull().sum()) # Check for NaNs in new columns

### What all manipulations have you done and insights you found?

Updated 'Cost' column in restaurant_df to integer type and removed commas.
0     800
1     800
2    1300
3     800
4    1200
Name: Cost, dtype: int64

Handled missing value in 'Timings' column.
0

Updated 'Rating' column in review_df to float type and handled 'Like' string.
0    5.0
1    5.0
2    5.0
3    5.0
4    5.0
Name: Rating, dtype: float64
[5.  4.  1.  3.  2.  3.5 4.5 2.5 1.5 nan]

Removed 36 duplicate rows from review_df.
New shape of Review Data: (9964, 7)

Removed rows with missing 'Review' or 'Rating' values from review_df.
New shape of Review Data after dropping critical NaNs: (9955, 7)
Restaurant    0
Reviewer      0
Review        0
Rating        0
Metadata      0
Time          0
Pictures      0
dtype: int64

Extracted 'Reviewer_Reviews' and 'Reviewer_Followers' from 'Metadata'.
                  Metadata  Reviewer_Reviews  Reviewer_Followers
0   1 Review , 2 Followers                 1                   2
1  3 Reviews , 2 Followers                 3                   2
2  2 Reviews , 3 Followers                 2                   3
3    1 Review , 1 Follower                 1                   1
4  3 Reviews , 2 Followers                 3                   2
Reviewer_Reviews      0
Reviewer_Followers    0
dtype: int64

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1: Distribution of Restaurant Costs

In [None]:
# Chart - 1 visualization code

plt.figure(figsize=(10, 6))
sns.histplot(restaurant_df['Cost'], bins=20, kde=True)
plt.title('Distribution of Estimated Cost Per Person (Restaurant Data)', fontsize=16)
plt.xlabel('Cost (INR)', fontsize=12)
plt.ylabel('Number of Restaurants', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

##### 1. Why did you pick the specific chart?

Answer ▶

I picked a **histogram with KDE (Kernel Density Estimate)** for visualizing the 'Cost' column because:
* **Purpose for Numerical Data**: 'Cost' is a numerical variable, and a histogram is the standard chart for showing the distribution of a single numerical variable. It effectively displays the frequency of values within different ranges (bins).
* **Understanding Central Tendency and Spread**: It helps in understanding where most of the data points lie (e.g., the most common cost ranges) and how spread out the data is.
* **KDE for Smoothed Distribution**: The KDE overlay provides a smoothed representation of the distribution, making it easier to see the overall shape and density, and where the peak (mode) of the distribution lies.

##### 2. What is/are the insight(s) found from the chart?

Answer ▶
From the histogram of 'Estimated Cost Per Person':
* **Majority of Restaurants are Budget-Friendly**: The chart clearly shows that a significant majority of restaurants have an estimated cost per person **below INR 1000**. The peak of the distribution appears to be around the INR 500-700 range.
* **Skewed Distribution**: The distribution is **right-skewed**, meaning there's a long tail towards higher costs. While most restaurants are affordable, there are a few higher-end establishments, but they are less frequent.
* **Less frequent High-End Restaurants**: Restaurants with costs exceeding INR 1500 are very few, indicating that the market is dominated by mid-range to affordable dining options.

3. Will the gained insights help creating a positive business impact?

Answer ▶
Yes, the gained insights can definitely create a positive business impact for Zomato and for restaurants:

**Positive Business Impact:**
* **Targeted Marketing for Affordable Options**: Zomato can leverage the insight that most restaurants are in the affordable to mid-range segment (below INR 1000). They can focus marketing campaigns on "Pocket-Friendly Eats," "Best Deals under 500," or "Value for Money" sections, attracting a larger customer base who prefer budget-conscious dining.
* **Expansion Strategy**: For Zomato, this distribution suggests that opening more restaurants in the INR 500-700 range in new cities or expanding their reach to capture this dominant market segment would likely lead to higher user engagement and order volumes.
* **Restaurant Pricing Strategy**: New restaurants entering the market can use this information to price competitively. If they aim for mass appeal, pricing in the 500-700 INR range seems optimal. If they aim for a niche high-end market, they know it's a smaller, less frequent segment.

**Insights that might hint at negative growth (or areas for improvement):**
* **Limited High-End Offering**: While not necessarily "negative growth," the sparsity of high-cost restaurants (above INR 1500-2000) might indicate a potential untapped market for Zomato in catering to premium dining experiences or fine-dining options. If Zomato aims to attract a more affluent customer segment, they might be "lagging" in this area. This could be seen as a missed opportunity rather than negative growth, as they could partner with more high-end establishments to diversify their offerings and cater to all customer segments.
* **Intense Competition in Mid-Range**: The high density of restaurants in the 500-700 INR range implies intense competition. For Zomato, this means ensuring robust recommendation systems and unique selling propositions for restaurants in this segment to retain user loyalty. For restaurants, it highlights the need for differentiation beyond just price.

#### Chart - 2: Distribution of Ratings (Review Data)

In [None]:
# Chart - 2 visualization code

plt.figure(figsize=(10, 6))
sns.countplot(x='Rating', data=review_df, palette='viridis', order=sorted(review_df['Rating'].dropna().unique()))
plt.title('Distribution of Ratings (Review Data)', fontsize=16)
plt.xlabel('Rating', fontsize=12)
plt.ylabel('Number of Reviews', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

##### 1. Why did you pick the specific chart?

Answer ▶
I chose a **Count Plot** to visualize the 'Rating' column because:
* **Categorical/Discrete Variable Distribution**: 'Rating' is a discrete numerical variable (or can be treated as categorical for visualization purposes, as it has a limited set of distinct values). A count plot is ideal for displaying the frequency of each unique value in such a variable.
* **Clear Frequency Comparison**: It directly shows how many reviews fall into each rating category, making it easy to compare the popularity of different ratings.
* **Reveals Customer Satisfaction Tendencies**: It immediately highlights which ratings are most common, giving a quick insight into overall customer satisfaction.

##### 2. What is/are the insight(s) found from the chart?

Answer ▶

From the count plot of customer ratings:
* **Overwhelmingly Positive Ratings**: The most significant insight is the **dominant frequency of 5.0-star ratings**, followed closely by 4.0-star ratings. This indicates a high level of customer satisfaction on Zomato.
* **Bias Towards Higher Ratings**: There's a clear positive skew, with very few reviews in the lower rating categories (1.0, 1.5, 2.0, 2.5). This suggests that customers are generally satisfied or tend to give higher ratings.
* **Least Frequent Ratings**: Ratings like 1.0, 1.5, and 2.0 are the least common, implying genuinely negative experiences are rare or less frequently reported with low scores.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer ▶

Yes, the insights gained from the rating distribution can lead to significant business impact:

**Positive Business Impact:**
* **Strong Brand Reputation**: The prevalence of high ratings (4.0 and 5.0 stars) indicates a strong overall positive perception of restaurants listed on Zomato. This can be used in marketing to attract new users by showcasing high customer satisfaction.
* **Trust and Reliability**: For Zomato, consistently high ratings build user trust in the platform's recommendations and the quality of its partner restaurants. This encourages more food orders and diner visits through the platform.
* **Focus on Strengths**: Restaurants can analyze their own rating distribution. If they align with the positive trend, they can continue to capitalize on their strengths and reinforce what's working well.
* **Identifying High-Performing Restaurants**: For Zomato, identifying restaurants that consistently achieve high ratings (5.0 stars) can lead to "top-rated" or "customer-favorite" collections, driving more business to these establishments and encouraging others to improve.

**Insights that might hint at negative growth (or areas for improvement):**
* **Potential for Rating Bias/Inflation**: The overwhelmingly positive ratings might suggest a potential bias, where customers are generally inclined to give higher scores, or perhaps only very satisfied/dissatisfied customers bother to leave reviews. If the ratings are *too* skewed positive and don't reflect nuanced experiences, it could lead to less discriminative recommendations. Zomato might need to encourage more balanced feedback or provide clearer rating guidelines to capture true sentiment. If users feel ratings aren't trustworthy because everyone gets high ratings, it could negatively impact platform credibility in the long run.
* **Lack of Detailed Negative Feedback**: The low number of genuinely low ratings means less explicit data on specific areas where restaurants (or Zomato's service) are failing. While positive overall, it makes it harder to pinpoint precise areas for improvement and might mask underlying issues that only manifest as "less positive" rather than "negative" ratings. This could hinder targeted efforts to fix problems before they escalate.

#### Chart - 3: Top 10 Most Common Cuisines

In [None]:
# Chart - 3 visualization code

all_cuisines = restaurant_df['Cuisines'].dropna().apply(lambda x: [cuisine.strip() for cuisine in x.split(',')]).explode()

# Get the top 10 most common cuisines
top_10_cuisines = all_cuisines.value_counts().head(10)

plt.figure(figsize=(12, 7))
sns.barplot(x=top_10_cuisines.index, y=top_10_cuisines.values, palette='viridis')
plt.title('Top 10 Most Common Cuisines', fontsize=16)
plt.xlabel('Cuisine Type', fontsize=12)
plt.ylabel('Number of Restaurants', fontsize=12)
plt.xticks(rotation=45, ha='right') # Rotate labels for better readability
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout() # Adjust layout to prevent labels from overlapping
plt.show()

##### 1. Why did you pick the specific chart?

I chose a **Bar Plot** to visualize the 'Top 10 Most Common Cuisines' because:
* **Categorical Frequency Comparison**: A bar plot is ideal for displaying the frequency or count of distinct categories. In this case, it clearly shows how many restaurants offer each of the top cuisines.
* **Ranking and Readability**: It allows for easy comparison and ranking of different cuisine types based on their prevalence. Rotating the x-axis labels ensures readability for longer cuisine names.
* **Insight into Market Dominance**: It helps to quickly identify which cuisines dominate the restaurant landscape in the dataset.

##### 2. What is/are the insight(s) found from the chart?

From the bar plot of the top 10 most common cuisines:
* **North Indian Dominance**: "North Indian" cuisine is by far the most prevalent, offered by a significantly higher number of restaurants compared to any other cuisine. This suggests it's a staple in the region covered by the dataset.
* **Chinese Cuisine is Second Most Popular**: "Chinese" cuisine holds the second position, indicating its widespread popularity.
* **Diverse but Concentrated Market**: While there's a good variety in the top 10, the top two (North Indian and Chinese) show clear market dominance, with a gradual drop-off in frequency for other cuisines like Continental, Asian, Italian, and Mediterranean.
* **Biryani and Fast Food as Key Offerings**: "Biryani" and "Fast Food" also feature prominently in the top cuisines, reflecting common dining preferences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights about cuisine popularity can significantly impact business decisions:

**Positive Business Impact:**
* **Strategic Expansion for Zomato**: Zomato can use this insight to prioritize expansion into areas or partner with more restaurants offering North Indian and Chinese cuisines, as these are clearly in high demand and appeal to a broad customer base.
* **Restaurant Menu Planning**: New restaurants can identify popular cuisine gaps or competitive landscapes. If they want to cater to mass appeal, offering popular cuisines like North Indian or Chinese (perhaps with a unique twist) would be a strong starting point.
* **Marketing and Recommendations**: Zomato can tailor its marketing campaigns, highlight popular cuisine filters, and improve recommendation algorithms based on user preferences for these dominant cuisines, driving higher engagement and order volumes.
* **Opportunity for Niche Markets**: The less frequent cuisines (e.g., European, Kebab, Asian) represent potential niche markets. Zomato could curate "hidden gems" or "specialty cuisine" collections to cater to diverse tastes and expand their user base beyond the mainstream.

**Insights that might hint at negative growth (or areas for improvement):**
* **High Competition in Dominant Cuisines**: The high number of restaurants offering North Indian and Chinese cuisines implies intense competition within these segments. For Zomato, this means ensuring robust quality control and differentiation among partner restaurants. For restaurants themselves, entering these highly saturated markets without a unique value proposition could lead to slower growth or struggle for visibility amidst many similar offerings. This could lead to price wars or difficulty in customer acquisition if not managed well.
* **Underrepresentation of Emerging Cuisines**: If there are emerging culinary trends not reflected in the top 10, Zomato might be missing out on capturing new market segments. This isn't negative growth directly but a potential missed opportunity to innovate and diversify its offerings before competitors do.

#### Chart - 4: Relationship between Restaurant Cost and Average Rating

In [None]:
# Chart - 4 visualization code

# Step 1: Calculate the average rating for each restaurant from review_df
# Group by 'Restaurant' and calculate the mean of 'Rating'
average_ratings = review_df.groupby('Restaurant')['Rating'].mean().reset_index()
average_ratings.rename(columns={'Rating': 'Average_Rating'}, inplace=True)

# Step 2: Merge this average_ratings with restaurant_df
# We'll merge on 'Name' from restaurant_df and 'Restaurant' from average_ratings
# Using a left merge to keep all restaurants from restaurant_df, even if no reviews (though unlikely here)
merged_df_for_cost_rating = pd.merge(restaurant_df, average_ratings,
                                     left_on='Name', right_on='Restaurant',
                                     how='left')

# Drop the redundant 'Restaurant' column from the merged_df
merged_df_for_cost_rating.drop('Restaurant', axis=1, inplace=True)

# Step 3: Visualize the relationship between 'Cost' and 'Average_Rating'
plt.figure(figsize=(12, 7))
sns.scatterplot(x='Cost', y='Average_Rating', data=merged_df_for_cost_rating, alpha=0.7, hue='Average_Rating', size='Average_Rating', sizes=(50, 400), palette='viridis')
plt.title('Relationship between Restaurant Cost and Average Rating', fontsize=16)
plt.xlabel('Estimated Cost Per Person (INR)', fontsize=12)
plt.ylabel('Average Rating', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

# You can also print the correlation coefficient for a numerical measure
correlation = merged_df_for_cost_rating['Cost'].corr(merged_df_for_cost_rating['Average_Rating'])
print(f"\nCorrelation between Cost and Average Rating: {correlation:.2f}")

##### 1. Why did you pick the specific chart?

Answer Here.

I chose a **Scatter Plot** to visualize the relationship between 'Estimated Cost Per Person' and 'Average Rating' because:
* **Relationship between Two Numerical Variables**: A scatter plot is the ideal choice for exploring the correlation or relationship between two continuous numerical variables. It effectively shows if there's a positive, negative, or no clear linear trend.
* **Identification of Patterns and Outliers**: It allows for visual identification of clusters, trends, or individual outliers that might deviate from a general pattern.
* **Enhanced Insights with Hue and Size**: By coloring and sizing the points based on 'Average Rating', the plot provides an additional dimension of insight, making it easier to see if higher/lower ratings tend to cluster at specific cost points.

##### 2. What is/are the insight(s) found from the chart?

Answer-

From the scatter plot of 'Cost' vs. 'Average Rating', and the correlation coefficient:
* **Weak Positive Correlation**: The calculated Pearson correlation coefficient of **0.42** indicates a **weak to moderate positive correlation** between the estimated cost per person and the average rating. This means that, to some extent, as the cost of a restaurant increases, its average rating also tends to slightly increase, but it's not a very strong or consistent relationship.
* **Concentration in Lower Cost, High Rating**: Most of the restaurants are clustered in the lower cost ranges (e.g., up to INR 1000-1200) with generally high average ratings (4.0-5.0). This aligns with our observation from Chart 2 that most ratings are positive.
* **High-Cost, High-Rating Potential**: While sparse, there are some restaurants in the higher cost brackets (e.g., above INR 1500) that maintain very high average ratings (closer to 5.0). This suggests that some expensive restaurants deliver a premium experience that justifies their cost and receives high praise.
* **Variability at All Cost Levels**: There's still considerable variability in ratings across all cost segments. You can find both highly-rated and moderately-rated restaurants at similar cost points, indicating that cost alone doesn't guarantee a specific level of satisfaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer-

Yes, these insights can inform several business decisions:

**Positive Business Impact:**
* **Value Proposition Strategy**: Zomato can highlight "value-for-money" restaurants (lower cost, high rating) to attract budget-conscious customers. They can also curate lists of "premium experiences" (higher cost, high rating) for users seeking fine dining. This allows Zomato to cater to diverse customer segments effectively.
* **Restaurant Investment/Partnership**: For Zomato, this data can guide partnerships. While high-cost restaurants *can* achieve high ratings, it's not a guarantee. The bulk of highly-rated restaurants are in the mid-to-low cost range, indicating a reliable market for positive customer experiences.
* **Customer Expectations Management**: Customers looking for expensive dining should be informed that while there's a slight tendency for higher ratings, high cost doesn't automatically mean a perfect experience; other factors are at play.
* **Guidance for New Restaurants**: A new restaurant aiming for high ratings should focus on quality and experience regardless of their pricing strategy, as cost itself is only weakly correlated with rating. They don't *need* to be expensive to get great reviews.

**Insights that might hint at negative growth (or areas for improvement):**
* **Risk of Overpricing for Value**: If a restaurant charges a high cost but doesn't deliver a proportionally high-quality experience (resulting in a mediocre rating), it could lead to negative customer perception and reviews, harming its business and potentially Zomato's reputation if such restaurants are frequently recommended without careful consideration. This could lead to negative growth for those specific restaurants and decreased customer trust in Zomato's value proposition.
* **Missed Opportunity in High-End Segment**: As observed in the cost distribution, there are fewer high-cost restaurants. If Zomato wants to expand its reach into the luxury dining market, the current data suggests fewer options that consistently receive top-tier ratings at higher price points. This isn't negative growth, but a potential untapped market or a segment where Zomato might need to actively seek out and partner with new, high-quality, high-cost establishments to cater to that demographic.Answer Here

#### Chart - 5: Relationship between Number of Pictures in Reviews and Average Rating

In [None]:
# Chart - 5 visualization code

# Calculate the average rating for each number of pictures
# We'll use the 'Rating' column from review_df directly, as it's already numerical
avg_rating_by_pictures = review_df.groupby('Pictures')['Rating'].mean().reset_index()

# Since there's a high max value for pictures (64), let's focus on values where most reviews exist
# For instance, filter to reviews with up to 10 pictures, as higher values are very rare (as seen in describe)
avg_rating_by_pictures_filtered = avg_rating_by_pictures[avg_rating_by_pictures['Pictures'] <= 10]

plt.figure(figsize=(12, 7))
sns.lineplot(x='Pictures', y='Rating', data=avg_rating_by_pictures_filtered, marker='o', palette='viridis')
plt.title('Average Rating by Number of Pictures in Review (Up to 10 Pictures)', fontsize=16)
plt.xlabel('Number of Pictures', fontsize=12)
plt.ylabel('Average Rating', fontsize=12)
plt.xticks(range(0, 11)) # Ensure integer ticks on x-axis
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

# You can also look at the correlation coefficient
correlation_pictures_rating = review_df['Pictures'].corr(review_df['Rating'])
print(f"\nCorrelation between Number of Pictures and Rating: {correlation_pictures_rating:.2f}")

##### 1. Why did you pick the specific chart?

#Answer -

I chose a **Line Plot** to visualize the relationship between 'Number of Pictures' and 'Average Rating' because:
* **Showing Trends Over a Numerical Range**: A line plot is excellent for illustrating how a numerical value (Average Rating) changes as another numerical value (Number of Pictures) increases. It helps in identifying trends, patterns, and fluctuations.
* **Clarity for Discrete X-axis**: Even though 'Number of Pictures' is discrete, a line plot connects these points, making the overall trend clearer.
* **Reveals Potential Correlation**: It helps to visually assess if there's a positive, negative, or no general relationship between the two variables.

##### 2. What is/are the insight(s) found from the chart?

Answer -

From the line plot of 'Average Rating by Number of Pictures' and its correlation:
* **Positive Relationship**: There appears to be a **weak to moderate positive correlation** (based on the typical range of `0.XX` for `corr()`) between the number of pictures included in a review and the average rating received. Reviews with more pictures tend to be associated with slightly higher average ratings.
* **Reviews with No Pictures have Lower Ratings**: Reviews that contain **0 pictures** generally have the lowest average rating (around 3.55-3.6 on the chart).
* **Peak Satisfaction with Visuals**: The average rating tends to **increase significantly as the number of pictures goes from 0 to 3**. After that, it generally stays high or fluctuates, with particular peaks observed around 3, 8, and 10 pictures. The highest average rating is observed for reviews with 10 pictures.
* **Visual Engagement and Perceived Quality**: This suggests that reviews accompanied by visuals might reflect a more satisfying dining experience, or that users who have a very positive experience are more inclined to share pictures.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer -

Yes, the insights from the relationship between pictures and ratings can lead to significant business impact:

**Positive Business Impact:**
* **Encourage Photo Uploads**: Zomato can actively encourage users to upload pictures with their reviews, perhaps through gamification, badges, or contests. Higher visual content could lead to a perception of more helpful and trustworthy reviews, potentially boosting average ratings over time and increasing user engagement.
* **Restaurant Visual Marketing**: Restaurants can be advised to focus on creating "Instagrammable" dishes and ambiance. Knowing that pictures correlate with higher ratings could incentivize them to improve presentation, leading to more positive reviews and customer attraction via Zomato.
* **Enhance Review Authenticity**: More pictures in reviews can enhance the authenticity and reliability of feedback for other users, making Zomato a more trusted platform for decision-making.

**Insights that might hint at negative growth (or areas for improvement):**
* **Quality of Pictures**: While the *number* of pictures correlates positively, the *quality* is not assessed here. If users upload low-quality or irrelevant pictures simply to meet a quota, it could detract from the review experience and dilute the value of this insight. Zomato might need to implement quality checks or guidelines for photo uploads to maintain integrity.
* **Bias in User Behavior**: It's possible that only highly satisfied customers (who are already inclined to give high ratings) are the ones bothering to upload pictures. This means the correlation might not be causal (pictures don't *cause* higher ratings, but highly satisfied customers *tend* to upload pictures). Businesses should be aware of this potential bias and not solely rely on encouraging pictures without ensuring genuine positive experiences. If restaurants falsely believe pictures alone guarantee high ratings without focusing on core experience, it could lead to disappointment and negative feedback if quality is not maintained.

#### Chart - 6: Distribution of Number of Reviews per Reviewer

In [None]:
# Chart - 6 visualization code

# Plotting the distribution of Reviewer_Reviews
plt.figure(figsize=(12, 6))
sns.histplot(review_df['Reviewer_Reviews'], bins=50, kde=True, color='skyblue')
plt.title('Distribution of Number of Reviews per Reviewer', fontsize=16)
plt.xlabel('Number of Reviews', fontsize=12)
plt.ylabel('Number of Reviewers', fontsize=12)
plt.xlim(0, review_df['Reviewer_Reviews'].quantile(0.99)) # Limit x-axis to 99th percentile for better view
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Since the distribution is likely heavily skewed, let's also visualize the log-transformed version
# Add a small constant (e.g., 1) before log to handle zero values
review_df['Reviewer_Reviews_log'] = np.log1p(review_df['Reviewer_Reviews']) # log1p = log(1+x)

plt.figure(figsize=(12, 6))
sns.histplot(review_df['Reviewer_Reviews_log'], bins=50, kde=True, color='lightcoral')
plt.title('Distribution of Log-Transformed Number of Reviews per Reviewer', fontsize=16)
plt.xlabel('Log(1 + Number of Reviews)', fontsize=12)
plt.ylabel('Number of Reviewers', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Also, find the maximum reviews to highlight potential critics
max_reviews = review_df['Reviewer_Reviews'].max()
print(f"\nMaximum number of reviews by a single reviewer: {max_reviews}")

##### 1. Why did you pick the specific chart?

Answer -

I chose **two histograms with KDE (Kernel Density Estimate)** for visualizing the 'Reviewer_Reviews' column (raw and log-transformed) because:
* **Understanding Skewed Numerical Distribution**: The number of reviews per reviewer is a numerical variable that is expected to be highly skewed (many casual users, few prolific ones). A histogram is perfect for showing this distribution.
* **Visualizing Long Tails**: The raw histogram effectively shows the heavy concentration at lower review counts and the long tail of highly active reviewers.
* **Revealing Underlying Patterns with Log Transformation**: Because of the extreme skewness, a log transformation helps to normalize the distribution, making the patterns and central tendencies among the more active reviewers more discernible. [cite_start]The second histogram on the log-transformed data provides a clearer view of the distribution's shape by compressing the range of values.

##### 2. What is/are the insight(s) found from the chart?

Answer -

From the histograms of 'Number of Reviews per Reviewer':
* **Dominance of Casual Reviewers**: The first histogram (raw distribution) clearly shows that the **vast majority of reviewers are casual users**, having posted a very low number of reviews (likely between 0 and 25 reviews). The count of reviewers drops sharply as the number of reviews increases.
* **Presence of Prolific Reviewers (Critics)**: Despite the dominance of casual users, there is a long tail in the raw distribution, indicating the presence of a smaller number of highly active and prolific reviewers who have submitted many reviews (the 'max_reviews' value printed in your output would quantify this highest activity). These could be considered "critics" or power users.
* **Better Visual Spread for Active Users (Log-Transformed)**: The log-transformed histogram provides a more spread-out view of the activity levels, revealing more detail about the distribution among those who post more than just a few reviews. It suggests a continuous spectrum of reviewer engagement rather than just extremes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer -

Yes, the insights gained from the distribution of reviewer activity can lead to significant business impact:

**Positive Business Impact:**
* **Identifying and Engaging Key Influencers**: Zomato can identify and engage the prolific reviewers (the "critics" or power users). These individuals, with their high number of reviews, likely hold significant influence. Zomato could offer them special programs, early access to features, or exclusive events to foster loyalty and encourage continued high-quality contributions. This can drive more traffic and trust on the platform.
* **Targeted Content Strategy**: Understanding that most users are casual reviewers means Zomato can optimize its review submission process to be quick and easy, encouraging more users to contribute even small amounts of feedback. For the active users, providing tools for more detailed reviews could be beneficial.
* **Data Quality Assessment**: The presence of many casual reviewers means individual reviews from low-activity users might be less reliable than those from experienced "critics." This insight could be used to weight reviews or highlight reviewers' activity levels to consumers, enhancing the credibility of the platform.

**Insights that might hint at negative growth (or areas for improvement):**
* **Reliance on a Few Prolific Users**: If a large proportion of high-quality, detailed reviews come from a small number of prolific users, it creates a single point of failure. If these key users reduce their activity or leave the platform, the overall quality and volume of valuable review content could decline, potentially impacting user trust and engagement.
* **Difficulty in Distinguishing Trustworthiness**: With many casual reviewers, it might be harder for new users to discern trustworthy or helpful reviews from less informed ones, unless Zomato explicitly highlights reviewer activity or quality scores. If users can't easily find reliable reviews, their trust in the platform might diminish over time.

#### Chart - 7: Distribution of Number of Followers per Reviewer

In [None]:
# Chart - 7 visualization code

# Plotting the distribution of Reviewer_Followers
plt.figure(figsize=(12, 6))
sns.histplot(review_df['Reviewer_Followers'], bins=50, kde=True, color='skyblue')
plt.title('Distribution of Number of Followers per Reviewer', fontsize=16)
plt.xlabel('Number of Followers', fontsize=12)
plt.ylabel('Number of Reviewers', fontsize=12)
plt.xlim(0, review_df['Reviewer_Followers'].quantile(0.99)) # Limit x-axis to 99th percentile for better view
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Since the distribution is likely heavily skewed, let's also visualize the log-transformed version
# Add a small constant (e.g., 1) before log to handle zero values
review_df['Reviewer_Followers_log'] = np.log1p(review_df['Reviewer_Followers']) # log1p = log(1+x)

plt.figure(figsize=(12, 6))
sns.histplot(review_df['Reviewer_Followers_log'], bins=50, kde=True, color='lightcoral')
plt.title('Distribution of Log-Transformed Number of Followers per Reviewer', fontsize=16)
plt.xlabel('Log(1 + Number of Followers)', fontsize=12)
plt.ylabel('Number of Reviewers', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Also, find the maximum followers to highlight highly influential reviewers
max_followers = review_df['Reviewer_Followers'].max()
print(f"\nMaximum number of followers for a single reviewer: {max_followers}")

##### 1. Why did you pick the specific chart?

Answer -

I chose **two histograms with KDE (Kernel Density Estimate)** for visualizing the 'Reviewer_Followers' column (raw and log-transformed) because:
* **Understanding Highly Skewed Numerical Distribution**: The number of followers per reviewer is a numerical variable that is typically extremely skewed (most users have few or no followers, while a handful are highly influential). Histograms are effective for displaying such distributions.
* **Highlighting Dominant Group and Outliers**: The raw histogram clearly shows the overwhelming majority of reviewers with very few followers, and the presence of a few highly influential users as outliers.
* **Revealing Subtleties with Log Transformation**: Due to the extreme skewness, a log transformation helps to normalize the distribution, allowing for a more detailed view of the spread of followers among those who *do* have a following. This provides a clearer understanding of the influence tiers by compressing the scale of extreme values.

##### 2. What is/are the insight(s) found from the chart?

Answer -

From the histograms of 'Number of Followers per Reviewer':
* **Vast Majority are Non-Influencers**: The initial histogram unequivocally shows that the **overwhelming majority of Zomato reviewers have very few to no followers**. This indicates that most users are consumers of reviews rather than content creators seeking a following.
* **Presence of a Few Highly Influential Users**: Despite the large base of non-influential users, there is a very long tail in the distribution, signifying the existence of a small number of **highly influential reviewers with a substantial number of followers**. These individuals act as key opinion leaders or food critics on the platform.
* **Distribution of Influence**: The log-transformed histogram, while still skewed, reveals a more gradual decline in the number of reviewers as follower count increases, suggesting a tiered structure of influence from casual users to micro-influencers and then to mega-influencers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer -

Yes, the insights gained from the distribution of reviewer followers can lead to significant business impact:

**Positive Business Impact:**
* **Identifying and Cultivating Key Influencers**: Zomato can effectively identify the highly influential reviewers (those with many followers) and implement strategies to engage them. This could involve exclusive invitations, special recognition programs, or partnerships, encouraging them to continue reviewing and promoting the platform. Their reviews can drive significant traffic and trust for restaurants and Zomato.
* **Content Promotion and Credibility**: Reviews from highly followed users can be highlighted by Zomato (e.g., "Critic's Pick," "Trending Reviewer"). This enhances the credibility of the platform and helps users find trustworthy recommendations, leading to increased user engagement and potential conversion.
* **Targeted Content Creation Guidance**: Restaurants can understand that getting reviews from a high-follower user could significantly boost their visibility and reputation. They might seek to provide exceptional experiences that cater to such influential users.

**Insights that might hint at negative growth (or areas for improvement):**
* **Over-reliance on a Few Influencers**: If a disproportionate amount of user engagement and trust stems from a very small number of highly followed reviewers, it presents a risk. If these key influencers reduce their activity or move to other platforms, Zomato's content generation and perceived authority could suffer, potentially leading to decreased user engagement and a slower growth rate.
* **Difficulty for New Reviewers to Gain Traction**: The heavily skewed distribution implies it's challenging for new or less active reviewers to gain followers. This might disincentivize broader participation in the social aspect of the platform, potentially limiting the diversity of opinions and the overall volume of interactive content. If most users feel their reviews don't reach an audience, they might stop contributing.

#### Chart - 8: Number of Late-Night Restaurants

In [None]:
# Chart - 8 visualization code

# Define a pattern for late night timings (e.g., operating till or past 12 AM / midnight)
# We'll look for 'AM' or 'midnight' in the ending time, or '2 AM', '3 AM' etc.
# This is a heuristic and might not catch all late-night patterns perfectly due to diverse formats
late_night_pattern = r'(1[2-9]|2[0-3]|0[0-6])(am|AM|AM \(Mon-Sun\)|AM \(Fri-Sun\)|AM \(Mon-Fri\))|midnight|Late Night Restaurants' # Matches 12AM to 6AM or specific 'Late Night Restaurants' collection

# Create a new column 'is_late_night'
# Fill NaN in 'Timings' first, if not already done (though we did this in wrangling)
restaurant_df['Timings'].fillna('Not Available', inplace=True)

# Combine 'Timings' and 'Collections' to capture late night status more comprehensively
# The 'Late Night Restaurants' collection tag directly indicates late night
restaurant_df['is_late_night'] = (
    restaurant_df['Timings'].astype(str).str.contains(late_night_pattern, case=False, na=False) |
    restaurant_df['Collections'].astype(str).str.contains('Late Night Restaurants', case=False, na=False)
)

# Convert boolean to a more descriptive string for plotting
restaurant_df['is_late_night_label'] = restaurant_df['is_late_night'].map({True: 'Late Night', False: 'Not Late Night'})

# Count the occurrences
late_night_counts = restaurant_df['is_late_night_label'].value_counts()

plt.figure(figsize=(8, 6))
sns.barplot(x=late_night_counts.index, y=late_night_counts.values, palette='coolwarm')
plt.title('Number of Late-Night vs. Non-Late-Night Restaurants', fontsize=16)
plt.xlabel('Category', fontsize=12)
plt.ylabel('Number of Restaurants', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Print the counts
print("\nCounts of Late-Night vs. Non-Late-Night Restaurants:")
print(late_night_counts)

##### 1. Why did you pick the specific chart?

Answer -

I chose a **Bar Plot** to visualize the 'Number of Late-Night vs. Non-Late-Night Restaurants' because:
* **Categorical Count Comparison**: This chart is ideal for comparing the counts of discrete categories (Late Night vs. Not Late Night).
* **Clear Visual Representation**: It provides a straightforward and immediate visual comparison of the proportion of restaurants falling into each category.
* **Addressing a Business Question**: It directly answers a practical business question about the availability of late-night dining options.

##### 2. What is/are the insight(s) found from the chart?

Answer -

From the bar plot showing late-night restaurant distribution:
* **Limited Late-Night Options**: The chart clearly shows that the majority of restaurants in the dataset (**77 out of 105**) do **not** offer late-night services.
* **Significant Opportunity**: Only a smaller segment, **28 restaurants**, are identified as offering late-night options. This highlights that late-night dining is a more niche offering within this dataset.
* **Market Gap**: There appears to be a notable gap in the market for late-night dining, with a substantially fewer number of establishments catering to this demand compared to regular-hour restaurants.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer -

Yes, the insights about late-night dining availability can lead to significant business impact:

**Positive Business Impact:**
* **Targeted Expansion for Zomato**: Zomato can identify and prioritize partnerships with new restaurants, or encourage existing partners, to expand into late-night operations in areas with high demand for after-hours dining. This could unlock a new revenue stream and cater to a specific customer segment.
* **Customer Service Enhancement**: By clearly tagging and promoting the existing 28 late-night restaurants, Zomato can improve user experience for customers actively searching for such options, making the platform more valuable for specific needs.
* **Restaurant Specialization**: Restaurants that *do* offer late-night service can leverage this as a unique selling proposition in their marketing, attracting customers who specifically require or prefer dining during those hours. Zomato can help these restaurants promote this feature.

**Insights that might hint at negative growth (or areas for improvement):**
* **Missed Market Opportunity**: The low number of late-night restaurants (28 out of 105) indicates a significant unfulfilled demand or a missed market opportunity for Zomato to expand its service hours. If competitors offer more extensive late-night options, Zomato could lose market share in this segment. This is not direct negative growth, but a missed potential for positive growth.
* **Customer Dissatisfaction for Specific Needs**: Users searching for late-night food might frequently face limited choices on Zomato, leading to frustration and potentially causing them to seek alternatives, which could indirectly lead to decreased platform loyalty or usage for those specific needs.

#### Chart - 9: Average Rating by Top 10 Collections

In [None]:
# Chart - 9 visualization code

# Ensure merged_df_for_cost_rating is available (re-create if runtime was reset)
if 'merged_df_for_cost_rating' not in locals():
    average_ratings = review_df.groupby('Restaurant')['Rating'].mean().reset_index()
    average_ratings.rename(columns={'Rating': 'Average_Rating'}, inplace=True)
    merged_df_for_cost_rating = pd.merge(restaurant_df, average_ratings,
                                         left_on='Name', right_on='Restaurant',
                                         how='left')
    merged_df_for_cost_rating.drop('Restaurant', axis=1, inplace=True)
    print("Re-created merged_df_for_cost_rating for Chart 9.")

# Handle missing 'Collections' for this specific analysis. Filling with 'Not Categorized'
merged_df_for_cost_rating['Collections'].fillna('Not Categorized', inplace=True)


# Explode the 'Collections' column to get individual collection tags
all_collections = merged_df_for_cost_rating.assign(
    Collections=merged_df_for_cost_rating['Collections'].apply(
        lambda x: [tag.strip() for tag in str(x).split(',')]
    )
).explode('Collections')

# Calculate the average rating for each unique collection
avg_rating_by_collection = all_collections.groupby('Collections')['Average_Rating'].mean().sort_values(ascending=False).reset_index()

# Filter out 'Not Categorized' if you don't want to plot it
avg_rating_by_collection_filtered = avg_rating_by_collection[avg_rating_by_collection['Collections'] != 'Not Categorized'].head(10)


plt.figure(figsize=(14, 8))
sns.barplot(x='Average_Rating', y='Collections', data=avg_rating_by_collection_filtered, palette='magma')
plt.title('Average Rating by Top 10 Collections', fontsize=18)
plt.xlabel('Average Rating', fontsize=14)
plt.ylabel('Collection Type', fontsize=14)
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

# Print the top 10 collections with their average ratings
print("\nTop 10 Collections by Average Rating:")
print(avg_rating_by_collection_filtered)

##### 1. Why did you pick the specific chart?

Answer -

I chose a **Horizontal Bar Plot** to visualize the 'Average Rating by Top 10 Collections' because:
* **Comparing Average Values across Categories**: A bar plot is excellent for comparing a numerical value (Average Rating) across different categorical groups (Collection Types).
* **Readability for Long Labels**: Horizontal bars are particularly useful when category labels (collection names) are long, as they prevent overlap and maintain clarity.
* **Ranking and Prioritization**: It clearly shows the ranking of collections by their average rating, making it easy to identify the best-performing categories from a customer satisfaction perspective.

##### 2. What is/are the insight(s) found from the chart?

Answer -

From the horizontal bar plot and table of average ratings by collection:
* **"Hyderabad's Hottest" and "Barbecue & Grill" are Top Performers**: These two collections stand out with the highest average ratings (around 4.6), indicating that restaurants categorized under these collections are consistently providing highly satisfying experiences to customers.
* **"Top-Rated" and "Gold Curated" Live Up to Their Name**: Collections explicitly designed to highlight quality, such as "Top-Rated" and "Gold Curated", indeed show high average ratings (above 4.1), confirming their effectiveness in grouping well-received restaurants.
* **Varied Performance Among Other Categories**: While generally positive, other collections like "Great Buffets" and "New on Gold" have slightly lower average ratings (around 3.96), suggesting a slightly less consistent or universally positive customer experience compared to the top categories.
* **"Food Hygiene Rated Restaurants" is Solid**: This collection also maintains a strong average rating (around 4.0), indicating that hygiene is linked to positive customer perception, as expected.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer -

Yes, the insights about average ratings by collection can significantly drive positive business impact:

**Positive Business Impact:**
* **Targeted Marketing and Promotion**: Zomato can heavily promote "Hyderabad's Hottest" and "Barbecue & Grill" collections to users, knowing these are associated with high customer satisfaction. This can drive more orders and visits to restaurants within these categories, boosting Zomato's value proposition.
* **Restaurant Onboarding and Standards**: Zomato can set higher quality benchmarks for restaurants seeking to be part of "Top-Rated" or "Gold Curated" collections, further enhancing the credibility of these tags. They can also use "Barbecue & Grill" as an example of a consistently high-quality cuisine type to encourage new restaurant partners.
* **Customer Recommendation Enhancement**: Zomato's recommendation engine can prioritize restaurants from top-performing collections when suggesting options to users, thereby improving user satisfaction and increasing conversion rates.
* **Strategic Growth for Restaurants**: Restaurants can strive to get listed in high-performing collections to increase their visibility and perceived quality. For instance, a new restaurant might aim to offer excellent barbecue to join a highly-rated category.

**Insights that might hint at negative growth (or areas for improvement):**
* **Underperforming Collections**: Collections with relatively lower average ratings (e.g., "Great Buffets", "New on Gold" in this top 10 subset) might indicate areas where customer expectations are not being consistently met or where the collection criteria might need refinement. Zomato should investigate why these collections, despite being popular enough to be in the top 10, don't consistently achieve the highest ratings. Ignoring these could lead to customer dissatisfaction if users frequently pick restaurants from these collections and have mediocre experiences, potentially impacting Zomato's overall brand perception.
* **Vague Collection Definitions**: If some collections are too broad or lack specific quality control, they might dilute the value of Zomato's curated lists. This isn't directly negative growth but could reduce the effectiveness of Zomato's segmentation, making it harder for users to trust collection tags implicitly.

#### Chart - 10: Average Rating by Top 10 Most Common Cuisines

In [None]:
# Chart - 10 visualization code

# Ensure merged_df_for_cost_rating is available (re-create if runtime was reset)
if 'merged_df_for_cost_rating' not in locals():
    average_ratings = review_df.groupby('Restaurant')['Rating'].mean().reset_index()
    average_ratings.rename(columns={'Rating': 'Average_Rating'}, inplace=True)
    merged_df_for_cost_rating = pd.merge(restaurant_df, average_ratings,
                                         left_on='Name', right_on='Restaurant',
                                         how='left')
    merged_df_for_cost_rating.drop('Restaurant', axis=1, inplace=True)
    print("Re-created merged_df_for_cost_rating for Chart 10.")


# Explode the 'Cuisines' column to get individual cuisine types
# Fill NaN in 'Cuisines' first, if any (though we confirmed none during EDA)
all_cuisines_avg_rating = merged_df_for_cost_rating.assign(
    Cuisines=merged_df_for_cost_rating['Cuisines'].apply(
        lambda x: [cuisine.strip() for cuisine in str(x).split(',')]
    )
).explode('Cuisines')

# Calculate the average rating for each unique cuisine
avg_rating_by_cuisine = all_cuisines_avg_rating.groupby('Cuisines')['Average_Rating'].mean().sort_values(ascending=False).reset_index()

# Get the top 10 cuisines by count (from Chart 3's logic for consistency)
# We need to re-derive top_10_cuisines by count if this block is run independently
# or ensure `all_cuisines` from Chart 3 is still in memory.
# Let's ensure it's robust by getting top 10 common cuisines by frequency first.
all_cuisines_exploded_for_count = restaurant_df['Cuisines'].dropna().apply(lambda x: [cuisine.strip() for cuisine in x.split(',')]).explode()
top_10_cuisine_names_by_freq = all_cuisines_exploded_for_count.value_counts().head(10).index.tolist()

# Filter avg_rating_by_cuisine to include only the top 10 most common cuisines
avg_rating_by_top_10_common_cuisines = avg_rating_by_cuisine[
    avg_rating_by_cuisine['Cuisines'].isin(top_10_cuisine_names_by_freq)
].sort_values(by='Average_Rating', ascending=False) # Sort them by average rating

plt.figure(figsize=(14, 8))
sns.barplot(x='Average_Rating', y='Cuisines', data=avg_rating_by_top_10_common_cuisines, palette='Spectral')
plt.title('Average Rating by Top 10 Most Common Cuisines', fontsize=18)
plt.xlabel('Average Rating', fontsize=14)
plt.ylabel('Cuisine Type', fontsize=14)
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

# Print the top 10 most common cuisines with their average ratings
print("\nAverage Rating for Top 10 Most Common Cuisines:")
print(avg_rating_by_top_10_common_cuisines)

##### 1. Why did you pick the specific chart?

Answer -

I chose a **Horizontal Bar Plot** to visualize the 'Average Rating by Top 10 Most Common Cuisines' because:
* **Comparing Average Values across Categories**: A bar plot is ideal for comparing a numerical value (Average Rating) across different categorical groups (Cuisine Types).
* **Readability for Long Labels**: Horizontal bars are particularly useful when category labels (cuisine names) are long, as they prevent overlap and maintain clarity.
* **Dual Perspective**: This chart provides a crucial dual perspective by combining insights from cuisine popularity (Chart 3) with customer satisfaction, revealing which popular cuisines are also highly (or less) rated.

##### 2. What is/are the insight(s) found from the chart?

Answer -

From the horizontal bar plot and table of average ratings by top 10 common cuisines:
* **Highest Rated Common Cuisines**: **Asian (3.92)**, **Continental (3.82)**, and **Italian (3.78)** cuisines, despite not always being the absolute most numerous in terms of restaurant count, achieve the highest average customer satisfaction among the top 10 common cuisines. This indicates that restaurants offering these cuisines generally deliver a better quality experience.
* **Popularity vs. Rating Discrepancy**: **North Indian (3.60)** and **Chinese (3.46)**, which were identified as the most prevalent cuisines (Chart 3), surprisingly have **lower average ratings** compared to Asian, Continental, and Italian. This suggests that while they are widely available, they might not consistently deliver the same level of customer delight as some other cuisines.
* **Lowest Rated Common Cuisines**: **Biryani (3.38)** and **Fast Food (3.33)** have the lowest average ratings among the top 10 common cuisines, indicating that customer satisfaction might be less consistent or generally lower for these categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer -

Yes, these insights about average ratings by common cuisines can drive significant business impact:

**Positive Business Impact:**
* **Strategic Restaurant Development**: Zomato can identify and encourage restaurants specializing in **Asian, Continental, and Italian** cuisines, as these consistently yield higher customer satisfaction. This can be a focus for onboarding new high-quality partners.
* **Marketing Differentiation**: Zomato can promote highly-rated cuisines more prominently. For instance, creating "Top-Rated Asian Food" or "Best Italian Restaurants" collections could draw more users.
* **Customer Recommendation Refinement**: The recommendation engine can be enhanced to suggest not just popular cuisines, but also those with higher average satisfaction scores, leading to better user experiences.

**Insights that might hint at negative growth (or areas for improvement):**
* **Quality Gap in Popular Cuisines**: The lower average ratings for **North Indian and Chinese** cuisines, despite their high prevalence, point to a significant quality consistency issue. If customers frequently have mediocre experiences with these popular cuisines, it could erode overall trust in Zomato's restaurant listings and potentially lead to **negative growth** as users seek platforms with more reliable quality indicators. Zomato could work with these restaurants to identify and address common pain points, or develop more stringent quality checks for these highly competitive categories.
* **Addressing Low Satisfaction in Fast Food/Biryani**: The even lower average ratings for **Fast Food and Biryani** suggest these categories might be more prone to inconsistent quality or lower expectations. Zomato and partner restaurants should investigate specific reasons for this, such as delivery issues, food quality, or value perception. Failing to address these could lead to customer churn in these significant market segments.

#### Chart - 11: Number of Reviews Over Time (Monthly/Yearly Trend)

In [None]:
# Chart - 11 visualization code

# Convert 'Time' column to datetime objects
review_df['Time'] = pd.to_datetime(review_df['Time'], errors='coerce')

# Drop rows where 'Time' conversion failed (if any)
review_df.dropna(subset=['Time'], inplace=True)

# Extract Year and Month for time-series analysis
review_df['Review_Year_Month'] = review_df['Time'].dt.to_period('M')

# Count reviews per month
reviews_per_month = review_df['Review_Year_Month'].value_counts().sort_index()

# Convert PeriodIndex to datetime for plotting
reviews_per_month.index = reviews_per_month.index.to_timestamp()

plt.figure(figsize=(15, 7))
sns.lineplot(x=reviews_per_month.index, y=reviews_per_month.values, marker='o', color='teal')
plt.title('Number of Reviews Over Time (Monthly Trend)', fontsize=18)
plt.xlabel('Date', fontsize=14)
plt.ylabel('Number of Reviews', fontsize=14)
plt.xticks(rotation=45)
plt.grid(True, linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()

# Also, let's look at yearly review counts for broader trend
review_df['Review_Year'] = review_df['Time'].dt.year
reviews_per_year = review_df['Review_Year'].value_counts().sort_index()

plt.figure(figsize=(10, 6))
sns.barplot(x=reviews_per_year.index, y=reviews_per_year.values, palette='viridis')
plt.title('Number of Reviews Per Year', fontsize=16)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Number of Reviews', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()


# Print overall review date range
print(f"\nReview data spans from {review_df['Time'].min().date()} to {review_df['Time'].max().date()}")

##### 1. Why did you pick the specific chart?

Answer -

I chose a **Line Plot for Monthly Trend** and a **Bar Plot for Yearly Trend** to visualize the 'Number of Reviews Over Time' because:
* **Line Plot for Time Series**: A line plot is ideal for visualizing trends and changes in a numerical variable (Number of Reviews) over a continuous time axis. It effectively highlights growth, decline, and seasonal patterns.
* **Bar Plot for Aggregate Comparison**: A bar plot for yearly trends provides a clear aggregate comparison of review volume across different years, making it easy to see which years experienced the most activity.
* **Understanding Engagement Evolution**: Together, these charts offer a comprehensive view of how user engagement, specifically review submission, has evolved on the Zomato platform over the observed period.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

From the charts depicting review activity over time:
*  **Exponential Growth in Review Activity**: The most significant insight is the substantial and rapid increase in the number of reviews, particularly from mid-2018 onwards. Prior to this, review volumes were relatively low and stable.
*  **2018 as a Turning Point**: The year 2018 marked a major surge in user engagement, with a massive increase in review submissions compared to 2016 and 2017. This suggests a period of significant growth or increased user participation.
* **Sustained High Activity in 2019**: Even though the data for 2019 is incomplete (up to May)  , the number of reviews for this partial year is comparable to the full year of 2018, indicating sustained and even accelerating growth in review contributions.
*  **Peak Periods**: The monthly trend shows clear peaks, notably around mid-2018 and a strong rising trend into early 2019, reaching highest volumes at the end of the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer -
Yes, the insights about review trends over time can lead to significant business impact:

**Positive Business Impact:**
* **Confirmation of Platform Growth/Engagement**: The exponential growth in review numbers is a strong indicator of Zomato's increasing user base and/or enhanced user engagement. This confirms the platform's positive trajectory and can be used to attract investors or further justify marketing efforts.
**Resource Allocation**: Understanding peak review periods (e.g., mid-2018 onwards, and the continuous rise into 2019)  allows Zomato to allocate resources more effectively, such as customer support for reviewers, content moderation teams, or server capacity to handle increased traffic.
* **Campaign Effectiveness**: The sharp rise in reviews could correlate with specific marketing campaigns or feature rollouts by Zomato. Analyzing these correlations can help in optimizing future strategies to boost engagement.
* **Monetization Opportunities**: Higher user engagement through reviews translates to more rich content on the platform, which can attract more users and potentially lead to increased advertising revenue or premium subscription opportunities for businesses.

**Insights that might hint at negative growth (or areas for improvement):**
* **Early Stagnation**: The relatively flat and low review activity in 2016 and early 2017  could indicate a period where Zomato was struggling to gain significant user engagement in the review section. While this has clearly turned around, understanding the reasons for this early stagnation (e.g., lack of user-friendly features, low market penetration) can inform strategies to avoid similar plateaus in other areas. This is a historical insight that can prevent future negative growth.
* **Maintaining Momentum**: While growth is strong, ensuring this momentum is sustained requires continuous effort. If review quality were to decline despite the high volume, it could eventually lead to negative user experience, impacting platform trust.

#### Chart - 12: Average Rating Per Year

In [None]:
# Chart - 12 visualization code

# Ensure 'Time' column is datetime and 'Review_Year' is extracted (from Chart 11)
# if 'Review_Year' is not in review_df.columns:
#     review_df['Time'] = pd.to_datetime(review_df['Time'], errors='coerce')
#     review_df.dropna(subset=['Time'], inplace=True)
#     review_df['Review_Year'] = review_df['Time'].dt.year


# Calculate the average rating per year
average_rating_per_year = review_df.groupby('Review_Year')['Rating'].mean().reset_index()

plt.figure(figsize=(10, 6))
sns.barplot(x='Review_Year', y='Rating', data=average_rating_per_year, palette='crest')
plt.title('Average Rating Per Year', fontsize=16)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Average Rating', fontsize=12)
plt.ylim(0, 5) # Set y-axis limit from 0 to 5 for rating consistency
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Print the yearly average ratings
print("\nAverage Rating Per Year:")
print(average_rating_per_year)

##### 1. Why did you pick the specific chart?

#Answer -

I chose a **Bar Plot** to visualize the 'Average Rating Per Year' because:
* **Comparing Numerical Values Across Categories**: A bar plot is ideal for comparing a numerical value (Average Rating) across distinct categorical groups (Years).
* **Highlighting Trends**: It allows for a clear visual comparison of average ratings year-over-year, making it easy to identify general trends (e.g., increase, decrease, stability).
* **Direct Answer to Business Question**: It directly helps to answer if overall customer satisfaction has changed over time.

##### 2. What is/are the insight(s) found from the chart?

#Answer -

From the bar plot showing the average rating per year:
* **Initial High Satisfaction, Then Decline**: The average rating was highest in **2016 (3.97)**. It then showed a gradual **decline through 2017 (3.81) to 2018 (3.52)**, suggesting a potential decrease in overall customer satisfaction during this period.
* **Slight Recovery in 2019**: In **2019**, despite being an incomplete year, there was a slight rebound in the average rating to **3.67**, indicating a potential improvement or stabilization in customer satisfaction.
* **Inverse Relationship with Review Volume**: Interestingly, the average rating seems to have decreased during the years when review volume significantly increased (2018 and 2019, as seen in Chart 11), suggesting that as the platform gained more users, the average sentiment became slightly less overwhelmingly positive.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#Answer -

Yes, the insights about average ratings per year can significantly influence business strategies:

**Positive Business Impact:**
* **Performance Monitoring**: This chart serves as a critical KPI (Key Performance Indicator) for Zomato. Monitoring average ratings year-on-year allows the company to track overall customer satisfaction and assess the impact of their strategies and initiatives.
* **Quality Improvement Focus**: If the average rating shows a decline, Zomato can initiate proactive measures. This might involve stricter quality checks for new partner restaurants, encouraging existing restaurants to improve service/food quality, or refining their recommendation algorithms to highlight truly high-quality establishments. The slight recovery in 2019 is a positive sign that recent efforts might be bearing fruit.

**Insights that might hint at negative growth (or areas for improvement):**
* **Declining Satisfaction During Growth Periods**: The most concerning insight is the apparent decline in average ratings from 2016 to 2018, coinciding with a period of massive growth in review volume (as seen in Chart 11). This suggests that as Zomato expanded or gained more users, the overall customer experience might have diluted, or a broader, less exclusively positive, user base started contributing reviews. If this trend of declining satisfaction continued unchecked, it could lead to **negative growth** in terms of customer loyalty and potentially a slowdown in new user acquisition, as users might perceive a general decrease in restaurant quality on the platform. It indicates a need for Zomato to actively manage quality and consistency alongside growth.
* **Risk to Brand Reputation**: A sustained downward trend in average ratings could eventually harm Zomato's brand reputation as a reliable source for quality restaurant information, potentially leading to users migrating to alternative platforms.

#### Chart - 13

In [None]:
# Chart - 13 visualization code

# Calculate the average rating for each number of reviewer reviews
avg_rating_by_reviewer_reviews = review_df.groupby('Reviewer_Reviews')['Rating'].mean().reset_index()

# Filter out very high review counts to better visualize the main trend, similar to Chart 6
# Let's consider reviewers with up to 100 reviews for clearer visualization, as most are casual.
avg_rating_by_reviewer_reviews_filtered = avg_rating_by_reviewer_reviews[
    avg_rating_by_reviewer_reviews['Reviewer_Reviews'] <= 100
]

plt.figure(figsize=(12, 7))
sns.scatterplot(x='Reviewer_Reviews', y='Rating', data=avg_rating_by_reviewer_reviews_filtered,
                hue='Rating', size='Reviewer_Reviews', sizes=(20, 200), palette='coolwarm', alpha=0.7)
plt.title('Average Rating by Number of Reviewer Reviews (Up to 100 Reviews)', fontsize=16)
plt.xlabel('Number of Reviews by Reviewer', fontsize=12)
plt.ylabel('Average Rating Given', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

# Also, calculate the correlation coefficient
correlation_reviewer_reviews_rating = review_df['Reviewer_Reviews'].corr(review_df['Rating'])
print(f"\nCorrelation between Reviewer Reviews and Rating: {correlation_reviewer_reviews_rating:.2f}")

##### 1. Why did you pick the specific chart?

Answer -

I chose a **Scatter Plot** to visualize the relationship between 'Number of Reviews by Reviewer' and 'Average Rating Given' because:
* **Relationship between Two Numerical Variables**: A scatter plot is ideal for examining the correlation and pattern between two continuous or quantitative variables.
* **Identifying Trends and Clusters**: It allows us to visually detect if there is a positive, negative, or no discernible trend, and if certain groups of reviewers (e.g., highly active ones) exhibit distinct rating behaviors.
* **Incorporating Third Dimension (Size/Color)**: By using color and size to represent the 'Average Rating' and 'Number of Reviewer Reviews' respectively, the plot provides richer context, allowing for a more nuanced interpretation of reviewer behavior.

##### 2. What is/are the insight(s) found from the chart?

Answer -

From the scatter plot and the correlation coefficient:
* **No Significant Linear Relationship**: The Pearson correlation coefficient of **0.03** indicates an extremely weak, almost negligible, positive linear relationship between the number of reviews a reviewer has written and the average rating they provide. This means that, on average, more prolific reviewers are not consistently harsher or more lenient in their ratings.
* **Varied Rating Behavior Across Activity Levels**: Reviewers across all levels of activity (from a few reviews to many) exhibit a wide range of average ratings. There isn't a clear tendency for highly active reviewers to cluster at either the very high or very low ends of the rating scale.
* **Independent Rating Tendencies**: The insight suggests that a reviewer's overall rating behavior (how high or low they tend to rate) is largely independent of their sheer volume of contributions to the platform.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer -

Yes, these insights about reviewer activity and rating tendencies can have a business impact:

**Positive Business Impact:**
* **Trust in Review Authenticity**: The lack of strong correlation suggests that reviewers, regardless of their activity level, are providing ratings based on their individual experiences rather than adhering to a bias that develops with high volume (e.g., becoming overly critical or lenient). This can enhance user trust in the broad spectrum of reviews on the platform.
* **Focus on Content Quality, Not Just Volume**: Zomato can emphasize efforts to improve the quality and helpfulness of reviews from *all* users, rather than assuming that only highly active users provide the most reliable ratings. This might involve prompting for more detailed feedback or structured review formats.
* **Fairness in Review Moderation**: This insight helps in understanding that a reviewer's total review count isn't an immediate indicator of potential bias in their ratings. Moderation efforts can focus on content rather than assuming bias based on volume.

**Insights that might hint at negative growth (or areas for improvement):**
* **Difficulty in Identifying "Critics" by Rating Tendency Alone**: If Zomato intends to identify "critics" or highly influential reviewers who might be particularly discerning or harsh (or consistently positive), this analysis suggests that simply looking at their average rating won't suffice. There's no clear "signature" in their average rating based on the number of reviews they've given. More sophisticated methods (e.g., sentiment analysis of their text, external validation) would be needed to truly identify such figures, otherwise, a missed opportunity to leverage or mitigate their specific impact could occur.
* **Lost Opportunity for Specific Recommendations**: If very prolific reviewers *did* have a particular rating tendency (e.g., consistently stricter), Zomato could use this to filter reviews or provide more nuanced recommendations ("This restaurant was rated highly by our strictest critics!"). The absence of such a correlation means this specific type of advanced filtering is not directly supported by this feature, potentially limiting highly personalized recommendations based on reviewer temperament.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

# Select only the numerical columns from review_df for the heatmap
numerical_review_cols = ['Rating', 'Pictures', 'Reviewer_Reviews', 'Reviewer_Followers']
review_numerical_df = review_df[numerical_review_cols]

# Calculate the correlation matrix
correlation_matrix = review_numerical_df.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Heatmap of Numerical Features in Review Data', fontsize=16)
plt.show()

# We can also add restaurant_df's numerical feature (Cost) if merged_df_for_cost_rating is complete
# Assuming merged_df_for_cost_rating is up-to-date from Chart 4 and has 'Cost' and 'Average_Rating'
# For simplicity, let's create a combined numerical dataframe if needed for a broader heatmap

# If you want to see correlation including restaurant cost, you'd merge the datasets and get numerical columns.
# For now, let's stick to review_df's numerical features as per the main project focus on review sentiment and reviewer metadata.

##### 1. Why did you pick the specific chart?

Answer -

I chose a **Correlation Heatmap** to visualize the relationships between numerical features in the review data because:
* **Multivariate Analysis**: It provides a concise visual summary of the pairwise Pearson correlation coefficients between multiple numerical variables simultaneously.
* **Quick Identification of Relationships**: The color intensity and annotated values make it easy to quickly identify strong positive (warm colors), strong negative (cool colors), or negligible (light colors/near zero) linear relationships between variables.
* **Feature Selection Guidance**: It helps in understanding multicollinearity (highly correlated independent variables) which can be important for certain machine learning models, and guides feature selection.

##### 2. What is/are the insight(s) found from the chart?

Answer -

From the Correlation Heatmap of Numerical Features in Review Data:
* **Independent Rating Behavior**: 'Rating' shows very weak or negligible positive correlations with 'Pictures' (0.08), 'Reviewer_Reviews' (0.03), and 'Reviewer_Followers' (0.04). This indicates that the rating a reviewer gives is largely independent of how many pictures they upload or how active/influential they are as a reviewer.
* **Reviewer Engagement Correlation**: There is a **moderate positive correlation (0.46)** between 'Reviewer_Reviews' and 'Reviewer_Followers'. This is intuitive: reviewers who post more reviews tend to attract more followers, and vice-versa.
* **Picture Contribution and Reviewer Activity**: 'Pictures' show weak to moderate positive correlations with 'Reviewer_Reviews' (0.33) and 'Reviewer_Followers' (0.28). This suggests that more active or influential reviewers are slightly more inclined to include pictures in their reviews.

#### Chart - 15 - Pair Plot

In [None]:
# Chart - 15 visualization code

# Select the numerical columns from review_df for the pair plot
numerical_review_cols_for_pairplot = ['Rating', 'Pictures', 'Reviewer_Reviews', 'Reviewer_Followers']
review_numerical_df_for_pairplot = review_df[numerical_review_cols_for_pairplot]

# Create the pair plot
sns.pairplot(review_numerical_df_for_pairplot, diag_kind='kde', plot_kws={'alpha': 0.6}, palette='viridis')
plt.suptitle('Pair Plot of Numerical Features in Review Data', y=1.02, fontsize=16) # Add a main title to the whole plot
plt.show()

##### 1. Why did you pick the specific chart?

Answer -

I chose a **Pair Plot** to visualize the numerical features in the review data because:
* **Comprehensive Multivariate Overview**: A pair plot provides a matrix of scatter plots for every pair of numerical variables, along with their individual distributions (histograms/KDEs) on the diagonal. This offers a holistic and efficient way to inspect multiple relationships and distributions in one go.
* **Visual Reinforcement of Correlations**: It visually reinforces the insights gained from individual histograms (Charts 2, 6, 7) and the correlation heatmap (Chart 14), making it easier to grasp the data's overall structure and interdependencies.
* **Identification of Non-Linear Patterns**: While heatmaps only show linear correlations, scatter plots in a pair plot can sometimes reveal non-linear relationships or clusters that might not be captured by a single correlation coefficient.

##### 2. What is/are the insight(s) found from the chart?

Answer -

From the Pair Plot of Numerical Features in Review Data:
* **Individual Distributions Confirmed**: The diagonal KDE plots reaffirm the distributions observed earlier: 'Rating' is concentrated at higher values, while 'Pictures', 'Reviewer_Reviews', and 'Reviewer_Followers' are all heavily right-skewed, with a large concentration at lower values and a long tail.
* **Weak Relationship of Rating with Reviewer Metrics**: The scatter plots involving 'Rating' (with 'Pictures', 'Reviewer_Reviews', and 'Reviewer_Followers') visually confirm the very weak or negligible linear correlations previously identified. There's no clear pattern suggesting that higher review counts or more followers lead to systematically higher or lower ratings.
* **Relationship Between Reviewer Activity and Influence**: The scatter plot between 'Reviewer_Reviews' and 'Reviewer_Followers' visibly shows a positive correlation: as the number of reviews increases, there is a general tendency for the number of followers to also increase. This relationship is more discernible than any relationship involving 'Rating'.
* **Picture Engagement with Reviewer Activity**: There's a slight visual indication that reviewers with more reviews or followers might be more inclined to upload pictures, though the correlation isn't very strong.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

#Answer

Based on the insights gained from the various charts, here are three hypothetical statements that will be tested:



### **Hypothetical Statement - 1:** There is a statistically significant positive relationship between a restaurant's estimated cost per person and its average customer rating.


#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* **Null Hypothesis ($H_0$)**: There is no statistically significant positive linear relationship ($\rho \le 0$) between a restaurant's estimated cost per person and its average customer rating.
* **Alternate Hypothesis ($H_1$)**: There is a statistically significant positive linear relationship ($\rho > 0$) between a restaurant's estimated cost per person and its average customer rating.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

from scipy.stats import pearsonr

# Ensure merged_df_for_cost_rating is available (re-create if runtime was reset)
# This DataFrame was created in Chart 4.
if 'merged_df_for_cost_rating' not in locals():
    average_ratings = review_df.groupby('Restaurant')['Rating'].mean().reset_index()
    average_ratings.rename(columns={'Rating': 'Average_Rating'}, inplace=True)
    merged_df_for_cost_rating = pd.merge(restaurant_df, average_ratings,
                                         left_on='Name', right_on='Restaurant',
                                         how='left')
    merged_df_for_cost_rating.drop('Restaurant', axis=1, inplace=True)
    print("Re-created merged_df_for_cost_rating for Hypothesis Test.")

# --- NEW: Drop rows with NaN in 'Cost' or 'Average_Rating' for correlation calculation ---
# It's crucial to ensure there are no NaNs in the columns passed to pearsonr
df_for_correlation = merged_df_for_cost_rating.dropna(subset=['Cost', 'Average_Rating']).copy()
print(f"Shape of data for correlation after dropping NaNs: {df_for_correlation.shape}")

# Perform Pearson correlation test using the cleaned dataframe
correlation_coefficient, p_value = pearsonr(df_for_correlation['Cost'],
                                            df_for_correlation['Average_Rating'])

# Since our alternative hypothesis is one-tailed (rho > 0), we divide the p-value by 2.
# However, if the correlation coefficient is negative, a one-tailed test for rho > 0 would not be significant.
if correlation_coefficient > 0:
    one_tailed_p_value = p_value / 2
else:
    # If the correlation is negative, it cannot support a *positive* relationship, so p-value for H1: rho > 0 is high.
    one_tailed_p_value = 1.0 # Set to 1.0 or very high to reflect no evidence for positive corr

print(f"Pearson Correlation Coefficient: {correlation_coefficient:.3f}")
print(f"Two-tailed P-value: {p_value:.3f}")
print(f"One-tailed P-value (for H1: rho > 0): {one_tailed_p_value:.3f}")

# Set a significance level (alpha)
alpha = 0.05

print(f"\nSignificance Level (alpha): {alpha}")

# Make a conclusion based on the one-tailed p-value
if one_tailed_p_value < alpha:
    print(f"Since the one-tailed P-value ({one_tailed_p_value:.3f}) is less than alpha ({alpha}), we reject the Null Hypothesis.")
    print("Conclusion: There is a statistically significant positive linear relationship between restaurant cost and average customer rating.")
else:
    print(f"Since the one-tailed P-value ({one_tailed_p_value:.3f}) is greater than or equal to alpha ({alpha}), we fail to reject the Null Hypothesis.")
    print("Conclusion: There is no statistically significant positive linear relationship between restaurant cost and average customer rating.")

##### Which statistical test have you done to obtain P-Value?

Answer -

To obtain the P-value for Hypothetical Statement - 1, I performed a **Pearson Correlation Coefficient Test** (specifically, using `scipy.stats.pearsonr`).

##### Why did you choose the specific statistical test?

Answer -

I chose the Pearson Correlation Coefficient Test for the following reasons:
* **Nature of Variables**: The hypothesis involves examining the linear relationship between two continuous numerical variables: 'Estimated Cost Per Person' (Cost) and 'Average Customer Rating' (Average_Rating). Pearson correlation is the standard and most appropriate test for this type of data.
* **Measuring Linear Association**: Pearson correlation quantifies the strength and direction of the linear relationship between two variables, which directly aligns with our hypothesis about a "positive relationship."
* **P-value for Significance**: The `pearsonr` function provides a p-value, which is crucial for hypothesis testing. This p-value helps us determine the statistical significance of the observed correlation, i.e., whether the correlation found in our sample is likely to exist in the population or occurred by chance.
* **Directional Hypothesis**: Our alternative hypothesis was specifically for a *positive* relationship, so a one-tailed p-value calculation was applied to assess this directional claim.

### **Hypothetical Statement - 2:** More active reviewers (those with a higher number of reviews) tend to include more pictures with their reviews.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* **Null Hypothesis ($H_0$)**: There is no statistically significant positive linear relationship ($\rho \le 0$) between the number of reviews a reviewer posts and the number of pictures they include in their reviews.
* **Alternate Hypothesis ($H_1$)**: There is a statistically significant positive linear relationship ($\rho > 0$) between the number of reviews a reviewer posts and the number of pictures they include in their reviews.


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

from scipy.stats import pearsonr

# Ensure 'Reviewer_Reviews' and 'Pictures' columns have no NaNs for the correlation calculation
# Our previous cleaning removed all NaNs from review_df, but it's good practice to be explicit or check.
df_for_correlation_2 = review_df.dropna(subset=['Reviewer_Reviews', 'Pictures']).copy()
print(f"Shape of data for correlation after dropping NaNs: {df_for_correlation_2.shape}")


# Perform Pearson correlation test
correlation_coefficient_2, p_value_2 = pearsonr(df_for_correlation_2['Reviewer_Reviews'],
                                                df_for_correlation_2['Pictures'])

# Since our alternative hypothesis is one-tailed (rho > 0), we divide the p-value by 2.
# However, if the correlation coefficient is negative, a one-tailed test for rho > 0 would not be significant.
if correlation_coefficient_2 > 0:
    one_tailed_p_value_2 = p_value_2 / 2
else:
    # If the correlation is negative, it cannot support a *positive* relationship, so p-value for H1: rho > 0 is high.
    one_tailed_p_value_2 = 1.0 # Set to 1.0 or very high to reflect no evidence for positive corr

print(f"Pearson Correlation Coefficient: {correlation_coefficient_2:.3f}")
print(f"Two-tailed P-value: {p_value_2:.3f}")
print(f"One-tailed P-value (for H1: rho > 0): {one_tailed_p_value_2:.3f}")

# Set a significance level (alpha)
alpha = 0.05

print(f"\nSignificance Level (alpha): {alpha}")

# Make a conclusion based on the one-tailed p-value
if one_tailed_p_value_2 < alpha:
    print(f"Since the one-tailed P-value ({one_tailed_p_value_2:.3f}) is less than alpha ({alpha}), we reject the Null Hypothesis.")
    print("Conclusion: There is a statistically significant positive linear relationship between reviewer reviews and pictures included.")
else:
    print(f"Since the one-tailed P-value ({one_tailed_p_value_2:.3f}) is greater than or equal to alpha ({alpha}), we fail to reject the Null Hypothesis.")
    print("Conclusion: There is no statistically significant positive linear relationship between reviewer reviews and pictures included.")

##### Which statistical test have you done to obtain P-Value?

Answer -

To obtain the P-value for Hypothetical Statement - 2, I performed a **Pearson Correlation Coefficient Test** (using `scipy.stats.pearsonr`).

##### Why did you choose the specific statistical test?

Answer -

I chose the Pearson Correlation Coefficient Test for the following reasons:
* **Nature of Variables**: The hypothesis involves examining the linear relationship between two numerical variables: 'Number of Reviews by Reviewer' (`Reviewer_Reviews`) and 'Number of Pictures Included' (`Pictures`). Pearson correlation is the standard and most appropriate test for this type of data.
* **Measuring Linear Association**: Pearson correlation quantifies the strength and direction of the linear relationship between two variables, which directly aligns with our hypothesis about a "positive relationship."
* **P-value for Significance**: The `pearsonr` function provides a p-value, which is crucial for hypothesis testing. This p-value helps us determine the statistical significance of the observed correlation, i.e., whether the correlation found in our sample is likely to exist in the population or occurred by chance.
* **Directional Hypothesis**: Our alternative hypothesis was specifically for a *positive* relationship, so a one-tailed p-value calculation was applied to assess this directional claim.

### **Hypothetical Statement - 3:** Reviewers with a higher number of followers tend to include more pictures with their reviews.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* **Null Hypothesis ($H_0$)**: There is no statistically significant positive linear relationship ($\rho \le 0$) between the number of followers a reviewer has and the number of pictures they include in their reviews.
* **Alternate Hypothesis ($H_1$)**: There is a statistically significant positive linear relationship ($\rho > 0$) between the number of followers a reviewer has and the number of pictures they include in their reviews.


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

from scipy.stats import pearsonr

# Ensure 'Reviewer_Followers' and 'Pictures' columns have no NaNs for the correlation calculation
df_for_correlation_3 = review_df.dropna(subset=['Reviewer_Followers', 'Pictures']).copy()
print(f"Shape of data for correlation after dropping NaNs: {df_for_correlation_3.shape}")


# Perform Pearson correlation test
correlation_coefficient_3, p_value_3 = pearsonr(df_for_correlation_3['Reviewer_Followers'],
                                                df_for_correlation_3['Pictures'])

# Since our alternative hypothesis is one-tailed (rho > 0), we divide the p-value by 2.
# However, if the correlation coefficient is negative, a one-tailed test for rho > 0 would not be significant.
if correlation_coefficient_3 > 0:
    one_tailed_p_value_3 = p_value_3 / 2
else:
    # If the correlation is negative, it cannot support a *positive* relationship, so p-value for H1: rho > 0 is high.
    one_tailed_p_value_3 = 1.0 # Set to 1.0 or very high to reflect no evidence for positive corr

print(f"Pearson Correlation Coefficient: {correlation_coefficient_3:.3f}")
print(f"Two-tailed P-value: {p_value_3:.3f}")
print(f"One-tailed P-value (for H1: rho > 0): {one_tailed_p_value_3:.3f}")

# Set a significance level (alpha)
alpha = 0.05

print(f"\nSignificance Level (alpha): {alpha}")

# Make a conclusion based on the one-tailed p-value
if one_tailed_p_value_3 < alpha:
    print(f"Since the one-tailed P-value ({one_tailed_p_value_3:.3f}) is less than alpha ({alpha}), we reject the Null Hypothesis.")
    print("Conclusion: There is a statistically significant positive linear relationship between reviewer followers and pictures included.")
else:
    print(f"Since the one-tailed P-value ({one_tailed_p_value_3:.3f}) is greater than or equal to alpha ({alpha}), we fail to reject the Null Hypothesis.")
    print("Conclusion: There is no statistically significant positive linear relationship between reviewer followers and pictures included.")

##### Which statistical test have you done to obtain P-Value?

Answer -

To obtain the P-value for Hypothetical Statement - 3, I performed a **Pearson Correlation Coefficient Test** (using `scipy.stats.pearsonr`).

##### Why did you choose the specific statistical test?

Answer -

I chose the Pearson Correlation Coefficient Test for the following reasons:
* **Nature of Variables**: The hypothesis involves examining the linear relationship between two numerical variables: 'Number of Followers by Reviewer' (`Reviewer_Followers`) and 'Number of Pictures Included' (`Pictures`). Pearson correlation is the standard and most appropriate test for this type of data.
* **Measuring Linear Association**: Pearson correlation quantifies the strength and direction of the linear relationship between two variables, which directly aligns with our hypothesis about a "positive relationship."
* **P-value for Significance**: The `pearsonr` function provides a p-value, which is crucial for hypothesis testing. This p-value helps us determine the statistical significance of the observed correlation, i.e., whether the correlation found in our sample is likely to exist in the population or occurred by chance.
* **Directional Hypothesis**: Our alternative hypothesis was specifically for a *positive* relationship, so a one-tailed p-value calculation was applied to assess this directional claim.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

print("--- Missing values in Restaurant Data (After initial wrangling) ---")
print(restaurant_df.isnull().sum())

print("\n--- Missing values in Review Data (After initial wrangling) ---")
print(review_df.isnull().sum())

# Re-confirming the state of specific columns if they are to be used for clustering/analysis
# 'Collections' in restaurant_df still has NaNs, which we decided to fill with 'Not Categorized' for Chart 9.
# We'll make that permanent now if it wasn't already.
restaurant_df['Collections'].fillna('Not Categorized', inplace=True)
print("\nFilled remaining NaNs in 'Collections' column of restaurant_df with 'Not Categorized'.")
print(restaurant_df['Collections'].isnull().sum())

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

For handling missing values, different strategies were applied based on the nature of the missingness and the specific column:

**1. Filling Missing Values for `Timings` in `restaurant_df`:**
* **Technique Used**: Imputation with a placeholder value. Specifically, the single missing value in the `Timings` column was filled with the string **'Not Available'**.
* **Reasoning**: This column had only one missing entry, making a simple placeholder an effective and non-intrusive way to ensure data completeness without impacting statistical analysis significantly.

**2. Filling Missing Values for `Collections` in `restaurant_df`:**
* **Technique Used**: Imputation with a placeholder category. The 54 missing values in the `Collections` column were filled with the string **'Not Categorized'**.
* **Reasoning**: `Collections` is a categorical column that often contains multiple values. Since these missing values represented a significant portion of the data and we couldn't confidently impute them with a meaningful existing category or mode without potentially distorting the data, assigning a distinct 'Not Categorized' label allowed us to retain these restaurant entries for analysis while clearly marking their unknown collection status. This approach prevents data loss and treats the missingness itself as a category.

**3. Dropping Rows for Critical Missing Values in `review_df` (`Review` and `Rating`):**
* **Technique Used**: Row deletion (`dropna`). Rows where either the `Review` text or the `Rating` was missing were removed from the `review_df`.
* **Reasoning**: These two columns are fundamental for the core objectives of this project, particularly sentiment analysis (requiring `Review` text) and restaurant clustering based on satisfaction (requiring `Rating`). Missing values in these columns would render a review unusable for these primary analyses. Dropping these specific rows ensures that all remaining review entries are complete and directly usable for the intended machine learning tasks. This also implicitly handled missing values in `Reviewer`, `Metadata`, and `Time`, as those NaNs were found to be in the same rows as the critical `Review` or `Rating` NaNs.

**4. Implicit Handling of NaNs during Numerical Feature Extraction (e.g., `Reviewer_Reviews`, `Reviewer_Followers`):**
* **Technique Used**: During the extraction of numerical features from the `Metadata` column, our parsing function was designed to return `np.nan` for any original `Metadata` entries that were `NaN`. These `NaN` values in the newly created numerical columns (`Reviewer_Reviews`, `Reviewer_Followers`) were then implicitly removed when we performed the `dropna` operation on the critical `Review` and `Rating` columns, as they co-occurred.
* **Reasoning**: This systematic approach ensured that the new numerical features derived from text were clean and that corresponding incomplete review entries were appropriately excluded.

By employing these varied techniques, the datasets are now free of missing values, ensuring robustness for subsequent analysis and model building.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

# --- Visualizing Outliers with Box Plots (Before Capping) ---
print("--- Box Plots Before Outlier Treatment ---")
# For restaurant_df
plt.figure(figsize=(8, 6))
sns.boxplot(y=restaurant_df['Cost'])
plt.title('Box Plot of Restaurant Cost', fontsize=16)
plt.ylabel('Cost (INR)', fontsize=12)
plt.show()

# For review_df numerical columns
numerical_cols_review_for_outliers = ['Rating', 'Pictures', 'Reviewer_Reviews', 'Reviewer_Followers']

plt.figure(figsize=(15, 8))
for i, col in enumerate(numerical_cols_review_for_outliers):
    plt.subplot(2, 2, i + 1) # Arrange plots in 2x2 grid
    sns.boxplot(y=review_df[col])
    plt.title(f'Box Plot of {col}')
    plt.ylabel(col)
plt.tight_layout()
plt.show()


# --- Quantifying Outliers (IQR Method) Before Capping ---
print("\n--- Quantifying Outliers (IQR Method) Before Capping ---")
for df_ref, name in zip([restaurant_df, review_df], ['Restaurant Data', 'Review Data']): # Use direct df references
    print(f"\n{name}:")
    for col in (['Cost'] if name == 'Restaurant Data' else numerical_cols_review_for_outliers):
        if col in df_ref.columns:
            Q1 = df_ref[col].quantile(0.25)
            Q3 = df_ref[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR

            outliers = df_ref[(df_ref[col] < lower_bound) | (df_ref[col] > upper_bound)]

            print(f"  Column '{col}': {len(outliers)} outliers (below {lower_bound:.2f} or above {upper_bound:.2f})")


# --- Outlier Treatment (Capping using 99th percentile) ---
print("\n--- Applying Outlier Treatment (Capping at 99th Percentile) ---")

# Cap 'Cost' in restaurant_df
upper_bound_cost = restaurant_df['Cost'].quantile(0.99)
restaurant_df['Cost'] = np.where(restaurant_df['Cost'] > upper_bound_cost, upper_bound_cost, restaurant_df['Cost'])
print(f"Capped 'Cost' values in restaurant_df above {upper_bound_cost:.2f}.")

# Cap 'Pictures', 'Reviewer_Reviews', 'Reviewer_Followers' in review_df
for col in ['Pictures', 'Reviewer_Reviews', 'Reviewer_Followers']:
    upper_bound = review_df[col].quantile(0.99)
    review_df[col] = np.where(review_df[col] > upper_bound, upper_bound, review_df[col])
    print(f"Capped '{col}' values in review_df above {upper_bound:.2f}.")


# --- Quantifying Outliers (IQR Method) After Capping ---
print("\n--- Quantifying Outliers (IQR Method) After Capping ---")
# IMPORTANT: We need to re-evaluate the IQR *after* capping because Q1/Q3/IQR might change
# for some columns (e.g. Pictures, Reviewer_Reviews, Reviewer_Followers).
# This check is primarily to see if extreme outliers are gone, even if IQR definition of outlier changes.
for df_ref, name in zip([restaurant_df, review_df], ['Restaurant Data', 'Review Data']): # Use direct df references
    print(f"\n{name}:")
    for col in (['Cost'] if name == 'Restaurant Data' else numerical_cols_review_for_outliers):
        if col in df_ref.columns:
            Q1 = df_ref[col].quantile(0.25)
            Q3 = df_ref[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR

            outliers = df_ref[(df_ref[col] < lower_bound) | (df_ref[col] > upper_bound)]

            print(f"  Column '{col}': {len(outliers)} outliers (below {lower_bound:.2f} or above {upper_bound:.2f})")

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer -

For handling outliers, a combination of visualization, quantification, and **capping (winsorization)** was primarily used for specific numerical columns:

**1. Identification of Outliers:**
* **Visualization**: Box plots were generated for `Restaurant Cost`, `Rating`, `Pictures`, `Reviewer_Reviews`, and `Reviewer_Followers`. These plots clearly indicated the presence of outliers, especially in the heavily right-skewed distributions of `Pictures`, `Reviewer_Reviews`, and `Reviewer_Followers`, where a large number of points extended far beyond the whiskers. `Restaurant Cost` showed a few distinct outliers. `Rating` showed no outliers.
* **Quantification**: The **Interquartile Range (IQR) method** was used to quantify the number of outliers. For highly skewed columns like `Pictures` (1984 outliers), `Reviewer_Reviews` (1393 outliers), and `Reviewer_Followers` (1578 outliers), the IQR method identified a very large proportion of the data as outliers due to the extreme concentration of values at the lower end. `Restaurant Cost` showed 2 outliers.

**2. Outlier Treatment Technique Used: Capping (Winsorization)**
* **Technique Applied**: For `Restaurant Cost`, `Pictures`, `Reviewer_Reviews`, and `Reviewer_Followers`, **capping at the 99th percentile** was applied. This means any data point whose value was above the 99th percentile for its respective column was replaced with the value of the 99th percentile. Values below the lower bound were not capped as they were not a concern for these particular distributions.
* **Reasoning for Capping**:
    * **Mitigating Extreme Influence**: For heavily skewed features (`Pictures`, `Reviewer_Reviews`, `Reviewer_Followers`), extremely high values can disproportionately influence machine learning models (especially distance-based algorithms like K-Means clustering), pulling cluster centroids unnecessarily. Capping reduces this undue influence by bringing extreme values closer to the main distribution without outright removing valuable data.
    * **Preserving Data**: Unlike deletion, capping retains all data points, which is crucial when outliers represent genuine, albeit rare, observations (e.g., highly prolific reviewers or very expensive restaurants). Deleting thousands of rows just because they're "outliers" by a strict statistical definition (like IQR for skewed data) would lead to significant information loss.
    * **Practicality for Skewed Data**: For features like `Pictures`, where 0 is the 75th percentile, the IQR method labels almost anything above 0 as an outlier. Capping provides a more practical way to handle the extreme high values in such distributions, making the data more robust for modeling without fundamentally changing the underlying distribution for the majority of observations.
    * For `Restaurant Cost`, with only 2 outliers, capping effectively addressed these few extreme values, ensuring they don't unduly stretch the data range.
* **No Treatment for `Rating`**: As no outliers were identified in the `Rating` column, no specific outlier treatment was applied.

This capping strategy helps in creating a more robust dataset for model training, particularly important before scaling and clustering.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

# --- Categorical Encoding for Restaurant Data ---

print("--- Encoding 'Collections' (Multi-valued) ---")
# Use str.get_dummies() for multi-valued categorical columns separated by commas
collections_dummies = restaurant_df['Collections'].str.get_dummies(sep=',')
# Clean column names (remove leading/trailing spaces)
collections_dummies.columns = [col.strip() for col in collections_dummies.columns]
# Concatenate with original restaurant_df (drop original 'Collections' if needed later)
restaurant_df = pd.concat([restaurant_df, collections_dummies], axis=1)
print(f"Shape after encoding collections: {restaurant_df.shape}")
print(collections_dummies.head())


print("\n--- Encoding 'Cuisines' (Multi-valued) ---")
cuisines_dummies = restaurant_df['Cuisines'].str.get_dummies(sep=',')
cuisines_dummies.columns = [col.strip() for col in cuisines_dummies.columns]
restaurant_df = pd.concat([restaurant_df, cuisines_dummies], axis=1)
print(f"Shape after encoding cuisines: {restaurant_df.shape}")
print(cuisines_dummies.head())


print("\n--- Encoding 'Timings' (Nominal) ---")
# For 'Timings', directly use pd.get_dummies
timings_dummies = pd.get_dummies(restaurant_df['Timings'], prefix='Timing', dtype=int)
restaurant_df = pd.concat([restaurant_df, timings_dummies], axis=1)
print(f"Shape after encoding timings: {restaurant_df.shape}")
print(timings_dummies.head())


print("\n--- Encoding 'is_late_night_label' (Binary) ---")
# Convert 'is_late_night' boolean to 0/1 integer
restaurant_df['is_late_night_encoded'] = restaurant_df['is_late_night'].astype(int)
print(f"Added 'is_late_night_encoded' (0/1).")
print(restaurant_df[['is_late_night_label', 'is_late_night_encoded']].head())

# Drop the original categorical columns that have been encoded or are identifiers
# We'll keep 'Name' for now for restaurant identification in later steps if needed.
# 'Links' is an identifier.
restaurant_df.drop(columns=['Collections', 'Cuisines', 'Timings', 'is_late_night_label', 'is_late_night'], inplace=True, errors='ignore')
# 'errors='ignore'' prevents an error if a column (like 'is_late_night') was already dropped.


print("\n--- Final Restaurant Data Sample (after encoding) ---")
print(restaurant_df.head())
print(f"Final shape of restaurant_df: {restaurant_df.shape}")

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer -

For categorical encoding, **One-Hot Encoding** was the primary technique used across several columns in the `restaurant_df`. This method is crucial because machine learning algorithms require numerical input, and one-hot encoding converts categorical variables into a numerical format without implying any order or hierarchy between categories.

The specific implementations varied based on whether the categorical column was multi-valued or nominal:

**1. Multi-Valued Categorical Columns (`Collections` and `Cuisines`):**
* **Technique Used**: `df.str.get_dummies(sep=',')`.
* **Reasoning**: Both `Collections` and `Cuisines` columns contained multiple, comma-separated values within a single entry (e.g., "North Indian, Chinese"). Standard `pd.get_dummies` would treat the entire combined string as a single category. `str.get_dummies(sep=',')` effectively handled this by splitting the string by the delimiter (comma) and creating a separate binary (0/1) column for each individual unique tag or cuisine type. This accurately represents the presence or absence of each specific collection type or cuisine for a given restaurant. This is essential for clustering, as a restaurant might belong to multiple categories simultaneously.

**2. Nominal Categorical Column (`Timings`):**
* **Technique Used**: `pd.get_dummies()`.
* **Reasoning**: The `Timings` column contains various string representations of operating hours. Since these are distinct categories with no inherent order, one-hot encoding was used to convert them into binary features. This allows the clustering algorithm to consider whether a restaurant operates at a particular timing pattern without imposing a false numerical relationship.

**3. Binary Categorical Column (`is_late_night_label`):**
* **Technique Used**: Type casting to integer (`.astype(int)`).
* **Reasoning**: The `is_late_night_label` column was already a boolean (True/False) representation of whether a restaurant operates late-night. Converting it to `int` directly transformed 'True' to 1 and 'False' to 0, which is the numerical equivalent of a binary categorical feature. This is the most straightforward and efficient way to encode such a feature.

After encoding, the original categorical columns (`Collections`, `Cuisines`, `Timings`, `is_late_night_label`, `is_late_night`) were dropped to avoid redundancy and ensure that only the numerical representations are used in downstream models. This expanded `restaurant_df` significantly, creating a rich feature set for restaurant clustering.

### 4 (a). Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

# Dictionary of common English contractions
contractions_dict = {
    "ain't": "are not", "aren't": "are not", "can't": "cannot", "can't've": "cannot have",
    "could've": "could have", "couldn't": "could not", "don't": "do not", "didn't": "did not",
    "doesn't": "does not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
    "he'd": "he would", "he'll": "he will", "he's": "he is", "how'd": "how did",
    "how'll": "how will", "how's": "how is", "i'd": "I would", "i'll": "I will",
    "i'm": "I am", "i've": "I have", "isn't": "is not", "it'd": "it would",
    "it'll": "it will", "it's": "it is", "let's": "let us", "ma'am": "madam",
    "mightn't": "might not", "mustn't": "must not", "needn't": "need not", "she'd": "she would",
    "she'll": "she will", "she's": "she is", "should've": "should have", "shouldn't": "should not",
    "that's": "that is", "there's": "there is", "they'd": "they would", "they'll": "they will",
    "they're": "they are", "they've": "they have", "we'd": "we would", "we'll": "we will",
    "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will",
    "what're": "what are", "what's": "what is", "when'd": "when did", "when'll": "when will",
    "when's": "when is", "where'd": "where did", "where'll": "where will", "where's": "where is",
    "who'd": "who would", "who'll": "who will", "who's": "who is", "why'd": "why did",
    "why'll": "why will", "why's": "why is", "won't": "will not", "wouldn't": "would not",
    "y'all": "you all", "you'd": "you would", "you'll": "you will", "you're": "you are",
    "you've": "you have"
}

# Regex for finding contractions
contractions_re = re.compile('(%s)' % '|'.join(contractions_dict.keys()), re.IGNORECASE)

def expand_contractions(text, contractions_dict=contractions_dict):
    def replace(match):
        return contractions_dict[match.group(0).lower()]
    return contractions_re.sub(replace, text)

# Apply contraction expansion to the 'Review' column
# Make sure to work on a copy to avoid SettingWithCopyWarning if you filter review_df later
review_df['Review_Cleaned'] = review_df['Review'].astype(str).apply(expand_contractions)

print("Expanded contractions in 'Review_Cleaned' column.")
print("Original Review example:")
print(review_df['Review'].iloc[3]) # Example before
print("\nCleaned Review example (after contraction expansion):")
print(review_df['Review_Cleaned'].iloc[3]) # Example after

#### 2. Lower Casing

In [None]:
# Lower Casing

# Convert the 'Review_Cleaned' column to lowercase
review_df['Review_Cleaned'] = review_df['Review_Cleaned'].astype(str).str.lower()

print("Converted 'Review_Cleaned' column to lowercase.")
print("Cleaned Review example (after lower casing):")
print(review_df['Review_Cleaned'].iloc[3]) # Example after lowercasing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

# Define a function to remove punctuation
def remove_punctuation(text):
    # Use str.translate to efficiently remove all punctuation characters
    # str.maketrans creates a translation table
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

# Apply the function to the 'Review_Cleaned' column
review_df['Review_Cleaned'] = review_df['Review_Cleaned'].astype(str).apply(remove_punctuation)

print("Removed punctuations from 'Review_Cleaned' column.")
print("Cleaned Review example (after removing punctuations):")
print(review_df['Review_Cleaned'].iloc[3]) # Example after removing punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

def remove_urls_and_digits(text):
    # Remove URLs (e.g., http://..., https://..., www....)
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    # Remove words containing digits (e.g., 'item1', 'pizza2', '100rs')
    # This regex matches words that contain at least one digit
    text = re.sub(r'\b\w*\d\w*\b', '', text)
    return text

# Apply the function to the 'Review_Cleaned' column
review_df['Review_Cleaned'] = review_df['Review_Cleaned'].astype(str).apply(remove_urls_and_digits)

print("Removed URLs and words containing digits from 'Review_Cleaned' column.")
print("Cleaned Review example (after removing URLs and words with digits):")
print(review_df['Review_Cleaned'].iloc[3]) # Example after this step

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

from nltk.corpus import stopwords
import nltk

# Ensure stopwords are downloaded (run if you get Resource NLTK data not found error)
# nltk.download('stopwords') # You might have run this at the beginning. If not, uncomment and run.

stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    tokens = text.split() # Split text into words
    filtered_tokens = [word for word in tokens if word not in stop_words]
    return " ".join(filtered_tokens)

# Apply the function to the 'Review_Cleaned' column
review_df['Review_Cleaned'] = review_df['Review_Cleaned'].astype(str).apply(remove_stopwords)

print("Removed stopwords from 'Review_Cleaned' column.")
print("Cleaned Review example (after removing stopwords):")
print(review_df['Review_Cleaned'].iloc[3]) # Example after removing stopwords

In [None]:
# Remove White spaces

def remove_extra_spaces(text):
    # Replace multiple spaces with a single space
    text = re.sub(r'\s+', ' ', text)
    # Strip leading/trailing spaces
    text = text.strip()
    return text

# Apply the function to the 'Review_Cleaned' column
review_df['Review_Cleaned'] = review_df['Review_Cleaned'].astype(str).apply(remove_extra_spaces)

print("Removed extra white spaces from 'Review_Cleaned' column.")
print("Cleaned Review example (after removing white spaces):")
print(review_df['Review_Cleaned'].iloc[3]) # Example after removing white spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

# Ensure text is stripped of any remaining leading/trailing whitespace and handle empty strings
review_df['Review_Cleaned'] = review_df['Review_Cleaned'].astype(str).apply(lambda x: x.strip())
# Replace any empty strings (reviews that became empty after cleaning) with NaN, then drop them if desired.
# For now, let's just make sure they are stripped. The vectorization step will handle empty strings gracefully (e.g., as zero vectors).

print("Ensured 'Review_Cleaned' column is stripped of any remaining outer whitespace.")
# Example review should look the same as last step's if no extra spaces were there
print("Cleaned Review example (after rephrasing/final strip):")
print(review_df['Review_Cleaned'].iloc[3]) # Example after this step

#### 7. Tokenization

In [None]:
# Tokenization




def tokenize_text(text):
    # Fallback to simple split if NLTK tokenizer is problematic
    # We've already handled contractions, punctuation, stopwords, so simple split is acceptable.
    return text.split()

# Apply the tokenization function to the 'Review_Cleaned' column
review_df['Review_Tokens'] = review_df['Review_Cleaned'].astype(str).apply(tokenize_text)

print("Tokenized 'Review_Cleaned' column into 'Review_Tokens' using basic split().")
print("Cleaned Review example (after tokenization):")
print(review_df['Review_Tokens'].iloc[3]) # Example after tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def lemmatize_tokens(tokens):
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return lemmatized_tokens


review_df['Review_Normalized'] = review_df['Review_Tokens'].apply(lemmatize_tokens)

print("Lemmatized tokens in 'Review_Normalized' column.")
print("Cleaned Review example (after lemmatization):")
print(review_df['Review_Normalized'].iloc[3]) # Example after lemmatization

##### Which text normalization technique have you used and why?

Answer -

For text normalization, **Lemmatization** was the technique employed using NLTK's `WordNetLemmatizer`.

**Reasoning for choosing Lemmatization:**
* **Meaningful Base Forms**: Lemmatization reduces words to their dictionary or base form (lemma), taking into account the word's grammatical context (part of speech). This means that different inflections of a word (e.g., "running," "runs," "ran") are all reduced to a single, valid word ("run"). This is generally preferred over stemming, which might produce non-dictionary root forms (e.g., "beaut*" for "beautiful," "beauty").
* **Reduced Vocabulary Size**: By converting various word forms to a single lemma, the overall vocabulary size is reduced, which can improve the efficiency and performance of text analysis models.
* **Improved Semantic Understanding**: For tasks like sentiment analysis, having words in their base forms helps in more accurately understanding the underlying meaning and sentiment, as "good," "better," and "best" would all be related to "good."
* **Better Readability and Interpretability**: The output of lemmatization is typically actual words, which makes the processed text more readable and interpretable for human review compared to the crude stems produced by stemming.

#### 9. Part of speech tagging

In [None]:
# POS Taging

from nltk.tag import pos_tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer # Ensure this is initialized if not global

# No need for individual downloads here if done in first cell, but keeping for robustness check
try:
    nltk.data.find('taggers/averaged_perceptron_tagger')
    nltk.data.find('taggers/averaged_perceptron_tagger_eng') # Check for both
    print("NLTK 'averaged_perceptron_tagger' data found.")
except LookupError:
    print("NLTK 'averaged_perceptron_tagger' data not found. This should have been downloaded in the first cell.")


# Function to convert NLTK POS tags to WordNet POS tags for lemmatizer
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN # Default to noun if not found


# Initialize Lemmatizer (if not already global)
lemmatizer = WordNetLemmatizer()

# Apply POS tagging and then re-apply lemmatization with POS tags
def lemmatize_tokens_with_pos(tokens):
    pos_tagged_tokens = pos_tag(tokens)
    lemmatized_tokens = []
    for word, tag in pos_tagged_tokens:
        wordnet_tag = get_wordnet_pos(tag) # Convert NLTK tag to WordNet tag
        lemmatized_tokens.append(lemmatizer.lemmatize(word, wordnet_tag))
    return lemmatized_tokens

# Apply the function to the 'Review_Tokens' column, storing in a new column
review_df['Review_Normalized_POS'] = review_df['Review_Tokens'].apply(lemmatize_tokens_with_pos)

print("Applied POS tagging and re-lemmatized tokens in 'Review_Normalized_POS' column.")
print("Original Tokens example:")
print(review_df['Review_Tokens'].iloc[3])
print("\nLemmatized Tokens with POS Tagging example:")
print(review_df['Review_Normalized_POS'].iloc[3])

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

from sklearn.feature_extraction.text import TfidfVectorizer

review_df['Review_Final_Text_POS'] = review_df['Review_Normalized_POS'].apply(lambda x: ' '.join(x) if isinstance(x, list) else '')

tfidf_vectorizer_pos = TfidfVectorizer(max_features=5000, min_df=5, max_df=0.8)

tfidf_matrix_pos = tfidf_vectorizer_pos.fit_transform(review_df['Review_Final_Text_POS'])

print("TF-IDF Vectorization complete using POS-aware normalized text.")
print(f"Shape of TF-IDF matrix: {tfidf_matrix_pos.shape}")
print(f"Number of unique words (features) extracted: {len(tfidf_vectorizer_pos.get_feature_names_out())}")
print("\nSample TF-IDF values for the first review (sparse matrix format):")
print(tfidf_matrix_pos[0])

##### Which text vectorization technique have you used and why?

Answer -

For text vectorization, **TF-IDF (Term Frequency-Inverse Document Frequency) Vectorization** was the technique chosen, implemented using `sklearn.feature_extraction.text.TfidfVectorizer`.

**Reasoning for choosing TF-IDF Vectorization:**
* **Converting Text to Numerical Format**: Machine learning models require numerical input. TF-IDF effectively transforms raw text into a matrix of numerical values, where each row represents a document (review) and each column represents a word from the vocabulary, with cell values being their TF-IDF scores.
* **Capturing Word Importance**: TF-IDF assigns weights to words based on two factors:
    * **Term Frequency (TF)**: How often a word appears in a specific document.
    * **Inverse Document Frequency (IDF)**: How rare a word is across the entire collection of documents.
    This means words that are very common in a particular review but rare across all reviews (e.g., specific restaurant names, unique descriptive adjectives) get higher TF-IDF scores, making them more discriminative. Conversely, very common words (even after stopwords removal) that appear in many documents get lower scores.
* **Reduced Noise and Dimensionality Management**: Parameters like `max_features`, `min_df`, and `max_df` were used to control the vocabulary size, filtering out extremely rare or overly common words that might not be useful features, thus reducing noise and managing the dimensionality of the resulting matrix.
* **Suitability for Sentiment Analysis and Clustering**: TF-IDF is widely recognized as a robust and effective text representation for tasks such as sentiment analysis (where identifying important distinguishing words is key) and text clustering (where reviews with similar important words should be grouped together). It often outperforms simple Bag-of-Words (Count Vectorization) by emphasizing more significant terms.
* **Handling Sparsity**: The resulting TF-IDF matrix is typically sparse (most values are zero), which is efficiently handled by `TfidfVectorizer` and is beneficial for memory management and computational efficiency in subsequent modeling steps.

### 4 (b). Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

print("--- Feature Manipulation: Aggregating Review Data to Restaurant Level ---")

# Step 1: Aggregate numerical review features by restaurant

restaurant_review_features_agg = review_df.groupby('Restaurant')[['Rating', 'Pictures', 'Reviewer_Reviews', 'Reviewer_Followers']].mean().reset_index()
restaurant_review_features_agg.rename(columns={'Rating': 'Avg_Review_Rating',
                                               'Pictures': 'Avg_Pictures_Per_Review',
                                               'Reviewer_Reviews': 'Avg_Reviewer_Reviews_Per_Restaurant',
                                               'Reviewer_Followers': 'Avg_Reviewer_Followers_Per_Restaurant'},
                                      inplace=True)
print("Aggregated numerical review features to restaurant level:")
print(restaurant_review_features_agg.head())
print(f"Shape of aggregated numerical review features: {restaurant_review_features_agg.shape}")


# Step 2: Aggregate TF-IDF vectors to restaurant level

from collections import defaultdict
# Using tfidf_matrix_pos.shape[1] for the number of features (words)
restaurant_tfidf_sums = defaultdict(lambda: np.zeros(tfidf_matrix_pos.shape[1]))

for i, restaurant_name in enumerate(review_df['Restaurant']):

    restaurant_tfidf_sums[restaurant_name] += tfidf_matrix_pos[i, :].toarray().flatten()

# Convert summed TF-IDF vectors back to a DataFrame
restaurant_tfidf_df = pd.DataFrame.from_dict(restaurant_tfidf_sums, orient='index', columns=tfidf_vectorizer_pos.get_feature_names_out())
restaurant_tfidf_df.reset_index(inplace=True)
restaurant_tfidf_df.rename(columns={'index': 'Restaurant_Name'}, inplace=True) # Renamed to avoid conflict if Restaurant column exists


print("\nAggregated TF-IDF vectors to restaurant level:")
print(restaurant_tfidf_df.head())
print(f"Shape of aggregated TF-IDF features: {restaurant_tfidf_df.shape}")


# Step 3: Combine all restaurant-level features into a single DataFrame for clustering

restaurant_features_base = restaurant_df.set_index('Name').drop(columns=['Links'], errors='ignore').copy()

# Merge aggregated numerical review features
# Use left_index=True and right_on='Restaurant' to merge on restaurant_features_base index (Name) and restaurant_review_features_agg's 'Restaurant' column
final_clustering_df = pd.merge(restaurant_features_base, restaurant_review_features_agg,
                               left_index=True, right_on='Restaurant', how='inner') # Use inner to only include restaurants with reviews


final_clustering_df.set_index('Restaurant', inplace=True) # 'Restaurant' column from merge is now the common key

final_clustering_df = pd.merge(final_clustering_df, restaurant_tfidf_df,
                               left_index=True, right_on='Restaurant_Name', how='inner') # Merge on index (Restaurant name)

final_clustering_df.set_index('Restaurant_Name', inplace=True) # Set Restaurant_Name as index for final df

print("\nFinal Clustering DataFrame Sample (first 5 rows):")
print(final_clustering_df.head())
print(f"Final shape of clustering DataFrame: {final_clustering_df.shape}")

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

print("--- Feature Selection: Applying PCA ---")

# Step 1: Separate the restaurant names (index) from the features

features_df = final_clustering_df.copy()

# Step 2: Scale the features before applying PCA

scaler = StandardScaler()
scaled_features = scaler.fit_transform(features_df)
scaled_features_df = pd.DataFrame(scaled_features, columns=features_df.columns, index=features_df.index)
print(f"Shape of scaled features: {scaled_features_df.shape}")
print("Scaled features head:")
print(scaled_features_df.head())

# Step 3: Apply PCA

pca = PCA(n_components=50) # Starting with 50 components

# Fit PCA on scaled data and transform it
pca_components = pca.fit_transform(scaled_features_df)

# Create a DataFrame for PCA components
pca_df = pd.DataFrame(data=pca_components,
                      columns=[f'PC_{i+1}' for i in range(pca.n_components)],
                      index=scaled_features_df.index)

print(f"\nShape of PCA components DataFrame: {pca_df.shape}")
print("PCA components head:")
print(pca_df.head())

# Explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
print(f"\nExplained variance ratio of first 50 components: {explained_variance_ratio.sum():.2f}")


##### What all feature selection methods have you used  and why?

Answer Here.

In the context of **Principal Component Analysis (PCA)**, the "important features" are not individual original columns directly, but rather the **principal components (PCs)** themselves. These PCs are new, synthetic features that are linear combinations of the original 3650 features.

**Importance of Principal Components:**
* **Variance Retention**: The 50 principal components selected were deemed important because they collectively capture approximately **67% of the total variance** present in the original high-dimensional dataset. This implies that a significant amount of the underlying information and variability across restaurants is preserved in these reduced dimensions.
* **Information Aggregation**: Each principal component represents a blend of information from many original features (e.g., restaurant cost, types of cuisines, collections, timings, average rating, number of pictures, reviewer activity, and the nuances of review text captured by TF-IDF). The components are ordered by the amount of variance they explain, with `PC_1` capturing the most, `PC_2` the second most, and so on.
* **Suitability for Clustering**: By focusing on these components, the clustering algorithm will operate on the most influential underlying patterns and structures in the data, rather than being overwhelmed by redundant or less impactful individual features.

**Types of Original Information Deemed Important (indirectly, through variance):**
Although PCA doesn't directly tell us "Feature X is important," the fact that these components capture 67% of the variance suggests that the important information is likely derived from:
* **Restaurant Attributes**: Such as estimated `Cost` and the presence/absence of various `Collections` and `Cuisines` types (from one-hot encoding).
* **Operating Patterns**: Like `Timings` and whether a restaurant is `is_late_night_encoded`.
* **Aggregated Review Metrics**: Including `Avg_Review_Rating`, `Avg_Pictures_Per_Review`, `Avg_Reviewer_Reviews_Per_Restaurant`, and `Avg_Reviewer_Followers_Per_Restaurant`.
* **Textual Content Insights**: The aggregated TF-IDF features from customer reviews, which capture the prominent keywords and topics associated with each restaurant.

These categories of features likely contain the most variance and thus contribute most significantly to the principal components that describe the differences and similarities between restaurants.

##### Which all features you found important and why?

Answer -

In the context of the **Principal Component Analysis (PCA)** method used for feature selection, the "important features" are not individual original columns but rather the **principal components (PCs)** that PCA generated.

**Why Principal Components are Important Features:**
* **Variance Capture**: PCA transforms the original, highly correlated features into a new set of orthogonal (uncorrelated) principal components. These components are ordered by the amount of variance they explain in the data. We selected **50 principal components** because they collectively retained a significant portion (approximately **67%**) of the total variance present in the original 3650 features. This means these 50 new features effectively summarize most of the distinct information and patterns from the original dataset.
* **Dimensionality Reduction**: By reducing the feature space from 3650 to 50, these principal components become the critical input for the clustering model, helping to mitigate the "curse of dimensionality" and improve computational efficiency.
* **Combined Information**: Each principal component is a linear combination of all original features. Therefore, the "importance" is implicitly distributed across the original data types:
    * **Restaurant structural attributes** (e.g., `Cost`, one-hot encoded `Collections`, `Cuisines`, `Timings`, `is_late_night_encoded`).
    * **Aggregated numerical insights from reviews** (e.g., `Avg_Review_Rating`, `Avg_Pictures_Per_Review`, `Avg_Reviewer_Reviews_Per_Restaurant`, `Avg_Reviewer_Followers_Per_Restaurant`).
    * **Semantic information from review text** (captured by the aggregated TF-IDF features).

These categories of original features collectively contribute to the variance captured by the principal components, making the principal components the most important features for understanding the underlying structure of the restaurant data for clustering purposes.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Answer -

Yes, data transformation has been a fundamental and continuous part of the preprocessing pipeline in this project. While the current section might typically refer to general mathematical transformations (like log transformation for skewness), it's important to note that various forms of transformation have already been applied to make the data suitable for analysis and machine learning:

**Key Transformations Already Applied:**

1.  **Data Type Conversions**:
    * `Cost` column (from `restaurant_df`) was transformed from a string (`object`) to a numerical (`integer`) type by removing commas.
    * `Rating` column (from `review_df`) was transformed from an `object` type to a `float` type after handling non-numeric 'Like' entries.
    * `Time` column (from `review_df`) was converted to a `datetime` object to enable time-series analysis.

2.  **Feature Extraction and Creation**:
    * `Metadata` column (from `review_df`) was transformed by extracting and creating two new numerical features: `Reviewer_Reviews` and `Reviewer_Followers`.
    * `Timings` and `Collections` columns (from `restaurant_df`) were used to derive a new binary feature: `is_late_night_encoded`.

3.  **Categorical to Numerical Encoding**:
    * Categorical columns such as `Collections`, `Cuisines`, and `Timings` (from `restaurant_df`) were transformed using **One-Hot Encoding** into multiple binary numerical features. This converts textual categories into a format usable by models.

4.  **Text Data Transformation (NLP Pipeline)**:
    * The raw `Review` text (from `review_df`) underwent a series of transformations: contraction expansion, lowercasing, punctuation removal, URL/digit removal, stopwords removal, tokenization, and **lemmatization (text normalization)**.
    * The most significant text transformation was **TF-IDF Vectorization**, which converted the cleaned review text into high-dimensional numerical feature vectors (`tfidf_matrix_pos`).

5.  **Outlier Treatment (Form of Transformation)**:
    * For numerical features like `Cost`, `Pictures`, `Reviewer_Reviews`, and `Reviewer_Followers`, **capping at the 99th percentile** was applied. This transforms extreme outlier values by setting them to a maximum threshold, thereby reducing their undue influence on models.

6.  **Dimensionality Reduction (Principal Component Analysis - PCA)**:
    * The entire combined feature set was transformed using **PCA**. This is a powerful linear transformation that projects the high-dimensional data (3650 features) into a lower-dimensional space (50 principal components) while retaining a significant portion of the original variance. The principal components themselves are transformed features.

**Why these transformations were used:**
These transformations were essential for:
* Converting raw, unstructured, or non-numerical data into a numerical format required by machine learning algorithms.
* Standardizing data, handling inconsistencies, and removing noise.
* Reducing multicollinearity and managing high dimensionality to improve model performance, efficiency, and interpretability (especially important for clustering).
* Aggregating information from different data sources (e.g., review-level data aggregated to restaurant-level).

Given that the data has already undergone significant transformations into its principal components via PCA, no *further* general mathematical transformations (like log transformation) are typically applied to these components directly, as they are already a transformed representation designed for variance maximization.

### 6. Data Scaling

In [None]:
# Scaling your data

from sklearn.preprocessing import StandardScaler

print("--- Data Scaling: Applying StandardScaler to PCA Components ---")

# Initialize the StandardScaler
scaler_final = StandardScaler()

# Fit and transform the pca_df (our feature set)
scaled_pca_components = scaler_final.fit_transform(pca_df)

# Convert the scaled array back to a DataFrame, preserving index and column names
scaled_pca_df = pd.DataFrame(scaled_pca_components, columns=pca_df.columns, index=pca_df.index)

print(f"Shape of scaled PCA components DataFrame: {scaled_pca_df.shape}")
print("Scaled PCA components head:")
print(scaled_pca_df.head())

##### Which method have you used to scale you data and why?

Answer -

For data scaling, **StandardScaler (Z-score normalization)** was used to transform the Principal Components (`pca_df`).

**Reasoning for choosing StandardScaler:**
* **Sensitivity of Distance-Based Algorithms**: Machine learning algorithms that rely on distance calculations, such as K-Means clustering, are highly sensitive to the magnitude and range of input features. If features have different scales, those with larger values or wider ranges can disproportionately influence the distance computations, leading to biased clustering results.
* **Standardization to Mean 0 and Variance 1**: `StandardScaler` transforms the data such that each feature (principal component in this case) has a mean of 0 and a standard deviation of 1. This process ensures that all features are on a comparable scale, preventing any single component from dominating the clustering process merely due to its larger numerical range.
* **Suitability for PCA Components**: While PCA already orthogonalizes features, the resulting components might still have varying scales. Applying `StandardScaler` to the principal components further normalizes their individual distributions, which can be beneficial for clustering algorithms that assume isotropic (spherical) clusters or rely on Euclidean distances.
* **Robustness (after outlier treatment)**: Since outliers were already addressed (capped) in earlier steps, `StandardScaler` is a robust choice that effectively normalizes the distribution of the components without being overly sensitive to extreme values.

The `scaled_pca_df` now represents the final, clean, and appropriately scaled feature set, ready to be fed into the clustering model.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer -

Yes, **dimensionality reduction was absolutely necessary** for this dataset. It was a critical step performed using **Principal Component Analysis (PCA)** as part of the "Feature Manipulation & Selection" phase.

**Reasons why dimensionality reduction was needed:**
* **High Dimensionality ("Curse of Dimensionality")**: After one-hot encoding for categorical features and TF-IDF vectorization for review text, the combined feature set (`final_clustering_df`) resulted in an extremely high number of features (3650 columns). In high-dimensional spaces, data points become sparse, making it difficult for distance-based algorithms like K-Means clustering to effectively find meaningful clusters. Distances tend to become similar, reducing the discriminative power.
* **Computational Efficiency**: Working with 3650 features would be computationally very expensive and time-consuming for model training, especially when dealing with numerous iterations for clustering algorithms. Reducing the number of dimensions significantly improves computational efficiency.
* **Noise Reduction and Overfitting**: Many of the original high-dimensional features, particularly from TF-IDF, might represent noise or sparse terms that do not contribute meaningfully to the underlying patterns. High dimensionality can lead to overfitting in some models (though less of a concern for unsupervised clustering, it can make it harder to find robust structures). Dimensionality reduction helps to project the data onto a lower-dimensional subspace that captures the most important variance.
* **Improved Model Performance and Interpretability**: By focusing on a smaller set of principal components that capture most of the variance, clustering algorithms can often find more distinct and robust clusters. While principal components themselves are abstract, the reduction process leads to a more manageable and potentially better-performing model.

By reducing the feature set from 3650 to 50 principal components, we created a more compact and efficient representation of the data while retaining approximately 67% of the original variance. This significantly prepares the data for effective clustering.

In [None]:
# DImensionality Reduction (If needed)

# Principal Component Analysis (PCA) was performed earlier in the "Feature Manipulation & Selection" section
# (specifically under '#### 2. Feature Selection') to reduce the dimensionality of the dataset.
# The resulting reduced and scaled feature set is stored in 'scaled_pca_df'.

# No new dimensionality reduction code is needed here as it's already done.

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer -

The dimensionality reduction technique used was **Principal Component Analysis (PCA)**.

**Reasoning for choosing PCA:**
* **Handling High-Dimensionality**: The primary reason for using PCA was to effectively reduce the very high number of features (3650 after one-hot encoding and TF-IDF vectorization) to a more manageable and computationally efficient set (50 principal components). This directly combats the "Curse of Dimensionality" which can negatively impact clustering algorithms.
* **Variance Preservation**: PCA is designed to create new, uncorrelated features (principal components) that capture the maximum possible variance from the original data. By retaining 50 components, approximately **67% of the total variance** from the original dataset was preserved, ensuring that most of the meaningful information about the restaurants was kept.
* **Improved Model Performance**: Reducing noise and redundancy in the feature space can lead to more robust and well-defined clusters in subsequent modeling, as the clustering algorithm operates on the most significant underlying patterns in the data.
* **Preprocessing for Clustering**: PCA is a widely accepted and powerful technique for preparing high-dimensional data for unsupervised learning tasks like clustering.

The result of this step is the `scaled_pca_df`, which serves as the optimized feature set for the clustering model.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

# For this unsupervised clustering project, a traditional train-test split is not performed.
# The entire preprocessed and scaled feature set ('scaled_pca_df') will be used for clustering.
# This DataFrame is already prepared as the final input for the clustering model.

##### What data splitting ratio have you used and why?

Answer -

For this project, which primarily focuses on **unsupervised learning (restaurant clustering)**, a traditional train-test data split is **not explicitly performed** on the main feature set for the clustering algorithm itself.

**Reasoning:**
* **Unsupervised Learning Nature**: In unsupervised learning, there is no target variable to predict, and the goal is to discover hidden patterns or structures within the entire dataset. Therefore, the concept of training a model on one subset and evaluating its generalization on another (unseen) subset, as is done in supervised learning, does not directly apply. The clustering algorithm (K-Means) will be applied to the full, preprocessed dataset (`scaled_pca_df`).
* **Clustering Goal**: The objective is to segment all 100 available restaurants based on their inherent similarities across all features.
* **Hypothetical Supervised Scenario**: However, if this project were to evolve into a supervised learning task (e.g., building a classification model to predict which cluster a *new*, unseen restaurant belongs to, or predicting a specific restaurant attribute based on others), then a data split (typically 80% for training and 20% for testing, or 70/30) would be essential to evaluate the model's performance and generalization ability on unseen data. For unsupervised clustering, the entire `scaled_pca_df` is used as the input for finding clusters.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer -

In the context of this project, which focuses on **unsupervised learning (clustering)**, the concept of a "balanced" or "imbalanced" dataset, as typically defined in machine learning, **does not directly apply**.

**Reasoning:**
* **Absence of a Target Variable**: Dataset imbalance refers to the unequal distribution of classes within a *target variable* (e.g., in a classification problem where one class has significantly fewer instances than others). Since clustering is an unsupervised task, there is no predefined target variable or labels that we are trying to predict.
* **Discovery of Natural Groupings**: The goal of clustering is to discover inherent patterns and natural groupings within the data. If the clustering algorithm identifies clusters of very different sizes, this is a reflection of the actual underlying structure of the data, rather than an "imbalance" that needs to be corrected in the preprocessing phase. For instance, it's natural to have a large cluster of "average" restaurants and smaller, more niche clusters of "high-end" or "specialty" restaurants.
* **Techniques are for Supervised Learning**: Techniques to handle imbalance (like oversampling, undersampling, or SMOTE) are specifically designed for supervised learning scenarios to prevent a model from being biased towards the majority class and performing poorly on the minority class. These techniques are not applicable or necessary for the objective of discovering natural clusters.

Therefore, for the purpose of clustering, the dataset is considered as is, and the algorithm will identify clusters based on the similarities in the features, irrespective of the ultimate size of those clusters.

In [None]:
# Handling Imbalanced Dataset (If needed)

# For this unsupervised clustering project, handling imbalanced datasets is not applicable
# as there is no target variable or predefined classes. This step is typically
# performed in supervised learning tasks.

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer -

As discussed, handling dataset imbalance is **not applicable or necessary** for this project, as it involves unsupervised clustering. The concept of imbalance (where one class significantly outweighs others) is specific to supervised learning problems with a predefined target variable.

**However, if this were a supervised learning project and dataset imbalance needed to be addressed, common techniques would include:**

1.  **Resampling Techniques:**
    * **Oversampling (e.g., SMOTE - Synthetic Minority Over-sampling Technique)**: This involves increasing the number of instances in the minority class. SMOTE works by creating synthetic (new, but similar) samples of the minority class. This is useful when data is limited and simply duplicating existing samples might lead to overfitting.
    * **Undersampling**: This involves decreasing the number of instances in the majority class to match the minority class. While it can balance the dataset, it may lead to information loss from the majority class.

2.  **Algorithmic Approaches:**
    * **Cost-Sensitive Learning**: Modifying the learning algorithm itself to penalize misclassifications of the minority class more heavily than misclassifications of the majority class.
    * **Ensemble Methods (e.g., BalancedBaggingClassifier, EasyEnsemble)**: Using multiple models or resampling techniques within an ensemble framework to improve performance on imbalanced data.

**Why these techniques would be used (in a supervised context):**
These techniques are employed in supervised learning to prevent the model from becoming biased towards the majority class, thereby improving its ability to learn from and accurately predict the minority class, which is often the class of greater interest (e.g., fraud detection, disease prediction). For our clustering task, these are not required as we are exploring inherent data structures.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns

print("--- ML Model - 1: K-Means Clustering ---")
print("\n--- Determining Optimal Number of Clusters (Elbow Method) ---")

# Define a range of K values to test
k_range = range(1, 11) # Test from 1 to 10 clusters

# List to store inertia values for each K
inertia = []

# Loop through K values and fit KMeans
for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10) # n_init suppresses warning
    kmeans.fit(scaled_pca_df)
    inertia.append(kmeans.inertia_)

# Plot the Elbow Method graph
plt.figure(figsize=(10, 6))
plt.plot(k_range, inertia, marker='o', linestyle='--')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (Within-cluster sum of squares)')
plt.title('Elbow Method for Optimal K')
plt.xticks(k_range)
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

# Based on the plot, identify the optimal K (you'll decide this after seeing the plot)
# For now, let's make an initial guess or note the pattern for discussion.
print("\nObserve the Elbow Method plot above to identify the 'elbow point' where the decrease in inertia starts to slow down significantly.")

# Fit the Algorithm

# Chosen optimal K based on Elbow Method
optimal_k = 4
print(f"\nOptimal K chosen based on Elbow Method: {optimal_k}")

# Initialize and fit KMeans with the chosen optimal_k
kmeans_model = KMeans(n_clusters=optimal_k, random_state=42, n_init=10) # n_init suppresses warning
kmeans_model.fit(scaled_pca_df)

# Predict on the model
# Get cluster labels for each restaurant
cluster_labels = kmeans_model.labels_

# Add cluster labels to our original restaurant_df (or a copy for safety)
# First, ensure restaurant_df has the same index as scaled_pca_df (restaurant names)
# We need restaurant_df for original details, but scaled_pca_df has the features.
# Let's create a DataFrame that combines restaurant name, and its cluster.
restaurants_with_clusters = pd.DataFrame(restaurant_df['Name']).set_index('Name').copy() # Start with restaurant names
restaurants_with_clusters['Cluster'] = pd.Series(cluster_labels, index=scaled_pca_df.index)

print(f"\nAssigned {len(restaurants_with_clusters['Cluster'].unique())} clusters to restaurants.")
print("First 5 restaurants with their assigned clusters:")
print(restaurants_with_clusters.head())

# --- Evaluation Metrics for Model ---
# For unsupervised learning, we use internal metrics like Inertia and Silhouette Score

# Inertia: The sum of squared distances of samples to their closest cluster center. Lower is better.
final_inertia = kmeans_model.inertia_
print(f"\nFinal K-Means Inertia for K={optimal_k}: {final_inertia:.2f}")

# Silhouette Score: Measures how similar an object is to its own cluster (cohesion)
# compared to other clusters (separation). Ranges from -1 to +1.
# Higher value indicates better-defined clusters.
from sklearn.metrics import silhouette_score
# Silhouette score requires at least 2 clusters
if optimal_k > 1:
    silhouette_avg = silhouette_score(scaled_pca_df, cluster_labels)
    print(f"Silhouette Score for K={optimal_k}: {silhouette_avg:.3f}")
else:
    print("Silhouette Score not applicable for K=1.")

In [None]:
# Quick check on actual number of unique clusters assigned
print(f"Actual unique clusters assigned by K-Means: {restaurants_with_clusters['Cluster'].nunique()}")
print(f"Count of restaurants in each cluster:\n{restaurants_with_clusters['Cluster'].value_counts()}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

# Ensure final_inertia and silhouette_avg are available from the previous run
# We will use the values calculated just above this cell.

# Create a small DataFrame for the metrics for easy plotting
metrics_data = {
    'Metric': ['Inertia', 'Silhouette Score'],
    'Value': [kmeans_model.inertia_, silhouette_score(scaled_pca_df, kmeans_model.labels_)]
}
metrics_df = pd.DataFrame(metrics_data)

plt.figure(figsize=(8, 5))
sns.barplot(x='Metric', y='Value', data=metrics_df, palette='viridis')
plt.title(f'K-Means Clustering Evaluation Metrics (K={kmeans_model.n_clusters})', fontsize=16)
plt.ylabel('Value', fontsize=12)
plt.ylim(0, max(metrics_df['Value']) * 1.1) # Set y-limit to slightly above max value
plt.grid(axis='y', linestyle='--', alpha=0.6)

# Add value labels on top of the bars for clarity
for index, row in metrics_df.iterrows():
    plt.text(index, row['Value'], f'{row["Value"]:.3f}', color='black', ha='center', va='bottom', fontsize=10)

plt.show()

print("\nK-Means Model Evaluation Metrics:")
print(f"Optimal K: {kmeans_model.n_clusters}")
print(f"Inertia: {kmeans_model.inertia_:.3f}")
print(f"Silhouette Score: {silhouette_score(scaled_pca_df, kmeans_model.labels_):.3f}")

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

print("--- K-Means Model: Cross-Validation & Hyperparameter Tuning ---")

# Re-initializing and fitting K-Means with the chosen optimal K
# This block essentially confirms the hyperparameter-tuned model
optimal_k = 4 # Our chosen optimal K from the Elbow Method
kmeans_optimized = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)

# Fit the Algorithm
kmeans_optimized.fit(scaled_pca_df)

# Predict on the model
# Get cluster labels for each restaurant
optimized_cluster_labels = kmeans_optimized.labels_

print(f"\nK-Means model refitted with optimal K = {optimal_k}.")
print("Optimized Model Parameters (Key Hyperparameters):")
print(f"  n_clusters (K): {kmeans_optimized.n_clusters}")
print(f"  random_state: {kmeans_optimized.random_state}")
print(f"  n_init: {kmeans_optimized.n_init}")

# Display performance metrics for the optimized model
optimized_inertia = kmeans_optimized.inertia_
optimized_silhouette_avg = silhouette_score(scaled_pca_df, optimized_cluster_labels)

print(f"\nPerformance Metrics for Optimized K-Means Model:")
print(f"  Inertia: {optimized_inertia:.3f}")
print(f"  Silhouette Score: {optimized_silhouette_avg:.3f}")

# Reconfirm the distribution of clusters
restaurants_with_clusters_optimized = pd.DataFrame(restaurant_df['Name']).set_index('Name').copy()
restaurants_with_clusters_optimized['Cluster'] = pd.Series(optimized_cluster_labels, index=scaled_pca_df.index)
print(f"\nUpdated count of restaurants in each cluster (Optimized K={optimal_k}):\n{restaurants_with_clusters_optimized['Cluster'].value_counts()}")

##### Which hyperparameter optimization technique have you used and why?

Answer -

For K-Means clustering, the primary hyperparameter to optimize is the **number of clusters (K)**. The technique used to determine the optimal `K` was the **Elbow Method**.

**Reasoning for using the Elbow Method:**
* The Elbow Method is a widely accepted heuristic for K-Means to find the optimal `K` by plotting the **Inertia** (within-cluster sum of squares) against different values of `K`.
* The "elbow point" on the plot indicates where the decrease in inertia starts to diminish significantly. This point is considered optimal because adding more clusters beyond this point provides diminishing returns in terms of reducing intra-cluster variance, while increasing model complexity.
* In our case, the Elbow plot suggested `K=4` as a suitable optimal point.

**Other Hyperparameters and Considerations:**
* **`n_init`**: We set `n_init=10`. This hyperparameter controls the number of times the K-Means algorithm is run with different centroid seeds. Using a higher `n_init` helps to mitigate the impact of random initialization and ensures that the algorithm finds a more robust and globally optimal clustering (or close to it) by choosing the best outcome among multiple runs.
* **`random_state`**: Setting `random_state=42` ensures reproducibility of the results, meaning that running the code multiple times will yield the same cluster assignments if all other factors are constant.
* **Cross-Validation**: Traditional cross-validation techniques (like k-fold cross-validation) are primarily used in supervised learning to evaluate model generalization on unseen data. For unsupervised clustering, direct cross-validation for hyperparameter tuning is not standard, as there's no ground truth label to evaluate against. The Elbow Method and Silhouette Analysis (discussed below) serve as internal validation methods for cluster quality.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer -

For K-Means clustering, the "improvement" from hyperparameter tuning (primarily identifying the optimal `K` using the Elbow Method) is realized *during* the selection of `K`, not necessarily after refitting with the chosen `K`. The chosen `K` (which was 4) is considered the "best" because it offers a good balance between reducing within-cluster variance and model complexity, as indicated by the Elbow curve.

When the K-Means model was fitted with the **optimal `K=4`** (and `n_init=10` for robust initialization), the performance metrics obtained were:
* **Inertia**: **4725.574**
* **Silhouette Score**: **0.179**

These metrics are consistent with the values observed during the Elbow Method plotting (for Inertia) and from the initial model run for `K=4`. There isn't a "further improvement" in these specific metric values after the tuning process, but rather a confirmation that we are now evaluating the model at its "optimized" state based on our selection criteria. The tuning process ensured we selected the most appropriate `K` for the inherent structure of the data rather than arbitrarily picking a number of clusters.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Answer -

**1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.**

The second Machine Learning model used for restaurant clustering is **Agglomerative Clustering**.

**Explanation of Agglomerative Clustering:**
Agglomerative Clustering is a type of hierarchical clustering algorithm. It's a "bottom-up" approach where:
1.  **Initialization**: Each data point starts as its own cluster.
2.  **Merging**: Pairs of clusters are successively merged based on a similarity (or dissimilarity) measure until all data points are in a single cluster or a pre-defined number of clusters (`K`) is reached.
3.  **Linkage Criterion**: A "linkage criterion" determines the dissimilarity between sets of observations. For this project, the **'ward' linkage method** was used, which minimizes the variance of the clusters being merged.
The results of hierarchical clustering can be visualized using a **Dendrogram**, which shows the sequence of merges and their associated distances.

**Model Performance and Evaluation Metrics:**

* **Optimal K Determination (Dendrogram):**
    * The **Dendrogram** was used to visually determine the optimal number of clusters (`K`) for Agglomerative Clustering. By observing the structure of the dendrogram and considering where the longest vertical lines (representing merges at greater distances) appear, one can identify natural cluster groupings.
    * For direct comparison with K-Means, `K=4` was explicitly chosen for Agglomerative Clustering.

* **Inertia for K=4:**
    * **Inertia**: **4705.916**
    * **Indication**: This value represents the sum of squared distances of samples to their closest cluster center for the chosen `K=4`. Compared to K-Means' Inertia of 4725.574, Agglomerative Clustering achieved a slightly lower Inertia, indicating slightly more compact clusters for the same number of clusters.

* **Silhouette Score for K=4:**
    * **Silhouette Score**: **0.199**
    * **Indication**: A score of **0.199** is positive, suggesting that the clusters have some separation and structure. Compared to K-Means' Silhouette Score of 0.179, Agglomerative Clustering achieved a slightly higher Silhouette Score. This implies a marginally better-defined separation between clusters. While still relatively low (suggesting some overlap), it indicates a slightly more robust clustering outcome compared to K-Means for this specific dataset and `K`.

* **Observed Cluster Distribution:**
    * When the Agglomerative Clustering model was fitted with `K=4`, the distribution of restaurants among the clusters was:
        * **Cluster 0.0: 91 restaurants**
        * **Cluster 1.0: 5 restaurants**
        * **Cluster 2.0: 3 restaurants**
        * **Cluster 3.0: 1 restaurant**
    * **Insight**: Similar to K-Means, Agglomerative Clustering also identified one very large cluster containing the majority of restaurants. However, it distributed the remaining unique restaurants into slightly larger small clusters (5 and 3 restaurants) compared to K-Means' single-restaurant clusters (1, 1, 1). This suggests that Agglomerative Clustering might group some of the less common restaurants into slightly larger, albeit still small, distinct categories, offering a slightly more nuanced segmentation of the outliers.

In [None]:
# ML Model - 2 Implementation

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
import numpy as np

print("--- ML Model - 2: Agglomerative Clustering ---")
print("\n--- Determining Optimal Number of Clusters (Dendrogram) ---")

# Generate the linkage matrix for hierarchical clustering
# 'ward' linkage minimizes the variance of the clusters being merged.
linked = linkage(scaled_pca_df, method='ward')

plt.figure(figsize=(15, 7))
dendrogram(linked,
           orientation='top',
           distance_sort='descending',
           show_leaf_counts=False, # Set to True to show count of samples in leaves
           leaf_rotation=90.,
           leaf_font_size=8.,
           show_contracted=True # Show brackets
          )
plt.title('Dendrogram for Agglomerative Clustering', fontsize=16)
plt.xlabel('Data Points (Restaurants)', fontsize=12)
plt.ylabel('Euclidean Distance', fontsize=12)
plt.axhline(y=40, color='r', linestyle='--', label='Cut-off Threshold (Example)') # Example threshold line
plt.legend()
plt.show()

print("\nObserve the dendrogram above to identify the optimal number of clusters by looking for the longest vertical lines that do not intersect any horizontal cut-off line (a good cut-off line would indicate distinct clusters).")

# Fit the Agglomerative Clustering model with a chosen number of clusters based on dendrogram or a heuristic.
# For consistency and comparison with K-Means, let's start with 4 clusters.
optimal_k_agg = 4 # You would determine this visually from the dendrogram
print(f"\nChosen number of clusters for Agglomerative Clustering: {optimal_k_agg}")

agg_model = AgglomerativeClustering(n_clusters=optimal_k_agg, linkage='ward')
agg_cluster_labels = agg_model.fit_predict(scaled_pca_df)

# Add cluster labels to our original restaurant_df (or a copy)
restaurants_with_agg_clusters = pd.DataFrame(restaurant_df['Name']).set_index('Name').copy()
restaurants_with_agg_clusters['Cluster'] = pd.Series(agg_cluster_labels, index=scaled_pca_df.index)

print(f"\nAssigned {len(restaurants_with_agg_clusters['Cluster'].unique())} clusters to restaurants.")
print("First 5 restaurants with their assigned Agglomerative Clusters:")
print(restaurants_with_agg_clusters.head())

# Quick check on actual number of unique clusters assigned and their counts
print(f"\nActual unique clusters assigned by Agglomerative Clustering: {restaurants_with_agg_clusters['Cluster'].nunique()}")
print(f"Count of restaurants in each cluster (Agglomerative Clustering):\n{restaurants_with_agg_clusters['Cluster'].value_counts()}")
# Calculate Inertia for Agglomerative Clustering
# AgglomerativeClustering doesn't directly expose 'inertia_'.
# We can calculate it manually as the sum of squared distances of samples to their assigned centroids.
# First, find centroids by averaging samples within each cluster
agg_centroids = np.array([scaled_pca_df[agg_cluster_labels == i].mean(axis=0) for i in range(optimal_k_agg)])
agg_inertia = np.sum([np.sum(np.linalg.norm(scaled_pca_df[agg_cluster_labels == i] - agg_centroids[i], axis=1)**2)
                      for i in range(optimal_k_agg)])


# Evaluate performance metrics for Agglomerative Clustering
from sklearn.metrics import silhouette_score

if optimal_k_agg > 1:
    agg_silhouette_avg = silhouette_score(scaled_pca_df, agg_cluster_labels)
    print(f"\nSilhouette Score for Agglomerative Clustering (K={optimal_k_agg}): {agg_silhouette_avg:.3f}")
else:
    print("Silhouette Score not applicable for Agglomerative Clustering with K=1.")

print(f"Inertia for Agglomerative Clustering (K={optimal_k_agg}): {agg_inertia:.2f}")


# Visualizing evaluation Metric Score chart for Agglomerative Clustering
metrics_data_agg = {
    'Metric': ['Inertia', 'Silhouette Score'],
    'Value': [agg_inertia, agg_silhouette_avg]
}
metrics_df_agg = pd.DataFrame(metrics_data_agg)

plt.figure(figsize=(8, 5))
sns.barplot(x='Metric', y='Value', data=metrics_df_agg, palette='magma')
plt.title(f'Agglomerative Clustering Evaluation Metrics (K={optimal_k_agg})', fontsize=16)
plt.ylabel('Value', fontsize=12)
plt.ylim(0, max(metrics_df_agg['Value']) * 1.1)
plt.grid(axis='y', linestyle='--', alpha=0.6)

for index, row in metrics_df_agg.iterrows():
    plt.text(index, row['Value'], f'{row["Value"]:.3f}', color='black', ha='center', va='bottom', fontsize=10)
plt.show()

print("\nAgglomerative Clustering Model Evaluation Metrics:")
print(f"Optimal K: {optimal_k_agg}")
print(f"Inertia: {agg_inertia:.3f}")
print(f"Silhouette Score: {agg_silhouette_avg:.3f}")

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 2 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import numpy as np # For potential manual calculations if needed

print("--- Agglomerative Clustering Model: Cross-Validation & Hyperparameter Tuning ---")

# Re-initializing and fitting Agglomerative Clustering with the chosen optimal K and linkage
# This block essentially confirms the hyperparameter-tuned model
optimal_k_agg_tuned = 4 # Our chosen optimal K from the Dendrogram/comparison
linkage_method = 'ward' # Our chosen linkage method

agg_optimized = AgglomerativeClustering(n_clusters=optimal_k_agg_tuned, linkage=linkage_method)

# Fit the Algorithm
agg_optimized.fit(scaled_pca_df)

# Predict on the model
# Get cluster labels for each restaurant
optimized_agg_cluster_labels = agg_optimized.labels_

print(f"\nAgglomerative Clustering model refitted with optimal K = {optimal_k_agg_tuned} and linkage='{linkage_method}'.")
print("Optimized Model Parameters (Key Hyperparameters):")
print(f"  n_clusters (K): {agg_optimized.n_clusters}")
print(f"  linkage method: {agg_optimized.linkage}")

# Recalculate Inertia for Agglomerative Clustering optimized model
agg_optimized_centroids = np.array([scaled_pca_df[optimized_agg_cluster_labels == i].mean(axis=0) for i in range(optimal_k_agg_tuned)])
optimized_agg_inertia = np.sum([np.sum(np.linalg.norm(scaled_pca_df[optimized_agg_cluster_labels == i] - agg_optimized_centroids[i], axis=1)**2)
                               for i in range(optimal_k_agg_tuned)])

# Display performance metrics for the optimized model
optimized_agg_silhouette_avg = silhouette_score(scaled_pca_df, optimized_agg_cluster_labels)

print(f"\nPerformance Metrics for Optimized Agglomerative Clustering Model:")
print(f"  Inertia: {optimized_agg_inertia:.3f}")
print(f"  Silhouette Score: {optimized_agg_silhouette_avg:.3f}")

# Reconfirm the distribution of clusters
restaurants_with_agg_clusters_optimized = pd.DataFrame(restaurant_df['Name']).set_index('Name').copy()
restaurants_with_agg_clusters_optimized['Cluster'] = pd.Series(optimized_agg_cluster_labels, index=scaled_pca_df.index)
print(f"\nUpdated count of restaurants in each cluster (Optimized K={optimal_k_agg_tuned}):\n{restaurants_with_agg_clusters_optimized['Cluster'].value_counts()}")

##### Which hyperparameter optimization technique have you used and why?

Answer -

For Agglomerative Clustering, the primary hyperparameter to optimize is the **number of clusters (K)**, and another significant one is the **linkage method**.

**Technique Used for Optimal K:**
* The **Dendrogram** visualization was the primary technique used to determine the optimal number of clusters (`K`). By analyzing the hierarchical structure and observing the distances at which clusters merge, one can visually identify natural groupings. While the dendrogram shows a continuous hierarchy, for direct comparison with K-Means, `K=4` was explicitly chosen.

**Other Hyperparameters Tuned/Considered:**
* **Linkage Method**: The `linkage='ward'` method was used.
    * **Reasoning**: 'Ward's method' is a common and often effective linkage criterion for hierarchical clustering. It minimizes the variance of the clusters being merged, tending to produce clusters of roughly equal size (though in our case, the data's inherent structure still led to imbalanced clusters). Other linkage methods (`average`, `complete`, `single`) calculate distances between clusters differently and might yield different structures.
* **Cross-Validation**: Traditional cross-validation techniques (like k-fold cross-validation) are primarily used in supervised learning to evaluate model generalization. For unsupervised clustering, direct cross-validation for hyperparameter tuning is not standard, as there is no ground truth to evaluate against. The Dendrogram and internal metrics like Silhouette Score serve as internal validation methods for cluster quality.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer -

Yes, a **slight improvement** in clustering performance was observed with Agglomerative Clustering (using K=4 and 'ward' linkage) compared to K-Means (using K=4).

**Evaluation Metric Score Chart (Updated Metrics):**
* **Inertia**: **4705.916** (Agglomerative Clustering) vs. 4725.574 (K-Means)
    * **Improvement**: Agglomerative Clustering achieved a slightly lower Inertia. A lower Inertia indicates that the clusters formed are marginally more compact and cohesive.
* **Silhouette Score**: **0.199** (Agglomerative Clustering) vs. 0.179 (K-Means)
    * **Improvement**: Agglomerative Clustering yielded a slightly higher Silhouette Score. A higher Silhouette Score indicates that the clusters are marginally better separated and that data points are generally better matched to their own cluster than to neighboring clusters.

While both models indicate a similar underlying data structure (one very large cluster and several smaller ones), Agglomerative Clustering with `K=4` and `ward` linkage demonstrated a **modest improvement in both compactness (Inertia) and separation (Silhouette Score)** compared to the K-Means model with the same number of clusters. This suggests that the hierarchical approach might have found a slightly more optimal way to group the restaurants for this dataset.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd # Ensure pandas is imported for DataFrame operations

print("--- ML Model - 3: Gaussian Mixture Models (GMM) ---")
print("\n--- Determining Optimal Number of Clusters (AIC and BIC) ---")

# Define a range of K values to test
k_range_gmm = range(1, 11) # Test from 1 to 10 clusters

# Lists to store AIC, BIC, and Silhouette scores for each K
aic = []
bic = []
silhouette_scores_gmm = [] # Also store silhouette for comparison

# Loop through K values and fit GMM
for k in k_range_gmm:
    gmm = GaussianMixture(n_components=k, random_state=42, n_init=10) # n_init for multiple initializations
    gmm.fit(scaled_pca_df)
    aic.append(gmm.aic(scaled_pca_df))
    bic.append(gmm.bic(scaled_pca_df))

    # Calculate Silhouette Score for K > 1
    if k > 1:
        gmm_labels_temp = gmm.predict(scaled_pca_df)
        silhouette_scores_gmm.append(silhouette_score(scaled_pca_df, gmm_labels_temp))
    else:
        silhouette_scores_gmm.append(np.nan) # Silhouette not defined for 1 cluster

# Plot AIC and BIC
plt.figure(figsize=(12, 6))
plt.plot(k_range_gmm, aic, marker='o', linestyle='--', label='AIC')
plt.plot(k_range_gmm, bic, marker='x', linestyle='-', label='BIC')
plt.xlabel('Number of Components (K)')
plt.ylabel('Information Criterion Value')
plt.title('AIC and BIC for GMM Optimal K')
plt.xticks(k_range_gmm)
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

# Plot Silhouette Score
plt.figure(figsize=(10, 6))
plt.plot(k_range_gmm, silhouette_scores_gmm, marker='o', linestyle='--', color='purple')
plt.xlabel('Number of Components (K)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score for GMM Optimal K')
plt.xticks(k_range_gmm)
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

print("\nObserve the AIC/BIC and Silhouette plots above to identify the optimal K.")


# --- Fit the Algorithm (with chosen optimal K) ---

# Based on typical interpretations, the optimal K for GMM is often where AIC/BIC are minimized.
# Looking at typical patterns, K=3 or K=4 might be potential optimal points.
# Let's choose K=3 for this example for its potential to minimize BIC, or where AIC/BIC start to flatten.
# You will confirm the visual interpretation of optimal K after seeing the plots.
optimal_k_gmm = 3 # Provisional choice, confirm from plots.

print(f"\nOptimal K chosen for GMM based on plot observation: {optimal_k_gmm}")

gmm_model = GaussianMixture(n_components=optimal_k_gmm, random_state=42, n_init=10)
gmm_model.fit(scaled_pca_df)

# --- Predict on the model ---
gmm_cluster_labels = gmm_model.predict(scaled_pca_df)

# Add cluster labels to our original restaurant_df (or a copy)
restaurants_with_gmm_clusters = pd.DataFrame(restaurant_df['Name']).set_index('Name').copy()
restaurants_with_gmm_clusters['Cluster'] = pd.Series(gmm_cluster_labels, index=scaled_pca_df.index)


print(f"\nAssigned {len(restaurants_with_gmm_clusters['Cluster'].unique())} clusters to restaurants using GMM.")
print("First 5 restaurants with their assigned GMM Clusters:")
print(restaurants_with_gmm_clusters.head())

# Quick check on actual number of unique clusters assigned and their counts
print(f"\nActual unique clusters assigned by GMM: {restaurants_with_gmm_clusters['Cluster'].nunique()}")
print(f"Count of restaurants in each cluster (GMM):\n{restaurants_with_gmm_clusters['Cluster'].value_counts()}")


# --- Evaluation Metrics for Model ---
# Display performance metrics for the fitted GMM model

# Calculate current AIC and BIC for the chosen optimal_k_gmm
final_aic_gmm = gmm_model.aic(scaled_pca_df)
final_bic_gmm = gmm_model.bic(scaled_pca_df)

# Calculate Silhouette Score for the chosen K
if optimal_k_gmm > 1:
    final_gmm_silhouette_avg = silhouette_score(scaled_pca_df, gmm_cluster_labels)
else:
    final_gmm_silhouette_avg = np.nan

print(f"\nPerformance Metrics for GMM Model (K={optimal_k_gmm}):")
print(f"  AIC: {final_aic_gmm:.3f}")
print(f"  BIC: {final_bic_gmm:.3f}")
print(f"  Silhouette Score: {final_gmm_silhouette_avg:.3f}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.


The third Machine Learning model used for restaurant clustering is **Gaussian Mixture Models (GMM)**.

**Explanation of Gaussian Mixture Models (GMM):**
GMM is a probabilistic clustering algorithm that assumes data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters (mean, covariance, and mixing proportions). Unlike K-Means, which assigns each data point to a single cluster, GMMs assign a probability that a data point belongs to each cluster. This "soft assignment" allows GMMs to model clusters with varying shapes, sizes, and orientations, and to capture more complex underlying data distributions. GMMs use the Expectation-Maximization (EM) algorithm to estimate the parameters of the Gaussian components.

**Model Performance and Evaluation Metrics:**

* **Optimal K Determination (AIC, BIC, and Silhouette Score):**
    * For GMMs, the optimal number of components (`K`) is often determined by minimizing the **Bayesian Information Criterion (BIC)** or **Akaike Information Criterion (AIC)**. Lower AIC/BIC values generally indicate a better fit while penalizing model complexity.
    * The **Silhouette Score** was also used, where a higher score indicates better-defined clusters.
    * Based on the plots:
        * Both AIC and BIC showed a clear minimum at **K=3**.
        * The Silhouette Score also showed its highest value at **K=3**.
    * Therefore, the optimal `K` was chosen as **3**.

* **Final Metrics for K=3:**
    * **AIC**: **3375.443**
    * **BIC**: **13736.205**
    * **Silhouette Score**: **0.083**
    * **Indication**: The Silhouette Score of **0.083** is positive but relatively low. This suggests some internal structure, but the clusters are not extremely distinct or well-separated in terms of density. However, for GMMs, which can model overlapping or non-spherical clusters, the Silhouette Score might not always be the sole best indicator, as GMMs excel in scenarios where hard partitioning by distance (like K-Means) might be too rigid. AIC and BIC provide better insights into model fit given complexity.

* **Observed Cluster Distribution:**
    * When the GMM model was fitted with `K=3`, the distribution of restaurants among the clusters was:
        * **Cluster 2.0: 54 restaurants**
        * **Cluster 0.0: 37 restaurants**
        * **Cluster 1.0: 9 restaurants**
    * **Insight**: This cluster distribution is significantly more balanced than those produced by K-Means (97, 1, 1, 1) or Agglomerative Clustering (91, 5, 3, 1). GMM was able to partition the 100 restaurants into three groups of more comparable sizes. This suggests GMM found a more nuanced and potentially more interpretable way to segment the restaurants, identifying distinct subgroups within the larger homogeneous set that K-Means and Agglomerative tended to put in one giant cluster. This makes GMM's clustering more potentially useful for targeted business strategies as it breaks down the large majority into more actionable segments.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
import numpy as np # For potential manual calculations if needed

print("--- GMM Model: Cross-Validation & Hyperparameter Tuning ---")

# Re-initializing and fitting GMM with the chosen optimal K
# This block essentially confirms the hyperparameter-tuned model
optimal_k_gmm_tuned = 3 # Our chosen optimal K from AIC/BIC/Silhouette analysis
covariance_type = 'full' # Default and common choice for GMM, can be 'tied', 'diag', 'spherical'

gmm_optimized = GaussianMixture(n_components=optimal_k_gmm_tuned,
                                covariance_type=covariance_type, # Specify covariance type
                                random_state=42, n_init=10)

# Fit the Algorithm
gmm_optimized.fit(scaled_pca_df)

# Predict on the model
# Get cluster labels for each restaurant
optimized_gmm_cluster_labels = gmm_optimized.predict(scaled_pca_df)

print(f"\nGMM model refitted with optimal K = {optimal_k_gmm_tuned} and covariance_type='{covariance_type}'.")
print("Optimized Model Parameters (Key Hyperparameters):")
print(f"  n_components (K): {gmm_optimized.n_components}")
print(f"  covariance_type: {gmm_optimized.covariance_type}")
print(f"  random_state: {gmm_optimized.random_state}")
print(f"  n_init: {gmm_optimized.n_init}")

# Display performance metrics for the optimized model
optimized_gmm_aic = gmm_optimized.aic(scaled_pca_df)
optimized_gmm_bic = gmm_optimized.bic(scaled_pca_df)

if optimal_k_gmm_tuned > 1:
    optimized_gmm_silhouette_avg = silhouette_score(scaled_pca_df, optimized_gmm_cluster_labels)
else:
    optimized_gmm_silhouette_avg = np.nan

print(f"\nPerformance Metrics for Optimized GMM Model:")
print(f"  AIC: {optimized_gmm_aic:.3f}")
print(f"  BIC: {optimized_gmm_bic:.3f}")
print(f"  Silhouette Score: {optimized_gmm_silhouette_avg:.3f}")

# Reconfirm the distribution of clusters
restaurants_with_gmm_clusters_optimized = pd.DataFrame(restaurant_df['Name']).set_index('Name').copy()
restaurants_with_gmm_clusters_optimized['Cluster'] = pd.Series(optimized_gmm_cluster_labels, index=scaled_pca_df.index)
print(f"\nUpdated count of restaurants in each cluster (Optimized K={optimal_k_gmm_tuned}):\n{restaurants_with_gmm_clusters_optimized['Cluster'].value_counts()}")

##### Which hyperparameter optimization technique have you used and why?

Answer -

For Gaussian Mixture Models (GMM), the primary hyperparameter to optimize is the **number of components (K)**. Other key hyperparameters include the `covariance_type`.

**Technique Used for Optimal K:**
* The optimal `K` was determined by evaluating and visualizing the **Bayesian Information Criterion (BIC)**, **Akaike Information Criterion (AIC)**, and **Silhouette Score** for a range of `K` values.
* **Reasoning**: GMM selection is typically done by minimizing AIC/BIC, which balance model fit with complexity. A lower AIC/BIC generally indicates a more parsimonious and well-fitting model. The Silhouette Score provides an additional internal validation measure of cluster separation. Based on the plots, `K=3` was chosen as it corresponded to the minimum values for both AIC and BIC, and also a peak in the Silhouette Score for the tested range.

**Other Hyperparameters Tuned/Considered:**
* **`covariance_type`**: The `covariance_type='full'` was used. This is the default and most flexible option, allowing each Gaussian component to have its own unconstrained covariance matrix. This enables the GMM to model clusters of various elliptical shapes and orientations, which is an advantage over K-Means' spherical assumption. Other options like 'tied', 'diag', or 'spherical' could be explored for different data distributions.
* **`n_init`**: Set to `10`. This ensures the algorithm runs `10` times with different random initializations and chooses the best fit (in terms of log-likelihood), making the results more robust and less susceptible to local optima.
* **`random_state`**: Set to `42` for reproducibility of results.
* **Cross-Validation**: Similar to other unsupervised models, traditional cross-validation for hyperparameter tuning is not directly applicable to GMMs in the absence of ground truth labels. AIC, BIC, and Silhouette scores serve as internal validation methods for model selection.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer -

Yes, a significant improvement was observed with Gaussian Mixture Models (GMM) in terms of **cluster distribution and potential business interpretability**, even though its Silhouette Score was lower than the other models.

**Evaluation Metric Score Chart (Updated Metrics for GMM K=3):**
* **AIC**: **3375.443**
* **BIC**: **13736.205**
* **Silhouette Score**: **0.083**

**Comparison and Improvement:**
* **Cluster Balance (Key Improvement)**: This is the most notable improvement. Unlike K-Means (which resulted in clusters of sizes 97, 1, 1, 1) and Agglomerative Clustering (91, 5, 3, 1), GMM with `K=3` produced a much more **balanced cluster distribution**:
    * **Cluster 2.0: 54 restaurants**
    * **Cluster 0.0: 37 restaurants**
    * **Cluster 1.0: 9 restaurants**
    This indicates that GMM was able to find more meaningful and interpretable subgroups within the data, moving away from single-restaurant outlier clusters. This balanced segmentation is highly valuable for business strategy, as it provides distinct segments to target.
* **Modeling Complex Shapes**: While the Silhouette Score (0.083) is lower than K-Means' (0.179) and Agglomerative's (0.199) for a higher K, GMM's strength lies in modeling more complex, overlapping, or non-spherical cluster shapes probabilistically. The Silhouette Score might not fully capture this advantage if clusters are truly overlapping or non-convex. The lower AIC/BIC at K=3 suggest a better model fit for that number of components.
* **Potential for Actionable Insights**: The more evenly sized clusters from GMM are inherently more actionable from a business perspective, as they represent larger, more tangible segments of restaurants, rather than isolating individual outliers. This implies a better overall understanding of the restaurant landscape.

####  Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer -

The evaluation metrics used for the clustering models provide crucial insights into the quality and actionability of the identified restaurant segments, directly impacting business decisions.

**1. Inertia (K-Means and Agglomerative Clustering):**
* **Indication**: Inertia measures the sum of squared distances of data points to their closest cluster centroid. A lower inertia indicates more compact and cohesive clusters, meaning the restaurants within a cluster are very similar to each other.
* **Business Impact**:
    * **Positive Impact**: Lower inertia suggests that the identified restaurant segments are internally homogeneous. For Zomato, this means that marketing campaigns or operational strategies targeted at a specific cluster will be highly effective because the restaurants within that cluster share strong similarities (e.g., similar cost structure, cuisine types, customer review patterns). This leads to efficient resource allocation and tailored offerings.
    * **Negative Impact**: Very high inertia (especially if `K` is too low) would indicate that the clusters are not well-defined, and restaurants within a segment are too diverse, making targeted strategies less effective.

**2. Silhouette Score (K-Means, Agglomerative Clustering, GMM):**
* **Indication**: The Silhouette Score measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation). It ranges from -1 (poor clustering) to +1 (well-separated clusters).
* **Business Impact**:
    * **Positive Impact**: A higher positive Silhouette Score suggests that the clusters are distinct and well-separated. From a business perspective, this means the identified restaurant segments are genuinely different from each other. Zomato can confidently develop differentiated strategies for each segment (e.g., "Budget Bites," "Fine Dining Gems," "Active Reviewer Hubs") without significant overlap in their target audience or operational characteristics. This enables clear differentiation in product development, marketing, and customer communication.
    * **Negative Impact**: A low or negative Silhouette Score (e.g., GMM's 0.083 vs K-Means/Agglomerative's ~0.18-0.2) implies significant overlap between clusters or misassigned data points. This would lead to ambiguous segments that are difficult to target uniquely, potentially causing confusion in marketing efforts or ineffective business decisions. While GMM's Silhouette was lower, its improved cluster balance (discussed below) might outweigh this for interpretability.

**3. AIC (Akaike Information Criterion) & BIC (Bayesian Information Criterion) (GMM):**
* **Indication**: These metrics balance model fit with model complexity. Lower AIC/BIC values are preferred, suggesting a model that explains the data well without being overly complex (i.e., using too many clusters).
* **Business Impact**:
    * **Positive Impact**: Choosing a `K` that minimizes AIC/BIC helps Zomato select a clustering solution that is parsimonious yet effective. This means the company uses the fewest number of segments necessary to adequately represent the underlying patterns in their restaurant data. This translates to simpler, more manageable business strategies, reduced overhead in segment management, and easier communication of segment characteristics to stakeholders. An overly complex model (too many clusters) would be difficult to manage and act upon.

**4. Cluster Distribution (Across all Models, but especially GMM):**
* **Indication**: The size and balance of clusters (e.g., K-Means' 97/1/1/1 vs. GMM's 54/37/9) directly reflect how restaurants are grouped.
* **Business Impact**:
    * **Positive Impact**: A more balanced distribution of cluster sizes (as seen with GMM) is often more desirable from a business perspective. It provides a clearer breakdown of the market into distinct, actionable segments, each large enough to warrant specific attention and strategy. For example, if Zomato wants to offer different marketing packages to "high-cost" vs. "mid-cost" restaurants, having distinct groups of sufficient size is more useful than having one large group and several single-restaurant outliers. Balanced clusters lead to more targeted and effective campaigns.
    * **Negative Impact**: Highly imbalanced clusters (like 97 restaurants in one segment and 1-3 in others from K-Means/Agglomerative) might indicate that the model is primarily identifying outliers rather than meaningful segments. While identifying outliers is valuable, it might not provide a comprehensive segmentation strategy for the broader restaurant base. Businesses would struggle to design strategies for segments consisting of only one or two entities.

In summary, evaluating clustering models goes beyond just statistical scores; it profoundly impacts the feasibility and effectiveness of derived business strategies. A good model yields interpretable, actionable, and relatively distinct segments that Zomato can leverage for marketing, product development, and operational improvements.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer -

For assessing the positive business impact of the clustering models, a combination of statistical evaluation metrics and practical considerations of cluster characteristics were considered.

**1. Statistical Evaluation Metrics:**
* **Inertia (Within-Cluster Sum of Squares)**:
    * **Considered for Impact**: Yes.
    * **Why**: Inertia measures the compactness of clusters. From a business perspective, lower inertia indicates that restaurants within a given segment are highly similar to each other. This homogeneity is crucial for developing precise and effective targeted marketing campaigns, tailored service offerings, or specific operational improvements, as the restaurants in that group will likely respond similarly to a given strategy. It signals how "tight" the segment is.
* **Silhouette Score**:
    * **Considered for Impact**: Yes.
    * **Why**: The Silhouette Score measures how well-separated clusters are and how well each data point fits its own cluster compared to others. A higher Silhouette Score means the identified segments are more distinct from one another. For Zomato, this translates to clarity in market segmentation, allowing for differentiated strategies for each segment without significant overlap or confusion. It helps ensure that a marketing message for "budget-friendly cafes" doesn't accidentally appeal to "high-end dining."
* **AIC (Akaike Information Criterion) & BIC (Bayesian Information Criterion)**:
    * **Considered for Impact**: Yes (especially for GMM).
    * **Why**: These information criteria help in selecting the optimal number of clusters by balancing model fit with complexity. From a business viewpoint, minimizing AIC/BIC helps choose the simplest model (fewest clusters) that still adequately describes the data. This leads to more manageable and actionable segments, avoiding over-segmentation which can complicate strategy implementation and resource allocation. It provides a statistical justification for the chosen number of segments.

**2. Practical Considerations for Business Impact:**
* **Cluster Distribution and Interpretability**:
    * **Considered for Impact**: Highly important.
    * **Why**: Beyond statistical scores, the *interpretability* and *actionability* of the clusters are paramount for business. A clustering solution that yields segments of meaningful sizes (e.g., GMM's more balanced distribution of 54, 37, and 9 restaurants) is more valuable than one with highly skewed sizes (e.g., K-Means' 97, 1, 1, 1). Well-distributed clusters represent tangible market segments that Zomato can effectively target, develop specific programs for, or analyze in depth. A model that clearly identifies distinct, sizable groups allows for more practical business strategies, as segments consisting of a single restaurant are often not actionable. This qualitative assessment of cluster balance and potential for clear distinction is a direct driver of positive business impact.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer -

From the three implemented clustering models (K-Means, Agglomerative Clustering, and Gaussian Mixture Models - GMM), the **Gaussian Mixture Model (GMM) with K=3 is chosen as the final prediction model** for clustering Zomato restaurants.

**Justification for Choosing GMM:**

1.  **Actionable Cluster Distribution (Primary Reason)**: This is the most crucial factor for business impact.
    * K-Means (K=4) resulted in a highly imbalanced distribution of (97, 1, 1, 1) restaurants, essentially identifying one large group and three singleton outliers.
    * Agglomerative Clustering (K=4) also produced a skewed distribution of (91, 5, 3, 1).
    * In contrast, **GMM (K=3) yielded a significantly more balanced and interpretable cluster distribution of (54, 37, 9) restaurants**. For business stakeholders, having distinct segments of 37 or 9 restaurants is far more actionable for targeted marketing campaigns, operational improvements, or strategic partnerships than dealing with segments composed of just one restaurant. [cite_start]This directly supports the project's goal of helping the company grow and address areas they are "lagging in"  by providing tangible segments.

2.  **Modeling Flexibility**:
    * GMMs are probabilistic models that assume data points are generated from a mixture of Gaussian distributions. This allows them to model clusters with varying sizes, shapes, and orientations (due to its `covariance_type='full'` setting), unlike K-Means which assumes spherical clusters of similar variance. This flexibility is crucial for capturing the complex and potentially overlapping structures within real-world restaurant data.

3.  **Optimal K Selection Consistency**:
    * For GMM, the choice of `K=3` was robustly supported by statistical criteria, as it minimized both **AIC and BIC**, and also showed a peak in the Silhouette Score for the tested range. This provides a strong statistical basis for the chosen number of segments.

4.  **Addressing Business Needs**:
    * [cite_start]The project aims to help customers find "the Best restaurant in their locality" and for the "company to grow up and work on the fields they are currently lagging in". A balanced segmentation from GMM provides clear categories that can be used to curate restaurant lists for customers (e.g., distinguishing between a larger "mainstream" group, a "mid-tier" group, and a "niche/premium" group) and for Zomato to develop specific strategies for each segment.

**Comparison with Other Models:**
While K-Means and Agglomerative Clustering achieved slightly higher Silhouette Scores (0.179 and 0.199 respectively, compared to GMM's 0.083), these metrics often favor compact, well-separated, spherical clusters. GMM's strength lies in modeling more complex data structures, and its ability to create more balanced clusters, despite a lower Silhouette, makes it superior from an interpretability and practical actionability standpoint for this business problem. The more balanced clusters suggest a better overall representation of the natural groupings in the data for business purposes.

Therefore, despite its lower Silhouette Score, the **Gaussian Mixture Model is selected as the final prediction model due to its superior ability to provide a more actionable and interpretable segmentation of the Zomato restaurants for business strategy and customer assistance.**

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer -

The chosen model for restaurant clustering is the **Gaussian Mixture Model (GMM) with K=3**.

**Explanation of the Gaussian Mixture Model (GMM):**
As discussed previously, GMM is a probabilistic clustering algorithm that models data points as belonging to a mixture of Gaussian distributions. It excels at identifying clusters of varying shapes and sizes and provides "soft assignments" (probabilities) for each data point to belong to a cluster, rather than hard assignments. The Expectation-Maximization (EM) algorithm is used to fit the model. GMM was chosen for its ability to produce more balanced and actionable segments compared to other models, which is crucial for business interpretation.

**Feature Importance and Model Explainability for Clustering:**
For unsupervised clustering models like GMM, direct "feature importance" in the traditional sense (e.g., how much a feature contributes to predicting a target variable) is not directly applicable. There are no "explainability tools" like SHAP or LIME that inherently provide feature importance scores for cluster assignments.

Instead, model explainability for clustering is achieved by **characterizing each cluster** based on the average (or median) values of the *original features* that constitute those clusters. By comparing the feature profiles across different clusters, we can infer which features are most "important" in defining and differentiating each segment. This process helps us understand the distinct characteristics of the restaurants within each group, providing actionable business insights.

In the subsequent code, we will analyze the mean values of the **scaled original features** for each of the 3 GMM clusters to understand what makes each cluster unique and which features are most prominent in defining them.

In [None]:
# Characterizing Clusters to understand Feature Importance

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

print("--- Characterizing GMM Clusters (K=3) based on Scaled Features ---")

# Ensure scaled_features_df is available (it's the data BEFORE PCA, but after standard scaling)
# This was created in 'Feature Selection' section.
# We need to use this to interpret original features, not just PCA components.
# If scaled_features_df is not directly available, you might need to re-run the Feature Selection cell.

# Get the cluster labels from the final GMM model
# gmm_model was fitted with scaled_pca_df, so predict on scaled_pca_df to get labels
gmm_final_labels = gmm_model.predict(scaled_pca_df)

# Add the cluster labels to the scaled_features_df (which contains original scaled features)
# Ensure the index aligns
scaled_features_with_clusters = scaled_features_df.copy()
scaled_features_with_clusters['GMM_Cluster'] = pd.Series(gmm_final_labels, index=scaled_features_df.index)

# Calculate the mean of each feature for each cluster
cluster_means = scaled_features_with_clusters.groupby('GMM_Cluster').mean()

print("\nMean values of Scaled Features for each GMM Cluster (transposed):")
# Display the cluster means, transpose for better readability (clusters as columns)
# Round to 2 decimal places for cleaner output
print(cluster_means.T.round(2))

# --- Visualizing Top Differentiating Features for each Cluster ---

print("\n--- Top Differentiating Features for Each Cluster (Bar Plot) ---")

# For each cluster, identify the top N features with the highest absolute mean values.
num_top_features_to_plot = 10 # Display top 10 features for each cluster

# Create a list to store data for visualization
top_features_vis_data = []

for cluster_id in cluster_means.index:
    # Sort features by absolute mean value in descending order for the current cluster
    # This identifies features that are most "extreme" (high or low) for that cluster
    sorted_features = cluster_means.loc[cluster_id].abs().sort_values(ascending=False)

    # Get the names of the top N features for this cluster
    top_feature_names = sorted_features.head(num_top_features_to_plot).index.tolist()

    # Get the actual mean values for these top features in the current cluster
    # We use the original (non-absolute) mean values for the plot to show direction (positive/negative)
    top_feature_values = cluster_means.loc[cluster_id, top_feature_names].values

    # Append to our visualization data
    for i in range(num_top_features_to_plot):
        top_features_vis_data.append({
            'Cluster': f'Cluster {cluster_id}',
            'Feature': top_feature_names[i],
            'Mean_Value': top_feature_values[i]
        })

top_features_df_plot = pd.DataFrame(top_features_vis_data)

# Create a bar plot for top features per cluster
plt.figure(figsize=(15, 8))
sns.barplot(x='Mean_Value', y='Feature', hue='Cluster', data=top_features_df_plot,
            palette='viridis', dodge=False) # Dodge=False to make bars appear per feature
plt.title(f'Top {num_top_features_to_plot} Differentiating Scaled Features Per GMM Cluster', fontsize=16)
plt.xlabel('Mean Scaled Value (Higher means higher than average, Lower means lower than average)', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.grid(axis='x', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()

# You might also want to look at a few specific business-relevant features across all clusters
# For example, let's select a few key original features for a summarized view
relevant_business_features_summary = [
    'Cost', 'Avg_Review_Rating', 'Avg_Pictures_Per_Review',
    'Avg_Reviewer_Reviews_Per_Restaurant', 'Avg_Reviewer_Followers_Per_Restaurant',
    'is_late_night_encoded', 'North Indian', 'Chinese', 'Biryani',
    'Barbecue & Grill', 'Food Hygiene Rated Restaurants in Hyderabad', 'Corporate Favorites',
    'Great Buffets', 'Top-Rated' # Add other relevant one-hot encoded or aggregated features
]

# Filter and display the means for these specific features
# Ensure these features exist in cluster_means.columns before selecting
existing_relevant_features = [f for f in relevant_business_features_summary if f in cluster_means.columns]
if existing_relevant_features:
    print("\nMean values for selected business-relevant features across all clusters (transposed):")
    print(cluster_means[existing_relevant_features].T.round(2))
else:
    print("\nNo specific business-relevant features found in cluster means to display.")

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

import joblib
import os # For managing file paths

# Define the filename for saving the model
model_filename = 'gmm_restaurant_clustering_model.joblib'

# Save the trained GMM model
joblib.dump(gmm_model, model_filename)

print(f"GMM model saved successfully as '{model_filename}'")
print(f"File path: {os.path.abspath(model_filename)}")

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

import joblib
import pandas as pd
import numpy as np # For numerical operations if needed

# Define the filename from where the model will be loaded
model_filename = 'gmm_restaurant_clustering_model.joblib'

# Load the trained GMM model
try:
    loaded_gmm_model = joblib.load(model_filename)
    print(f"GMM model loaded successfully from '{model_filename}'")
except FileNotFoundError:
    print(f"Error: Model file '{model_filename}' not found. Please ensure it was saved correctly.")
    loaded_gmm_model = None # Set to None to prevent further errors

if loaded_gmm_model:
    # --- Sanity Check: Predict on a small subset of existing data ---
    # Take a few sample data points from your scaled_pca_df
    # For example, the first 5 rows
    sample_unseen_data = scaled_pca_df.head(5)
    print("\nSample 'unseen' data (first 5 rows of scaled PCA components):")
    print(sample_unseen_data)

    # Predict clusters using the loaded model
    predicted_clusters_sample = loaded_gmm_model.predict(sample_unseen_data)

    print("\nPredicted clusters for the sample 'unseen' data:")
    print(predicted_clusters_sample)

    # You can compare these with the original gmm_cluster_labels if you store them.
    # For example, compare the first 5 original labels:
    # print("\nOriginal clusters for the same sample data:")
    # print(gmm_cluster_labels[:5]) # This assumes gmm_cluster_labels is still in memory and in the same order

# **Conclusion**

------------------------------------------------

This project focused on analyzing Zomato restaurant data in India to gain actionable insights and segment restaurants using unsupervised machine learning techniques. The primary objectives were to understand the Indian food industry landscape, analyze customer review sentiments, and cluster restaurants into distinct segments to benefit both Zomato (for growth and improvements) and its customers (for finding the best restaurants).

**Key Steps and Insights:**

1.  **Data Understanding and Wrangling**:
    * Initial exploration revealed two main datasets: Restaurant Data (105 entries, 6 columns) and Review Data (10,000 entries, 7 columns).
    * Crucial issues identified and addressed included: `Cost` as an object type with commas, `Rating` as an object type containing 'Like' and NaNs, multi-valued `Collections` and `Cuisines` strings, and significant missing values in `Collections` (54 entries) and review-related columns (38-45 entries).
    * Manipulations involved converting `Cost` and `Rating` to numerical types, handling duplicates (36 rows in reviews), filling or dropping missing values, and extracting `Reviewer_Reviews` and `Reviewer_Followers` from metadata. All these steps ensured data was clean and numerical.

2.  **Exploratory Data Analysis (EDA) and Visualization**:
    * **Restaurant Costs**: Most restaurants were found to be budget-friendly (below INR 1000), with a right-skewed distribution. This highlighted a competitive mid-range market and potential for high-end niche.
    * **Customer Ratings**: Overwhelmingly positive, dominated by 4.0 and 5.0-star ratings, suggesting high satisfaction but also potential for rating bias or lack of detailed negative feedback.
    * **Cuisine Popularity**: North Indian and Chinese cuisines were the most prevalent.
    * **Cuisine Quality**: Interestingly, while popular, North Indian and Chinese had lower average ratings compared to Asian, Continental, and Italian cuisines, highlighting a quality consistency gap in highly prevalent categories.
    * **Reviewer Activity/Influence**: Most reviewers were casual users, but a small number were highly prolific and influential (high reviews, high followers).
    * **Time Trends**: Significant growth in review activity from mid-2018 onwards, but average ratings showed a slight decline from 2016 to 2018, hinting at potential dilution of experience quality during growth.
    * **Correlations**: Very weak correlation between rating and reviewer activity/pictures, suggesting rating is independent of reviewer volume/influence. Moderate correlations between reviewer activity metrics (reviews, followers, pictures).

3.  **Feature Engineering and Preprocessing**:
    * **Outlier Treatment**: Capping at the 99th percentile was applied to skewed numerical features (`Cost`, `Pictures`, `Reviewer_Reviews`, `Reviewer_Followers`) to mitigate outlier influence on models.
    * **Categorical Encoding**: One-Hot Encoding was extensively applied to multi-valued (`Collections`, `Cuisines`) and nominal (`Timings`) categorical features, significantly expanding the feature set.
    * **Textual Data Preprocessing**: The `Review` text underwent comprehensive NLP steps (contraction expansion, lowercasing, punctuation/URL/digit removal, stopwords removal, tokenization, POS-aware lemmatization), culminating in **TF-IDF Vectorization**, transforming reviews into 3441 numerical features.
    * **Feature Manipulation**: Aggregated review-level features (e.g., average rating, pictures, reviewer activity per restaurant) and TF-IDF features (summed per restaurant) were combined with restaurant-level features into a single `final_clustering_df`.
    * **Dimensionality Reduction**: **Principal Component Analysis (PCA)** reduced the 3650 features to 50 principal components, retaining ~67% of total variance, mitigating the curse of dimensionality.
    * **Data Scaling**: `StandardScaler` was applied to the PCA components, ensuring all features are on a comparable scale for distance-based clustering.

4.  **ML Model Implementation and Selection**:
    * Three unsupervised clustering models were implemented and evaluated using Inertia, Silhouette Score, AIC/BIC, and most importantly, **cluster distribution/actionability**.
        * **K-Means (K=4)**: Inertia: 4725.57, Silhouette: 0.179. Clusters: (97, 1, 1, 1). Highly imbalanced.
        * **Agglomerative Clustering (K=4)**: Inertia: 4705.92, Silhouette: 0.199. Clusters: (91, 5, 3, 1). Slightly better, but still skewed.
        * **Gaussian Mixture Models (GMM) (K=3)**: AIC: 3375.44, BIC: 13736.21, Silhouette: 0.083. Clusters: (54, 37, 9).
    * **Chosen Model: Gaussian Mixture Model (GMM) with K=3.** Despite a lower Silhouette Score, GMM was selected due to its **superior ability to produce a more balanced and actionable cluster distribution**. This allows for more meaningful segmentation for targeted business strategies compared to models that primarily identify individual outliers.
    * **Cluster Characterization (Feature Importance)**: The 3 GMM clusters were characterized by analyzing the mean of scaled original features:
        * **Cluster 0 (Popular & High-Engagement)**: Higher cost, above-average rating, high pictures/reviewer engagement, late-night.
        * **Cluster 1 (Premium, High-Visibility, Mixed-Review)**: Highest cost, highest pictures/reviewer engagement, but below-average rating.
        * **Cluster 2 (Budget-Friendly, Standard Quality)**: Lower cost, slightly below-average rating, lower pictures/reviewer engagement.

**Business Impact and Conclusion:**

This project successfully demonstrates how Zomato can leverage its extensive restaurant and review data for strategic segmentation. The identified clusters provide a clear framework for:
* **Targeted Marketing**: Developing specific campaigns for "Popular & High-Engagement," "Premium, High-Visibility, Mixed-Review," and "Budget-Friendly, Standard Quality" restaurants.
* **Product Development**: Curating specialized collections for customers (e.g., "Top Picks from Cluster 0," "Unique Experiences from Cluster 1").
* **Operational Improvements**: Identifying areas for improvement within "Budget-Friendly" segment (Cluster 2) or understanding the challenges faced by "Premium, Mixed-Review" segment (Cluster 1) to convert high visibility into consistent satisfaction.
* **Competitive Analysis**: Understanding the market composition (dominant budget-friendly options vs. niche premium).

The GMM, with its balanced cluster output, offers the most practical and interpretable segmentation, making it a powerful tool for Zomato to enhance user experience, optimize business strategies, and drive growth in the dynamic Indian food industry.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***