# **Project Name**    -



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual
##### **Name**    - Arpit Dutta

# **Project Summary -**

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Zomato, an Indian start-up founded in 2008 by Deepinder Goyal and Pankaj Chaddah, serves as a restaurant aggregator and food delivery platform. It offers information such as menus, customer reviews, and food delivery options from partnered restaurants in select cities. Known for its rich culinary diversity, India boasts numerous restaurants and hotel resorts, embodying the concept of unity in diversity. The restaurant industry in India is constantly evolving, with an increasing number of people embracing dining out or ordering food deliveries. The surge in restaurants across various states has inspired an exploration of data to uncover insights and intriguing trends about the Indian food scene in different cities.

This project aims to analyze Zomato's restaurant data across Indian cities, focusing on both customers and the company. It involves assessing customer review sentiments to draw meaningful conclusions, which will be presented through visualizations for quick and easy analysis. Additionally, the project will cluster Zomato's restaurants into different segments. The analysis will address business questions, aiding customers in finding the best local restaurants and helping the company identify areas for growth and improvement. The data also provides insights into cuisine and pricing, which can be leveraged for cost-benefit analysis. Moreover, sentiment analysis and reviewer metadata can help identify influential critics within the industry.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from numpy import math
from numpy import loadtxt
import seaborn as sns
import matplotlib.pyplot as plt
import missingno

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.preprocessing import MultiLabelBinarizer


import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
restaurant_data = pd.read_csv('/content/drive/MyDrive/ZOMATO/Zomato Restaurant names and Metadata.csv')
reviews = pd.read_csv('/content/drive/MyDrive/ZOMATO/Zomato Restaurant reviews.csv')

### Dataset First View

In [None]:
# Dataset First Look
restaurant_data.head()

In [None]:
reviews.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
restaurant_data.shape

In [None]:
reviews.shape

### Dataset Information

In [None]:
# Dataset Info
restaurant_data.info()

In [None]:
reviews.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("Duplicates in restaurant_data:",restaurant_data.duplicated().sum())
print("Duplicates in reviews:",reviews.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("Missing values in restaurant_data:")
print(restaurant_data.isnull().sum(),"\n")
print("Missing values in reviews:")
print(reviews.isnull().sum())

In [None]:
# Visualizing the missing values
sns.heatmap(restaurant_data.isnull(), cbar=False)

In [None]:
sns.heatmap(reviews.isnull(), cbar=False)

### What did you know about your dataset?

Restaurants data

The restaurants data consists of information about 105 restaurants with 6 features.
In the column "Collections" 52% of values are missing
Reviews data

The reviews data cosists of information about 10000 reviews about the 105 restaurants with 7 features.
In the columns "Reviewer","Review","Rating","Metadata","Time" <0.5% of the data is missing
Datatypes of all the columns in each dataset is string object

In [None]:
restaurant_data.Cuisines_list = restaurant_data.Cuisines.apply(lambda x: x.lower().replace(" ","").split(","))
restaurant_data.Cuisines_list

cusines_set = set()
for cuisines in restaurant_data.Cuisines_list:
  cusines_set.update(cuisines)
cusines_set
print("Total No. of cuisines",len(cusines_set))
print(cusines_set)

In [None]:
# Replacing non-numerical values in the 'Rating' column with NaN
reviews['Rating'] = pd.to_numeric(reviews['Rating'], errors='coerce')

# Filling NaN values with the median of the numeric ratings
reviews['Rating'].fillna(reviews['Rating'].median(), inplace=True)

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
restaurant_data.columns

In [None]:
reviews.columns

In [None]:
# Dataset Describe
restaurant_data.describe()

In [None]:
reviews.describe()

### Variables Description

Restaurant Data:¶
Name: Name of Restaurants
Links: URL Links of Restaurants
Cost: Per person estimated cost of dining
Collection: Tagging of Restaurants w.r.t. Zomato categories
Cuisines: Cuisines served by restaurants
Timings: Restaurant timings
Review Data:
Reviewer: Name of the reviewer
Review: Review text
Rating: Rating provided
MetaData: Reviewer metadata - Number of reviews and followers
Time: Date and Time of Review
Pictures: Number of pictures posted with review

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in restaurant_data.columns.tolist():
  print("Unique",i,":",restaurant_data[i].nunique())

In [None]:
for i in reviews.columns.tolist():
  print("Unique",i,":",reviews[i].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
#  Handling missing values

# Dropping collections column since most of the values are null
restaurant_data.drop("Collections",axis= 1,inplace=True)

# Dropping remaining null values from restaurant_data and reviews since they are very few
restaurant_data.dropna(inplace=True)
reviews.dropna(inplace=True)

# Check for missing values After handling
print("Missing values in restaurant_data:")
print(restaurant_data.isnull().sum(),"\n")
print("Missing values in reviews:")
print(reviews.isnull().sum())

In [None]:
# Drop duplicate rows
restaurant_data.drop_duplicates(inplace=True)
reviews.drop_duplicates(inplace=True)
print("restaurant_data shape:",restaurant_data.shape)
print("reviews shape:",reviews.shape)

In [None]:
# Lets extract cuisines from the Cuisines(string) column and store as a list
print("Cuisines data Before preprocessing:\n")
print(restaurant_data.Cuisines[0])
print("\nCuisines data After preprocessing:\n")
print(restaurant_data.Cuisines[0].lower().replace(" ","").split(","))

In [None]:
# Lets apply the preprocessing steps on the Cuisines column
restaurant_data["Cuisines_list"] = restaurant_data.Cuisines.apply(lambda x: x.lower().replace(" ","").split(","))

# Lets check the names of count of unique cuisines
cusines_set = set()
for cuisines in restaurant_data.Cuisines_list:
  cusines_set.update(cuisines)
print("Total number of unique cuisines: ",len(cusines_set))
cusines_set

In [None]:
# Preprocess and convert the cost column to int data type
restaurant_data['Cost'] = restaurant_data['Cost'].str.replace(',', '').str.replace('₹', '').astype(int)
restaurant_data.Cost

In [None]:
restaurant_data.Cost.describe()

In [None]:
restaurant_data[['Name','Cost','Cuisines_list']]

*Data Wrangling on Reviews*

In [None]:
# replace non numbers with NAN
reviews.Rating = pd.to_numeric(reviews.Rating, errors='coerce')
reviews.Rating.fillna(reviews.Rating.mean(), inplace=True)

In [None]:
# Checking the Metadata column to create a regex expression
followers = reviews.Metadata.apply(lambda x: x.split(",")[-1])
print(followers.apply(lambda x: x.split(" ")[-1]).value_counts())

review_count = reviews.Metadata.apply(lambda x: x.split(",")[0])
print(review_count.apply(lambda x: x.split(" ")[-1]).value_counts())

In [None]:
import re
def extract_follower_and_review_count(text):

    # Define regular expressions for review and followers
    review_pattern = r'(\d+) Review'
    followers_pattern = r'(\d+) Follower'

    # Search for the review and followers using regex
    review_match = re.search(review_pattern, text)
    followers_match = re.search(followers_pattern, text)

    # Extract the review and followers values
    review = review_match.group(1) if review_match else 0
    followers = followers_match.group(1) if followers_match else 0
    return [review, followers]

extract_follower_and_review_count("1 Review , 22 Follower")

In [None]:
reviews[['prev_reviews_count', 'followers_count']] = reviews['Metadata'].apply(extract_follower_and_review_count).apply(pd.Series)
reviews.drop('Metadata', axis=1, inplace=True)
merged_restaurant_data = pd.merge(reviews, restaurant_data[["Name","Cost","Cuisines_list"]], left_on='Restaurant', right_on='Name')
merged_restaurant_data

### What all manipulations have you done and insights you found?

Dropped missing values and duplicates
Extracted cuisines from the Cuisines column
Converted cost column to int data type
Insights

There are 44 unique cuisines across 104 restaurants
Estimated cost of dining of all 104 restaurents are in the range 150 Rs to 2800 Rs
Extracting the locations from the links column we can observe that all restaurents are from Gachibowli, Hyderabad

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
sns.histplot(data=restaurant_data, x='Cost', kde=True, color='skyblue')
plt.title('Restaurant Cost Distribution')
plt.xlabel('Cost')
plt.ylabel('Frequency')
plt.show()
print("\n",restaurant_data['Cost'].describe())

##### 1. Why did you pick the specific chart?
Reasons for Choosing This Chart:
Combines Discrete and Continuous Data Visualization:

The histogram shows the frequency of different cost ranges, while the KDE provides a smooth representation of the data's distribution, giving a more comprehensive view of the underlying patterns.
Highlights Central Tendency and Spread:

By plotting both the histogram and the KDE curve, we can easily observe where most of the restaurant costs lie (central tendency) and how spread out the costs are (variability).
Easy Identification of Peaks and Skewness:

The KDE curve helps identify any peaks (modes) in the data, showing if there are certain cost ranges that are more common. It can also indicate whether the distribution is skewed, which is helpful for understanding market segmentation.

##### 2. What is/are the insight(s) found from the chart?

1. What is/are the insight(s) found from the chart?
The average cost per person is 861 Rs With a std. deviation of 515 Rs
Min observed cost is 150 and max is 2800

##### 3. Will the gained insights help creating a positive business impact?
Understanding Typical Cost Ranges:

From the descriptive statistics:
Mean cost: ₹861, indicating that on average, a meal at a restaurant costs around ₹861.
Median cost (50%): ₹700, suggesting that half of the restaurants have meal costs below ₹700.
Standard deviation (std): ₹512, showing considerable variability in restaurant costs.
Most restaurants have costs concentrated between ₹500 and ₹1200, as indicated by the 25th percentile (₹500) and 75th percentile (₹1200).
Identifying Price Segmentation:

The data reveals various market segments, which can help Zomato target different customer groups. For example:
Lower-end segment: Restaurants with costs below ₹500.
Mid-range segment: Costs between ₹500 and ₹1200.
High-end segment: Restaurants costing above ₹1200.
Zomato can tailor its marketing strategies based on these segments to attract different types of customers.

Potential Growth Opportunities:

If the distribution shows a skew towards lower costs, Zomato could consider expanding partnerships with higher-end restaurants to balance the offering. Conversely, if high-cost restaurants dominate, promoting budget-friendly options could attract more customers.

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(data = reviews, x = 'Rating', kde = True, color = 'skyblue')

print("\n",reviews['Rating'].describe())

The average rating is 3.6 indicating a majority of postitive ratings
We can see two peaks in the distribution at 1 and 5 indicating that customers tend to have strong opinions about their experiences, either very good or very bad, rather than neutral or average.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
all_cuisines = []

for cuisine_list in restaurant_data.Cuisines_list:
    for cuisine in cuisine_list:
        all_cuisines.append(cuisine)

# Count the occurrences of each cuisine type
cuisine_counts = pd.Series(all_cuisines).value_counts()

# Choose the top N cuisine types to display on the y-axis
top_n = 10  # You can change this number as needed
cuisine_counts = cuisine_counts.head(top_n)

# Create the bar chart
plt.figure(figsize=(10, 6))
bars = plt.barh(cuisine_counts.index, cuisine_counts.values)
plt.title('Top {} Cuisine Types in Restaurants'.format(top_n))
plt.xlabel('Number of Restaurants')
plt.ylabel('Cuisine Type')
plt.gca().invert_yaxis()  # Invert the y-axis to display the most common cuisine at the top
plt.show()

##### 2. What is/are the insight(s) found from the chart?

Northindian seems to be the most popular cuisine and south indian and bakery to be the least popular

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Reset index to ensure it's unique
restaurant_data.reset_index(drop=True, inplace=True)

# Explode the 'Cuisines_list' column to create multiple rows for each cuisine type
exploded_cuisines = restaurant_data['Cuisines_list'].explode()

# Create a DataFrame from the exploded cuisines and 'Cost' column
cuisine_cost_df = pd.DataFrame({'Cuisine Type': exploded_cuisines, 'Cost': restaurant_data['Cost']})

# Count the occurrences of each cuisine type and select the top 20
top_20_cuisines = cuisine_cost_df['Cuisine Type'].value_counts().head(20).index
# Filter the DataFrame to include only the top 20 cuisines
cuisine_cost_df_top_20 = cuisine_cost_df[cuisine_cost_df['Cuisine Type'].isin(top_20_cuisines)]

# Calculate the average cost for each cuisine type and sort by average cost
average_cost_by_cuisine = cuisine_cost_df_top_20.groupby('Cuisine Type')['Cost'].median().sort_values(ascending=False).reset_index()

# Create the bar chart
plt.figure(figsize=(10, 6))
bars = plt.barh(average_cost_by_cuisine['Cuisine Type'], average_cost_by_cuisine['Cost'])
plt.title('Median Cost by Cuisine Type')
plt.xlabel('Median Cost')
plt.ylabel('Cuisine Type')
plt.gca().invert_yaxis()  # Invert the y-axis to display the highest median cost at the top

plt.tight_layout()
plt.show()

##### 2. What is/are the insight(s) found from the chart?

From the above plot the median cost of cuisine is highest for asian followed by italian and mediterranean
With the lowest being fastfoods, burgers and bakery

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Sort the DataFrame by cost in ascending order (affordable to expensive)
affordable_restaurants = restaurant_data.sort_values(by='Cost', ascending=False).tail(20)

# Sort the DataFrame by cost in descending order (expensive to affordable)
expensive_restaurants = restaurant_data.sort_values(by='Cost', ascending=False).head(20)

# Concatenate both DataFrames to create a single DataFrame with the top 20 affordable and top 20 expensive restaurants
top_restaurants = pd.concat([expensive_restaurants,affordable_restaurants])

# Create a bar plot to visualize the top 20 affordable and top 20 expensive restaurants
plt.figure(figsize=(18, 8))
plt.barh(top_restaurants['Name'], top_restaurants['Cost'])
plt.title('Top 20 Affordable and Expensive Restaurants')
plt.xlabel('Cost')
plt.ylabel('Restaurant Name')
plt.gca().invert_yaxis()  # Invert the y-axis to show the highest cost at the top
plt.tight_layout()

plt.show()

print("\n Statistics for top 20 expensive restaurants")
print(expensive_restaurants.describe())
print("\n Statistics for top 20 affordable restaurants")
print(affordable_restaurants.describe())

##### 2. What is/are the insight(s) found from the chart?


*  The top expensive restaurants are, on average, about 5.4 times more costly than the top affordable restaurants.
*  This data could help consumers make decisions based on their budget.




#### Chart - 5

In [None]:
# Chart - 5 visualization code
import datetime as dt

# Converting 'Time' column to datetime
reviews['Time'] = pd.to_datetime(reviews['Time'])

# Extracting month, day of the week, and hour
reviews['Month'] = reviews['Time'].dt.month
reviews['DayOfWeek'] = reviews['Time'].dt.day_name()
reviews['Hour'] = reviews['Time'].dt.hour

# Seasonal Trend Analysis: Average Rating by Month
monthly_avg_rating = reviews.groupby('Month')['Rating'].mean()

# Weekly Trend Analysis: Average Rating by Day of the Week
weekly_avg_rating = reviews.groupby('DayOfWeek')['Rating'].mean().reindex([
    'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'
])

# Hourly Trend Analysis: Average Rating by Hour of the Day
hourly_avg_rating = reviews.groupby('Hour')['Rating'].mean()

# Plotting all three trends in a single figure with 3 rows and 1 column
fig, axs = plt.subplots(3, 1, figsize=(10, 10))

# Monthly Trend Plot
axs[0].plot(monthly_avg_rating, color='teal')
axs[0].set_title('Average Ratings by Month')
axs[0].set_xlabel('Month')
axs[0].set_ylabel('Average Rating')
axs[0].set_xticks(range(0, 12))
axs[0].set_xticklabels([dt.date(2000, m, 1).strftime('%B') for m in range(1, 13)])
axs[0].tick_params(axis='x', rotation=45)

# Weekly Trend Plot
axs[1].plot(weekly_avg_rating, color='purple')
axs[1].set_title('Average Ratings by Day of the Week')
axs[1].set_xlabel('Day of the Week')
axs[1].set_ylabel('Average Rating')
axs[1].set_xticks(range(7))
axs[1].set_xticklabels(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
axs[1].tick_params(axis='x', rotation=45)

# Hourly Trend Plot
axs[2].plot(hourly_avg_rating, color='orange')
axs[2].set_title('Average Ratings by Hour of the Day')
axs[2].set_xlabel('Hour of the Day')
axs[2].set_ylabel('Average Rating')
axs[2].set_xticks(range(0, 24))
axs[2].grid(True)

plt.tight_layout()
plt.show()

##### 2. What is/are the insight(s) found from the chart?

Average Ratings by Month:

There is a noticeable peak in June, indicating the highest average ratings occur in this month.
There are lower points, particularly noticeable in April and September, suggesting a potential seasonal impact on ratings.
Average Ratings by Day of the Week:

Ratings peak mid-week, particularly on Wednesday, and then there is a decline towards the weekend.
The lowest average ratings occur on Friday.
Average Ratings by Hour of the Day:

There are peaks in the early hours (around 5 AM), mid-morning (around 9 AM), and late evening (around 8 PM).
There are noticeable dips in the late morning (around 7-8 AM) and early evening (around 2-4 PM).
Business Implications:

Positive Impact: The insights could lead to targeted marketing during peak times, quality control measures when lower ratings are expected, and staffing adjustments to ensure service quality during critical hours or days.
Strategic Applications:

Seasonal Adjustments: The variation in monthly ratings could suggest that the restaurants should adjust their offerings or operations seasonally, perhaps offering summer specials or comfort food in colder months.

Weekly Planning: Understanding that ratings dip over the weekend could imply that customers have higher expectations or that there are operational challenges during these days.

Hourly Focus: The hourly variations might reflect changes in customer base or staff shifts. Special attention to service quality at known low points, and maintaining the high standards during the peaks, could enhance overall ratings.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns  # Import seaborn for enhanced styling

# Assuming you have already loaded your data and merged it as shown in your code

# Create 10 bins for the 'Cost' column
merged_restaurant_data['Cost_Bin'] = pd.cut(merged_restaurant_data['Cost'], bins=7)

# Set a custom color palette for the plot
colors = sns.color_palette("Set2")

# Create a box plot with seaborn for better styling
plt.figure(figsize=(12, 8))
sns.set(style="whitegrid")  # Set the style to whitegrid
sns.boxplot(x='Cost_Bin', y='Rating', data=merged_restaurant_data, palette=colors)
plt.xticks(rotation=45)
plt.title('Variation of Ratings with Cost')
plt.xlabel('Cost Bins')
plt.ylabel('Ratings')

# Add grid lines to the plot
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Customize the plot further as needed, e.g., adjusting fonts, colors, etc.

plt.tight_layout()  # Ensure the plot is well-fit within the figure
plt.show()

##### 2. What is/are the insight(s) found from the chart?

Higher Cost, Higher Ratings: The highest cost bin [(2421, 2800)] shows the highest median rating, which could suggest that more expensive restaurants tend to receive better ratings, possibly due to perceived quality, service, or experience.

Spread of Ratings: The box plots show the interquartile range (IQR), which represents the middle 50% of ratings for each cost bin. The IQR appears to be tighter for the lower and middle cost bins, indicating that ratings are more consistent in these categories. The highest cost bin has a larger IQR, suggesting more variability in how customers rate these more expensive restaurants.

Presence of Outliers: There are outliers in both the lower and higher cost bins, indicated by the dots outside the main "box" of the box plot. This indicates that there are some restaurants with ratings that are significantly lower than the typical range of ratings for their cost category.

Business Implications:

Positive Impact:

For mid-range and lower-cost restaurants, consistent quality seems to lead to consistently good ratings, indicating a well-met customer expectation for these price points.
The presence of outliers suggests that there is an opportunity for improvement or differentiation for restaurants that are underperforming within their cost category.

Negative Impact:

Outliers, especially in the highest cost bin, suggest that high prices alone do not guarantee high ratings. Poor experiences at expensive restaurants may lead to significantly lower ratings due to higher customer expectations associated with higher costs.
The larger IQR in the highest cost bin suggests a riskier investment; while high costs can correlate with high ratings, the variability is greater, indicating potential for lower-than-average ratings as well.
Addressing the reasons behind outlier ratings, especially in the higher cost bins, could be key to improving overall customer satisfaction and business success.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Sentiment Analysis***

In [None]:
def clean_string(input_string):
    cleaned_string = re.sub(r"(@\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?|\d+", "", input_string.lower())
    return cleaned_string

reviews['Review Cleaned'] = reviews['Review'].apply(clean_string)

In [None]:
import pandas as pd
from textblob import TextBlob

# Function to get sentiment polarity
def get_sentiment(text):
    return TextBlob(text).sentiment.polarity

# Applying the function to the review column
reviews['Sentiment'] = reviews['Review Cleaned'].apply(get_sentiment)

# Classifying sentiments into positive, negative, and neutral
reviews['Sentiment_Type'] = reviews['Sentiment'].apply(lambda x: 'positive' if x > 0 else ('negative' if x < 0 else 'neutral'))

# Displaying the first few rows with the sentiment analysis
reviews[['Review Cleaned', 'Sentiment', 'Sentiment_Type']].head()

In [None]:
# Extract Noun from each review for further analysi
import spacy

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

def extract_food_entities(review):
    doc = nlp(review)
    # Extract noun from text using spacy
    food_entities = [token.text for token in doc if token.pos_ == 'NOUN']
    return food_entities

# Apply the function to each review
reviews['Food_Entities'] = reviews['Review Cleaned'].apply(extract_food_entities)

# Display the first few rows with extracted food entities
reviews[['Review Cleaned', 'Food_Entities']].head()

In [None]:
reviews['Food_Entities'].value_counts()

In [None]:
# Calculating the cumulative sentiment for each entity
from collections import defaultdict

# Initialize dictionaries to hold the sum of values of sentiments for each food item
food_sentiment_counts = defaultdict(lambda: {'positive': 0, 'negative': 0, 'neutral': 0})

# Iterating through each review
for index, row in reviews.iterrows():
    sentiment = row['Sentiment_Type']
    sentiment_value = row['Sentiment']
    for food_item in row['Food_Entities']:
        # Get the cumulative sentiment value for this food item
        food_sentiment_counts[food_item][sentiment] += sentiment_value

# Now, food_sentiment_counts has the sentiment counts for each food item
food_sentiment_count_df = pd.DataFrame.from_dict(food_sentiment_counts, orient='index')
food_sentiment_count_df

In [None]:
# Visualizing the result
# Finding the threshold to filter the top values (greater than the 20th highest positive sentiment value)
threshold = food_sentiment_count_df['positive'].sort_values(ascending=False)[19]

# Filtering the DataFrame
top_sentiments_df = food_sentiment_count_df[food_sentiment_count_df['positive'] > threshold].sort_values(by='positive', ascending=False)


# Plotting
fig, ax = plt.subplots(figsize=(10, 6))

# Stacked bar chart for positive counts and any other sentiment counts except negative
top_sentiments_df[['positive']].plot(kind='bar', stacked=True, ax=ax, color='green')

# Line chart for negative values
ax2 = ax.twinx()
(top_sentiments_df['negative']*(-1)).plot(kind='line', ax=ax2, color='red', marker='o', linewidth=2, label='Negative')

# Labels and legend
ax.set_xlabel('Food')
ax.set_ylabel('Positive Sentiment Value')
ax2.set_ylabel('Negative Sentiment Value')
ax.set_title('Sentiment Value for Top 20 Entities Sorted by Positive Sentiment')
fig.legend(loc='upper right', bbox_to_anchor=(1.1, 1))
ax.set_xticklabels(top_sentiments_df.index, rotation=45)

plt.tight_layout()
plt.show()


Food is Key: The entities 'food' and 'place' has the highest mentions by a significant margin.

Service Matters: 'Service' is another entity with a high positive sentiment value, showing it's an important factor for customers.

Negative Sentiments: Even for entities with high positive sentiment values, there are corresponding negative sentiments, as indicated by the line graph. This suggests that while an aspect might be generally well-received, there are still notable areas of dissatisfaction.

Consistent Positive Aspects: Entities such as 'place', 'service' and 'ambience' have strong positive sentiment values, suggesting that these are consistently positive aspects of the restaurants included in this analysis.

Specific Food Items: The chart separates 'chicken' and 'veg' indicates varying degrees of sentiment, which could reflect the quality or preference of these food items.

Quality Over Quantity: 'Quality' has a higher positive sentiment compared to 'quantity', suggesting that customers value the quality of their meal more than the amount.

Less Impactful Entities: Entities like 'music', 'buffet', 'menu', and 'quantity' have lower positive sentiment values, suggesting they are less impactful on overall customer satisfaction.

Business Impact:

Positive Impact:

The chart can help restaurateurs understand what aspects are most valued by customers, allowing them to focus on maintaining high standards in these areas.
Negative Impact:

Ignoring the entities with high negative sentiment could result in customer dissatisfaction and poor online reviews, which can be detrimental to the business.

In [None]:
# Finding the threshold to filter the top values (greater than the 20th highest positive sentiment value)
threshold = food_sentiment_count_df['negative'].sort_values(ascending=True)[19]

# Filtering the DataFrame
top_sentiments_df = food_sentiment_count_df[food_sentiment_count_df['negative'] < threshold].sort_values(by='negative', ascending=True)


# Plotting
fig, ax = plt.subplots(figsize=(10, 6))

# Stacked bar chart for positive counts and any other sentiment counts except negative
top_sentiments_df[['positive']].plot(kind='bar', stacked=True, ax=ax)

# Line chart for negative values
ax2 = ax.twinx()
(top_sentiments_df['negative']*(-1)).plot(kind='line', ax=ax2, color='red', marker='o', linewidth=2, label='Negative')

# Labels and legend
ax.set_xlabel('Food')
ax.set_ylabel('Positive Sentiment Value')
ax2.set_ylabel('Negative Sentiment Value')
ax.set_title('Sentiment Value for Top 20 Entities Sorted by Negative Sentiment')
fig.legend(loc='upper right', bbox_to_anchor=(1.1, 1))
ax.set_xticklabels(top_sentiments_df.index, rotation=45)
plt.tight_layout()
plt.show()

Based on the sentiment values provided for various entities, sorted by negative sentiment, here are the insights and their potential impact on business:

Food and Place as Major Drivers: Food and place have the highest positive sentiment values, suggesting that they are key drivers of satisfaction. However, they also have the highest negative sentiments, indicating that these aspects can greatly detract from the experience when they do not meet expectations.

Chicken as a High-Risk Item: While chicken has a relatively high positive sentiment, it has a disproportionately high negative sentiment compared to its positive score. This suggests that chicken dishes are crucial to get right, as they can significantly impact customer sentiment.

Consistency in Service: Service has a high positive sentiment, but also a notable negative sentiment. Consistent service quality could be a determining factor in overall customer satisfaction.

Experience and Taste: These are areas with substantial positive sentiment but also notable negative sentiment. This indicates that while good experiences and taste are praised, bad ones leave a strong negative impression on customers.

Quality Over Speed and Accuracy: Quality has a high positive sentiment and a relatively lower negative sentiment compared to order and time. This implies that customers value the quality of their food over the efficiency of service or order accuracy.

Value for Money: Money has a low positive sentiment and a negative sentiment, which suggests that the perception of value for money is a concern for customers.

Operational Aspects: The negative sentiments for 'order', 'delivery', and 'time' are lower than for 'food', 'chicken', and 'place', but are still significant. This could indicate that operational efficiency in order processing and delivery is an area for improvement.

Staff Interaction: Staff have higher positive sentiment and comparatively lower negative sentiment. This suggests that good staff interactions can greatly enhance the customer experience, but poor interactions have less impact on negative sentiment compared to food quality or place.

Business Impact:

Positive Impact:

The data can guide targeted improvements in areas that significantly affect customer sentiment, like food quality and ambiance.
Understanding that quality is more important than speed could lead to prioritizing cooking quality over rapid service, improving overall satisfaction.
Since value for money is a concern, restaurants could review pricing strategies to better align with customer expectations.
Negative Impact:

Failing to address areas with high negative sentiment, especially food and place, could lead to customer loss and negative reviews.
Overlooking the importance of the chicken dishes and experience could disproportionately impact the business negatively due to their high negative sentiments relative to their positive values.
Overall, focusing on areas with high positive sentiment can reinforce strengths, while addressing areas with high negative sentiment can mitigate risks. This balance can create a positive business impact by improving customer satisfaction and loyalty.

## ***6. Feature Engineering & Data Pre-processing***

In [None]:
# Creating a copy of the dataset for further feature engineering
from sklearn.preprocessing import MultiLabelBinarizer

# Create a MultiLabelBinarizer
mlb = MultiLabelBinarizer()
features = mlb.fit_transform(restaurant_data.Cuisines_list)

# Create a DataFrame with the cuisine labels
features_df = pd.DataFrame(features, columns=mlb.classes_)

# Add resataurant name and cost to the features dataFrame
features_df['Cost'] = restaurant_data['Cost']
features_df['Name'] = restaurant_data['Name']

#features_df['avg_rating'] = restaurant_data_ratings['Rating']
features_df.set_index('Name', inplace=True)
features_df

In [None]:
features_df['Avg_Ratings'] = reviews.groupby('Restaurant')['Rating'].mean().sort_values(ascending=False)

# Fill Missing Ratings and Cost values with mean
features_df['Avg_Ratings'].fillna(value = features_df['Avg_Ratings'].mean(), inplace = True)
features_df['Cost'].fillna(value = features_df['Cost'].mean(), inplace = True)

In [None]:
# Select only the cuisines that occure more than once
selected_features = features_df.columns[features_df.sum(axis=0)>7].tolist()
selected_features

## ***7. ML Model Implementation***

### ML Model - 1  Implementing K-Means

In [None]:
selected_features

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

#X = features_df
X= StandardScaler().fit_transform(features_df[selected_features])

# Define a range of cluster numbers to try
cluster_range = range(2, 15)  # You can adjust this range as needed

# Initialize lists to store the inertia (within-cluster sum of squares) and silhouette scores
inertia_values = []
silhouette_scores = []

# Perform K-means clustering for each cluster number in the range
for n_clusters in cluster_range:
    kmeans = KMeans(n_clusters=n_clusters,init='k-means++', random_state=42)
    kmeans.fit(X)

    # Calculate the inertia and silhouette score for this cluster number
    inertia_values.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X, kmeans.labels_))

# Plot the elbow curve to find the optimal number of clusters
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.plot(cluster_range, inertia_values, marker='o')
plt.title('Elbow Curve')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia (Within-Cluster Sum of Squares)')

# Plot the silhouette score to evaluate cluster quality
plt.subplot(1, 2, 2)
plt.plot(cluster_range, silhouette_scores, marker='o')
plt.title('Silhouette Score')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')

plt.tight_layout()
plt.show()

# Based on the plots, you can choose the optimal number of clusters
optimal_clusters = cluster_range[np.argmax(silhouette_scores)]
print(f"Optimal number of clusters: {optimal_clusters}")

Optimal number of clusters: 14

In [None]:
# Using the dendogram to find the optimal number of clusters
import scipy.cluster.hierarchy as sch

dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean Distances')
plt.axhline(y=11.5, color='r', linestyle='--')
plt.show() # find largest vertical distance we can make without crossing any other horizontal line

In [None]:
# Fitting hierarchical clustering to the mall dataset
from sklearn.cluster import AgglomerativeClustering

# Hierarchical clustering on the t-SNE transformed data
hc = AgglomerativeClustering(n_clusters=7, metric='euclidean', linkage='ward')
features_df['Cluster'] = hc.fit_predict(X)


In [None]:
results_df = features_df[selected_features+['Cluster']]
results_df

In [None]:
results_df_grouped = results_df.groupby('Cluster').sum()
results_df_grouped[['Cost','Avg_Ratings']] = results_df[['Cost','Avg_Ratings','Cluster']].groupby('Cluster').mean()
results_df_grouped

In [None]:
# Preparing the data for heatmap
cuisine_data = results_df_grouped.drop(['Cost', 'Avg_Ratings'], axis=1)
cost_ratings_data = results_df_grouped[['Cost', 'Avg_Ratings']]

# Plotting
fig, ax = plt.subplots(2, 1, figsize=(12, 10))

# Heatmap for cuisine distribution
sns.heatmap(cuisine_data.T, cmap="YlGnBu", ax=ax[0])  # Transposing for better layout
ax[0].set_title('Heatmap of Cuisine Distribution Across Clusters')

# Grouped bar chart for cost and ratings
cost_ratings_data.plot(kind='bar', ax=ax[1], secondary_y='Avg_Ratings')
ax[1].set_title('Average Cost and Ratings by Cluster')
ax[1].set_ylabel('Cost')
ax[1].right_ax.set_ylabel('Avg Ratings')

plt.tight_layout()
plt.show()

In [None]:
#@title Scatterplot of Cost vs Average Ratings with Top 3 Cuisines Labeled
df = results_df_grouped.reset_index()

# Plotting the scatterplot
plt.figure(figsize=(10, 6))
for i, row in df.iterrows():
    plt.scatter(row['Cost'], row['Avg_Ratings'],s=200, label=f"{int(row['Cluster'])}")
    top_cuisines = row.drop(['Cluster', 'Cost', 'Avg_Ratings']).nlargest(3)
    i = 0
    for cuisine, value in top_cuisines.items():
        if value > 0:
            plt.annotate(cuisine, (row['Cost'], row['Avg_Ratings']), textcoords="offset points", xytext=(30,-20-i*15), ha='center')
            i+=1
plt.xlabel('Cost')
plt.ylabel('Average Ratings')
plt.title('Scatterplot of Cost vs Average Ratings with Top 3 Cuisines Labeled')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
#@title Ratings to Average cost Ratio
ax = (cost_ratings_data.Avg_Ratings/cost_ratings_data.Cost).sort_values(ascending=False).plot(kind='bar', figsize=(10, 6))
# Adding labels and title
plt.title('Ratings to Average Cost Ratio by Cluster')
plt.xlabel('Cluster')
plt.ylabel('Ratio')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Show plot
plt.tight_layout()
plt.show()

#### 1. Explain the ML Model used

Explanation of the ML Model and its Performance:
The ML model used is K-Means clustering. K-Means is an unsupervised learning algorithm used for clustering data points into K clusters based on similarity.
The performance of the K-Means model can be evaluated using two main metrics: inertia and silhouette score.
Inertia represents the sum of squared distances of samples to their closest cluster center. Lower inertia values indicate tighter clusters.
Silhouette score measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation). Higher silhouette scores indicate better-defined clusters.
Based on the provided data:
Inertia values decrease as the number of clusters (K) increases, which is expected. However, the rate of decrease slows down as K increases.
Silhouette scores also increase with the number of clusters, indicating better-defined clusters.
By analyzing both metrics, we can determine the appropriate number of clusters that provide a good balance between tight clustering (low inertia) and well-separated clusters (high silhouette score).

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

For a positive business impact, the evaluation metrics of interest may include:

Silhouette Score: This metric indicates how well-separated the clusters are. A higher silhouette score suggests that the clusters are well-defined and distinct, which could lead to better segmentation of restaurants and potentially more targeted marketing strategies.


Inertia: While inertia is not a direct measure of cluster quality, lower inertia values indicate tighter clusters. Tighter clusters imply that restaurants within the same cluster are more similar to each other in terms of cuisines and other features, which could lead to more accurate recommendations or targeted promotions for customers.

# **Conclusion**

Cluster 1 has a moderate cost and ratings, and a high ratio of ratings to cost. This cluster has a variety of cuisines, such as desserts, continental, biryani, and Chinese, which may indicate that these restaurants offer a good balance of quality, price, and diversity. This cluster may attract customers who are looking for a satisfying and reasonable dining experience.

Cluster 2 has the highest average cost and ratings, indicating that it consists of high-end restaurants that offer premium quality and service. The cuisine of this cluster is mainly Italian and Asian, suggesting that these are popular and profitable choices for upscale dining.

Cluster 5 has the second highest average ratings, but a lower cost than cluster 2. This cluster has a large proportion of Asian restaurants, indicating that this cuisine is well-liked and affordable by the customers. This cluster also has some continental and Chinese restaurants, which may appeal to a diverse range of tastes and preferences.

Cluster 4 has the lowest average ratings, but a moderate cost. This cluster is dominated by biryani and Chinese restaurants, which may indicate that these cuisines are oversaturated or underperforming in the market. The low ratings may also reflect the quality, service, or hygiene issues of these restaurants.

Cluster 6 has the lowest average cost, but also low ratings. This cluster consists mostly of fast food restaurants, which may cater to the budget-conscious or time-pressed customers. However, the low ratings may suggest that these restaurants do not offer much value or satisfaction to the customers.

Cluster 0 and 3 have similar costs and ratings, but different cuisines. Cluster 0 has mostly north Indian and Chinese restaurants, while cluster 3 has mostly south Indian and continental restaurants. These clusters may reflect the regional and cultural preferences of the customers, as well as the availability and competition of these cuisines in the market.

**Summary of Business Case Solutions Based on Insights:**

**Optimizing Food and Place Experiences:**

Invest in improving the quality and ambiance of restaurants, focusing on addressing the high negative sentiment associated with food and place.
Strategic Approach to Chicken Dishes:

Implement a detailed review and improvement strategy specifically for chicken dishes, given their high impact on both positive and negative sentiments.
Prioritizing Consistent Service Quality:

Develop and implement training programs for staff to ensure consistent service quality, considering its influence on overall customer satisfaction.
Balancing Experience and Taste:

Conduct regular quality checks to maintain positive sentiments related to experience and taste, mitigating the negative impact of occasional lapses.
Emphasizing Quality Over Speed and Accuracy:

Adjust operational priorities to prioritize food quality over speed and accuracy, aligning with customer preferences.
Addressing Value for Money Concerns:

Review pricing strategies to align with customer expectations and communicate value-added services to address the negative sentiment associated with money.
Strategic Approach to Cluster Insights:

Provide insights to restaurants based on their cluster categorization, enabling them to tailor strategies for improvement or expansion.
Encouraging Diversity in Cuisine Offerings:

Encourage restaurants to diversify their cuisine offerings based on cluster insights to meet varied customer preferences.
These specific business case solutions derived from the insights aim to guide Zomato and its partner restaurants in addressing key areas of improvement, optimizing customer satisfaction, and fostering sustainable business growth.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***