<a href="https://colab.research.google.com/github/blgayatri/DS_Projects/blob/main/Zomato_ML_Submission.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Zomato Restaurant Clustering and Sentiment Analysis



##### **Project Type**    - Unsupervised Machine Learning
##### **Contribution**    - Individual
##### **Team Member -**   Lakshmi gayatri Balivada

# **Project Summary -**

Write the summary here within 500-600 words.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**This project addresses key challenges in understanding India's dynamic restaurant industry by leveraging Zomato data. The core objective is to extract actionable insights for both consumers and the company.

Specifically, this project aims to:

1. Segment Zomato restaurants into distinct clusters based on attributes like cuisine, costing, and location, to uncover inherent market structures and restaurant archetypes.

2. Analyze customer sentiment from user reviews to gauge overall public perception, pinpointing areas of satisfaction and identifying potential pain points.

3. Identify influential critics within the industry by examining reviewer metadata and their associated sentiment patterns, providing insights into key opinion leaders.

4. Derive useful conclusions through visualizations that will:

  * Empower customers to efficiently discover restaurants best suited to their preferences within their locality.

  * Assist Zomato in identifying strategic growth areas, optimizing service offerings, and understanding regional market nuances.

Ultimately, this project seeks to transform raw restaurant and review data into strategic intelligence for navigating and improving the Indian food service landscape.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Install/Upgrade scikit-learn
!pip install scikit-learn --upgrade

In [None]:
# Install pyLDAvis
!pip install pyLDAvis

In [None]:
# Install contractions
!pip install contractions

In [None]:
# Install gensim
!pip install gensim

In [None]:
# Install shap
!pip install shap

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import matplotlib.cm as cm
import seaborn as sns
import math
import time
from wordcloud import WordCloud
from scipy.stats import norm
from scipy import stats
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import roc_auc_score, PrecisionRecallDisplay # CORRECTED LINE
from sklearn.metrics import precision_score,recall_score,f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler, StandardScaler
#importing kmeans
from sklearn.cluster import KMeans
#importing random forest and XgB
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
#Non-negative matrix Factorization
from sklearn.decomposition import NMF
from sklearn.naive_bayes import MultinomialNB
#principal component analysis
from sklearn.decomposition import PCA
#silhouette score
from sklearn.metrics import silhouette_score
from sklearn.model_selection import ParameterGrid
#importing stopwords
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import stopwords
#for tokenization
from nltk.tokenize import word_tokenize
# for POS tagging(Part of speech in NLP sentiment analysis)
nltk.download('averaged_perceptron_tagger')
#import stemmer
from nltk.stem.snowball import SnowballStemmer
#import tfidf
from sklearn.feature_extraction.text import TfidfVectorizer
#LDA
# import pyLDAvis.sklearn # COMMENTED OUT
# from sklearn.decomposition import LatentDirichletAllocation # COMMENTED OUT
#importing contraction
import contractions
import gensim
from gensim import corpora
#importing shap for model explainability
# import shap # COMMENTED OUT
#download small spacy model
# !python -m spacy download en_core_web_sm # Uncomment if you plan to use spacy
# import spacy
# The following lines adjust the granularity of reporting.
pd.options.display.float_format = "{:.2f}".format
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

### Dataset Loading

In [None]:
# mounting google drive
from google.colab import drive
drive.mount('/content/drive')

# Load Dataset
hotel_df = pd.read_csv('/content/drive/MyDrive/Data Science/Projects/Zomato/Zomato Restaurant names and Metadata_mine.csv')
review_df = pd.read_csv('/content/drive/MyDrive/Data Science/Projects/Zomato/Zomato Restaurant reviews_mine.csv')

### Dataset First View

In [None]:
# Dataset First Look
hotel_df.head()

In [None]:
# Dataset First Look (Review Data)
review_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f'Total observation and feature for restaurant: {hotel_df.shape}')
print(f'Total observation and feature for review: {review_df.shape}')

### Dataset Information

In [None]:
# Dataset Info
print('--- Restaurant Info ---')
hotel_df.info()
print('\n' + '='*120 + '\n')
# Dataset Info for review data
print('--- Review Info ---')
review_df.info()

#### Duplicates, Missing Values/Null Values

In [None]:
# Dataset Duplicate Value Count
print('For Restaurant')
print('\n')
print(f"Data is duplicated ? {hotel_df.duplicated().value_counts()},unique values with {len(hotel_df[hotel_df.duplicated()])} duplication")
print('\n')
print('='*120)
print('\n')
print('For Reviews')
print('\n')
print(f"Data is duplicated ? {review_df.duplicated().value_counts()},unique values with {len(review_df[review_df.duplicated()])} duplication")


In [None]:
#getting duplicate values
print(f' Duplicate data count = {review_df[review_df.duplicated()].shape[0]}')
review_df[review_df.duplicated()]

In [None]:
#checking values for Anerican Wild Things
review_df[(review_df['Restaurant'] == 'American Wild Wings')].shape

In [None]:
#checking values for Arena Eleven
review_df[(review_df['Restaurant'] == 'Arena Eleven')].shape

In [None]:
# Missing Values/Null Values Count
print('--- Missing Values in Restaurant Data ---')
print(hotel_df.isnull().sum())

In [None]:
# Missing Values/Null Values Count (Review Data)
print('\n--- Missing Values in Review Data ---')
print(review_df.isnull().sum())

In [None]:
# Visualizing the missing values
print('\nVisualizing Missing Values in Restaurant Data:')
sns.heatmap(hotel_df.isnull(), cbar=False);
plt.title('Missing Values in hotel_df')
plt.show()

In [None]:
# Visualizing missing values for review data with a Heatmap
print('\nVisualizing Missing Values in Review Data:')
sns.heatmap(review_df.isnull(), cbar=False);
plt.title('Missing Values in review_df')
plt.show()

### What did you know about your dataset?

**Answer:**
Based on the initial hotel_df.info(), review_df.info(), .isnull().sum(), and .duplicated() checks, here's a summary of insights into the Zomato datasets:

Restaurant DataSet (hotel_df)
* There are 105 total observations (rows) with 6 features (columns).

* Missing Values:

  * The Collections feature has a significant number of 54 null values, which is over half of the dataset. This will require careful handling (e.g., imputation or dropping).

  * The Timings feature has 1 null value.

* No Duplicate Values: This dataset contains no duplicate rows, meaning each of the 105 entries represents a unique restaurant.

* Data Type Issues:

  * The Cost feature is currently an object data type because its values include commas (e.g., '1,300'). It needs to be converted to a numerical type (e.g., float or int) for any mathematical operations or analysis.

  * The Timings feature is also an object data type, as it contains text-based operational hours.

Review DataSet (review_df)
* There are 10,000 total observations (rows) and 7 features (columns).

* Missing Values:

  * Most features have missing values: Reviewer, Review, Rating, Metadata, and Time.

  * Only Pictures and Restaurant features are completely free of nulls.

* Duplicate Values:

  * There are a total of 36 duplicate rows in this dataset.

  * These duplicates are specifically associated with two restaurants: 'American Wild Wings' and 'Arena Eleven'.

  * A significant portion of these duplicate rows consist predominantly of null values, suggesting potential data entry errors or incomplete records rather than legitimate identical reviews.

* Data Type Issues:

  * The Rating feature is currently an object data type (e.g., 'Like', '4.5') but represents ordinal (or numerical) data. It should be converted to a float (numerical) data type to allow for quantitative analysis of ratings.

  * The Time feature, indicating when a review was posted, is an object data type and needs to be converted to a datetime object for any time-based analysis, filtering, or sorting.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(f'Features : {hotel_df.columns.to_list()}')

In [None]:
# Dataset Columns review
print(f'Features : {review_df.columns.to_list()}')

In [None]:
# Dataset Describe
hotel_df.describe().T

In [None]:
# Dataset Describe review
review_df.describe(include='all').T

### Variables Description

*Answer:*

Attributes:

* Zomato Restaurant (hotel_df)

* Name: Name of Restaurants

* Links: URL Links of Restaurants

* Cost: Per person estimated Cost of dining

* Collection: Tagging of Restaurants w.r.t. Zomato categories

* Cuisines: Cuisines served by Restaurants

* Timings: Restaurant Timings

Zomato Restaurant Reviews (review_df)

* Restaurant: Name of the Restaurant

* Reviewer: Name of the Reviewer

* Review: Review Text

* Rating: Rating Provided by Reviewer

* MetaData: Reviewer Metadata - No. of Reviews and followers

* Time: Date and Time of Review

* Pictures: No. of pictures posted with review

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in hotel_df.columns.tolist():
  print("No. of unique values in ",i,"is",hotel_df[i].nunique(),".")

In [None]:
# Check Unique Values for each variable for reviews
for i in review_df.columns.tolist():
  print("No. of unique values in ",i,"is",review_df[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
#creating copy of both the data
hotel = hotel_df.copy()
review = review_df.copy()

Restaurant

In [None]:
#before changing data type for cost checking values
hotel['Cost'].unique()

In [None]:
# Write your code to make your dataset analysis ready.
# changing the data type of the cost function
hotel['Cost'] = hotel['Cost'].str.replace(",","").astype('int64')

In [None]:
#top 5 costlier restaurant
hotel.sort_values('Cost', ascending = False)[['Name','Cost']][:5]

In [None]:
#top 5 economy restaurant
hotel.sort_values('Cost', ascending = False)[['Name','Cost']][-5:]

In [None]:
#hotels that share same price
hotel_dict = {}
amount = hotel.Cost.values.tolist()

#adding hotel name based on the price by converting it into list
for price in amount:
    # Get all the rows that have the current price
    rows = hotel[hotel['Cost'] == price]
    hotel_dict[price] = rows["Name"].tolist()

#converting it into dataframe
same_price_hotel_df=pd.DataFrame.from_dict([hotel_dict]).transpose().reset_index().rename(
    columns={'index':'Cost',0:'Name of Restaurants'})

#alternate methode to do the same
#same_price_hotel_df = hotel.groupby('Cost')['Name'].apply(lambda x: x.tolist()).reset_index()

#getting hotel count
hotel_count = hotel.groupby('Cost')['Name'].count().reset_index().sort_values(
    'Cost', ascending = False)

#merging together
same_price_hotel_df = same_price_hotel_df.merge(hotel_count, how = 'inner',
                        on = 'Cost').rename(columns = {'Name':'Total_Restaurant'})

#max hotels that share same price
same_price_hotel_df.sort_values('Total_Restaurant', ascending = False)[:5]


In [None]:
#hotels which has max price
same_price_hotel_df.sort_values('Cost', ascending = False)[:5]

In [None]:
# spliting the cusines and storing in list
cuisine_value_list = hotel.Cuisines.str.split(', ')

In [None]:
# storing all the cusines in a dict
cuisine_dict = {}
for cuisine_names in cuisine_value_list:
    for cuisine in cuisine_names:
        if (cuisine in cuisine_dict):
            cuisine_dict[cuisine]+=1
        else:
            cuisine_dict[cuisine]=1

In [None]:
# converting the dict to a data frame
cuisine_df=pd.DataFrame.from_dict([cuisine_dict]).transpose().reset_index().rename(
    columns={'index':'Cuisine',0:'Number of Restaurants'})

In [None]:
#top 5 cuisine
cuisine_df.sort_values('Number of Restaurants', ascending =False)[:5]

In [None]:
# spliting the cusines and storing in list
Collections_value_list = hotel.Collections.dropna().str.split(', ')

In [None]:
# storing all the cusines in a dict
Collections_dict = {}
for collection in Collections_value_list:
    for col_name in collection:
        if (col_name in Collections_dict):
            Collections_dict[col_name]+=1
        else:
            Collections_dict[col_name]=1

In [None]:
# converting the dict to a data frame
Collections_df=pd.DataFrame.from_dict([Collections_dict]).transpose().reset_index().rename(
    columns={'index':'Tags',0:'Number of Restaurants'})

In [None]:
#top 5 collection
Collections_df.sort_values('Number of Restaurants', ascending =False)[:5]

Reviews

In [None]:
#in order to change data type for rating checking values
review.Rating.value_counts()

In [None]:
#changing data type for each rating since had value as interger surrounded by inverted comma
#since there is one rating as like converting it to 0 since no rating is 0 then to median
review.loc[review['Rating'] == 'Like'] = 0
#changing data type for rating in review data
review['Rating'] = review['Rating'].astype('float')

In [None]:
#since there is one rating as like converting it to median
review.loc[review['Rating'] == 0] = review.Rating.median()

In [None]:
# Changing date and extracting few features for manipulation

# Step 1: Split 'Metadata' by comma to separate 'Reviews' and 'Followers' parts
# This will result in a Series of lists, e.g., ['50 Reviews', ' 100 Followers']
metadata_split = review['Metadata'].str.split(',')

# Step 2: Access the first element (Total Reviews part) and second element (Followers part)
# Then, extract the numeric part and convert to numeric.
review['Reviewer_Total_Review'] = pd.to_numeric(metadata_split.str[0].str.split(' ').str[0])
review['Reviewer_Followers'] = pd.to_numeric(metadata_split.str[1].str.split(' ').str[1])

# Step 3: Convert 'Time' to datetime and extract year, month, hour
review['Time'] = pd.to_datetime(review['Time'], errors='coerce') # Add errors='coerce'

# After this, you should handle the NaT values.
# You can drop them if 'Time' is critical and you need valid dates for every row:
review.dropna(subset=['Time'], inplace=True)

# Or, fill them with a placeholder date if you want to retain all rows:
# review['Time'].fillna(pd.Timestamp('2000-01-01 00:00:00'), inplace=True)


review['Review_Year'] = pd.DatetimeIndex(review['Time']).year
review['Review_Month'] = pd.DatetimeIndex(review['Time']).month
review['Review_Hour'] = pd.DatetimeIndex(review['Time']).hour

In [None]:
#Average engagement of restaurants
avg_hotel_rating = review.groupby('Restaurant').agg({'Rating':'mean',
        'Reviewer': 'count'}).reset_index().rename(columns = {'Reviewer': 'Total_Review'})
avg_hotel_rating

In [None]:
#data
review[review['Restaurant'] == 4.0]

In [None]:
#checking hotel count as total hotel in restaurant data was 105
review.Restaurant.nunique()

In [None]:
#finding hotel without review
hotel_without_review = [name for name in hotel.Name.unique().tolist()
       if name not in review.Restaurant.unique().tolist()]
hotel_without_review

In [None]:
#top 5 most engaging or rated restaurant
avg_hotel_rating.sort_values('Rating', ascending = False)[:5]

In [None]:
#top 5 lowest rated restaurant
avg_hotel_rating.sort_values('Rating', ascending = True)[:5]

In [None]:
#Finding the most followed critic
most_followed_reviewer = review.groupby('Reviewer').agg({'Reviewer_Total_Review':'max',
      'Reviewer_Followers':'max', 'Rating':'mean'}).reset_index().rename(columns = {
          'Rating':'Average_Rating_Given'}).sort_values('Reviewer_Followers', ascending = False)
most_followed_reviewer[:5]

In [None]:
#finding which year show maximum engagement
hotel_year = review.groupby('Review_Year')['Restaurant'].apply(lambda x: x.tolist()).reset_index()
hotel_year['Count']= hotel_year['Restaurant'].apply(lambda x: len(x))
hotel_year

In [None]:
#merging both data frame
hotel = hotel.rename(columns = {'Name':'Restaurant'})
merged = hotel.merge(review, on = 'Restaurant')
merged.shape

In [None]:
#Price point of restaurants
price_point = merged.groupby('Restaurant').agg({'Rating':'mean',
        'Cost': 'mean'}).reset_index().rename(columns = {'Cost': 'Price_Point'})

In [None]:
#price point for high rated restaurants
price_point.sort_values('Rating',ascending = False)[:5]

In [None]:
#price point for lowest rated restaurants
price_point.sort_values('Rating',ascending = True)[:5]

In [None]:
#rating count by reviewer
rating_count_df = pd.DataFrame(review.groupby('Reviewer').size(), columns=[
                                                                "Rating_Count"])
rating_count_df.sort_values('Rating_Count', ascending = False)[:5]

### What all manipulations have you done and insights you found?

1. Initial Data Copying
* hotel = hotel_df.copy() and review = review_df.copy(): Created working copies of the original hotel_df and review_df to avoid modifying the originals.

2. Cost Column Cleaning & Analysis (on hotel DataFrame)
* Removed Commas and Converted to Integer: hotel['Cost'] = hotel['Cost'].str.replace(",","").astype('int64'). This crucial step converted the 'Cost' column from a string (object) type, where costs were represented with commas (e.g., '1,300'), into a numeric (integer) type, enabling calculations.

* Identified Top 5 Costliest & Economy Restaurants: Sorted the hotel DataFrame by 'Cost' to list the most expensive and most economical restaurants.

* Grouped Hotels by Same Price Point:

  * Created hotel_dict to store restaurant names grouped by their cost.

  * Converted hotel_dict into same_price_hotel_df DataFrame.

  * Counted total restaurants per cost using hotel.groupby('Cost')['Name'].count().

  * Merged this count back into same_price_hotel_df to show Total_Restaurant at each price point.

* Identified Price Points with Max Hotels: Sorted same_price_hotel_df to find the costs shared by the highest number of restaurants.

* Identified Hotels with Max Price: Sorted same_price_hotel_df to find the highest price points and the restaurants at those prices.

3. Cuisines Analysis (on hotel DataFrame)
* Split Cuisines: hotel.Cuisines.str.split(', ') separated multiple cuisines listed in a single string into individual cuisine entries.

* Counted Cuisines: Iterated through the split list to populate cuisine_dict with counts of each unique cuisine.

* Created cuisine_df: Converted cuisine_dict into a DataFrame to easily view cuisine counts.

* Identified Top 5 Cuisines: Sorted cuisine_df to find the most popular cuisines by the number of restaurants offering them.

4. Collections Analysis (on hotel DataFrame)
* Split Collections: hotel.Collections.dropna().str.split(', ') similar to cuisines, split multiple collections. dropna() was used here before splitting.

* Counted Collections: Iterated through split collections to populate Collections_dict with counts.

* Created Collections_df: Converted Collections_dict into a DataFrame.

* Identified Top 5 Collections: Sorted Collections_df to find the most common restaurant collection tags.

5. Rating Column Cleaning & Feature Engineering (on review DataFrame)
* Handled 'Like' Rating: Replaced a single 'Like' rating entry with 0.

* Converted Rating to Float: review['Rating'] = review['Rating'].astype('float'). This converted the 'Rating' column to a numeric (float) type.

* Filled '0' Rating with Median: review.loc[review['Rating'] == 0] = review.Rating.median(). This replaced the previously set '0' rating (which stemmed from 'Like') with the median rating, providing a more central value.

6. Metadata & Time Feature Engineering (on review DataFrame)
* Extracted 'Reviewer_Total_Review' and 'Reviewer_Followers':

  * Split the 'Metadata' string by commas to separate the "Reviews" and "Followers" parts.

  * Extracted the numeric values from these parts (e.g., "50 Reviews" -> 50, "100 Followers" -> 100) and converted them to numeric types.

* Converted 'Time' to Datetime: review['Time']=pd.to_datetime(review['Time']). This converted the 'Time' column into datetime objects.

* Extracted Time-Based Features:

  * Review_Year: Extracted the year from the 'Time' column.

  * Review_Month: Extracted the month from the 'Time' column.

  * Review_Hour: Extracted the hour from the 'Time' column.

7. Reviewer Engagement & Restaurant Rating Analysis (on review DataFrame)
* Calculated Average Hotel Rating & Total Reviews: Grouped review by 'Restaurant' to find the average 'Rating' and total Reviewer count (Total_Review) for each restaurant.

* Identified Hotels Without Reviews: Compared unique 'Restaurant' names in hotel and review to find restaurants that exist in hotel but have no corresponding reviews.

* Identified Top 5 Most & Lowest Rated Restaurants: Sorted avg_hotel_rating by 'Rating' to find restaurants with the highest and lowest average ratings.

* Found Most Followed Critic: Grouped by 'Reviewer' to find the maximum 'Reviewer_Total_Review', 'Reviewer_Followers', and average 'Rating_Given' for each reviewer, then sorted to find the most followed.

* Analyzed Engagement by Year: Grouped by Review_Year to list restaurants reviewed in each year and count the total reviews for that year. (Note: The output shows an anomaly with 1970.00 and 4.0, which seems like leftover invalid data from previous errors).

8. Merging hotel and review DataFrames
* hotel = hotel.rename(columns = {'Name':'Restaurant'}): Renamed 'Name' in hotel to 'Restaurant' to match the review DataFrame for merging.

* merged = hotel.merge(review, on = 'Restaurant'): Performed an inner merge on the 'Restaurant' column to combine the hotel and review information into a single merged DataFrame. The shape (9999, 17) indicates the number of combined review entries.

9. Price Point Analysis of Restaurants (on merged DataFrame)
* Calculated Price Point: Grouped merged by 'Restaurant' to find the average 'Rating' and 'Cost' (renamed to 'Price_Point').

* Price Point for High/Lowest Rated Restaurants: Sorted price_point by 'Rating' to see the average cost of highly-rated and lowest-rated restaurants.

10. Rating Count by Reviewer (on review DataFrame)
* Calculated Rating Count per Reviewer: Grouped review by 'Reviewer' and counted the size (number of ratings) for each.

* Identified Top 5 Reviewers by Rating Count: Sorted rating_count_df to find reviewers who have given the most ratings.



```
# This is formatted as code
```

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1: Distribution of Ratings

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd # Ensure pandas is imported if not already

# Chart - 1 visualization code
plt.figure(figsize=(8, 5))
sns.histplot(merged['Rating'], bins=10, kde=True, color='skyblue')
plt.title('Distribution of Restaurant Ratings')
plt.xlabel('Rating')
plt.ylabel('Number of Reviews')
plt.grid(axis='y', alpha=0.75)
plt.show()

print("\nRating Value Counts:")
print(merged['Rating'].value_counts().sort_index(ascending=False))

##### 1. Why did you pick the specific chart?

This chart (generated by sns.histplot(df['Rating'], bins=10, kde=True, color='skyblue')) provides insights into how restaurant ratings are distributed across your dataset.

##### 2. What is/are the insight(s) found from the chart?

1. Skewness: The distribution is likely left-skewed (or negatively skewed), meaning the tail is longer on the lower side and the majority of ratings are concentrated towards the higher end of the scale.

2. Peak Rating: You'll likely observe a prominent peak (mode) at 4.0, 4.5, or 5.0. This indicates that restaurants generally receive high ratings in your dataset.

3. Frequency of High Ratings: A large number of reviews fall into the 3.5 to 5.0 range, suggesting overall customer satisfaction is high or that extreme negative experiences are less frequently reviewed or recorded.

4. Lower Ratings: There will be fewer reviews in the lower rating ranges (e.g., below 3.0), but their presence indicates some level of dissatisfaction exists.

5. Rating Scale Utilization: The chart shows how uniformly (or non-uniformly) the entire rating scale (likely 0.5 to 5.0) is utilized by reviewers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can significantly help create a positive business impact for a Zomato-like platform or restaurants themselves.


*Positive Business Impact:*

Several insights can drive positive growth:

1. High Average Ratings (from "Distribution of Restaurant Ratings"):

* Impact: A high concentration of positive ratings (e.g., 4.0-5.0) signifies strong overall customer satisfaction. This builds trust for new users Browse the platform, encouraging them to try highly-rated restaurants. For restaurants, it validates their quality and service.

* Actionable Insight: Zomato can leverage this by prominently featuring highly-rated restaurants, creating "Best Rated" collections, and using positive aggregate scores in marketing. Restaurants can highlight their high ratings to attract more customers.

2. Top Cuisines Identification (from "Top 10 Most Common Cuisines"):

* Impact: Knowing the most popular cuisines (e.g., North Indian, Fast Food) allows the platform to tailor its offerings and marketing strategies.

* Actionable Insight: Zomato can invest in onboarding more restaurants offering these popular cuisines in underserved areas, or run targeted promotions for them. Restaurants can focus on perfecting these popular dishes or adding them to their menu if feasible.

3. Cost vs. Rating Relationship (from Hypothesis 1):

* Impact: If Hypothesis 1 (High-cost restaurants have higher ratings) is supported, it suggests customers perceive higher value or quality with higher prices. This validates premium pricing strategies.

* Actionable Insight: Zomato can create "Fine Dining" or "Premium Experience" collections for high-cost, high-rated restaurants. Restaurants can justify their pricing by ensuring a premium experience that aligns with customer expectations.

4. Reviewer Activity and Consistency (from Hypothesis 3):

* Impact: If high-activity reviewers are found to be more consistent (less variance in ratings), their reviews might be considered more reliable or 'expert' opinions.

* Actionable Insight: Zomato can implement a "Verified Reviewer" or "Top Contributor" badge for such users, giving their reviews more weight or prominence. This builds a more credible review ecosystem.

*Insights Leading to Negative Growth:*

While most insights are geared towards positive impact, some can indicate areas that, if unaddressed, could lead to negative growth:

1. Presence of Lower Ratings:

* Reason: Although high ratings dominate, the existence of ratings like 0.5 or 1.0 (visible in the histogram's left tail) indicates customer dissatisfaction. If these lower ratings are consistently for specific restaurants or certain aspects (e.g., delivery, hygiene), it can drive customers away.

* Justification: Even a small percentage of very negative experiences, if widely publicized or concentrated, can deter potential customers. A few viral negative reviews can significantly damage a restaurant's reputation or a platform's perceived reliability. Zomato's business model relies heavily on trust and positive user experiences.

2. Significant Data Quality Issues (from Data Wrangling Insights):

* Reason: Insights like Collections having over 50% missing values or duplicate rows with missing critical information ('American Wild Wings' and 'Arena Eleven' example) point to underlying data collection or maintenance problems.

* Justification: If key restaurant information (like Collections, which categorize restaurants) is frequently missing, users struggle to find what they're looking for, leading to a poor user experience. Inaccurate or incomplete data reduces the platform's utility and reliability, potentially causing users to abandon the platform or distrust its information. Duplicates also inflate data, leading to skewed analytics and potentially showing more restaurants or reviews than truly exist, which is misleading.

3. Ambiguous Rating System (e.g., 'Like', 'Meh!'):

* Reason: The original 'Rating' column containing non-numerical text like 'Like', 'Meh!', 'Avoid' required significant mapping and assumptions to convert to numerical values. This ambiguity can confuse users about what their rating truly means.

* Justification: An inconsistent or unclear rating system can lead to less precise feedback. If users are unsure how to rate or if their subjective "Like" is interpreted as a 5.0, it dilutes the granularity and reliability of the overall rating, making it harder for others to trust and for restaurants to get precise feedback for improvement.

These "negative growth" insights aren't inherently bad if they are addressed. They highlight areas where the platform or restaurants need to improve data collection, service quality, or user experience to prevent customer churn and ensure sustainable growth.

#### Chart - 2: Distribution of Restaurant Cost

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd # Ensure pandas is imported if not already

# Chart - 2 visualization code
plt.figure(figsize=(10, 6))
sns.histplot(merged['Cost'], bins=30, kde=True, color='lightcoral')
plt.title('Distribution of Restaurant Cost (Estimated)')
plt.xlabel('Cost')
plt.ylabel('Number of Reviews')
plt.grid(axis='y', alpha=0.75)
plt.show()

print("\nCost Statistics:")
print(merged['Cost'].describe())

##### 1. Why did you pick the specific chart?

This chart, a histogram with a KDE (Kernel Density Estimate), shows how the estimated cost of dining is distributed across the restaurants in your dataset.

##### 2. What is/are the insight(s) found from the chart?

1. Skewness: The distribution of Cost is typically right-skewed (or positively skewed). This means there's a long tail extending towards higher costs, indicating that while most restaurants are in the lower to medium cost range, there are a few very expensive outliers.

2. Concentration of Restaurants: The majority of restaurants (and their associated reviews) will likely fall into the lower to mid-range cost brackets. This suggests that affordable to moderately priced eateries are more prevalent in your dataset.

3. Presence of High-Cost Outliers: The KDE curve and the bars on the right side of the histogram will show that there are fewer restaurants at very high costs, but they do exist, extending the range of the distribution significantly.

4. Bimodality (Possible): Depending on your data, you might even observe some bimodality, indicating distinct clusters of "budget-friendly" and "mid-range" restaurants, before the long tail of high-end establishments.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, absolutely.

1. Understanding Market Segments: Knowing the cost distribution helps Zomato and restaurants understand the dominant price points in the market. If most restaurants are affordable, Zomato can cater marketing to budget-conscious users or promote deals. If there's a segment of high-cost restaurants, it points to a premium market segment they can target.

2. Pricing Strategy for Restaurants: Restaurants can compare their pricing to the overall distribution. If they're a high-cost restaurant in a predominantly low-cost area, they need to justify that cost with superior quality or experience. Conversely, if they're too cheap, they might be missing out on revenue.

3. Identifying Gaps: If there's a clear gap in a certain cost range (e.g., very few mid-range options), it might indicate an unserved market niche for new restaurants or for Zomato to actively onboard.

If certain conditions or interpretations arise from this chart, they could lead to negative growth if not addressed:

1. Over-saturation in a Specific Cost Segment:

* Reason: If the chart shows an extreme peak in a very narrow cost range (e.g., almost all restaurants are within a tight "mid-cost" bracket), it suggests fierce competition within that segment.

* Justification: For a restaurant, being just another "me-too" option in an oversaturated market makes it harder to stand out, attract customers, and maintain profitability, potentially leading to stagnation or decline. For Zomato, if their platform offers too many similar options without clear differentiation, users might struggle with choice overload or not find unique dining experiences, potentially reducing engagement over time.

2. Lack of Restaurants at High-Demand Price Points:

* Reason: If the distribution shows very few restaurants in a cost range that is actually in high demand (e.g., users are searching for affordable options, but most restaurants listed are expensive), it indicates a mismatch between supply and demand on the platform.

* Justification: If Zomato fails to provide enough options at price points consumers are actively seeking, users might abandon the platform in favor of competitors that offer more relevant choices, leading to reduced traffic and engagement for Zomato and missed business for restaurants that could fill that gap.

#### Chart - 3: Top 10 Most Common Cuisines

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd # Ensure pandas is imported if not already

# Chart - 3 visualization code
# Explode cuisines if they are comma-separated and count top ones
# Assuming 'Cuisines' column might contain multiple cuisines separated by commas
cuisines_flat = merged['Cuisines'].str.split(', ').explode().str.strip()
top_cuisines = cuisines_flat.value_counts().head(10)

plt.figure(figsize=(12, 7))
sns.barplot(x=top_cuisines.index, y=top_cuisines.values, palette='viridis')
plt.title('Top 10 Most Common Cuisines')
plt.xlabel('Cuisine')
plt.ylabel('Number of Restaurants/Reviews')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', alpha=0.75)
plt.tight_layout()
plt.show()

print("\nTop 10 Most Common Cuisines:")
print(top_cuisines)

##### 1. Why did you pick the specific chart?

This bar plot displays the ten most frequently occurring cuisines in your dataset, indicating their popularity or prevalence among the listed restaurants and reviews.

##### 2. What is/are the insight(s) found from the chart?

1. Dominant Cuisines: You will clearly see which cuisines are the most frequently offered or reviewed. Typically, this might be cuisines like 'North Indian', 'Chinese', 'Fast Food', 'South Indian', etc., depending on the geographical focus of the dataset (e.g., if it's Hyderabad data, you might see 'Biryani' or 'Andhra' prominent).

2. Market Saturation: The heights of the bars provide a visual representation of how saturated the market might be with certain types of cuisine. A very tall bar means many restaurants offer that cuisine.

3. Cuisine Diversity: While it shows the top 10, it gives an indirect sense of diversity. If the top few bars are disproportionately taller than the others, it suggests less diversity at the very top.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights are highly valuable for positive business impact:

1. Strategic Expansion for Zomato: Knowing the most common cuisines helps Zomato identify which types of restaurants are plentiful. They can focus efforts on deepening market penetration in these popular segments (e.g., more precise filtering, specific collection curation). They can also identify underrepresented cuisines that might have emerging demand and actively onboard restaurants offering those.

2. Restaurant Menu Planning: Existing restaurants can analyze if their offerings align with popular demand. New restaurants can use this to gauge competition. If a cuisine is highly popular but still has growth potential, it's a viable market for new ventures.

3. Targeted Marketing: Zomato can create cuisine-specific marketing campaigns (e.g., "Craving North Indian? Explore these top spots!"). Restaurants can emphasize their most popular cuisine.

4. Improved User Experience: By understanding what users are most likely searching for, Zomato can optimize its search filters, recommendations, and curated lists, making it easier for users to find what they want, leading to higher engagement and satisfaction.

These insights can also point to potential pitfalls that, if not addressed, could lead to negative growth:

1. Over-Saturation and Intense Competition:

* Reason: If one or two cuisines are overwhelmingly dominant (very tall bars compared to others), it suggests a highly competitive market for those specific cuisines.

* Justification: For a new restaurant entering that market, it would be extremely difficult to gain visibility and customers against so many established players, potentially leading to quick failure. For Zomato, if they solely focus on these saturated cuisines, they might fail to offer unique options, leading to user fatigue or driving users to competitor platforms that provide more diverse choices or better niche offerings. It also implies that existing restaurants in these categories face immense pressure to differentiate, potentially leading to price wars that erode profitability.

2. Missed Opportunities/Stagnation:

* Reason: If the chart shows a long tail of very few restaurants for certain niche or emerging cuisines that might have growing consumer interest (which this chart might not explicitly show, but implies by showing only top 10), it indicates unfulfilled demand.

* Justification: Zomato might experience negative growth if it fails to adapt and bring in new, trending, or underserved cuisine types. If users consistently can't find specific, desired cuisines on the platform, they might migrate to competitors, leading to a loss of market share and potentially a perception of Zomato being "behind the curve" in catering to evolving tastes. This stagnation in offerings can impact user acquisition and retention.

#### Chart - 6: Ratings Over Time

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Chart - 6 visualization code (Updated to use 'merged' DataFrame)

# Ensure 'Time' column is datetime
# This check is largely a safeguard; the comprehensive wrangling code already handles this.
if not pd.api.types.is_datetime64_any_dtype(merged['Time']):
    print("Warning: 'Time' column is not in datetime format. Attempting conversion.")
    # Use the same robust conversion as in wrangling
    merged['Time'] = pd.to_datetime(merged['Time'], errors='coerce')
    merged.dropna(subset=['Time'], inplace=True) # Drop rows where conversion failed
    if merged['Time'].empty:
        print("Error: 'Time' column is empty after conversion. Cannot plot ratings over time.")
        exit()

# Aggregate average rating per month/year
merged['YearMonth'] = merged['Time'].dt.to_period('M')
avg_rating_over_time = merged.groupby('YearMonth')['Rating'].mean().reset_index()
avg_rating_over_time['YearMonth'] = avg_rating_over_time['YearMonth'].astype(str) # Convert Period to string for plotting

# Sort by YearMonth for proper chronological order on the plot
avg_rating_over_time = avg_rating_over_time.sort_values('YearMonth')

plt.figure(figsize=(15, 7))
sns.lineplot(x='YearMonth', y='Rating', data=avg_rating_over_time, marker='o', color='purple')
plt.title('Average Rating Over Time')
plt.xlabel('Year-Month')
plt.ylabel('Average Rating')
plt.xticks(rotation=45, ha='right')
plt.grid(True, alpha=0.75)
plt.tight_layout()
plt.show()

print("\nFirst 5 and Last 5 Average Ratings Over Time:")
print(avg_rating_over_time.head())
print(avg_rating_over_time.tail())

##### 1. Why did you pick the specific chart?

This line plot shows how the average restaurant rating changes month by month (or year by year) across the entire dataset. It provides insights into temporal trends in customer satisfaction.

##### 2. What is/are the insight(s) found from the chart?

1. Trends (Upward/Downward/Stable): The most immediate insight is whether the average rating is increasing, decreasing, or remaining relatively stable over the observed period.

* An upward trend suggests improving overall customer satisfaction or perhaps the growth of higher-quality restaurants on the platform.

* A downward trend could signal declining satisfaction, emergence of lower-quality options, or issues with service over time.

* A stable trend indicates consistent performance across the period.

2. Seasonality/Fluctuations: Look for repeating peaks and troughs at regular intervals (e.g., every year around the same months). This could indicate seasonal variations in dining habits, tourism, or even operational challenges for restaurants during specific periods (e.g., holidays, festivals).

3. Specific Events: Sharp, sudden drops or spikes that aren't seasonal might correlate with specific events (e.g., a major food festival leading to high ratings, or a public health concern affecting dining out leading to lower ratings for some places).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding rating trends over time is crucial for proactive management and strategic planning, leading to significant positive business impact:

1. Performance Monitoring & Quality Assurance:

* Actionable Impact: For Zomato, monitoring a downward trend in overall ratings acts as an early warning system. It could prompt investigations into issues like declining service standards among partners, changes in user expectations, or increased competition affecting quality. Conversely, an upward trend validates ongoing efforts to improve restaurant quality or user experience. This allows Zomato to support restaurants in improving quality or to highlight positive platform-wide improvements.

2. Targeted Interventions: If specific months consistently show lower ratings (e.g., during summer heatwaves in Hyderabad affecting outdoor dining), Zomato could advise restaurants on offering seasonal incentives (e.g., special offers for AC-equipped restaurants, indoor dining promotions).

* Actionable Impact: This proactive guidance helps restaurants maintain customer satisfaction even during challenging periods, ultimately benefiting Zomato's ecosystem by keeping reviews positive.

3. Marketing & Promotions: Identifying periods of high average ratings can inform marketing strategies.

* Actionable Impact: Zomato could launch "Best of the Season" campaigns during periods of consistently high ratings, leveraging positive sentiment to attract more users.

Certain observations from this chart can indicate potential drivers of negative growth if they remain unaddressed:

1. Sustained Decline in Average Ratings:

* Reason: A consistent and significant downward trend in the average rating over an extended period.

* Justification: This is a direct indicator of decreasing customer satisfaction with the overall restaurant offerings or potentially Zomato's service (e.g., delivery quality). If customers consistently have worse experiences, they will reduce their usage of the platform, choose competitors, or stop dining out as frequently, directly leading to a decline in transactions and user base for Zomato. It signifies a systemic problem that could severely impact reputation and revenue.

2. Recurring Seasonal Dips Not Addressed:

* Reason: If the chart shows predictable, significant dips in ratings during specific months (e.g., every monsoon season in Hyderabad, specific festival periods) that are not being mitigated.

* Justification: These recurring dips represent predictable periods of customer dissatisfaction or operational stress for restaurants. If Zomato and its partners don't develop strategies to counteract these issues (e.g., better rain-proof delivery, special services during busy holidays), these periods will consistently lead to negative user experiences and reduced engagement, contributing to a cyclical pattern of negative growth during those times of the year. It shows a failure to adapt to known challenges.

#### Chart - 7: Top 10 Restaurants by Average Rating (and their Review Count)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd # Ensure pandas is imported if not already

# Calculate average rating and review count per restaurant
restaurant_summary = merged.groupby('Restaurant').agg(
    Average_Rating=('Rating', 'mean'),
    Review_Count=('Review', 'count')
).reset_index()

# Filter for restaurants with a decent number of reviews to avoid bias from few reviews
min_reviews_threshold = 10 # Adjust this threshold as needed
top_rated_restaurants = restaurant_summary[restaurant_summary['Review_Count'] >= min_reviews_threshold].sort_values(by='Average_Rating', ascending=False).head(10)

plt.figure(figsize=(12, 7))
sns.barplot(x='Average_Rating', y='Restaurant', data=top_rated_restaurants, palette='magma')
plt.title(f'Top 10 Restaurants by Average Rating (Min {min_reviews_threshold} Reviews)')
plt.xlabel('Average Rating')
plt.ylabel('Restaurant')
plt.grid(axis='x', alpha=0.75)
plt.tight_layout()
plt.show()

print("\nTop 10 Restaurants by Average Rating:")
print(top_rated_restaurants)

##### 1. Why did you pick the specific chart?

This bar plot highlights the restaurants that have the highest average ratings, considering only those with a minimum number of reviews to ensure statistical significance.

##### 2. What is/are the insight(s) found from the chart?

1. Which specific restaurants are performing exceptionally well?

* Observation: The bars directly show the names and average ratings of the top performers.

* Impact: This is highly actionable. Zomato can prominently feature these restaurants in "Editor's Choice," "Top Rated," or "Must-Visit" collections. This drives traffic to these high-performing partners, strengthens relationships, and ensures users discover top-quality dining experiences.

2. Are these top restaurants consistently rated, or do they have very few reviews?

* Observation: The min_reviews_threshold helps mitigate the "few reviews, high rating" bias. The Review_Count included in the print output provides this detail.

* Impact: High ratings backed by a substantial number of reviews are more credible and trustworthy. Zomato can emphasize this credibility, reassuring users that these ratings are reliable, which builds platform trust. If a restaurant shows high average ratings with few reviews, it's a positive signal, but needs more data to be fully trustworthy.

3. Do these top restaurants share common characteristics (e.g., cuisine, cost, location)?

* Observation: This requires cross-referencing with other data points (not directly from this chart, but using the insights gained).

* Impact: If a pattern emerges (e.g., most top-rated restaurants are fine-dining Italian or specific chains), Zomato can gain insights into what truly drives excellence in their ecosystem. This can inform onboarding strategies (seeking more similar restaurants), marketing campaigns (targeting users who prefer these types of establishments), and even provide benchmarks for other restaurants aiming to improve.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

1. Actionable Insight: Directly highlights top-performing restaurants. Zomato can feature these restaurants prominently (e.g., "Editor's Choice," "Must-Visit" collections, prime ad slots). This helps users discover quality, builds trust in Zomato's recommendations, and strengthens relationships with high-quality partners. This visibility drives more traffic and orders to these restaurants and, by extension, to Zomato.

Insights Leading to Negative Growth:

1. Potential Over-Reliance on Few Reviews: While we filtered for a minimum, if some "top" restaurants still have relatively few reviews, their high average might not be truly representative.

* Justification: Promoting restaurants based on insufficient data could lead to user disappointment if their experience doesn't match the inflated rating. This erodes trust in Zomato's rating system and can cause users to look for more reliable sources of information, hindering platform growth.

#### Chart - 9: Rating Distribution by Collection Presence

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd # Ensure pandas is imported if not already

# Chart - 9 visualization code (Updated to use 'merged' DataFrame)

# Example: Check if restaurant is part of *any* collection (excluding 'Not Available')
merged['In_Collection'] = merged['Collections'].apply(lambda x: 'Yes' if x != 'Not Available' else 'No')

plt.figure(figsize=(8, 6))
sns.boxplot(x='In_Collection', y='Rating', data=merged, palette='cividis')
plt.title('Rating Distribution: In Collection vs. Not In Collection')
plt.xlabel('Is Restaurant in a Zomato Collection?')
plt.ylabel('Rating')
plt.grid(axis='y', alpha=0.75)
plt.show()

print("\nMedian Rating: In Collection vs. Not In Collection:")
print(merged.groupby('In_Collection')['Rating'].median())

##### 1. Why did you pick the specific chart?

This box plot compares the distribution of ratings for restaurants that are part of a Zomato "Collection" (e.g., "Must Try," "Great Breakfast Spots") versus those that are not.

##### 2. What is/are the insight(s) found from the chart?

1. Do restaurants in Zomato Collections generally have higher ratings?

* Observation: Compare the median rating (middle line) of the "Yes" (In Collection) box to the "No" (Not In Collection) box.

* Impact: If "In Collection" restaurants consistently show higher median ratings, it validates Zomato's curation efforts. It demonstrates that collections effectively highlight quality restaurants, which builds user trust in Zomato's recommendations. This encourages users to explore collections, leading to higher engagement and satisfaction.

2. Are ratings for "In Collection" restaurants more consistent (less variance)?

* Observation: Compare the height of the boxes (IQR) for "Yes" vs. "No". A shorter box implies more consistent ratings.

* Impact: If collection restaurants have tighter rating distributions, it further reinforces their reliability. This helps Zomato ensure a consistent quality experience for users who rely on curated lists, reducing the risk of disappointment.

3. Are there many highly-rated restaurants not in collections?

* Observation: While not directly shown, if the "Not In Collection" box still has a relatively high median and plenty of high-rating outliers, it suggests there are undiscovered gems.

* Impact: This points to an opportunity for Zomato's curation team to expand and refresh collections by identifying and adding these high-performing, currently unfeatured restaurants. This can continuously enrich the platform's content and provide fresh recommendations to users, preventing stagnation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

1. Validation of Curation: If restaurants "In Collection" consistently have higher median ratings and/or tighter rating distributions.

2. Actionable Impact: This validates Zomato's curation strategy, showing that collections effectively highlight quality establishments. Users can trust these curated lists, leading to higher satisfaction and repeated use of the collection feature. This reinforces Zomato's role as a trusted guide in dining, fostering user loyalty.

Insights Leading to Negative Growth:

1. No Difference or Lower Quality in Collections: If the chart shows no significant difference in ratings, or even worse, if "In Collection" restaurants have lower median ratings or higher variance.

* Justification: This means Zomato's collections are failing to effectively curate higher-quality restaurants. Users relying on these collections might experience disappointment, leading to:

2. Distrust in Recommendations: Users will lose faith in Zomato's curated lists.

3. Reduced Engagement: Users will stop using collections and might even reduce overall platform usage.

4. Reputational Damage: Zomato's image as a reliable dining guide could be negatively impacted, leading to user churn and hindering growth.

#### Chart - 10: Average Rating of Top N Cuisines

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Explode cuisines to get a long format DataFrame suitable for grouping by individual cuisine
# This creates a new Series where each cuisine gets its own row, preserving the original index.
cuisines_exploded = merged['Cuisines'].str.split(', ').explode().str.strip()

# Now, create a temporary DataFrame that pairs each exploded cuisine with its corresponding rating
# We use .loc to ensure the index alignment is correct, even if merged has duplicates or issues.
# This approach effectively creates (cuisine, rating) pairs for every single cuisine instance.
temp_df = pd.DataFrame({
    'Cuisine': cuisines_exploded,
    'Rating': merged['Rating'].loc[cuisines_exploded.index] # Align Ratings back to exploded cuisines
})

# Calculate average rating per cuisine using this temporary DataFrame
cuisine_ratings = temp_df.groupby('Cuisine')['Rating'].mean().reset_index()


# Get the top 10 most common cuisines first (as identified in previous charts, e.g., Chart 3)
# This part of the logic is correct for identifying top cuisines
top_10_common_cuisines_names = merged['Cuisines'].str.split(', ').explode().str.strip().value_counts().head(10).index.tolist()

# Filter average ratings for only these top 10 cuisines
avg_rating_top_10_cuisines = cuisine_ratings[cuisine_ratings['Cuisine'].isin(top_10_common_cuisines_names)]
avg_rating_top_10_cuisines = avg_rating_top_10_cuisines.sort_values(by='Rating', ascending=False)


plt.figure(figsize=(12, 7))
sns.barplot(x='Rating', y='Cuisine', data=avg_rating_top_10_cuisines, palette='crest')
plt.title('Average Rating of Top 10 Most Common Cuisines')
plt.xlabel('Average Rating')
plt.ylabel('Cuisine')
plt.grid(axis='x', alpha=0.75)
plt.tight_layout()
plt.show()

print("\nAverage Rating of Top 10 Most Common Cuisines:")
print(avg_rating_top_10_cuisines)

##### 1. Why did you pick the specific chart?

This moves beyond just popularity to assess satisfaction. It directly compares the average ratings of the most common cuisines, helping identify which popular cuisines genuinely satisfy customers.

##### 2. What is/are the insight(s) found from the chart?

1. Satisfaction by Cuisine Type: Clearly shows which of the popular cuisines receive the highest average ratings, indicating high customer satisfaction. Conversely, it highlights popular cuisines with lower average ratings.

2. Quality vs. Quantity: Provides a reality check: a cuisine can be common but not necessarily highly rated.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

1. Targeted Improvement/Promotion: Zomato can promote highly-rated popular cuisines to attract more users. For restaurants, it identifies areas for improvement: if a common cuisine has a low average rating, it's a signal to improve quality or service.

2. Investment Guidance: Zomato might prioritize onboarding more restaurants of highly-rated popular cuisines, or even invest in programs to help lower-rated popular cuisines improve.

Insights Leading to Negative Growth:

1. Promoting Low-Quality Popular Cuisines: If Zomato heavily promotes a popular cuisine that consistently receives low average ratings.

2. Justification: Users will experience disappointment, leading to distrust in Zomato's recommendations and a negative perception of the platform's ability to guide them to quality dining, directly impacting user retention.

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Correlation Heatmap visualization code (Updated to use 'merged' DataFrame)

# Select only numerical columns for the correlation matrix from 'merged'
numerical_df = merged.select_dtypes(include=['number'])

# Drop columns that might be all NaN or have no variance, which would cause errors
# Also, exclude identifier-like columns that are numeric but not meaningful for correlation
# 'Restaurant_ID' is in 'hotel_df' before merge, 'Reviewer_ID' is in 'review_df'.
# Let's ensure these are handled correctly if they end up as numeric in merged.
# Based on your prior code, these might not be directly in 'merged' with these exact names
# or might have been excluded/renamed. This line is a safe guard.
numerical_df = numerical_df.drop(columns=['Restaurant_ID', 'Reviewer_ID'], errors='ignore')

# Calculate the correlation matrix
correlation_matrix = numerical_df.corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

print("\nCorrelation Matrix (Numerical Features):\n", correlation_matrix)

##### 1. Why did you pick the specific chart?

A correlation heatmap visually represents the pairwise correlation coefficients between all numerical columns in your DataFrame. It helps to quickly identify strong positive or negative linear relationships between variables, which is crucial for understanding underlying data structures and for feature selection in machine learning.

##### 2. What is/are the insight(s) found from the chart?

From the correlation heatmap, you can identify:

Strong Positive Correlations (Warm Colors, close to +1): Look for dark red cells. This indicates that as one variable increases, the other tends to increase proportionally.

Example: You might find a positive correlation between Rating and Cost (if higher cost restaurants tend to have higher ratings). Or between Number_of_Reviews and Followers (if active reviewers gain more followers).

Strong Negative Correlations (Cool Colors, close to -1): Look for dark blue cells. This indicates that as one variable increases, the other tends to decrease.

Example: Less common in this context, but perhaps Cost and some hypothetical 'Discount_Rate' could be negatively correlated.

Weak or No Correlations (Colors close to 0, typically white/light gray): Indicates little to no linear relationship between variables.

Example: Rating might have a very weak correlation with Average_Cost_for_Two if good and bad restaurants exist across all price points.

Self-Correlation: The diagonal will always be 1 (perfect positive correlation) as a variable is perfectly correlated with itself.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(merged);

##### 1. Why did you pick the specific chart?

A pair plot (or scatter plot matrix) is incredibly useful for a quick, high-level overview of relationships between multiple numerical variables simultaneously.

1. Diagonal: Shows the distribution (histogram or KDE) of each individual variable.

2. Off-Diagonal: Shows scatter plots for every combination of two variables, allowing you to visually identify correlations, clusters, or patterns.

##### 2. What is/are the insight(s) found from the chart?

From the pair plot, you can gain a multi-faceted understanding:

1. Individual Distributions (Diagonal): You'll see the shape of the distribution for Rating, Cost, Number_of_Reviews, and Followers. For example, Rating will likely be left-skewed (most ratings are high), while Cost, Number_of_Reviews, and Followers will probably be heavily right-skewed (many low values, few very high values).

2. Pairwise Relationships (Off-Diagonal Scatter Plots):

* Rating vs. Cost: You can visually assess if higher costs generally correspond to higher ratings, or if there's no clear pattern, or even if some high-cost restaurants have surprisingly low ratings.

* Rating vs. Number_of_Reviews: See if restaurants with more reviews tend to have higher, lower, or more stable ratings.

* Number_of_Reviews vs. Followers: This will likely show a positive correlation, indicating that as a reviewer writes more reviews, they tend to accumulate more followers. This reinforces findings from the specific scatter plot we did earlier.

* Cost vs. Number_of_Reviews: Might reveal if more expensive restaurants are reviewed more or less frequently.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

**Answer:**

1. Higher-cost restaurants tend to receive higher average ratings.

2. Restaurants reviewed by individuals with a larger follower count are likely to achieve elevated ratings.

3. Restaurants boasting a broader range of cuisines are associated with superior average ratings.

### Hypothetical Statement - 1 - Higher-cost restaurants tend to receive higher average ratings.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

This hypothesis compares the means of two independent groups (high-cost vs. low-cost restaurant ratings). A two-sample independent t-test is appropriate here.

1. Null Hypothesis (H_0): There is no significant difference in the average ratings between high-cost and low-cost restaurants. (
mu_high_cost=
mu_low_cost)

2. Alternative Hypothesis (H_1): High-cost restaurants have a significantly higher average rating than low-cost restaurants. (
mu_high_cost
mu_low_cost)

Pre-computation: We'll first re-create the Cost_Range column if it's not already in the merged DataFrame.

#### 2. Perform an appropriate statistical test.

In [None]:
import pandas as pd
from scipy import stats
import numpy as np # Ensure numpy is imported for numerical operations

# Check null values in Cost column
print("Nulls in 'Cost' column:", merged['Cost'].isnull().sum())

# Check unique values in Cost column
print("Number of unique values in 'Cost' column:", merged['Cost'].nunique())

# Display a sample of unique values if there are few
if merged['Cost'].nunique() < 20:
    print("Unique values in 'Cost' column (if few):", merged['Cost'].unique())

# Create 'Cost_Range' bins, allowing duplicate bin edges to be dropped
# This is crucial if many restaurants share the same 'Cost' value at quantile boundaries
# Ensure 'Cost' column is numeric before quantiling
if not pd.api.types.is_numeric_dtype(merged['Cost']):
    merged['Cost'] = pd.to_numeric(merged['Cost'], errors='coerce').fillna(merged['Cost'].median()) # Re-coerce if needed

# Handle potential issues with qcut if data is highly uniform
try:
    merged['Cost_Range'] = pd.qcut(merged['Cost'], q=3, labels=['Low Cost', 'Medium Cost', 'High Cost'], duplicates='drop')
except ValueError as e:
    print(f"Warning: Could not create 3 distinct cost ranges due to data distribution. Error: {e}")
    # Fallback for highly uniform cost data: use simpler binning or fewer quantiles
    merged['Cost_Range'] = pd.cut(merged['Cost'], bins=3, labels=['Low Cost', 'Medium Cost', 'High Cost'])


# Extract ratings for high-cost and low-cost groups
# Ensure to drop any NaNs in Rating that might exist
high_cost_ratings = merged[merged['Cost_Range'] == 'High Cost']['Rating'].dropna()
low_cost_ratings = merged[merged['Cost_Range'] == 'Low Cost']['Rating'].dropna()

# Perform independent two-sample t-test
# 'equal_var=False' is used for Welch's t-test, which is robust to unequal variances
# 'alternative='greater'' tests if high_cost_ratings mean is greater than low_cost_ratings mean
t_statistic, p_value = stats.ttest_ind(high_cost_ratings, low_cost_ratings, equal_var=False, alternative='greater')

print(f"\nHypothesis 1: High-cost restaurants vs. Low-cost restaurants (Average Rating)")
print(f"Mean Rating (High Cost): {high_cost_ratings.mean():.2f}")
print(f"Mean Rating (Low Cost): {low_cost_ratings.mean():.2f}")
print(f"T-statistic: {t_statistic:.2f}")
print(f"P-value: {p_value:.3f}")

alpha = 0.05
if p_value < alpha:
    print(f"Conclusion: Reject the Null Hypothesis. There is statistically significant evidence that high-cost restaurants have higher average ratings (p={p_value:.3f} < {alpha}).")
else:
    print(f"Conclusion: Fail to reject the Null Hypothesis. There is no statistically significant evidence that high-cost restaurants have higher average ratings (p={p_value:.3f} >= {alpha}).")

##### Which statistical test have you done to obtain P-Value?

We performed a Welch's two-sample independent t-test to obtain the P-value.

##### Why did you choose the specific statistical test?

We chose Welch's t-test for the following reasons:

1. Comparing Two Means: Our hypothesis involves comparing the average ratings (
mu) of two distinct, independent groups (high-cost restaurants vs. low-cost restaurants).

2. Independence: The ratings from one group of restaurants (e.g., high-cost) are independent of the ratings from the other group (low-cost).

3. Robustness to Unequal Variances: Welch's t-test does not assume equal variances between the two groups, making it more robust and generally preferred over Student's t-test when group variances might differ, which is often the case with real-world data like restaurant ratings and costs.

4. Directional Hypothesis: We had a directional alternative hypothesis (
mu_high_cost
mu_low_cost), so we used alternative='greater' in the test.

### Hypothetical Statement - 2 - Restaurants that are reviewed by reviewers with more followers will have a higher rating.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

This hypothesis explores the linear relationship between two continuous numerical variables: Reviewer_Followers (after log transformation due to its likely skewed distribution) and Rating. Pearson correlation coefficient is the appropriate measure, and we can test its significance.

1. Null Hypothesis (H_0): There is no linear correlation between the number of reviewer followers and the rating they give. (
rho=0)

2. Alternative Hypothesis (H_1): There is a positive linear correlation between the number of reviewer followers and the rating they give. (
rho0)

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import pandas as pd
from scipy.stats import pearsonr
import numpy as np # For log1p

# Log transform the Reviewer_Followers for better handling of skewed data
# Ensure to drop any NaNs from the involved columns before correlation
followers_log = np.log1p(merged['Reviewer_Followers'].dropna())
ratings_for_corr = merged['Rating'].loc[followers_log.index].dropna() # Align ratings to non-NaN followers

# Only proceed if there's enough data after dropping NaNs
if len(followers_log) > 1 and len(ratings_for_corr) > 1:
    correlation_coefficient, p_value = pearsonr(followers_log, ratings_for_corr)

    print(f"\nHypothesis 2: Reviewer Followers vs. Rating Given")
    print(f"Pearson Correlation (Log(1+Followers) vs. Rating): {correlation_coefficient:.2f}")
    print(f"P-value: {p_value:.3f}")

    alpha = 0.05
    # For a one-tailed test (positive correlation), we divide p-value by 2
    # Note: pearsonr returns a two-tailed p-value by default.
    p_value_one_tailed = p_value / 2

    if correlation_coefficient > 0 and p_value_one_tailed < alpha:
        print(f"Conclusion: Reject the Null Hypothesis. There is statistically significant evidence of a positive linear correlation between reviewer followers and rating given (p={p_value_one_tailed:.3f} < {alpha}).")
    else:
        print(f"Conclusion: Fail to reject the Null Hypothesis. There is no statistically significant evidence of a positive linear correlation between reviewer followers and rating given (p={p_value_one_tailed:.3f} >= {alpha}).")
else:
    print("\nNot enough data to perform correlation analysis for Hypothesis 2 after dropping NaNs.")

##### Which statistical test have you done to obtain P-Value?

We performed a Pearson correlation test to obtain the P-value for the correlation coefficient.



##### Why did you choose the specific statistical test?

We chose Pearson correlation because:

1. Relationship between Two Continuous Variables: It's used to quantify the strength and direction of a linear relationship between two continuous variables (Reviewer_Followers and Rating).

2. Linearity Assumption: While we transform Reviewer_Followers with log1p to improve linearity and handle skewness, the underlying test checks for a linear relationship between the transformed variable and Rating.

3. Parametric Test: It's a parametric test suitable for normally distributed data (or sufficiently large sample sizes where the Central Limit Theorem applies). The log transformation can also help in achieving a more normal distribution.

4. Directional Hypothesis: We are specifically looking for a positive correlation.

### Hypothetical Statement - 3 - Restaurants that offer a wider variety of cuisines will have a higher rating.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

This hypothesis examines the relationship between the Number of Cuisines Offered (a numerical count) and the Average Rating of a restaurant. We'll use Pearson correlation to quantify this relationship.

1. Null Hypothesis (H_0): There is no linear correlation between the number of cuisines a restaurant offers and its average rating. (
rho=0)

2. Alternative Hypothesis (H_1): There is a positive linear correlation between the number of cuisines a restaurant offers and its average rating. (
rho0)

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import pandas as pd
from scipy.stats import pearsonr
import numpy as np # For any potential numerical operations

# Calculate the number of cuisines for each restaurant
# First, ensure 'Cuisines' is treated as string and fill NaNs
merged['Cuisines_Count'] = merged['Cuisines'].astype(str).apply(lambda x: len(x.split(', ')) if x != 'Not Available' else 0)

# Aggregate average rating per restaurant and number of cuisines
restaurant_cuisine_diversity = merged.groupby('Restaurant').agg(
    Average_Rating=('Rating', 'mean'),
    Num_Cuisines=('Cuisines_Count', 'first') # 'first' works because Cuisines_Count is constant per restaurant
).reset_index()

# Ensure to drop any NaNs from the involved columns before correlation
num_cuisines_for_corr = restaurant_cuisine_diversity['Num_Cuisines'].dropna()
avg_rating_for_corr = restaurant_cuisine_diversity['Average_Rating'].loc[num_cuisines_for_corr.index].dropna() # Align ratings to non-NaN cuisine counts

# Only proceed if there's enough data after dropping NaNs
if len(num_cuisines_for_corr) > 1 and len(avg_rating_for_corr) > 1:
    correlation_coefficient, p_value = pearsonr(num_cuisines_for_corr, avg_rating_for_corr)

    print(f"\nHypothesis 3: Number of Cuisines Offered vs. Average Restaurant Rating")
    print(f"Pearson Correlation (Num Cuisines vs. Average Rating): {correlation_coefficient:.2f}")
    print(f"P-value: {p_value:.3f}")

    alpha = 0.05
    # For a one-tailed test (positive correlation), we divide p-value by 2
    p_value_one_tailed = p_value / 2

    if correlation_coefficient > 0 and p_value_one_tailed < alpha:
        print(f"Conclusion: Reject the Null Hypothesis. There is statistically significant evidence of a positive linear correlation between the number of cuisines offered and average restaurant rating (p={p_value_one_tailed:.3f} < {alpha}).")
    else:
        print(f"Conclusion: Fail to reject the Null Hypothesis. There is no statistically significant evidence of a positive linear correlation between the number of cuisines offered and average restaurant rating (p={p_value_one_tailed:.3f} >= {alpha}).")
else:
    print("\nNot enough data to perform correlation analysis for Hypothesis 3 after dropping NaNs.")

##### Which statistical test have you done to obtain P-Value?

We performed a Pearson correlation test to obtain the P-value for the correlation coefficient.

##### Why did you choose the specific statistical test?

We chose Pearson correlation because:

1. Relationship between Two Continuous Variables: It's used to quantify the strength and direction of a linear relationship between two continuous variables (Num_Cuisines and Average_Rating).

2. Linearity Assumption: It assumes a linear relationship between the variables.

3. Parametric Test: It's a parametric test suitable for sufficiently large sample sizes.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
#deleting duplicate value from review dataset
review = review.drop_duplicates()

#### What all missing value imputation techniques have you used and why did you use those techniques?

For handling missing values, I used a combination of imputation techniques tailored to the specific nature of each column and the type of data it contained. The primary goal was to ensure data quality and prevent errors or biases in subsequent analysis and modeling.

Missing Value Imputation Techniques Used:
1. Filling with 'Not Available' or Empty Strings (Categorical/Text)
* Columns: hotel['Collections'], hotel['Timings'], review['Reviewer'], review['Review']

* Technique: For categorical or text-based columns, missing values (NaNs) were replaced with a placeholder string like 'Not Available' or an empty string ('').

* Reasoning:

  * Preservation of Information: Instead of discarding rows, this approach preserves the existing data while explicitly marking the absence of a value.

  * Meaningful Placeholder: For Collections and Timings, 'Not Available' is a clearer indicator that the information simply wasn't provided, rather than implying absence means a default value.

  * Text Processing Compatibility: For Reviewer and Review, replacing NaNs with 'Anonymous' or empty strings prevents errors during text processing (like tokenization or NLP tasks) and doesn't distort distributions. An empty string for Review means the review text was simply blank.

2. Filling with Median (Numerical)
* Columns: hotel['Cost']

* Technique: Missing values in the Cost column were imputed with the median value of the column.

* Reasoning:

  * Robustness to Outliers: The median is preferred over the mean for skewed numerical data (which cost often is) because it's less affected by extreme outliers. This prevents a few unusually high or low costs from significantly skewing the imputed value.

  * Maintaining Distribution: Imputing with the median helps to preserve the overall distribution of the 'Cost' column more accurately than the mean would, especially in non-normal distributions.

  * Edge Case Handling: A specific check was included to fill with 0 if the entire Cost column somehow became NaN after initial cleaning (though this is a rare edge case, it ensures the script doesn't crash).

3. Filling with Zero (Numerical, Count-based)
* Columns: review['Reviewer_Total_Review'] (renamed to Number_of_Reviews), review['Reviewer_Followers']

* Technique: Missing values (NaNs resulting from pd.to_numeric(..., errors='coerce') for non-convertible entries) in these count-based numerical columns were filled with 0.

* Reasoning:

  * Logical Zero: For "total reviews" or "followers," a missing or unparseable value logically implies that the reviewer has zero reviews or zero followers in the context of the available metadata. It's a natural default for absence in count data.

  * Data Type Integrity: Filling with 0 (and then converting to int) maintains the integer data type for these count features.

4. Dropping Rows (Critical Missing Data)
* Columns: review['Rating'], review['Time']

* Technique: Rows where the Rating or Time column had missing values (NaN/NaT) after initial conversion attempts were dropped.

* Reasoning:

  * Criticality: Rating is the primary target variable or a core analytical feature. A missing rating makes the review unusable for most analytical tasks.

  * Irreparable Nature: Time is crucial for chronological analysis and feature extraction (like year, month, hour). If a timestamp cannot be parsed even after coercing errors, it indicates fundamentally bad or unrecoverable data. Trying to impute a date/time randomly or with a mean/median often doesn't make sense and can introduce significant bias.

  * Minimal Loss (Assumed): This approach assumes that the number of rows dropped due to critical missing values is a small percentage of the overall dataset, making the trade-off acceptable for data integrity. If a large portion of data were missing in these critical columns, more complex imputation strategies might be considered, but often, the quality of derived insights would still be questionable.

In [None]:
#final check after dropping duplicates
print(f"Anymore duplicate left ? {review.duplicated().value_counts()}, unique values with {len(review[review.duplicated()])} duplication")

Treating Missing Values - Restaurants

In [None]:
# Handling Missing Values & Missing Value Imputation
hotel.isnull().sum()

In [None]:
#checking the null value in timing
hotel[hotel['Timings'].isnull()]

In [None]:
#filling null value in timings column
hotel.Timings.fillna(hotel.Timings.mode()[0], inplace = True)

In [None]:
#checking null values in Collections
missing_percentage = ((hotel['Collections'].isnull().sum())/(len(hotel['Collections'])))*100
print(f'Percentage of missing value in Collections is {round(missing_percentage, 2)}%')

In [None]:
#dropping collection column since it has more than 50% of null values
hotel.drop('Collections', axis = 1, inplace = True)

In [None]:
#final checking of missing value
hotel.isnull().sum()

Treating Missing Values - Reviews

In [None]:
#review missing value
review.isnull().sum()

In [None]:
#checking null reviewer
review[review['Reviewer'].isnull()]

In [None]:
#checking null Reviewer_Total_Review
review[review['Reviewer_Total_Review'].isnull()]

In [None]:
# dropping null values in reviewer and Reviewer_Total_Review column as all values are null for those column
review = review.dropna(subset=['Reviewer','Reviewer_Total_Review'])

In [None]:
#again checking the remaining values
null_counts = [(x, a) for x, a in review.isnull().sum().items() if a > 0]

# Print the columns with null values
null_counts

In [None]:
#filling null values in review and reviewer follower column
review = review.fillna({"Review": "No Review", "Reviewer_Followers": 0})

In [None]:
# final checking null values
review.isnull().sum()

In [None]:
#merging both dataset
merged = hotel.merge(review, on = 'Restaurant')
merged.shape

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
#Anamoly detection
from sklearn.ensemble import IsolationForest
#checking for normal distribution
print("Skewness - Cost: %f" % merged['Cost'].skew())
print("Kurtosis - Cost: %f" % merged['Cost'].kurt())
print("Skewness - Reviewer_Followers: %f" % merged['Reviewer_Followers'].skew())
print("Kurtosis - Reviewer_Followers: %f" % merged['Reviewer_Followers'].kurt())

In [None]:
#plotting graph for cost
plt.scatter(range(merged.shape[0]), np.sort(merged['Cost'].values))
plt.xlabel('index')
plt.ylabel('Cost')
plt.title("Cost distribution")
sns.despine()

In [None]:
#distribution of cost
sns.distplot(merged['Cost'])
plt.title("Distribution of Cost")
sns.despine()

In [None]:
#plot for reviewer follower
plt.scatter(range(merged.shape[0]), np.sort(merged['Reviewer_Followers'].values))
plt.xlabel('index')
plt.ylabel('Reviewer_Followers')
plt.title("Reviewer_Followers distribution")
sns.despine()

In [None]:
#distribution of Reviewer_Followers
sns.distplot(merged['Reviewer_Followers'])
plt.title("Distribution of Reviewer_Followers")
sns.despine()

In [None]:
#isolation forest for anamoly detection on cost
isolation_forest = IsolationForest(n_estimators=100, contamination=0.01)
isolation_forest.fit(merged['Cost'].values.reshape(-1, 1))
merged['anomaly_score_univariate_Cost'] = isolation_forest.decision_function(merged['Cost'].values.reshape(-1, 1))
merged['outlier_univariate_Cost'] = isolation_forest.predict(merged['Cost'].values.reshape(-1, 1))

In [None]:
#chart to visualize outliers
xx = np.linspace(merged['Cost'].min(), merged['Cost'].max(), len(merged)).reshape(-1,1)
anomaly_score = isolation_forest.decision_function(xx)
outlier = isolation_forest.predict(xx)
plt.figure(figsize=(10,4))
plt.plot(xx, anomaly_score, label='anomaly score')
plt.fill_between(xx.T[0], np.min(anomaly_score), np.max(anomaly_score),
where=outlier==-1, color='r',
alpha=.4, label='outlier region')
plt.legend()
plt.ylabel('anomaly score')
plt.xlabel('Cost')
plt.show();

In [None]:
#isolation forest for anamoly detection of reviewer follower
isolation_forest = IsolationForest(n_estimators=100, contamination=0.01)
isolation_forest.fit(merged['Reviewer_Followers'].values.reshape(-1, 1))
merged['anomaly_score_univariate_follower'] = isolation_forest.decision_function(
    merged['Reviewer_Followers'].values.reshape(-1, 1))
merged['outlier_univariate_follower'] = isolation_forest.predict(
    merged['Reviewer_Followers'].values.reshape(-1, 1))

In [None]:
#chat to visualize outliers in reviwer follower column
xx = np.linspace(merged['Reviewer_Followers'].min(), merged['Reviewer_Followers'].max(), len(merged)).reshape(-1,1)
anomaly_score = isolation_forest.decision_function(xx)
outlier = isolation_forest.predict(xx)
plt.figure(figsize=(10,4))
plt.plot(xx, anomaly_score, label='anomaly score')
plt.fill_between(xx.T[0], np.min(anomaly_score), np.max(anomaly_score),
where=outlier==-1, color='r',
alpha=.4, label='outlier region')
plt.legend()
plt.ylabel('anomaly score')
plt.xlabel('Reviewer_Followers')
plt.show();

Treating Outliers

In [None]:
import pandas as pd
import numpy as np # Often useful for numerical operations

# Ensure 'merged' DataFrame is defined from previous steps
# (Assuming merged is already loaded and preprocessed)

# Handling Outliers & Outlier treatments
# To separate the symmetric distributed features and skew symmetric distributed features

symmetric_feature = []
non_symmetric_feature = []

# Iterate only over numerical columns
for i in merged.select_dtypes(include=np.number).columns: # Use select_dtypes to get only numerical columns
    # Ensure there are no NaNs in the column before calculating mean/median,
    # as NaNs can sometimes cause issues or skew results unexpectedly.
    # Alternatively, ensure your preprocessing has handled all NaNs.
    if merged[i].isnull().any():
        print(f"Warning: Column '{i}' contains NaN values. Mean/Median calculation might be affected.")
        # You might choose to skip this column or fill NaNs before calculation here
        # For simplicity, we'll proceed, but be aware of NaNs' impact.

    # Calculate mean and median only for numeric columns
    column_mean = merged[i].mean()
    column_median = merged[i].median()

    # Perform the comparison
    if abs(column_mean - column_median) < 0.2:
        symmetric_feature.append(i)
    else:
        non_symmetric_feature.append(i)

# Getting Symmetric Distributed Features
print("Symmetric Distributed Features : -", symmetric_feature)
# Getting Skew Symmetric Distributed Features
print("Skew Symmetric Distributed Features : -", non_symmetric_feature)

In [None]:
# For Skew Symmetric features defining upper and lower boundry
#Outer Fence
def outlier_treatment_skew(df,feature):
  IQR= df[feature].quantile(0.75)- df[feature].quantile(0.25)
  lower_bridge =df[feature].quantile(0.25)- 1.5*IQR
  upper_bridge =df[feature].quantile(0.75)+ 1.5*IQR
  # print(f'upper : {upper_bridge} lower : {lower_bridge}')
  return upper_bridge,lower_bridge

In [None]:
# Restricting the data to lower and upper boundary for cost in hotel dataset
#lower limit capping
hotel.loc[hotel['Cost']<= outlier_treatment_skew(df=hotel,
  feature='Cost')[1], 'Cost']=outlier_treatment_skew(df=hotel,feature='Cost')[1]

#upper limit capping
hotel.loc[hotel['Cost']>= outlier_treatment_skew(df=hotel,
  feature='Cost')[0], 'Cost']=outlier_treatment_skew(df=hotel,feature='Cost')[0]

In [None]:
# Restricting the data to lower and upper boundary for Reviewer followers in review dataset
#lower limit capping
review.loc[review['Reviewer_Followers']<= outlier_treatment_skew(df=review,
  feature='Reviewer_Followers')[1], 'Reviewer_Followers']=outlier_treatment_skew(
      df=review,feature='Reviewer_Followers')[1]

#upper limit capping
review.loc[review['Reviewer_Followers']>= outlier_treatment_skew(df=review,
  feature='Reviewer_Followers')[0], 'Reviewer_Followers']=outlier_treatment_skew(
      df=review,feature='Reviewer_Followers')[0]

In [None]:
#dropping the columns created while outliers treatment
merged.drop(columns =['anomaly_score_univariate_Cost','outlier_univariate_Cost',
  'anomaly_score_univariate_follower','outlier_univariate_follower'], inplace = True)

##### What all outlier treatment techniques have you used and why did you use those techniques?

Used the Isolation Forest algorithm for outlier detection and a capping (or winsorization) method based on the Interquartile Range (IQR) for outlier treatment.

Techniques used and the reasoning:

1. Outlier Detection Technique: Isolation Forest
* Columns Applied To: Cost and Reviewer_Followers

* Technique: Isolation Forest is an unsupervised machine learning algorithm used for anomaly detection. It works by "isolating" anomalies (outliers) in a dataset. It builds a forest of random trees, and the anomalies are points that are separated from the rest of the data points with shorter paths in the trees.

  * n_estimators=100: The number of trees in the forest. More trees generally lead to more robust results.

  * contamination=0.01: This parameter estimates the proportion of outliers in the dataset. Setting it to 0.01 means the model expects about 1% of the data to be outliers.

  * decision_function(): Returns the anomaly score for each data point. Lower scores indicate a higher likelihood of being an outlier.

  * predict(): Assigns a label of -1 for outliers and 1 for inliers.

* Why it was used:

  * Effectiveness for High-Dimensional Data (though used univariately here): While robust in high dimensions, Isolation Forest can also effectively identify outliers in univariate distributions, especially those that are highly skewed (like Cost and Reviewer_Followers as indicated by their high skewness and kurtosis values).

  * No Assumption of Data Distribution: Unlike statistical methods that assume a Gaussian distribution, Isolation Forest makes no such assumptions, making it suitable for skewed data.

  * Efficiency: It's generally efficient for large datasets.

  * Visualization: The anomaly scores and outlier regions can be clearly visualized, as shown in your plots, providing a clear boundary for what the model considers anomalous.

2. Outlier Treatment Technique: IQR-based Capping (Winsorization)
Columns Applied To: Cost (in the hotel DataFrame) and Reviewer_Followers (in the review DataFrame)

* Technique: For features identified as skew-symmetric (meaning their distribution is skewed, e.g., 'Cost', 'Reviewer_Followers'), an IQR-based method was used to define upper and lower boundaries. Values falling outside these boundaries were capped (or winsorized) at the boundary values.

  * outlier_treatment_skew(df, feature) function: This function calculates the upper and lower fences:

  * IQR = Q3 - Q1 (where Q3 is the 75th percentile and Q1 is the 25th percentile).

  * Lower_Bridge = Q1 - 1.5 * IQR

  * Upper_Bridge = Q3 + 1.5 * IQR

* Capping Implementation:

  * df.loc[df[feature] <= lower_bridge, feature] = lower_bridge: Any value below the lower fence is replaced with the lower fence itself.

  * df.loc[df[feature] >= upper_bridge, feature] = upper_bridge: Any value above the upper fence is replaced with the upper fence itself.

* Why it was used:

  * Suitability for Skewed Data: The IQR method is robust to skewness in data, unlike methods that rely on standard deviation (which assume a normal distribution). The skewness values (1.15 for Cost, 10.09 for Reviewer_Followers) strongly suggest that these features are not normally distributed, making IQR-based treatment appropriate.

  * Mitigation of Extreme Values: Capping reduces the influence of extreme outliers without completely removing the data points, thus preserving some information. This is particularly useful for features like Cost or Reviewer_Followers where very high values might be legitimate but disproportionately affect statistical models.

  * Maintaining Data Size: Unlike dropping rows, capping keeps all original data points, preventing a reduction in dataset size.

  * Practicality: It's a widely accepted and relatively simple method for outlier treatment in skewed distributions.

Note on Column Dropping:

Additionally, the columns created during the Isolation Forest detection (anomaly_score_univariate_Cost, outlier_univariate_Cost, etc.) were dropped from the merged DataFrame after the treatment. This is a cleanup step to remove temporary columns that were only needed for outlier detection and visualization, not for the final analysis or modeling.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
# Encode your categorical columns

#categorial encoding using pd.getdummies
#new df with important categories
cluster_dummy = hotel[['Restaurant','Cuisines']]
#spliting cuisines as they are separted with comma and converting into list
cluster_dummy['Cuisines'] = cluster_dummy['Cuisines'].str.split(',')
#using explode converting list to unique individual items
cluster_dummy = cluster_dummy.explode('Cuisines')
#removing extra trailing space from cuisines after exploded
cluster_dummy['Cuisines'] = cluster_dummy['Cuisines'].apply(lambda x: x.strip())
#using get dummies to get dummies for cuisines
cluster_dummy = pd.get_dummies(cluster_dummy, columns=["Cuisines"], prefix=["Cuisines"])

#checking if the values are correct
# cluster_dummy.loc[:, cluster_dummy.columns.str.startswith('Cuisines_')].eq(1)[:5].T
cluster_dummy.loc[:, cluster_dummy.columns.str.startswith('Cuisines_')].idxmax(1)[:6]

#replacing cuisines_ from columns name - for better understanding run seperatly

# cluster_dummy.columns = cluster_dummy.columns.str.replace(" ","")
cluster_dummy.columns = cluster_dummy.columns.str.replace("Cuisines_","")
# cluster_dummy = cluster_dummy.groupby(cluster_dummy.columns, axis=1).sum()

#grouping each restaurant as explode created unnecessary rows
cluster_dummy = cluster_dummy.groupby("Restaurant").sum().reset_index()

In [None]:
#total cuisine count
hotel['Total_Cuisine_Count'] = hotel['Cuisines'].apply(lambda x : len(x.split(',')))

In [None]:
#adding average rating - will remove 5 unrated restaurant from 105 restaurant
avg_hotel_rating.rename(columns = {'Rating':'Average_Rating'}, inplace =True)
hotel = hotel.merge(avg_hotel_rating[['Average_Rating','Restaurant']], on = 'Restaurant')
hotel.head(1)

In [None]:
#adding cost column to the new dataset
cluster_dummy = hotel[['Restaurant','Cost','Average_Rating','Total_Cuisine_Count'
                      ]].merge(cluster_dummy, on = 'Restaurant')

In [None]:
cluster_dummy.shape

Alternate Method for Creating Dummies

In [None]:
#creating data frame for categorial encoding
cluster_df = hotel[['Restaurant','Cuisines','Cost','Average_Rating','Total_Cuisine_Count']]

In [None]:
#creating new dataframe for clustering
cluster_df = pd.concat([cluster_df,pd.DataFrame(columns=list(cuisine_dict.keys()))])

In [None]:
#creating categorial feature for cuisine
#iterate over every row in the dataframe
for i, row in cluster_df.iterrows():
  # iterate over the new columns
  for column in list(cluster_df.columns):
      if column not in ['Restaurant','Cost','Cuisines','Average_Rating','Total_Cuisine_Count']:
        # checking if the column is in the list of cuisines available for that row
        if column in row['Cuisines']:
          #assign it as 1 else 0
          cluster_df.loc[i,column] = 1
        else:
          cluster_df.loc[i,column] = 0

In [None]:
#result from encoding
cluster_df.head(2).T

#### What all categorical encoding techniques have you used & why did you use those techniques?

Used One-Hot Encoding (via pd.get_dummies) for the 'Cuisines' column, which is a common and effective technique for handling categorical data, especially when there are no inherent ordinal relationships between categories.

Techniques used and the reasoning:
1. One-Hot Encoding (for 'Cuisines')
* Technique: You transformed the Cuisines column (which contains multiple, comma-separated cuisines per restaurant) into a series of binary (0 or 1) columns. Each new column represents a unique cuisine, and a 1 indicates that the restaurant offers that particular cuisine, while a 0 indicates it does not.

  * Steps involved:

    1. cluster_dummy['Cuisines'] = cluster_dummy['Cuisines'].str.split(','): Split the comma-separated string into a list of cuisines for each restaurant.

    2.  cluster_dummy = cluster_dummy.explode('Cuisines'): Transformed the DataFrame such that each cuisine from the list gets its own row, duplicating the restaurant information.

    3. cluster_dummy['Cuisines'] = cluster_dummy['Cuisines'].apply(lambda x: x.strip()): Cleaned up any leading/trailing whitespace from the cuisine names.

    4. cluster_dummy = pd.get_dummies(cluster_dummy, columns=["Cuisines"], prefix=["Cuisines"]): Applied the actual one-hot encoding, creating binary columns for each unique cuisine.

    5. cluster_dummy.columns = cluster_dummy.columns.str.replace("Cuisines_",""): Renamed the columns for clarity (e.g., Cuisines_Chinese becomes Chinese).

    6. cluster_dummy = cluster_dummy.groupby("Restaurant").sum().reset_index(): Aggregated the dummy variables back to the restaurant level. Since a restaurant offering a cuisine would have a 1 in the exploded rows, summing them up effectively gives a 1 if the restaurant offers that cuisine at least once, and 0 otherwise.

* Why it was used:

  * Nominal Categorical Data: Cuisines are nominal categories (there's no inherent order or ranking between "Chinese" and "Italian"). One-hot encoding is ideal for such data because it avoids implying any false ordinal relationships that numerical encoding might introduce.

  * Multi-Label Categories: Restaurants can offer multiple cuisines. The explode and then get_dummies approach correctly handles this multi-label scenario, where a single restaurant can have 1s across several cuisine columns.

  * Compatibility with Machine Learning Models: Most machine learning algorithms cannot directly work with text categories. One-hot encoding converts these categories into a numerical format that models can understand and process.

  * No Information Loss: This method preserves all the information about which cuisines a restaurant offers.

2. Implicit Numerical Encoding/Feature Creation (for Total_Cuisine_Count)
Column: hotel['Total_Cuisine_Count']

* Technique: You created a new numerical feature by counting the number of cuisines each restaurant offers. This is not a direct categorical encoding of an existing column, but rather a feature derived from a categorical one.

  * Step involved: hotel['Total_Cuisine_Count'] = hotel['Cuisines'].apply(lambda x : len(x.split(',')))

* Why it was used:

  * Capturing Diversity: This numerical feature directly quantifies the diversity of cuisines offered by a restaurant, which can be an important predictor or descriptive statistic.

  * Simplicity & Interpretability: It's a straightforward and easily interpretable metric.

  * Complementary to One-Hot Encoding: While one-hot encoding tells you which specific cuisines are offered, Total_Cuisine_Count tells you how many distinct cuisines are offered. Both can be valuable for different analytical purposes.

Alternate Method (Manual One-Hot-like Encoding)
Presented an "Alternate Method" using a loop to manually assign 1s and 0s based on a predefined cuisine_dict (which wasn't shown in the provided snippet but is implied).

* Technique: This involves iterating through rows and columns to manually set binary flags based on whether a cuisine is present in the Cuisines string.

* Why it was included: While pd.get_dummies is generally more efficient and recommended for this task, a manual loop can sometimes be used to achieve a similar one-hot encoding effect, especially when there's complex logic or a need to map to pre-existing specific cuisine columns. However, for a straightforward multi-label scenario like this, pd.get_dummies with explode is typically preferred for its efficiency and conciseness.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
#creating new df for text processing of sentiment analysis
sentiment_df = review[['Reviewer','Restaurant','Rating','Review']]
#analysing two random sample
sentiment_df.sample(2)

In [None]:
#setting index
sentiment_df = sentiment_df.reset_index()
sentiment_df['index'] = sentiment_df.index

In [None]:
sentiment_df.sample(2)

In [None]:
# Expand Contraction
import contractions
# applying fuction for contracting text
sentiment_df['Review']=sentiment_df['Review'].apply(lambda x:contractions.fix(x))

#### 2. Lower Casing

In [None]:
# Lower Casing
sentiment_df['Review'] = sentiment_df['Review'].str.lower()

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
import string
def remove_punctuation(text):
  '''a function for removing punctuation'''

  # replacing the punctuations with no space,
  # which in effect deletes the punctuation marks
  translator = str.maketrans('', '', string.punctuation)
  # return the text stripped of punctuation marks
  return text.translate(translator)

In [None]:
#remove punctuation using function created
sentiment_df['Review'] = sentiment_df['Review'].apply(remove_punctuation)
sentiment_df.sample(5)

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
import re

# Remove links
sentiment_df["Review"] = sentiment_df["Review"].apply(lambda x: re.sub(r"http\S+", "", x))

# Remove digits
sentiment_df["Review"] = sentiment_df["Review"].apply(lambda x: re.sub(r"\d+", "", x))

In [None]:
#function to extract location of the restaurant
def get_location(link):
  link_elements = link.split("/")
  return link_elements[3]

#create a location feature
hotel['Location'] = hotel['Links'].apply(get_location)
hotel.sample(2)

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
# extracting the stopwords from nltk library
sw = stopwords.words('english')

In [None]:
#function to remove stopwords
def delete_stopwords(text):
  '''a function for removing the stopword'''
  # removing the stop words and lowercasing the selected words
  text = [word.lower() for word in text.split() if word.lower() not in sw]
  # joining the list of words with space separator
  return " ".join(text)

In [None]:
#calling function to remove stopwords
sentiment_df['Review'] = sentiment_df['Review'].apply(delete_stopwords)

In [None]:
# Remove White spaces
sentiment_df['Review'] =sentiment_df['Review'].apply(lambda x: " ".join(x.split()))

In [None]:
#random sample
sentiment_df.sample(2)

#### 6. Rephrase Text

In [None]:
#Rephrase Text

In [None]:
sentiment_df.sample(2)

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
#applying Lemmatization
from nltk.stem import WordNetLemmatizer

# Create a lemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_tokens(tokens):
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return lemmatized_tokens

# Lemmatize the 'Review' column
sentiment_df['Review'] = sentiment_df['Review'].apply(lemmatize_tokens)

In [None]:
sentiment_df.sample(2)

##### Which text normalization technique have you used and why?

*Lemmatization:*

I've implemented lemmatization using nltk.stem.WordNetLemmatizer. This process reduces words to their base or dictionary form (lemma). For example, words like "running," "ran," and "runs" are all converted to "run."

*Why Lemmatization Was Chosen:*

1. Semantic Accuracy: Lemmatization aims to return a valid word from a dictionary, considering the word's part of speech and context. This results in more meaningful base forms compared to stemming, which often just truncates words and might produce non-dictionary terms (e.g., "automat" from "automatic").

21. Reduced Vocabulary: By unifying different inflected forms of a word, lemmatization helps reduce the overall size of your vocabulary. A smaller, more consistent vocabulary can improve the performance and efficiency of text-based machine learning models by reducing sparsity.

3. Enhanced Feature Representation: When analyzing text, you want words with the same core meaning to be treated similarly. Lemmatization ensures that related words are grouped, providing a cleaner and more accurate representation of the text's content, which is beneficial for tasks like sentiment analysis or text clustering.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(tokenizer=lambda x: x, lowercase=False)
vectorizer.fit(sentiment_df['Review'].values)
#creating independent variable for sentiment analysis
X_tfidf = vectorizer.transform(sentiment_df['Review'].values)

In [None]:
#Bag of Words
tokenized_text = []
for token in sentiment_df['Review']:
    tokenized_text.append(token)

#creating token dict
tokens_dict = gensim.corpora.Dictionary(tokenized_text)

#print token dict
#tokens_dict.token2id

In [None]:
#using tokens_dict.doc2bow() to generate BoW features for each tokenized course
texts_bow = [tokens_dict.doc2bow(text) for text in tokenized_text]

#creating a new text_bow dataframe based on the extracted BoW features
tokens = []
bow_values = []
doc_indices = []
doc_ids = []
for text_idx, text_bow in enumerate(texts_bow):
    for token_index, token_bow in text_bow:
        token = tokens_dict.get(token_index)
        tokens.append(token)
        bow_values.append(token_bow)
        doc_indices.append(text_idx)
        doc_ids.append(sentiment_df["Restaurant"][text_idx])

bow_dict = {"doc_index": doc_indices,
            "doc_id": doc_ids,
            "token": tokens,
            "bow": bow_values,
            }
bows_df = pd.DataFrame(bow_dict)
bows_df.head()

#### Which text vectorization technique have you used and why?

Text Vectorization Techniques Used

1. TF-IDF (Term Frequency-Inverse Document Frequency)
I used TfidfVectorizer from sklearn.feature_extraction.text to transform the Review column.

* How it works: TF-IDF assigns a numerical weight to each word in a document, reflecting how important a word is to that document within a collection of documents (corpus).

  * Term Frequency (TF): How often a word appears in a specific document.

  * Inverse Document Frequency (IDF): A measure of how rare or common a word is across all documents in the corpus. Words that are common across many documents (like "the" or "a") get a lower IDF score, while unique or specific words get a higher IDF score.

  * TF-IDF is the product of TF and IDF.

* Why it was used:

  *  Captures Importance: TF-IDF is excellent for capturing the importance of words not just by their frequency in a single review, but also by their rarity across all reviews. This means common, less informative words (like stopwords, even after removal) are down-weighted, while more distinctive, "key" words are given higher scores.

  * Handles Stopwords Implicitly: While you manually removed stopwords, TF-IDF naturally assigns low weights to very common words, making it robust to variations in preprocessing.

  * Numerical Representation: It provides a numerical representation of text, which is a required input for most machine learning models.

  * Feature Scaling: The values produced by TF-IDF are typically normalized, aiding in better performance for many models.

2. Bag of Words (BoW)
I implemented a Bag of Words model using gensim.corpora.Dictionary and doc2bow.

* How it works: BoW represents text as an unordered collection of words, disregarding grammar and word order, but keeping track of the frequency of each word.

  * Create a Dictionary (tokens_dict): This maps every unique word in your entire corpus to a unique integer ID.

  * Convert to Bag-of-Words Format (texts_bow): For each review, it creates a list of (word_id, frequency) pairs, indicating how many times each word (by its ID) appears in that specific review.

  * DataFrame Creation (bows_df): You then converted this list of lists into a DataFrame, showing which token appeared how many times in which document.

* Why it was used:

  * Simplicity and Interpretability: BoW is a straightforward and intuitive way to convert text into a numerical format. It's easy to understand what each numerical feature represents (the count of a specific word).

  * Foundation for NLP: It's a fundamental technique in NLP and serves as the basis for many other more complex text representation models.

  * Feature Creation: It creates a numerical vector (a "bag" of word counts) for each document, which can then be fed into machine learning algorithms.

Both TF-IDF and BoW are essential techniques for converting unstructured text into structured numerical data that machine learning models can process. TF-IDF often provides a more nuanced representation of word importance than simple word counts (BoW).

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

Restaurant

In [None]:
# Manipulate Features to minimize feature correlation and create new features
hotel.shape

In [None]:
#columns for dataset
hotel.columns

In [None]:
#dropping columns
hotel = hotel.drop(columns = ['Links','Location'], axis = 1)

In [None]:
hotel.shape

In [None]:
#creating new dataframe to be used for clustering i.e dropping the unimportant column
cluster_df.shape

In [None]:
#dropping cuisine and restaurant from cluster_df
cluster_df = cluster_df.drop(columns = ['Restaurant','Cuisines'], axis = 1)

In [None]:
cluster_df.sample(1)

In [None]:
#alternatively using other variable created earlier during categorial creation
cluster_dummy.shape

In [None]:
#review data shape
review.shape

In [None]:
#review column
review.columns

In [None]:
#creating new binary feature called sentiment based on rating which has 1 = positive and 0 = negative
sentiment_df['Sentiment'] = sentiment_df['Rating'].apply(
    lambda x: 1 if x >=sentiment_df['Rating'].mean() else 0)  #1 = positive # 0 = negative

In [None]:
#sentiment data frame
sentiment_df.sample(2)

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
hotel.columns

In [None]:
#feature selcted for clustering
cluster_df.columns

In [None]:
cluster_dummy.columns

In [None]:
review.columns

In [None]:
#feature selected for sentiment analysis
sentiment_df.columns

##### What all feature selection methods have you used  and why?

I didn't use formal feature selection methods. Instead, I relied on manual selection based on my understanding of the data and what would be relevant for the specific tasks, like clustering and sentiment analysis.

*My Feature Selection Approach*

For both the clustering and sentiment analysis tasks, I directly chose the columns I believed were most relevant, rather than employing automated feature selection algorithms. This approach is often driven by domain knowledge and the immediate goals of the analysis.

For Restaurant Clustering

For clustering restaurants, I selected:

* Cost: I figured the price range would be a major factor in grouping similar restaurants.

* Average_Rating: Naturally, a restaurant's overall rating is a critical indicator of its quality and popularity, so it had to be in there.

* Total_Cuisine_Count: I engineered this feature to quantify how diverse a restaurant's menu is. I thought this would help distinguish between specialized eateries and those offering a broad range of dishes.

* Individual Cuisine Dummy Variables: These are the one-hot encoded columns for each specific cuisine (like 'Chinese', 'Italian', etc.). I knew these were essential to group restaurants based on the types of food they serve.

My rationale here was to pick features that directly describe a restaurant's core identity, offerings, and perceived quality. These attributes felt like they'd naturally form distinct clusters of restaurants.

For Sentiment Analysis

For preparing the data for sentiment analysis, I initially picked:

* Reviewer: To keep track of who wrote the review.

* Restaurant: To know which restaurant the review was about.

* Rating: This is super important because it often serves as the numeric representation of the sentiment (e.g., a 5-star rating usually means positive sentiment).

* Review: This is the actual text I needed to analyze for sentiment.

* index: Just a simple identifier I added.

Later, I anticipated a Sentiment column, which would be my target variable, either directly derived from Rating or as the output of a sentiment model.

My goal here was to gather all the raw information necessary for breaking down and analyzing the sentiment expressed in the reviews.

Why No Formal Methods?

I didn't explicitly use techniques like statistical tests (e.g., Chi-squared), model-based importance (e.g., from a Random Forest), or dimensionality reduction (e.g., PCA) at this stage. My choices were more about common sense and what makes logical sense for describing a restaurant and its reviews. For this dataset, with a manageable number of features whose relevance seemed quite clear, this direct approach felt sufficient. If I were dealing with hundreds or thousands of features, I'd definitely lean on more automated feature selection methods to prevent overfitting and improve efficiency.

##### Which all features you found important and why?

*Key Important Features*

1. Cost: Why it's important: Cost is a fundamental aspect of any restaurant. It heavily influences a customer's expectation and perception of value. For instance, a higher cost often implies a premium experience, better ingredients, or superior ambiance. Conversely, a lower cost might suggest a more casual dining experience. In my analysis, I saw a positive correlation between cost and rating, indicating that as cost increases, so does the rating, implying customers are often satisfied with the value they receive at different price points.

2. Rating: Why it's important: As a direct measure of customer satisfaction, Rating is perhaps the most crucial feature. It serves as the primary indicator of a restaurant's success and popularity. When analyzing the merged dataset, Rating becomes a central dependent variable for predictive modeling (e.g., predicting restaurant success) and a key factor for clustering restaurants into quality tiers.

3. Number_of_Reviews (Reviewer's Total Reviews) and Reviewer_Followers 📈
Why they're important: These features provide insights into the influence and activity of reviewers.

  * Number_of_Reviews: This tells me how experienced a reviewer is on the platform. Reviewers with many reviews might be considered more credible or influential.

  * Reviewer_Followers: This directly quantifies a reviewer's reach and impact. If a reviewer has many followers, their opinion might hold more weight or their review could attract more attention to a restaurant. My analysis suggested that restaurants reviewed by individuals with more followers tended to receive higher ratings, highlighting the 'influencer effect' in the dining scene.

4. Cuisines (and derived features like Total_Cuisine_Count and One-Hot Encoded Cuisines): Why they're important: Cuisines define a restaurant's core offering.

  * Total_Cuisine_Count: This engineered feature quantifies the diversity of a restaurant's menu. It helps distinguish highly specialized eateries from those offering a broad range of options. My hypothesis test suggested that a wider variety of cuisines is associated with higher ratings, implying diners appreciate choice.

  * One-Hot Encoded Cuisine Features: These binary features (e.g., 'Chinese', 'Italian', 'Biryani') are vital for understanding the specific culinary identity of a restaurant. They allow for the clustering of restaurants based on their food types and are essential for any recommendation system based on cuisine preference.

*Why These Features are Important for the Project*

These features collectively provide a comprehensive view of a restaurant's characteristics, its performance, and the dynamics of its reviews. They are crucial for:

* Clustering: Grouping similar restaurants based on cost, rating, and cuisine profile.

* Predictive Modeling: Potentially predicting a restaurant's success or rating based on its attributes and the influence of its reviewers.

* Understanding Market Dynamics: Gaining insights into what drives consumer preferences and restaurant performance in the Hyderabad dining scene.

* Recommendation Systems: Building systems that can recommend restaurants based on user preferences for cost, cuisine, or desired average rating.

By focusing on these features, I can build robust models and derive meaningful insights from the dataset.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
# Getting symmetric and skew symmetric features from the cplumns
symmetric_feature=[]
non_symmetric_feature=[]
for i in cluster_df.describe().columns:
  if abs(cluster_df[i].mean()-cluster_df[i].median())<0.1:
    symmetric_feature.append(i)
  else:
    non_symmetric_feature.append(i)

# Getting Symmetric Distributed Features
print("Symmetric Distributed Features : -",symmetric_feature)

# Getting Skew Symmetric Distributed Features
print("Skew Symmetric Distributed Features : -",non_symmetric_feature)

In [None]:
#using log transformation to transform Cost as using capping tends to change median and mean
cluster_df['Cost'] = np.log1p(cluster_df['Cost'])
cluster_dummy['Cost'] = np.log1p(cluster_dummy['Cost'])

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm # Make sure norm is imported if you're using `fit=norm`

# Transform Your data
# Ensure 'cluster_df' is defined and has the 'Cost' column.
# If not, you'll encounter a NameError.
for i, col in enumerate(['Cost']):
    plt.figure(figsize=(8, 5)) # It's good practice to create a new figure for each plot in a loop
    sns.histplot(cluster_df[col], kde=True, color='#055E85', stat='density') # distplot is deprecated, use histplot/kdeplot
    # If you still want the fitted normal distribution, you'd calculate it and plot separately
    # x = np.linspace(cluster_df[col].min(), cluster_df[col].max(), 100)
    # plt.plot(x, norm.pdf(x, cluster_df[col].mean(), cluster_df[col].std()), color='green', linestyle='--', label='Normal Fit')

    feature = cluster_df[col]
    plt.axvline(feature.mean(), color='#ff033e', linestyle='dashed', linewidth=3, label='mean');  # red
    plt.axvline(feature.median(), color='#A020F0', linestyle='dashed', linewidth=3, label='median'); # cyan
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left') # Corrected 'loc' from 'up' to 'upper left'
    plt.title(f'{col.title()} Distribution'); # Added "Distribution" to title for clarity
    plt.tight_layout();
    plt.show() # Display the plot

Yes, the data for the Cost feature needed to be transformed. I used log transformation (specifically np.log1p) for this.

*Data Transformation: Log Transformation on 'Cost'*

Why Transformation Was Needed

The initial analysis identified Cost as a skew-symmetric feature. This means its distribution was heavily skewed (likely to the right, with a long tail of higher costs), as indicated by the mean being significantly different from the median (the condition abs(cluster_df[i].mean()-cluster_df[i].median())<0.1 suggests it failed this symmetry test).

* Impact of Skewness on Models: Many machine learning algorithms (especially linear models or those assuming normality) perform poorly or yield biased results when input features are highly skewed. Skewed data can also lead to issues with outlier detection and influence model convergence.

Why Log Transformation (np.log1p) Was Used

* Normalizing Skewed Distributions: Logarithmic transformations are highly effective at reducing right-skewness in data. By compressing the range of values, they make the distribution more symmetrical and closer to a normal distribution.

* Handling Zero Values: I used np.log1p (which computes log(1+x)) specifically because the Cost column might contain zero values. A standard np.log() function would produce an error or infinity for zero, whereas np.log1p handles it gracefully, returning 0 for* an input of 0.

* Outlier Treatment Alternative: As your comment mentions, "using capping tends to change median and mean." While capping outliers (as done previously with the IQR method) can manage extreme values, it doesn't fundamentally change the shape of the distribution. Log transformation, however, directly addresses the skewness, often making the distribution more amenable to statistical analysis and modeling. It effectively "pulls in" high values, mitigating their extreme influence without simply clipping them.

By applying the log transformation to Cost, I aimed to make its distribution more normal-like, which often leads to better performance for machine learning models that expect normally distributed inputs and helps in satisfying the assumptions of various statistical tests.

### 6. Data Scaling

In [None]:
# Scaling your data
cluster_dummy.sample(5)

In [None]:
#normalizing numerical columns
numerical_cols = ['Cost','Total_Cuisine_Count','Average_Rating']
scaler = StandardScaler()
scaler.fit(cluster_dummy[numerical_cols])
scaled_df = cluster_dummy.copy()
scaled_df[numerical_cols] = scaler.transform(cluster_dummy[numerical_cols])

##### Which method have you used to scale you data and why?

I used Standardization (specifically StandardScaler from scikit-learn) to scale my data.

Data Scaling Method

Standardization (Z-score Normalization): I applied StandardScaler to the numerical columns: Cost, Total_Cuisine_Count, and Average_Rating.

How Standardization Works:

Standardization (also known as Z-score normalization) transforms data such that it has a mean of 0 and a standard deviation of 1. For each data point (x), it's calculated using the formula:

      z = (x−μ​) / σ

where:

* μ is the mean of the feature

* σ is the standard deviation of the feature

Why Standardization Was Used:

* Handles Varying Scales: Features like Cost might have values ranging in hundreds or thousands, while Total_Cuisine_Count might range from 1 to 10, and Average_Rating from 1 to 5. Without scaling, features with larger numerical ranges can disproportionately influence machine learning algorithms, simply because their absolute values are bigger. Standardization brings all features to a comparable scale, preventing this dominance.

* Benefits Distance-Based Algorithms: Many machine learning algorithms, particularly those that rely on distance calculations (like K-Means Clustering, which you're likely preparing for with cluster_dummy), K-Nearest Neighbors (KNN), or Support Vector Machines (SVMs), are very sensitive to the scale of the input features. If features are on different scales, the distances calculated will be heavily skewed by the features with larger magnitudes. Standardization ensures that each feature contributes equally to the distance calculation.

* Improves Model Convergence: For optimization algorithms used in models like Linear Regression, Logistic Regression, or Neural Networks, standardization can lead to faster and more stable convergence by providing a more uniform landscape for the optimization process.

* Assumes No Specific Distribution (Relatively Robust to Outliers if pre-treated): While it doesn't bound values to a specific range (like Min-Max Scaling), Standardization works well even if the data isn't normally distributed, especially after outlier treatment (which I already performed on Cost). It's less affected by extreme values than methods like Min-Max scaling, provided those outliers have been managed.

In summary, standardizing my numerical features ensures that my subsequent machine learning models (especially those focused on clustering, as indicated by cluster_dummy) treat all features with equal importance based on their information content, not just their raw magnitude.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Based on the current state of your cluster_dummy DataFrame, yes, I think dimensionality reduction is needed.

Why Dimensionality Reduction is Important 📉

Your cluster_dummy DataFrame, intended for clustering, currently has 48 columns. A significant portion of these are one-hot encoded cuisine features. While necessary for representing culinary variety, having 40+ binary cuisine columns can lead to several issues:

1. The Curse of Dimensionality: As the number of features (dimensions) increases, the data becomes extremely sparse. In high-dimensional spaces, data points become very far apart, making it difficult for distance-based algorithms like K-Means (which you're likely setting up for) to effectively group similar items. Clusters become less meaningful, and every point can appear to be an "outlier" relative to others.

2. Increased Computational Cost: More dimensions mean more calculations. Training models, especially clustering algorithms, becomes significantly slower and requires more memory.

3. Overfitting (for Supervised Models): While clustering is unsupervised, if you later use these features for a supervised learning task (like predicting ratings based on restaurant attributes), too many dimensions can lead to models that fit the noise in the training data rather than the underlying patterns.

4. Reduced Interpretability: It's much harder to visualize or understand relationships in 48 dimensions compared to, say, 2, 3, or even 10 principal components.

5. Multicollinearity: Many of your cuisine columns might be highly correlated (e.g., if "North Indian" often appears with "Biryani"). This multicollinearity can destabilize some algorithms and make feature importance harder to interpret.

How it Specifically Applies to Your Data

You have Cost, Average_Rating, Total_Cuisine_Count, and then 45 individual cuisine type columns. Many restaurants might only offer a few cuisines, leading to a high percentage of zeros in these one-hot encoded columns. This sparsity, combined with the sheer number of these features, makes your dataset a strong candidate for dimensionality reduction.

In [None]:
# Dimensionality Reduction (If needed)
#applying pca
#setting restaurant feature as index as it still had categorial value
scaled_df.set_index(['Restaurant'],inplace=True)
features = scaled_df.columns
# features = features.drop('Restaurant')
# create an instance of PCA
pca = PCA()

# fit PCA on features
pca.fit(scaled_df[features])

In [None]:
#explained variance v/s no. of components
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker ='o', color = 'orange')
plt.xlabel('number of components',size = 15, color = 'red')
plt.ylabel('cumulative explained variance',size = 14, color = 'blue')
plt.title('Variance v/s No. of Components',size = 20, color = 'green')
plt.xlim([0, 8])
plt.show()

In [None]:
#using n_component as 3
pca = PCA(n_components=3)

# fit PCA on features
pca.fit(scaled_df[features])

# explained variance ratio of each principal component
print('Explained variation per principal component: {}'.format(pca.explained_variance_ratio_))
# variance explained by three components
print('Cumulative variance explained by 3 principal components: {:.2%}'.format(
                                        np.sum(pca.explained_variance_ratio_)))

# transform data to principal component space
df_pca = pca.transform(scaled_df[features])

In [None]:
#shape
print("original shape: ", scaled_df.shape)
print("transformed shape:", df_pca.shape)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

*Dimensionality Reduction Technique Used*

I used Principal Component Analysis (PCA) as my dimensionality reduction technique.

How I used it:

1. I initialized PCA() without specifying n_components first to plot the explained variance ratio. This plot (Variance v/s No. of Components) showed me how much of the total variance in the data could be explained by an increasing number of principal components.

2. Based on the plot, I observed that the first few components captured a significant portion of the variance. I specifically chose n_components=3 for the final transformation because, as the output showed, these 3 principal components collectively explained 62.71% of the total variance in the data.

3. Finally, I used pca.transform() to project my scaled_df (original shape: (100, 47)) into the new, lower-dimensional space, resulting in df_pca with a shape of (100, 3).

Why PCA was chosen:

1. Feature Extraction: PCA is a feature extraction technique, meaning it creates new, synthetic features (principal components) that are linear combinations of the original features. This is beneficial when you want to retain as much information as possible from the original set but in a more compact form, rather than just discarding features (like in feature selection).

2. Variance Maximization: PCA aims to find the directions (principal components) along which the data has the highest variance. By selecting components that explain most of the variance, I ensured that the most important patterns and information in my original 47 features were largely preserved in just 3 components.

3. Orthogonal Components: The principal components are statistically uncorrelated (orthogonal). This helps in dealing with multicollinearity present in the original features (especially the cuisine dummies).

4. Suitability for Clustering: PCA is a very common and effective preprocessing step for clustering. By reducing the noise and redundancy, it helps clustering algorithms identify more distinct and meaningful groups. It specifically addresses the "curse of dimensionality" by condensing the high-dimensional data into a lower-dimensional representation while preserving the most significant variation.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
# for sentiment analysis using sentiment_df dataframe
X = X_tfidf #from text vectorization
y = sentiment_df['Sentiment']

In [None]:
sentiment_df.shape

In [None]:
#spliting test train
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

# describes info about train and test set
print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

##### What data splitting ratio have you used and why?

I've used an 80/20 split for the training and testing datasets.

Data Splitting Ratio: I split the data for sentiment analysis into:

80% for the training set (X_train, y_train)

20% for the testing set (X_test, y_test)

Why this Ratio?

An 80/20 split is a very common and generally robust choice in machine learning for several reasons:

1. Sufficient Training Data: With 80% of the data, the model has a large enough portion to learn complex patterns and relationships within the text data (represented by X_tfidf) and their corresponding sentiments (y). This helps in building a well-generalized model.

2. Representative Testing Data: The remaining 20% provides a decent-sized, unseen dataset to evaluate the model's performance realistically. It's large enough to give a statistically meaningful assessment of how well the model generalizes to new, unobserved reviews. A smaller test set might lead to a less reliable evaluation.

3. Balance: It strikes a good balance between providing ample data for training and reserving enough data for unbiased evaluation. While other ratios like 70/30 or 75/25 are also common, 80/20 often serves as a good default for many standard datasets.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Yes, I think the dataset is likely imbalanced, especially if the Sentiment column, derived from Rating, represents discrete sentiment categories (e.g., Positive, Neutral, Negative).

Why It's Likely Imbalanced

1. Nature of Review Data: User review datasets, particularly those involving star ratings (like your Rating column), are very commonly imbalanced. People are often more motivated to leave a review when they have had a very positive or a very negative experience. Mediocre or "average" experiences tend to be reviewed less frequently. Moreover, most products/services, to remain viable, must maintain a generally satisfactory performance, leading to a natural skew towards higher ratings.

2. Observed Rating Distributions: In many real-world scenarios, review ratings often follow a J-shaped or left-skewed distribution when plotted, meaning there's a higher frequency of 4-star and 5-star reviews, a moderate frequency of 1-star reviews, and fewer 2-star and 3-star reviews.

  * If your Sentiment categories are mapped from these ratings (e.g., 4-5 stars = Positive, 2-3 stars = Neutral, 1 star = Negative), then the "Positive" class will almost certainly have a significantly higher number of instances than the "Neutral" or "Negative" classes.

Implications of Imbalance

An imbalanced dataset can pose significant challenges for machine learning models, particularly in classification tasks like sentiment analysis:

1. Biased Model Performance: A model trained on an imbalanced dataset tends to be biased towards the majority class. It might achieve high overall accuracy by simply predicting the majority class for most instances, but it will perform very poorly on the minority classes. For example, if 90% of reviews are positive, a model that predicts "Positive" for every review would achieve 90% accuracy, but it would be useless for identifying negative or neutral sentiment.

2. Poor Generalization for Minority Classes: The model might fail to learn the distinctive patterns of the minority classes due to insufficient examples. This leads to poor recall and precision for the underrepresented sentiments, which are often the most important ones to correctly identify (e.g., recognizing negative feedback to address customer issues).

3. Misleading Evaluation Metrics: Metrics like accuracy can be deceptive. It's crucial to use other metrics like Precision, Recall, F1-score, and AUC-ROC which provide a more nuanced view of performance across all classes, especially minority ones.

To definitively confirm the imbalance, I would need to perform a value_counts() on the sentiment_df['Sentiment'] column. If confirmed, strategies like oversampling (SMOTE), undersampling, or using class weights during model training would be critical next steps to handle this imbalance effectively.

In [None]:
#getting the value count for target class
vc = sentiment_df.Sentiment.value_counts().reset_index().rename(columns =
            {'index':'Sentiment','Sentiment':'Count'})

In [None]:
#defining majority and minority class value
majority_class = vc.Count[0]
minority_class = vc.Count[1]

In [None]:
#calculating cir value for checking class imbalance
CIR = majority_class / minority_class
CIR

In [None]:
# Dependant Variable Column Visualization
sentiment_df['Sentiment'].value_counts().plot(kind='pie',
                              figsize=(15,6),
                               autopct="%1.1f%%",
                               startangle=90,
                               shadow=True,
                               labels=['Positive Sentiment','Negative Sentiment'],
                               colors=['red','blue'],
                               explode=[0.01,0.02]
                              )
plt.show()

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Handling Imbalance Decision

Decision: I chose not to perform any under-sampling or over-sampling techniques to treat the class imbalance.

Reasoning:

1. Slight Imbalance: While the Class Imbalance Ratio (CIR) of 1.73 indicates that the majority class has about 1.73 times more observations than the minority class, this is considered a slight imbalance.

2. Not Critically Severe: For many machine learning algorithms, particularly robust ones like tree-based models (e.g., Random Forest, Gradient Boosting) or those that can handle class weights, a CIR of 1.73 might not severely degrade performance. The benefits of resampling (which can sometimes introduce noise or lose information) might not outweigh the costs in this case.

By making this decision, I'm essentially trusting that the chosen machine learning model will be robust enough to handle this level of imbalance, or that the cost of misclassifying the minority class is not high enough to warrant explicit balancing techniques.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation
#importing kmeans
from sklearn.cluster import KMeans

In [None]:
#Within Cluster Sum of Squared Errors(WCSS) for different values of k
wcss=[]
for i in range(1,11):
    km=KMeans(n_clusters=i,random_state = 20)
    km.fit(df_pca)
    wcss.append(km.inertia_)

In [None]:
#elbow curve
plt.plot(range(1,11),wcss)
plt.plot(range(1,11),wcss, linewidth=2, color="red", marker ="o")
plt.xlabel("K Value", size = 20, color = 'purple')
plt.xticks(np.arange(1,11,1))
plt.ylabel("WCSS", size = 20, color = 'green')
plt.title('Elbow Curve', size = 20, color = 'blue')
plt.show()

In [None]:
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans # Make sure KMeans is imported
from sklearn.metrics import silhouette_score, silhouette_samples
from sklearn.model_selection import ParameterGrid

# candidates for the number of cluster
parameters = list(range(2,10))
#parameters
parameter_grid = ParameterGrid({'n_clusters': parameters})
best_score = -1

#visualizing Silhouette Score for individual clusters and the clusters made
for n_clusters in parameters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # 1st subplot is the silhouette plot
    # silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(df_pca) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=10, n_init='auto') # Added n_init='auto' to suppress future warning
    cluster_labels = clusterer.fit_predict(df_pca)

    # silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(df_pca, cluster_labels)
    print("For n_clusters =", n_clusters,
          "average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(df_pca, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("Silhouette plot for the various clusters.")
    ax1.set_xlabel("Silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(df_pca[:, 0], df_pca[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors, edgecolor='k')

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')
    # Corrected line: Use f-string to format the cluster number as a text marker
    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker=f'${i}$', alpha=1, # Corrected part
                    s=50, edgecolor='k')

    ax2.set_title("Visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")
    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')
    plt.show() # Added to display each plot

In [None]:
#vizualizing the clusters and the datapoints in each clusters
plt.figure(figsize = (10,6), dpi = 120)

kmeans= KMeans(n_clusters = 5, init= 'k-means++', random_state = 42)
kmeans.fit(df_pca)

#predict the labels of clusters.
label = kmeans.fit_predict(df_pca)
#Getting unique labels
unique_labels = np.unique(label)

#plotting the results:
for i in unique_labels:
    plt.scatter(df_pca[label == i , 0] , df_pca[label == i , 1] , label = i)
plt.legend()
plt.show()

In [None]:
#making df for pca
kmeans_pca_df = pd.DataFrame(df_pca,columns=['PC1','PC2','PC3'],index=scaled_df.index)
kmeans_pca_df["label"] = label
kmeans_pca_df.sample(2)

In [None]:
#joining the cluster labels to names dataframe
cluster_dummy.set_index(['Restaurant'],inplace=True)
cluster_dummy = cluster_dummy.join(kmeans_pca_df['label'])
cluster_dummy.sample(2)

In [None]:
#changing back cost value to original from log1p done during transformation
cluster_dummy['Cost'] = np.expm1(cluster_dummy['Cost'])
cluster_dummy.sample(2)

In [None]:
#creating df to store cluster data
clustering_result = cluster_dummy.copy().reset_index()
clustering_result = hotel[['Restaurant','Cuisines']].merge(clustering_result[['Restaurant','Cost',
                  'Average_Rating',	'Total_Cuisine_Count','label']], on = 'Restaurant')
clustering_result.head()

In [None]:
# Counting content in each cluster
cluster_count = cluster_dummy['label'].value_counts().reset_index().rename(
    columns={'index':'label','label':'Total_Restaurant'}).sort_values(by='Total_Restaurant')
cluster_count

In [None]:
#creating new df for checkign cuising in each cluster
new_cluster_df = clustering_result.copy()
new_cluster_df['Cuisines'] = new_cluster_df['Cuisines'].str.split(',')
new_cluster_df = new_cluster_df.explode('Cuisines')
#removing extra trailing space from cuisines after exploded
new_cluster_df['Cuisines'] = new_cluster_df['Cuisines'].apply(lambda x: x.strip())
new_cluster_df.sample(5)

In [None]:
#printing cuisine list for each cluster
for cluster in new_cluster_df['label'].unique().tolist():
  print('Cuisine List for Cluster :', cluster,'\n')
  print(new_cluster_df[new_cluster_df["label"]== cluster]['Cuisines'].unique(),'\n')
  print('='*120)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

I used the K-Means clustering algorithm as my machine learning model.

ML Model: K-Means Clustering 📊

I've implemented K-Means, an unsupervised machine learning algorithm, for clustering the restaurants based on their transformed features (from PCA).

How K-Means Works:

K-Means aims to partition n observations into k clusters, where each observation belongs to the cluster with the nearest mean (centroid).

1. Initialization: It randomly initializes k centroids.

2. Assignment Step: Each data point is assigned to the nearest centroid, forming k clusters.

3. Update Step: The centroids are re-calculated as the mean of all data points assigned to that cluster.

4. Iteration: Steps 2 and 3 are repeated until the centroids no longer move significantly or a maximum number of iterations is reached.

Model Implementation Steps:

1. Determining Optimal k (Number of Clusters): I used two methods to find the optimal number of clusters:

* Elbow Method (using WCSS - Within-Cluster Sum of Squares): I calculated the WCSS for k values ranging from 1 to 10. WCSS measures the sum of squared distances between each point and its cluster centroid. The "elbow" point in the plot, where the rate of decrease in WCSS sharply changes, suggests an optimal k.

  * Observation from Elbow Curve: While not perfectly sharp, the elbow curve shows a significant drop up to k=3 or k=4, and then the rate of decrease slows. It suggests a value around 3 or 4 might be reasonable.

* Silhouette Score Analysis: I calculated the average silhouette score for k values from 2 to 9. The silhouette score measures how similar an object is to its own cluster compared to other clusters. Higher values indicate better-defined and more separated clusters.

  * Observation from Silhouette Scores:

    * For n_clusters = 2, the average silhouette score is: 0.31357

    * For n_clusters = 3, the average silhouette score is: 0.29742

    * For n_clusters = 4, the average silhouette score is: 0.31274

    * For n_clusters = 5, the average silhouette score is: 0.30244

    * For n_clusters = 6, the average silhouette score is: 0.31674 (Highest)

    * For n_clusters = 7, the average silhouette score is: 0.30965

    * For n_clusters = 8, the average silhouette score is: 0.29778

    * For n_clusters = 9, the average silhouette score is: 0.29957

* Based on the silhouette scores, k=6 yields the highest average silhouette score, suggesting it's the best choice for cluster separation and cohesion among the tested values.

2. Final K-Means Model: Although the silhouette score suggested 6, you then proceeded to fit K-Means with n_clusters = 5 (kmeans= KMeans(n_clusters = 5, init= 'k-means++', random_state = 42)). This might be a deliberate choice based on domain interpretability or other factors not explicitly shown.

3. Cluster Assignment: The kmeans.fit_predict(df_pca) step assigns each restaurant (represented by its PCA-transformed features) to one of the 5 (or 6, depending on the final k choice) clusters.

4. Cluster Visualization: The plt.scatter plot shows the data points colored by their assigned cluster in the 2D PCA space (using PC1 and PC2), providing a visual representation of the clusters formed. The cluster centers are also plotted.

5. Adding Cluster Labels to DataFrame: The label (cluster ID) is then added back to kmeans_pca_df and subsequently joined with cluster_dummy (after converting Cost back from log1p), enabling further analysis of what defines each cluster.

Performance Evaluation Metric: Silhouette Score Chart
The Silhouette Score is the primary evaluation metric used here for K-Means clustering.

Export to Sheets
As seen, n_clusters = 6 yields the highest average silhouette score of 0.31674. This indicates that, among the tested k values, 6 clusters provide the best separation between clusters and cohesion within clusters, making it the most optimal choice according to this metric.

Despite the silhouette score suggesting 6 clusters, your final K-Means model for visualization and DataFrame merging was set to n_clusters=5. If the goal is to choose the best k based purely on the silhouette score, then k=6 would be the data-driven choice from this evaluation.

In [None]:
#importing module for hierarchial clustering and vizualizing dendograms
import scipy.cluster.hierarchy as sch
plt.figure(figsize=(12,5))
dendrogram = sch.dendrogram(sch.linkage(df_pca, method = 'ward'),orientation='top',
            distance_sort='descending',
            show_leaf_counts=True)

plt.title('Dendrogram')
plt.xlabel('Restaurants')
plt.ylabel('Euclidean Distances')

plt.show()

In [None]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
import numpy as np # Ensure numpy is imported if df_pca is a numpy array

range_n_clusters = [2,3,4,5,6,7,8,9,10,11,12,13,14,15]
for n_clusters in range_n_clusters:
    # Removed affinity = 'euclidean' because linkage = 'ward' automatically uses euclidean distance.
    hc = AgglomerativeClustering(n_clusters = n_clusters, linkage = 'ward')
    y_hc = hc.fit_predict(df_pca)
    score = silhouette_score(df_pca, y_hc)
    print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))

In [None]:
# agglomerative clustering
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification

# define the model
model = AgglomerativeClustering(n_clusters = 5)      #n_clusters=5
# fit model and predict clusters
y_hc = model.fit_predict(df_pca)
# retrieve unique clusters
clusters = unique(y_hc)
# create scatter plot for samples from each cluster
for cluster in clusters:
	# get row indexes for samples with this cluster
	row_ix = where(y_hc == cluster)
	# create scatter of these samples
	plt.scatter(df_pca[row_ix, 0], df_pca[row_ix, 1])
# show the plot
plt.show()
#Evaluation

#Silhouette Coefficient
print("Silhouette Coefficient: %0.3f"%silhouette_score(df_pca,y_hc, metric='euclidean'))

#davies_bouldin_score of our clusters
from sklearn.metrics import davies_bouldin_score
davies_bouldin_score(df_pca, y_hc)
print("davies_bouldin_score %0.3f"%davies_bouldin_score(df_pca, y_hc))

In [None]:
#creating new colummn for predicting cluster using hierarcial clsutering
clustering_result['label_hr'] = y_hc

In [None]:
clustering_result.sample(5)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
from sklearn.metrics import silhouette_score
from sklearn.decomposition import LatentDirichletAllocation # Import LatentDirichletAllocation

topic_range = range(2, 11)
silhouette_scores = []

for n_components in topic_range:
    # Added random_state for reproducibility. It's good practice for algorithms with random initialization.
    lda = LatentDirichletAllocation(n_components=n_components, random_state=42)
    lda.fit(X)
    labels = lda.transform(X).argmax(axis=1)
    silhouette_scores.append(silhouette_score(X, labels))

In [None]:
#plotting silhouette score
plt.plot(topic_range, silhouette_scores, marker ='o', color='red')
plt.xlabel('Number of Topics', size = 15, color = 'green')
plt.ylabel('Silhouette Score', size = 15, color = 'blue')
plt.show()

In [None]:
# LDA model
lda = LatentDirichletAllocation(n_components=4)
lda.fit(X)

In [None]:
import pyLDAvis
import pyLDAvis.lda_model # This module is typically used for sklearn's LDA models

pyLDAvis.enable_notebook()

In [None]:
import pyLDAvis.lda_model # Corrected import

# ploting the clusters top 30 terms
# Assuming 'lda' is your fitted LatentDirichletAllocation model,
# 'X' is your document-term matrix (e.g., TF-IDF or CountVectorizer output),
# and 'vectorizer' is your fitted TfidfVectorizer or CountVectorizer.
lda_pyLDAvis = pyLDAvis.lda_model.prepare(lda, X, vectorizer, mds='tsne')
lda_pyLDAvis

In [None]:
#creating copy to store predicted sentiments
review_sentiment_prediction = review[review_df.columns.to_list()].copy()
review_sentiment_prediction.head()

In [None]:
# predicting the sentiments and storing in a feature
topic_results = lda.transform(X)
review_sentiment_prediction['Prediction'] = topic_results.argmax(axis=1)
review_sentiment_prediction.sample(5)

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# Define the number of words to include in the word cloud
N = 100

# Create a list of strings for each topic
topic_text = []
for index, topic in enumerate(lda.components_):
    # Corrected: Use get_feature_names_out() instead of the deprecated get_feature_names()
    topic_words = [vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-N:]]
    topic_text.append(" ".join(topic_words))

# Create a word cloud for each topic
for i in range(len(topic_text)):
    print(f'TOP 100 WORDS FOR TOPIC #{i}')
    wordcloud = WordCloud(background_color="black", colormap='rainbow').generate(topic_text[i])
    plt.figure(figsize=(10,5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()
    print('='*120)

In [None]:
for sentiment in review_sentiment_prediction['Prediction'].unique().tolist():
  print('Prediction = ',sentiment,'\n')
  print(review_sentiment_prediction[review_sentiment_prediction['Prediction'] ==
        sentiment]['Rating'].value_counts())
  print('='*120)

In [None]:
#defining function to calculate score
from sklearn.metrics import roc_auc_score, f1_score, accuracy_score
from tabulate import tabulate
import itertools


#calculating score
def calculate_scores(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    roc_auc = roc_auc_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    accuracy = accuracy_score(y_test, y_pred)
    # Get the confusion matrix for both train and test

    cm = confusion_matrix(y_test, y_pred)
    plt.imshow(cm, cmap='Wistia')

    # Add labels to the plot
    class_names = ["Positive", "Negative"]
    tick_marks = np.arange(len(class_names))
    plt.xticks(tick_marks, class_names)
    plt.yticks(tick_marks, class_names)

    # Add values inside the confusion matrix
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                horizontalalignment="center",
                color="white" if cm[i, j] > thresh else "black")

    # Add a title and x and y labels
    plt.title('Confusion Matrix')
    plt.xlabel('Predicted label')
    plt.ylabel('True label')

    plt.show()
    print(cm)
    return roc_auc, f1, accuracy, precision, recall

#printing result
def print_table(model, X_train, y_train, X_test, y_test):
    roc_auc, f1, accuracy, precision, recall = calculate_scores(model, X_train, y_train, X_test, y_test)
    table = [["ROC AUC", roc_auc], ["Precision", precision],
             ["Recall", recall], ["F1", f1], ["Accuracy", accuracy]]
    print(tabulate(table, headers=["Metric", "Score"]))

In [None]:
#logisctic regression
from sklearn.linear_model import LogisticRegression

# create and fit the model
clf = LogisticRegression()

In [None]:
#XgBoost
from xgboost import XGBClassifier

#create and fit the model
xgb = XGBClassifier()

In [None]:
# Visualizing evaluation Metric Score chart for logistic regression
# printing result
print_table(clf, X_train, y_train, X_test, y_test)

In [None]:
# Visualizing evaluation Metric Score chart for XgBoost
# printing result
print_table(xgb, X_train, y_train, X_test, y_test)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

#logistic regression
# finding the best parameters for LogisticRegression by gridsearchcv
param_dict = {'C': [0.1,1,10,100,1000],'penalty': ['l1', 'l2'],'max_iter':[1000]}
clf_grid = GridSearchCV(clf, param_dict,n_jobs=-1, cv=5, verbose = 5,scoring='recall')

In [None]:
# printing result
print_table(clf_grid, X_train, y_train, X_test, y_test)

In [None]:
# finding the best parameters for XGBRegressor by gridsearchcv
xgb_param={'n_estimators': [100,125,150],'max_depth': [7,10,15],'criterion': ['entropy']}
xgb_grid=GridSearchCV(estimator=xgb,param_grid = xgb_param,cv=3,scoring='recall',verbose=5,n_jobs = -1)

In [None]:
# printing result for gridsearch Xgb
print_table(xgb_grid, X_train, y_train, X_test, y_test)

In [None]:
import matplotlib.pyplot as plt
import pandas as pd  # Import pandas
from sklearn.metrics import roc_curve, roc_auc_score # Ensure roc_auc_score is imported

# finding the best parameters for all the models
# Assuming clf_grid and xgb_grid are already fitted GridSearchCV objects
log_reg_best = clf_grid.best_estimator_
xgbc_best = xgb_grid.best_estimator_

# predicting the sentiment by all models
# X_test needs to be defined from your earlier sentiment analysis split
y_preds_proba_lr = log_reg_best.predict_proba(X_test)[:, 1] # Simplified slicing
y_preds_proba_xgbc = xgbc_best.predict_proba(X_test)[:, 1] # Simplified slicing

classifiers_proba = [(log_reg_best, y_preds_proba_lr),
                    (xgbc_best, y_preds_proba_xgbc)]

# Define a result table as a DataFrame
# Initialize an empty list to store dictionaries, then concatenate
results = []

# Train the models and record the results
for pair in classifiers_proba:
    fpr, tpr, _ = roc_curve(y_test,  pair[1])
    auc = roc_auc_score(y_test, pair[1])

    results.append({
        'classifiers': pair[0].__class__.__name__,
        'fpr': fpr,
        'tpr': tpr,
        'auc': auc
    })

# Convert list of dictionaries to DataFrame
result_table = pd.DataFrame(results)

# Set name of the classifiers as index labels
result_table.set_index('classifiers', inplace=True)

# ploting the roc auc curve for all models
fig = plt.figure(figsize=(10,6))
for i in result_table.index:
    plt.plot(result_table.loc[i]['fpr'],
             result_table.loc[i]['tpr'],
             label="{}, AUC={:.3f}".format(i, result_table.loc[i]['auc']))

plt.plot([0,1], [0,1],'r--') # Plotting the random classifier line

plt.xlabel("False Positive Rate", fontsize=15)
plt.ylabel("True Positive Rate", fontsize=15)
plt.title('ROC AUC Curve', fontweight='bold', fontsize=15)
plt.legend(prop={'size':13}, loc='lower right')
plt.grid(True) # Added grid for better readability
plt.show()

##### Which hyperparameter optimization technique have you used and why?

I've used Grid Search Cross-Validation (GridSearchCV) for hyperparameter optimization for both Logistic Regression and XGBoost models.

Hyperparameter Optimization Technique: Grid Search Cross-Validation (GridSearchCV)

How GridSearchCV Works:

GridSearchCV exhaustively searches over a specified parameter grid for the best combination of hyperparameters. For each combination:

1. It trains the model multiple times (defined by cv, typically 5-fold cross-validation).

2. It evaluates the model's performance using a specified scoring metric (in your case, 'recall').

3. After trying all combinations, it selects the set of hyperparameters that resulted in the best average score across the cross-validation folds.

Why GridSearchCV Was Chosen:

1. Exhaustive Search: GridSearchCV guarantees that it will find the best combination of hyperparameters within the defined search space. This is valuable when you have a relatively small and well-defined set of hyperparameters to tune, as it ensures you don't miss optimal configurations.

2. Cross-Validation for Robustness: By performing cross-validation (cv=5 for Logistic Regression, cv=3 for XGBoost), it provides a more robust estimate of model performance for each hyperparameter combination. This helps in selecting hyperparameters that generalize well to unseen data, rather than just performing well on a single train/validation split.

3. Direct Optimization for Key Metric: You explicitly set scoring='recall'. This is a crucial choice, especially in imbalanced datasets (which you identified earlier). By optimizing for recall, you prioritize minimizing False Negatives (i.e., correctly identifying as many actual positive sentiments as possible). This is often vital in sentiment analysis, where missing negative feedback could have significant business implications.

4. Parallel Processing: The use of n_jobs=-1 allows GridSearchCV to run computations in parallel, significantly speeding up the process, especially when the parameter grid is large or cross-validation folds are numerous.

While other techniques like Randomized Search (RandomizedSearchCV) might be more efficient for very large search spaces, GridSearchCV is effective for smaller, targeted searches as performed here, especially when you want to ensure the absolute best combination within the defined ranges is found.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, there is an improvement in model performance after applying hyperparameter optimization with GridSearchCV.

The "improvement" needs to be understood in the context of the scoring metric used for GridSearchCV, which was 'recall'.

1. Logistic Regression

* Targeted Improvement: For Logistic Regression, the Recall score significantly improved from 0.9333 to 0.9667. This is a direct consequence of optimizing for 'recall' during GridSearchCV. A higher recall means the model is better at identifying all actual positive cases, which is crucial if minimizing false negatives is important (e.g., catching all instances of negative feedback).

* Trade-offs: This improvement in recall came at the cost of a decrease in other metrics like ROC AUC, Precision, F1-score, and Accuracy. The increase in False Positives (from 183 to 320) while reducing False Negatives (from 82 to 41) indicates this trade-off. The model became more aggressive in predicting the positive class.

2. XGBoost

  * Overall Improvement: For XGBoost, there's a more balanced improvement. ROC AUC, Precision, F1-score, and Accuracy all increased.

  * ROC AUC improved from 0.7603 to 0.8181, indicating better overall discriminative power.

  * Precision improved from 0.7838 to 0.8481, meaning fewer false positives.

  * F1-score improved from 0.8538 to 0.8706, showing a better balance between precision and recall.

  * Accuracy improved from 0.8018 to 0.8359, indicating more correct predictions overall.

* Recall Trade-off: Interestingly, even though the primary scoring for optimization was recall, XGBoost's recall slightly decreased (from 0.9374 to 0.8943). This could be due to the interaction of max_depth and n_estimators with the recall scoring, where the best combination for recall in cross-validation might not perfectly align with the test set's recall, or the other metrics saw larger gains. The confusion matrix shows an increase in False Negatives (from 77 to 130), supporting the drop in recall.

In conclusion, GridSearchCV successfully optimized for the chosen metric (recall for Logistic Regression, and recall for XGBoost leading to overall better performance for XGBoost). The "improvement" is evident in the targeted recall for Logistic Regression and a general uplift across most metrics for XGBoost, demonstrating the value of hyperparameter tuning.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
#creating variable that contain restaurant cuisine details
restaurant_df = cluster_dummy.copy()
restaurant_df = restaurant_df.reset_index()
restaurant_df = restaurant_df.drop(columns = ['Cost',	'Average_Rating',	'Total_Cuisine_Count','label'], axis =1)
restaurant_df.head(2)

In [None]:
#shape
restaurant_df.shape

In [None]:
#restaurant matrix
rest_genre = restaurant_df.loc[:, restaurant_df.columns != 'Restaurant']
rest_matrix = rest_genre.values
rest_matrix

In [None]:
#matrix shape
rest_matrix.shape

In [None]:
import pandas as pd
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer # Or CountVectorizer

# --- Ensure 'review' DataFrame exists (your raw review data) ---
try:
    _ = review
except NameError:
    # If 'review' is not defined, create a dummy DataFrame for demonstration.
    # In your actual notebook, this would be your loaded review data.
    review = pd.DataFrame({
        'Restaurant': ['Taste Buds', 'Spice Route', 'Green Oasis'],
        'Reviewer': ['Alice', 'Bob', 'Charlie'],
        'Review': [
            "Great food and fantastic service!",
            "Service was slow, but the ambiance was nice.",
            "Horrible experience, never again.",
        ],
        'Rating': [5.0, 3.0, 1.0]
    })
    print("Dummy 'review' DataFrame created for demonstration.")

# --- Ensure 'X' (vectorized text) and 'vectorizer' exist ---
try:
    _ = X
    _ = vectorizer
except NameError:
    # If 'X' and 'vectorizer' are not defined, create them by vectorizing 'Review' column.
    vectorizer = TfidfVectorizer(max_features=1000) # Use your actual vectorizer parameters
    X = vectorizer.fit_transform(review['Review'])
    print("Dummy 'X' (vectorized reviews) and 'vectorizer' created for demonstration.")

# --- Ensure 'lda' (fitted LDA model) exists ---
try:
    _ = lda
except NameError:
    # If 'lda' is not defined, fit a dummy LDA model.
    lda = LatentDirichletAllocation(n_components=4, random_state=42) # Use your actual n_components
    lda.fit(X)
    print("Dummy 'lda' model fitted for demonstration.")

# --- Create 'review_sentiment_prediction' (if it's not already) ---
# This DataFrame typically holds your original review data plus the predicted topic/sentiment.
try:
    _ = review_sentiment_prediction
except NameError:
    review_sentiment_prediction = review.copy()
    topic_results = lda.transform(X)
    review_sentiment_prediction['Prediction'] = topic_results.argmax(axis=1)
    print("'review_sentiment_prediction' created from previous steps.")

# --- Corrected code starts here ---
# Define 'sentiment_df' from 'review_sentiment_prediction'
sentiment_df = review_sentiment_prediction.copy()

# creating user or reviewer profile
user_df = sentiment_df[['Reviewer', 'Restaurant', 'Rating']].copy()
user_df.head()

In [None]:
#shape
user_df.shape

In [None]:
# grouping the data by the 'user' column
grouped_data = user_df.groupby('Reviewer')

# defining a function to create the new dataframe
def create_new_column(data):
    return [{'Restaurant': row['Restaurant'], 'Rating': row['Rating']} for _, row in data.iterrows()]
    #variable _ is used to store the index value, which is not used in the loop

# applying the function to the grouped data and creating a new dataframe
user_rating = grouped_data.apply(create_new_column)
user_rating = user_rating.reset_index().rename(columns ={0:'Rated_Restaurant'})
user_rating.head()

In [None]:
#shape
user_rating.shape

In [None]:
#iterating over user rating df such that it end up making an array which had same shape as restaurant df
user_rated_restaurant = {}
for index, row in user_rating.iterrows():
    user_rated_restaurant[row['Reviewer']] = {}
    for i in range(len(row['Rated_Restaurant'])):
        user_rated_restaurant[row['Reviewer']][row['Rated_Restaurant'][i][
            'Restaurant']] = row['Rated_Restaurant'][i]['Rating']

# creating an empty user preference vector for each user
user_preference_vector = pd.DataFrame(np.zeros((len(user_rating), len(restaurant_df))),
                      columns=restaurant_df.Restaurant, index=user_rating['Reviewer'])

# Iterate through the user rating dataframe
for index, row in user_rating.iterrows():
    for i in range(len(row['Rated_Restaurant'])):
        restaurant = row['Rated_Restaurant'][i]['Restaurant']
        rating = row['Rated_Restaurant'][i]['Rating']
        user_preference_vector.loc[row['Reviewer'], restaurant] = rating

#reset index
user_preference_vector = user_preference_vector.reset_index()

In [None]:
#getting output
user_preference_vector.sample(5)

In [None]:
import pandas as pd
import numpy as np

# Assuming 'rest_genre', 'user_preference_vector', and 'rest_matrix' are already defined.
# If they are not, you would need to define them before this code block.

# Initialize an empty list to collect the DataFrames
rows_to_concat = []

for index, row in user_preference_vector.iterrows():
    # Ensure that row[1:] correctly captures the numerical preference vector.
    # If 'Reviewer' is the first column, this is correct.
    user_preference_vector_array = row[1:].values.reshape(1, -1)
    dot_product = np.dot(user_preference_vector_array, rest_matrix)

    # Create a DataFrame for the current reviewer's scores and append to the list
    # Use row['Reviewer'] directly for the index when creating the DataFrame
    reviewer_scores = pd.DataFrame(dot_product, columns=rest_genre.columns, index=[row['Reviewer']])
    rows_to_concat.append(reviewer_scores)

# Concatenate all collected DataFrames outside the loop
result_df = pd.concat(rows_to_concat)

# Reset the index and rename the new index column to 'Reviewer'
result_df = result_df.reset_index().rename(columns={'index': 'Reviewer'})

In [None]:
#getting output
result_df[:5]

In [None]:
#creating test user
test_user_ids = user_rating.copy()
test_user_ids['Rated_Restaurant_Count'] = test_user_ids['Rated_Restaurant'].apply(lambda x: len(x))

#taking 1000 user who atleast rating 2 restaurant as they show repeatition
test_user_ids = test_user_ids.sort_values('Rated_Restaurant_Count', ascending = False)[:1000]
test_user_ids.head()

In [None]:
#creating list for all reviewer in test ids
test_user_ids = test_user_ids['Reviewer'].to_list()
print(f"Total numbers of test users {len(test_user_ids)}")

In [None]:
#test user profile
test_user_profile = result_df[result_df['Reviewer']=='Ankita']
test_user_profile

In [None]:
# Now let's get the test user vector by excluding the `user` column
test_user_vector = test_user_profile.iloc[0, 1:].values
test_user_vector

In [None]:
#let test reviewer or user be 'Ankita'
liked_restaurant = user_df[user_df['Reviewer'] == 'Ankita']['Restaurant'].to_list()
liked_restaurant = set(liked_restaurant)
liked_restaurant

In [None]:
#getting values for all restaurant
all_restaurant = set(restaurant_df['Restaurant'].values)

In [None]:
#getting unknown restaurants
unknown_restaurant = all_restaurant.difference(liked_restaurant)

In [None]:
#getting unknown restaurant genre
unknown_restaurant_genres = restaurant_df[restaurant_df['Restaurant'].isin(unknown_restaurant)]
#getting the restaurant matrix by excluding `Restaurant' columns:
restaurant_matrix = unknown_restaurant_genres.iloc[:, 1:].values
restaurant_matrix

In [None]:
#recommendation score
score = np.dot(restaurant_matrix[1], test_user_vector)
score

In [None]:
# Only keep the score larger than the recommendation threshold
# The threshold can be fine-tuned to adjust the size of generated recommendations
score_threshold = 10.0
# score_threshold = 20.0
res_dict = {}

In [None]:
def generate_recommendation_scores():
    users = []
    restaurant = []
    scores = []
    for user_id in test_user_ids:
        test_user_profile = result_df[result_df['Reviewer'] == user_id]
        # get user vector for the current user id
        test_user_vector = test_user_profile.iloc[0, 1:].values


        # get the unknown restaurant ids for the current user id
        liked_restaurant = user_df[user_df['Reviewer'] == user_id]['Restaurant'].to_list()
        all_restaurant = set(restaurant_df['Restaurant'].values)
        unknown_restautant = all_restaurant.difference(liked_restaurant)
        unknown_restaurant_genres = restaurant_df[restaurant_df['Restaurant'].isin(unknown_restaurant)]
        unknown_restaurant_ids = unknown_restaurant_genres.iloc[:, :1].values

        # user np.dot() to get the recommendation scores for each restaurant
        recommendation_scores = np.dot(unknown_restaurant_genres.iloc[:, 1:].values, test_user_vector)

        # Append the results into the users, restaurant, and scores list
        for i in range(0, len(unknown_restaurant_ids)):
            score = recommendation_scores[i]
            # Only keep the restaurant with high recommendation score
            if score >= score_threshold:
              users.append(user_id)
              restaurant.append(unknown_restaurant_ids[i])
              scores.append(recommendation_scores[i])

    return users, restaurant, scores

In [None]:
# Return users, courses, and scores lists for the dataframe
users, restaurant, scores = generate_recommendation_scores()
res_dict['User'] = users
res_dict['Restaurant'] = restaurant
res_dict['Score'] = scores
res_df = pd.DataFrame(res_dict, columns=['User', 'Restaurant', 'Score'])
res_df['Restaurant'] = res_df['Restaurant'].apply(lambda x: str(x[0]))
res_df

In [None]:
#most recommended restaurant
recom_rest = res_df.groupby('Restaurant')['User'].count().reset_index().sort_values(
                            'User', ascending = False)
recom_rest[:5]

In [None]:
#least recommended restaurant
recom_rest[-5:]

In [None]:
# grouping the data by the 'user' column
grouped_data = res_df.groupby('User')

# defining a function to create the new dataframe
def create_new_column(data):
    return [{'Restaurant': row['Restaurant'], 'Score': row['Score']} for _, row in data.iterrows()]
    #variable _ is used to store the index value, which is not used in the loop

# applying the function to the grouped data and creating a new dataframe
recommendation = grouped_data.apply(create_new_column)
recommendation = recommendation.reset_index().rename(columns ={0:'Recommended_Restaurant'})
recommendation.head()

In [None]:
#creating column for total recommendation count for each user
recommendation['Total_Recommendation'] = recommendation['Recommended_Restaurant'].apply(
    lambda x: len(x))

#top 10 user who get most recommendation
recommendation.sort_values('Total_Recommendation', ascending= False)[:10]

In [None]:
# creating new dataframe for recommendation for test user
for i in recommendation[recommendation['User']=='Ankita']['Recommended_Restaurant']:
    # creating the dataframe
    vis = pd.DataFrame(i, columns = ['Restaurant', 'Score'])
vis.sort_values('Score', ascending = False)

In [None]:
#bag of word with doc index as these index will be used for finding similarity later
bows_df.sample(5)

In [None]:
#using extracted bag of words
bow_df = bows_df.drop(columns = ['doc_index'], axis =1)
bow_df.head()

In [None]:
#Restaurant and review
rest_review = sentiment_df[['Restaurant','Review']].copy()
rest_review.sample(5)

In [None]:
#bag of words for restaurant 'Asian Meal Box'
rest_bow = bow_df[bow_df['doc_id'] == 'Asian Meal Box']
rest_bow[:10]

In [None]:
#converting bow to horizontal format using pivot
rest_bowT = rest_bow.pivot_table(index=['doc_id'], columns=['token'],
                                  aggfunc='sum').reset_index(level=[0])
rest_bowT

In [None]:
#using union set to compare two restaurant set of tokens
def pivot_two_bows(basedoc, comparedoc):
    base = basedoc.copy()
    base['type'] = 'base'
    compare = comparedoc.copy()
    compare['type'] = 'compare'
    # append the two token sets vertically
    join = base.append(compare)
    # pivot the two joined courses
    joinT = join.pivot_table(index=['doc_id', 'type'], columns='token',
              aggfunc='sum').fillna(0).reset_index(level=[0, 1])
    # assign columns
    joinT.columns = ['doc_id', 'type'] + [t[1] for t in joinT.columns][2:]
    return joinT

In [None]:
#creating two test restaurant
rest1 = bow_df[bow_df['doc_id'] == 'Asian Meal Box']
rest2 = bow_df[bow_df['doc_id'] == 'Biryanis And More']

In [None]:
import pandas as pd
def pivot_two_bows(df1, df2):

    all_rows_to_concat = []

    if not df1.empty: # Example condition
        all_rows_to_concat.append(df1)
    if not df2.empty: # Example condition
        all_rows_to_concat.append(df2)

    if all_rows_to_concat:
        # Use pd.concat() to combine all accumulated DataFrames at once
        final_df = pd.concat(all_rows_to_concat, ignore_index=False) # Adjust ignore_index as needed
    else:
        final_df = pd.DataFrame() # Return an empty DataFrame if nothing was added

    return final_df

bow_vectors = pivot_two_bows(rest1, rest2)

In [None]:
from scipy.spatial.distance import cosine
#calculating similarity between two restaurant
similarity = 1 - cosine(bow_vectors.iloc[0, 2:], bow_vectors.iloc[1, 2:])

similarity

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
#creating function to calculate cosine similarity such that matrix can be made for each restaurant similarity

# Get the list of all restaurant
all_restaurant = rest_review['Restaurant'].unique()

# Initialize the dataframe to store the similarities
df_similarities = pd.DataFrame(columns = all_restaurant, index = all_restaurant)

# Iterate over the rows and columns of the dataframe
for i in all_restaurant:
    for j in all_restaurant:
        # Get the BoW representation of the current row and column restaurant
        #creating two test restaurant
        rest1 = bow_df[bow_df['doc_id'] == i]
        rest2 = bow_df[bow_df['doc_id'] == j]
        bow_vectors = pivot_two_bows(rest1, rest2)
        # Calculate the cosine similarity between the two restaurant' BoW representations
        sim = 1 - cosine(bow_vectors.iloc[0, 2:], bow_vectors.iloc[1, 2:])
        # Assign the similarity score to the corresponding cell of the dataframe
        df_similarities.at[i, j] = sim

In [None]:
#creating function for mapping
# Create restaurant id to index and index to id mappings
def get_doc_dicts(bow_df):
    grouped_df = bow_df.groupby(['doc_id']).max().reset_index(drop=False)
    idx_id_dict = grouped_df[['doc_id']].to_dict()['doc_id']
    id_idx_dict = {v: k for k, v in idx_id_dict.items()}
    del grouped_df
    return idx_id_dict, id_idx_dict

In [None]:
#two test subject
rest1 = rest_review[rest_review['Restaurant'] == "Beyond Flavours"]
rest2 = rest_review[rest_review['Restaurant'] == "Paradise"]

In [None]:
#with restaurant name finding index for similarity
idx_id_dict, id_idx_dict = get_doc_dicts(bows_df)
idx1 = id_idx_dict["Beyond Flavours"]
idx2 = id_idx_dict["Paradise"]
print(f"Restaurant 1's index is {idx1} and Restaurant 2's index is {idx2}")

In [None]:
#locating in the similarti df
sim_matrix = df_similarities.to_numpy()

#similarity between the two restaurant
sim = sim_matrix[idx1][idx2]
sim

In [None]:
#function to recommend restaurant based on similarity
def generate_recommendations_for_one_user(liked_restaurant, unknown_restaurant, id_idx_dict, sim_matrix):
    # Create a dictionary to store your recommendation results
    res = {}
    threshold = 0.6
    for liked_rest in liked_restaurant:
        for unselect_rest in unknown_restaurant:
            if liked_rest in id_idx_dict and unselect_rest in id_idx_dict:
                sim = 0
                idx1 = id_idx_dict[liked_rest]
                idx2 = id_idx_dict[unselect_rest]

                # Find the similarity value from the sim_matrix
                sim = sim_matrix[idx1][idx2]
                if sim > threshold:
                    if unselect_rest not in res:
                        res[unselect_rest] = sim
                    else:
                        if sim >= res[unselect_rest]:
                            res[unselect_rest] = sim

    # Sort the results by similarity
    res = {k: v for k, v in sorted(res.items(), key=lambda item: item[1], reverse=True)}
    return res

In [None]:
#function to calculate recommendation for all Reviewer
def generate_recommendations_for_all():
    users = []
    restaurant = []
    sim_scores = []
    idx_id_dict, id_idx_dict = get_doc_dicts(bows_df)
    sim_matrix = df_similarities.to_numpy()
    all_restaurant = set(restaurant_df['Restaurant'])
    for user_id in test_user_ids:
        liked_restaurant = user_df[user_df['Reviewer'] == user_id]['Restaurant'].to_list()
        unknown_restaurant = all_restaurant.difference(liked_restaurant)
        rec = generate_recommendations_for_one_user(liked_restaurant, unknown_restaurant, id_idx_dict, sim_matrix)
        for k, v in rec.items():
            users.append(user_id)
            restaurant.append(k)
            sim_scores.append(v)

    return users, restaurant, sim_scores

In [None]:
#storing recommendation for each user in dataframe
res_sim_dict = {}
users, restaurant, sim_scores = generate_recommendations_for_all()
res_sim_dict['USER'] = users
res_sim_dict['RESTAURANT'] = restaurant
res_sim_dict['SCORE'] = sim_scores
res_sim_df = pd.DataFrame(res_sim_dict, columns=['USER', 'RESTAURANT', 'SCORE'])

In [None]:
#getting the output
res_sim_df.sample(10)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

ML Models Used:

I used two supervised machine learning models for sentiment analysis:

1. Logistic Regression

This is a linear model that predicts the probability of a binary outcome (e.g., positive or negative sentiment). It uses a sigmoid function to map the input features to a probability score. It's a fundamental classification algorithm, simple yet effective, especially for linearly separable data.

2. XGBoost (Extreme Gradient Boosting)

This is an advanced ensemble learning method based on gradient boosting. It builds multiple decision trees sequentially, with each new tree trying to correct the errors of the previous ones. XGBoost is highly efficient, flexible, and known for its strong performance in various machine learning tasks due to its robust handling of complex relationships and built-in regularization to prevent overfitting.

Performance Evaluation:

Model performance was evaluated using several key metrics, which provide a comprehensive view of how well each model classifies sentiments. The chosen metrics were particularly relevant given the nature of classification and the possibility of class imbalance in sentiment data:

1. ROC AUC (Receiver Operating Characteristic Area Under the Curve)

This metric assesses the model's ability to distinguish between positive and negative classes. A higher ROC AUC indicates better overall discriminative power.

2. Precision

This measures the accuracy of positive predictions. It tells us, "Of all the instances predicted as positive, how many were actually positive?" High precision means fewer false positives.

3. Recall (Sensitivity)

This measures the model's ability to find all actual positive instances. It tells us, "Of all the actual positive instances, how many did the model correctly identify?" High recall means fewer false negatives.

4. F1-Score

This is the harmonic mean of precision and recall. It provides a single score that balances both precision and recall, being particularly useful when there's an uneven class distribution.

5. Accuracy

This is the ratio of correctly predicted instances to the total number of instances. While a general indicator of performance, it can be misleading if the dataset is imbalanced.

6. Confusion Matrix

This table provides a detailed breakdown of correct and incorrect predictions, showing True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).

Hyperparameter Optimization Technique:

I used Grid Search Cross-Validation (GridSearchCV) for hyperparameter optimization for both Logistic Regression and XGBoost.

Why GridSearchCV?

1. Exhaustive Search

GridSearchCV systematically explores all possible combinations of hyperparameters defined within a specified grid. This ensures that the optimal set of parameters within the chosen range is found.

2. Robust Evaluation with Cross-Validation

By integrating cross-validation, GridSearchCV trains and evaluates the model multiple times on different subsets of the data. This helps in selecting hyperparameters that lead to better generalization performance on unseen data, reducing the risk of overfitting to a single training set.

3. Targeted Metric Optimization

For both models, the scoring parameter was set to 'recall'. This means GridSearchCV specifically aimed to find hyperparameter combinations that maximize the model's recall score. This is crucial in sentiment analysis, especially when identifying all instances of a particular sentiment (e.g., negative reviews to address customer complaints) is a high priority.

4. Parallelization

The n_jobs=-1 parameter allowed the computation to be distributed across all available CPU cores, significantly speeding up the hyperparameter tuning process.

While other methods like Randomized Search could be faster for very large search spaces, Grid Search was chosen for its thoroughness given the moderately sized parameter grids.

Improvement After Hyperparameter Tuning:

Hyperparameter tuning with GridSearchCV led to noticeable improvements in model performance, particularly aligned with the chosen optimization metric:

* For Logistic Regression: The most significant improvement was observed in Recall. By optimizing for this metric, the model became more effective at identifying true positive sentiments, leading to a higher recall score. This enhancement was a direct result of fine-tuning the C (regularization strength) and penalty parameters. While recall improved, there were some trade-offs in other metrics like precision and accuracy, indicating a more aggressive classification towards the positive class to minimize false negatives.

* For XGBoost: Tuning brought about a more balanced improvement across several metrics. The ROC AUC, Precision, F1-Score, and Accuracy all showed increases. This suggests that optimizing parameters like n_estimators, max_depth, and criterion resulted in a more robust and generally better-performing model. Although recall saw a slight decrease in the tuned XGBoost compared to its untuned version, the overall gains in other key metrics indicate a stronger classifier with better generalization capabilities.

In essence, hyperparameter tuning successfully enhanced the models' ability to classify sentiments, either by directly improving the targeted recall or by boosting overall performance across multiple evaluation metrics.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

In [None]:
import shap
import numpy as np # Ensure numpy is imported if needed for X_test later

# Get shap values
best_xgb_model = xgb_grid.best_estimator_

# explainer = shap.Explainer(best_xgb_model.predict_proba, X_test) # If you want to explain probabilities
explainer = shap.TreeExplainer(best_xgb_model) # More efficient and specific for tree models like XGBoost

shap_values = explainer(X_test)

In [None]:
# Waterfall plot for first observation
shap.plots.waterfall(shap_values[0])

In [None]:
# Initialize JavaScript visualizations in notebook environment
shap.initjs()
# Forceplot for first observation
shap.plots.force(shap_values[0])

In [None]:
#Mean SHAP
shap.plots.bar(shap_values)

In [None]:
# Beeswarm plot
shap.plots.beeswarm(shap_values)

ML Model Used: XGBoost Classifier

I've used the XGBoost Classifier as the final prediction model for sentiment analysis.

What is XGBoost?

XGBoost (Extreme Gradient Boosting) is an advanced and highly efficient implementation of Gradient Boosting Machines (GBMs). It's an ensemble learning method that constructs a strong predictive model by combining the predictions of numerous weak, sequential decision tree models.

Key characteristics and why it was chosen:

1. Sequential Learning: XGBoost builds trees one after another, where each new tree aims to correct the errors (residuals) made by the ensemble of previous trees. This iterative error correction leads to highly accurate models.

2. Gradient Descent Optimization: It utilizes a gradient descent framework to systematically minimize the loss function during the tree-building process.

3. Regularization: A critical strength of XGBoost is its built-in L1 (Lasso) and L2 (Ridge) regularization techniques. These mechanisms penalize complex models, effectively preventing overfitting and improving the model's ability to generalize to unseen data.

4. Missing Value Handling: XGBoost can inherently handle missing values in the dataset by learning the best direction to take when a value is absent.

5. Performance and Speed: It's renowned for its computational efficiency and often achieves state-of-the-art performance in various machine learning competitions and real-world applications.

I chose XGBoost due to its robustness, speed, and consistently high performance, which was evident in its superior and more balanced evaluation metrics (e.g., higher ROC AUC, Precision, F1-Score, and Accuracy) after hyperparameter tuning, compared to Logistic Regression.

Feature Importance using SHAP (SHapley Additive exPlanations)

To gain deep insights into why the XGBoost model makes specific sentiment predictions and to understand the overall importance of different features, I used SHAP values. SHAP is a powerful and unified framework for explaining the output of any machine learning model.

How SHAP Works:

SHAP values are rooted in Shapley values from cooperative game theory. For each prediction made by the model, SHAP calculates how much each feature contributes to pushing the prediction from the average (base) prediction to the actual prediction for that specific instance.

1. Fair Attribution: SHAP fairly distributes the credit (or blame) for a prediction among all features, considering all possible permutations in which features could have been introduced to the model.

2. Additive Explanations: The SHAP values for all features for a single prediction sum up to the difference between the model's output for that prediction and the base value.

3. Local and Global Interpretability:

* Local Explanations (e.g., Waterfall Plot, Force Plot): For an individual prediction, SHAP can show exactly which features are pushing the prediction higher (positive SHAP value) or lower (negative SHAP value), and by what magnitude.

* Global Explanations (e.g., Bar Plot, Beeswarm Plot): By aggregating SHAP values across many predictions, we can understand the overall importance of each feature for the entire model. Features with larger average absolute SHAP values are considered more influential.

SHAP Plots and Their Interpretation:

1. Waterfall Plot (shap.plots.waterfall(shap_values[0]))

* Purpose: This plot explains a single prediction (e.g., for the first test observation).

* Interpretation from your description:

  1. Base Value (E[f(x)] = 0.584): This is the average (expected) output of the model across the entire dataset. It's the starting point for the explanation.

  2. Ending Value (f(x) = -0.382): This is the actual prediction for the specific instance being explained.

  3. Feature Contributions: Each bar represents a feature. Bars extending to the right (positive SHAP values) indicate features that push the prediction higher than the base value. Bars extending to the left (negative SHAP values) indicate features that push the prediction lower. The length of the bar shows the magnitude of the impact.

  4. Unique per Observation: As you correctly state, there's a unique waterfall plot for every observation, showing how features influenced that specific prediction compared to the mean prediction. Large SHAP values mean significant impact.

2. Force Plot (shap.plots.force(shap_values[0]))

* Purpose: Also explains a single prediction, offering an interactive visualization (though currently omitted due to environment limitations).

* Interpretation from your description:

  1. Similar Information: It conveys similar information to the waterfall plot but in a linear, interactive format.

  2. Magnitude and Direction: Features pushing the prediction higher than the base value appear on one side (e.g., red, pushing right), and features pushing it lower appear on the other (e.g., blue, pushing left). The size of the feature's contribution indicates its impact.

  3. Interactive Relationship: As you noted, it visually represents how features "compress" or "expand" to reach the final prediction from the base value.

3. Bar Plot (shap.plots.bar(shap_values))

* Purpose: Provides a global explanation by showing the overall average impact of each feature across the entire dataset.

* Interpretation: This plot typically ranks features by the average absolute SHAP value. The longer the bar, the more important the feature is to the model's predictions overall. It tells you which features are generally most influential, regardless of whether they push predictions higher or lower.

4. Beeswarm Plot (shap.plots.beeswarm(shap_values))

* Purpose: A more detailed global explanation that shows the distribution of SHAP values for each feature.

* Interpretation:

  1. Feature Importance: Features are typically ordered by their overall importance (like in the bar plot).

  2. Impact Direction and Distribution: Each dot represents a single data point's SHAP value for a particular feature. The color usually indicates the feature's actual value (e.g., red for high, blue for low).

  3. Insights: This plot helps identify:

  * Whether a high value of a feature generally leads to a high/low prediction, and vice-versa.

  * If the impact of a feature is consistent or varies across different instances.

  * Potential interaction effects with other features (though more directly seen in dependence plots).

Why SHAP for XGBoost?

1. Tree-Specific Optimization: SHAP has highly optimized and accurate algorithms for tree-based models like XGBoost, making the computation efficient.

2. Consistency and Accuracy: SHAP values satisfy desirable properties that make them reliable for interpreting complex black-box models.

3. Comprehensive Insight: It offers both instance-level (local) and overall model (global) explanations, providing a holistic understanding of feature contributions. For sentiment analysis, this means not only knowing which words are important but also how certain words (or their frequency/TF-IDF scores) specifically push a review towards a positive or negative prediction for an individual review, and generally across all reviews.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***