<a href="https://colab.research.google.com/github/ahmedatia456123/Restaurant-Market-Analysis-Predictive-Pricing-Model/blob/main/Copy_of_Bangalore_Dining_Insights_%26_Strategic_Investment_Plan_%2B_Price_Prediction_with_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Background


## Introduction to Analyzing the Zomato Dataset

Analyzing the Zomato dataset offers valuable insights into the restaurant scene in Bengaluru, a city bustling with over 12,000 eateries that cater to a diverse range of culinary tastes from around the world. As new restaurants continue to open daily, the industry remains dynamic, with growing demand that presents both opportunities and challenges. For newcomers, competing with well-established establishments can be tough, especially when many restaurants offer similar fare.

Bengaluru, known as the IT hub of India, has a large population that relies heavily on dining out due to busy lifestyles, making the study of restaurant demographics crucial. This analysis aims to uncover key patterns and preferences, including:

- **Popularity of Various Cuisines**: Identifying which types of food are favored in different localities.
- **Vegetarian Preferences**: Examining if certain areas show a strong inclination towards vegetarian dishes and whether these areas are predominantly inhabited by specific communities, such as Jains, Marwaris, or Gujaratis.
- **Restaurant Characteristics**: Evaluating factors such as the restaurant's location, pricing, and whether it follows a theme.
- **Local Cuisine Trends**: Determining which neighborhoods are renowned for particular types of cuisine and the factors driving these preferences.

By studying these aspects, we can gain a deeper understanding of the restaurant landscape in Bengaluru, helping new and existing restaurants better align with local tastes and demands.


# Objectives

The primary objective of this data analysis project is to identify the most promising investment opportunities in the restaurant and cafe sector in Bangalore. This involves analyzing various factors that influence the success and customer appeal of these establishments, and developing machine learning models to support pricing strategies and enhance customer experience.

**Key Goals:**

1. **Investment Analysis:**
   - **Identify High-Performing Establishments:** Analyze the data to pinpoint restaurants and cafes with high ratings, significant customer engagement, and strong financial performance indicators. Focus on key attributes such as location, type, and customer reviews to assess which establishments are likely to offer the best returns on investment.
   - **Evaluate Pricing Strategies:** Develop and implement machine learning models to predict optimal pricing for menu items based on factors such as location, type, and customer feedback. This will help establish competitive pricing that aligns with market expectations and maximizes profitability.

2. **Customer Experience Enhancement:**
   - **Analyze Customer Preferences:** Utilize the data to understand customer preferences regarding dish likes, cuisines, and other attributes. This will inform strategies to improve the dining experience by focusing on popular dishes, preferred cuisines, and services that enhance overall satisfaction.
   - **Improve Engagement and Accessibility:** Examine the impact of online ordering and table booking options on customer engagement and satisfaction. Determine how these features contribute to higher ratings and increased customer interactions.

3. **Machine Learning Model Development:**
   - **Predictive Pricing Model:** Build and refine machine learning models to forecast prices for menu items based on historical data, restaurant type, location, and customer reviews. This model will provide insights into setting competitive prices that attract customers while ensuring profitability.
   - **Enhancement Recommendations:** Generate actionable recommendations for improving customer experience based on predictive analytics and historical trends. This will include suggestions for menu adjustments, service enhancements, and strategic changes to attract and retain customers.


# libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import plotly.offline as py
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=False)
from wordcloud import WordCloud
from geopy.geocoders import Nominatim
from folium.plugins import HeatMap
import folium
from tqdm import tqdm
import re
from collections import Counter
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
from nltk import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
import gensim
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import matplotlib.colors as mcolors
from sklearn.manifold import TSNE
from gensim.models import word2vec
import nltk



import warnings
warnings.filterwarnings('ignore')


In [None]:
!pip install geopy

In [None]:
!pip install kmodes

# Importing Data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df = pd.read_csv('/content/drive/My Drive/Zomato Geospatial Analysis/zomato.csv')

# Cleaning & Preparing

In [None]:
df.head(1)

In [None]:
print(f'shape of the data {df.shape[0]} rows and {df.shape[1]} columns')

In [None]:
# Checking columns dtypes
df.info()

We have a problem with the `rate` and `approx_cost` columns. The `rate` column should be in float format, while the `approx_cost` column should be in integer format.


## Cleaning

### Dealing with dublicates

In [None]:
#checking duplication
df.duplicated().sum()

### Dealing with Missing Values

In [None]:
df.isnull().sum()

In [None]:
# First thing i wanna do is removing query string from link
df.url = df.url.apply(lambda x: x.split('?')[0])

After reviewing the data, I noticed that there are multiple recordings for some restaurants over time. These recordings are not very useful because there are very few of them per restaurant, and they only show an increase in the number of votes and changes in the order of liked dishes without any other significant differences.








In [None]:
# Taking a copy
original_df = df.copy(deep=True)
# I will sort them by votes then group them by url and get last row of each group to get latest recored
df = df.sort_values('votes').groupby('address').last().reset_index()

In [None]:
df[df['name']=='Cafe Coffee Day'].url.iloc[2]

In [None]:
# Checking location missing values
df[df['location'].isnull()]

I could fill the `location` field using information from the `listed_in (city)` or `address` columns. However, I will remove this field as it contains missing values in most of the records.


In [None]:
# remove location with null values
df.dropna(subset=['location'],inplace=True)

Dealing with rates

In [None]:
print(f'Percentage of miss values form rating column {df["rate"].isnull().sum()/df.shape[0]*100}')

A 15% missing value rate is too high to delete the affected rows. Instead, I will find a way to fill these missing values with meaningful data.


In [None]:
#Check what values are in rates
df['rate'].unique()

I need the ratings to be in numeric format. However, the current data contains `Null` values, as well as entries marked as `'NEW'` and `'-'`. To address this:

1. **Calculate Ratings from Reviews:** I will derive ratings from the `reviews_list` column where available.
   
2. **Handle Missing Ratings:** For restaurants with no reviews, I will use the average rating of restaurants in the same location to estimate their ratings.

3. **Preserve 'NEW' Information:** To retain the information about new establishments, I will create a new column named `is_new` with values `'yes'` or `'no'`. This will facilitate visualization. Later, for modeling purposes, this column will be converted to binary values (`1` and `0`).


In [None]:
# First Create the is_new
df['is_new'] = df.rate.apply(lambda x: 'Yes' if x=='NEW' else 'No')

In [None]:
print(f'Number of new establishments {df.is_new.value_counts()[1]}')

In [None]:
print(f'NEW represent {100*2208/df.shape[0]}% of the data')

In [None]:
# Change `None` values to `-` to make them easier to handle.

df.rate = df.rate.fillna('-')

I will now prepare the `review_list` column by changing its format from a string to a list. This will allow me to work with the data more effectively and extract useful information.


In [None]:
# Fix review_list column and pase it into python list
df.reviews_list = df.reviews_list.apply(lambda x: eval(x))

In [None]:
def get_rating_for_NEW_res(row):
    # Check if the rate is neither 'NEW' nor '-'. If it's a valid rate, return it.
    if row.rate != 'NEW' and row.rate != '-':
        return row.rate

    # Retrieve the list of reviews from the row
    reviews_list = row.reviews_list

    # If there are no reviews available, return 'NEW' to indicate the restaurant is new
    if len(reviews_list) == 0:
        return 'NEW'

    # Initialize a list to hold the numeric ratings extracted from the reviews
    rating_ph = []

    # Iterate through each review in the reviews_list
    for review in reviews_list:
        # Skip reviews with None or invalid rating values
        if review[0] is None:
            continue

        # Extract and clean the rating value from the review string
        rate = float(review[0].lower().replace('rated', ''))
        rating_ph.append(rate)

    # If no valid ratings were extracted, return 'NEW'
    if len(rating_ph) == 0:
        return 'NEW'

    # Calculate the average rating from the extracted ratings and return it formatted as a string
    return f'{"{:.2f}".format(np.mean(rating_ph))}/5'


In [None]:
df.rate = df.apply(lambda row: get_rating_for_NEW_res(row), axis=1)

In [None]:
#Check Number of new again
len(df[df.rate=='NEW'])

Now, I will apply a new function to calculate the average rating for restaurants in a city that have a 'NEW' value. First, I will convert the rate column to float format after obtaining the rating as a float number.

In [None]:
df['rate'] = df['rate'].apply(lambda x: float(x.split('/')[0]) if x != 'NEW' else x)

avg_by_location = df[df.rate!='NEW'][['location','rate']].groupby('location').median()
def get_avg_rateing_for_NEW_res(row):
    """
    Function to get the average rating for restaurants with 'NEW' status.

    Parameters:
    row (pd.Series): A row from the DataFrame containing restaurant details.

    Returns:
    float or None: The rating of the restaurant if not 'NEW', or the average rating for the location if 'NEW'.
    """
    # Check if the rate is not 'NEW'
    if row.rate != 'NEW':
        return row.rate

    # Retrieve the location of the restaurant
    loc = row.location

    # Check if the location exists in the average ratings by location index
    if loc not in avg_by_location.index:
        return None

    # Return the average rating for the location
    return avg_by_location.loc[loc].values[0]

In [None]:
df.rate = df.apply(lambda row:get_avg_rateing_for_NEW_res(row),axis=1)
# The remaining values represent a small percentage and cannot be engineered meaningfully, so I will remove them.
df.dropna(subset=['rate'],inplace=True)


In [None]:
df.dropna(subset=['cuisines'],inplace=True)
df.dropna(subset=['rest_type'],inplace=True)
df.dropna(subset=['approx_cost(for two people)'],inplace=True)
# now we have only phone and i dont need it so i will remove it
df.drop(columns=['phone'],inplace=True)
df['approx_cost(for two people)'] = df['approx_cost(for two people)'].apply(lambda x: int(x.replace(',','')))

### Features Enginnering

**Dealing with resturant types**

In [None]:
#checking resturant type values
df.rest_type.unique()

Some restaurants have multiple types. I will extract and separate these types for clarity.

In [None]:
import re
def separate_types(types,lower=False):
    """
    Function to extract and separate multiple types from a single column into individual columns.

    Parameters:
    types (str): The name of the column containing types.

    Returns:
    None
    """
    # Reset index of the DataFrame to ensure sequential indexing
    df.reset_index(drop=True, inplace=True)

    # Initialize a variable to track the maximum number of types for any restaurant
    max_types = 1

    # Iterate through each row of the DataFrame
    for i, row in df.iterrows():
        # Check if there are multiple types separated by commas
        if ',' in row[types]:
            # Split the types by comma and strip any extra whitespace
            list_type = [x.strip() for x in row[types].split(',')]
            # Update the cell with the list of types
            df.at[i, types] = list_type
            # Update the maximum number of types if this row has more
            if len(list_type) > max_types:
                max_types = len(list_type)
        else:
            # If only one type, store it as a list with a single element
            df.loc[i, types] = [row[types]]

    # Print the maximum number of types for a single restaurant
    print(f'Max {types} for single restaurants is {max_types}')

    # Iterate through each row and list of types
    for indx, t_list in enumerate(df[types].tolist()):
        list_len = len(t_list)
        # Iterate through each type in the list
        for j in range(list_len):
            # Create a new column name for each type
            column_name = types + '_' + str(j)
            # Add the new column if it doesn't exist
            if column_name not in df.columns:
                df[column_name] = '-'
            # Update the DataFrame with the type value
            if lower:
              text = re.sub(r'\b\d+(?!th|nd|rd)\b', '', t_list[j].lower().replace('bangalore',''))
              text = cleaned_text = re.sub(r'\W+', '', text)
              df.at[indx, column_name] = text
            else:
              df.at[indx, column_name] = t_list[j]


In [None]:
separate_types('rest_type')

Dealing With Ratings

In [None]:
sns.displot(df.rate)

We have other problem some resturants has rate 5 of 5 but only has one vote this is misleading and gonna effect our model se we need to take Weighted_rating
### Weighted Rating Formula

The Bayesian average formula for calculating a weighted rating is:

$$
\text{Weighted Rating} = \frac{v \cdot r + m \cdot c}{v + m}
$$

Where:

- \( r \) is the average rating of the item.
- \( v \) is the number of votes for the item.
- \( m \) is the minimum number of votes required to be listed (threshold).
- \( c \) is the mean rating across all items.

### How It Works

- **\( v \cdot r \)**: The total score for the item based on its own ratings.
- **\( m \cdot c \)**: A balancing factor that brings in the average of all ratings, weighted by the minimum number of votes required.
- **\( v + m \)**: The total weight, combining both the item's votes and the overall system's expectations.



In [None]:
overall_avg = df.rate.mean()
min_votes = 20
print(f'Mean Rating = {overall_avg}')

In [None]:
def calculate_weighted_rating(row, overall_avg, min_votes):
    v = row['votes']
    r = row['rate']
    c = overall_avg
    m = min_votes

    # Calculate the weighted rating
    weighted_rating = (v * r + m * c) / (v + m)
    return round(weighted_rating,2)

# Apply the function to create the new column
df['weighted_rating'] = df.apply(lambda row: calculate_weighted_rating(row, overall_avg, min_votes), axis=1)

# if there is a change by 0.5 to rate due to averaging then i will marke that rate as unvalid
df['is_rate_valid'] = df.apply(lambda row: 1 if abs(row.rate-row.weighted_rating) < 0.5 else 0, axis=1)

In [None]:
df.is_rate_valid.value_counts()

Dealing with `menu_item`

In [None]:
# Converting menu_item into list
df.menu_item = df.menu_item.apply(lambda x: eval(x))

Dealing with cuisines

In [None]:
df.cuisines.unique()

In [None]:
separate_types('cuisines')

Dealing with dish_liked

In [None]:
# i will change value with none for dish_liked to -
df.dish_liked = df.dish_liked.fillna('-')
separate_types('dish_liked')

In [None]:
# Add column to know if the location is on Road not not
df['is_road'] = df.location.apply(lambda x: 'Yes' if 'Road' in x else 'No')

Dealing with location

In [None]:
location_df = df.location.value_counts().reset_index()

geolocator = Nominatim(user_agent="app" , timeout=None ) ## set timeout=None to get rid of timeout error

lat = [] ## define lat list to store all the latitudes
lon = [] ## define lon list to store all the longitudes


for name in location_df.location.tolist():
    location = geolocator.geocode(name+', Bangalore',country_codes='IN')

    if location is None:
        lat.append(np.nan)
        lon.append(np.nan)

    else:
        lat.append(location.latitude)
        lon.append(location.longitude)

location_df['lat'] = lat
location_df['lon']=lon
location_df.isna().sum()

In [None]:
#fixing the two missing locations
location_df[location_df.lat.isna()]

In [None]:
index = location_df[location_df.lat.isna()].index[0]
location_df.loc[index,'lat'] = 13.0080
location_df.loc[index,'lon'] = 77.5800
index = location_df[location_df.lat.isna()].index[0]
location_df.loc[index,'lat'] = 13.0827
location_df.loc[index,'lon'] = 77.6785

In [None]:
#Merging to main df
df = df.merge(location_df,on='location',how='left')

In [None]:
# Number of Specialization per resturant
df['num_spec'] = df.rest_type.apply(lambda x: len(x))
# Number of Liked Dishes per resturant
df['num_dish_liked'] = df.dish_liked.apply(lambda x: len(x))
# Number of Liked Dishes per resturant
df['num_reviews'] = df.reviews_list.apply(lambda x: len(x))
# Number of menu item per resturant
df['num_menu_item'] = df.menu_item.apply(lambda x: len(x))
# Number of cuisines per resturant
df['num_cuisines'] = df.cuisines.apply(lambda x: len(x))

Encoding Data

Columns need to be deleted because they do not add value to the prediction model, or there are alternative features that were engineered earlier, would cause multicollinearity.

In [None]:
cols = ['url','address','name','rate','location','rest_type','dish_liked','cuisines','reviews_list','listed_in(city)','menu_item','count']
encoded_df = df.drop(cols,axis=1).copy()

Binary encoding

In [None]:
encoded_df['online_order'] = encoded_df['online_order'].replace({'Yes': 1, 'No': 0})
encoded_df['book_table'] = encoded_df['book_table'].replace({'Yes': 1, 'No': 0})
encoded_df['is_new'] = encoded_df['is_new'].replace({'Yes': 1, 'No': 0})
encoded_df['is_road'] = encoded_df['is_road'].replace({'Yes': 1, 'No': 0})

Target Encoding With Smoothing

In [None]:
def target_encode_multiple_columns_with_smoothing(columns, df, target_col, ignore_value='-', smoothing=1):
    """
    Performs target encoding with smoothing on a list of specified columns of the DataFrame,
    replacing the specified ignore value with None (NaN). The original columns are updated
    with the encoded values, while None values are left unchanged.

    Parameters:
    columns (list of str): The list of column names to encode.
    df (pd.DataFrame): The DataFrame containing the data.
    target_col (str): The name of the target column.
    ignore_value (str): The value to ignore and replace with None (default is '-').
    smoothing (int): The smoothing factor to balance between category mean and global mean.

    Returns:
    pd.DataFrame: DataFrame with the original columns updated with the smoothed encoded values,
                  and ignored values replaced with None.
    dict: Dictionary of dictionaries containing the feature values used in the encoding for each column,
          for mapping to test sets.
    """
    feature_mappings = {}

    for column_name in columns:
        # Ensure the column exists in the DataFrame
        if column_name not in df.columns or target_col not in df.columns:
            raise ValueError(f"Column '{column_name}' or '{target_col}' does not exist in the DataFrame.")

        # Replace the ignore_value with None (which will be treated as NaN in Pandas)
        df[column_name] = df[column_name].replace(ignore_value, None)

        # Calculate the global mean of the target column
        global_mean = df[target_col].mean()

        # Calculate the mean and count for each category, excluding NaN values
        agg = df[df[column_name].notna()].groupby(column_name)[target_col].agg(['mean', 'count'])

        # Calculate the smoothed values using the formula:
        # smoothed_value = (mean * count + global_mean * smoothing) / (count + smoothing)
        agg['smoothed'] = (agg['mean'] * agg['count'] + global_mean * smoothing) / (agg['count'] + smoothing)

        # Store the smoothed values in the feature_mappings dictionary
        smoothed_values_dict = agg['smoothed'].to_dict()
        feature_mappings[column_name] = smoothed_values_dict

        # Map the smoothed values to the DataFrame, leaving None values unchanged
        df[column_name] = df[column_name].apply(lambda x: smoothed_values_dict[x] if pd.notna(x) else None)

    return df, feature_mappings


In [None]:
cols = ['rest_type_0','rest_type_1','cuisines_0','cuisines_1','cuisines_2','cuisines_3','cuisines_4','cuisines_5','cuisines_6','cuisines_7','dish_liked_0','dish_liked_1','dish_liked_2','dish_liked_3','dish_liked_4','dish_liked_5','dish_liked_6','listed_in(type)']
encoded_df, feature_mappings = target_encode_multiple_columns_with_smoothing(cols, encoded_df,'approx_cost(for two people)', smoothing=2)



In [None]:
def sum_columns_and_delete(df, columns, new_column_name):
    """
    Sums the specified columns into a new column and deletes the original columns.

    Parameters:
    df (pd.DataFrame): The DataFrame containing the data.
    columns (list of str): The list of column names to sum.
    new_column_name (str): The name of the new column where the sum will be stored.

    Returns:
    pd.DataFrame: The DataFrame with the new summed column and original columns deleted.
    """
    # Ensure all columns exist in the DataFrame
    for column in columns:
        if column not in df.columns:
            raise ValueError(f"Column '{column}' does not exist in the DataFrame.")

    # Sum the specified columns into a new column
    df[new_column_name] = df[columns].sum(axis=1)

    # Drop the original columns
    df = df.drop(columns=columns)

    return df

In [None]:
cols=['rest_type_0','rest_type_1']
encoded_df = sum_columns_and_delete(encoded_df,cols,'rest_type')

cols=['cuisines_0','cuisines_1','cuisines_2','cuisines_3','cuisines_4','cuisines_5','cuisines_6','cuisines_7']
encoded_df = sum_columns_and_delete(encoded_df,cols,'cuisines')

cols=['dish_liked_0','dish_liked_1','dish_liked_2','dish_liked_3','dish_liked_4','dish_liked_5','dish_liked_6']
encoded_df = sum_columns_and_delete(encoded_df,cols,'dish_liked')



In [None]:
encoded_df

# Helper Functions

In [None]:
types_frq = None
# @title
def min_max_scaling(data, new_min=0, new_max=3500):
    """
    Function to perform min-max scaling on a dataset.

    Parameters:
    data (array-like): Input data to be scaled.
    new_min (float): The new minimum value of the scaled data. Default is 0.
    new_max (float): The new maximum value of the scaled data. Default is 3500.

    Returns:
    numpy.ndarray: Scaled data with values between new_min and new_max.
    """
    # Ensure that data is a numpy array
    data = np.array(data)

    # Compute the minimum and maximum of the original data
    original_min = np.min(data)
    original_max = np.max(data)

    # Apply min-max scaling formula to rescale the data
    scaled_data = ((data - original_min) / (original_max - original_min)) * (new_max - new_min) + new_min
    return scaled_data

def multi_bar_chart(temp_df, chart_title, axis_title, figuresize=(15, 6), types_frq=types_frq, freq=True):
    """
    Function to create a multi-bar chart with a line plot overlay.

    Parameters:
    temp_df (pd.DataFrame): DataFrame containing data to be plotted.
    chart_title (str): Title of the chart.
    axis_title (str): Title of the x-axis.
    figuresize (tuple): Size of the figure. Default is (15, 6).
    types_frq (pd.DataFrame): DataFrame with frequency data for the line plot.
    freq (bool): Flag to determine if the frequency line plot should be included. Default is True.

    Returns:
    None: Displays the chart.
    """
    # Create a figure and an axis for the primary y-axis
    fig, ax1 = plt.subplots(figsize=figuresize)

    # Plot the bar plot for 'approx_cost(for two people)' on the primary y-axis
    sns.barplot(x=temp_df.index, y=temp_df['approx_cost(for two people)'], ax=ax1, color='blue', alpha=0.8, label='Approximate Cost')

    # Plot the bar plot for 'votes' on the primary y-axis
    sns.barplot(x=temp_df.index, y=temp_df['votes'], ax=ax1, color='orange', alpha=0.8, label='Votes')

    # Optionally plot the scaled frequency line if 'freq' is True
    if freq:
        scaled_freq = min_max_scaling(types_frq['count'], new_max=temp_df.votes.max())
        sns.lineplot(x=types_frq[types_frq.columns[0]], y=scaled_freq, ax=ax1, color='black', label='Scaled Frequency')

    # Set x-axis label and primary y-axis label, and customize tick parameters
    ax1.set_xlabel(axis_title)
    ax1.set_ylabel('Median (Cost and Votes)', color='blue')
    ax1.tick_params(axis='y', labelcolor='blue')
    ax1.set_xticklabels(ax1.get_xticklabels(), rotation=90)

    # Create a secondary y-axis for the line plot
    ax2 = ax1.twinx()

    # Plot the line plot for 'weighted_rating' on the secondary y-axis
    sns.lineplot(x=temp_df.index, y=temp_df['weighted_rating'], ax=ax2, color='red', label='Weighted Rating')

    # Set secondary y-axis label and customize tick parameters
    ax2.set_ylabel('Weighted Rating', color='red')
    ax2.tick_params(axis='y', labelcolor='red')

    # Add legends for both y-axes
    ax1.legend(loc='upper left')
    ax2.legend(loc='upper right')

    # Set the title of the chart and display it
    plt.title(chart_title)
    plt.show()


def pie_chart(df, column_name,title,figsize=[10,10]):
  sns.set(style="whitegrid")
  count = df[column_name].value_counts()
  plt.figure(figsize=(figsize[0], figsize[1]))
  plt.pie(count, labels=count.index, autopct='%1.1f%%', startangle=140, colors=sns.color_palette("pastel"))
  plt.title(title)
  plt.show()


def count_chart_with_percentage(types_frq, column_name, title,x_title,range_list=None):
  # Calculate cumulative sum and percentage
  types_frq['cumsum'] = types_frq['count'].cumsum()
  types_frq['cumperc'] = 100 * types_frq['cumsum'] / types_frq['count'].sum()
  if range_list is not None:
    types_frq = types_frq[range_list[0]:range_list[1]]
  # Create a figure with two subplots in one row
  # Plotting
  fig, ax1 = plt.subplots(figsize=(15, 8))

  # Bar plot for counts
  sns.barplot(x=types_frq[column_name], y=types_frq['count'], color='b', ax=ax1, label='Count')

  # Plot cumulative sum
  ax2 = ax1.twinx()
  ax2.plot(types_frq[column_name], types_frq['cumperc'], color='r', marker='o', label='Cumulative Percentage')
  ax2.set_ylabel('Cumulative Percentage')

  # Add labels and title
  ax1.set_title(title)
  ax1.set_xlabel(x_title)
  ax1.set_ylabel('Count')
  ax1.set_xticklabels(ax1.get_xticklabels(), rotation=90)
  ax1.legend(loc='upper left')
  ax2.legend(loc='upper right')

  plt.show()

def create_radar_chart(ax, data, categories, color,single_chart, label):
    """
    Creates a radar chart on the given axis.

    :param ax: Matplotlib axis to plot on.
    :param data: Series containing data values.
    :param categories: List of category labels.
    :param color: Color for the radar chart.
    :param label: Label for the legend.
    """
    num_vars = len(categories)

    # Compute angle for each category
    angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()
    angles += angles[:1]  # Complete the loop

    # Append the start value to the end to close the circle
    data = data.tolist()
    data += data[:1]

    # Create the radar chart
    if single_chart:
      ax.fill(angles, data, color=color, alpha=0.05)
    else:
      ax.plot(angles, data, color=color, alpha=0.25)

    ax.plot(angles, data, color=color, linewidth=2, label=label)

    # Set the labels and title
    ax.set_yticklabels([])
    ax.set_xticks(angles[:-1])
    ax.set_xticklabels(categories, rotation=45)
    ax.set_title(label, size=15, color=color, y=1.1)

def plot_radar_charts(df, single_chart=False, custom_title='Characteristics',scalling=True):
    """
    Generates and displays radar charts for each row of a DataFrame.

    :param df: DataFrame where each row represents data for a radar chart.
    :param single_chart: Boolean to decide if all data should be plotted in one radar chart or in a grid.
    :param custom_title: Custom title for the single radar chart.
    """
    categories = df.columns.tolist()
    if scalling:
      # Normalize the data to [0, 1] range
      scaler = MinMaxScaler()
      scaled_data = scaler.fit_transform(df)
      df_scaled = pd.DataFrame(scaled_data, columns=categories, index=df.index)
    else:
      df_scaled = df
    num_rows = len(df)

    # Generate a color palette with a unique color for each row
    palette = sns.color_palette("tab10", n_colors=num_rows)

    if single_chart:
        # Create a single radar chart for all data
        fig, ax = plt.subplots(figsize=(10, 10), subplot_kw=dict(polar=True))

        # Plot each row data on the same radar chart with different colors
        for i, (index, row) in enumerate(df_scaled.iterrows()):
            color = palette[i % len(palette)]  # Use color from palette
            create_radar_chart(ax, row, categories, color,single_chart, label=index)

        # Add the legend
        ax.legend(loc='upper right', bbox_to_anchor=(1.1, 1.1))

        # Set the title for the single radar chart
        ax.set_title(custom_title, size=20, color='Black', y=1.1)
        plt.show()

    else:
        # Calculate the number of rows needed for a 2-column layout
        num_cols = 2
        num_rows_plot = int(np.ceil(num_rows / num_cols))

        # Create a figure with subplots
        fig, axs = plt.subplots(num_rows_plot, num_cols, subplot_kw=dict(polar=True), figsize=(15, num_rows_plot * 6))

        # Flatten the axes array if necessary
        axs = axs.flatten()

        # Adjust the layout
        plt.subplots_adjust(hspace=0.4, wspace=0.4)

        # Generate radar charts for each row
        for i, (index, row) in enumerate(df_scaled.iterrows()):
            ax = axs[i]
            color = palette[i % len(palette)]  # Use color from palette
            create_radar_chart(ax, row, categories, color,single_chart, label=index)

        # Hide any unused subplots
        for j in range(i + 1, len(axs)):
            axs[j].axis('off')
        plt.show()


# Exploratory Data Analysis

# General Analysis

### General

In [None]:
df.columns

In [None]:
temp_df = df[['online_order','book_table','weighted_rating','votes','num_cuisines','num_menu_item','num_reviews','num_dish_liked','num_spec','is_road','approx_cost(for two people)']]
temp_df.replace({'Yes': 1, 'No': 0}, inplace=True)
plt.figure(figsize=(15,10))
sns.heatmap(temp_df.corr(), annot=True, cmap='coolwarm', vmin=-1, vmax=1, center=0, fmt='.2f', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

Among the factors influencing price, the options for booking tables, the number of liked dishes, votes, and ratings are the most variable. This is true when other fixed features, such as location, are set aside.

**Which are top restaurants?**

In [None]:
plt.figure(figsize=(10,7))
chains=df['name'].value_counts()[:20]
sns.barplot(x=chains,y=chains.index,palette='deep')
plt.title("Most famous restaurants")
plt.xlabel("Number of outlets")

In [None]:
temp_df = df[['name','votes','weighted_rating','approx_cost(for two people)']].groupby('name').median().sort_values('votes',ascending=False)[:20]
types_frq = df['name'].value_counts().reset_index().rename(columns={'index':'name','count':'count'})
types_frq = types_frq[types_frq['name'].isin(temp_df.index)]
multi_bar_chart(temp_df,'Analysis of top 20 restaurants In Votes','Restaurant Names',types_frq=types_frq)

In [None]:
# Check types for cafe Coffee Day, Onesta, and Just Bake and Brewski, Toit, and The Black Pearl
df[df.name.isin(['Cafe Coffee Day','Onesta','Just Bake','Byg Brewski Brewing Company', 'Toit','The Black Pearl'])][['name','listed_in(type)']].groupby('name')['listed_in(type)'].apply(Counter).dropna()



#### Overview

In Bangalore, several restaurant and cafe chains have established a strong presence, with notable examples including Café Coffee Day, Onesta, and Just Bake. Despite their high number of outlets, certain establishments like Byg Brewski Brewing Company, Toit, and The Black Pearl excel in customer engagement and votes.

#### Key Insights

1. **Popularity and Outlet Distribution:**
   - **Café Coffee Day**: With a substantial number of outlets across various types—27 in Delivery, 23 in Dine-out, 27 in Cafes, and 19 in Desserts—Café Coffee Day holds a significant share of the market. Its wide distribution indicates a strong market presence and widespread consumer accessibility.
   - **Just Bake**: This chain focuses heavily on Delivery (31 outlets) and Desserts (31 outlets), showcasing its specialization in quick-service and dessert offerings. Its high number of delivery outlets suggests a strong demand for convenience and dessert options in Bangalore.
   - **Onesta**: Operating with 24 Delivery, 21 Dine-out, 21 Cafes, 9 Desserts, and 10 Buffet outlets, Onesta’s diversified approach highlights its strategy to cater to various dining preferences, from quick bites to buffet options.

2. **Customer Engagement and Votes:**
   - **Byg Brewski Brewing Company**: Despite having fewer outlets (with 2 each in Delivery, Dine-out, and Drinks & Nightlife), Byg Brewski stands out in terms of customer engagement and votes. This suggests that while it operates on a smaller scale, it has successfully created a high-value, engaging experience for its patrons.
   - **Toit**: Known for its Dine-out and Drinks & Nightlife options, Toit also has a strong presence in the market. Its limited number of outlets (one each in Dine-out and Drinks & Nightlife) indicates a focused strategy, resulting in significant customer engagement despite fewer locations.
   - **The Black Pearl**: This establishment’s success in Dine-out and Buffet categories, with seven outlets each, and Pubs and Bars (four outlets), demonstrates its ability to attract and engage customers through a niche offering.

#### Market Implications

1. **Diverse Customer Preferences**: The data suggests a variety of customer preferences in Bangalore. Chains like Café Coffee Day and Just Bake cater to high-volume, quick-service needs, while Byg Brewski and Toit offer specialized experiences that garner high engagement despite having fewer outlets. This indicates that customer engagement is not solely dependent on the number of outlets but also on the quality and type of experience offered.

2. **Strategic Focus and Engagement**: The success of Byg Brewski and Toit highlights the effectiveness of a focused approach. Establishments that concentrate on delivering a unique experience—such as Byg Brewski’s brewery setting or Toit’s brewery and dining experience—can achieve higher engagement and votes, even with a smaller footprint.

3. **Opportunities for Growth**: For investors, the findings suggest opportunities in both expanding popular chains and investing in niche concepts that offer high engagement. Chains with a high number of outlets should focus on enhancing customer experience to maintain and boost engagement. Conversely, specialized establishments can explore scaling their operations while preserving their unique appeal.

In summary, the Bangalore market shows a dynamic landscape where both high-volume chains and niche establishments play crucial roles. Understanding and leveraging customer preferences and engagement metrics can guide strategic decisions for growth and investment.


### List Type

In [None]:
pie_chart(df,'listed_in(type)','List Type',figsize=[10,10])

In [None]:
# @title
temp_df = df[['listed_in(type)','votes','weighted_rating','approx_cost(for two people)']].groupby('listed_in(type)').median().sort_values('votes',ascending=False)
types_frq = df['listed_in(type)'].value_counts().reset_index().rename(columns={'index':'name','count':'count'})
types_frq = types_frq[types_frq['listed_in(type)'].isin(temp_df.index)]
multi_bar_chart(temp_df,'Top general types iterm of Votes','Types Names',types_frq=types_frq)

In [None]:

radar_df = temp_df.merge(df['listed_in(type)'].value_counts(),how='left',left_index=True, right_index=True)
radar_df = radar_df.rename(columns={'approx_cost(for two people)':'Price','weighted_rating':'Rating','votes':'Engagement','count':'Traffic'})
plot_radar_charts(radar_df,single_chart=True)

#### Overview

Based on the analysis of restaurant types and their performance in Bangalore, the data reveals notable trends in customer preferences, engagement, and ratings. Here's a summary of the findings and the implications for the market.

#### Key Findings

1. **Type Distribution**:
   - **Delivery**: 50%
   - **Dine-out**: 34%
   - **Desserts**: 7%

2. **Top Performing Categories**:
   - **Drinks & Nightlife**: Top category in terms of engagement and ratings.
   - **Buffet** and **Pubs and Bars**: Slightly lower in engagement and ratings but still notable.

3. **Lower Performing Categories**:
   - **Delivery**, **Dine-out**, and **Desserts**: These categories show significantly lower votes and ratings compared to others.

#### Market Implications

1. **Customer Preferences**:
   - **Drinks & Nightlife** establishments excel in terms of engagement and ratings. This suggests a strong consumer preference for venues that offer a vibrant social experience, which often includes entertainment and a lively atmosphere. The higher engagement indicates that customers are willing to invest time and money in these experiences.

2. **Delivery, Dine-out, and Desserts**:
   - These categories, despite having a high percentage of establishments, show lower engagement and ratings. This could imply that while these options are popular and widely available, they may not offer the unique or high-quality experiences that drive high engagement and satisfaction. The lower performance could be attributed to factors such as service quality, food variety, or overall dining experience.

#### Opportunities

1. **Enhancing the Drinks & Nightlife Experience**:
   - The high engagement and ratings for Drinks & Nightlife venues highlight a significant opportunity for investment in this category. Establishments that offer unique, high-quality experiences in this sector are likely to attract more customers and achieve higher ratings. Investing in creating memorable experiences and focusing on high-quality service can yield substantial returns.

2. **Improving Delivery and Dine-out Services**:
   - To address the lower engagement and ratings in Delivery, Dine-out, and Desserts, businesses should focus on improving the quality and uniqueness of their offerings. This could involve enhancing the delivery experience, offering exclusive or premium dining options, or innovating in dessert offerings. By addressing customer feedback and focusing on quality, these categories can potentially improve their performance and customer satisfaction.

3. **Diversifying Offerings**:
   - There is an opportunity to diversify offerings within the high-performing categories. For example, integrating elements of Drinks & Nightlife into Dine-out experiences could create a more appealing and engaging environment, thereby boosting ratings and customer engagement.

#### Conclusion

The analysis indicates a strong consumer preference for Drinks & Nightlife experiences, which presents a lucrative opportunity for investment. Meanwhile, there is potential for improvement in the Delivery, Dine-out, and Desserts categories. By focusing on enhancing the quality and uniqueness of these offerings, businesses can better align with customer preferences and drive higher engagement and satisfaction.

Investors should consider these insights to make informed decisions and capitalize on the opportunities within the Bangalore restaurant market.


### Cost

How Rating Effect Priceing

In [None]:
df.columns

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(19, 8))

# Scatter plot for rate vs cost
sns.scatterplot(y='approx_cost(for two people)',x='weighted_rating',data=df,hue='book_table',alpha=0.8,palette='Set2',ax=ax[0,0])
ax[0,0].set_title('Cost & Booking Table vs Weighted Rating')
ax[0,0].set_xlabel('Weighted Rating')
ax[0,0].set_ylabel('Cost')

# Line plot for rate vs weighted_rating
sns.scatterplot(y='approx_cost(for two people)',x='weighted_rating',data=df,hue='online_order',alpha=0.8,palette='Set2',ax=ax[0,1])
ax[0,1].set_title('Cost & Online Order vs Weighted Rating')
ax[0,1].set_ylabel('Cost')
ax[0,1].set_xlabel('Weighted Rating')


sns.scatterplot(y='approx_cost(for two people)',x='votes',data=df,alpha=0.5,ax=ax[1,0])
ax[1,0].set_title('Cost & Votes vs Votes')
ax[1,0].set_xlabel('Votes')
ax[1,0].set_ylabel('Cost')

# Line plot for rate vs weighted_rating
sns.scatterplot(y='approx_cost(for two people)',x='weighted_rating',data=df,hue='is_road',alpha=0.5,palette='Set2',ax=ax[1,1])
ax[1,1].set_title('Cost & On Road vs Weighted Rating')
ax[1,1].set_ylabel('Cost')
ax[1,1].set_xlabel('Weighted Rating')

plt.legend(loc='upper left')
# Adjust layout and show plot
plt.tight_layout()
plt.show()

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(19, 8))

# Scatter plot for rate vs cost
sns.boxplot(y='approx_cost(for two people)',x='book_table',data=df,ax=ax[0,0])
ax[0,0].set_title('Cost vs Booking Table')
ax[0,0].set_xlabel('Booking Table')
ax[0,0].set_ylabel('Cost')

# Line plot for rate vs weighted_rating
sns.boxplot(y='approx_cost(for two people)',x='online_order',data=df,ax=ax[0,1])
ax[0,1].set_title('Cost vs Online Order')
ax[0,1].set_ylabel('Cost')
ax[0,1].set_xlabel('Online Order')


sns.boxplot(y='approx_cost(for two people)',x='is_road',data=df,ax=ax[1,0])
ax[1,0].set_title('Cost vs Road')
ax[1,0].set_xlabel('Votes')
ax[1,0].set_ylabel('Road')


# Adjust layout and show plot
plt.tight_layout()
plt.show()

## Market Analysis: The Impact of Booking Tables, Online Orders, and Location on Restaurant Pricing

### Introduction

The Indian restaurant and cafe market is characterized by diverse consumer preferences and business models. This analysis examines the influence of booking tables, online ordering, location, and customer feedback (ratings and votes) on pricing strategies within this sector.

### Key Insights

1. **Booking Tables and Pricing**

   Booking tables at restaurants show a strong correlation with pricing. Establishments that offer table reservations tend to have higher price points. This is likely due to the premium experience associated with dine-in services, which includes not only the food but also the ambiance and personalized service. Consumers are often willing to pay more for the assurance of a reserved spot, especially in popular or high-demand venues.

2. **Online Orders and High-Cost Restaurants**

   High-cost restaurants often do not offer online ordering services. This is primarily because the experience of dining in such establishments includes being physically present to enjoy the environment and service, which cannot be replicated through home delivery. Additionally, the logistical challenges and potential compromise on food quality during delivery deter high-end restaurants from offering online orders.

3. **Impact of Location**

   Restaurants located on main roads better than within residential areas exhibit a significant effect on pricing strategies. Being in a prime location allows these establishments to command higher prices due to increased visibility and accessibility. Moreover, such locations often attract a broader customer base, enabling them to cover a wider price range to cater to diverse economic segments.

4. **Votes and Ratings Influence**

   While customer votes (the number of reviews) have a limited impact on pricing, ratings (the quality of reviews) significantly influence price levels. An increase in ratings from 3.5 to 4.3 is typically associated with higher prices, as it reflects consumer satisfaction and perceived value. However, beyond a rating of 4.3, prices tend to decrease. This trend suggests that to achieve exceptionally high ratings, restaurants might lower prices to enhance value perception and attract more customers, creating a balance between cost and quality.

### Conclusion

The dynamics of the Indian restaurant market reveal that consumer preferences and business strategies are intricately linked. High-rated establishments often find themselves adjusting pricing to maintain quality and customer satisfaction. For investors, understanding these nuances can guide strategic decisions in the food service industry, emphasizing the importance of location, service offerings, and consumer engagement in determining pricing strategies.

### Recommendations for Investors

- **Focus on Location**: Invest in restaurants with strategic locations that naturally attract more foot traffic and can justify higher pricing.

- **Enhance Customer Experience**: Encourage businesses to offer table bookings to capitalize on consumers' willingness to pay for a guaranteed dining experience.

- **Balance Pricing and Quality**: For high-rating targets, focus on maintaining quality and adjusting prices to stay competitive without compromising the customer experience.

- **Leverage Customer Feedback**: Use ratings and reviews as critical data points for continuous improvement and strategic pricing adjustments.


### Location

Which is the foodi area?

In [None]:
# Calculate the value counts and reset the index
temp_df = df['location'].value_counts().reset_index()
temp_df.columns = ['location', 'count']
traffic_df = temp_df.copy()
count_chart_with_percentage(temp_df,'location','Top 30 Foodie Areas','Locations',[0,30])

What is the effect of location on the on other variables?

In [None]:
# @title
temp_df = df[['location','votes','weighted_rating','approx_cost(for two people)']].groupby('location').median().sort_values('approx_cost(for two people)',ascending=False)[:40]
types_frq = df['location'].value_counts().reset_index().rename(columns={'index':'name','count':'count'})
types_frq = types_frq[types_frq['location'].isin(temp_df.index)]
radar_df = temp_df.head(3)
multi_bar_chart(temp_df,'Location vs Cost','Locations',types_frq=types_frq)

In [None]:
# @title
temp_df = df[['location','votes','weighted_rating','approx_cost(for two people)']].groupby('location').median().sort_values('votes',ascending=False)[:40]
types_frq = df['location'].value_counts().reset_index().rename(columns={'index':'name','count':'count'})
types_frq = types_frq[types_frq['location'].isin(temp_df.index)]
radar_df = pd.concat([radar_df,temp_df.head(3)])
multi_bar_chart(temp_df,'Location vs Votes','Locations',types_frq=types_frq)

In [None]:

size_map = {
    'Buffet': 300,
    'Cafes': 200,
    'Pubs & Bars': 100,
    'Delivery': 80,
    'Desserts': 60,
    'Dine-out': 50,
    'Drinks & nightlife': 50
}
temp_df = df.copy(deep=True)
temp_df['size'] = temp_df['listed_in(type)'].map(size_map)
plt.figure(figsize=(15,10))
palette = sns.color_palette('rainbow', n_colors=len(temp_df['listed_in(type)'].unique()))
sns.scatterplot(
    x='lon', y='lat',
    data=temp_df,
    hue='listed_in(type)',
    style='listed_in(type)',
    palette=palette,
    markers=['o', 's', 'D', 'X', 'v', '^', '<', '>'],
    s=300,
    size='size',
    sizes=(100, 300)
)

# Customize the plot
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Scatter Plot of Locations')
plt.legend(title='Listed In Type')
plt.grid(True)

# Show the plot
plt.show()

In [None]:
df['listed_in(type)'].unique()

Spatial Analysis

In [None]:
from folium.plugins import FastMarkerCluster

def generate_map(latitude=12.911276	,longitude=77.604565,zoom_start=10):
  m = folium.Map(location=[latitude, longitude], zoom_start=zoom_start)
  return m

basemap = generate_map()
HeatMap(data=df[['lat','lon']],zoom=20).add_to(basemap)
# Create HTML for the vertical gradient bar legend
legend_html = '''
<div style="
    position: fixed;
    bottom: 50px; left: 50px; width: 40px; height: 250px;
    background: linear-gradient(to top, blue, green, yellow, red);
    border-radius: 8px;
    box-shadow: 0px 0px 12px rgba(0, 0, 0, 0.2);
    padding: 5px;
    font-size: 12px;
    font-family: Arial, sans-serif;
    z-index: 1000;
    display: flex;
    flex-direction: column;
    justify-content: space-between;
">
    <div style="text-align: center; font-weight: bold;">Intensity</div>
</div>
<div style="
    position: fixed;
    bottom: 50px; left: 95px; height: 250px;
    display: flex;
    flex-direction: column;
    justify-content: space-between;
    font-size: 12px;
    font-family: Arial, sans-serif;
    z-index: 1000;
    font-weight: bold;
">
    <div>High</div>
    <div>Miduam</div>
    <div>Low</div>
</div>
'''

# Add legend to the map
basemap.get_root().html.add_child(folium.Element(legend_html))

# Display the map
basemap.save('map_with_vertical_gradient_legend_bold_labels.html')

# Add marker clustring
FastMarkerCluster(data=df[['lat','lon']],zoom=20).add_to(basemap)

In [None]:
basemap

In [None]:
temp_df = df[['location','votes','weighted_rating','approx_cost(for two people)']].groupby('location').median().sort_values('votes',ascending=False)
temp_df = temp_df[temp_df.index.isin(traffic_df.head(3).location.tolist())]
radar_df = pd.concat([radar_df,temp_df])
radar_df = radar_df.merge(df['location'].value_counts(),how='left',left_index=True, right_index=True)
radar_df = radar_df.rename(columns={'approx_cost(for two people)':'Price','weighted_rating':'Rating','votes':'Engagement','count':'Traffic'})
radar_df.drop_duplicates(inplace=True)

In [None]:
plot_radar_charts(radar_df,single_chart=True,custom_title='Locations Characteristics')

Bangalore's dynamic culinary landscape offers various opportunities for strategic investment. By examining the distribution and characteristics of dining establishments across key neighborhoods, investors can make informed decisions.

## Centralized Nightlife Venues
- **Drinks & Nightlife**: Concentrated in the heart of Bangalore, these establishments cater to the city's vibrant, tech-savvy young professionals and expats seeking entertainment and social experiences. The central locations offer high visibility and access to a diverse clientele. Investment in these areas should focus on innovative concepts that combine local culture with international trends to attract a wide audience.

## Western Buffet Offerings
- **Buffet**: Predominantly located on the western side of the city center, buffets appeal to families and groups. These venues should emphasize diverse culinary options and value for money to attract the surrounding residential communities. Expanding in these areas can capitalize on the demand for family-friendly dining experiences.

## Emerging Restaurant Hubs
- **Whitefield, Electronic City, BTM Layout, HSR Layout, Marathahalli**: These neighborhoods account for nearly 30% of the city's restaurants, with Whitefield alone comprising 10%. Known for their vibrant youth culture and burgeoning tech industry, these areas are ideal for casual dining and quick-service restaurants. Investors should focus on creating hip, affordable venues that cater to students, young professionals, and tech workers.

## High-Spending Customer Zones
- **Sankey Road, Lavelle Road, Race Course Road, Infantary Road**: These affluent areas attract customers with higher spending power, making them suitable for upscale dining establishments. Restaurants here should offer gourmet cuisine, exceptional service, and a premium ambiance to meet the expectations of discerning diners. Innovative and exclusive dining concepts will thrive in these high-value zones.

## Engagement-Driven Destinations
- **Rajarajeshwari nagar, Lavelle Road, Church Street**: Known for high engagement and vibrant atmospheres, these areas attract patrons seeking unique culinary experiences. Establishments should focus on creating interactive and memorable dining experiences, such as themed decor, live performances, or fusion menus that highlight both global and local flavors.

## Conclusion

Bangalore's diverse neighborhoods offer varied opportunities for restaurant investments. By aligning restaurant concepts with the unique characteristics and customer profiles of each area, investors can optimize market reach and profitability. Understanding local consumer behavior, leveraging the city's tech-driven innovation, and maintaining cultural relevance will be key to successful ventures in this bustling metropolis.


## Resturants Types

**What are the different specializations in the food sector, and what are their characteristics?**

In [None]:
# Assuming df is already defined with columns 'rest_type_1' and 'rest_type_2'
rest_types = df['rest_type_0'].tolist() + df['rest_type_1'].tolist()
types_df = pd.DataFrame(rest_types, columns=['rest_type'])
types_df = types_df[types_df.rest_type != '-']
types_frq = types_df.value_counts().reset_index()
types_frq.columns = ['rest_type', 'count']
count_chart_with_percentage(types_frq,'rest_type','Most Common Restaurant Type','Restaurant Type')

In [None]:
pie_chart(types_df[types_df.rest_type.isin(types_frq.rest_type.head(10).tolist())],'rest_type','Rest Type',figsize=[10,10])

In [None]:
temp_df=df[['rest_type_0','approx_cost(for two people)','weighted_rating','votes','location']].rename(columns={'rest_type_0':'rest_type'})
temp_df= pd.concat([temp_df,df[['rest_type_1','approx_cost(for two people)','weighted_rating','votes','location']][df.rest_type_1 != '-'].rename(columns={'rest_type_1':'rest_type'})])
count_types = temp_df.drop('location',axis=1).groupby('rest_type').median().sort_values(by='approx_cost(for two people)').copy()
freq_types = temp_df['rest_type'].value_counts().reset_index()

In [None]:
multi_bar_chart(count_types,'Approximate Cost and Weighted Rating by Restaurant Type','Restaurant Types',types_frq=freq_types)

In [None]:
radar_df = count_types.merge(freq_types,how='left',left_on='rest_type', right_on='rest_type')
radar_df = radar_df.rename(columns={'approx_cost(for two people)':'Price','weighted_rating':'Rating','votes':'Engagement','count':'Traffic'})
radar_df = radar_df.sort_values('Price',ascending=False)
radar_df = pd.concat([radar_df.head(9),radar_df[radar_df.rest_type=='Quick Bites']])
radar_df = radar_df.set_index('rest_type')
radar_df.drop_duplicates(inplace=True)
plot_radar_charts(radar_df,single_chart=True,custom_title='specializations  Characteristics')

In [None]:
plot_radar_charts(radar_df,single_chart=False,custom_title='specializations  Characteristics')


## Market Overview

Bangalore's food scene is vibrant and diverse, encompassing a wide range of dining options. The market is primarily divided into three main categories:

1. **Quick Bites (40%)**: This category includes fast-food outlets and quick-service restaurants. While it constitutes a significant portion of the market, it lacks strong customer engagement.

2. **Casual Dining and Cafes**: These segments together account for about two-thirds of the food market. Casual Dining and Cafes have higher customer engagement, suggesting that diners prefer a mix of convenience and a pleasant dining atmosphere.

3. **Irani Cafes**: These are currently trending due to their engaging atmosphere, reasonable pricing, and high ratings (up to 4.4). Irani Cafes offer unique dining experiences, filling a niche in the market with limited competition.

## Investment Opportunities

1. **Fine Dining**: Although this sector is the most expensive and currently has limited customer interest, it presents an opportunity for experienced investors to develop exceptional dining experiences for high-end customers.

2. **Drinks and Nightlife**: Pubs, microbreweries, and clubs are popular among younger demographics and show strong demand. Investing in these venues can be lucrative due to their popularity and relatively good pricing.

## Customer Types and Best Locations

1. **Young Professionals and Millennials**: This group favors microbreweries, pubs, and clubs. Ideal locations for these venues are busy areas with vibrant nightlife.

2. **Families and Casual Diners**: Families prefer Casual Dining and Cafes, which are best situated in suburban areas with a community-oriented vibe.

3. **Wealthy and Special Occasion Diners**: Fine Dining establishments cater to high-income individuals and special events. These should be located in upscale neighborhoods or near cultural landmarks.

4. **Culture Lovers**: Irani Cafes appeal to those interested in traditional and cultural dining experiences. These cafes perform well in historical areas that complement their cultural theme.

## Conclusion

Bangalore’s food market presents diverse opportunities for investors. Irani Cafes stand out as a promising investment due to their unique appeal and limited competition. By aligning investment strategies with customer preferences and selecting appropriate locations, investors can effectively tap into the city’s varied dining needs.


### Location

**What is the most known location for each resturant type?**

In [None]:
from matplotlib.colors import to_hex
temp_df = temp_df.rename(columns={'votes': 'Count'})
# Group the DataFrame
grouped_df = (
    temp_df[['location', 'rest_type', 'Count']]
    .groupby(['location', 'rest_type'])
    .count()
    .reset_index()
    .groupby('rest_type')
    .max()
    .sort_values(['location', 'Count'], ascending=False)

)
def styling_df(grouped_df,col='location'):
  # Get unique locations and assign a unique color to each
  unique_locations = grouped_df[col].unique()
  colors = sns.color_palette("pastel", len(unique_locations))  # Generate unique colors
  location_colors = dict(zip(unique_locations, colors))  # Map each location to a color

  # Define a function to apply styling to the 'location' column
  def highlight_locations(s):
      return [f'background-color: {to_hex(location_colors[val])}' for val in s]

  # Style the DataFrame
  if 'Price' in grouped_df.columns:

    styled_df = (
        grouped_df.style
        .background_gradient(subset='Count', cmap='YlGnBu')
        .background_gradient(subset='Price', cmap='YlGnBu')
        .background_gradient(subset='Rating', cmap='YlGnBu')
        .apply(highlight_locations, subset=[col])
    )
  else:
    styled_df = (
        grouped_df.style
        .background_gradient(subset='Count', cmap='YlGnBu')
        .apply(highlight_locations, subset=[col])
    )

  # Display the styled DataFrame
  return styled_df

styled_df = styling_df(grouped_df)
styled_df

### Location and Socioeconomic analysis

**1. Yeshwantpur**  
Yeshwantpur stands out for its wide range of affordable food options, showcasing significant diversity in restaurant types. This diversity, combined with budget-friendly pricing, appeals primarily to the middle class, who are looking for varied food experiences without exceeding average spending limits. The variety here supports a broad spectrum of tastes and preferences, making it a popular choice for those with moderate budgets.

**2. Wilson Garden**  
In contrast, Wilson Garden has a more limited range of food options. Its primary appeal lies in its affordability and focus on fast food. This characteristic makes it attractive to budget-conscious diners seeking quick, economical meals. The limited diversity in food types reflects its niche focus on low-budget, fast-food offerings.

**3. Whitefield**  
Whitefield offers a diverse range of food options catering to mid-to-high budget ranges. This neighborhood serves as a midpoint between middle-class and high-class food experiences. The area’s offerings appeal to both upper-middle-class and lower-high-class patrons. Additionally, Whitefield’s vibrant nightlife contributes to its attractiveness, drawing in those seeking both quality food and evening entertainment.

**4. West Bangalore**  
Known for its food trucks, West Bangalore caters to a wide demographic, including people of various ages, classes, and backgrounds. The specialization in food trucks provides a unique, casual food experience that appeals to a large and diverse audience. This format supports the community's varied preferences and enhances accessibility.

**5. Sankey Road and Lavelle Road**  
Both Sankey Road and Lavelle Road are renowned for their exclusive food experiences. These areas often feature unique or high-end restaurants that are sometimes the only outlets of their kind in the city. This exclusivity attracts high-class and high-ticket customers who are looking for premium food experiences. Their reputation for offering one-of-a-kind food options further solidifies their appeal to affluent patrons.

**6. Kalyan Nagar**  
Kalyan Nagar predominantly attracts upper-middle-class residents. The area provides a balanced mix of food options that cater to this demographic’s preferences and budget. It offers a comfortable and accessible food experience that aligns with the lifestyle of upper-middle-class individuals.

### Summary

Each neighborhood in Bangalore has its unique characteristics that cater to different segments of the food market:

- **Yeshwantpur**: Diverse and affordable, appealing to the middle class.
- **Wilson Garden**: Budget-friendly and fast food, suited for economical diners.
- **Whitefield**: Mid-to-high budget, attracting both upper-middle-class and lower-high-class individuals, with a focus on nightlife.
- **West Bangalore**: Food trucks offering diverse options, drawing a broad demographic.
- **Sankey Road and Lavelle Road**: Exclusive food experiences for high-class and high-ticket customers.
- **Kalyan Nagar**: Upper-middle-class neighborhood with a balanced food scene.



#### Dishes

In [None]:
def get_top_dish(df,num=3):
  combined_list = [item for sublist in df['dish_liked'] for item in sublist if item != '-']
  count_unique = Counter(combined_list)
  sorted_count_unique = dict(count_unique.most_common(num))
  return sorted_count_unique

temp_df=df[['rest_type_0','approx_cost(for two people)','weighted_rating','votes','location','dish_liked']].rename(columns={'rest_type_0':'rest_type'})
temp_df= pd.concat([temp_df,df[['rest_type_1','approx_cost(for two people)','weighted_rating','votes','location','dish_liked']][df.rest_type_1 != '-'].rename(columns={'rest_type_1':'rest_type'})])
rest_type_list = styled_df.index.tolist()
spec_df = pd.DataFrame(columns=['rest_type','1st', '2nd', '3rd'])
for rest in rest_type_list:
  output = get_top_dish(temp_df[temp_df.rest_type==rest])
  t_df = pd.DataFrame(columns=['rest_type','1st', '2nd', '3rd'])
  if len(output)==0:
    continue
  t_df.loc[0,'1st'] = list(output.keys())[0]
  t_df.loc[0,'2nd'] = list(output.keys())[1]
  if len(output)>2:
    t_df.loc[0,'3rd'] = list(output.keys())[2]
  else:
    t_df.loc[0,'3rd'] = '-'
  t_df.loc[0,'rest_type'] = rest
  spec_df = pd.concat([spec_df,t_df])
spec_df = spec_df.set_index('rest_type')
print('Most preferred Dishes For each type')
spec_df

## Cuisines

In [None]:
cols = ['cuisines_0','votes','approx_cost(for two people)','weighted_rating','location']
temp_df = df[cols].copy(deep=True)
temp_df.rename(columns={'cuisines_0':'cuisines'},inplace=True)
for i in range(1,8):
  cols = [f'cuisines_{i}','votes','approx_cost(for two people)','weighted_rating','location']
  temp_df = pd.concat([temp_df,df[cols].rename(columns={f'cuisines_{i}':'cuisines'})])
temp_df = temp_df[temp_df.cuisines != '-']
count_values = temp_df.cuisines.value_counts().reset_index()
count_values.columns = ['cuisines','count']

count_chart_with_percentage(count_values,'cuisines','Most common cuisines','Cuisines',[0,30])

In [None]:
multi_bar_chart(temp_df.drop('location',axis=1).groupby('cuisines').median().sort_values(by='approx_cost(for two people)'),'Approximate Cost and Weighted Rating by Cuisines','Cuisines Type',figuresize=(25,10),types_frq=count_values)

In [None]:
multi_bar_chart(temp_df.drop('location',axis=1).groupby('cuisines').median().sort_values(by='votes'),'Approximate Cost and Weighted Rating by Cuisines','Cuisines Type',figuresize=(25,10),types_frq=count_values)

In [None]:
radar_df = temp_df.drop('location',axis=1).groupby('cuisines').median().sort_values(by='votes').merge(count_values.drop(['cumsum','cumperc'],axis=1),how='left',left_index=True, right_on='cuisines')
radar_df = radar_df.rename(columns={'approx_cost(for two people)':'Price','weighted_rating':'Rating','votes':'Engagement','count':'Traffic'})
radar_df = radar_df.sort_values('Price',ascending=False)
radar_df = pd.concat([radar_df.head(7),radar_df[radar_df.cuisines.isin(['North Indian','Chinese','South Indian','African','Singaporean'])]])
radar_df = radar_df.set_index('cuisines')
radar_df.drop_duplicates(inplace=True)
plot_radar_charts(radar_df,single_chart=True,custom_title='specializations  Characteristics')

In [None]:
plot_radar_charts(radar_df,single_chart=False,custom_title='specializations  Characteristics')

# Investment Analysis Report: Foreign and Local Cuisines in Bangalore

## Introduction

This report examines the market dynamics of various cuisines in Bangalore, focusing on engagement levels, pricing strategies, and investment potential. Bangalore's cosmopolitan nature creates a diverse culinary landscape, offering opportunities for both local and foreign cuisines to thrive.

## Cantonese Cuisine

**Analysis:**
- **Engagement:** High
- **Pricing:** Premium
- **Target Audience:** Affluent individuals seeking exclusive dining experiences.
- **Profitability:** Significant due to high pricing, but the customer base is niche.

**Recommendation:**
- **Investment Strategy:** Invest in Cantonese cuisine by emphasizing targeted marketing and exclusive dining experiences. The high price point limits the customer base but ensures high returns per customer.
- **Explanation:** High engagement despite premium pricing indicates strong demand among affluent consumers who value authentic experiences. This niche market can be highly profitable but requires targeted strategies to attract and retain customers.

## German Cuisine

**Analysis:**
- **Engagement:** High
- **Spreading:** Low
- **Pricing:** Lower than Cantonese
- **Target Audience:** Middle-income groups looking for authentic yet affordable experiences.
- **Profitability:** Balanced between exclusivity and mass appeal.

**Recommendation:**
- **Investment Strategy:** Focus on providing authentic experiences at competitive prices to attract a broad demographic.
- **Explanation:** German cuisine’s lower price point and moderate engagement suggest it is accessible to a larger audience compared to high-end options. This balance can attract middle-income groups while maintaining profitability.

## Sri Lankan, Parsi, and Russian Cuisines

**Analysis:**
- **Engagement:** High
- **Spreading:** Low
- **Pricing:** Medium
- **Target Audience:** Diners interested in cultural diversity and unique flavors.
- **Profitability:** Steady returns with a focus on authenticity and distinctive experiences.

**Recommendation:**
- **Investment Strategy:** Emphasize cultural authenticity and unique offerings to maintain high engagement.
- **Explanation:** Medium pricing combined with high engagement indicates a strong interest in diverse culinary experiences. By highlighting authenticity and unique flavors, these cuisines can sustain their appeal and provide steady returns.

## Singaporean Cuisine

**Analysis:**
- **Engagement:** Growing
- **Spreading:** Low
- **Pricing:** Moderate
- **Target Audience:** Indian audiences interested in diverse culinary experiences.
- **Profitability:** Promising, with increasing appeal due to unique flavor profiles and fusion influences.

**Recommendation:**
- **Investment Strategy:** Leverage strategic marketing, including culinary festivals and pop-up events, to enhance visibility and engagement.
- **Explanation:** Singaporean cuisine's emerging popularity aligns with growing interest in diverse food options. Strategic marketing and events can capitalize on this trend and boost engagement.

**Supporting Insights:**
- [Popularising Singaporean Cuisine Among Indian Audiences](https://bwhotelier.com/article/popularising-singaporean-cuisine-among-indian-audiences-445126)

## Foreign vs. Local Cuisines

**Analysis:**
- **Foreign Cuisines:** Generally attract high engagement and can command premium pricing. There is strong market openness to international flavors.
- **Local Cuisines:** Despite comprising 30% of the restaurant market, face saturation and reduced engagement. Consumers seek novelty.

**Recommendation:**
- **Investment Strategy:** For local cuisines, focus on innovative approaches such as new regional specialties or fusion dishes.
- **Explanation:** The saturation of local cuisines like Northern and Southern Indian reduces consumer engagement. Introducing novel options can rejuvenate interest and offer a competitive edge.

## Chinese Cuisine

**Analysis:**
- **Engagement:** Low
- **Spreading:** High
- **Pricing:** Variable, often affordable
- **Target Audience:** Wide-ranging, with a taste for fusion flavors.
- **Popularity:** Ranks second to Northern Indian cuisine.

**Recommendation:**
- **Investment Strategy:** Continue investing in Chinese cuisine by leveraging its established popularity and introducing innovative dishes.
- **Explanation:** The strong market presence and adaptability of Chinese cuisine, coupled with its affordability, contribute to its sustained popularity. Innovative offerings can further enhance its market position.

**Supporting Insights:**
- [The Rise of Chinese Cuisine in India](https://hongskitchen.in/rise-of-chinese-cuisine-in-india)

## African Cuisine

**Analysis:**
- **Engagement:** Low but with potential for growth
- **Spreading:** Low
- **Pricing:** Variable
- **Target Audience:** Health-conscious and adventurous diners.
- **Profitability:** High potential due to low competition; aligning with current health trends.

**Recommendation:**
- **Investment Strategy:** Develop a robust marketing strategy focusing on cultural festivals and events to increase engagement and build a loyal customer base.
- **Explanation:** Despite currently low engagement, African cuisine's rich flavors and health-oriented offerings align with consumer trends towards diverse and healthy eating. Effective marketing can tap into this potential.

**Supporting Insights:**
- [India, Africa Surpass China in Reshaping Food Landscape](https://www.foodbusinessnews.net/articles/25632-india-africa-surpass-china-in-reshaping-food-landscape)
- [The Rise of African Food](https://www.tenderstem.co.uk/news-events/news/the-rise-of-african-food)

## Local Cuisines: Northern and Southern Indian

**Analysis:**
- **Market Share:** 30% of restaurants
- **Competition:** High
- **Engagement:** Reduced due to saturation

**Recommendation:**
- **Investment Strategy:** Innovate within local cuisines by introducing new regional specialties or fusion dishes.
- **Explanation:** The high level of competition and saturation in local cuisines necessitates differentiation. Innovation is key to capturing consumer interest and staying relevant in the market.

## Journal Article Insights

**Study:**
- **Title:** "Restaurants in Little India, Singapore: A Study of Spatial Organization and Pragmatic Cultural Change"
- **Findings:** Offers insights into how restaurants adapt to cultural changes and spatial organization.

**Application:**
- **Strategy:** Apply insights to organize and position restaurants in Bangalore effectively. Understanding spatial and cultural adaptations will enhance the effectiveness of foreign cuisine offerings.
- **Explanation:** Adapting restaurant setups based on cultural and spatial insights can improve market positioning and customer appeal.

## Conclusion

Bangalore's diverse food scene presents substantial investment opportunities in both foreign and local cuisines. While foreign cuisines like Cantonese, German, Singaporean, and African offer promising prospects due to their unique appeal and engagement, local cuisines require innovative approaches to capture consumer interest in a saturated market. Strategic investments in marketing and unique culinary experiences are crucial for success.

This report incorporates information on Singaporean cuisine and insights from relevant studies, providing a comprehensive overview of the culinary landscape in Bangalore.



### Location

In [None]:
temp_df = temp_df.rename(columns={'votes': 'Count'})
# Group the DataFrame
grouped_df = (
    temp_df[['location', 'cuisines', 'Count']]
    .groupby(['location', 'cuisines'])
    .count()
    .reset_index()
    .groupby('cuisines')
    .max()
    .sort_values(['location', 'Count'], ascending=False)
    .merge(temp_df[['location','approx_cost(for two people)']].groupby('location').median().rename(columns={'approx_cost(for two people)':'Price'}),how='left',left_on='location', right_index=True)
    .merge(temp_df[['location','weighted_rating']].groupby('location').median().rename(columns={'weighted_rating':'Rating'}),how='left',left_on='location', right_index=True)

)

styled_df = styling_df(grouped_df)
styled_df

In [None]:
cols = ['cuisines_0','votes','approx_cost(for two people)','weighted_rating','location','dish_liked']
temp_df = df[cols].copy(deep=True)
temp_df.rename(columns={'cuisines_0':'cuisines'},inplace=True)
for i in range(1,8):
  cols = [f'cuisines_{i}','votes','approx_cost(for two people)','weighted_rating','location','dish_liked']
  temp_df = pd.concat([temp_df,df[cols].rename(columns={f'cuisines_{i}':'cuisines'})])
cuisines_list = styled_df.index.tolist()
cuisine_spec_df = pd.DataFrame(columns=['cuisines','1st', '2nd', '3rd'])
for rest in cuisines_list:

  output = get_top_dish(temp_df[temp_df.cuisines==rest])
  t_df = pd.DataFrame(columns=['cuisines','1st', '2nd', '3rd'])
  if len(output)==0:
    continue
  t_df.loc[0,'1st'] = list(output.keys())[0]
  t_df.loc[0,'2nd'] = list(output.keys())[1]
  if len(output)>2:
    t_df.loc[0,'3rd'] = list(output.keys())[2]
  else:
    t_df.loc[0,'3rd'] = '-'
  t_df.loc[0,'cuisines'] = rest
  cuisine_spec_df = pd.concat([cuisine_spec_df,t_df])
cuisine_spec_df = cuisine_spec_df.set_index('cuisines')
print('Most preferred Dishes For each cuisines')
cuisine_spec_df.head(60)

Classification Analysis for Returants/cafes , I will use Kmean for its performance and simplicity


In [None]:
plt.figure(figsize=(15,8))
sns.heatmap(encoded_df.corr(),annot=True)

In [None]:
from sklearn.preprocessing import StandardScaler
temp_df = encoded_df[['book_table','dish_liked','rest_type','cuisines','votes','approx_cost(for two people)']].copy(deep=True)
scaler = StandardScaler()
temp_df = scaler.fit_transform(temp_df)

In [None]:
from kmodes.kmodes import KModes
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score,calinski_harabasz_score

silhouette_avg_score=[]
wcss = []
for n_clusters in range(2,8):
    # n_init=50 to ensure i got same result everytime
    kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init=50,random_state=42)
    #kmeans = KModes(n_clusters=n_clusters, init='Huang', n_init=50,random_state=42)
    #kmeans.fit(temp_df)
    #clusters = kmeans.predict(temp_df)
    clusters = kmeans.fit_predict(temp_df)
    silhouette_avg = silhouette_score(temp_df, clusters)
    silhouette_avg_score.append(silhouette_avg)
    wcss.append(kmeans.inertia_)
    print("For n_clusters =", n_clusters, "The average silhouette_score is :", silhouette_avg, "Inertia",kmeans.inertia_)

In [None]:
plt.plot(range(2,8),wcss)
plt.xlabel('Number of clusters')
plt.ylabel('Inertia score')

In [None]:


km = KMeans(init='k-means++', n_clusters = 5, n_init=50,random_state=42)
#km = KModes(n_clusters=n_clusters, init='Huang', n_init=5,random_state=42)
clusters = km.fit_predict(temp_df)
df['cluster'] = clusters



In [None]:
df[['cluster','approx_cost(for two people)']].groupby('cluster').median()

In [None]:
sns.lineplot(x='cluster',y='approx_cost(for two people)',data=df)

In [None]:
sns.scatterplot(x='listed_in(type)',y='approx_cost(for two people)',data=df[df.cluster==0][['listed_in(type)','approx_cost(for two people)','votes','weighted_rating']])

In [None]:
multi_bar_chart(df[df.cluster==1][['listed_in(type)','approx_cost(for two people)','votes','weighted_rating']].groupby('listed_in(type)').mean().sort_values(by='approx_cost(for two people)'),'Approximate Cost and Weighted Rating by Cuisines','location',figuresize=(25,10),freq=None)

PCA Analysis

In [None]:
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(temp_df)
pca_samples = pca.transform(temp_df)

In [None]:
def pca_plot(pca_samples, temp_df):
  fig, ax1 = plt.subplots(figsize=(14, 5))
  sns.set(font_scale=1)

  # Plot the individual explained variance on the primary y-axis (right)
  sns.barplot(
      x=np.arange(0, temp_df.shape[1]),
      y=pca.explained_variance_ratio_ * 100,
      alpha=0.5,
      color='g',
      label='Individual explained variance',
      ax=ax1
  )

  # Create a secondary y-axis (left) for the cumulative explained variance
  ax2 = ax1.twinx()

  # Calculate the cumulative explained variance
  cumsum_explained_variance = np.cumsum(pca.explained_variance_ratio_) * 100

  # Plot the cumulative explained variance on the secondary y-axis
  ax2.plot(
      np.arange(0, temp_df.shape[1]),
      cumsum_explained_variance,
      marker='o',
      color='b',
      label='Cumulative explained variance'
  )




  # Labels and legend for the primary y-axis (individual variance)
  ax1.set_ylabel('Explained variance (%)', fontsize=14, color='g')
  ax1.set_xlabel('Principal components', fontsize=14)
  ax1.legend(loc='upper left', fontsize=13)

  # Labels and legend for the secondary y-axis (cumulative variance)
  ax2.set_ylabel('Cumulative explained variance (%)', fontsize=14, color='b')
  ax2.legend(loc='upper right', fontsize=13)

  # Adjust the layout to make room for both y-axes
  fig.tight_layout()

  plt.show()

pca_plot(pca_samples, temp_df)

In [None]:
pca = PCA(5)
pca.fit(temp_df)
pca_samples = pca.transform(temp_df)

# Visualize pcs in 2D

In [None]:

def plot_pca_scatter(pca_samples, labels, label_color_map=None, alpha=0.4):
    """
    Creates scatter plots for all pairs of PCA components.

    Parameters:
    - pca_samples: 2D array or DataFrame containing the PCA components.
    - labels: Array or Series containing the cluster labels for color coding.
    - label_color_map: Dictionary mapping cluster labels to colors. Default is None, which uses a default color map.
    - alpha: Transparency level for the scatter plot points. Default is 0.4.
    """

    # Set the style and context for the plot
    sns.set_style("white")
    sns.set_context("notebook", font_scale=1, rc={"lines.linewidth": 2.5})

    # If no color map is provided, create a default one
    if label_color_map is None:
        unique_clusters = np.unique(labels)
        default_colors = sns.color_palette("hsv", len(unique_clusters))
        label_color_map = {cluster: color for cluster, color in zip(unique_clusters, default_colors)}

    # Get colors for each point
    label_color = [label_color_map[l] for l in labels]

    # Create subplots
    n_components = pca_samples.shape[1]
    fig, axes = plt.subplots(n_components, n_components, figsize=(15, 15))

    # Loop through all pairs of PCA components and create scatter plots
    for i in range(n_components):
        for j in range(n_components):
            ax = axes[i, j]
            if i == j:
                ax.text(0.5, 0.5, f'PCA {i+1}',
                        horizontalalignment='center',
                        verticalalignment='center',
                        fontsize=12)
                ax.set_visible(False)
            else:
                ax.scatter(pca_samples[:, i], pca_samples[:, j], c=label_color, alpha=alpha)
                ax.set_xlabel(f'PCA {i+1}', fontsize=12)
                ax.set_ylabel(f'PCA {j+1}', fontsize=12)
                ax.yaxis.grid(color='lightgray', linestyle=':')
                ax.xaxis.grid(color='lightgray', linestyle=':')
                ax.spines['right'].set_visible(False)
                ax.spines['top'].set_visible(False)

    # Set layout and show plot
    plt.tight_layout()
    plt.show()

plot_pca_scatter(pca_samples, df['cluster'].values )

In [None]:
import matplotlib as mpl
import matplotlib.cm as cm

def graph_component_silhouette(n_clusters, lim_x, mat_size, sample_silhouette_values, clusters):
    plt.rcParams["patch.force_edgecolor"] = True
    plt.style.use('fivethirtyeight')
    mpl.rc('patch', edgecolor = 'dimgray', linewidth=1)

    fig, ax1 = plt.subplots(1, 1)
    fig.set_size_inches(8, 8)
    ax1.set_xlim([lim_x[0], lim_x[1]])
    ax1.set_ylim([0, mat_size + (n_clusters + 1) * 10])
    y_lower = 10
    for i in range(n_clusters):

        # Aggregate the silhouette scores for samples belonging to cluster i, and sort them
        ith_cluster_silhouette_values = sample_silhouette_values[clusters == i]
        ith_cluster_silhouette_values.sort()
        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i
        cmap = cm.get_cmap("Spectral")
        color = cmap(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_silhouette_values,
                           facecolor=color, edgecolor=color, alpha=0.8)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.03, y_lower + 0.5 * size_cluster_i, str(i), color = 'red', fontweight = 'bold',
                bbox=dict(facecolor='white', edgecolor='black', boxstyle='round, pad=0.3'))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10

In [None]:
sample_silhouette_values = silhouette_samples(temp_df, clusters)

In [None]:
graph_component_silhouette(len(pd.Series(clusters).value_counts()), [-0.07, 0.7], len(temp_df), sample_silhouette_values, clusters)


It Show that clustring was good

In [None]:
df['classes'] = df['cluster'].map({0:'Mid',1:'Low',2:'Upper-Mid',3:'Lower-High',4:'High'})

In [None]:
encoded_df['classes'] = df['classes']

In [None]:
encoded_df['count']=df['count']

In [None]:
radar_df

In [None]:
radar_df = encoded_df[['online_order','book_table','weighted_rating','approx_cost(for two people)','num_cuisines','num_spec','count','classes','votes']].groupby('classes').mean().sort_values(by='votes')
radar_df = radar_df.rename(columns={'approx_cost(for two people)':'Price','weighted_rating':'Rating','votes':'Engagement','count':'Traffic','num_cuisines':'Variety in Dishes','num_spec':'Variety in Specializations'})
radar_df = radar_df.sort_values('Price',ascending=False)
radar_df.drop_duplicates(inplace=True)

plot_radar_charts(radar_df,single_chart=False,custom_title='Resturants Characteristics per Classes')

# Investment Analysis Report: Restaurant Markets in Bangalore

## Introduction
This report provides an analysis of the restaurant markets across various locations in Bangalore, focusing on customer characteristics, economic conditions, and investment opportunities. Bangalore's diverse culinary landscape offers multiple opportunities for both local and foreign cuisines.

## Yeshwantpur

**Customer Characteristics:**
- **Cuisine Preference:** Predominantly local cuisines.
- **Price Sensitivity:** Budget-friendly restaurants.
- **Rating:** Below average, indicating potential for improvement with higher quality offerings.
- **Consumer Behavior:** Customers seek diverse and affordable food options.

**Economic Context:**
- Yeshwantpur is a developing residential and commercial area with a mix of affordable and mid-range options. The increasing development is contributing to a dynamic food scene.
- **Source:** [Economic Overview of Yeshwantpur](https://www.economictimes.indiatimes.com/industry/services/property-/-cstruction/economic-overview-of-yeshwantpur/articleshow/71130025.cms)

**Recommendation:**
- New restaurants offering high-quality food can compete effectively due to the current low ratings. Focus on delivering quality at competitive prices to attract budget-conscious yet discerning customers.

## Whitefield

**Customer Characteristics:**
- **Cuisine Preference:** Mainly Asian and European cuisines.
- **Price Sensitivity:** Mid-range to higher end.
- **Rating:** Average, suggesting a stable but competitive market.
- **Consumer Behavior:** Customers seek diverse and international food experiences.

**Economic Context:**
- Whitefield is a significant IT and residential hub with a higher average income level. It attracts professionals and expatriates, influencing the demand for diverse and high-quality cuisines.
- **Source:** [Whitefield Economic and Demographic Insights](https://www.financialexpress.com/industry/whitefield-economic-and-demographic-insights/)

**Recommendation:**
- Restaurants should focus on offering high-quality Asian and European cuisines to stand out in a competitive market. Emphasize unique and high-quality international dishes to appeal to the area's diverse clientele.

## Ulsoor

**Customer Characteristics:**
- **Cuisine Preference:** Predominantly European cuisines.
- **Price Sensitivity:** Upper-middle class.
- **Rating:** Generally good, reflecting a well-established market.
- **Consumer Behavior:** Customers prefer European cuisines and upscale dining experiences.

**Economic Context:**
- Ulsoor is an upscale locality known for its affluent residents and historical significance. It has a strong demand for high-quality European cuisines.
- **Source:** [Ulsoor Economic and Demographic Profile](https://www.business-standard.com/article/economy-policy/ulsoor-economic-and-demographic-profile-120092700094_1.html)

**Recommendation:**
- Invest in European cuisine with a focus on premium dining experiences. Offer exceptional dishes and a refined atmosphere to attract the upper-middle-class clientele.

## St. Marks Road, MG Road, and Church Street

**Customer Characteristics:**
- **Cuisine Preference:** Diverse, seeking unique and high-quality experiences.
- **Price Sensitivity:** Lower high class.
- **Rating:** Generally good, indicating a well-established market.
- **Consumer Behavior:** Customers look for both uniqueness and quality in their dining experiences.

**Economic Context:**
- These areas are central business districts with high-end retail and dining establishments. They cater to a diverse, high-income clientele seeking premium experiences.
- **Source:** [Economic Overview of Bangalore's Central Business Districts](https://www.thehindubusinessline.com/news/economic-overview-of-bangalores-central-business-districts/article33585535.ece)

**Recommendation:**
- Focus on offering unique dining experiences with a high standard of quality to meet the expectations of the lower high-class demographic. Develop innovative menus and distinctive concepts to attract high-income customers.

## Lavelle Road

**Customer Characteristics:**
- **Cuisine Preference:** High-class, with an emphasis on exceptional experiences and uniqueness.
- **Price Sensitivity:** High-end.
- **Rating:** Very high, indicating a well-established and competitive market.
- **Consumer Behavior:** Customers seek exclusivity and high-quality experiences.

**Economic Context:**
- Lavelle Road is one of Bangalore’s most affluent areas, known for its upscale residential and commercial properties. It attracts a high-income clientele with a preference for luxury dining.
- **Source:** [Lavelle Road Economic Profile](https://www.forbesindia.com/article/real-estate/lavelle-road-economic-profile/57415/1)

**Recommendation:**
- Invest in high-end dining establishments that offer exceptional experiences and exclusive menu items. Position restaurants as luxury dining destinations with superior service and unique culinary offerings.

## Conclusion
Bangalore presents a vibrant food scene with opportunities for investment in both foreign and local cuisines. While foreign cuisines like Cantonese, German, Singaporean, and African show promising potential due to their unique appeal and engagement, local cuisines, particularly Northern and Southern Indian, face market saturation and require innovative approaches to remain competitive. Understanding each area's economic conditions and customer characteristics will be key to making informed investment decisions.






---





# Investment Report: Strategic Recommendations for Restaurant Ventures in Bangalore

## Executive Summary

This report outlines key findings and strategic recommendations for investing in Bangalore's diverse restaurant market. By understanding customer preferences and leveraging neighborhood characteristics, investors can optimize profitability and market positioning.

## Key Findings

### 1. Restaurant Categories and Distribution

- **Quick Bites (40%)**: Fast-food and quick-service restaurants dominate but show lower customer engagement. Enhancing the dining experience could boost interest.
- **Casual Dining and Cafes**: These segments, which constitute about two-thirds of the market, balance convenience with ambiance, resulting in higher engagement.
- **Irani Cafes**: Known for their engaging atmosphere and reasonable pricing, Irani cafes are a growing niche with limited competition.

### 2. Performance of Different Categories

- **Top Performers**: Drinks & Nightlife establishments attract high engagement and positive ratings. Investing in lively, social venues aligns with consumer preferences.
- **Lower Performers**: Delivery, Dine-out, and Desserts categories show lower engagement. Improving service quality and offering unique experiences can address these shortcomings.

### 3. Neighborhood Insights

- **Yeshwantpur**: Offers affordable and diverse food options, appealing to the middle class.
- **Wilson Garden**: Features budget-friendly fast food, attracting cost-conscious diners.
- **Whitefield**: Focuses on a mid-to-high budget range with a nightlife emphasis, appealing to upper-middle-class and high-class patrons.
- **West Bangalore**: Known for food trucks, catering to a broad audience with diverse tastes.
- **Sankey Road and Lavelle Road**: High-end dining options attract affluent customers. **Lavelle Road, in particular, serves as an ideal middle ground, catering to both high-class and upper-middle-class customers. This makes it a strategic location with significant potential for a diverse customer base.**
- **Kalyan Nagar**: Provides a balanced dining experience for upper-middle-class residents.

### 4. Cuisine-Specific Trends

- **Cantonese Cuisine**: High engagement and premium pricing suggest strong demand among affluent customers for exclusive dining experiences.
- **German Cuisine**: Moderate engagement with competitive pricing appeals to middle-income groups.
- **Sri Lankan, Parsi, and Russian Cuisines**: High engagement and medium pricing highlight a growing interest in unique flavors and cultural diversity.
- **Singaporean Cuisine**: Shows increasing popularity with moderate pricing, indicating potential for greater engagement through targeted marketing.
- **Chinese Cuisine**: Despite lower engagement, Chinese cuisine is widespread across the city, similar to local cuisines. Its affordability and broad presence suggest room for growth by enhancing uniqueness and quality.
- **African Cuisine**: Low engagement but potential for growth by focusing on health-conscious and adventurous diners.

## Strategic Recommendations

### 1. Pricing Strategy

- **Consideration for Price Reduction**: Adjust prices strategically to enhance competitiveness and appeal. Specialize in unique offerings to justify premium pricing where feasible.
- **Location-Based Pricing**: Avoid high prices in high-traffic road locations. Instead, focus on specialty areas where a balance between quality and price can attract middle and upper-middle-class customers.

### 2. Location Strategy

- **Prime Locations**: Invest in high-visibility areas for Drinks & Nightlife venues to attract a diverse clientele.
- **Avoid High-Rent Roads**: Opt for locations with manageable rental costs to avoid passing on high prices to customers, which could affect engagement.
- **Lavelle Road**: Leverage Lavelle Road’s unique position as a middle ground for high-class and upper-middle-class customers. Its balanced demographic profile offers a strategic advantage for attracting a broad customer base.

### 3. Enhancing Customer Experience

- **Specialization**: Focus on unique dining experiences that offer value and quality. Specialized menus and personalized services can improve customer satisfaction and loyalty.
- **Encouraging Reviews**: Actively encourage customers to leave reviews and ratings. Positive feedback not only improves visibility but also drives profitability through increased customer trust and engagement.

### 4. Metrics and Analysis

- **Key Metrics**: Track and analyze reviews, votes, ratings, and overall customer feedback. These metrics are crucial for assessing customer satisfaction and identifying areas for improvement.
- **Direct Profit Correlation**: Higher ratings and positive reviews correlate with increased profitability. Prioritize strategies that enhance customer experience and encourage positive feedback.

## Conclusion

Investing in Bangalore's restaurant market requires a strategic approach to pricing, location, and customer engagement. By aligning restaurant concepts with neighborhood characteristics and enhancing customer experience, investors can optimize their returns. **Lavelle Road’s strategic position offers significant potential for attracting a varied and high-value customer base.** Additionally, addressing the unique attributes of different cuisines and focusing on customer feedback will drive growth and profitability.


# Part 2

# Behavioral Analysis of Restaurant Reviews

# Preparing Data from Part 2

In [None]:
import spacy
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from transformers import pipeline
from nltk.corpus import stopwords, words


In [None]:
nlp = spacy.load('en_core_web_sm')
nltk.download('vader_lexicon')
nltk.download('words')
nltk.download('stopwords')

valid_words = set(words.words())
stop_words = set(stopwords.words('english'))
sia = SentimentIntensityAnalyzer()

Creating a column that contains a list of sentiment analyses for each review involves specifying the sentiment for each sentence and extracting the nouns, as they are likely the aspects the reviewer liked or disliked.








For better accuracy, it would be ideal to use an LLM model for this task, but it would be time-consuming. Therefore, I'll use a pre-trained model from Hugging Face to strike a balance between speed and accuracy. I'll reserve the LLM model for later when dealing with a single restaurant or a smaller number of reviews.

In [None]:

import string
from sklearn.feature_extraction.text import CountVectorizer
from tqdm import tqdm

def spacy_tokenizer(text):
    doc = nlp(clean_string(text))
    return [token.lemma_.lower() for token in doc if not token.is_punct and not token.is_stop]
def clean_string(s):
    # Define allowed characters: alphanumeric, punctuation, and space
    allowed_chars = string.ascii_letters + string.digits + string.punctuation + ' '

    # Remove characters not in allowed_chars
    cleaned = ''.join(c for c in s if c in allowed_chars)

    # Add a space before and after punctuation
    cleaned = re.sub(r'([' + re.escape(string.punctuation) + r'])', r' \1 ', cleaned)

    # Add a space between numbers and words if there is no space between them
    cleaned = re.sub(r'(\d)([a-zA-Z])', r'\1 \2', cleaned)
    cleaned = re.sub(r'([a-zA-Z])(\d)', r'\1 \2', cleaned)

    # Replace multiple whitespace with a single space
    cleaned = re.sub(r'\s+', ' ', cleaned).strip()

    # Replace multiple consecutive periods with a single period
    cleaned = re.sub(r'\.{2,}', '.', cleaned)

    # Remove specific unwanted characters and terms
    cleaned = cleaned.replace('Ã','').replace('Â','').replace('RATED','').replace('but','.').strip()

    return cleaned
def get_sentiment_scores(df):
  df = df.reset_index(drop=True)
  df['review_sentiment_list']=None
  for indx, row in tqdm(df.iterrows(), total=df.shape[0]):
    reviews = row['reviews_list']
    if len(reviews)==0:
      continue
    review_sentiment_list = []
    for i in reviews:
      cleaned_text = re.sub(r'\s+', ' ', clean_string(i[1]))

      doc = nlp(cleaned_text)
      sents_sentiment_list = []
      sentemints_score_list = []

      for sent in doc.sents:
        try:
          sentiment = sia.polarity_scores(str(sent))
        except:
          continue

        output_dict = None
        sentiment_score = sentiment['compound']

        sentemints_score_list.append(sentiment_score)
        output_dict = {'sentiment_score':sentiment_score,'nouns':[],'adj':[],'sent':str(sent)}

        for token in sent:
          if token.is_stop:
            continue
          if len(token.text)>15:
            continue
          conditions =  token.is_alpha and token.lemma_.lower() in valid_words and token.lemma_.lower() not in stop_words
          if token.pos_ == 'NOUN' and conditions :
            if sentiment['compound']>=0:
              output_dict['nouns'].append(str(token.lemma_.lower()))
            elif sentiment['compound']<0:
              output_dict['nouns'].append(str(token.lemma_.lower()))
          if token.pos_ == 'ADJ' and conditions:
            if sentiment['compound']>=0:
              output_dict['adj'].append(str(token.lemma_.lower()))
            elif sentiment['compound']<0:
              output_dict['adj'].append(str(token.lemma_.lower()))

        sents_sentiment_list.append(output_dict)
      review_sentiment_list.append(sents_sentiment_list)

    df.at[indx,'review_sentiment_list'] = review_sentiment_list

  return df

df=get_sentiment_scores(df)

In [None]:
df.to_csv('/content/drive/My Drive/Zomato Geospatial Analysis/sentiment_df_spacy.csv',index=False)

In [None]:
df=pd.read_csv('/content/drive/My Drive/Zomato Geospatial Analysis/sentiment_df_spacy.csv')


In [None]:
# Fix review_sentiment_list column and pase it into python list
df['review_sentiment_list'] = df['review_sentiment_list'].apply(lambda x: None if pd.isnull(x) else eval(x))

# Fix review_list column and pase it into python list
df.reviews_list = df.reviews_list.apply(lambda x: eval(x))


In [None]:
nouns_list = []
for i in df.review_sentiment_list.tolist():
  if i is None:
    continue
  for j in i:
    for k in j:
      nouns_list.extend(k['nouns'])

In [None]:
nouns_list = list(set(nouns_list))
nouns_list = [s for s in nouns_list if len(s) >= 3]

In [None]:
len(nouns_list)

In [None]:

import google.generativeai as genai
import time
null=None
YOUR_API_KEY = "AIzaSyDyWlUqOb7TRCdoy3QPf9ABwK7HRAlv4kI"
genai.configure(api_key=YOUR_API_KEY)
model = genai.GenerativeModel('gemini-pro')
count = 0
def llm(prompt):
  while True:
    try:
      response = model.generate_content(prompt)

    except:
      pass
  return response.text
def analays_reviews(reviews_list,word,sentiment):
  output = pd.DataFrame(columns=['sentiment'])
  sentiment_list = []
  for i in reviews_list:

    time.sleep(1)
    #prompt = f'this is a review on returant "{clean_string(i)}" act as data analyst and return object has sentiment analysis related to {word} like these examples sentiment, followed by list of aspect in the sentence followed by reason explaining this aspect {{ "sentiment": "positive", "liked": [{{"aspect": "food", "reason": "good"}}, {{"aspect": "delivery", "reason": "fast"}}], "disliked": [] }} and if it negative {{ "sentiment": "negative", "liked": [], "disliked": [{{"aspect": "Food", "reason": "mediocre"}}, {{"aspect": "buffet", "reason": "Only vegetarian"}}] }} and {{ "sentiment": "mixed", "liked": [{{"aspect": "rice bowl", "reason": "good"}}, {{"aspect": "rice bowl", "reason": "must try"}}], "disliked": [{{"aspect": "pizza", "reason": "not freshly"}}, {{"aspect": "pizza", "reason": "mayonnaise instead of cheese"}}] }} and {{ "sentiment": "mixed", "liked": [{{"aspect": "chicken pepper fry", "reason": "good"}}, {{"aspect": "neer dosa", "reason": "Soft, fluffy and light"}}], "disliked": [{{"aspect": "chicken ghee roast", "reason": "lacked the punch"}}] }} and {{ "sentiment": "mixed", "liked": [{{"aspect": "mushroom", "reason": "tasty"}}, {{"aspect": "fruit tart", "reason": "good"}}], "disliked": [{{"aspect": "food", "reason": "could have been better"}}, {{"aspect": "staff", "reason": "unfriendly"}}, {{"aspect": "price", "reason": "steep"}}] }} in general avoid adding any word won add any meaning minimize number of words and letter.'
    prompt = f'''Do aspect sentiment analysis on this review "{clean_string(i)}" and return these
    1- Sentiment
    2- reson for each aspect reviewer liked or disliked
    3- list of thing reviewer liked, like of thing reviewer disliked
    4- only interested in {sentiment} sentiment related to this word {word}
    5- output should be in python dict format
    here are some examples
    {{"sentiment": "positive", "liked": [{{"aspect": "food", "reason": "good"}}, {{"aspect": "delivery", "reason": "fast"}}], "disliked": [] }}
    {{ "sentiment": "negative", "liked": [], "disliked": [{{"aspect": "food", "reason": "mediocre"}}, {{"aspect": "buffet", "reason": "Only vegetarian"}}] }}
    {{ "sentiment": "negative", "liked": [], "disliked": [{{"aspect": "place", "reason": "expensive"}}, {{"aspect": "ambiance", "reason": "noisy"}}] }}

     '''
    printit=True
    while printit:
      try:
        response = model.generate_content(prompt)
        printit = False
      except:
        pass
    try:
      encoded = eval(response.text)
      if encoded['sentiment'] == sentiment:
        sentiment_list.append(encoded)
        print(encoded)
    except:
      pass


  output.sentiment=sentiment_list
  return output




In [None]:
# Create the figure and axis
fig, ax1 = plt.subplots(figsize=(12, 8))

# Plot the first line on the primary y-axis
sns.lineplot(data=df, x='approx_cost(for two people)', y='num_reviews', ax=ax1, color='blue', label='Number of Reviews')
ax1.set_xlabel('Approximate Cost (for two people)')
ax1.set_ylabel('Number of Reviews', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')

# Create a second y-axis and plot the second line
ax2 = ax1.twinx()
df_counts = df[['name', 'approx_cost(for two people)']].groupby('approx_cost(for two people)').count().reset_index()
sns.lineplot(data=df_counts, x='approx_cost(for two people)', y='name', ax=ax2, color='orange', label='Count of Restaurants')
sns.lineplot(data=df, x='approx_cost(for two people)', y=df['weighted_rating'] * 500, ax=ax2, color='red', label='Scaled Weighted Rating')
ax2.set_ylabel('Count of Restaurants', color='orange')
ax2.tick_params(axis='y', labelcolor='orange')

# Set the title and grid
plt.title('Number of Reviews and Count of Restaurants vs. Approximate Cost')
ax1.grid(True)

# Add vertical lines at each 100 unit interval on the x-axis
for x in range(0, int(df['approx_cost(for two people)'].max()) + 100, 100):
    ax1.axvline(x=x, color='gray', linestyle='--', linewidth=0.7)

# Add legends
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')

# Show the plot
plt.show()

In [None]:
nouns_list = []
for i in df.review_sentiment_list.tolist():
  if i is None:
    continue
  for j in i:
    for k in j:
      nouns_list.extend(k['nouns'])
nouns_list= pd.DataFrame(nouns_list,columns=['word']).word.value_counts().reset_index()
nouns_list.columns = ['word','count']

In [None]:
def word_cloud(nouns_list,col='word'):
  most_freq = nouns_list.loc[:100]
  word_freq = dict(zip(most_freq[col], most_freq['count']))

  # Create and generate a word cloud image
  wordcloud = WordCloud(width=1200, height=600, background_color='white').generate_from_frequencies(word_freq)

  # Display the generated image
  plt.figure(figsize=(10, 5))
  plt.imshow(wordcloud, interpolation='bilinear')
  plt.axis('off')  # Hide axes
  plt.show()
word_cloud(nouns_list)


In analyzing customer behavior in Bengaluru's restaurant industry, our findings reveal that the number of reviews is skewed toward higher-end establishments, which is expected since budget restaurants typically prioritize affordability over experience and quality. To accurately compare customer sentiments across different restaurant categories, it's essential to set a clear price range that defines each class.

Interestingly, there's a noticeable drop in the number of reviews in the range of approximately 1,500 to 2,300 reviews. This unusual pattern warrants further investigation to understand its underlying causes.

Additionally, word cloud analysis shows that the overall experience is as crucial as the food itself, sometimes even more so, particularly in higher-end establishments. However, it's important to note that these conclusions are primarily relevant to higher-class restaurants due to the bias in the data toward these types of establishments.

**What classes fall in price range 1700-2300?**

In [None]:
temp_df = df[(df['approx_cost(for two people)'] > 1700) & (df['approx_cost(for two people)'] < 2300)]

# Count the occurrences of each classes
temp_df = temp_df['classes'].reset_index()
pie_chart(temp_df,'classes','Count of classes with Approximate Cost between 1700 and 2300')

Over 90% of the classes in this range cater to Mid, Upper-Mid, and High-Class groups. However, the engagement and ratings are lower because the area is equally divided between High-class and Mid-class residents. This makes it challenging to meet both groups' expectations in terms of quality and budget.

**Need more investigation to confirm it**

**Why there is a gap in the middle?**

What types of each class are in this range?

I will Plot prices with frequancy of each type of each class in this range

In [None]:
plt.figure(figsize=(15, 8))
temp_df = df[(df['approx_cost(for two people)'] > 1700) & (df['approx_cost(for two people)'] < 2300)][['classes','approx_cost(for two people)','listed_in(type)','weighted_rating','votes']].groupby(['classes','listed_in(type)']).mean().reset_index()
freq = df[(df['approx_cost(for two people)'] > 1700) & (df['approx_cost(for two people)'] < 2300)][['classes','approx_cost(for two people)','listed_in(type)','weighted_rating','votes']].groupby(['classes','listed_in(type)']).count().reset_index()
sns.barplot(x='classes', y='approx_cost(for two people)', data=temp_df,hue='listed_in(type)',dodge=True,alpha=0.5,legend=False)
ax2 = plt.gca().twinx()
sns.lineplot(x='classes', y='votes', data=freq,hue='listed_in(type)',ax=ax2)

ax2.legend(loc='upper left', bbox_to_anchor=(1, 0.85))
plt.title('Price vs Classes in Price Range 1700-2300 and Frequancy')
plt.xlabel('Classes')
plt.ylabel('Count')
plt.show()

High class has the highest share in this price range and Dine-out is the highest among all types

Now I will CHeck avg price for High and Mid class prices for each type

In [None]:
plt.figure(figsize=(15, 8))

# Plot for 'High' class
temp_df = df[df.classes == 'High'][['approx_cost(for two people)', 'listed_in(type)', 'weighted_rating', 'votes']].groupby(['listed_in(type)']).mean().reset_index()
sns.barplot(x='listed_in(type)', y='approx_cost(for two people)', data=temp_df, alpha=0.7, label='High')
# Plot for 'High' class
temp_df = df[df.classes == 'Lower-High'][['approx_cost(for two people)', 'listed_in(type)', 'weighted_rating', 'votes']].groupby(['listed_in(type)']).mean().reset_index()
sns.barplot(x='listed_in(type)', y='approx_cost(for two people)', data=temp_df, alpha=0.7, label='Lower-High')



# Plot for 'Upper-Mid' class
temp_df = df[df.classes == 'Upper-Mid'][['approx_cost(for two people)', 'listed_in(type)', 'weighted_rating', 'votes']].groupby(['listed_in(type)']).mean().reset_index()
sns.barplot(x='listed_in(type)', y='approx_cost(for two people)', data=temp_df,alpha=0.5,color='green', label='Upper-Mid')

# Plot for 'Mid' class
temp_df = df[df.classes == 'Mid'][['approx_cost(for two people)', 'listed_in(type)', 'weighted_rating', 'votes']].groupby(['listed_in(type)']).mean().reset_index()
sns.barplot(x='listed_in(type)', y='approx_cost(for two people)', data=temp_df, alpha=0.3,color='orange', label='Mid')

# Add a title and labels
plt.title('Price vs Classes in Price Over all')
plt.xlabel('Listed In (Type)')
plt.ylabel('Price')

# Add a legend, letting matplotlib handle it automatically
plt.legend(loc='upper left', bbox_to_anchor=(1, 1))

# Show the plot
plt.show()

In [None]:
plt.figure(figsize=(15, 8))
temp_df = df[(df['approx_cost(for two people)'] > 1700) & (df['approx_cost(for two people)'] < 2300)][['classes','approx_cost(for two people)', 'listed_in(type)', 'weighted_rating', 'votes']].groupby(['classes','listed_in(type)']).mean().reset_index()

sns.barplot(x='classes', y='votes', data=temp_df,hue='listed_in(type)',dodge=True)
plt.legend(loc='upper left', bbox_to_anchor=(1, 0.85))
plt.title('Votes vs Classes in Price Range 1700-2300')
plt.xlabel('Classes')
plt.ylabel('Votes')
plt.show()

In [None]:
import itertools


def sentiment_by_type_for_class(df):

  df.dropna(subset=['review_sentiment_list'],inplace=True)
  temp_df = df.reset_index()
  # get listed_in(type) unique values
  listed_in_type = temp_df['listed_in(type)'].unique()
  # create new column sum all sentiment in each row
  temp_df['nouns'] = None
  temp_df['adj'] = None
  temp_df['sentiment_score']=0
  temp_df['sentiment_avg']=0

  nouns_dict={}
  adj_dict={}
  types_dict={}
  for i,row in temp_df.iterrows():
    expande_reviews = list(itertools.chain(*list(row['review_sentiment_list'])))
    temp_ph = [d['sentiment_score'] for d in expande_reviews]
    temp_df.at[i,'sentiment_score'] = str(temp_ph)
    if len(temp_ph) != 0:
      temp_df.at[i,'sentiment_avg'] = sum(temp_ph)/len(temp_ph)
    else:
      temp_df.at[i,'sentiment_avg'] = 0
    temp_df.at[i,'nouns'] = str([d['nouns'] for d in expande_reviews])
    temp_df.at[i,'adj'] = str([d['adj'] for d in expande_reviews])

  temp_df.sentiment_score = temp_df.sentiment_score.apply(lambda x: eval(x))
  temp_df.nouns = temp_df.nouns.apply(lambda x: eval(x))
  temp_df.adj = temp_df.adj.apply(lambda x: eval(x))


  return temp_df


In [None]:

sentiment_df=sentiment_by_type_for_class(df.copy())


In [None]:
plt.figure(figsize=(15, 8))

sns.barplot(data=sentiment_df[(sentiment_df['approx_cost(for two people)'] > 1700) & (df['approx_cost(for two people)'] < 2300)][['classes','listed_in(type)','sentiment_avg']].groupby(['listed_in(type)','classes']).mean().reset_index(),hue='classes', x='listed_in(type)', y='sentiment_avg',alpha=0.5)
sns.lineplot(data=sentiment_df[(sentiment_df['approx_cost(for two people)'] > 1700) & (df['approx_cost(for two people)'] < 2300)][['classes','listed_in(type)','sentiment_avg']].groupby(['listed_in(type)','classes']).mean().reset_index(),hue='classes', x='listed_in(type)', y='sentiment_avg',linewidth=4)
plt.title('Sentiment vs Types per class in Price Range 1700-2300')
plt.xlabel('Types')
plt.ylabel('Sentiment')
plt.legend(loc='upper left', bbox_to_anchor=(1, 0.85))
plt.show()

**What is reviews saying about resturant in this range?**

In [None]:
def exctract_review_price_range(sentiment_df):

  temp_df=sentiment_df[['sentiment_score','nouns','adj','classes','listed_in(type)']]
  temp_dict={'noun':[],'adj':[],'sentiment':[],'classes':[],'type':[]}
  for i,row in temp_df.iterrows():
    sent_nouns = row.nouns
    sent_adj = row.adj
    sent_score = row.sentiment_score
    for i_s,s in enumerate(sent_adj):
      for n in s:
        temp_dict['adj'].append(n)
        temp_dict['classes'].append(row.classes)
        temp_dict['type'].append(row['listed_in(type)'])
        temp_dict['noun'].append(sent_nouns[i_s])
        temp_dict['sentiment'].append(sent_score[i_s])



  noun_sentiment_df = pd.DataFrame(temp_dict)
  noun_sentiment_df=noun_sentiment_df.explode('noun').reset_index(drop=True)

  return noun_sentiment_df

In [None]:
noun_sentiment_1700_2300 = exctract_review_price_range(sentiment_df[(sentiment_df['approx_cost(for two people)'] > 1700) & (df['approx_cost(for two people)'] < 2300)])

Using LLM Model to categorize the lise accurtly

In [None]:
categories = { 'food quality': 'Food Quality', 'food': 'Food Quality', 'dish': 'Food Quality', 'meal': 'Food Quality', 'cuisine': 'Food Quality', 'menu': 'Food Quality', 'taste': 'Food Quality', 'flavor': 'Food Quality', 'presentation': 'Food Quality', 'freshness': 'Food Quality', 'ingredients': 'Food Quality', 'preparation': 'Food Quality', 'taste profile': 'Food Quality', 'texture': 'Food Quality', 'portion size': 'Food Quality', 'recipe': 'Food Quality', 'seasoning': 'Food Quality', 'spices': 'Food Quality', 'dish quality': 'Food Quality', 'meal quality': 'Food Quality', 'food experience': 'Food Quality', 'culinary': 'Food Quality', 'dish presentation': 'Food Quality', 'cooking techniques': 'Food Quality', 'ingredient quality': 'Food Quality', 'food standards': 'Food Quality', 'food preparation': 'Food Quality', 'flavor profile': 'Food Quality', 'meal preparation': 'Food Quality', 'dish preparation': 'Food Quality', 'food safety': 'Food Quality', 'culinary quality': 'Food Quality', 'kitchen standards': 'Food Quality', 'gourmet': 'Food Quality', 'culinary excellence': 'Food Quality', 'menu variety': 'Food Quality', 'dish variety': 'Food Quality', 'menu selection': 'Food Quality', 'service': 'Service', 'staff': 'Service', 'employee': 'Service', 'team': 'Service', 'worker': 'Service', 'attendant': 'Service', 'personnel': 'Service', 'assistant': 'Service', 'waitstaff': 'Service', 'receptionist': 'Service', 'manager': 'Service', 'barista': 'Service', 'server': 'Service', 'host': 'Service', 'hostess': 'Service', 'staff member': 'Service', 'crew': 'Service', 'service quality': 'Service', 'customer service': 'Service', 'service standard': 'Service', 'hospitality': 'Service', 'care': 'Service', 'attention': 'Service', 'support': 'Service', 'assistance': 'Service', 'service staff': 'Service', 'service experience': 'Service', 'service level': 'Service', 'service delivery': 'Service', 'interaction': 'Service', 'customer support': 'Service', 'service efficiency': 'Service', 'employee conduct': 'Service', 'staff behavior': 'Service', 'service team': 'Service', 'staff responsiveness': 'Service', 'team performance': 'Service', 'service attitude': 'Service', 'service provision': 'Service', 'staff professionalism': 'Service', 'staff demeanor': 'Service', 'service excellence': 'Service', 'ambiance': 'Ambiance/Atmosphere', 'atmosphere': 'Ambiance/Atmosphere', 'environment': 'Ambiance/Atmosphere', 'mood': 'Ambiance/Atmosphere', 'setting': 'Ambiance/Atmosphere', 'vibe': 'Ambiance/Atmosphere', 'character': 'Ambiance/Atmosphere', 'tone': 'Ambiance/Atmosphere', 'feel': 'Ambiance/Atmosphere', 'aesthetic': 'Ambiance/Atmosphere', 'energy': 'Ambiance/Atmosphere', 'style': 'Ambiance/Atmosphere', 'theme': 'Ambiance/Atmosphere', 'decor': 'Ambiance/Atmosphere', 'layout': 'Ambiance/Atmosphere', 'sound': 'Ambiance/Atmosphere', 'lighting': 'Ambiance/Atmosphere', 'music': 'Ambiance/Atmosphere', 'furnishings': 'Ambiance/Atmosphere', 'interior': 'Ambiance/Atmosphere', 'exterior': 'Ambiance/Atmosphere', 'design': 'Ambiance/Atmosphere', 'artwork': 'Ambiance/Atmosphere', 'color scheme': 'Ambiance/Atmosphere', 'decorations': 'Ambiance/Atmosphere', 'design elements': 'Ambiance/Atmosphere', 'atmospheric elements': 'Ambiance/Atmosphere', 'ambiance quality': 'Ambiance/Atmosphere', 'ambiance level': 'Ambiance/Atmosphere', 'overall ambiance': 'Ambiance/Atmosphere', 'ambiance setting': 'Ambiance/Atmosphere', 'ambiance experience': 'Ambiance/Atmosphere', 'environmental factors': 'Ambiance/Atmosphere', 'design aspects': 'Ambiance/Atmosphere', 'ambiance design': 'Ambiance/Atmosphere', 'ambiance impact': 'Ambiance/Atmosphere', 'ambiance effect': 'Ambiance/Atmosphere', 'ambiance creation': 'Ambiance/Atmosphere', 'ambiance management': 'Ambiance/Atmosphere', 'ambiance style': 'Ambiance/Atmosphere', 'ambiance theme': 'Ambiance/Atmosphere', 'atmospheric experience': 'Ambiance/Atmosphere', 'atmosphere quality': 'Ambiance/Atmosphere', 'ambiance appeal': 'Ambiance/Atmosphere', 'ambiance influence': 'Ambiance/Atmosphere', 'ambiance characteristics': 'Ambiance/Atmosphere', 'experience': 'Overall Experience', 'overall': 'Overall Experience', 'general experience': 'Overall Experience', 'overall satisfaction': 'Overall Experience', 'visit experience': 'Overall Experience', 'customer experience': 'Overall Experience', 'overall impression': 'Overall Experience', 'overall rating': 'Overall Experience', 'overall quality': 'Overall Experience', 'general impression': 'Overall Experience', 'overall service': 'Overall Experience', 'total experience': 'Overall Experience', 'complete experience': 'Overall Experience', 'experience level': 'Overall Experience', 'comprehensive experience': 'Overall Experience', 'price': 'Price', 'cost': 'Price', 'value': 'Price', 'expensive': 'Price', 'cheap': 'Price', 'affordable': 'Price', 'inexpensive': 'Price', 'value for money': 'Price', 'discount': 'Price', 'price point': 'Price', 'charge': 'Price', 'rate': 'Price', 'fee': 'Price', 'amount': 'Price', 'spending': 'Price', 'costs': 'Price', 'expense': 'Price', 'tab': 'Price', 'bill': 'Price', 'fare': 'Price', 'price range': 'Price', 'price level': 'Price', 'premium': 'Price', 'luxury': 'Price', 'cheapness': 'Price', 'cost-effectiveness': 'Price', 'costliness': 'Price', 'spending power': 'Price', 'affordability': 'Price', 'economic': 'Price', 'discounts': 'Price', 'promotions': 'Price', 'deals': 'Price', 'special offers': 'Price', 'bargains': 'Price', 'savings': 'Price', 'expense level': 'Price', 'financial value': 'Price', 'economic value': 'Price', 'value proposition': 'Price', 'cleanliness': 'Cleanliness', 'hygiene': 'Cleanliness', 'neatness': 'Cleanliness', 'sanitation': 'Cleanliness', 'orderliness': 'Cleanliness', 'tidiness': 'Cleanliness', 'spotless': 'Cleanliness', 'pristine': 'Cleanliness', 'immaculate': 'Cleanliness', 'freshness': 'Cleanliness', 'maintained': 'Cleanliness', 'well-kept': 'Cleanliness', 'clean': 'Cleanliness', 'germ-free': 'Cleanliness', 'disinfection': 'Cleanliness', 'sanitary': 'Cleanliness', 'cleaned': 'Cleanliness', 'scrubbed': 'Cleanliness', 'polished': 'Cleanliness', 'sterile': 'Cleanliness', 'hygienic': 'Cleanliness', 'sanitized': 'Cleanliness', 'uncluttered': 'Cleanliness', 'spick and span': 'Cleanliness', 'orderly': 'Cleanliness', 'fresh': 'Cleanliness', 'dirt-free': 'Cleanliness', 'dust-free': 'Cleanliness', 'scrubbed': 'Cleanliness', 'cleared': 'Cleanliness', 'spic-and-span': 'Cleanliness', 'well-maintained': 'Cleanliness', 'cleaned': 'Cleanliness', 'immaculate': 'Cleanliness', 'special feature': 'Special Features', 'feature': 'Special Features', 'amenity': 'Special Features', 'attraction': 'Special Features', 'unique': 'Special Features', 'perk': 'Special Features', 'highlight': 'Special Features', 'advantage': 'Special Features', 'extra': 'Special Features', 'offering': 'Special Features', 'facility': 'Special Features', 'benefit': 'Special Features', 'additional feature': 'Special Features', 'distinctive': 'Special Features', 'option': 'Special Features', 'addition': 'Special Features', 'value-add': 'Special Features', 'enhancement': 'Special Features', 'specialty': 'Special Features', 'exclusive': 'Special Features', 'novelty': 'Special Features', 'feature set': 'Special Features', 'special offer': 'Special Features', 'bonus': 'Special Features', 'unique offering': 'Special Features', 'extra feature': 'Special Features', 'distinction': 'Special Features', 'exceptional': 'Special Features', 'premium feature': 'Special Features', 'special characteristic': 'Special Features', 'highlighted feature': 'Special Features', 'notable': 'Special Features', 'rare': 'Special Features', 'differentiator': 'Special Features', 'superior': 'Special Features' }

In [None]:
def prepar_radar_for_type(df,classes):
  df['words'] = df.apply(lambda row: [row['noun'], row['adj']], axis=1)
  df = df.explode('words')
  df=df.reset_index(drop=True)
  df['words'] = df['words'].replace(categories)
  int_list = ['Food Quality','Service','Ambiance/Atmosphere','Price','Cleanliness','Special Features']
  list_of_type=df['type'].unique()
  dff=pd.DataFrame(columns=int_list)
  for i in list_of_type:
    temp_df=df[df.classes==classes][df['type']==i][['words','sentiment']][df['words'].isin(int_list)].groupby('words').mean().sort_values('sentiment',ascending=False).reset_index()
    if len(temp_df)==0:
      continue
    temp_df.set_index('words',inplace=True)
    temp_df = temp_df.T
    temp_df.index=[i]
    dff=pd.concat([dff,temp_df],axis=0)
  return dff

In [None]:
plot_radar_charts(prepar_radar_for_type(noun_sentiment_1700_2300.copy(),'High'),single_chart=True,custom_title='Reviews insights for High Class',scalling=False)

In [None]:
plot_radar_charts(prepar_radar_for_type(noun_sentiment_1700_2300.copy(),'Lower-High'),single_chart=True,custom_title='Reviews insights for Lower-High Class',scalling=False)

In [None]:
plot_radar_charts(prepar_radar_for_type(noun_sentiment_1700_2300.copy(),'Upper-Mid'),single_chart=True,custom_title='Reviews insights for Upper-Mid Class',scalling=False)

In [None]:
plot_radar_charts(prepar_radar_for_type(noun_sentiment_1700_2300.copy(),'Mid'),single_chart=True,custom_title='Reviews insights for Mid Class',scalling=False)

### Important Note:

The analysis of reviews provided here is not entirely accurate and should not be taken too seriously due to several biases. The analysis is skewed towards specific customer segments because not everyone is interested in leaving reviews. Additionally, the focus was on speed rather than accuracy, so while this can serve as a starting point, it may not cover all aspects of the review data.

### Price Range Analysis (1500-2300):

For the price range of 1500-2300, we've observed that this segment overlaps with multiple customer classes, making it challenging to meet everyone's expectations. If restaurants aim to improve food quality and the overall experience to satisfy higher-end customers, the prices may become unsatisfactory for lower-end customers, and vice versa.

However, within this price range, establishments catering to the lower-high class, particularly in categories like Dine-out, Drinks & Nightlife, and Pubs and Bars, tend to perform better than others. Despite this, the overall review sentiment remains less optimistic compared to other segments, which highlights the complexity of catering to a diverse customer base within this range.

### Recommendations:

For investors interested in this price range, it is crucial to carefully select the location and decor to reflect the target customer class. Additionally, the marketing team should intensify its efforts to ensure that the right customer segment is being targeted effectively.


**What are Reviews insights about each Class on all price range?**

In [None]:
noun_sentiment_all_range = exctract_review_price_range(sentiment_df)

In [None]:
plot_radar_charts(prepar_radar_for_type(noun_sentiment_all_range.copy(),'High'),single_chart=False,custom_title='Reviews insights for High Class',scalling=False)

In [None]:
plot_radar_charts(prepar_radar_for_type(noun_sentiment_all_range.copy(),'Lower-High'),single_chart=False,custom_title='Reviews insights for lower-High Class',scalling=False)

In [None]:
plot_radar_charts(prepar_radar_for_type(noun_sentiment_all_range.copy(),'Upper-Mid'),single_chart=False,custom_title='Reviews insights for Upper-Mid Class',scalling=False)

In [None]:
plot_radar_charts(prepar_radar_for_type(noun_sentiment_all_range.copy(),'Mid'),single_chart=False,custom_title='Reviews insights for Mid Class',scalling=False)

In [None]:
plot_radar_charts(prepar_radar_for_type(noun_sentiment_all_range.copy(),'Low'),single_chart=False,custom_title='Reviews insights for Low Class',scalling=False)

### Analysis of Customer Satisfaction by Class and Type

#### High Class:
For the high class, there is a noticeable **similarity** in customer satisfaction across **Delivery, Cafes, and Drinks & Nightlife** categories. This similarity might be due to the convenience and premium experience these options offer, catering well to the preferences of this segment. However, **Buffet services** stand out with **higher satisfaction levels** compared to other types, possibly because they offer a variety of options that appeal to this class. On the other hand, **Dine-out experiences** show **poor satisfaction**, particularly regarding **pricing**, likely because the high prices may not align with the perceived value for this class.

**Recommendations for Investors:**  
For investors targeting the high class, **focusing on enhancing the buffet experience** could be promising. It’s also advisable to **reconsider pricing strategies for Dine-out** services to better meet the expectations of this segment.

#### Lower-High Class:
In the lower-high class, there is a similarity in satisfaction levels across **Dine-out, Pubs & Bars, and Drinks & Nightlife** categories, but all of these show **lower satisfaction regarding pricing**. This may be because the pricing in these categories doesn't match the perceived value for this group. **Desserts** also show **poor satisfaction**, indicating a potential area for improvement.

**Recommendations for Investors:**  
Investors could focus on **improving pricing strategies, food quality, special features, and cleanliness** in Dessert offerings. Additionally, **enhancing the Buffet experience** with better pricing and cleanliness could attract this class.

#### Upper-Mid Class:
The upper-mid class shows similarities in satisfaction across **Delivery, Cafes, and Dine-out** categories, with generally acceptable satisfaction except for **pricing**. However, **Desserts and Drinks & Nightlife** categories show **poor satisfaction**, particularly regarding **pricing and cleanliness**. The **best satisfaction levels** are seen in **Pubs & Bars and Buffets**.

**Recommendations for Investors:**  
For this segment, **maintaining the high standards in Pubs & Bars and Buffets** is key. Meanwhile, **improving pricing and cleanliness** in Desserts and Drinks & Nightlife could yield better satisfaction.

#### Mid Class:
The mid class displays **poor satisfaction** in **Drinks & Nightlife** and medium satisfaction in **Pubs & Bars**. In contrast, **Buffets** receive **high satisfaction**, but **cleanliness** is a concern. For **Drinks & Nightlife and Pubs & Bars**, satisfaction with **food quality, services, and special features** is moderate, suggesting areas for potential improvement.

**Recommendations for Investors:**  
Investors should focus on **improving cleanliness in Buffets** and **enhancing the overall experience in Drinks & Nightlife and Pubs & Bars**, particularly in food quality and special features.

#### Low Class:
The low class has the least number of reviews, but the analysis shows **significant satisfaction** in **Delivery, Desserts, Dine-out, Cafes, and Buffets**. However, **Buffet pricing** shows **very poor satisfaction**, and **Dessert pricing** has only **medium satisfaction**. Overall, there is **low satisfaction with Desserts** across all aspects, and **Drinks & similar venues** show **medium to poor satisfaction** with pricing across all classes.

**Recommendations for Investors:**  
For the low class, investors should focus on **addressing pricing concerns in Buffets and Drinks with all types** and **improving the overall Dessert experience**, including pricing and quality. These could be potential areas for gaining a competitive advantage.




In order to understand i will analysis sentement and aspect for this price range from reviews

# Part 3 Building Price Prediction model

In [None]:
pd.set_option('display.max_columns', None)

df.head()

Get only valid rates

In [None]:
valid_df = df[df.is_rate_valid==1]


In [None]:
#valid_df['menu_item'] = valid_df['menu_item'].str.replace(r'[^a-zA-Z,\s]', '', regex=True)

In [None]:
#separate_types('menu_item')

Columns need to be deleted because they do not add value to the prediction model, or there are alternative features that were engineered earlier, would cause multicollinearity.

In [None]:
cols = ['url','address','name','rest_type','dish_liked','cuisines','reviews_list','review_sentiment_list','menu_item']
pred_df = df.drop(cols,axis=1).copy()

In [None]:
pred_df.head()

### Feature encoding

Applying Binary Encoding for columns with yes and no

In [None]:
pred_df['online_order'] = pred_df['online_order'].replace({'Yes': 1, 'No': 0})
pred_df['book_table'] = pred_df['book_table'].replace({'Yes': 1, 'No': 0})
pred_df['is_new'] = pred_df['is_new'].replace({'Yes': 1, 'No': 0})
pred_df['is_road'] = pred_df['is_road'].replace({'Yes': 1, 'No': 0})

Appplying Ordinal Encoding for Classes

In [None]:
pred_df['classes']=pred_df['classes'].replace({'High': 5, 'Lower-High': 4,'Upper-Mid':3,'Mid':2,'Low':1})

In [None]:
pred_df=pred_df.sort_values('approx_cost(for two people)',ascending=False).iloc[1:]

In [None]:
def remove_outliers_iqr(df, column_name):
    """
    Removes outliers from a DataFrame column using the IQR method.

    Parameters:
    df (pd.DataFrame): The DataFrame containing the data.
    column_name (str): The name of the column to process.

    Returns:
    pd.DataFrame: DataFrame with outliers removed.
    """
    # Ensure the column exists in the DataFrame
    if column_name not in df.columns:
        raise ValueError(f"Column '{column_name}' does not exist in the DataFrame.")

    # Calculate Q1 (25th percentile) and Q3 (75th percentile)
    Q1 = df[column_name].quantile(0.25)
    Q3 = df[column_name].quantile(0.75)

    # Calculate IQR
    IQR = Q3 - Q1

    # Determine the lower and upper bounds
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Filter out outliers
    df_filtered = df[(df[column_name] >= lower_bound) & (df[column_name] <= upper_bound)]

    return df_filtered

#pred_df = remove_outliers_iqr(pred_df,'approx_cost(for two people)')

Rest Features i will do target encoding becuase i have too many features but important step to avoid data leakage is that i will split my data to train and test sets

In [None]:
train,test = train_test_split(pred_df,test_size=0.2,random_state=42)

In [None]:
cols = ['rest_type_0','rest_type_1','cuisines_0','cuisines_1','cuisines_2','cuisines_3','cuisines_4','cuisines_5','cuisines_6','cuisines_7','dish_liked_0','dish_liked_1','dish_liked_2','dish_liked_3','dish_liked_4','dish_liked_5','dish_liked_6','listed_in(type)','listed_in(city)','location']
train_encoded, feature_mappings = target_encode_multiple_columns_with_smoothing(cols, train,'approx_cost(for two people)', smoothing=2)



In [None]:
train_encoded.fillna(0,inplace=True)

In [None]:
for column_name, mapping_dict in feature_mappings.items():
    test[column_name] = test[column_name].map(mapping_dict)

In [None]:
cols=['rest_type_0','rest_type_1']
train_encoded = sum_columns_and_delete(train_encoded,cols,'rest_type')
test = sum_columns_and_delete(test,cols,'rest_type')

cols=['cuisines_0','cuisines_1','cuisines_2','cuisines_3','cuisines_4','cuisines_5','cuisines_6','cuisines_7']
train_encoded = sum_columns_and_delete(train_encoded,cols,'cuisines')
test = sum_columns_and_delete(test,cols,'cuisines')

cols=['dish_liked_0','dish_liked_1','dish_liked_2','dish_liked_3','dish_liked_4','dish_liked_5','dish_liked_6']
train_encoded = sum_columns_and_delete(train_encoded,cols,'dish_liked')
test = sum_columns_and_delete(test,cols,'dish_liked')




In [None]:
plt.figure(figsize=(20,10))
sns.heatmap(train_encoded.corr(),annot=True)

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate VIF for each feature
vif_data = pd.DataFrame()
vif_data["feature"] = train_encoded.columns
vif_data["VIF"] = [variance_inflation_factor(train_encoded.values, i) for i in range(train_encoded.shape[1])]

vif_data.set_index('feature')

In [None]:
train_encoded.columns

In [None]:
cols = ['rest_type','cuisines','approx_cost(for two people)','classes','num_spec','num_cuisines','num_dish_liked','listed_in(type)','listed_in(city)','num_reviews','book_table','online_order','votes','is_rate_valid']
train_encoded = train_encoded[cols]
test = test[cols]

In [None]:
X_train,y_train = train_encoded.drop('approx_cost(for two people)',axis=1),train_encoded['approx_cost(for two people)']
X_test,y_test = test.drop('approx_cost(for two people)',axis=1),test['approx_cost(for two people)']

In [None]:
sns.histplot(y_train)

In [None]:
#y_train = np.log(y_train)
#y_test = np.log(y_test)

In [None]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
def predict(ml_model,X_train=X_train,X_test=X_test):
    #this fucntion helps to train the model and plot the results

  model = ml_model.fit(X_train,y_train)
  print('Training score : {}'.format(model.score(X_train,y_train)))
  y_prediction = model.predict(X_test)
  print('Predictions are : {}'.format(y_prediction))
  print('\n')
  r2_score = metrics.r2_score(y_test, y_prediction)
  print('r2_score : {}'.format(r2_score))
  print('MAE : {}'.format(metrics.mean_absolute_error(y_test, y_prediction)))
  print('MSE : {}'.format(metrics.mean_squared_error(y_test, y_prediction)))
  print('RMSE : {}'.format(np.sqrt(metrics.mean_squared_error(y_test, y_prediction))))

  fig, (ax1,ax2) = plt.subplots(1,2,figsize=(15,5))
  sns.distplot(y_test-y_prediction,ax=ax1)
  ax1.set_title('Distribution of Prediction Errors')
  #sns.distplot(y_test-y_prediction)
  ax2.scatter(y_test, y_prediction, color = 'blue')
  ax2.plot(y_prediction, y_prediction, color = 'red')
  ax2.set_xlabel('Predicted')
  ax2.set_ylabel('Actual')
  ax2.set_title('Actual vs Predicted')
  plt.show()

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
import joblib

rf_reg = RandomForestRegressor(n_estimators=600, random_state=42)

predict(rf_reg,X_train=X_train,X_test=X_test)
joblib.dump(rf_reg, 'random_forest_model.pkl')  # Save the model to a file


In [None]:
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

np.random.seed(42)

# Assuming X_train, X_test, y_train, y_test are already defined and preprocessed

# Convert datasets into DMatrix, the data structure used by XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Define XGBoost parameters (these can be tuned further)
params = {
    'objective': 'reg:squarederror',  # For regression tasks
    'learning_rate': 0.1,
    'max_depth': 6,
    'alpha': 10,
    'n_estimators': 100,
    'eval_metric': 'rmse'
}

# Train the model
bst = xgb.train(params, dtrain, num_boost_round=2000, evals=[(dtest, "test")], early_stopping_rounds=10, verbose_eval=True)

# Predict on test data
y_pred = bst.predict(dtest)

# Evaluate the model
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE: {rmse}")
bst.save_model('xgb_model.json')

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
y_pred = bst.predict(dtest)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f'Mean Absolute Error (MAE): {mae:.2f}')
print(f'Mean Squared Error (MSE): {mse:.2f}')
print(f'Root Mean Squared Error (RMSE): {rmse:.2f}')
print(f'R-squared (R2): {r2:.5f}')

fig, (ax1,ax2) = plt.subplots(1,2,figsize=(15,5))
sns.distplot(y_test-y_pred.reshape(-1),ax=ax1)
ax1.set_title('Distribution of Prediction Errors')
#sns.distplot(y_test-y_pred)
ax2.scatter(y_test, y_pred.reshape(-1), color = 'blue')
ax2.plot(y_pred.reshape(-1), y_pred.reshape(-1), color = 'red')
ax2.set_xlabel('Predicted')
ax2.set_ylabel('Actual')
ax2.set_title('Actual vs Predicted')
plt.show()

In [None]:
X_train.shape

In [None]:
import tensorflow as tf
from tensorflow.keras.optimizers import SGD, Adagrad, RMSprop, Adam, Nadam, Adadelta,AdamW,Adamax
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, TensorBoard, ModelCheckpoint
from tensorflow.keras.layers import Dense, BatchNormalization, Dropout, LeakyReLU
from tensorflow.keras.regularizers import l2

np.random.seed(42)
tf.random.set_seed(42)


# Choose the optimizer
optimizer = AdamW(learning_rate=0.1, epsilon=1e-7)

# Define your model architecture
ann = tf.keras.models.Sequential()
ann.add(Dense(units=13, activation='tanh'))
ann.add(BatchNormalization())
ann.add(Dropout(0.18))

ann.add(Dense(units=32, activation=None))
ann.add(LeakyReLU(alpha=0.01))
ann.add(BatchNormalization())
ann.add(Dropout(0.18))

ann.add(Dense(units=64, activation=None))
ann.add(LeakyReLU(alpha=0.01))
ann.add(BatchNormalization())
ann.add(Dropout(0.18))

ann.add(Dense(units=32, activation=None))
ann.add(LeakyReLU(alpha=0.01))
ann.add(BatchNormalization())
ann.add(Dropout(0.18))

ann.add(Dense(units=1))

# Compile the model
early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=50)

# Callback to reduce learning rate on plateau
lr_reduction = ReduceLROnPlateau(monitor='val_loss',
                                  factor=0.9,
                                  patience=5,
                                  min_lr=1e-6,
                                  verbose=1)

# Callback to save the best model
checkpoint = ModelCheckpoint(filepath='best_ann_model.keras',
                             monitor='val_loss',
                             save_best_only=True,
                             mode='min',
                             verbose=1)

ann.compile(optimizer=optimizer, loss='mean_squared_error')

# Fit the model with callbacks
ann.fit(X_train, y_train,
        batch_size=1024,
        epochs=100000,
        validation_data=(X_test, y_test),
        callbacks=[early_stop, lr_reduction, checkpoint],
        verbose=1)

# Load the best model
ann.load_weights('best_ann_model.keras')

In [None]:
sns.lineplot(x=ann.history.epoch[50:],y=ann.history.history['loss'][50:],label='Training Loss')
sns.lineplot(x=ann.history.epoch[50:],y=ann.history.history['val_loss'][50:],label='Validation Loss')

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
y_pred = ann.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f'Mean Absolute Error (MAE): {mae:.2f}')
print(f'Mean Squared Error (MSE): {mse:.2f}')
print(f'Root Mean Squared Error (RMSE): {rmse:.2f}')
print(f'R-squared (R2): {r2:.5f}')

fig, (ax1,ax2) = plt.subplots(1,2,figsize=(15,5))
sns.distplot(y_test-y_pred.reshape(-1),ax=ax1)
ax1.set_title('Distribution of Prediction Errors')
#sns.distplot(y_test-y_pred)
ax2.scatter(y_test, y_pred.reshape(-1), color = 'blue')
ax2.plot(y_pred.reshape(-1), y_pred.reshape(-1), color = 'red')
ax2.set_xlabel('Predicted')
ax2.set_ylabel('Actual')
ax2.set_title('Actual vs Predicted')
plt.show()

# Improvement

## 1. Increase Review Volume and Enhance Aspect Analysis
- Gather more reviews and utilize advanced aspect analysis using a paid LLM model. This will add more complexity to the model, leading to more accurate predictions.

## 2. Improve Address Accuracy
- Obtain more precise addresses or correct existing ones. This will help reveal detailed location information, which can significantly impact price predictions.

## 3. Incorporate Detailed Menu Pricing
- Instead of predicting an approximation for the entire menu, obtain detailed menus with item-specific prices. This approach will enhance the accuracy of price predictions.

## 4. Expand Data Collection
- Collect additional data to improve the overall model performance and prediction accuracy.


In [None]:
%%shell
jupyter nbconvert --to html '/content/drive/MyDrive/Zomato Geospatial Analysis/Copy of Bangalore Dining Insights & Strategic Investment Plan + Price Prediction with ML.ipynb'