# Zomato Dataset Analysis

**Detailed exploratory data analysis (EDA)** adapted from the provided Spotify notebook style, focused on restaurants, cuisines, ratings, pricing and city-level insights.

**Contents**

1. Introduction
2. Imports
3. Load data
4. Initial inspection
5. Data cleaning & preprocessing
6. Feature engineering
7. Exploratory Data Analysis (visuals & insights)
8. Conclusions


In [None]:

# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Display settings
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 100)
plt.rcParams['figure.figsize'] = (10,6)
sns.set(style='whitegrid')

print('Libraries imported')

In [None]:

# Load dataset (adjust path if necessary)
csv_path = '/mnt/data/Zomato_Dataset/Zomato Dataset.csv'
df = pd.read_csv(csv_path, encoding='latin-1')
print('Loaded:', df.shape)
df.head()

In [None]:

# Initial inspection
print('Columns:', len(df.columns))
display(df.info())
display(df.describe(include='all').T)

## Data Cleaning & Preprocessing

Steps:
- Rename columns to consistent snake_case
- Handle missing values
- Convert numeric columns
- Remove duplicates
- Basic filtering (remove unrealistic prices/ratings)

In [None]:

# Make a copy and normalize column names
data = df.copy()
data.columns = [c.strip().replace(' ', '_').replace('/', '_').replace('-', '_') for c in data.columns]
data.rename(columns=lambda x: x.lower(), inplace=True)
data.shape

In [None]:

# Check null counts and duplicates
nulls = data.isnull().sum().sort_values(ascending=False)
display(nulls.head(30))
print('Duplicates:', data.duplicated().sum())

In [None]:

# Example cleaning steps (safe and conservative)
# Convert ratings to numeric where appropriate
for col in ['dining_rating','delivery_rating','average_rating','avg_rating_restaurant','avg_rating_cuisine','avg_rating_city']:
    if col in data.columns:
        data[col] = pd.to_numeric(data[col], errors='coerce')

# Prices may be string; try to convert
if 'prices' in data.columns:
    # remove currency symbols, commas and convert to numeric if possible (many datasets use ranges)
    data['prices_clean'] = (data['prices'].astype(str)
                            .str.replace('[^0-9.-]', '', regex=True)
                            .replace('', np.nan))
    data['prices_clean'] = pd.to_numeric(data['prices_clean'], errors='coerce')

# Drop exact duplicate rows
data = data.drop_duplicates().reset_index(drop=True)
data.shape

In [None]:

# Quick checks after cleaning
display(data[['dining_rating','delivery_rating','average_rating','prices','prices_clean']].head(10))
data.isnull().mean().sort_values(ascending=False).head(20)

## Feature engineering

Create helpful features:
- total_votes
- price_per_vote (if applicable)
- is_expensive (boolean)
- cuisine list (split multiple cuisines)

In [None]:

# Feature engineering examples
if 'total_votes' not in data.columns:
    if 'dining_votes' in data.columns and 'delivery_votes' in data.columns:
        data['total_votes'] = data['dining_votes'].fillna(0) + data['delivery_votes'].fillna(0)
    elif 'votes' in data.columns:
        data['total_votes'] = pd.to_numeric(data['votes'], errors='coerce')
    else:
        data['total_votes'] = np.nan

# is_expensive from price band if exists (heuristic)
if 'is_expensive' not in data.columns and 'prices_clean' in data.columns:
    data['is_expensive'] = (data['prices_clean'] > data['prices_clean'].median()).astype(int)

# split cuisines (keep first cuisine as main category)
if 'cuisine' in data.columns:
    data['cuisine_main'] = data['cuisine'].astype(str).str.split(',').str[0].str.strip()
elif 'cuisines' in data.columns:
    data['cuisine_main'] = data['cuisines'].astype(str).str.split(',').str[0].str.strip()
    
# example derived column: rating_missing
data['rating_missing'] = data['average_rating'].isnull().astype(int)

data[['total_votes','prices_clean','is_expensive','cuisine_main','rating_missing']].head()

## Exploratory Data Analysis (EDA)

We'll produce a series of charts and tables to understand the dataset.

In [None]:

# Distribution of Average Ratings
if 'average_rating' in data.columns:
    plt.figure(figsize=(8,5))
    plt.hist(data['average_rating'].dropna(), bins=30)
    plt.title('Distribution of Average Rating')
    plt.xlabel('Average Rating')
    plt.ylabel('Count')
    plt.show()


In [None]:

# Top 15 cuisines
if 'cuisine_main' in data.columns:
    top_cuisines = data['cuisine_main'].value_counts().nlargest(15)
    display(top_cuisines)
    plt.figure(figsize=(10,6))
    sns.barplot(y=top_cuisines.index, x=top_cuisines.values)
    plt.title('Top 15 Cuisines (by item/restaurant entries)')
    plt.xlabel('Count')
    plt.ylabel('Cuisine')
    plt.show()


In [None]:

# Top 20 restaurants by total_votes
if 'total_votes' in data.columns:
    top_rest = (data.groupby('restaurant_name')['total_votes']
                .sum().sort_values(ascending=False).head(20))
    display(top_rest)
    plt.figure(figsize=(10,6))
    sns.barplot(y=top_rest.index, x=top_rest.values)
    plt.title('Top 20 Restaurants by Total Votes')
    plt.xlabel('Total Votes')
    plt.ylabel('Restaurant')
    plt.show()


In [None]:

# Average rating by city (top 15 cities with most entries)
if 'city' in data.columns and 'average_rating' in data.columns:
    city_counts = data['city'].value_counts().nlargest(30).index
    city_avg = (data[data['city'].isin(city_counts)].groupby('city')['average_rating']
                .mean().sort_values(ascending=False).head(15))
    display(city_avg)
    plt.figure(figsize=(12,6))
    sns.barplot(x=city_avg.values, y=city_avg.index)
    plt.title('Average Rating by City (top 15 by count)')
    plt.xlabel('Average Rating')
    plt.ylabel('City')
    plt.show()


In [None]:

# Price vs Rating scatter
if 'prices_clean' in data.columns and 'average_rating' in data.columns:
    plt.figure(figsize=(8,6))
    plt.scatter(data['prices_clean'], data['average_rating'], alpha=0.3)
    plt.title('Price vs Average Rating (scatter)')
    plt.xlabel('Price (cleaned)')
    plt.ylabel('Average Rating')
    plt.show()


In [None]:

# Top cuisines by average rating (with at least N samples)
if 'cuisine_main' in data.columns and 'average_rating' in data.columns:
    cuisine_stats = (data.groupby('cuisine_main')
                     .agg(count_items=('cuisine_main','size'), avg_rating=('average_rating','mean'),
                          avg_price=('prices_clean','mean'))
                     .sort_values('count_items', ascending=False))
    display(cuisine_stats.head(20))
    # cuisines with >= 50 samples
    popular_cuisines = cuisine_stats[cuisine_stats['count_items']>=50].sort_values('avg_rating', ascending=False).head(15)
    plt.figure(figsize=(10,6))
    sns.barplot(x=popular_cuisines['avg_rating'], y=popular_cuisines.index)
    plt.title('Top cuisines by average rating (>=50 entries)')
    plt.xlabel('Average Rating')
    plt.ylabel('Cuisine')
    plt.show()


In [None]:

# Correlation heatmap for numeric features
num_cols = data.select_dtypes(include=[np.number]).columns.tolist()
if len(num_cols) > 1:
    corr = data[num_cols].corr()
    plt.figure(figsize=(10,8))
    sns.heatmap(corr, annot=True, fmt='.2f', cmap='RdYlBu', center=0)
    plt.title('Correlation matrix (numeric features)')
    plt.show()
else:
    print('No numeric columns to correlate.')

## Advanced analysis ideas (examples)

- Identify highly-rated & affordable restaurants
- Top cuisines per city
- Time-series if data has timestamp (not present here)
- Restaurant popularity scoring

In [None]:

# Highly-rated & affordable: rating >= 4.0 and price below median
if 'average_rating' in data.columns and 'prices_clean' in data.columns:
    med = data['prices_clean'].median()
    high_affordable = data[(data['average_rating']>=4.0) & (data['prices_clean']<=med)]
    top_high_affordable = (high_affordable.groupby('restaurant_name')
                           .agg(avg_rating=('average_rating','mean'), avg_price=('prices_clean','mean'),
                                total_votes=('total_votes','sum'))
                           .sort_values(['avg_rating','total_votes'], ascending=[False,False]).head(30))
    display(top_high_affordable)
else:
    print('Required columns not present for this analysis.')

In [None]:

# Top cuisines per city (example for top 10 cities)
if 'city' in data.columns and 'cuisine_main' in data.columns:
    top_cities = data['city'].value_counts().nlargest(10).index
    top_cuisine_city = (data[data['city'].isin(top_cities)]
                       .groupby(['city','cuisine_main']).size().reset_index(name='count'))
    top_by_city = top_cuisine_city.sort_values(['city','count'], ascending=[True,False]).groupby('city').head(5)
    display(top_by_city)
else:
    print('City or cuisine data missing')

## Insights & Observations

- Summarize the main findings here. Write 6â€“10 bullet points discussing top cuisines, city-level differences, price vs rating behaviour, and any data quality notes.

## Conclusion

Short wrap-up and suggestions for next steps (e.g., build a recommendation system, use NLP on menu items, or enrich with location/restaurant metadata).