# Swiggy Restaurant Recommendation System  
## Data Understanding and Exploration

### Objective
The objective of this notebook is to understand the structure, quality, and characteristics
of the Swiggy restaurant dataset. This step helps identify missing values, duplicates,
and inconsistencies before performing data cleaning and preprocessing.


In [1]:
import pandas as pd
import numpy as np


In [2]:
# Loading raw Swiggy restaurant data
df = pd.read_csv("../data/raw/swiggy_raw.csv")

# Display first 5 rows
df.head()


Unnamed: 0,id,name,city,rating,rating_count,cost,cuisine,lic_no,link,address,menu
0,567335,AB FOODS POINT,Abohar,--,Too Few Ratings,₹ 200,"Beverages,Pizzas",22122652000138,https://www.swiggy.com/restaurants/ab-foods-po...,"AB FOODS POINT, NEAR RISHI NARANG DENTAL CLINI...",Menu/567335.json
1,531342,Janta Sweet House,Abohar,4.4,50+ ratings,₹ 200,"Sweets,Bakery",12117201000112,https://www.swiggy.com/restaurants/janta-sweet...,"Janta Sweet House, Bazar No.9, Circullar Road,...",Menu/531342.json
2,158203,theka coffee desi,Abohar,3.8,100+ ratings,₹ 100,Beverages,22121652000190,https://www.swiggy.com/restaurants/theka-coffe...,"theka coffee desi, sahtiya sadan road city",Menu/158203.json
3,187912,Singh Hut,Abohar,3.7,20+ ratings,₹ 250,"Fast Food,Indian",22119652000167,https://www.swiggy.com/restaurants/singh-hut-n...,"Singh Hut, CIRCULAR ROAD NEAR NEHRU PARK ABOHAR",Menu/187912.json
4,543530,GRILL MASTERS,Abohar,--,Too Few Ratings,₹ 250,"Italian-American,Fast Food",12122201000053,https://www.swiggy.com/restaurants/grill-maste...,"GRILL MASTERS, ADA Heights, Abohar - Hanumanga...",Menu/543530.json


In [3]:
df.shape

(148541, 11)

### Dataset Shape

- Each row represents a restaurant
- Each column represents an attribute such as city, rating, cost, or cuisine


In [4]:
# Viewing all column names
df.columns


Index(['id', 'name', 'city', 'rating', 'rating_count', 'cost', 'cuisine',
       'lic_no', 'link', 'address', 'menu'],
      dtype='object')

### Column Description

| Column Name | Description |
|------------|-------------|
| id | Unique restaurant identifier |
| name | Restaurant name |
| city | City where the restaurant is located |
| rating | Average customer rating |
| rating_count | Number of ratings |
| cost | Average cost for two |
| cuisine | Cuisine types |
| lic_no | License number |
| link | Swiggy restaurant URL |
| address | Restaurant address |
| menu | Menu file reference |


In [11]:
# Checking data types of each column
df.dtypes


id               int64
name            object
city            object
rating          object
rating_count    object
cost            object
cuisine         object
lic_no          object
link            object
address         object
menu            object
dtype: object

### Data Type Observations

- Most columns are stored as object type
- Rating, rating_count, and cost should be numerical
- These columns require cleaning and type conversion


In [24]:
# Checking missing values
df.isnull().sum()


id                0
name             86
city              0
rating           86
rating_count     86
cost            131
cuisine          99
lic_no          229
link              0
address          86
menu              0
dtype: int64

### Missing Values Observation

- Some columns contain missing or invalid values
- Rating and rating_count contain non-numeric placeholders
- These issues will be handled in the data cleaning stage


In [26]:
# Checking for duplicate rows
df.duplicated().sum()


np.int64(0)

### Duplicate Records Observation

- Duplicate restaurant entries may exist
- Duplicate removal is necessary to avoid repeated recommendations


In [30]:
# Number of unique cities
df['city'].nunique()


821

In [31]:
# Number of unique cuisines
df['cuisine'].nunique()


2132

### City and Cuisine Analysis

- Restaurants are distributed across multiple cities
- Cuisine column has high variability and multiple values
- One-Hot Encoding will be required during preprocessing


In [32]:
# Summary statistics for numerical columns
df[['rating', 'rating_count', 'cost']].describe()


Unnamed: 0,rating,rating_count,cost
count,148455,148455,148410
unique,42,8,363
top,--,Too Few Ratings,₹ 200
freq,87014,87014,38635


### Numerical Feature Observations

- Rating column contains missing and invalid values
- Cost column contains currency symbols and inconsistent formats
- Rating count is stored as text and needs conversion


## Key Observations

- Dataset contains missing, inconsistent, and duplicate values
- Categorical features require encoding
- Numerical features need cleaning and conversion
- Proper preprocessing is required before building the recommendation system

### Next Step
Proceed to data cleaning and preprocessing in the next notebook.
