# Swiggy Restaurant Recommendation System  
## Data Cleaning

### Objective
The purpose of this notebook is to clean the raw Swiggy restaurant dataset by:
- Removing duplicate records
- Handling missing and invalid values
- Cleaning numerical columns such as rating, rating_count, and cost

The cleaned dataset will be saved for further preprocessing and modeling.


In [1]:
import pandas as pd
import numpy as np


In [2]:
# Load raw data
df = pd.read_csv("../data/raw/swiggy_raw.csv")

# Preview dataset
df.head()


Unnamed: 0,id,name,city,rating,rating_count,cost,cuisine,lic_no,link,address,menu
0,567335,AB FOODS POINT,Abohar,--,Too Few Ratings,₹ 200,"Beverages,Pizzas",22122652000138,https://www.swiggy.com/restaurants/ab-foods-po...,"AB FOODS POINT, NEAR RISHI NARANG DENTAL CLINI...",Menu/567335.json
1,531342,Janta Sweet House,Abohar,4.4,50+ ratings,₹ 200,"Sweets,Bakery",12117201000112,https://www.swiggy.com/restaurants/janta-sweet...,"Janta Sweet House, Bazar No.9, Circullar Road,...",Menu/531342.json
2,158203,theka coffee desi,Abohar,3.8,100+ ratings,₹ 100,Beverages,22121652000190,https://www.swiggy.com/restaurants/theka-coffe...,"theka coffee desi, sahtiya sadan road city",Menu/158203.json
3,187912,Singh Hut,Abohar,3.7,20+ ratings,₹ 250,"Fast Food,Indian",22119652000167,https://www.swiggy.com/restaurants/singh-hut-n...,"Singh Hut, CIRCULAR ROAD NEAR NEHRU PARK ABOHAR",Menu/187912.json
4,543530,GRILL MASTERS,Abohar,--,Too Few Ratings,₹ 250,"Italian-American,Fast Food",12122201000053,https://www.swiggy.com/restaurants/grill-maste...,"GRILL MASTERS, ADA Heights, Abohar - Hanumanga...",Menu/543530.json


### Dataset Loading

The raw Swiggy dataset is loaded from the data/raw directory.
At this stage, no modifications are applied to ensure the original data
remains unchanged for reference.


In [3]:
df.shape


(148541, 11)

In [4]:
# Removing duplicate rows
df = df.drop_duplicates()

# Shape after removing duplicates
df.shape


(148541, 11)

### Duplicate Removal

- Duplicate restaurant records were identified and removed
- This helps avoid bias and repeated recommendations


##  Handle Rating Column

In [7]:
# Replace invalid rating values with NaN
df['rating'] = df['rating'].replace('--', np.nan)

# Convert rating to numeric
df['rating'] = pd.to_numeric(df['rating'], errors='coerce')

df['rating'].head()


0    NaN
1    4.4
2    3.8
3    3.7
4    NaN
Name: rating, dtype: float64

### Rating Column Cleaning

The rating column contained invalid placeholders such as '--'.
These values were converted to NaN and then transformed into numeric format
to allow mathematical operations during recommendation.


## Handling Rating Count Column

In [8]:
# Remove text like 'ratings' and '+' from rating_count
df['rating_count'] = (
    df['rating_count']
    .astype(str)
    .str.replace('ratings', '', regex=False)
    .str.replace('rating', '', regex=False)
    .str.replace('+', '', regex=False)
    .str.strip()
)

# Convert to numeric
df['rating_count'] = pd.to_numeric(df['rating_count'], errors='coerce')

df['rating_count'].head()


0      NaN
1     50.0
2    100.0
3     20.0
4      NaN
Name: rating_count, dtype: float64

### Rating Count Cleaning

The rating_count column was stored as text (e.g., '50+ ratings').
Textual components were removed and the column was converted into numeric
format to maintain consistency.


## Clean Cost Column

In [9]:
# Remove currency symbols and encoding issues
df['cost'] = (
    df['cost']
    .astype(str)
    .str.replace('₹', '', regex=False)
    .str.replace('â‚¹', '', regex=False)
    .str.strip()
)

# Convert cost to numeric
df['cost'] = pd.to_numeric(df['cost'], errors='coerce')

df['cost'].head()


0    200.0
1    200.0
2    100.0
3    250.0
4    250.0
Name: cost, dtype: float64

### Cost Column Cleaning

The cost column contained currency symbols and encoding issues.
These symbols were removed and the column was converted to numeric format
to support comparison and similarity calculations.


## Missing value after cleaning

In [10]:
df.isnull().sum()


id                  0
name               86
city                0
rating          87100
rating_count    89952
cost              131
cuisine            99
lic_no            229
link                0
address            86
menu                0
dtype: int64

In [11]:
# Drop rows where essential values are missing
df = df.dropna(subset=['city', 'cuisine', 'cost'])

# Fill missing ratings with median rating
df['rating'] = df['rating'].fillna(df['rating'].median())

# Fill missing rating counts with 0
df['rating_count'] = df['rating_count'].fillna(0)


In [12]:
df.isnull().sum()


id                0
name              0
city              0
rating            0
rating_count      0
cost              0
cuisine           0
lic_no          143
link              0
address           0
menu              0
dtype: int64

### Missing Value Treatment

Rows missing essential features such as city, cuisine, or cost were removed.
Missing ratings were filled using the median value, while missing rating counts
were set to zero to preserve data consistency.


In [13]:
df.head()


Unnamed: 0,id,name,city,rating,rating_count,cost,cuisine,lic_no,link,address,menu
0,567335,AB FOODS POINT,Abohar,4.0,0.0,200.0,"Beverages,Pizzas",22122652000138,https://www.swiggy.com/restaurants/ab-foods-po...,"AB FOODS POINT, NEAR RISHI NARANG DENTAL CLINI...",Menu/567335.json
1,531342,Janta Sweet House,Abohar,4.4,50.0,200.0,"Sweets,Bakery",12117201000112,https://www.swiggy.com/restaurants/janta-sweet...,"Janta Sweet House, Bazar No.9, Circullar Road,...",Menu/531342.json
2,158203,theka coffee desi,Abohar,3.8,100.0,100.0,Beverages,22121652000190,https://www.swiggy.com/restaurants/theka-coffe...,"theka coffee desi, sahtiya sadan road city",Menu/158203.json
3,187912,Singh Hut,Abohar,3.7,20.0,250.0,"Fast Food,Indian",22119652000167,https://www.swiggy.com/restaurants/singh-hut-n...,"Singh Hut, CIRCULAR ROAD NEAR NEHRU PARK ABOHAR",Menu/187912.json
4,543530,GRILL MASTERS,Abohar,4.0,0.0,250.0,"Italian-American,Fast Food",12122201000053,https://www.swiggy.com/restaurants/grill-maste...,"GRILL MASTERS, ADA Heights, Abohar - Hanumanga...",Menu/543530.json


In [17]:
df.shape


(148398, 11)

In [16]:
# Save cleaned dataset
df.to_csv("../data/processed/cleaned_data.csv", index=False)

## Data Cleaning Summary

- Removed duplicate records
- Cleaned and converted rating, rating_count, and cost columns
- Handled missing values appropriately
- Ensured correct data types for numerical features

The cleaned dataset is now ready for feature engineering and encoding.
