# 📚 Data Science Internship:

## Level 1 – Task 1: Data Exploration and Preprocessing

## 🎯 Objective:
In today's class, we will cover the **fundamentals of data exploration and preprocessing** as part of the **Cognifyz Technologies Data Science Internship**. This session focuses on understanding the dataset and preparing it for analysis.

---

## 📊 Dataset Introduction:
We will work with a **restaurant dataset** that contains key information, including:
- **Restaurant Information:** Name, Address, City, Country.
- **Customer Insights:** Ratings, Votes, Cuisines.
- **Service Availability:** Online Delivery, Table Booking.
- **Geolocation:** Latitude and Longitude for mapping.

---

## 📝 Today's Topics:
1. **Dataset Understanding**
   - Explore the structure (rows, columns).
   - Identify the meaning of each column.
2. **Data Cleaning & Preprocessing**
   - Handle missing values using appropriate techniques.
   - Convert data types for better analysis.
3. **Analyze the Target Variable**
   - Understand the distribution of **restaurant ratings**.
   - Detect any class imbalances in the dataset.
4. **Documenting the Process**
   - Write clean and clear explanations for every step.
   - Ensure reproducibility and clariting your tasks and preparing your final report.


### 📊 Step 1: Dataset Shape and Structure

1. **Objective:** Understand the dataset's structure by checking its size and column information.  

2. **Actions Taken:**
   - Identified the dataset contains **9551 rows** and **21 columns**.  
   - Explored column names, data types, and non-null values using `df.info()`.  

3. **Why is this Important?**
   - Helps to understand the **scope of the dataset**.  
   - Reveals potential **data issues** (e.g., incorrect types or missing values).  


In [1]:
# import libras
import pandas as pd
import numpy as np

In [4]:
# load the dataset
df = pd.read_csv('Dataset .csv')
# viwe first 5 rows
df.head()

Unnamed: 0,Restaurant ID,Restaurant Name,Country Code,City,Address,Locality,Locality Verbose,Longitude,Latitude,Cuisines,...,Currency,Has Table booking,Has Online delivery,Is delivering now,Switch to order menu,Price range,Aggregate rating,Rating color,Rating text,Votes
0,6317637,Le Petit Souffle,162,Makati City,"Third Floor, Century City Mall, Kalayaan Avenu...","Century City Mall, Poblacion, Makati City","Century City Mall, Poblacion, Makati City, Mak...",121.027535,14.565443,"French, Japanese, Desserts",...,Botswana Pula(P),Yes,No,No,No,3,4.8,Dark Green,Excellent,314
1,6304287,Izakaya Kikufuji,162,Makati City,"Little Tokyo, 2277 Chino Roces Avenue, Legaspi...","Little Tokyo, Legaspi Village, Makati City","Little Tokyo, Legaspi Village, Makati City, Ma...",121.014101,14.553708,Japanese,...,Botswana Pula(P),Yes,No,No,No,3,4.5,Dark Green,Excellent,591
2,6300002,Heat - Edsa Shangri-La,162,Mandaluyong City,"Edsa Shangri-La, 1 Garden Way, Ortigas, Mandal...","Edsa Shangri-La, Ortigas, Mandaluyong City","Edsa Shangri-La, Ortigas, Mandaluyong City, Ma...",121.056831,14.581404,"Seafood, Asian, Filipino, Indian",...,Botswana Pula(P),Yes,No,No,No,4,4.4,Green,Very Good,270
3,6318506,Ooma,162,Mandaluyong City,"Third Floor, Mega Fashion Hall, SM Megamall, O...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.056475,14.585318,"Japanese, Sushi",...,Botswana Pula(P),No,No,No,No,4,4.9,Dark Green,Excellent,365
4,6314302,Sambo Kojin,162,Mandaluyong City,"Third Floor, Mega Atrium, SM Megamall, Ortigas...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.057508,14.58445,"Japanese, Korean",...,Botswana Pula(P),Yes,No,No,No,4,4.8,Dark Green,Excellent,229


In [6]:
# see the dataset shape
df.shape

(9551, 21)

In [8]:
# informaction of dataset 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9551 entries, 0 to 9550
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Restaurant ID         9551 non-null   int64  
 1   Restaurant Name       9551 non-null   object 
 2   Country Code          9551 non-null   int64  
 3   City                  9551 non-null   object 
 4   Address               9551 non-null   object 
 5   Locality              9551 non-null   object 
 6   Locality Verbose      9551 non-null   object 
 7   Longitude             9551 non-null   float64
 8   Latitude              9551 non-null   float64
 9   Cuisines              9542 non-null   object 
 10  Average Cost for two  9551 non-null   int64  
 11  Currency              9551 non-null   object 
 12  Has Table booking     9551 non-null   object 
 13  Has Online delivery   9551 non-null   object 
 14  Is delivering now     9551 non-null   object 
 15  Switch to order menu 

In [10]:
# see the missing value
df.isnull().sum()

Restaurant ID           0
Restaurant Name         0
Country Code            0
City                    0
Address                 0
Locality                0
Locality Verbose        0
Longitude               0
Latitude                0
Cuisines                9
Average Cost for two    0
Currency                0
Has Table booking       0
Has Online delivery     0
Is delivering now       0
Switch to order menu    0
Price range             0
Aggregate rating        0
Rating color            0
Rating text             0
Votes                   0
dtype: int64

### 📌 Step 2: Handling Missing Values

1. **Objective:** Identify and manage any missing or incorrectly recorded data.  

2. **Actions Taken:**
   - Found **missing values** in **Longitude** and **Latitude** where `0` indicated missing data.  
   - Replaced **0 values** with the **mean location** for each city.  

3. **Why is this Important?**
   - Ensures the dataset is **complete** for accurate analysis.  
   - Missing values in critical columns like location can cause **biased insights**.  


In [12]:
# Count rows where Longitude is 0
longitude_zero_count = df[df['Longitude'] == 0].shape[0]
print("Rows with Longitude = 0:", longitude_zero_count)

# Count rows where Latitude is 0
latitude_zero_count = df[df['Latitude'] == 0].shape[0]
print("Rows with Latitude = 0:", latitude_zero_count)

# Count rows where BOTH Longitude and Latitude are 0
both_zero_count = df[(df['Longitude'] == 0) & (df['Latitude'] == 0)].shape[0]
print("Rows with both Longitude and Latitude = 0:", both_zero_count)


Rows with Longitude = 0: 498
Rows with Latitude = 0: 498
Rows with both Longitude and Latitude = 0: 497


In [14]:
# Step 1: Replace zero values with NaN (to treat them as missing)
df['Longitude'].replace(0, np.nan, inplace=True)
df['Latitude'].replace(0, np.nan, inplace=True)

# Step 2: Fill missing values using the mean longitude and latitude for each city
df['Longitude'] = df.groupby('City')['Longitude'].transform(lambda x: x.fillna(x.mean()))
df['Latitude'] = df.groupby('City')['Latitude'].transform(lambda x: x.fillna(x.mean()))

# Step 3: Check if there are any missing values left
print(df[['Longitude', 'Latitude']].isnull().sum())


Longitude    0
Latitude     0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Longitude'].replace(0, np.nan, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Latitude'].replace(0, np.nan, inplace=True)


### 📌 Step 3: Data Type Conversion

1. **Objective:** Ensure all columns are in the correct format for accurate analysis.  
2. **Actions Taken:**
   - Converted **Yes/No** columns (`Has Table booking`, `Has Online delivery`, `Is delivering now`) to **binary (0/1)**.  
   - Changed **`Average Cost for two`** from **float** to **integer**.  
3. **Why is this Important?**
   - Prevents errors during calculations.  
   - Ensures consistency for statistical analysis and future modeling.  


In [16]:
 # Data Type Conversion
# Replace 'Yes' with 1 and 'No' with 0 in multiple columns
df[['Has Table booking', 'Has Online delivery', 'Is delivering now']] = df[['Has Table booking', 'Has Online delivery', 'Is delivering now']].replace({'Yes': 1, 'No': 0})

# Example: Convert 'Average Cost for two' to integer
df['Average Cost for two'] = df['Average Cost for two'].astype(int)


  df[['Has Table booking', 'Has Online delivery', 'Is delivering now']] = df[['Has Table booking', 'Has Online delivery', 'Is delivering now']].replace({'Yes': 1, 'No': 0})


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9551 entries, 0 to 9550
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Restaurant ID         9551 non-null   int64  
 1   Restaurant Name       9551 non-null   object 
 2   Country Code          9551 non-null   int64  
 3   City                  9551 non-null   object 
 4   Address               9551 non-null   object 
 5   Locality              9551 non-null   object 
 6   Locality Verbose      9551 non-null   object 
 7   Longitude             9551 non-null   float64
 8   Latitude              9551 non-null   float64
 9   Cuisines              9542 non-null   object 
 10  Average Cost for two  9551 non-null   int32  
 11  Currency              9551 non-null   object 
 12  Has Table booking     9551 non-null   int64  
 13  Has Online delivery   9551 non-null   int64  
 14  Is delivering now     9551 non-null   int64  
 15  Switch to order menu 

### 📊 Step 4: Analyze the Target Variable ("Aggregate Rating")

1. **Objective:**  
   - Understand the distribution of **restaurant ratings**.  
   - Identify **class imbalances** in the "Aggregate rating" column.  

2. **Actions Taken:**  
   - Analyzed how often each **rating** appears.  
   - Identified a **class imbalance**:  
     - **2148** restaurants are **Not Rated (0.0)**.  
     - Very **high ratings (4.5+)** are **rare** (only 301 entries).  
   - Decided to **keep all data**, including **Not Rated** entries.  

3. **Why is This Important?**  
   - Helps understand customer **satisfaction** and **rating trends**.  
   - Retaining all data allows for **comprehensive** analysis.  


In [20]:
# Count how many restaurants have each rating
rating_distribution = df['Aggregate rating'].value_counts().sort_index()

# Display the distribution
print(rating_distribution)


Aggregate rating
0.0    2148
1.8       1
1.9       2
2.0       7
2.1      15
2.2      27
2.3      47
2.4      87
2.5     110
2.6     191
2.7     250
2.8     315
2.9     381
3.0     468
3.1     519
3.2     522
3.3     483
3.4     498
3.5     480
3.6     458
3.7     427
3.8     400
3.9     335
4.0     266
4.1     274
4.2     221
4.3     174
4.4     144
4.5      95
4.6      78
4.7      42
4.8      25
4.9      61
Name: count, dtype: int64


In [22]:
# Analyze the relationship between rating text and aggregate rating
rating_summary = df.groupby(['Aggregate rating', 'Rating color', 'Rating text']).size().reset_index(name='Count')

# Display the summary
print(rating_summary)


    Aggregate rating Rating color Rating text  Count
0                0.0        White   Not rated   2148
1                1.8          Red        Poor      1
2                1.9          Red        Poor      2
3                2.0          Red        Poor      7
4                2.1          Red        Poor     15
5                2.2          Red        Poor     27
6                2.3          Red        Poor     47
7                2.4          Red        Poor     87
8                2.5       Orange     Average    110
9                2.6       Orange     Average    191
10               2.7       Orange     Average    250
11               2.8       Orange     Average    315
12               2.9       Orange     Average    381
13               3.0       Orange     Average    468
14               3.1       Orange     Average    519
15               3.2       Orange     Average    522
16               3.3       Orange     Average    483
17               3.4       Orange     Average 

In [24]:
# Confirm rating distribution (including Not Rated)
print(df['Aggregate rating'].value_counts())

Aggregate rating
0.0    2148
3.2     522
3.1     519
3.4     498
3.3     483
3.5     480
3.0     468
3.6     458
3.7     427
3.8     400
2.9     381
3.9     335
2.8     315
4.1     274
4.0     266
2.7     250
4.2     221
2.6     191
4.3     174
4.4     144
2.5     110
4.5      95
2.4      87
4.6      78
4.9      61
2.3      47
4.7      42
2.2      27
4.8      25
2.1      15
2.0       7
1.9       2
1.8       1
Name: count, dtype: int64


In [26]:
# Save to CSV
df.to_csv('task1_dataset.csv', index=False)