# Lesson 3B: Data Cleaning

# 🎯Learning Objectives:

1. Identifying and handling missing values  
2. Identifying and removing duplicates  
3. Export the cleaned dataset

# ▶️Getting Started
We will get started by loading pandas and the Restaurant Transaction Dataset.csv file.

In [9]:

import pandas as pd

df = pd.read_csv("Restaurant_Transactions_Dataset.csv")


In [10]:
df.info()
print("\nThe missing values count:", df.isnull().sum())



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1002 entries, 0 to 1001
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Customer_ID     1002 non-null   int64  
 1   Food_Item       989 non-null    object 
 2   Category        1002 non-null   object 
 3   Date_of_Visit   1002 non-null   object 
 4   Time            1002 non-null   object 
 5   Weather         1002 non-null   object 
 6   Price           996 non-null    float64
 7   Weekend         1002 non-null   object 
 8   Public_Holiday  1002 non-null   object 
dtypes: float64(1), int64(1), object(7)
memory usage: 70.6+ KB


# 1️⃣ Handling Missing Values  

Missing values can impact the accuracy of our analysis.

### ✅ What We'll Do:
✔️ Identify missing values in our dataset  
✔️ Visualize missing values using a DataFrame  
✔️ Choose an appropriate method to handle them  (remove or impute)



### (i) Identify missing values
We will start by identifying where is the missing value in the data set.

In [12]:
# axis = 0, referred to row wise
# axis = 1, referred to column wise
# 0,1 use as boolean
df[df.isnull().any(axis=1)]








Unnamed: 0,Customer_ID,Food_Item,Category,Date_of_Visit,Time,Weather,Price,Weekend,Public_Holiday
27,1016,,Cold,03/03/2023,22:00,Sunny,18.68,No,No
41,1008,,Hot,05/03/2023,20:30,Raining,19.602,Yes,No
53,1073,,Cold,27/03/2023,15:00,Sunny,12.88,No,No
69,1008,,Hot,30/01/2023,14:30,Raining,12.573,No,No
117,1085,,Hot,08/02/2023,10:00,Raining,9.63,No,No
125,1003,,Cold,27/02/2023,10:00,Sunny,13.15,No,No
175,1008,,Cold,25/03/2023,18:30,Sunny,22.56,Yes,No
238,1038,,Hot,10/01/2023,20:30,Raining,5.868,No,No
334,1023,,Cold,01/04/2023,15:00,Sunny,23.604,Yes,No
402,1075,,Hot,25/02/2023,18:30,Raining,10.7676,Yes,Yes


### (ii) Handling Missing Values - Removal
One way to handle missing values is to remove the values. We usually do this for categorical data.

In [13]:
# this dropna function will drop the NaN row from table based on the specific subset
df_cleaned1 = df.dropna(subset=['Food_Item'])

print("Missing value after cleaning: ")
print(df_cleaned1.isnull().sum())

Missing value after cleaning: 
Customer_ID       0
Food_Item         0
Category          0
Date_of_Visit     0
Time              0
Weather           0
Price             6
Weekend           0
Public_Holiday    0
dtype: int64


### (iii) Handling Missing Values - Imputation
Another way to handle missing values is through imputation - usually with the mean or median. This is usually done for numerical data and which value you impute is based on which will yield you the most accurate data.

In [14]:
# make a copy
df_cleaned2 = df_cleaned1.copy()

# replace NaN value with mean using fillna
# when the outliner distort the average, use median instead of mean
df_cleaned2['Price'] = df_cleaned1['Price'].fillna(df_cleaned1['Price'].mean())

print("Missing value after performing imputation: ")
print(df_cleaned2['Price'].isnull().sum())


Missing value after performing imputation: 
0


# 2️⃣: Identifying & Removing Duplicates

Duplicate records can skew our analysis.

###✅ What We'll Do:
✔️ **Check if there are duplicate rows**

✔️ **Decide whether to remove them**

Let's clean our data!


In [22]:
df_cleaned3 = df_cleaned2.copy()

# calc number of rows are duplicated
duplicates = df_cleaned3.duplicated().sum()

df_cleaned3 = df_cleaned3.drop_duplicates()
print(f"Number of duplicated rows: {df_cleaned3.duplicated().sum()}")

# show duplicated row
if duplicates > 0:
    print("Duplicate rows found:")
    display(df_cleaned2[df_cleaned2.duplicated(keep=False)])

Number of duplicated rows: 0
Duplicate rows found:


Unnamed: 0,Customer_ID,Food_Item,Category,Date_of_Visit,Time,Weather,Price,Weekend,Public_Holiday
85,1023,Ice Cream,Cold,16/03/2023,12:00,Sunny,10.14,No,Yes
86,1023,Ice Cream,Cold,16/03/2023,12:00,Sunny,10.14,No,Yes
131,1075,Smoothie,Cold,03/03/2023,13:00,Sunny,6.74,No,No
132,1075,Smoothie,Cold,03/03/2023,13:00,Sunny,6.74,No,No


# 3️⃣: Exporting the Cleaned Data

Now that we have cleaned our dataset, it's important to save it for future use.  

### 🔹 Why Export the Data?
- To **preserve** the cleaned version of the dataset  
- To **avoid redoing** preprocessing steps  
- To **use it in another project** (e.g., visualization, modeling)  

We will export the dataset as a **CSV file**. Let's do it! 🚀


In [32]:
output_file = "Cleaned_Restaurant_Transactions_Dataset.csv"
df_cleaned2.to_csv(output_file, index=False)
print(f"Cleaned dataset has been successfully saved to {output_file}")

Cleaned dataset has been successfully saved to Cleaned_Restaurant_Transactions_Dataset.csv


# Summary & Next Steps

Great job! You’ve successfully cleaned and prepared your dataset!

### ✅ What We Did:
✔️ Handled missing values  
✔️ Removed duplicate records  

✔️ Exported the cleaned dataset for future use   



SyntaxError: unterminated f-string literal (detected at line 2) (1103550778.py, line 2)