## Final Data Cleaning Summary

This notebook implements a complete data cleaning pipeline for a retail sales dataset. Missing numeric values were handled using median imputation, categorical missing values were explicitly labeled as "Unknown", duplicate records were removed, and data types were normalized for reliable downstream analysis.

The cleaned dataset produced here serves as a stable and reusable foundation for exploratory data analysis (EDA) and further modeling, ensuring data integrity and reproducibility.

In [18]:
import pandas as pd
df = pd.read_csv("../data/retail_sales_dirty.csv")
df

Unnamed: 0,order_id,order_date,region,product,category,quantity,unit_price
0,1001,2023-01-05,North,Laptop,Electronics,2,750.0
1,1002,2023-01-07,South,Mobile,Electronics,,300.0
2,1003,2023-01-10,East,Chair,Furniture,10,45.0
3,1003,2023-01-10,East,Chair,Furniture,10,45.0
4,1004,2023-01-15,West,Table,Furniture,three,120.0
5,1005,2023-01-20,,Headphones,Electronics,8,60.0
6,1006,2023-01-22,South,Sofa,Furniture,1,


In [19]:
df = pd.read_csv("../data/retail_sales_dirty.csv")

In [3]:
df

Unnamed: 0,order_id,order_date,region,product,category,quantity,unit_price
0,1001,2023-01-05,North,Laptop,Electronics,2,750.0
1,1002,2023-01-07,South,Mobile,Electronics,,300.0
2,1003,2023-01-10,East,Chair,Furniture,10,45.0
3,1003,2023-01-10,East,Chair,Furniture,10,45.0
4,1004,2023-01-15,West,Table,Furniture,three,120.0
5,1005,2023-01-20,,Headphones,Electronics,8,60.0
6,1006,2023-01-22,South,Sofa,Furniture,1,


In [20]:
df["quantity"] = pd.to_numeric(df["quantity"], errors="coerce")
df["quantity"] = df["quantity"].fillna(df["quantity"].median())

df["unit_price"] = pd.to_numeric(df["unit_price"], errors="coerce")
df["unit_price"] = df["unit_price"].fillna(df["unit_price"].median())

df["region"] = df["region"].fillna("Unknown").astype(str)

In [5]:
df.isna().sum()

order_id      0
order_date    0
region        0
product       0
category      0
quantity      0
unit_price    0
dtype: int64

In [6]:
df

Unnamed: 0,order_id,order_date,region,product,category,quantity,unit_price
0,1001,2023-01-05,North,Laptop,Electronics,2.0,750.0
1,1002,2023-01-07,South,Mobile,Electronics,8.0,300.0
2,1003,2023-01-10,East,Chair,Furniture,10.0,45.0
3,1003,2023-01-10,East,Chair,Furniture,10.0,45.0
4,1004,2023-01-15,West,Table,Furniture,8.0,120.0
5,1005,2023-01-20,Unknown,Headphones,Electronics,8.0,60.0
6,1006,2023-01-22,South,Sofa,Furniture,1.0,90.0


In [7]:
df.duplicated().sum()

np.int64(1)

In [8]:
df = df.drop_duplicates()

In [9]:
df

Unnamed: 0,order_id,order_date,region,product,category,quantity,unit_price
0,1001,2023-01-05,North,Laptop,Electronics,2.0,750.0
1,1002,2023-01-07,South,Mobile,Electronics,8.0,300.0
2,1003,2023-01-10,East,Chair,Furniture,10.0,45.0
4,1004,2023-01-15,West,Table,Furniture,8.0,120.0
5,1005,2023-01-20,Unknown,Headphones,Electronics,8.0,60.0
6,1006,2023-01-22,South,Sofa,Furniture,1.0,90.0


In [10]:
df.isna().sum()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, 0 to 6
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   order_id    6 non-null      int64  
 1   order_date  6 non-null      object 
 2   region      6 non-null      object 
 3   product     6 non-null      object 
 4   category    6 non-null      object 
 5   quantity    6 non-null      float64
 6   unit_price  6 non-null      float64
dtypes: float64(2), int64(1), object(4)
memory usage: 384.0+ bytes


In [11]:
import os
os.listdir("../data")

['retail_sales_dirty.csv',
 'retail_sales.csv',
 'retail_sales.db',
 'retail_sales_cleaned.csv',
 '.ipynb_checkpoints']

In [12]:
import os
os.getcwd()

'/Users/ansh/Documents/Data-analysis-bootcamp/01_data_analysis/notebooks'

In [13]:
import pandas as pd
import os

In [14]:
df = pd.read_csv("../data/retail_sales_dirty.csv")
df.head()

Unnamed: 0,order_id,order_date,region,product,category,quantity,unit_price
0,1001,2023-01-05,North,Laptop,Electronics,2,750.0
1,1002,2023-01-07,South,Mobile,Electronics,,300.0
2,1003,2023-01-10,East,Chair,Furniture,10,45.0
3,1003,2023-01-10,East,Chair,Furniture,10,45.0
4,1004,2023-01-15,West,Table,Furniture,three,120.0


In [15]:
df

Unnamed: 0,order_id,order_date,region,product,category,quantity,unit_price
0,1001,2023-01-05,North,Laptop,Electronics,2,750.0
1,1002,2023-01-07,South,Mobile,Electronics,,300.0
2,1003,2023-01-10,East,Chair,Furniture,10,45.0
3,1003,2023-01-10,East,Chair,Furniture,10,45.0
4,1004,2023-01-15,West,Table,Furniture,three,120.0
5,1005,2023-01-20,,Headphones,Electronics,8,60.0
6,1006,2023-01-22,South,Sofa,Furniture,1,


In [16]:
df

Unnamed: 0,order_id,order_date,region,product,category,quantity,unit_price
0,1001,2023-01-05,North,Laptop,Electronics,2,750.0
1,1002,2023-01-07,South,Mobile,Electronics,,300.0
2,1003,2023-01-10,East,Chair,Furniture,10,45.0
3,1003,2023-01-10,East,Chair,Furniture,10,45.0
4,1004,2023-01-15,West,Table,Furniture,three,120.0
5,1005,2023-01-20,,Headphones,Electronics,8,60.0
6,1006,2023-01-22,South,Sofa,Furniture,1,


In [21]:
df.to_csv("../data/retail_sales_cleaned.csv", index=False)