# Blinkit Sales Analytics – Data Cleaning

## Objective
Clean and standardize each raw dataset individually to improve data quality
before analysis and modeling.

This includes:
- Handling missing values
- Removing duplicates
- Fixing data types
- Standardizing categorical values

Cleaned datasets are saved to the interim layer.


In [11]:
import pandas as pd
import os

RAW_PATH = "../data/raw"
INTERIM_PATH = "../data/interim"

os.makedirs(INTERIM_PATH, exist_ok=True)
pd.set_option("display.max_columns", None)


In [12]:
def load_csv(filename):
    return pd.read_csv(os.path.join(RAW_PATH, filename))

def save_csv(df, filename):
    df.to_csv(os.path.join(INTERIM_PATH, filename), index=False)
    print(f"✅ Saved: {filename}")


In [13]:
orders = load_csv("blinkit_orders.csv")

orders.drop_duplicates(inplace=True)

# Convert date columns if present
for col in orders.columns:
    if "date" in col.lower():
        orders[col] = pd.to_datetime(orders[col], errors="coerce")

save_csv(orders, "orders_clean.csv")
orders.info()


✅ Saved: orders_clean.csv
<class 'pandas.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   order_id                5000 non-null   int64         
 1   customer_id             5000 non-null   int64         
 2   order_date              5000 non-null   datetime64[us]
 3   promised_delivery_time  5000 non-null   str           
 4   actual_delivery_time    5000 non-null   str           
 5   delivery_status         5000 non-null   str           
 6   order_total             5000 non-null   float64       
 7   payment_method          5000 non-null   str           
 8   delivery_partner_id     5000 non-null   int64         
 9   store_id                5000 non-null   int64         
dtypes: datetime64[us](1), float64(1), int64(4), str(4)
memory usage: 390.8 KB


In [14]:
order_items = load_csv("blinkit_order_items.csv")

order_items.drop_duplicates(inplace=True)

# Remove rows with missing critical IDs
order_items.dropna(subset=["order_id", "product_id"], inplace=True)

save_csv(order_items, "order_items_clean.csv")


✅ Saved: order_items_clean.csv


In [15]:
products = load_csv("blinkit_products.csv")

products.drop_duplicates(inplace=True)

# Standardize text columns
text_cols = products.select_dtypes(include="object").columns
products[text_cols] = products[text_cols].apply(lambda x: x.str.strip().str.lower())

save_csv(products, "products_clean.csv")


✅ Saved: products_clean.csv


See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3.
  text_cols = products.select_dtypes(include="object").columns


In [16]:
customers = load_csv("blinkit_customers.csv")

customers.drop_duplicates(inplace=True)

# Fill missing categorical values
customers.fillna("unknown", inplace=True)

save_csv(customers, "customers_clean.csv")


✅ Saved: customers_clean.csv


In [17]:
inventory = load_csv("blinkit_inventory.csv")
inventory_new = load_csv("blinkit_inventoryNew.csv")

inventory.drop_duplicates(inplace=True)
inventory_new.drop_duplicates(inplace=True)

save_csv(inventory, "inventory_clean.csv")
save_csv(inventory_new, "inventory_new_clean.csv")


✅ Saved: inventory_clean.csv
✅ Saved: inventory_new_clean.csv


In [18]:
delivery = load_csv("blinkit_delivery_performance.csv")

delivery.drop_duplicates(inplace=True)

# Convert delivery times to numeric if needed
for col in delivery.columns:
    if "time" in col.lower():
        delivery[col] = pd.to_numeric(delivery[col], errors="coerce")

save_csv(delivery, "delivery_clean.csv")


✅ Saved: delivery_clean.csv


In [19]:
marketing = load_csv("blinkit_marketing_performance.csv")

marketing.drop_duplicates(inplace=True)

save_csv(marketing, "marketing_clean.csv")


✅ Saved: marketing_clean.csv


In [20]:
feedback = load_csv("blinkit_customer_feedback.csv")

feedback.drop_duplicates(inplace=True)

# Normalize ratings if present
for col in feedback.columns:
    if "rating" in col.lower():
        feedback[col] = pd.to_numeric(feedback[col], errors="coerce")

save_csv(feedback, "feedback_clean.csv")


✅ Saved: feedback_clean.csv


## Data Cleaning Summary

- Each dataset cleaned independently
- Duplicates removed
- Missing values handled conservatively
- Data types standardized
- Cleaned files saved to `data/interim/`

Next Step:
Exploratory Data Analysis using cleaned datasets
