# Blinkit Sales Analytics – Data Understanding

## Objective
Understand the structure, schema, and relationships of multiple Blinkit business datasets.

This notebook focuses on:
- Identifying available datasets
- Understanding each table’s purpose
- Inspecting schema, data types, and missing values
- Identifying potential primary & foreign keys

No cleaning or transformation is done here.


In [8]:
import pandas as pd
import os

pd.set_option("display.max_columns", None)


In [9]:
DATA_PATH = "../data/raw"

files = [f for f in os.listdir(DATA_PATH) if f.endswith(".csv")]
files


['blinkit_customers.csv',
 'blinkit_customer_feedback.csv',
 'blinkit_delivery_performance.csv',
 'blinkit_inventory.csv',
 'blinkit_inventoryNew.csv',
 'blinkit_marketing_performance.csv',
 'blinkit_orders.csv',
 'blinkit_order_items.csv',
 'blinkit_products.csv']

In [10]:
dataframes = {}

for file in files:
    df_name = file.replace(".csv", "")
    dataframes[df_name] = pd.read_csv(os.path.join(DATA_PATH, file))


In [11]:
for name, df in dataframes.items():
    print(f"\n--- {name.upper()} ---")
    print("Shape:", df.shape)
    print("Columns:", df.columns.tolist())



--- BLINKIT_CUSTOMERS ---
Shape: (2500, 11)
Columns: ['customer_id', 'customer_name', 'email', 'phone', 'address', 'area', 'pincode', 'registration_date', 'customer_segment', 'total_orders', 'avg_order_value']

--- BLINKIT_CUSTOMER_FEEDBACK ---
Shape: (5000, 8)
Columns: ['feedback_id', 'order_id', 'customer_id', 'rating', 'feedback_text', 'feedback_category', 'sentiment', 'feedback_date']

--- BLINKIT_DELIVERY_PERFORMANCE ---
Shape: (5000, 8)
Columns: ['order_id', 'delivery_partner_id', 'promised_time', 'actual_time', 'delivery_time_minutes', 'distance_km', 'delivery_status', 'reasons_if_delayed']

--- BLINKIT_INVENTORY ---
Shape: (75172, 4)
Columns: ['product_id', 'date', 'stock_received', 'damaged_stock']

--- BLINKIT_INVENTORYNEW ---
Shape: (18105, 4)
Columns: ['product_id', 'date', 'stock_received', 'damaged_stock']

--- BLINKIT_MARKETING_PERFORMANCE ---
Shape: (5400, 11)
Columns: ['campaign_id', 'campaign_name', 'date', 'target_audience', 'channel', 'impressions', 'clicks', 'conv

In [12]:
for name, df in dataframes.items():
    print(f"\n--- {name.upper()} INFO ---")
    df.info()



--- BLINKIT_CUSTOMERS INFO ---
<class 'pandas.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   customer_id        2500 non-null   int64  
 1   customer_name      2500 non-null   str    
 2   email              2500 non-null   str    
 3   phone              2500 non-null   int64  
 4   address            2500 non-null   str    
 5   area               2500 non-null   str    
 6   pincode            2500 non-null   int64  
 7   registration_date  2500 non-null   str    
 8   customer_segment   2500 non-null   str    
 9   total_orders       2500 non-null   int64  
 10  avg_order_value    2500 non-null   float64
dtypes: float64(1), int64(4), str(6)
memory usage: 215.0 KB

--- BLINKIT_CUSTOMER_FEEDBACK INFO ---
<class 'pandas.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype
---  --

In [13]:
for name, df in dataframes.items():
    print(f"\n--- {name.upper()} MISSING VALUES ---")
    print(df.isnull().sum().sort_values(ascending=False))




--- BLINKIT_CUSTOMERS MISSING VALUES ---
customer_id          0
customer_name        0
email                0
phone                0
address              0
area                 0
pincode              0
registration_date    0
customer_segment     0
total_orders         0
avg_order_value      0
dtype: int64

--- BLINKIT_CUSTOMER_FEEDBACK MISSING VALUES ---
feedback_id          0
order_id             0
customer_id          0
rating               0
feedback_text        0
feedback_category    0
sentiment            0
feedback_date        0
dtype: int64

--- BLINKIT_DELIVERY_PERFORMANCE MISSING VALUES ---
reasons_if_delayed       1902
order_id                    0
promised_time               0
delivery_partner_id         0
actual_time                 0
delivery_time_minutes       0
distance_km                 0
delivery_status             0
dtype: int64

--- BLINKIT_INVENTORY MISSING VALUES ---
product_id        0
date              0
stock_received    0
damaged_stock     0
dtype: int64

---

In [14]:
for name, df in dataframes.items():
    print(f"\n--- {name.upper()} SAMPLE ---")
    display(df.head(2))



--- BLINKIT_CUSTOMERS SAMPLE ---


Unnamed: 0,customer_id,customer_name,email,phone,address,area,pincode,registration_date,customer_segment,total_orders,avg_order_value
0,97475543,Niharika Nagi,ektataneja@example.org,912987579691,"23, Nayar Path, Bihar Sharif-154625",Udupi,321865,2023-05-13,Premium,13,451.92
1,22077605,Megha Sachar,vedant45@example.com,915123179717,"51/302, Buch Chowk\nSrinagar-570271",Aligarh,149394,2024-06-18,Inactive,4,825.48



--- BLINKIT_CUSTOMER_FEEDBACK SAMPLE ---


Unnamed: 0,feedback_id,order_id,customer_id,rating,feedback_text,feedback_category,sentiment,feedback_date
0,2234710,1961864118,30065862,4,"It was okay, nothing special.",Delivery,Neutral,2024-07-17
1,5450964,1549769649,9573071,3,The order was incorrect.,App Experience,Negative,2024-05-28



--- BLINKIT_DELIVERY_PERFORMANCE SAMPLE ---


Unnamed: 0,order_id,delivery_partner_id,promised_time,actual_time,delivery_time_minutes,distance_km,delivery_status,reasons_if_delayed
0,1961864118,63230,2024-07-17 08:52:01,2024-07-17 08:47:01,-5.0,0.96,On Time,
1,1549769649,14983,2024-05-28 13:25:29,2024-05-28 13:27:29,2.0,0.98,On Time,Traffic



--- BLINKIT_INVENTORY SAMPLE ---


Unnamed: 0,product_id,date,stock_received,damaged_stock
0,153019,17-03-2023,4,2
1,848226,17-03-2023,4,2



--- BLINKIT_INVENTORYNEW SAMPLE ---


Unnamed: 0,product_id,date,stock_received,damaged_stock
0,153019,Mar-23,4,1
1,848226,Mar-23,4,1



--- BLINKIT_MARKETING_PERFORMANCE SAMPLE ---


Unnamed: 0,campaign_id,campaign_name,date,target_audience,channel,impressions,clicks,conversions,spend,revenue_generated,roas
0,548299,New User Discount,2024-11-05,Premium,App,3130,163,78,1431.85,4777.75,3.6
1,390914,Weekend Special,2024-11-05,Inactive,App,3925,494,45,4506.34,6238.11,2.98



--- BLINKIT_ORDERS SAMPLE ---


Unnamed: 0,order_id,customer_id,order_date,promised_delivery_time,actual_delivery_time,delivery_status,order_total,payment_method,delivery_partner_id,store_id
0,1961864118,30065862,2024-07-17 08:34:01,2024-07-17 08:52:01,2024-07-17 08:47:01,On Time,3197.07,Cash,63230,4771
1,1549769649,9573071,2024-05-28 13:14:29,2024-05-28 13:25:29,2024-05-28 13:27:29,On Time,976.55,Cash,14983,7534



--- BLINKIT_ORDER_ITEMS SAMPLE ---


Unnamed: 0,order_id,product_id,quantity,unit_price
0,1961864118,642612,3,517.03
1,1549769649,378676,1,881.42



--- BLINKIT_PRODUCTS SAMPLE ---


Unnamed: 0,product_id,product_name,category,brand,price,mrp,margin_percentage,shelf_life_days,min_stock_level,max_stock_level
0,153019,Onions,Fruits & Vegetables,Aurora LLC,947.95,1263.93,25.0,3,13,88
1,11422,Potatoes,Fruits & Vegetables,Ramaswamy-Tata,127.16,169.55,25.0,3,20,65


## Key Observations (Initial)

Based on column names and structure:

- `blinkit_orders`  
  Likely Primary Key: `order_id`  
  Foreign Keys: `customer_id`

- `blinkit_order_items`  
  Foreign Keys: `order_id`, `product_id`

- `blinkit_products`  
  Primary Key: `product_id`

- `blinkit_customers`  
  Primary Key: `customer_id`

- `blinkit_inventory`  
  Foreign Key: `product_id`

- `blinkit_delivery_performance`  
  Foreign Key: `order_id`

- `blinkit_marketing_performance`  
  Campaign-level data (may join by date or campaign_id)

- `blinkit_customer_feedback`  
  Foreign Keys: `order_id`, `customer_id`


## Data Modeling Perspective

- **Fact Table**
  - blinkit_orders (central sales transactions)

- **Dimension Tables**
  - products
  - customers
  - order_items
  - inventory
  - delivery_performance
  - marketing_performance
  - customer_feedback

This project follows a **star-schema-inspired analytics design**.


## Summary

- Dataset represents a full e-commerce business system
- Data is split across multiple logical entities
- Orders act as the central fact table
- Multiple dimensions enrich business analysis
- Several datasets contain missing values and require cleaning

### Next Step
Proceed to **02_data_cleaning.ipynb**  
Each dataset will be cleaned **individually** and saved to `data/interim/`
