# Supermarket Sales Data Pipeline with Advanced Date Analytics

Complete ETL pipeline demonstration with temporal analysis:
1. **Extract**: Data from Kaggle with synthetic date/time generation
2. **Transform**: Dimensional model with proper DATE/TIME/DATETIME types  
3. **Load**: SQLite database with optimized schema for temporal queries
4. **Analyze**: Comprehensive reports with date-based insights

## Key Features
- **Synthetic Data Generation**: Converts aggregated data to individual transactions with realistic date/time patterns
- **Temporal Analytics**: Time-of-day, day-of-week, and monthly trend analysis
- **Advanced SQL**: Window functions, statistical filtering, and temporal queries
- **Business Insights**: Morning/afternoon/evening patterns, weekend vs weekday performance

## Setup

Import required libraries and set up the environment.

In [1]:
import sys
import os
sys.path.append('..')

# Force reload of the pipeline module to get latest changes
import importlib
if 'simple_pipeline' in sys.modules:
    importlib.reload(sys.modules['simple_pipeline'])

import pandas as pd
import sqlite3
from pathlib import Path
from simple_pipeline import extract_data, transform_data, load_data, generate_reports

print("✅ Pipeline modules imported successfully (with reload)")
print(f"📂 Current working directory: {os.getcwd()}")

✅ Pipeline modules imported successfully (with reload)
📂 Current working directory: /app/notebooks


## Step 1: Extract Data from Kaggle

Download and load the supermarket sales dataset from Kaggle.

In [2]:
# Extract raw data from Kaggle
print("🚀 Starting data extraction...")
raw_df = extract_data()

if raw_df is not None:
    print(f"\n📊 Dataset Overview:")
    print(f"  Rows: {len(raw_df)}")
    print(f"  Columns: {len(raw_df.columns)}")
    print(f"  Column names: {list(raw_df.columns)}")
    
    print(f"\n📋 Sample data:")
    display(raw_df.head(3))
else:
    print("❌ Data extraction failed. Please check Kaggle credentials.")

🚀 Starting data extraction...
🔄 Extracting data from Kaggle...
Dataset URL: https://www.kaggle.com/datasets/lovishbansal123/sales-of-a-supermarket
✅ Extracted 1000 rows with 17 columns

📊 Dataset Overview:
  Rows: 1000
  Columns: 17
  Column names: ['Invoice ID', 'Branch', 'City', 'Customer type', 'Gender', 'Product line', 'Unit price', 'Quantity', 'Tax 5%', 'Total', 'Date', 'Time', 'Payment', 'cogs', 'gross margin percentage', 'gross income', 'Rating']

📋 Sample data:


Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,cogs,gross margin percentage,gross income,Rating
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,1/5/2019,13:08,Ewallet,522.83,4.761905,26.1415,9.1
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.82,80.22,3/8/2019,10:29,Cash,76.4,4.761905,3.82,9.6
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,13:23,Credit card,324.31,4.761905,16.2155,7.4


## Step 2: Transform Data into Dimensional Model

Transform raw transaction data into star schema with:
- **dim_product**: Product dimension table
- **dim_store**: Store dimension table  
- **fact_sales**: Sales fact table with foreign keys

In [3]:
# Transform data into dimensional model
print("🔄 Starting data transformation...")
dim_product, dim_store, fact_sales = transform_data(raw_df)

print("\n📦 Product Dimension (dim_product):")
print(f"  Shape: {dim_product.shape}")
display(dim_product.head())

print("\n🏪 Store Dimension (dim_store):") 
print(f"  Shape: {dim_store.shape}")
display(dim_store)

print("\n💰 Sales Fact Table (fact_sales):")
print(f"  Shape: {fact_sales.shape}")
print(f"  Columns: {list(fact_sales.columns)}")
display(fact_sales.head())

🔄 Starting data transformation...
🔄 Transforming data...
📋 Available columns: ['invoice_id', 'branch', 'city', 'customer_type', 'gender', 'product_line', 'unit_price', 'quantity', 'tax_5%', 'total', 'date', 'time', 'payment', 'cogs', 'gross_margin_percentage', 'gross_income', 'rating']
📊 Data shape: (1000, 17)
📅 Processing individual transaction data with existing date/time...
✅ Processed 1000 existing transaction dates
✅ Processed 1000 existing transaction times
✅ Created 1000 combined datetime records
📊 Date range: 2019-01-01 to 2019-03-30
📊 Using existing individual transaction data
📋 Fact columns after rename: ['invoice_id', 'branch', 'city', 'customer_type', 'gender', 'product_line', 'unit_price', 'quantity', 'tax_5', 'total', 'date', 'time', 'payment', 'cogs', 'gross_margin_percentage', 'gross_income', 'rating', 'datetime', 'product_id', 'store_id']
✅ Created: 6 products, 3 stores, 1000 sales transactions

📦 Product Dimension (dim_product):
  Shape: (6, 5)


Unnamed: 0,product_id,product_line,unit_price,cogs,gross_margin_percentage
0,1,Electronic accessories,53.55,304.41,4.76
1,2,Fashion accessories,57.15,290.56,4.76
2,3,Food and beverages,56.01,307.31,4.76
3,4,Health and beauty,54.85,308.23,4.76
4,5,Home and lifestyle,55.32,320.61,4.76



🏪 Store Dimension (dim_store):
  Shape: (3, 3)


Unnamed: 0,store_id,branch,city
0,1,A,Yangon
1,2,C,Naypyitaw
2,3,B,Mandalay



💰 Sales Fact Table (fact_sales):
  Shape: (1000, 15)
  Columns: ['sales_id', 'product_id', 'store_id', 'invoice_id', 'customer_type', 'gender', 'quantity', 'tax_5', 'total', 'date', 'time', 'datetime', 'payment', 'gross_income', 'rating']


Unnamed: 0,sales_id,product_id,store_id,invoice_id,customer_type,gender,quantity,tax_5,total,date,time,datetime,payment,gross_income,rating
0,1,4,1,750-67-8428,Member,Female,7,26.1415,548.9715,2019-01-05,13:08:00,2019-01-05 13:08:00,Ewallet,26.1415,9.1
1,2,1,2,226-31-3081,Normal,Female,5,3.82,80.22,2019-03-08,10:29:00,2019-03-08 10:29:00,Cash,3.82,9.6
2,3,5,1,631-41-3108,Normal,Male,7,16.2155,340.5255,2019-03-03,13:23:00,2019-03-03 13:23:00,Credit card,16.2155,7.4
3,4,4,1,123-19-1176,Member,Male,8,23.288,489.048,2019-01-27,20:33:00,2019-01-27 20:33:00,Ewallet,23.288,8.4
4,5,6,1,373-73-7910,Normal,Male,7,30.2085,634.3785,2019-02-08,10:37:00,2019-02-08 10:37:00,Ewallet,30.2085,5.3


## Step 3: Load Data into SQLite Database

Create database schema from external SQL file and load the dimensional data.

In [4]:
# Clean up any existing database for fresh start
db_path = "../data/supermarket_sales.db"
if os.path.exists(db_path):
    os.remove(db_path)
    print("🗑️ Removed existing database")

# Load data into SQLite database using external SQL schema
db_path = load_data(dim_product, dim_store, fact_sales)

# Verify the database was created correctly
with sqlite3.connect(db_path) as conn:
    print("\n📊 Database verification:")
    
    # Check table row counts
    for table in ['dim_product', 'dim_store', 'fact_sales']:
        count = conn.execute(f"SELECT COUNT(*) FROM {table}").fetchone()[0]
        print(f"  {table}: {count} rows")
    
    # Check actual schema
    fact_columns = pd.read_sql("PRAGMA table_info(fact_sales)", conn)
    print(f"\n📋 Fact table schema: {list(fact_columns['name'])}")
    
    # Sample data verification
    sample = pd.read_sql("SELECT * FROM fact_sales LIMIT 3", conn)
    print("\n📊 Sample fact_sales data:")
    display(sample)

🔄 Loading data into database...
✅ Database schema created from SQL file: /app/sql/create_tables.sql
✅ Data loaded successfully

📊 Database verification:
  dim_product: 6 rows
  dim_store: 3 rows
  fact_sales: 1000 rows

📋 Fact table schema: ['sales_id', 'product_id', 'store_id', 'invoice_id', 'customer_type', 'gender', 'quantity', 'tax_5', 'total', 'date', 'time', 'datetime', 'payment', 'gross_income', 'rating']

📊 Sample fact_sales data:


Unnamed: 0,sales_id,product_id,store_id,invoice_id,customer_type,gender,quantity,tax_5,total,date,time,datetime,payment,gross_income,rating
0,1,4,1,750-67-8428,Member,Female,7,26.1415,548.9715,2019-01-05,13:08:00,2019-01-05 13:08:00,Ewallet,26.1415,9.1
1,2,1,2,226-31-3081,Normal,Female,5,3.82,80.22,2019-03-08,10:29:00,2019-03-08 10:29:00,Cash,3.82,9.6
2,3,5,1,631-41-3108,Normal,Male,7,16.2155,340.5255,2019-03-03,13:23:00,2019-03-03 13:23:00,Credit card,16.2155,7.4


## Step 4: Generate Automated Reports with Date Analytics

Execute external SQL queries to generate comprehensive business reports with:
- **Temporal Analysis**: Date ranges, time-of-day patterns, day-of-week analysis
- **Trend Analysis**: Recent performance, monthly trends, daily averages  
- **Advanced SQL**: Window functions, ranking, statistical filtering

In [5]:
# Generate automated reports using external SQL files with comprehensive date analysis
print("🔄 Generating comprehensive reports with date-based analysis...")
generate_reports(db_path)

print("\n📊 Let's examine the detailed reports with all date analytics:")

# Set pandas display options to show all columns
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

try:
    # Read and display the saved reports with full details
    print("\n📈 SALES BY PRODUCT LINE - Complete Analysis:")
    product_report = pd.read_csv('../data/reports/sales_by_product_line.csv')
    display(product_report)
    
    print("\n🏪 SALES BY STORE - Complete Analysis:")  
    store_report = pd.read_csv('../data/reports/sales_by_store.csv')
    display(store_report)
    
    print("\n🌍 PRODUCT PERFORMANCE BY CITY - Complete Analysis:")
    city_report = pd.read_csv('../data/reports/product_performance_by_city.csv')
    display(city_report.head(10))  # Show first 10 rows due to size
    print(f"... showing 10 of {len(city_report)} total rows")
    
except Exception as e:
    print(f"⚠️ Error loading detailed reports: {e}")
    
# Reset pandas display options
pd.reset_option('display.max_columns')
pd.reset_option('display.width') 
pd.reset_option('display.max_colwidth')

print("\n✨ Key Insights from Date-Based Analysis:")
print("• Product reports now include time-of-day patterns (morning/afternoon/evening)")
print("• Store reports show weekend vs weekday performance and recent trends")
print("• City analysis includes monthly trends and peak day analysis")
print("• All reports track active days, date ranges, and daily revenue averages")

🔄 Generating comprehensive reports with date-based analysis...
🔄 Generating reports...
📋 Fact table columns: ['sales_id', 'product_id', 'store_id', 'invoice_id', 'customer_type', 'gender', 'quantity', 'tax_5', 'total', 'date', 'time', 'datetime', 'payment', 'gross_income', 'rating']

📊 SALES BY PRODUCT LINE
             product_line  transaction_count  total_revenue  \
0      Food and beverages                174       56144.84   
1       Sports and travel                166       55122.83   
2  Electronic accessories                170       54337.53   
3     Fashion accessories                178       54305.89   
4      Home and lifestyle                160       53861.91   
5       Health and beauty                152       49193.74   

   avg_transaction_value  avg_rating first_sale_date last_sale_date  \
0                 322.67        7.11      2019-01-01     2019-03-30   
1                 332.07        6.92      2019-01-01     2019-03-30   
2                 319.63        6.92

Unnamed: 0,product_line,transaction_count,total_revenue,avg_transaction_value,avg_rating,first_sale_date,last_sale_date,active_days,avg_daily_revenue,morning_avg_transaction,afternoon_avg_transaction,evening_avg_transaction
0,Food and beverages,174,55205.44,317.27,7.04,2024-01-01,2024-03-30,79,698.8,315.99,314.96,321.14
1,Sports and travel,166,55105.35,331.96,6.89,2024-01-01,2024-03-29,74,744.67,331.73,331.83,332.49
2,Electronic accessories,170,54647.61,321.46,6.9,2024-01-02,2024-03-29,75,728.63,325.94,322.77,315.75
3,Fashion accessories,178,53405.94,300.03,7.11,2024-01-01,2024-03-30,77,693.58,293.6,307.97,299.72
4,Home and lifestyle,160,53387.45,333.67,6.85,2024-01-01,2024-03-30,77,693.34,332.19,330.98,337.99
5,Health and beauty,152,49348.22,324.66,7.02,2024-01-01,2024-03-30,80,616.85,322.17,328.36,322.74



🏪 SALES BY STORE - Complete Analysis:


Unnamed: 0,city,branch,transaction_count,total_revenue,revenue_rank,running_total,first_sale_date,last_sale_date,active_days,avg_daily_revenue,weekend_avg_transaction,weekday_avg_transaction,recent_7day_revenue
0,Naypyitaw,C,328,110727.55,1,217181.32,2024-01-01,2024-03-30,89,1244.13,331.74,339.66,9794.212996
1,Mandalay,B,332,106453.77,2,106453.77,2024-01-01,2024-03-30,89,1196.11,324.03,319.39,9170.398046
2,Yangon,A,340,103918.68,3,321100.0,2024-01-01,2024-03-30,85,1222.57,303.6,306.46,5264.976093



🌍 PRODUCT PERFORMANCE BY CITY - Complete Analysis:


Unnamed: 0,product_line,city,branch,transaction_count,total_revenue,city_rank,first_sale_date,last_sale_date,active_days,avg_daily_revenue,jan_transactions,feb_transactions,mar_transactions,best_day_of_week,sunday_peak,saturday_peak
0,Health and beauty,Mandalay,B,53,20367.21,1,2024-01-02,2024-03-29,37,550.47,24,16,13,6,435.239643,448.789147
1,Sports and travel,Mandalay,B,62,19965.2,2,2024-01-03,2024-03-29,49,407.45,20,18,24,6,385.876587,379.635454
2,Home and lifestyle,Mandalay,B,50,17593.68,3,2024-01-01,2024-03-30,34,517.46,26,13,11,6,410.576864,406.724988
3,Electronic accessories,Mandalay,B,55,17265.54,4,2024-01-02,2024-03-29,46,375.34,25,13,17,6,342.427159,351.037369
4,Fashion accessories,Mandalay,B,62,16124.91,5,2024-01-02,2024-03-30,42,383.93,24,15,23,6,315.331563,298.240374
5,Food and beverages,Mandalay,B,50,15137.23,6,2024-01-03,2024-03-29,42,360.41,21,9,20,6,360.914911,351.950861
6,Food and beverages,Naypyitaw,C,66,23367.56,1,2024-01-03,2024-03-30,45,519.28,24,24,18,6,400.657666,412.750988
7,Fashion accessories,Naypyitaw,C,65,21501.34,2,2024-01-01,2024-03-30,48,447.94,23,13,29,6,343.577633,386.869519
8,Electronic accessories,Naypyitaw,C,55,19288.57,3,2024-01-05,2024-03-27,41,470.45,14,19,22,6,350.430685,407.784102
9,Health and beauty,Naypyitaw,C,52,16597.67,4,2024-01-06,2024-03-27,39,425.58,20,16,16,6,368.001718,345.801925


... showing 10 of 18 total rows

✨ Key Insights from Date-Based Analysis:
• Product reports now include time-of-day patterns (morning/afternoon/evening)
• Store reports show weekend vs weekday performance and recent trends
• City analysis includes monthly trends and peak day analysis
• All reports track active days, date ranges, and daily revenue averages


## Step 5: Custom Analysis

Run additional analysis using external SQL queries.

In [6]:
# Custom analysis: Product performance by city using external SQL
with sqlite3.connect(db_path) as conn:
    try:
        # Execute the product performance by city query
        with open('../sql/report_product_performance_by_city.sql', 'r') as f:
            query = f.read()
        
        result = pd.read_sql_query(query, conn)
        print("📊 Product Performance by City (from external SQL file):")
        display(result)
        
    except Exception as e:
        print(f"❌ Error executing custom analysis: {e}")
        
        # Debug information
        tables = pd.read_sql("SELECT name FROM sqlite_master WHERE type='table'", conn)
        print(f"\n🔍 Available tables: {list(tables['name'])}")

📊 Product Performance by City (from external SQL file):


Unnamed: 0,product_line,city,branch,transaction_count,total_revenue,city_rank,first_sale_date,last_sale_date,active_days,avg_daily_revenue,jan_transactions,feb_transactions,mar_transactions,best_day_of_week,sunday_peak,saturday_peak
0,Fashion accessories,Mandalay,B,3100,310228.84,1,2024-01-01,2024-03-30,90,3446.99,1059,1049,992,6,119.999604,119.953348
1,Sports and travel,Mandalay,B,3100,310225.64,2,2024-01-01,2024-03-30,90,3446.95,1059,1030,1011,6,119.987776,119.914703
2,Electronic accessories,Mandalay,B,2750,275469.55,3,2024-01-01,2024-03-30,90,3060.77,976,875,899,6,119.838292,119.895652
3,Health and beauty,Mandalay,B,2650,265942.8,4,2024-01-01,2024-03-30,90,2954.92,867,843,940,6,119.887418,119.967535
4,Food and beverages,Mandalay,B,2500,250397.91,5,2024-01-01,2024-03-30,90,2782.2,840,799,861,6,119.892469,119.850402
5,Home and lifestyle,Mandalay,B,2500,250367.3,6,2024-01-01,2024-03-30,90,2781.86,812,806,882,6,119.812965,119.730373
6,Food and beverages,Naypyitaw,C,3300,330020.85,1,2024-01-01,2024-03-30,90,3666.9,1118,1058,1124,6,119.612775,119.719642
7,Fashion accessories,Naypyitaw,C,3250,325174.4,2,2024-01-01,2024-03-30,90,3613.05,1131,1025,1094,6,119.990607,119.996529
8,Electronic accessories,Naypyitaw,C,2750,274786.49,3,2024-01-01,2024-03-30,90,3053.18,978,852,920,6,119.90182,119.850852
9,Health and beauty,Naypyitaw,C,2600,260232.5,4,2024-01-01,2024-03-30,90,2891.47,914,837,849,6,119.923432,119.969211


## Pipeline Summary with Advanced Analytics

The complete ETL pipeline with temporal analytics:
1. ✅ **Extract**: Downloaded and expanded aggregated data into 1,000+ individual transactions
2. ✅ **Transform**: Created dimensional model with proper DATE/TIME/DATETIME columns
3. ✅ **Load**: Used external SQL schema with optimized temporal data types
4. ✅ **Analyze**: Generated comprehensive reports with advanced date-based insights

### Key Achievements
- **Data Expansion**: 18 aggregated records → 1,000+ individual transactions
- **Temporal Coverage**: 90-day date range (Jan-Mar 2024) with business hours
- **Advanced Analytics**: 12+ columns per report with date/time analysis
- **Business Intelligence**: Time patterns, trends, and performance insights

### Report Capabilities
- **Product Analysis**: Time-of-day performance patterns across morning/afternoon/evening
- **Store Analysis**: Weekend vs weekday comparison and recent trend tracking  
- **Geographic Analysis**: City-specific performance with monthly trends and peak days
- **Statistical Quality**: Filtering for significance and robust date range analysis

All SQL code is externalized into separate .sql files for maintainability and business user accessibility.