# üèóÔ∏è Stage 1: Data Ingestion Pipeline
## Amazon Sales Data ‚Üí DuckDB Processing

**Objective**: Streamlined data ingestion process with integrated data cleaning and DuckDB storage

**Key Features**:
- ‚úÖ Production-ready data cleaning pipeline
- ‚úÖ DuckDB integration for analytical storage
- ‚úÖ Automated quality validation
- ‚úÖ Business logic for missing values

**Pipeline Steps**:
1. Environment Setup & Configuration
2. Data Loading with Cleaning
3. DuckDB Schema Creation
4. Data Ingestion & Validation
5. Quality Assurance Checks

## üì¶ Step 1.1: Import Required Libraries

In [15]:
# Core data processing libraries
import pandas as pd
import numpy as np
import duckdb
from datetime import datetime
import os
from pathlib import Path

# Utility libraries
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Required libraries imported successfully")
print(f"üìä Pandas version: {pd.__version__}")
print(f"ü¶Ü DuckDB version: {duckdb.__version__}")

‚úÖ Required libraries imported successfully
üìä Pandas version: 2.3.3
ü¶Ü DuckDB version: 1.2.1


## ‚öôÔ∏è Step 1.2: Configuration and Parameters

In [9]:
# Data Ingestion Configuration (Based on exploration findings)
CONFIG = {
    # File paths
    'csv_file': 'Amazon Sale Report.csv',
    'duckdb_file': 'amazon_sales.duckdb',
    
    # Key business columns (identified from exploration)
    'business_columns': {
        'date_col': 'Date',
        'amount_col': 'Amount', 
        'category_col': 'Category',
        'status_col': 'Status',
        'courier_status_col': 'Courier Status',
        'currency_col': 'currency'
    },
    
    # Data cleaning rules (from exploration insights)
    'cleaning_rules': {
        'default_currency': 'INR',
        'cancelled_amount_value': 0.0,
        'date_format': '%m-%d-%y'
    },
    
    # DuckDB table names
    'tables': {
        'raw_data': 'amazon_sales_raw',
        'monthly_revenue': 'monthly_revenue_by_category',
        'daily_orders': 'daily_orders_by_status'
    }
}

# Verify file exists
csv_path = Path(CONFIG['csv_file'])
if csv_path.exists():
    file_size_mb = csv_path.stat().st_size / (1024 * 1024)
    print(f"‚úÖ Source file found: {CONFIG['csv_file']}")
    print(f"üìÅ File size: {file_size_mb:.1f} MB")
else:
    print(f"‚ùå Source file not found: {CONFIG['csv_file']}")
    
print("‚öôÔ∏è Configuration loaded successfully")

‚úÖ Source file found: Amazon Sale Report.csv
üìÅ File size: 65.7 MB
‚öôÔ∏è Configuration loaded successfully


## üßπ Step 1.3: Data Cleaning Function

In [10]:
def clean_amazon_sales_data(df, config):
    """
    Production-ready data cleaning function based on exploration insights
    
    Business Rules:
    - Cancelled orders with missing Amount ‚Üí Set Amount = 0
    - Missing currency ‚Üí Set to 'INR' (default)
    - Flag data quality issues for non-cancelled orders with missing Amount
    """
    df_clean = df.copy()
    cleaning_stats = {}
    
    print("üßπ Applying data cleaning pipeline...")
    
    # Rule 1: Handle missing Amount values
    amount_col = config['business_columns']['amount_col']
    status_col = config['business_columns']['status_col']
    currency_col = config['business_columns']['currency_col']
    
    # Count original missing values
    original_amount_nulls = df_clean[amount_col].isna().sum()
    original_currency_nulls = df_clean[currency_col].isna().sum()
    
    # Set Amount = 0 for cancelled orders with missing Amount
    cancelled_missing_amount = (df_clean[status_col] == 'Cancelled') & (df_clean[amount_col].isna())
    cancelled_count = cancelled_missing_amount.sum()
    df_clean.loc[cancelled_missing_amount, amount_col] = config['cleaning_rules']['cancelled_amount_value']
    
    # Flag non-cancelled orders with missing Amount (data quality issue)
    non_cancelled_missing = (df_clean[status_col] != 'Cancelled') & (df_clean[amount_col].isna())
    flagged_count = non_cancelled_missing.sum()
    if flagged_count > 0:
        df_clean.loc[non_cancelled_missing, 'data_quality_flag'] = 'missing_amount_non_cancelled'
    
    # Rule 2: Set default currency for missing values
    currency_missing = df_clean[currency_col].isna()
    currency_count = currency_missing.sum()
    df_clean.loc[currency_missing, currency_col] = config['cleaning_rules']['default_currency']
    
    # Rule 3: Convert date column to proper datetime format
    date_col = config['business_columns']['date_col']
    df_clean[date_col] = pd.to_datetime(df_clean[date_col], format=config['cleaning_rules']['date_format'])
    
    # Cleaning statistics
    cleaning_stats = {
        'original_amount_nulls': original_amount_nulls,
        'cancelled_orders_fixed': cancelled_count,
        'non_cancelled_flagged': flagged_count,
        'currency_defaults_set': currency_count,
        'final_amount_nulls': df_clean[amount_col].isna().sum(),
        'final_currency_nulls': df_clean[currency_col].isna().sum()
    }
    
    print(f"‚úÖ Cleaned {cancelled_count} cancelled orders (Amount ‚Üí 0)")
    print(f"‚úÖ Set default currency for {currency_count} records")
    print(f"‚ö†Ô∏è  Flagged {flagged_count} non-cancelled orders with missing Amount")
    print(f"üìä Final Amount nulls: {cleaning_stats['final_amount_nulls']}")
    
    return df_clean, cleaning_stats

print("üîß Data cleaning function defined successfully")

üîß Data cleaning function defined successfully


## ü¶Ü Step 1.4: DuckDB Connection & Schema Setup

In [4]:
# Connect to DuckDB and create schema
def create_duckdb_schema(config):
    """Create DuckDB connection and define schemas for all tables"""
    
    conn = duckdb.connect(config['duckdb_file'])
    
    # Raw data table schema (optimized for Amazon sales data)
    raw_table_ddl = f"""
    CREATE OR REPLACE TABLE {config['tables']['raw_data']} (
        -- Identifiers
        index_id INTEGER,
        order_id VARCHAR,
        
        -- Date and Time
        date_col DATE,
        
        -- Product Information  
        category VARCHAR,
        size VARCHAR,
        sku VARCHAR,
        asin VARCHAR,
        style VARCHAR,
        
        -- Order Details
        status VARCHAR,
        courier_status VARCHAR,
        qty INTEGER,
        amount DECIMAL(10,2),
        currency VARCHAR(10),
        
        -- Customer Information
        ship_service_level VARCHAR,
        ship_city VARCHAR,
        ship_state VARCHAR,
        ship_postal_code INTEGER,
        ship_country VARCHAR,
        
        -- Sales Channel
        sales_channel VARCHAR,
        fulfilled_by VARCHAR,
        promotion_ids VARCHAR,
        
        -- Data Quality
        data_quality_flag VARCHAR,
        
        -- Metadata
        ingestion_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    );
    """
    
    # Monthly revenue by category analytical table
    monthly_revenue_ddl = f"""
    CREATE OR REPLACE TABLE {config['tables']['monthly_revenue']} (
        year_month VARCHAR,
        category VARCHAR,
        total_revenue DECIMAL(12,2),
        order_count INTEGER,
        avg_order_value DECIMAL(10,2),
        created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    );
    """
    
    # Daily orders by status analytical table  
    daily_orders_ddl = f"""
    CREATE OR REPLACE TABLE {config['tables']['daily_orders']} (
        order_date DATE,
        status VARCHAR,
        order_count INTEGER,
        total_quantity INTEGER,
        total_amount DECIMAL(12,2),
        created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    );
    """
    
    # Execute schema creation
    conn.execute(raw_table_ddl)
    conn.execute(monthly_revenue_ddl)
    conn.execute(daily_orders_ddl)
    
    print(f"‚úÖ DuckDB connection established: {config['duckdb_file']}")
    print(f"‚úÖ Created table: {config['tables']['raw_data']}")
    print(f"‚úÖ Created table: {config['tables']['monthly_revenue']}")  
    print(f"‚úÖ Created table: {config['tables']['daily_orders']}")
    
    return conn

# Create database connection and schema
conn = create_duckdb_schema(CONFIG)

‚úÖ DuckDB connection established: amazon_sales.duckdb
‚úÖ Created table: amazon_sales_raw
‚úÖ Created table: monthly_revenue_by_category
‚úÖ Created table: daily_orders_by_status


## üì• Step 1.5: Load and Clean Raw Data

In [11]:
# Load CSV data with cleaning pipeline
print("üì• Loading Amazon sales data...")
start_time = datetime.now()

# Read CSV file
df_raw = pd.read_csv(CONFIG['csv_file'])
print(f"‚úÖ Loaded {len(df_raw):,} records from {CONFIG['csv_file']}")

# Apply data cleaning
df_clean, cleaning_stats = clean_amazon_sales_data(df_raw, CONFIG)

# Display cleaning summary
print(f"\nüìä CLEANING SUMMARY:")
print(f"{'Metric':<30} {'Before':<10} {'After':<10}")
print("-" * 50)
print(f"{'Amount nulls':<30} {cleaning_stats['original_amount_nulls']:<10} {cleaning_stats['final_amount_nulls']:<10}")
print(f"{'Currency nulls':<30} {cleaning_stats['currency_defaults_set']:<10} {cleaning_stats['final_currency_nulls']:<10}")
print(f"{'Cancelled orders fixed':<30} {'-':<10} {cleaning_stats['cancelled_orders_fixed']:<10}")
print(f"{'Records flagged':<30} {'-':<10} {cleaning_stats['non_cancelled_flagged']:<10}")

# Basic data info
print(f"\nüìà DATASET INFO:")
print(f"‚Ä¢ Total records: {len(df_clean):,}")
print(f"‚Ä¢ Columns: {len(df_clean.columns)}")
print(f"‚Ä¢ Memory usage: {df_clean.memory_usage(deep=True).sum() / 1024 / 1024:.1f} MB")
print(f"‚Ä¢ Processing time: {(datetime.now() - start_time).total_seconds():.1f} seconds")

# Preview cleaned data
print(f"\nüîç CLEANED DATA PREVIEW:")
display(df_clean.head())

üì• Loading Amazon sales data...
‚úÖ Loaded 128,975 records from Amazon Sale Report.csv
üßπ Applying data cleaning pipeline...
‚úÖ Cleaned 7566 cancelled orders (Amount ‚Üí 0)
‚úÖ Set default currency for 7795 records
‚ö†Ô∏è  Flagged 229 non-cancelled orders with missing Amount
üìä Final Amount nulls: 229

üìä CLEANING SUMMARY:
Metric                         Before     After     
--------------------------------------------------
Amount nulls                   7795       229       
Currency nulls                 7795       0         
Cancelled orders fixed         -          7566      
Records flagged                -          229       

üìà DATASET INFO:
‚Ä¢ Total records: 128,975
‚Ä¢ Columns: 25
‚úÖ Loaded 128,975 records from Amazon Sale Report.csv
üßπ Applying data cleaning pipeline...
‚úÖ Cleaned 7566 cancelled orders (Amount ‚Üí 0)
‚úÖ Set default currency for 7795 records
‚ö†Ô∏è  Flagged 229 non-cancelled orders with missing Amount
üìä Final Amount nulls: 229

üìä CLEAN

Unnamed: 0,index,Order ID,Date,Status,Fulfilment,Sales Channel,ship-service-level,Style,SKU,Category,...,Amount,ship-city,ship-state,ship-postal-code,ship-country,promotion-ids,B2B,fulfilled-by,Unnamed: 22,data_quality_flag
0,0,405-8078784-5731545,2022-04-30,Cancelled,Merchant,Amazon.in,Standard,SET389,SET389-KR-NP-S,Set,...,647.62,MUMBAI,MAHARASHTRA,400081.0,IN,,False,Easy Ship,,
1,1,171-9198151-1101146,2022-04-30,Shipped - Delivered to Buyer,Merchant,Amazon.in,Standard,JNE3781,JNE3781-KR-XXXL,kurta,...,406.0,BENGALURU,KARNATAKA,560085.0,IN,Amazon PLCC Free-Financing Universal Merchant ...,False,Easy Ship,,
2,2,404-0687676-7273146,2022-04-30,Shipped,Amazon,Amazon.in,Expedited,JNE3371,JNE3371-KR-XL,kurta,...,329.0,NAVI MUMBAI,MAHARASHTRA,410210.0,IN,IN Core Free Shipping 2015/04/08 23-48-5-108,True,,,
3,3,403-9615377-8133951,2022-04-30,Cancelled,Merchant,Amazon.in,Standard,J0341,J0341-DR-L,Western Dress,...,753.33,PUDUCHERRY,PUDUCHERRY,605008.0,IN,,False,Easy Ship,,
4,4,407-1069790-7240320,2022-04-30,Shipped,Amazon,Amazon.in,Expedited,JNE3671,JNE3671-TU-XXXL,Top,...,574.0,CHENNAI,TAMIL NADU,600073.0,IN,,False,,,


## üíæ Step 1.6: Ingest Data into DuckDB

In [None]:
# Prepare data for DuckDB insertion
def prepare_for_duckdb(df, config):
    """Prepare DataFrame for DuckDB insertion with proper column mapping"""
    
    # Create column mapping for DuckDB schema
    df_db = df.copy()
    
    # Rename columns to match DuckDB schema
    column_mapping = {
        'index': 'index_id',
        'Order ID': 'order_id', 
        'Date': 'date_col',
        'Status': 'status',
        'Fulfilment': 'fulfilled_by',
        'Sales Channel ': 'sales_channel',
        'ship-service-level': 'ship_service_level',
        'Style': 'style',
        'SKU': 'sku',
        'Category': 'category',
        'Size': 'size',
        'ASIN': 'asin',
        'Courier Status': 'courier_status',
        'Qty': 'qty',
        'currency': 'currency',
        'Amount': 'amount',
        'ship-city': 'ship_city',
        'ship-state': 'ship_state',
        'ship-postal-code': 'ship_postal_code',
        'ship-country': 'ship_country',
        'promotion-ids': 'promotion_ids'
    }
    
    # Rename columns that exist in the DataFrame
    existing_renames = {old: new for old, new in column_mapping.items() if old in df_db.columns}
    df_db = df_db.rename(columns=existing_renames)
    
    # Select only columns that exist in DuckDB schema
    db_columns = ['index_id', 'order_id', 'date_col', 'category', 'size', 'sku', 'asin', 'style',
                  'status', 'courier_status', 'qty', 'amount', 'currency', 'ship_service_level', 
                  'ship_city', 'ship_state', 'ship_postal_code', 'ship_country', 'sales_channel',
                  'fulfilled_by', 'promotion_ids', 'data_quality_flag']
    
    # Keep only columns that exist in both DataFrame and schema
    available_columns = [col for col in db_columns if col in df_db.columns]
    df_final = df_db[available_columns].copy()
    
    return df_final

# Prepare and insert data
print("üíæ Preparing data for DuckDB insertion...")
df_for_db = prepare_for_duckdb(df_clean, CONFIG)

print(f"‚úÖ Prepared {len(df_for_db)} records with {len(df_for_db.columns)} columns")
print(f"üìã Columns: {list(df_for_db.columns)}")

# Insert data into DuckDB
print("\nüíæ Inserting data into DuckDB...")
insert_start = datetime.now()

# Use DuckDB's efficient bulk insert
conn.register('df_temp', df_for_db)
# Insert only the columns we have (excluding ingestion_timestamp which has DEFAULT)
column_list = ', '.join(df_for_db.columns)
conn.execute(f"INSERT INTO {CONFIG['tables']['raw_data']} ({column_list}) SELECT * FROM df_temp")

insert_time = (datetime.now() - insert_start).total_seconds()
print(f"‚úÖ Successfully inserted {len(df_for_db):,} records")
print(f"‚è±Ô∏è  Insert time: {insert_time:.2f} seconds")

# Verify insertion
count_result = conn.execute(f"SELECT COUNT(*) FROM {CONFIG['tables']['raw_data']}").fetchone()
print(f"üîç Verification: {count_result[0]:,} records in DuckDB table")

üíæ Preparing data for DuckDB insertion...
‚úÖ Prepared 128975 records with 22 columns
üìã Columns: ['index_id', 'order_id', 'date_col', 'category', 'size', 'sku', 'asin', 'style', 'status', 'courier_status', 'qty', 'amount', 'currency', 'ship_service_level', 'ship_city', 'ship_state', 'ship_postal_code', 'ship_country', 'sales_channel', 'fulfilled_by', 'promotion_ids', 'data_quality_flag']

üíæ Inserting data into DuckDB...
‚úÖ Prepared 128975 records with 22 columns
üìã Columns: ['index_id', 'order_id', 'date_col', 'category', 'size', 'sku', 'asin', 'style', 'status', 'courier_status', 'qty', 'amount', 'currency', 'ship_service_level', 'ship_city', 'ship_state', 'ship_postal_code', 'ship_country', 'sales_channel', 'fulfilled_by', 'promotion_ids', 'data_quality_flag']

üíæ Inserting data into DuckDB...
‚úÖ Successfully inserted 128,975 records
‚è±Ô∏è  Insert time: 1.90 seconds
üîç Verification: 128,975 records in DuckDB table
‚úÖ Successfully inserted 128,975 records
‚è±Ô∏è  Ins

## ‚úÖ Step 1.7: Data Quality Validation

In [14]:
# Comprehensive data quality validation
def validate_data_quality(conn, config):
    """Run quality checks on ingested data"""
    
    table_name = config['tables']['raw_data']
    
    print("üîç Running data quality validation...")
    print("=" * 50)
    
    # Basic counts and nulls
    basic_stats = conn.execute(f"""
        SELECT 
            COUNT(*) as total_records,
            COUNT(DISTINCT order_id) as unique_orders,
            SUM(CASE WHEN amount IS NULL THEN 1 ELSE 0 END) as null_amounts,
            SUM(CASE WHEN currency IS NULL THEN 1 ELSE 0 END) as null_currency,
            SUM(CASE WHEN data_quality_flag IS NOT NULL THEN 1 ELSE 0 END) as flagged_records
        FROM {table_name}
    """).fetchone()
    
    print(f"üìä BASIC STATISTICS:")
    print(f"‚Ä¢ Total records: {basic_stats[0]:,}")
    print(f"‚Ä¢ Unique orders: {basic_stats[1]:,}")
    print(f"‚Ä¢ Null amounts: {basic_stats[2]:,}")
    print(f"‚Ä¢ Null currency: {basic_stats[3]:,}")
    print(f"‚Ä¢ Flagged records: {basic_stats[4]:,}")
    
    # Business validation
    business_stats = conn.execute(f"""
        SELECT 
            status,
            COUNT(*) as order_count,
            SUM(amount) as total_amount,
            AVG(amount) as avg_amount,
            MIN(date_col) as earliest_date,
            MAX(date_col) as latest_date
        FROM {table_name}
        GROUP BY status
        ORDER BY order_count DESC
    """).fetchall()
    
    print(f"\nüìà BUSINESS VALIDATION BY STATUS:")
    print(f"{'Status':<25} {'Count':<10} {'Total $':<12} {'Avg $':<10} {'Date Range'}")
    print("-" * 80)
    for row in business_stats:
        status, count, total, avg, min_date, max_date = row
        total_str = f"${total:,.0f}" if total else "$0"
        avg_str = f"${avg:.0f}" if avg else "$0"
        print(f"{status:<25} {count:<10,} {total_str:<12} {avg_str:<10} {min_date} to {max_date}")
    
    # Category analysis
    category_stats = conn.execute(f"""
        SELECT 
            category,
            COUNT(*) as order_count,
            SUM(amount) as total_revenue
        FROM {table_name}
        WHERE amount > 0
        GROUP BY category
        ORDER BY total_revenue DESC
        LIMIT 10
    """).fetchall()
    
    print(f"\nüè∑Ô∏è  TOP CATEGORIES BY REVENUE:")
    print(f"{'Category':<20} {'Orders':<10} {'Revenue'}")
    print("-" * 40)
    for row in category_stats:
        category, count, revenue = row
        print(f"{category:<20} {count:<10,} ${revenue:,.0f}")
    
    # Quality score
    quality_issues = basic_stats[2] + basic_stats[3] + basic_stats[4]  # nulls + flags
    quality_score = max(0, 100 - (quality_issues / basic_stats[0] * 100))
    
    print(f"\nüéØ DATA QUALITY SCORE: {quality_score:.1f}%")
    
    return quality_score

# Run validation
quality_score = validate_data_quality(conn, CONFIG)

if quality_score >= 95:
    print("\n‚úÖ EXCELLENT data quality - Ready for analytical processing!")
elif quality_score >= 85:
    print("\n‚ö†Ô∏è  GOOD data quality - Minor issues detected")
else:
    print("\n‚ùå POOR data quality - Review required before proceeding")

üîç Running data quality validation...
üìä BASIC STATISTICS:
‚Ä¢ Total records: 128,975
‚Ä¢ Unique orders: 120,378
‚Ä¢ Null amounts: 229
‚Ä¢ Null currency: 0
‚Ä¢ Flagged records: 229

üìà BUSINESS VALIDATION BY STATUS:
Status                    Count      Total $      Avg $      Date Range
--------------------------------------------------------------------------------
Shipped                   77,804     $50,324,255  $649       2022-03-31 to 2022-06-29
Shipped - Delivered to Buyer 28,769     $18,650,815  $648       2022-03-31 to 2022-06-26
Cancelled                 18,332     $6,919,284   $377       2022-03-31 to 2022-06-29
Shipped - Returned to Seller 1,953      $1,269,644   $651       2022-03-31 to 2022-06-22
Shipped - Picked Up       973        $661,252     $680       2022-04-06 to 2022-06-27
Pending                   658        $430,271     $656       2022-04-04 to 2022-06-29
Pending - Waiting for Pick Up 281        $192,138     $684       2022-06-27 to 2022-06-28
Shipped - Ret

## üéØ Step 1.8: Ingestion Summary & Next Steps

**Stage 1 Data Ingestion COMPLETE!** 

‚úÖ **Achievements:**
- Loaded and cleaned 128,975+ sales records
- Applied business logic for missing values  
- Created optimized DuckDB schema
- Achieved high data quality score
- Ready for Stage 2 analytical processing

**Next Pipeline Steps:**
- **Stage 2**: Create analytical tables (monthly revenue by category, daily orders by status)
- **Stage 3**: Generate business intelligence visualizations
- **Stage 4**: Implement Dagster orchestration framework

In [16]:
# Close DuckDB connection to release file lock for other notebooks
if 'conn' in locals():
    conn.close()
    print("‚úÖ DuckDB connection closed - file lock released")
else:
    print("‚ÑπÔ∏è  No connection to close")

‚úÖ DuckDB connection closed - file lock released
