# Data Exploration & Discovery
## Understanding Amazon Sales Data Structure

This notebook covers **Data Exploration & Discovery** - examining the Amazon Sale Report.csv file to understand its structure before building our Dagster pipeline.

### Objectives:
1. Understand the CSV file structure and size
2. Identify key columns for our analytical tables
3. Plan the raw data database schema for DuckDB
4. Prepare insights for the data ingestion pipeline

### Project Pipeline Overview:
- **Data Exploration** ‚Üê **(Current Notebook)**
- **Stage 1: Data Ingestion** - CSV loading, cleaning, and DuckDB storage
- **Stage 2: Analytical Processing** - Create business intelligence tables
- **Stage 3: Visualization & Insights** - Generate charts and analysis reports

## Setup Instructions

## Import Required Libraries

**‚ö†Ô∏è Important: Install Required Packages First**

```

Before running this notebook, make sure you have installed the required packages:pip install -r requirements-core.txt

# Install core requirements

```bash

# Activate your conda environmentconda activate verihub-dagster

In [5]:
import pandas as pd
import numpy as np
import os
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)

print("‚úÖ Libraries imported successfully!")

‚úÖ Libraries imported successfully!


In [6]:
# Define file path
csv_file = "Amazon Sale Report.csv"

# Check if file exists
if os.path.exists(csv_file):
    file_size = os.path.getsize(csv_file) / (1024 * 1024)  # Convert to MB
    print(f"üìÅ File found: {csv_file}")
    print(f"üìä File size: {file_size:.2f} MB")
else:
    print(f"‚ùå File not found: {csv_file}")
    print("Please ensure the file is in the current directory")

üìÅ File found: Amazon Sale Report.csv
üìä File size: 65.73 MB


## Step 1: Read Sample Data (First 1000 rows)

In [7]:
# Read sample data to understand structure without loading entire file
try:
    print("üîç Reading sample data (first 1000 rows)...")
    df_sample = pd.read_csv(csv_file, nrows=1000)
    
    print(f"‚úÖ Successfully loaded sample data")
    print(f"üìã Number of columns: {len(df_sample.columns)}")
    print(f"üî¢ Number of sample rows: {len(df_sample)}")
    print(f"üíæ Sample memory usage: {df_sample.memory_usage(deep=True).sum() / 1024:.2f} KB")
    
except Exception as e:
    print(f"‚ùå Error reading CSV: {e}")

üîç Reading sample data (first 1000 rows)...
‚úÖ Successfully loaded sample data
üìã Number of columns: 24
üî¢ Number of sample rows: 1000
üíæ Sample memory usage: 1398.88 KB
‚úÖ Successfully loaded sample data
üìã Number of columns: 24
üî¢ Number of sample rows: 1000
üíæ Sample memory usage: 1398.88 KB


## Step 2: Examine Column Structure

In [8]:
# Display all column names
print("üìù Column Names:")
print("=" * 50)
for i, col in enumerate(df_sample.columns, 1):
    print(f"{i:2d}. {col}")

üìù Column Names:
 1. index
 2. Order ID
 3. Date
 4. Status
 5. Fulfilment
 6. Sales Channel 
 7. ship-service-level
 8. Style
 9. SKU
10. Category
11. Size
12. ASIN
13. Courier Status
14. Qty
15. currency
16. Amount
17. ship-city
18. ship-state
19. ship-postal-code
20. ship-country
21. promotion-ids
22. B2B
23. fulfilled-by
24. Unnamed: 22


## Step 3: Analyze Data Types and Missing Values

In [9]:
# Detailed column analysis
print("üìä Column Details:")
print("=" * 80)
print(f"{'Column':<25} {'Data Type':<15} {'Non-Null':<10} {'Null Count':<10} {'Sample Values':<20}")
print("-" * 80)

for col in df_sample.columns:
    dtype = str(df_sample[col].dtype)
    non_null = df_sample[col].count()
    null_count = len(df_sample) - non_null
    
    # Get sample unique values (first 3)
    sample_values = df_sample[col].dropna().unique()[:3]
    sample_str = str(list(sample_values))[:18] + ".." if len(str(list(sample_values))) > 20 else str(list(sample_values))
    
    print(f"{col:<25} {dtype:<15} {non_null:<10} {null_count:<10} {sample_str:<20}")

üìä Column Details:
Column                    Data Type       Non-Null   Null Count Sample Values       
--------------------------------------------------------------------------------
index                     int64           1000       0          [np.int64(0), np.i..
Order ID                  object          1000       0          ['405-8078784-5731..
Date                      object          1000       0          ['04-30-22']        
Status                    object          1000       0          ['Cancelled', 'Shi..
Fulfilment                object          1000       0          ['Merchant', 'Amaz..
Sales Channel             object          1000       0          ['Amazon.in', 'Non..
ship-service-level        object          1000       0          ['Standard', 'Expe..
Style                     object          1000       0          ['SET389', 'JNE378..
SKU                       object          1000       0          ['SET389-KR-NP-S',..
Category                  object          1000  

## Step 4: Display Sample Data

In [10]:
# Show first 5 rows
print("üëÄ First 5 rows of data:")
print("=" * 100)
display(df_sample.head())

üëÄ First 5 rows of data:


Unnamed: 0,index,Order ID,Date,Status,Fulfilment,Sales Channel,ship-service-level,Style,SKU,Category,Size,ASIN,Courier Status,Qty,currency,Amount,ship-city,ship-state,ship-postal-code,ship-country,promotion-ids,B2B,fulfilled-by,Unnamed: 22
0,0,405-8078784-5731545,04-30-22,Cancelled,Merchant,Amazon.in,Standard,SET389,SET389-KR-NP-S,Set,S,B09KXVBD7Z,,0,INR,647.62,MUMBAI,MAHARASHTRA,400081.0,IN,,False,Easy Ship,
1,1,171-9198151-1101146,04-30-22,Shipped - Delivered to Buyer,Merchant,Amazon.in,Standard,JNE3781,JNE3781-KR-XXXL,kurta,3XL,B09K3WFS32,Shipped,1,INR,406.0,BENGALURU,KARNATAKA,560085.0,IN,Amazon PLCC Free-Financing Universal Merchant ...,False,Easy Ship,
2,2,404-0687676-7273146,04-30-22,Shipped,Amazon,Amazon.in,Expedited,JNE3371,JNE3371-KR-XL,kurta,XL,B07WV4JV4D,Shipped,1,INR,329.0,NAVI MUMBAI,MAHARASHTRA,410210.0,IN,IN Core Free Shipping 2015/04/08 23-48-5-108,True,,
3,3,403-9615377-8133951,04-30-22,Cancelled,Merchant,Amazon.in,Standard,J0341,J0341-DR-L,Western Dress,L,B099NRCT7B,,0,INR,753.33,PUDUCHERRY,PUDUCHERRY,605008.0,IN,,False,Easy Ship,
4,4,407-1069790-7240320,04-30-22,Shipped,Amazon,Amazon.in,Expedited,JNE3671,JNE3671-TU-XXXL,Top,3XL,B098714BZP,Shipped,1,INR,574.0,CHENNAI,TAMIL NADU,600073.0,IN,,False,,


## Step 5: Identify Key Business Columns

In [11]:
# Identify potential date columns
print("üìÖ POTENTIAL DATE COLUMNS:")
print("=" * 40)
date_keywords = ['date', 'time', 'day', 'month', 'year', 'created', 'updated', 'shipped']
date_cols = []

for col in df_sample.columns:
    if any(keyword in col.lower() for keyword in date_keywords):
        date_cols.append(col)
        print(f"‚úì {col} ({df_sample[col].dtype})")
        print(f"  Sample values: {df_sample[col].dropna().head(3).tolist()}")
        print()

if not date_cols:
    print("‚ö†Ô∏è No obvious date columns found. Let's check all columns manually.")

üìÖ POTENTIAL DATE COLUMNS:
‚úì Date (object)
  Sample values: ['04-30-22', '04-30-22', '04-30-22']



In [12]:
# Identify potential revenue/money columns
print("üí∞ POTENTIAL REVENUE/MONEY COLUMNS:")
print("=" * 40)
money_keywords = ['price', 'amount', 'revenue', 'total', 'cost', 'value', 'sales', 'profit']
money_cols = []

for col in df_sample.columns:
    if any(keyword in col.lower() for keyword in money_keywords):
        money_cols.append(col)
        print(f"‚úì {col} ({df_sample[col].dtype})")
        print(f"  Sample values: {df_sample[col].dropna().head(3).tolist()}")
        if df_sample[col].dtype in ['float64', 'int64']:
            print(f"  Range: ${df_sample[col].min():.2f} - ${df_sample[col].max():.2f}")
        print()

if not money_cols:
    print("‚ö†Ô∏è No obvious money columns found. Let's check numeric columns.")

üí∞ POTENTIAL REVENUE/MONEY COLUMNS:
‚úì Sales Channel  (object)
  Sample values: ['Amazon.in', 'Amazon.in', 'Amazon.in']

‚úì Amount (float64)
  Sample values: [647.62, 406.0, 329.0]
  Range: $0.00 - $2224.00



In [13]:
# Identify potential category columns
print("üè∑Ô∏è POTENTIAL CATEGORY COLUMNS:")
print("=" * 40)
category_keywords = ['category', 'type', 'class', 'group', 'segment', 'product', 'item']
category_cols = []

for col in df_sample.columns:
    if any(keyword in col.lower() for keyword in category_keywords):
        category_cols.append(col)
        unique_count = df_sample[col].nunique()
        print(f"‚úì {col} ({df_sample[col].dtype})")
        print(f"  Unique values: {unique_count}")
        print(f"  Sample categories: {df_sample[col].dropna().unique()[:5].tolist()}")
        print()

if not category_cols:
    print("‚ö†Ô∏è No obvious category columns found. Let's check all text columns.")

üè∑Ô∏è POTENTIAL CATEGORY COLUMNS:
‚úì Category (object)
  Unique values: 8
  Sample categories: ['Set', 'kurta', 'Western Dress', 'Top', 'Ethnic Dress']



In [14]:
# Identify potential order status columns
print("üìä POTENTIAL ORDER STATUS COLUMNS:")
print("=" * 40)
status_keywords = ['status', 'state', 'condition', 'fulfillment', 'delivery', 'shipped', 'cancelled']
status_cols = []

for col in df_sample.columns:
    if any(keyword in col.lower() for keyword in status_keywords):
        status_cols.append(col)
        unique_vals = df_sample[col].dropna().unique()
        print(f"‚úì {col} ({df_sample[col].dtype})")
        print(f"  Unique statuses: {len(unique_vals)}")
        print(f"  Status values: {unique_vals.tolist()}")
        print()

if not status_cols:
    print("‚ö†Ô∏è No obvious status columns found. Let's check all categorical columns.")

üìä POTENTIAL ORDER STATUS COLUMNS:
‚úì Status (object)
  Unique statuses: 5
  Status values: ['Cancelled', 'Shipped - Delivered to Buyer', 'Shipped', 'Shipped - Returned to Seller', 'Shipped - Rejected by Buyer']

‚úì Courier Status (object)
  Unique statuses: 3
  Status values: ['Shipped', 'Cancelled', 'Unshipped']

‚úì ship-state (object)
  Unique statuses: 33
  Status values: ['MAHARASHTRA', 'KARNATAKA', 'PUDUCHERRY', 'TAMIL NADU', 'UTTAR PRADESH', 'CHANDIGARH', 'TELANGANA', 'ANDHRA PRADESH', 'RAJASTHAN', 'DELHI', 'HARYANA', 'ASSAM', 'JHARKHAND', 'CHHATTISGARH', 'ODISHA', 'KERALA', 'MADHYA PRADESH', 'WEST BENGAL', 'NAGALAND', 'Gujarat', 'UTTARAKHAND', 'BIHAR', 'JAMMU & KASHMIR', 'PUNJAB', 'HIMACHAL PRADESH', 'ARUNACHAL PRADESH', 'MANIPUR', 'Goa', 'MEGHALAYA', 'GOA', 'TRIPURA', 'LADAKH', 'DADRA AND NAGAR']



## Step 6: Data Quality Assessment

In [15]:
# Check for missing values
print("üîç MISSING VALUES ANALYSIS:")
print("=" * 50)
missing_data = df_sample.isnull().sum()
missing_percent = (missing_data / len(df_sample)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing_data,
    'Missing %': missing_percent
})
missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

if not missing_df.empty:
    display(missing_df)
else:
    print("‚úÖ No missing values found in sample data!")

üîç MISSING VALUES ANALYSIS:


Unnamed: 0,Missing Count,Missing %
Unnamed: 22,1000,100.0
fulfilled-by,788,78.8
promotion-ids,328,32.8
currency,70,7.0
Amount,70,7.0
Courier Status,40,4.0


In [16]:
# Check data types distribution
print("üìà DATA TYPES DISTRIBUTION:")
print("=" * 30)
dtype_counts = df_sample.dtypes.value_counts()
for dtype, count in dtype_counts.items():
    print(f"{dtype}: {count} columns")

# Show numeric columns statistics
numeric_cols = df_sample.select_dtypes(include=[np.number]).columns
if len(numeric_cols) > 0:
    print(f"\nüî¢ NUMERIC COLUMNS SUMMARY:")
    print("=" * 40)
    display(df_sample[numeric_cols].describe())

üìà DATA TYPES DISTRIBUTION:
object: 18 columns
float64: 3 columns
int64: 2 columns
bool: 1 columns

üî¢ NUMERIC COLUMNS SUMMARY:


Unnamed: 0,index,Qty,Amount,ship-postal-code,Unnamed: 22
count,1000.0,1000.0,930.0,1000.0,0.0
mean,499.5,0.914,630.399817,477829.116,
std,288.819436,0.541199,282.928469,196377.271646,
min,0.0,0.0,0.0,110007.0,
25%,249.75,1.0,432.0,386001.0,
50%,499.5,1.0,568.0,500074.0,
75%,749.25,1.0,752.7475,600088.25,
max,999.0,15.0,2224.0,855102.0,


## Step 7: Estimate Full Dataset Size

In [17]:
# Estimate total rows and memory requirements
print("üìè FULL DATASET ESTIMATION:")
print("=" * 40)

try:
    # Count total lines in file (this might take a moment for large files)
    print("Counting total rows... (this may take a moment)")
    with open(csv_file, 'r', encoding='utf-8') as f:
        total_rows = sum(1 for line in f) - 1  # Subtract header
    
    print(f"üìä Total rows in file: {total_rows:,}")
    
    # Estimate memory requirements
    sample_memory_mb = df_sample.memory_usage(deep=True).sum() / (1024 * 1024)
    estimated_full_memory = (sample_memory_mb / len(df_sample)) * total_rows
    
    print(f"üíæ Sample memory usage: {sample_memory_mb:.2f} MB")
    print(f"üíæ Estimated full dataset memory: {estimated_full_memory:.2f} MB")
    
    if estimated_full_memory > 1000:
        print(f"‚ö†Ô∏è Large dataset! Consider chunked processing.")
    else:
        print(f"‚úÖ Dataset size manageable for in-memory processing.")
        
except Exception as e:
    print(f"‚ö†Ô∏è Could not estimate full dataset size: {e}")

üìè FULL DATASET ESTIMATION:
Counting total rows... (this may take a moment)
üìä Total rows in file: 128,975
üíæ Sample memory usage: 1.37 MB
üíæ Estimated full dataset memory: 176.19 MB
‚úÖ Dataset size manageable for in-memory processing.
üìä Total rows in file: 128,975
üíæ Sample memory usage: 1.37 MB
üíæ Estimated full dataset memory: 176.19 MB
‚úÖ Dataset size manageable for in-memory processing.


### üîß Data Cleaning Strategy

Before proceeding with Dagster pipeline, we need to address critical missing values:
- **Amount** (7% missing): Essential for revenue analysis
- **Currency** (7% missing): Needed for currency normalization
- **Courier Status** (4% missing): Important for order tracking

In [18]:
# Investigate missing Amount values pattern
print("üîç MISSING AMOUNT VALUES ANALYSIS:")
print("=" * 50)

# Check missing Amount records
missing_amount = df_sample[df_sample['Amount'].isna()]
print(f"üìä Missing Amount records: {len(missing_amount)}")

if len(missing_amount) > 0:
    print(f"üìä Sample of missing Amount records:")
    print("\nüîç Status distribution for missing Amount:")
    print(missing_amount['Status'].value_counts())
    
    print(f"\nüîç Category distribution for missing Amount:")
    print(missing_amount['Category'].value_counts())
    
    print(f"\nüîç Currency status for missing Amount:")
    print(missing_amount['currency'].value_counts(dropna=False))
    
    # Check if missing Amount correlates with missing Currency
    both_missing = df_sample[(df_sample['Amount'].isna()) & (df_sample['currency'].isna())]
    print(f"\nüìä Records missing both Amount AND Currency: {len(both_missing)}")
    
    print(f"\nüîç First few missing Amount records:")
    display(missing_amount[['Date', 'Status', 'Category', 'Amount', 'currency', 'Qty']].head())

üîç MISSING AMOUNT VALUES ANALYSIS:
üìä Missing Amount records: 70
üìä Sample of missing Amount records:

üîç Status distribution for missing Amount:
Status
Cancelled    69
Shipped       1
Name: count, dtype: int64

üîç Category distribution for missing Amount:
Category
kurta            30
Set              26
Western Dress     7
Top               6
Blouse            1
Name: count, dtype: int64

üîç Currency status for missing Amount:
currency
NaN    70
Name: count, dtype: int64

üìä Records missing both Amount AND Currency: 70

üîç First few missing Amount records:


Unnamed: 0,Date,Status,Category,Amount,currency,Qty
8,04-30-22,Cancelled,Set,,,0
29,04-30-22,Cancelled,kurta,,,0
65,04-30-22,Cancelled,kurta,,,0
84,04-30-22,Cancelled,kurta,,,0
95,04-30-22,Cancelled,kurta,,,0


In [19]:
# Data Cleaning Strategy for Missing Values
print("\nüîß DATA CLEANING STRATEGY:")
print("=" * 50)

print("‚úÖ STRATEGY FOR MISSING AMOUNT VALUES:")
print("1. Cancelled orders with missing Amount: SET Amount = 0 (business logic)")
print("2. Shipped orders with missing Amount: INVESTIGATE/EXCLUDE (data quality issue)")
print("3. Missing Currency: SET to 'INR' (default currency for Indian sales)")

print(f"\nüìä IMPACT ANALYSIS:")
cancelled_missing = df_sample[(df_sample['Amount'].isna()) & (df_sample['Status'] == 'Cancelled')]
shipped_missing = df_sample[(df_sample['Amount'].isna()) & (df_sample['Status'] != 'Cancelled')]

print(f"‚Ä¢ Cancelled orders to set Amount=0: {len(cancelled_missing)} ({len(cancelled_missing)/len(df_sample)*100:.1f}%)")
print(f"‚Ä¢ Non-cancelled orders with missing Amount: {len(shipped_missing)} ({len(shipped_missing)/len(df_sample)*100:.1f}%)")

print(f"\nüîç Non-cancelled missing Amount details:")
if len(shipped_missing) > 0:
    display(shipped_missing[['Date', 'Status', 'Category', 'Amount', 'currency', 'Qty']])
else:
    print("None found - all missing Amount are cancelled orders!")


üîß DATA CLEANING STRATEGY:
‚úÖ STRATEGY FOR MISSING AMOUNT VALUES:
1. Cancelled orders with missing Amount: SET Amount = 0 (business logic)
2. Shipped orders with missing Amount: INVESTIGATE/EXCLUDE (data quality issue)
3. Missing Currency: SET to 'INR' (default currency for Indian sales)

üìä IMPACT ANALYSIS:
‚Ä¢ Cancelled orders to set Amount=0: 69 (6.9%)
‚Ä¢ Non-cancelled orders with missing Amount: 1 (0.1%)

üîç Non-cancelled missing Amount details:


Unnamed: 0,Date,Status,Category,Amount,currency,Qty
937,04-30-22,Shipped,Blouse,,,15


In [20]:
# Implement Data Cleaning Function
def clean_sales_data(df):
    """
    Clean Amazon sales data with business logic for missing values
    """
    df_clean = df.copy()
    
    print("üßπ APPLYING DATA CLEANING RULES:")
    print("=" * 40)
    
    # Rule 1: Set Amount = 0 for cancelled orders with missing Amount
    cancelled_mask = (df_clean['Status'] == 'Cancelled') & (df_clean['Amount'].isna())
    before_cancelled = cancelled_mask.sum()
    df_clean.loc[cancelled_mask, 'Amount'] = 0.0
    print(f"‚úÖ Set Amount=0 for {before_cancelled} cancelled orders")
    
    # Rule 2: Handle the 1 shipped order with missing Amount (investigate)
    shipped_missing = (df_clean['Status'] != 'Cancelled') & (df_clean['Amount'].isna())
    if shipped_missing.sum() > 0:
        print(f"‚ö†Ô∏è  Found {shipped_missing.sum()} non-cancelled orders with missing Amount:")
        print("   ‚Üí Recommendation: Exclude from analysis or investigate further")
        # For now, we'll flag but keep the record
        df_clean.loc[shipped_missing, 'data_quality_flag'] = 'missing_amount_shipped'
    
    # Rule 3: Set default currency to INR for missing currency
    currency_missing = df_clean['currency'].isna()
    before_currency = currency_missing.sum()
    df_clean.loc[currency_missing, 'currency'] = 'INR'
    print(f"‚úÖ Set currency=INR for {before_currency} records with missing currency")
    
    # Summary
    remaining_amount_nulls = df_clean['Amount'].isna().sum()
    remaining_currency_nulls = df_clean['currency'].isna().sum()
    
    print(f"\nüìä CLEANING RESULTS:")
    print(f"‚Ä¢ Amount nulls remaining: {remaining_amount_nulls}")
    print(f"‚Ä¢ Currency nulls remaining: {remaining_currency_nulls}")
    
    return df_clean

# Apply cleaning to sample data
print("üîÑ Applying cleaning to sample data...")
df_sample_clean = clean_sales_data(df_sample)

# Verify results
print(f"\n‚úÖ VERIFICATION:")
print(f"‚Ä¢ Original Amount nulls: {df_sample['Amount'].isna().sum()}")
print(f"‚Ä¢ Cleaned Amount nulls: {df_sample_clean['Amount'].isna().sum()}")
print(f"‚Ä¢ Amount=0 records: {(df_sample_clean['Amount'] == 0).sum()}")
print(f"‚Ä¢ Records with data_quality_flag: {df_sample_clean.get('data_quality_flag', pd.Series()).notna().sum()}")

üîÑ Applying cleaning to sample data...
üßπ APPLYING DATA CLEANING RULES:
‚úÖ Set Amount=0 for 69 cancelled orders
‚ö†Ô∏è  Found 1 non-cancelled orders with missing Amount:
   ‚Üí Recommendation: Exclude from analysis or investigate further
‚úÖ Set currency=INR for 70 records with missing currency

üìä CLEANING RESULTS:
‚Ä¢ Amount nulls remaining: 1
‚Ä¢ Currency nulls remaining: 0

‚úÖ VERIFICATION:
‚Ä¢ Original Amount nulls: 70
‚Ä¢ Cleaned Amount nulls: 1
‚Ä¢ Amount=0 records: 85
‚Ä¢ Records with data_quality_flag: 1


üèØ DATA EXPLORATION COMPLETION - DISCOVERY & UNDERSTANDING:

**Key Findings:**
- **Dataset Size**: 128,975 rows, ~176MB memory
- **Business Columns**: Date, Amount, Category, Status identified  
- **Data Quality**: 7% missing Amount values resolved with business logic
- **Cleaning Strategy**: Cancelled orders ‚Üí Amount=0, Missing currency ‚Üí INR

**Ready for Step 1.2**: Environment & DuckDB Schema Design

## Step 8: Plan Raw Data Schema

In [21]:
# Based on our analysis, plan the database tables we need to create
print("üóÑÔ∏è DATABASE SCHEMA PLANNING:")
print("=" * 50)

print("\n1Ô∏è‚É£ RAW DATA TABLE (amazon_sales_raw):")
print("   - Store complete CSV data as-is")
print("   - All original columns preserved")
print("   - Source for downstream transformations")

print("\n2Ô∏è‚É£ MONTHLY REVENUE BY CATEGORY TABLE (monthly_revenue_by_category):")
print("   - Columns needed:")
print("     * month (DATE or VARCHAR)")
print("     * product_category (VARCHAR)")
print("     * total_revenue (DECIMAL)")
print("   - Aggregated from raw data")

print("\n3Ô∏è‚É£ DAILY ORDERS BY STATUS TABLE (daily_orders_by_status):")
print("   - Columns needed:")
print("     * order_date (DATE)")
print("     * order_status (VARCHAR)")
print("     * order_count (INTEGER)")
print("   - Aggregated from raw data")

print("\nüìã COLUMN MAPPING NEEDED:")
print("Based on the columns we found, we need to identify:")
for i, col in enumerate(df_sample.columns, 1):
    print(f"{i:2d}. {col} -> Purpose: [TO BE DETERMINED]")

üóÑÔ∏è DATABASE SCHEMA PLANNING:

1Ô∏è‚É£ RAW DATA TABLE (amazon_sales_raw):
   - Store complete CSV data as-is
   - All original columns preserved
   - Source for downstream transformations

2Ô∏è‚É£ MONTHLY REVENUE BY CATEGORY TABLE (monthly_revenue_by_category):
   - Columns needed:
     * month (DATE or VARCHAR)
     * product_category (VARCHAR)
     * total_revenue (DECIMAL)
   - Aggregated from raw data

3Ô∏è‚É£ DAILY ORDERS BY STATUS TABLE (daily_orders_by_status):
   - Columns needed:
     * order_date (DATE)
     * order_status (VARCHAR)
     * order_count (INTEGER)
   - Aggregated from raw data

üìã COLUMN MAPPING NEEDED:
Based on the columns we found, we need to identify:
 1. index -> Purpose: [TO BE DETERMINED]
 2. Order ID -> Purpose: [TO BE DETERMINED]
 3. Date -> Purpose: [TO BE DETERMINED]
 4. Status -> Purpose: [TO BE DETERMINED]
 5. Fulfilment -> Purpose: [TO BE DETERMINED]
 6. Sales Channel  -> Purpose: [TO BE DETERMINED]
 7. ship-service-level -> Purpose: [TO BE DE

## Step 9: Document Findings & Next Steps

In [None]:
print("üéØ DATA EXPLORATION COMPLETE - DISCOVERY & UNDERSTANDING:")
print("=" * 60)

print("\n‚úÖ Data Understanding Complete:")
print(f"   - File size: {file_size:.2f} MB")
print(f"   - Columns: {len(df_sample.columns)}")
print(f"   - Sample rows analyzed: {len(df_sample)}")

print("\nüîß NEXT DATA INGESTION STEPS:")
print("1.2 Environment & Connection Setup")
print("    - Set up DuckDB connection resource")
print("    - Configure Dagster resources")
print("1.3 Raw Data Schema Design")
print("    - Create raw data table in DuckDB")
print("    - Define proper data types")
print("1.4 CSV Loading Pipeline")
print("    - Build Dagster asset/op for CSV loading")
print("    - Implement error handling and validation")
print("1.5 Data Validation & Quality Checks")
print("    - Verify data integrity after loading")
print("    - Implement business rule validations")
print("1.6 Ingestion Monitoring & Logging")
print("    - Set up pipeline monitoring")
print("    - Configure logging and alerting")

print("\nüìä FUTURE STAGES (After Data Ingestion):")
print("Stage 2: Data Transformation")
print("  - Create analytical tables (monthly revenue, daily orders)")
print("  - Implement data aggregations")
print("Stage 3: Data Visualization & Analytics")
print("  - Daily orders by status chart")
print("  - Most profitable month analysis")

print("\n‚ö†Ô∏è KEY FINDINGS TO REMEMBER:")
print("- Need to identify exact date and revenue columns")
print("- May need data cleaning and transformation")
print("- Consider chunked processing for large dataset")
print("- Plan proper error handling and logging")

print("\nüéâ Ready to proceed to Stage 1: Data Ingestion Pipeline!")

üéØ STEP 1.1 COMPLETION - DATA DISCOVERY & UNDERSTANDING:

‚úÖ Data Understanding Complete:
   - File size: 65.73 MB
   - Columns: 24
   - Sample rows analyzed: 1000

üîß NEXT DATA INGESTION STEPS:
1.2 Environment & Connection Setup
    - Set up DuckDB connection resource
    - Configure Dagster resources
1.3 Raw Data Schema Design
    - Create raw data table in DuckDB
    - Define proper data types
1.4 CSV Loading Pipeline
    - Build Dagster asset/op for CSV loading
    - Implement error handling and validation
1.5 Data Validation & Quality Checks
    - Verify data integrity after loading
    - Implement business rule validations
1.6 Ingestion Monitoring & Logging
    - Set up pipeline monitoring
    - Configure logging and alerting

üìä FUTURE STAGES (After Data Ingestion):
Stage 2: Data Transformation
  - Create analytical tables (monthly revenue, daily orders)
  - Implement data aggregations
Stage 3: Data Visualization & Analytics
  - Daily orders by status chart
  - Most profi