# E-commerce Data Processing Pipeline

This notebook uses the modular code from the `src` package to process e-commerce sales data.

The pipeline follows a medallion architecture:
- **Bronze Layer**: Raw data ingestion
- **Silver Layer**: Cleaned and enriched data
- **Gold Layer**: Aggregated business metrics

In [None]:
# Install required packages if not already available
%pip install openpyxl pandas

## Step 1: Create Catalog and Schemas

In [None]:
%sql
CREATE CATALOG IF NOT EXISTS ecommerce_sales COMMENT 'Ecommerce sales data processing';

In [None]:
%sql
USE CATALOG ecommerce_sales;
CREATE SCHEMA IF NOT EXISTS source_data_layer COMMENT 'Source data';    
CREATE SCHEMA IF NOT EXISTS bronze_layer COMMENT 'Raw data';
CREATE SCHEMA IF NOT EXISTS processed_layer COMMENT 'Processed data';
CREATE SCHEMA IF NOT EXISTS gold_layer COMMENT 'Final publish layer - Aggregated data';

## Step 2: Create Volumes

In [None]:
%sql
USE CATALOG ecommerce_sales;

USE SCHEMA bronze_layer;
CREATE VOLUME IF NOT EXISTS ecom_volume_raw COMMENT 'Managed volume for raw files';

USE SCHEMA processed_layer;
CREATE VOLUME IF NOT EXISTS ecom_volume_processed COMMENT 'Managed volume for processed files';

USE SCHEMA gold_layer;
CREATE VOLUME IF NOT EXISTS ecom_volume_gold COMMENT 'Managed volume for final publish aggregated files';

## Step 3: Initialize Spark Session

In [None]:
from pyspark.sql import SparkSession
import logging

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Ecommerce Data Processing") \
    .getOrCreate()

print("Spark session initialized successfully")

## Step 4: Run the Complete Data Pipeline

The pipeline will:
1. Load raw data from CSV, Excel, and JSON files
2. Enrich and transform the data
3. Create comprehensive order tables
4. Generate profit aggregations

In [None]:
from src.pipeline import run_pipeline

# Run the complete pipeline
run_pipeline(spark)

## Step 5: Verify Bronze Layer Tables

In [None]:
# Check raw products data
display(spark.read.table("ecommerce_sales.bronze_layer.products_raw").limit(10))

In [None]:
# Check raw customers data
display(spark.read.table("ecommerce_sales.bronze_layer.customers_raw").limit(10))

In [None]:
# Check raw orders data
display(spark.read.table("ecommerce_sales.bronze_layer.orders_raw").limit(10))

## Step 6: Verify Silver Layer Tables

In [None]:
# Check processed products
display(spark.read.table("ecommerce_sales.processed_layer.products_processed").limit(10))

In [None]:
# Check processed customers
display(spark.read.table("ecommerce_sales.processed_layer.customer_processed").limit(10))

In [None]:
# Check comprehensive orders
display(spark.read.table("ecommerce_sales.processed_layer.comprehensive_orders").limit(10))

## Step 7: Verify Gold Layer Tables

In [None]:
# Check profit aggregates
display(spark.read.table("ecommerce_sales.gold_layer.profit_aggregates").limit(100))

## Step 8: Generate SQL Reports

Generate various business reports using SQL queries.

In [None]:
from src.pipeline import generate_sql_reports

generate_sql_reports(spark)

### Profit by Year

In [None]:
%sql
SELECT
  year,
  ROUND(SUM(profit), 2) as total_profit
FROM ecommerce_sales.processed_layer.comprehensive_orders
GROUP BY year
ORDER BY year

### Profit by Year and Category

In [None]:
%sql
SELECT
  year,
  Category,
  ROUND(SUM(profit), 2) as total_profit
FROM ecommerce_sales.processed_layer.comprehensive_orders
GROUP BY year, Category
ORDER BY year, Category

### Profit by Customer

In [None]:
%sql
SELECT
  customer_id, 
  customer_name,
  ROUND(SUM(profit), 2) as total_profit
FROM ecommerce_sales.processed_layer.comprehensive_orders
GROUP BY customer_id, customer_name
ORDER BY total_profit DESC
LIMIT 20

### Profit by Customer and Year

In [None]:
%sql
SELECT
  customer_id, 
  customer_name,
  year,
  ROUND(SUM(profit), 2) as total_profit
FROM ecommerce_sales.processed_layer.comprehensive_orders
GROUP BY customer_id, customer_name, year
ORDER BY customer_name, year
LIMIT 50

## Optional: Run Individual Pipeline Components

You can also run individual components of the pipeline separately:

In [None]:
# Example: Load and enrich only product data
from src.ingestion import load_products_data
from src.transformations import enrich_product_data

raw_products = load_products_data(spark)
enriched_products = enrich_product_data(raw_products)
display(enriched_products)