# Ray Data for ETL: A Comprehensive Beginner's Guide

This notebook provides a complete introduction to Ray Data for Extract, Transform, Load (ETL) workflows. We'll cover both the practical aspects of building ETL pipelines and the underlying architecture that makes Ray Data powerful for distributed data processing.

<div class="alert alert-block alert-info">
<b> Learning Roadmap:</b>
<ul>
    <li><b>Part 1:</b> What is Ray Data and ETL?</li>
    <li><b>Part 2:</b> Ray Data Architecture & Concepts</li>
    <li><b>Part 3:</b> Extract - Reading Data</li>
    <li><b>Part 4:</b> Transform - Processing Data</li>
    <li><b>Part 5:</b> Load - Writing Data</li>
    <li><b>Part 6:</b> Advanced ETL Patterns</li>
    <li><b>Part 7:</b> Performance & Best Practices</li>
    <li><b>Part 8:</b> Troubleshooting Common Issues</li>
</ul>
</div>


## Setup and Imports

Let's start by importing the necessary libraries and setting up our environment.


In [None]:
import ray
import pandas as pd
import numpy as np
import pyarrow as pa
from typing import Dict, Any
import time

# Initialize Ray
if not ray.is_initialized():
    ray.init()

print(f"Ray version: {ray.__version__}")
print(f"Ray cluster resources: {ray.cluster_resources()}")


## Part 1: What is Ray Data and ETL?

### Understanding ETL

**ETL** stands for Extract, Transform, Load - a fundamental pattern in data engineering:

- **Extract**: Reading data from various sources (databases, files, APIs, etc.)
- **Transform**: Processing, cleaning, and enriching the data
- **Load**: Writing the processed data to destination systems

### What is Ray Data?

Ray Data is a distributed data processing library built on top of Ray that has recently reached **General Availability (GA)**. As the fastest-growing use case for Ray, it's designed to handle **both traditional ETL/ML workloads and next-generation AI applications**, providing a unified platform that scales from CPU clusters to heterogeneous GPU environments.

<div class="alert alert-block alert-success">
<b> Ray Data: One Platform for All Data Workloads</b><br>
Ray Data is part of Ray, the AI Compute Engine that now orchestrates <b>over 1 million clusters per month</b>. Whether you're running traditional ETL on CPU clusters or cutting-edge multimodal AI pipelines, Ray Data provides a unified solution that evolves with your needs.
</div>

<div class="alert alert-block alert-info">
<b> Ray Data: From Traditional to Transformational:</b>
<ul>
    <li><b>Traditional ETL:</b> Excellent for structured data processing, business intelligence, and reporting</li>
    <li><b>ML Workflows:</b> Perfect for feature engineering, model training pipelines, and batch scoring</li>
    <li><b>Scalable Processing:</b> Automatically scales from single machines to thousands of CPU cores</li>
    <li><b>Future-Ready:</b> Seamlessly extends to GPU workloads and multimodal data when needed</li>
    <li><b>Python-Native:</b> No JVM overhead - pure Python performance at scale</li>
    <li><b>Streaming Architecture:</b> Handle datasets larger than memory with ease</li>
</ul>
</div>

### Today's Workloads, Tomorrow's Possibilities

Ray Data excels across the entire spectrum of data processing needs:

**Traditional & Current Workloads:**
- **Business ETL**: Customer analytics, financial reporting, operational dashboards
- **Classical ML**: Recommendation systems, fraud detection, predictive analytics
- **Data Engineering**: Large-scale data cleaning, transformation, and aggregation
- **BI Pipelines**: Data warehouse loading, metric computation, and reporting

**Next-Generation Workloads:**
- **Multimodal AI**: Processing text, images, video, and audio together
- **LLM Pipelines**: Fine-tuning, embedding generation, and batch inference
- **Computer Vision**: Image preprocessing and model inference at scale
- **Compound AI Systems**: Orchestrating multiple models and traditional ML

### Ray Data vs Traditional Tools

Let's understand how Ray Data compares to other data processing tools across traditional and modern workloads:

| Feature | Ray Data | Pandas | Spark | Dask |
|---------|----------|--------|-------|------|
| **Traditional ETL** | Excellent | Good | Excellent | Good |
| **Scale** | Multi-machine | Single-machine | Multi-machine | Multi-machine |
| **Memory** | Streaming | In-memory | Mixed | Lazy evaluation |
| **Python Performance** | Native (no JVM) | Native | JVM overhead | Native |
| **CPU Clusters** | Optimized | Single-node | Good | Good |
| **GPU Support** | Native | None | Limited | Limited |
| **Classical ML** | Excellent | Limited | Good | Good |
| **Multimodal Data** | Optimized | Limited | Limited | Limited |
| **Fault Tolerance** | Built-in | None | Built-in | Limited |

### Real-World Impact Across All Workloads

Organizations worldwide are seeing dramatic results with Ray Data for both traditional and advanced workloads:

**Traditional ETL & Analytics:**
- **Amazon**: Migrated an exabyte-scale workload from Spark to Ray Data, cutting costs by **82%** and saving **$120 million annually**
- **Instacart**: Processing **100x more data** for recommendation systems and business analytics
- **Financial Services**: Major banks using Ray Data for fraud detection and risk analytics at scale

**Modern AI & ML:**
- **Niantic**: Reduced code complexity by **85%** while scaling AR/VR data pipelines
- **Canva**: Cut cloud costs in **half** while processing design assets and user data
- **Pinterest**: Boosted GPU utilization to **90%+** for image processing and recommendations

Ray Data provides a unified platform that excels at traditional ETL, classical ML, and next-generation AI workloads - eliminating the need for multiple specialized systems.


## Part 2: Ray Data Architecture & Concepts

### The AI Compute Engine Architecture

Ray Data is built on Ray, the AI Compute Engine that powers the most demanding AI workloads in production. Ray's architecture addresses the core challenges of modern AI infrastructure:

<div class="alert alert-block alert-success">
<b> Ray: Built for the AI Era</b><br>
Unlike traditional distributed systems designed for structured data and CPU workloads, Ray was purpose-built for:<br>
<ul>
    <li><b>Python-Native:</b> No JVM overhead or serialization bottlenecks</li>
    <li><b>Heterogeneous Compute:</b> Seamlessly orchestrates CPUs, GPUs, and other accelerators</li>
    <li><b>Dynamic Workloads:</b> Adapts to varying compute needs in real-time</li>
    <li><b>Fault Tolerance:</b> Handles failures gracefully at massive scale</li>
</ul>
Ray now supports clusters up to <b>8,000 nodes</b> with 4x improved scalability.
</div>

### Core Concepts

Before diving into ETL examples, let's understand the fundamental concepts that power Ray Data.

#### 1. Datasets and Blocks

A **Dataset** in Ray Data is a distributed collection of data that's divided into **blocks**. Think of blocks as chunks of your data that can be processed independently.

<div class="alert alert-block alert-info">
<b> Understanding Blocks:</b>
<ul>
    <li>Each block contains a subset of your data (typically 1-128 MB)</li>
    <li>Blocks are stored in Ray's distributed object store</li>
    <li>Operations are applied to blocks in parallel across the cluster</li>
    <li>Block size affects performance - too small causes overhead, too large causes memory issues</li>
</ul>
</div>

### Performance Innovations: RayTurbo

For production workloads, **Anyscale** offers **RayTurbo**, an optimized runtime with 30+ performance improvements:

<div class="alert alert-block alert-warning">
<b> RayTurbo Performance Improvements:</b>
<ul>
    <li><b>Ray Data:</b> Up to 4.5x faster with streaming metadata fetching</li>
    <li><b>Ray Serve:</b> Up to 56% faster inference with replica compaction</li>
    <li><b>Ray Train:</b> Up to 60% cost reduction with elastic training on spot instances</li>
    <li><b>Autoscaling:</b> 5.1x faster node autoscaling</li>
    <li><b>Batch Inference:</b> Up to 6x lower costs compared to AWS Bedrock</li>
</ul>
</div>


In [None]:
# Let's create a simple dataset to understand blocks
# Create sample data
data = [(i, f"name_{i}", np.random.rand()) for i in range(1000)]
ds = ray.data.from_items(data)

print(f"Dataset: {ds}")
print(f"Number of blocks: {ds.num_blocks()}")
print(f"Schema: {ds.schema()}")

# Look at a few rows
print("\nFirst 3 rows:")
for i, row in enumerate(ds.take(3)):
    print(f"Row {i}: {row}")


#### 2. Lazy vs Eager Execution

Ray Data uses **lazy execution** by default, meaning operations are not executed immediately but are planned and optimized before execution.

**Lazy Execution Benefits:**
- **Optimization**: Ray Data can optimize the entire pipeline before execution
- **Memory efficiency**: Only necessary data is loaded into memory
- **Fault tolerance**: Can restart from intermediate points if failures occur

<div class="alert alert-block alert-warning">
<b> Understanding Execution:</b><br>
<b>Lazy:</b> Build a plan first, then execute (default)<br>
<b>Eager:</b> Execute operations immediately as they're called<br><br>
Lazy execution allows Ray Data to optimize your entire pipeline for better performance!
</div>


In [None]:
# Demonstrate lazy execution
print("Creating a lazy dataset pipeline...")

# These operations are not executed yet - they're just planned
ds_lazy = (ds
    .map(lambda x: {"id": x[0], "name": x[1], "value": x[2] * 2})
    .filter(lambda x: x["value"] > 1.0)
    .map(lambda x: {**x, "category": "high" if x["value"] > 1.5 else "medium"})
)

print(f"Lazy dataset: {ds_lazy}")
print("Notice: No actual computation has happened yet!")

# Execution happens when we materialize the data
print("\nExecuting pipeline...")
result = ds_lazy.take(5)  # This triggers execution
print(f"Results: {result}")


## Part 3: Extract - Reading Data

The **Extract** phase involves reading data from various sources. Ray Data provides built-in connectors for many common data sources and makes it easy to scale data reading across a distributed cluster, especially for the **multimodal data** that powers modern AI applications.

### The Multimodal Data Revolution

Today's AI applications process vastly more complex data than traditional ETL pipelines:

<div class="alert alert-block alert-success">
<b> The Scale of Modern Data:</b>
<ul>
    <li><b>Unstructured Data Growth:</b> Now outpaces structured data by 10x+ in most organizations</li>
    <li><b>Video Processing:</b> Companies like OpenAI (Sora), Pinterest, and Apple process petabytes of multimodal data daily</li>
    <li><b>Foundation Models:</b> Require processing millions of images, videos, and documents</li>
    <li><b>AI-Powered Processing:</b> Every aspect of data processing is becoming AI-enhanced</li>
</ul>
</div>

### How Ray Data Reads Data Under the Hood

When you read data with Ray Data, here's what happens:

1. **File Discovery**: Ray Data discovers all files matching your path pattern
2. **Task Creation**: Files are distributed across Ray tasks (typically one file per task)
3. **Parallel Reading**: Multiple tasks read files simultaneously across the cluster
4. **Block Creation**: Each task creates data blocks stored in Ray's object store
5. **Lazy Planning**: The dataset is created but data isn't loaded until needed

This architecture enables Ray Data to efficiently handle both traditional structured data and modern unstructured formats that power AI applications.

<div class="alert alert-block alert-info">
<b> Built-in Data Sources:</b>
<ul>
    <li><b>Structured:</b> Parquet, CSV, JSON, Arrow</li>
    <li><b>Unstructured:</b> Images, Videos, Audio, Binary files</li>
    <li><b>Databases:</b> MongoDB, MySQL, PostgreSQL, Snowflake</li>
    <li><b>Cloud Storage:</b> S3, GCS, Azure Blob Storage</li>
    <li><b>Data Lakes:</b> Delta Lake, Iceberg (via RayTurbo)</li>
    <li><b>ML Formats:</b> TensorFlow Records, PyTorch datasets</li>
    <li><b>Memory:</b> Python lists, NumPy arrays, Pandas DataFrames</li>
</ul>
</div>

### Enterprise-Grade Data Connectivity

For enterprise environments, **Anyscale** provides additional connectors and optimizations:
- **Enhanced Security**: Integration with enterprise identity systems
- **Governance Controls**: Data lineage and access controls
- **Performance Optimization**: RayTurbo's streaming metadata fetching provides up to **4.5x faster** data loading
- **Hybrid Deployment**: Support for Kubernetes, on-premises, and multi-cloud environments


In [None]:
# Using TPC-H Benchmark Dataset - Industry Standard for Data Processing
# TPC-H is the gold standard benchmark for decision support systems and analytics
print(" Accessing TPC-H Benchmark Dataset (Scale Factor 1000)")
print("   The Transaction Processing Performance Council Benchmark H")
print("   - Industry standard for testing data processing systems")
print("   - Scale Factor 1000 = ~1TB of data across 8 tables")
print("   - Used by enterprises worldwide for performance evaluation")

# TPC-H S3 data location
TPCH_S3_PATH = "s3://ray-benchmark-data/tpch/parquet/sf1000"

print(f"\n TPC-H Dataset Overview:")
print(f"    Source: {TPCH_S3_PATH}")
print(f"    Scale: 1000 (approximately 1TB)")
print(f"    Use Case: Enterprise decision support and business intelligence")

# TPC-H Schema Overview
tpch_tables = {
    "customer": "Customer master data with demographics and market segments",
    "orders": "Order header information with dates, priorities, and status",
    "lineitem": "Detailed line items for each order (largest table ~6B rows)",
    "part": "Parts catalog with specifications and retail prices", 
    "supplier": "Supplier information including contact details and geography",
    "partsupp": "Part-supplier relationships with costs and availability",
    "nation": "Nation reference data with geographic regions",
    "region": "Regional groupings for geographic analysis"
}

print(f"\n TPC-H Schema (8 Tables):")
for table, description in tpch_tables.items():
    print(f"    {table.upper()}: {description}")

print(f"\n Business Scenario:")
print(f"   Global supply chain and retail operation")
print(f"   - Multi-national customer base")
print(f"   - Complex supplier relationships") 
print(f"   - Detailed transaction history")
print(f"   - Perfect for traditional BI and modern AI/ML applications")

print(f"\n TPC-H dataset ready for analysis!")
print(f" This represents real-world enterprise-scale data processing challenges")
print(f" Demonstrates Ray Data's capabilities on industry-standard benchmarks")


In [None]:
# Read TPC-H Customer Master Data (Traditional Structured Data Processing)
customers_ds = ray.data.read_parquet(f"{TPCH_S3_PATH}/customer")

print(" TPC-H Customer Master Data (Traditional ETL):")
print(f"    Schema: {customers_ds.schema()}")
print(f"    Blocks: {customers_ds.num_blocks()}")
print(f"    Total customers: {customers_ds.count():,}")
print(f"    Estimated size: {customers_ds.size_bytes() / (1024*1024):.1f} MB")

print("\n Sample customer records:")
customers_ds.show(5)


In [None]:
# Customer Market Segment Analysis - Traditional BI Workload
print(" Customer Market Segment Distribution:")
print("   Analyzing customer segments for business intelligence...")

segment_analysis = customers_ds.groupby('c_mktsegment').agg(
    customer_count=('c_custkey', 'count'),
    avg_account_balance=('c_acctbal', 'mean')
)
segment_analysis.show(5)


In [None]:
# Geographic Reference Data - Nations Table
print(" TPC-H Nations Reference Data:")
print("   Loading geographic data for customer demographics...")

nation_ds = ray.data.read_parquet(f"{TPCH_S3_PATH}/nation")
print(f"    Total nations: {nation_ds.count():,}")
print(f"    Size: {nation_ds.size_bytes() / 1024:.1f} KB")

print("\n Sample nation records:")
nation_ds.show(5)


In [None]:
# Customer Demographics by Nation - Join Analysis
print(" Customer Demographics by Nation:")
print("   Joining customer and nation data for geographic analysis...")

customer_nation_analysis = (
    customers_ds
    .join(nation_ds, left_on='c_nationkey', right_on='n_nationkey')
    .groupby('n_name')
    .agg(
        customer_count=('c_custkey', 'count'),
        avg_balance=('c_acctbal', 'mean'),
        total_balance=('c_acctbal', 'sum')
    )
)
customer_nation_analysis.sort('customer_count', descending=True).show(10)


In [None]:
# Read TPC-H High-Volume Transactional Data (Orders + Line Items)
# This demonstrates Ray Data's strength in traditional ETL: processing massive, enterprise-scale datasets

# Read Orders table (header information)
orders_ds = ray.data.read_parquet(f"{TPCH_S3_PATH}/orders")

print(" TPC-H Orders Data (Enterprise Transaction Processing):")
print(f"    Schema: {orders_ds.schema()}")
print(f"    Blocks: {orders_ds.num_blocks()}")
print(f"    Total orders: {orders_ds.count():,}")
print(f"    Estimated size: {orders_ds.size_bytes() / (1024*1024):.1f} MB")

print("\n Sample order records:")
orders_ds.show(3)

# Read Line Items table (detailed transaction data - largest table in TPC-H)
lineitem_ds = ray.data.read_parquet(f"{TPCH_S3_PATH}/lineitem")

print(f"\n TPC-H Line Items Data (Detailed Transaction Processing):")
print(f"    Schema: {lineitem_ds.schema()}")
print(f"    Blocks: {lineitem_ds.num_blocks()}")
print(f"    Total line items: {lineitem_ds.count():,}")
print(f"    Estimated size: {lineitem_ds.size_bytes() / (1024*1024):.1f} MB")

print("\n Sample line item records:")
lineitem_ds.show(3)

# Demonstrate column pruning optimization (common ETL optimization)
lineitem_subset = ray.data.read_parquet(
    f"{TPCH_S3_PATH}/lineitem",
    columns=['l_orderkey', 'l_partkey', 'l_quantity', 'l_extendedprice', 'l_discount', 'l_shipdate']
)
print(f"\n Column Pruning Optimization on Line Items:")
print(f"   Original columns: {len(lineitem_ds.schema())}")
print(f"   Selected columns: {len(lineitem_subset.schema())}")
print(f"   Data reduction: {(1 - lineitem_subset.size_bytes()/lineitem_ds.size_bytes())*100:.1f}% size reduction")

# Traditional ETL analytics - business KPIs on enterprise data
print("\n Traditional Business Analytics (Enterprise Scale - Billions of Records):")

# Order priority analysis - typical business reporting
order_priority_analysis = orders_ds.groupby('o_orderpriority').agg(
    order_count=('o_orderkey', 'count'),
    avg_total_price=('o_totalprice', 'mean'),
    total_value=('o_totalprice', 'sum')
)
print("Order Priority Distribution:")
order_priority_analysis.sort('total_value', descending=True).show()

# Time-based order analysis - common time-series analysis
orders_with_year = orders_ds.map(lambda x: {
    **x,
    'order_year': int(str(x['o_orderdate'])[:4])
})
yearly_revenue = orders_with_year.groupby('order_year').agg(
    yearly_orders=('o_orderkey', 'count'),
    yearly_revenue=('o_totalprice', 'sum'),
    avg_order_value=('o_totalprice', 'mean')
)
print("\nYearly Revenue Trends:")
yearly_revenue.sort('order_year').show()


## Part 4: Transform - Processing Data

The **Transform** phase is where the real data processing happens. Ray Data provides several transformation operations that can be applied to datasets, and understanding how they work under the hood is key to building efficient ETL pipelines that power modern AI applications.

### Transformations for the AI Era

Modern AI workloads require more than traditional data transformations. Ray Data is designed for the era of **compound AI systems** and **agentic workflows** where:

<div class="alert alert-block alert-success">
<b> AI-Powered Transformations:</b>
<ul>
    <li><b>Multimodal Processing:</b> Simultaneously process text, images, video, and audio</li>
    <li><b>Model Inference:</b> Embed ML models directly into transformation pipelines</li>
    <li><b>GPU Acceleration:</b> Seamlessly utilize both CPU and GPU resources</li>
    <li><b>Compound AI:</b> Orchestrate multiple models and traditional ML within single workflows</li>
    <li><b>AI-Enhanced ETL:</b> Use AI to optimize every aspect of data processing</li>
</ul>
</div>

### How Ray Data Processes Transformations

When you apply transformations with Ray Data:

1. **Task Distribution**: Transformations are distributed across Ray tasks/actors
2. **Block-level Processing**: Each task processes one or more blocks independently  
3. **Streaming Execution**: Blocks flow through the pipeline without waiting for all data
4. **Operator Fusion**: Compatible operations are automatically combined for efficiency
5. **Heterogeneous Compute**: Intelligently schedules CPU and GPU work
6. **Fault Tolerance**: Failed tasks are automatically retried

This architecture enables Ray Data to handle everything from traditional business logic to cutting-edge AI inference within the same pipeline.

<div class="alert alert-block alert-info">
<b> Transformation Categories:</b>
<ul>
    <li><b>Row-wise operations:</b> <code>map()</code> - Transform individual rows</li>
    <li><b>Batch operations:</b> <code>map_batches()</code> - Transform groups of rows (ideal for ML inference)</li>
    <li><b>Filtering:</b> <code>filter()</code> - Remove rows based on conditions</li>
    <li><b>Aggregations:</b> <code>groupby()</code> - Group and aggregate data</li>
    <li><b>Joins:</b> <code>join()</code> - Combine datasets</li>
    <li><b>AI Operations:</b> Embed models for inference, embeddings, and feature extraction</li>
    <li><b>Shuffling:</b> <code>random_shuffle()</code>, <code>sort()</code> - Reorder data</li>
</ul>
</div>

### Enterprise-Scale Transformation Performance

With **RayTurbo**, transformation performance reaches new levels:
- **Compiled Graphs**: Up to 17x faster GPU communication and 2.8x faster multi-node performance
- **Advanced Scheduling**: Intelligent resource allocation across heterogeneous clusters
- **Memory Optimization**: Reduced overhead for small tasks and efficient peer-to-peer communication


### Practical ETL Transformations

Let's implement common ETL transformations using our e-commerce data:

#### 1. Data Enrichment with Business Logic


In [None]:
def traditional_etl_enrichment_tpch(batch):
    """
    Traditional ETL transformations for TPC-H business intelligence and reporting
    This demonstrates classic data warehouse-style transformations on enterprise data
    """
    df = batch.to_pandas() if hasattr(batch, 'to_pandas') else pd.DataFrame(batch)
    
    # Parse order date and create time dimensions (standard BI practice)
    df['o_orderdate'] = pd.to_datetime(df['o_orderdate'])
    df['order_year'] = df['o_orderdate'].dt.year
    df['order_quarter'] = df['o_orderdate'].dt.quarter
    df['order_month'] = df['o_orderdate'].dt.month
    df['order_day_of_week'] = df['o_orderdate'].dt.dayofweek
    
    # Business day classifications (common in traditional ETL)
    df['is_weekend'] = df['order_day_of_week'].isin([5, 6])
    df['quarter_name'] = 'Q' + df['order_quarter'].astype(str)
    df['month_name'] = df['o_orderdate'].dt.month_name()
    
    # Revenue and profit calculations (standard BI metrics)
    df['revenue_tier'] = pd.cut(
        df['o_totalprice'],
        bins=[0, 50000, 150000, 300000, float('inf')],
        labels=['Small', 'Medium', 'Large', 'Enterprise']
    )
    
    # Order priority business rules (TPC-H specific)
    priority_weights = {
        '1-URGENT': 1.0,
        '2-HIGH': 0.8,
        '3-MEDIUM': 0.6,
        '4-NOT SPECIFIED': 0.4,
        '5-LOW': 0.2
    }
    df['priority_weight'] = df['o_orderpriority'].map(priority_weights).fillna(0.4)
    df['weighted_revenue'] = df['o_totalprice'] * df['priority_weight']
    
    # Order status analysis
    df['is_urgent'] = df['o_orderpriority'].isin(['1-URGENT', '2-HIGH'])
    df['is_large_order'] = df['o_totalprice'] > 200000
    df['requires_expedited_processing'] = df['is_urgent'] | df['is_large_order']
    
    # Date-based business logic
    df['days_to_process'] = (pd.to_datetime(df['o_orderdate']) - pd.Timestamp('1992-01-01')).dt.days
    df['is_peak_season'] = df['order_month'].isin([11, 12])  # Nov-Dec peak
    
    return df

def ml_ready_feature_engineering_tpch(batch):
    """
    Modern ML feature engineering for TPC-H data
    This prepares enterprise data for machine learning models
    """
    df = batch.to_pandas() if hasattr(batch, 'to_pandas') else pd.DataFrame(batch)
    
    # Temporal features for ML models
    df['days_since_epoch'] = (df['o_orderdate'] - pd.Timestamp('1992-01-01')).dt.days
    df['month_sin'] = np.sin(2 * np.pi * df['order_month'] / 12)
    df['month_cos'] = np.cos(2 * np.pi * df['order_month'] / 12)
    df['quarter_sin'] = np.sin(2 * np.pi * df['order_quarter'] / 4)
    df['quarter_cos'] = np.cos(2 * np.pi * df['order_quarter'] / 4)
    
    # Priority encoding for ML (one-hot style features)
    for priority in ['1-URGENT', '2-HIGH', '3-MEDIUM']:
        df[f'is_priority_{priority.split("-")[0]}'] = (df['o_orderpriority'] == priority).astype(int)
    
    # Revenue-based features (common in ML)
    df['log_total_price'] = np.log1p(df['o_totalprice'])  # Log transformation for ML
    df['revenue_per_priority'] = df['o_totalprice'] * df['priority_weight']
    df['weekend_large_order'] = (df['is_weekend'] & df['is_large_order']).astype(int)
    
    # Time-series features for predictive modeling
    df['year_normalized'] = (df['order_year'] - df['order_year'].min()) / (df['order_year'].max() - df['order_year'].min())
    df['seasonal_revenue_multiplier'] = np.where(df['is_peak_season'], 1.2, 1.0)
    
    # Customer key features (for customer analytics)
    df['customer_id_mod_100'] = df['o_custkey'] % 100  # Simple customer segmentation feature
    
    return df

# Traditional ETL Processing on TPC-H Data (CPU-based, enterprise business logic)
print(" Traditional ETL Processing on TPC-H Data (Business Intelligence Focus):")
print("   Processing millions of enterprise orders with standard BI transformations...")

traditional_enriched = orders_ds.map_batches(
    traditional_etl_enrichment_tpch,
    batch_format="pyarrow",
    batch_size=10000  # Larger batches for efficiency on CPU clusters with enterprise data
)

print("\n Traditional ETL Results:")
traditional_enriched.show(3)

# ML-Ready Feature Engineering (Preparing enterprise data for model training/inference)
print("\n ML-Ready Feature Engineering (Next-Generation Capabilities):")
print("   Adding ML features for predictive analytics on enterprise transaction data...")

ml_ready_data = traditional_enriched.map_batches(
    ml_ready_feature_engineering_tpch,
    batch_format="pyarrow",
    batch_size=10000
)

print("\n ML-Ready Data Sample:")
ml_ready_data.show(3)

print(f"\n Complete TPC-H ETL Pipeline:")
print(f"    Original TPC-H columns: {len(orders_ds.schema())}")
print(f"    After traditional ETL: {len(traditional_enriched.schema())}")
print(f"    After ML enrichment: {len(ml_ready_data.schema())}")
print(f"    Total processing: {orders_ds.count():,} enterprise orders with {len(ml_ready_data.schema())} features")

# Store the enriched dataset for later use
enriched_orders = ml_ready_data


In [None]:
# Traditional Business Intelligence & Data Warehouse Analytics on TPC-H Enterprise Data
print(" TPC-H Business Intelligence & Reporting (Enterprise CPU Cluster Workloads)")
print("   Generating executive dashboards and operational reports on industry-standard benchmark data...")

# Executive Summary Dashboard - typical BI metrics on enterprise data
print("\n Executive Dashboard (Traditional BI on TPC-H):")
executive_summary = (
    enriched_orders
    .groupby('order_quarter')
    .agg(
        total_orders=('o_orderkey', 'count'),
        total_revenue=('o_totalprice', 'sum'),
        avg_order_value=('o_totalprice', 'mean'),
        weighted_revenue=('weighted_revenue', 'sum'),
        urgent_order_percentage=('is_urgent', 'mean')
    )
)
print("Quarterly Business Performance:")
executive_summary.show()

# Operational Analytics - business process optimization
print("\n Operational Analytics (Enterprise Process Optimization):")
operational_metrics = (
    enriched_orders
    .groupby('revenue_tier')
    .agg(
        order_volume=('o_orderkey', 'count'),
        total_revenue=('o_totalprice', 'sum'),
        avg_priority_weight=('priority_weight', 'mean'),
        expedited_processing_rate=('requires_expedited_processing', 'mean'),
        peak_season_orders=('is_peak_season', 'sum')
    )
)
print("Performance by Revenue Tier:")
operational_metrics.show()

# Priority-Based Analysis - enterprise order management
print("\n Priority-Based Analysis (Order Management Insights):")
priority_performance = (
    enriched_orders
    .groupby('o_orderpriority')
    .agg(
        priority_orders=('o_orderkey', 'count'),
        priority_revenue=('o_totalprice', 'sum'),
        avg_order_value=('o_totalprice', 'mean'),
        large_order_rate=('is_large_order', 'mean'),
        weekend_order_rate=('is_weekend', 'mean')
    )
)
print("Performance by Order Priority:")
priority_performance.sort('priority_revenue', descending=True).show()

# Temporal Business Analysis - time-series insights
print("\n Temporal Analysis (Time-Series Business Intelligence):")
temporal_intelligence = (
    enriched_orders
    .groupby('order_year')
    .agg(
        yearly_orders=('o_orderkey', 'count'),
        yearly_revenue=('o_totalprice', 'sum'),
        avg_order_value=('o_totalprice', 'mean'),
        peak_season_revenue=('seasonal_revenue_multiplier', lambda x: (x > 1.0).mean()),
        large_order_percentage=('is_large_order', 'mean')
    )
)
print("Year-over-Year Performance:")
temporal_intelligence.sort('order_year').show()

# Advanced Analytics for ML/AI Applications on Enterprise Data
print("\n Advanced Analytics (Enterprise ML/AI Preparation):")
print("   Preparing aggregated features for machine learning models on TPC-H data...")

# Customer behavior patterns - ML feature engineering on enterprise scale
customer_behavior = (
    enriched_orders
    .groupby('o_custkey')
    .agg(
        order_frequency=('o_orderkey', 'count'),
        total_lifetime_value=('o_totalprice', 'sum'),
        avg_order_value=('o_totalprice', 'mean'),
        priority_preference=('priority_weight', 'mean'),
        ml_features_avg_log_price=('log_total_price', 'mean'),
        ml_features_weekend_large_orders=('weekend_large_order', 'sum'),
        customer_segment_mod=('customer_id_mod_100', 'max'),
        peak_season_preference=('is_peak_season', 'mean')
    )
)

print("Customer Behavior Patterns (Enterprise ML Features):")
customer_behavior.sort('total_lifetime_value', descending=True).show(5)

# Advanced time-series features for forecasting models
monthly_trends = (
    enriched_orders
    .groupby(['order_year', 'order_month'])
    .agg(
        monthly_orders=('o_orderkey', 'count'),
        monthly_revenue=('o_totalprice', 'sum'),
        urgent_order_ratio=('is_urgent', 'mean'),
        seasonal_multiplier=('seasonal_revenue_multiplier', 'mean')
    )
)

print("\nMonthly Trends (Time-Series ML Features):")
monthly_trends.sort(['order_year', 'order_month']).show(10)

print(f"\n TPC-H Analytics Summary:")
print(f"    Processed {enriched_orders.count():,} enterprise orders")
print(f"    Traditional BI: Quarterly, operational, priority-based, and temporal analytics")
print(f"    ML Preparation: Customer behavior patterns and time-series features")
print(f"    All analytics computed on distributed CPU cluster using Ray Data")
print(f"    Industry-standard TPC-H benchmark demonstrates real-world enterprise capabilities")


In [None]:
# Create output directories for TPC-H processed data
import os
os.makedirs("/tmp/tpch_etl_output", exist_ok=True)
os.makedirs("/tmp/tpch_etl_output/analytics", exist_ok=True)

print(" Writing TPC-H processed data to various formats...")

# Write enriched TPC-H orders to Parquet (best for large enterprise datasets)
print(" Writing enriched TPC-H orders to Parquet...")
enriched_orders.write_parquet("/tmp/tpch_etl_output/enriched_tpch_orders")

# Write analytics results to CSV (good for business users)
print(" Writing priority analytics to CSV...")
priority_performance.write_csv("/tmp/tpch_etl_output/analytics/priority_performance.csv")

# Write customer analytics to JSON (good for APIs and downstream systems)
print(" Writing customer behavior analytics to JSON...")
customer_behavior.limit(1000).write_json("/tmp/tpch_etl_output/analytics/top_customers_tpch.json")

# Custom writer example - create a TPC-H executive summary report
def create_tpch_executive_summary(batch):
    """Create a custom TPC-H executive summary report"""
    df = batch.to_pandas() if hasattr(batch, 'to_pandas') else pd.DataFrame(batch)
    
    summary = {
        'report_timestamp': pd.Timestamp.now().isoformat(),
        'report_type': 'TPC-H Executive Summary',
        'total_customers_analyzed': len(df),
        'total_customer_lifetime_value': float(df['total_lifetime_value'].sum()),
        'average_customer_lifetime_value': float(df['total_lifetime_value'].mean()),
        'top_customer_ltv': float(df['total_lifetime_value'].max()),
        'customers_by_order_frequency': {
            'single_order': int((df['order_frequency'] == 1).sum()),
            'repeat_customers': int((df['order_frequency'] >= 2).sum()),
            'high_frequency': int((df['order_frequency'] >= 10).sum())
        },
        'average_priority_preference': float(df['priority_preference'].mean()),
        'peak_season_customers': int((df['peak_season_preference'] > 0.5).sum()),
        'enterprise_insights': {
            'benchmark': 'TPC-H Scale Factor 1000',
            'data_size': '~1TB processed',
            'processing_engine': 'Ray Data distributed CPU cluster'
        }
    }
    
    # Write summary to file
    with open('/tmp/tpch_etl_output/analytics/tpch_executive_summary.json', 'w') as f:
        json.dump(summary, f, indent=2)
    
    print(" Generated TPC-H executive summary report")
    return [{"summary_created": True, "customers_processed": len(df)}]

# Generate executive summary using custom writer
summary_result = customer_behavior.map_batches(create_tpch_executive_summary, batch_size=None)
summary_result.take()  # Trigger execution

# Write time-series data for forecasting models
print(" Writing monthly trends for time-series analysis...")
monthly_trends.write_parquet("/tmp/tpch_etl_output/analytics/monthly_trends_tpch")

# Display what was created
print("\n TPC-H ETL Output files created:")
for root, dirs, files in os.walk("/tmp/tpch_etl_output"):
    level = root.replace("/tmp/tpch_etl_output", "").count(os.sep)
    indent = " " * 2 * level
    print(f"{indent} {os.path.basename(root)}/")
    sub_indent = " " * 2 * (level + 1)
    for file in files:
        file_path = os.path.join(root, file)
        size_mb = os.path.getsize(file_path) / (1024 * 1024)
        print(f"{sub_indent} {file} ({size_mb:.1f} MB)")

print(f"\n TPC-H ETL Pipeline Complete!")
print(f"    Industry-standard benchmark data processed")
print(f"    Traditional BI and modern ML features generated") 
print(f"    Enterprise-scale ETL demonstrated on CPU cluster")
print(f"    Ready for both business intelligence and AI/ML applications")


In [None]:
class ProductionETLPipeline:
    """
    A production-ready ETL pipeline class demonstrating best practices:
    - Error handling and data validation
    - Logging and monitoring
    - Resource management
    - Modular design
    """
    
    def __init__(self, batch_size=1000, concurrency=2):
        self.batch_size = batch_size
        self.concurrency = concurrency
        self.processed_records = 0
        self.error_count = 0
        self.quality_metrics = {}
        
    def validate_data_quality(self, batch, dataset_name="dataset"):
        """Validate data quality and collect metrics"""
        df = batch.to_pandas() if hasattr(batch, 'to_pandas') else pd.DataFrame(batch)
        initial_count = len(df)
        
        # Track quality metrics
        quality_checks = {}
        
        if 'customer_id' in df.columns:
            # Check for missing customer IDs
            missing_customer_ids = df['customer_id'].isna().sum()
            quality_checks['missing_customer_ids'] = missing_customer_ids
            df = df[df['customer_id'].notna()]
            
        if 'price' in df.columns:
            # Check for negative prices
            negative_prices = (df['price'] < 0).sum()
            quality_checks['negative_prices'] = negative_prices
            df = df[df['price'] >= 0]
            
        if 'quantity' in df.columns:
            # Check for invalid quantities
            invalid_quantities = (df['quantity'] <= 0).sum()
            quality_checks['invalid_quantities'] = invalid_quantities
            df = df[df['quantity'] > 0]
        
        final_count = len(df)
        dropped_count = initial_count - final_count
        
        if dropped_count > 0:
            print(f"  Data quality issues in {dataset_name}: dropped {dropped_count}/{initial_count} records")
            self.error_count += dropped_count
            
        # Store quality metrics
        self.quality_metrics[dataset_name] = quality_checks
        
        # Add quality metadata to the data
        df['data_quality_score'] = 1.0 - (dropped_count / initial_count if initial_count > 0 else 0)
        df['validation_timestamp'] = pd.Timestamp.now()
        
        return df
    
    def enrich_with_advanced_features(self, batch):
        """Advanced feature engineering with error handling"""
        try:
            df = batch.to_pandas() if hasattr(batch, 'to_pandas') else pd.DataFrame(batch)
            
            # Calculate advanced business metrics
            df['total_value'] = df['quantity'] * df['price']
            
            # Customer value segments (using percentiles)
            value_percentiles = df['total_value'].quantile([0.33, 0.66, 1.0])
            df['value_segment'] = pd.cut(
                df['total_value'],
                bins=[-float('inf')] + value_percentiles.tolist(),
                labels=['Economy', 'Standard', 'Premium', 'Luxury']
            )
            
            # Time-based features
            df['order_date'] = pd.to_datetime(df['order_date'])
            df['order_hour'] = df['order_date'].dt.hour
            df['is_business_hours'] = df['order_hour'].between(9, 17)
            df['order_season'] = df['order_date'].dt.month % 12 // 3
            
            # RFM Analysis components (Recency, Frequency, Monetary)
            reference_date = df['order_date'].max()
            df['days_since_order'] = (reference_date - df['order_date']).dt.days
            
            # Product affinity scoring
            product_scores = {'laptop': 5, 'phone': 4, 'tablet': 3, 'watch': 2, 'headphones': 1}
            df['product_affinity_score'] = df['product'].map(product_scores).fillna(0)
            
            # Calculate expected shipping costs (business rule)
            df['estimated_shipping_cost'] = np.where(
                df['total_value'] > 500,
                0,  # Free shipping for orders over $500
                np.where(df['is_business_hours'], 15, 25)  # Higher cost outside business hours
            )
            
            self.processed_records += len(df)
            return df
            
        except Exception as e:
            print(f" Error in feature engineering: {str(e)}")
            # Return original batch to continue processing
            return batch
    
    def create_customer_360_view(self, orders_df, customers_df):
        """Create a comprehensive customer view"""
        try:
            # Aggregate order data by customer
            customer_order_metrics = (
                orders_df
                .groupby('customer_id')
                .agg(
                    total_orders=('order_id', 'count'),
                    total_lifetime_value=('total_value', 'sum'),
                    avg_order_value=('total_value', 'mean'),
                    max_order_value=('total_value', 'max'),
                    preferred_product=('product', lambda x: x.mode().iloc[0] if not x.mode().empty else 'unknown'),
                    avg_product_affinity=('product_affinity_score', 'mean'),
                    total_shipping_saved=('estimated_shipping_cost', lambda x: (x == 0).sum() * 20)  # Estimated savings
                )
            )
            
            # Join with customer data
            customer_360 = customers_df.join(
                customer_order_metrics,
                key='customer_id',
                how='left'
            )
            
            return customer_360
            
        except Exception as e:
            print(f" Error creating customer 360 view: {str(e)}")
            return customers_df
    
    def run_pipeline(self):
        """Execute the complete ETL pipeline"""
        print(" Starting Production ETL Pipeline...")
        start_time = time.time()
        
        try:
            # EXTRACT: Read data with validation
            print(" Phase 1: Extracting data...")
            raw_orders = ray.data.read_parquet('/tmp/sample_data/orders.parquet')
            raw_customers = ray.data.read_csv('/tmp/sample_data/customers.csv')
            
            # TRANSFORM: Data quality validation
            print(" Phase 2: Data quality validation...")
            clean_orders = raw_orders.map_batches(
                lambda batch: self.validate_data_quality(batch, "orders"),
                batch_format="pyarrow",
                batch_size=self.batch_size
            )
            
            clean_customers = raw_customers.map_batches(
                lambda batch: self.validate_data_quality(batch, "customers"),
                batch_format="pyarrow", 
                batch_size=self.batch_size
            )
            
            # TRANSFORM: Feature engineering
            print(" Phase 3: Feature engineering...")
            enriched_orders = clean_orders.map_batches(
                self.enrich_with_advanced_features,
                batch_format="pyarrow",
                batch_size=self.batch_size,
                concurrency=self.concurrency
            )
            
            # TRANSFORM: Create customer 360 view
            print(" Phase 4: Creating customer 360 view...")
            customer_360 = self.create_customer_360_view(enriched_orders, clean_customers)
            
            # LOAD: Write results
            print(" Phase 5: Loading results...")
            os.makedirs("/tmp/production_output", exist_ok=True)
            
            # Write different outputs for different use cases
            enriched_orders.write_parquet("/tmp/production_output/enriched_orders")
            customer_360.write_parquet("/tmp/production_output/customer_360")
            
            # Generate final metrics
            execution_time = time.time() - start_time
            total_orders = enriched_orders.count()
            total_customers = customer_360.count()
            
            pipeline_metrics = {
                'execution_time_seconds': round(execution_time, 2),
                'total_orders_processed': total_orders,
                'total_customers_processed': total_customers,
                'records_per_second': round(self.processed_records / execution_time, 2),
                'error_count': self.error_count,
                'data_quality_metrics': self.quality_metrics
            }
            
            # Save metrics
            with open('/tmp/production_output/pipeline_metrics.json', 'w') as f:
                json.dump(pipeline_metrics, f, indent=2)
            
            print(" Pipeline completed successfully!")
            print(f" Processed {total_orders:,} orders and {total_customers:,} customers in {execution_time:.2f}s")
            print(f" Throughput: {pipeline_metrics['records_per_second']:,.0f} records/second")
            
            return pipeline_metrics
            
        except Exception as e:
            print(f" Pipeline failed: {str(e)}")
            raise

# Execute the production pipeline
pipeline = ProductionETLPipeline(batch_size=500, concurrency=2)
metrics = pipeline.run_pipeline()


## Summary: Your Journey with Ray Data ETL

Congratulations! You've completed a comprehensive journey through Ray Data for ETL. Let's summarize what you've learned and explore how to take your AI data pipelines to production.

<div class="alert alert-block alert-success">
<b> What You've Mastered:</b>
<ul>
    <li><b>Ray Data Fundamentals:</b> Blocks, lazy execution, streaming processing</li>
    <li><b>Extract Phase:</b> Reading from multiple data sources efficiently, including multimodal data</li>
    <li><b>Transform Phase:</b> Distributed data processing and feature engineering</li>
    <li><b>Load Phase:</b> Writing to various destinations with optimization</li>
    <li><b>Production Patterns:</b> Error handling, monitoring, and data quality</li>
    <li><b>Performance Optimization:</b> Understanding bottlenecks and solutions</li>
</ul>
</div>

### When to Use Ray Data

**Ray Data excels across the full spectrum of data workloads:**

**Traditional ETL & Business Intelligence:**
- **High-volume transaction processing** for e-commerce, finance, and operations
- **Business intelligence** and executive reporting at scale
- **Data warehouse** loading and transformation pipelines
- **CPU cluster optimization** with pure Python performance (no JVM overhead)
- **Traditional analytics** that need to scale beyond single-node tools

**Modern ML & AI Workloads:**
- **Feature engineering** for machine learning at scale
- **Batch inference** on foundation models and LLMs
- **Multimodal data processing** (text, images, video, audio)
- **GPU-accelerated pipelines** for AI applications
- **Real-time model serving** and inference workloads

**Ray Data's Unified Platform Advantage:**
- **One system** for both traditional ETL and cutting-edge AI
- **Seamless evolution** from CPU-based analytics to GPU-powered AI
- **No migration** required as your data needs grow and change
- **Consistent APIs** whether processing structured business data or unstructured AI content

**Ray Data is proven at scale:**
- Processing **exabyte-scale** workloads (Amazon's migration from Spark)
- **1M+ clusters** orchestrated monthly across the Ray ecosystem
- **$120M annual savings** achieved by leading enterprises
- **Traditional workloads** running alongside **next-generation AI** on the same platform

### From Open Source to Enterprise: Anyscale Platform

While Ray Data open source provides powerful capabilities, **Anyscale** offers a unified AI platform for production deployments:

<div class="alert alert-block alert-info">
<b> Anyscale: The Unified AI Platform</b>
<ul>
    <li><b>RayTurbo Runtime:</b> Up to 5.1x performance improvements over open source</li>
    <li><b>Enterprise Governance:</b> Resource quotas, usage tracking, and advanced observability</li>
    <li><b>AI Anywhere:</b> Deploy on Kubernetes, hybrid cloud, or any infrastructure</li>
    <li><b>LLM Suite:</b> Complete capabilities for embeddings, fine-tuning, and serving</li>
    <li><b>Marketplace Ready:</b> Available on AWS and GCP Marketplaces</li>
</ul>
</div>

### Production Deployment Options

**Getting Started:**
1. **Ray Open Source**: Perfect for development and smaller workloads
2. **Anyscale Platform**: Enterprise features with RayTurbo optimizations
3. **Marketplace Deployment**: One-click setup via AWS or GCP Marketplace

### Key Architectural Insights

Understanding how Ray Data works under the hood helps you build better pipelines:

1. **AI-Native Architecture**: Purpose-built for Python, GPUs, and multimodal data
2. **Streaming Execution**: Process datasets larger than cluster memory
3. **Heterogeneous Compute**: Seamlessly orchestrate CPUs, GPUs, and other accelerators
4. **Operator Fusion**: Combines compatible operations for efficiency
5. **Enterprise Scalability**: Proven to scale to 8,000+ nodes

### Production Readiness Checklist

Before deploying Ray Data pipelines to production:

-  **Architecture**: Choose between Ray OSS and Anyscale based on your needs
-  **Performance**: Consider RayTurbo for production workloads requiring maximum efficiency
-  **Governance**: Implement enterprise controls for AI sprawl and cost management
-  **Security**: Leverage enterprise identity integration and access controls
-  **Monitoring**: Use advanced observability tools for optimization insights
-  **Scalability**: Test with realistic data volumes and cluster sizes

### Join the Ray Ecosystem

The Ray community is thriving with **1,000+ contributors** and growing:

1. **Community**: Join the Ray Slack community for support and discussions
2. **Learning**: Access Ray Summit sessions and technical deep-dives
3. **Contributing**: Contribute to the fastest-growing AI infrastructure project
4. **Enterprise Support**: Explore Anyscale for production deployments

<div class="alert alert-block alert-success">
<b> One Platform for All Your Data Workloads</b><br>
You now have the knowledge to build production-ready, scalable data pipelines that handle everything from traditional business ETL to cutting-edge AI applications. Whether you're processing millions of e-commerce transactions for business intelligence or preparing multimodal data for foundation models, Ray Data provides a unified platform that scales with your needs.<br><br>
<b>Start with traditional ETL today, evolve to AI tomorrow - all on the same platform.</b> Ray Data and Anyscale eliminate the complexity of managing multiple systems as your data requirements grow.
</div>

### Get Started Today

- ** Ray Documentation**: [docs.ray.io](https://docs.ray.io/en/latest/data/)
- ** Try Anyscale**: Available on [AWS](https://aws.amazon.com/marketplace) and [GCP](https://console.cloud.google.com/marketplace) Marketplaces
- ** Community**: Join the conversation on [Ray Slack](https://ray-distributed.slack.com)
- ** Learn More**: Watch [Ray Summit sessions](https://www.youtube.com/c/RayProject) for deeper insights
