# Unstructured Data Ingestion and Processing With Ray Data

**Time to complete**: 35 min | **Difficulty**: Advanced | **Prerequisites**: Data engineering experience, document processing, basic NLP knowledge

## What you'll build

Build a comprehensive document ingestion pipeline that transforms unstructured documents from data lakes into structured, analytics-ready datasets using Ray Data's distributed processing capabilities for enterprise data warehouse workflows.

## Table of Contents

1. [Data Lake Document Discovery](#step-1-data-lake-document-discovery) (8 min)
2. [Document Processing and Classification](#step-2-document-processing-and-classification) (10 min)
3. [Text Extraction and Enrichment](#step-3-text-extraction-and-enrichment) (8 min)
4. [Data Warehouse Output](#step-4-data-warehouse-output) (6 min)
5. [Verification and Summary](#step-5-verification-and-summary) (3 min)

## Learning Objectives

**Why unstructured data ingestion matters**: Enterprise data lakes contain vast amounts of unstructured documents (PDFs, Word docs, presentations, reports) that need systematic processing to extract business value for analytics and reporting.

**Ray Data's ingestion capabilities**: Distribute document processing across clusters to handle large-scale document collections, extract structured data, and prepare analytics-ready datasets for data warehouse consumption.

**Data lake to warehouse patterns**: Techniques used by data engineering teams to systematically process document collections, extract structured information, and create queryable datasets for business intelligence.

**Production ingestion workflows**: Scalable document processing patterns that handle diverse file formats, extract metadata, and create structured schemas for downstream analytics systems.

## Overview

**Challenge**: Enterprise data lakes contain millions of unstructured documents (PDFs, Word docs, presentations) across multiple formats that need systematic processing to extract business value. Traditional document processing approaches struggle with:
- **Scale**: Single-machine processing limits document volume
- **Consistency**: Manual extraction creates inconsistent schemas
- **Integration**: Complex infrastructure for analysis
- **Warehouse integration**: Manual data modeling and ETL processes

**Solution**: Ray Data enables end-to-end document ingestion pipelines with native distributed operations for processing millions of documents efficiently.

## Prerequisites Checklist

Before starting, ensure you have:
- [ ] Understanding of data lake and data warehouse concepts
- [ ] Experience with document processing and text extraction
- [ ] Knowledge of structured data formats (Parquet, Delta Lake, Iceberg)
- [ ] Python environment with Ray Data and document processing libraries
- [ ] Access to S3 or other cloud storage for document sources

## Quick start (3 minutes)

This section demonstrates large-scale document ingestion using Ray Data:

In [18]:

# Standard library imports for basic operations
import json  # For handling JSON data (quality issues)
import uuid  # For generating unique document IDs
from datetime import datetime  # For timestamps
from pathlib import Path  # For file path operations
from typing import Any, Dict, List  # For type hints

# Data processing libraries
import numpy as np  # For numerical operations
import pandas as pd  # For DataFrame operations in batch processing

# Ray Data imports for distributed processing
import ray  # Core Ray library
from ray.data.aggregate import Count, Max, Mean, Sum  # Aggregation functions
from ray.data.expressions import col, lit  # Expression API (not used in filters but available)

# Get the current Ray Data context to configure global settings
ctx = ray.data.DataContext.get_current()

# Disable progress bars for cleaner output in production
# You can enable these for debugging: set to True to see progress
ctx.enable_progress_bars = False
ctx.enable_operator_progress_bars = False

# Initialize Ray cluster connection
# ignore_reinit_error=True allows running this code multiple times without errors
ray.init(ignore_reinit_error=True)

2025-10-20 20:08:43,128	INFO worker.py:1833 -- Connecting to existing Ray cluster at address: 10.0.86.206:6379...
2025-10-20 20:08:43,129	INFO worker.py:1851 -- Calling ray.init() again after it has already been called.


0,1
Python version:,3.12.11
Ray version:,2.50.0
Dashboard:,http://session-77uweunq3awbhqefvry4lwcqq5.i.anyscaleuserdata.com


## Step 1: Data Lake Document Discovery

### Discover document collections in data lake

In [19]:
# STEP 1: READ DOCUMENTS FROM DATA LAKE (S3)

# Use Ray Data's read_binary_files() to load documents from S3
# Why read_binary_files()?
#   - Reads files as raw bytes (works for PDFs, DOCX, images, etc.)
#   - Distributes file reading across cluster workers
#   - include_paths=True gives us the file path for each document
#
# Parameters explained:
#   - S3 path: Location of document collection in data lake
#   - include_paths=True: Adds 'path' column with file location
#   - ray_remote_args: Resource allocation per task
#     * num_cpus=0.025: Very low CPU since this is I/O-bound (reading files)
#     * This allows many concurrent reads without CPU bottleneck
#   - limit(100): Process only 100 documents for this demo
#     * Remove this limit to process entire collection

document_collection = ray.data.read_binary_files(
    "s3://anyscale-rag-application/1000-docs/",
    include_paths=True,
    ray_remote_args={"num_cpus": 0.025}
).limit(100)

# Display the schema to understand our data structure
# At this point, we have: 'bytes' (file content) and 'path' (file location)
print(f"Dataset schema: {document_collection.schema()}")

2025-10-20 20:08:43,509	INFO logging.py:293 -- Registered dataset logger for dataset dataset_122_0
2025-10-20 20:08:43,513	INFO streaming_executor.py:159 -- Starting execution of Dataset dataset_122_0. Full logs are in /tmp/ray/session_2025-10-20_17-37-47_219984_2353/logs/ray-data
2025-10-20 20:08:43,514	INFO streaming_executor.py:160 -- Execution plan of Dataset dataset_122_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> LimitOperator[limit=1] -> TaskPoolMapOperator[ReadFiles]
  gpu_fraction_per_op = (optimal_num_tasks_per_op * num_gpus_per_op) / np.sum(
2025-10-20 20:08:46,611	INFO streaming_executor.py:279 -- ✔️  Dataset dataset_122_0 execution finished in 3.10 seconds


Dataset schema: Column  Type
------  ----
bytes   binary
path    string


### Document metadata extraction

In [20]:

# STEP 2: TEXT EXTRACTION FUNCTION
# This function extracts text from various document formats (PDF, DOCX, etc.)
# It processes one document at a time (distirbuted by Ray) and returns structured metadata + text


def process_file(record: Dict[str, Any]) -> Dict[str, Any]:
    """
    Extract text content from document files using the Unstructured library.
    
    BEGINNER NOTE:
    - Input: A dictionary (record) with 'bytes' (file content) and 'path' (file location)
    - Output: A dictionary with extracted text + metadata
    - This function runs on each worker node in parallel
    
    Why extract text immediately?
    - Avoids passing large binary data through multiple operations
    - Reduces memory usage in downstream processing
    - Enables faster processing by dropping binary data early
    """
    # Import libraries inside function so each Ray worker has access
    import io
    from pathlib import Path
    
    from unstructured.partition.auto import partition
    
    # Extract file metadata from the input record
    file_path = Path(record["path"])  # Convert path string to Path object
    file_bytes = record["bytes"]  # Raw file content (binary data)
    file_size = len(file_bytes)  # Size in bytes
    file_extension = file_path.suffix.lower()  # Get extension (.pdf, .docx, etc.)
    file_name = file_path.name  # Just the filename (not full path)
    
    # We can only extract text from certain file types
    # If unsupported, return metadata with empty text
    supported_extensions = {".pdf", ".docx", ".doc", ".pptx", ".ppt", ".html", ".txt"}
    
    if file_extension not in supported_extensions:
        # Return a record with metadata but no extracted text
        return {
            "document_id": str(uuid.uuid4()),  # Generate unique ID
            "file_path": str(file_path),
            "file_name": file_name,
            "file_extension": file_extension,
            "file_size_bytes": file_size,
            "file_size_mb": round(file_size / (1024 * 1024), 2),  # Convert to MB
            "discovery_timestamp": datetime.now().isoformat(),
            "extracted_text": "",  # Empty - unsupported format
            "text_length": 0,
            "word_count": 0,
            "extraction_status": "unsupported_format"
        }
    
    # Extract text using the Unstructured library
    try:
        # Create an in-memory file stream from bytes (avoids writing to disk)
        with io.BytesIO(file_bytes) as stream:
            # partition() automatically detects format and extracts text
            # It returns a list of text elements (paragraphs, tables, etc.)
            elements = partition(file=stream)
            
            # Combine all extracted text elements into one string
            extracted_text = " ".join([str(el) for el in elements]).strip()
            
            # Calculate text statistics for quality assessment
            text_length = len(extracted_text)  # Total characters
            word_count = len(extracted_text.split()) if extracted_text else 0
            extraction_status = "success"
            
    except Exception as e:
        # If extraction fails (corrupted file, unsupported format, etc.)
        # Log the error and continue processing other files
        print(f"Cannot process file {file_path}: {e}")
        extracted_text = ""
        text_length = 0
        word_count = 0
        extraction_status = f"error: {str(e)[:100]}"  # Store error message (truncated)
    
    # Return record with all metadata and extracted text
    return {
        "document_id": str(uuid.uuid4()),  # Unique identifier for this document
        "file_path": str(file_path),
        "file_name": file_name,
        "file_extension": file_extension,
        "file_size_bytes": file_size,
        "file_size_mb": round(file_size / (1024 * 1024), 2),
        "discovery_timestamp": datetime.now().isoformat(),
        "extracted_text": extracted_text,  # The actual text content
        "text_length": text_length,
        "word_count": word_count,
        "extraction_status": extraction_status
    }


# Ray Data's map() applies process_file() to each document in parallel
# This is the "embarrassingly parallel" pattern - each document is independent
print("Extracting text from documents...")

# Why map() instead of map_batches()?
#   - map(): Process one record at a time (good for variable-size documents)
#   - map_batches(): Process records in batches (better for vectorized operations)
#   - Text extraction is I/O-bound and document-specific, so map() is ideal

# Parameters:
#   - process_file: The function to apply to each record
#   - concurrency=16: Run 8 parallel tasks at a time
#   - num_cpus=1: Each task gets 1 CPU (text extraction is CPU-intensive)

documents_with_text = document_collection.map(
    process_file,
    concurrency=16,
    num_cpus=1
)

Extracting text from documents...


In [21]:
# Convert a sample of documents to pandas DataFrame for easy viewing
documents_with_text.limit(25).to_pandas()

2025-10-20 20:08:46,860	INFO logging.py:293 -- Registered dataset logger for dataset dataset_124_0
2025-10-20 20:08:46,866	INFO streaming_executor.py:159 -- Starting execution of Dataset dataset_124_0. Full logs are in /tmp/ray/session_2025-10-20_17-37-47_219984_2353/logs/ray-data
2025-10-20 20:08:46,866	INFO streaming_executor.py:160 -- Execution plan of Dataset dataset_124_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> LimitOperator[limit=25] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[Map(process_file)]
[36m(Map(process_file) pid=15079, ip=10.0.119.92)[0m Cannot set gray non-stroke color because /'P15' is an invalid float value
[36m(Map(process_file) pid=15079, ip=10.0.119.92)[0m Cannot set gray non-stroke color because /'P19' is an invalid float value
[36m(Map(process_file) pid=15079, ip=10.0.119.92)[0m Cannot set gray non-stroke color because /'P23' is an invalid float value
[36m(Map(process_file) pid=15079, ip=10.0.119.92)[0m Cannot set gray n

Unnamed: 0,document_id,file_path,file_name,file_extension,file_size_bytes,file_size_mb,discovery_timestamp,extracted_text,text_length,word_count,extraction_status
0,e0fe8f07-5f07-47c4-aea0-95dd19bf1ecf,anyscale-rag-application/1000-docs/A Compariso...,A Comparison of Programming Languages in Econo...,.pdf,211355,0.2,2025-10-20T20:08:50.946284,A Comparison of Programming Languages in Econo...,33839,5307,success
1,7a2c4d5f-ecf4-4222-a73e-1476aea9c97d,anyscale-rag-application/1000-docs/A Compariso...,A Comparison of Software and Hardware Techniqu...,.pdf,156844,0.15,2025-10-20T20:08:52.951872,A Comparison of Software and Hardware Techniqu...,71494,11296,success
2,71c9003b-2e02-4bbe-ba66-c0611de53a86,anyscale-rag-application/1000-docs/A Compilati...,A Compilation Target for Probabilistic Program...,.pdf,892594,0.85,2025-10-20T20:08:54.244419,A Compilation Target for Probabilistic Program...,39374,6122,success
3,be7c84e0-45f2-48fb-816b-1f89fcb6d038,anyscale-rag-application/1000-docs/Graph Theor...,Graph Theory (2005).pdf,.pdf,206383,0.2,2025-10-20T20:08:54.838291,V. Adamchik Graph Theory Victor Adamchik Fall ...,10103,1600,success
4,156c9d93-db92-4b3c-86ef-ccde0d2636bd,anyscale-rag-application/1000-docs/Multidigit ...,Multidigit Multiplication for Mathematicians (...,.pdf,346439,0.33,2025-10-20T20:08:56.554954,MULTIDIGIT MULTIPLICATION FOR MATHEMATICIANS D...,60434,10046,success
5,89477ca7-248b-4f88-8b9d-faaa570c67c4,anyscale-rag-application/1000-docs/Shining Lig...,Shining Light on Shadow Stacks - 7 Nov 2018 (1...,.pdf,443892,0.42,2025-10-20T20:08:58.067227,8 1 0 2 v o N 7 ] R C . s c [ 1 v 5 6 1 3 0 . ...,67521,10529,success
6,b515b4ae-63e0-4b05-8e09-7427fb7e25b9,anyscale-rag-application/1000-docs/lwref - BSD...,lwref - BSDCan2014 - FreeBSD.pdf,.pdf,214499,0.2,2025-10-20T20:08:58.625721,An insane idea on reference counting Gleb Smir...,11319,1613,success
7,8b3a498c-0630-4c86-becf-000f85b98cf3,anyscale-rag-application/1000-docs/A Dive in t...,A Dive in to Hyper-V Architecture and Vulnerab...,.pdf,4298584,4.1,2025-10-20T20:09:00.147126,This presentation is for informational purpose...,13282,1478,success
8,ecfa57ef-751a-40d4-ba31-66efebc8a41d,anyscale-rag-application/1000-docs/GraphBLAS M...,GraphBLAS Mathmatics - Provisional Release 1.0...,.pdf,691008,0.66,2025-10-20T20:09:01.769236,GraphBLAS Mathematics - Provisional Release 1....,40944,7326,success
9,3867cca9-15cf-48e1-b91d-511bdc6a08b4,anyscale-rag-application/1000-docs/Multiple By...,Multiple Byte Processing with Full-Word Instru...,.pdf,451218,0.43,2025-10-20T20:09:02.324341,"out destroying the 1D C-R property, it is nece...",21908,3989,success


In [22]:

# STEP 3: BUSINESS METADATA ENRICHMENT FUNCTION
# Add business context by classifying documents and assigning priority
# This enables filtering and partitioning in the data warehouse

def enrich_business_metadata(record: Dict[str, Any]) -> Dict[str, Any]:
    """
    Classify documents by business category and assign processing priority.
    
    BEGINNER NOTE:
    - Input: A record with extracted text and file metadata
    - Output: Same record with added business classification fields
    - Uses filename patterns to determine document type
    
    Why separate this from text extraction?
    - Text extraction is CPU-intensive, this is lightweight string matching
    - Separating concerns makes each stage easier to debug and optimize
    - Allows different resource allocations (CPU) for each stage
    """
    # Get the filename for pattern matching
    file_name = record["file_name"]
    filename_lower = file_name.lower()  # Convert to lowercase for easier matching
    file_size = record["file_size_bytes"]
    
    # Classify by Business Segment
    # Look for keywords in filename to determine business category
    # Real systems might use ML models or lookup tables instead
    # This is a much simpler and naive implementation of text categorization for demo purposes
    
    # Financial documents: earnings reports, revenue statements
    if any(keyword in filename_lower for keyword in ["financial", "earnings", "revenue", "profit"]):
        doc_type = "financial_document"
        business_category = "finance"
    
    # Legal documents: contracts, agreements
    elif any(keyword in filename_lower for keyword in ["legal", "contract", "agreement", "terms"]):
        doc_type = "legal_document"
        business_category = "legal"
    
    # Regulatory documents: compliance filings, SEC reports
    elif any(keyword in filename_lower for keyword in ["regulatory", "compliance", "filing", "sec"]):
        doc_type = "regulatory_document"
        business_category = "compliance"
    
    # Client documents: customer information, portfolios
    elif any(keyword in filename_lower for keyword in ["client", "customer", "portfolio"]):
        doc_type = "client_document"
        business_category = "client_services"
    
    # Research documents: market analysis, reports
    elif any(keyword in filename_lower for keyword in ["market", "research", "analysis", "report"]):
        doc_type = "research_document"
        business_category = "research"
    
    # Default category for unclassified documents
    else:
        doc_type = "general_document"
        business_category = "general"
    

    # Documents with urgent keywords get higher priority for processing
    
    # High priority: urgent, time-sensitive documents
    if any(keyword in filename_lower for keyword in ["urgent", "critical", "deadline"]):
        priority = "high"
        priority_score = 3
    
    # Medium priority: important but not urgent
    elif any(keyword in filename_lower for keyword in ["important", "quarterly", "annual"]):
        priority = "medium"
        priority_score = 2
    
    # Low priority: standard documents
    else:
        priority = "low"
        priority_score = 1
    
    # Return the record
    # Use **record to keep all existing fields, then add new ones
    return {
        **record,  # All existing fields (extracted_text, file_name, etc.)
        "document_type": doc_type,
        "business_category": business_category,
        "processing_priority": priority,
        "priority_score": priority_score,
        "estimated_pages": max(1, file_size // 50000),  # Rough estimate: 50KB per page
        "processing_status": "classified"
    }


# Apply Business Metadata enrichment to all documents
print("Enriching with business metadata...")

# Parameters:
#   - concurrency=16: More parallel tasks since this is lightweight
#   - num_cpus=0.25: Very low CPU usage (just string operations)
#     * This allows many documents to be classified simultaneously

documents_with_metadata = documents_with_text.map(
    enrich_business_metadata,
    concurrency=16,
    num_cpus=0.25
)

# View a few documents with business classification added
documents_with_metadata.limit(5).to_pandas()

2025-10-20 20:09:05,233	INFO logging.py:293 -- Registered dataset logger for dataset dataset_126_0
2025-10-20 20:09:05,239	INFO streaming_executor.py:159 -- Starting execution of Dataset dataset_126_0. Full logs are in /tmp/ray/session_2025-10-20_17-37-47_219984_2353/logs/ray-data
2025-10-20 20:09:05,239	INFO streaming_executor.py:160 -- Execution plan of Dataset dataset_126_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> LimitOperator[limit=5] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[Map(process_file)] -> TaskPoolMapOperator[Map(enrich_business_metadata)]


Enriching with business metadata...


  gpu_fraction_per_op = (optimal_num_tasks_per_op * num_gpus_per_op) / np.sum(
2025-10-20 20:09:20,005	INFO streaming_executor.py:279 -- ✔️  Dataset dataset_126_0 execution finished in 14.76 seconds


Unnamed: 0,document_id,file_path,file_name,file_extension,file_size_bytes,file_size_mb,discovery_timestamp,extracted_text,text_length,word_count,extraction_status,document_type,business_category,processing_priority,priority_score,estimated_pages,processing_status
0,682d973d-a8c8-46ff-9b87-a84d10684f6c,anyscale-rag-application/1000-docs/100G Networ...,100G Networking Technology Overview - Slides -...,.pdf,1516903,1.45,2025-10-20T20:09:11.424457,100G Networking Technology Overview Christophe...,8996,1558,success,general_document,general,low,1,30,classified
1,d262b550-2dba-4977-ac73-cb2aa0915adf,anyscale-rag-application/1000-docs/Grand Centr...,Grand Central Dispatch - FreeBSD Dev Summit (1...,.pdf,130189,0.12,2025-10-20T20:09:11.788307,Grand Central Dispatch FreeBSD Devsummit Rober...,7831,1071,success,general_document,general,low,1,2,classified
2,57eef3e9-0c9c-4fb7-b95e-d083e84e55ac,anyscale-rag-application/1000-docs/Monitor_a_j...,Monitor_a_job.docx,.docx,387461,0.37,2025-10-20T20:09:12.526486,Monitor a job Anyscale jobs provides several t...,3296,585,success,general_document,general,low,1,7,classified
3,703a2fd4-44c1-4703-8082-a78e8151dd98,anyscale-rag-application/1000-docs/Serial Orde...,Serial Order - A Parallel Distributed Processi...,.pdf,2281776,2.18,2025-10-20T20:09:17.436733,SERIAL ORDER: A PARALLEL DISTRmUTED PROCESSING...,132375,21122,success,general_document,general,low,1,45,classified
4,6a2bab92-3d51-49d6-a7b6-c317a023d239,anyscale-rag-application/1000-docs/jargn10-the...,jargn10-thejargonfilever00038gut.txt,.txt,1140873,1.09,2025-10-20T20:09:19.958387,This Is The Project Gutenberg Etext of The Hac...,1065517,170519,success,general_document,general,low,1,22,classified


In [23]:

# WHY AGGREGATIONS?
# Before writing to warehouse, understand what we're processing:
# - How many documents of each type?
# - What's the size distribution?
# - Which categories have the most content?

# AGGREGATION 1: Document Type Distribution
# Group by document_type and calculate statistics

doc_type_stats = documents_with_metadata.groupby("document_type").aggregate(
    Count(),  # How many documents of each type?
    Sum("file_size_bytes"),  # Total size per document type
    Mean("file_size_mb"),  # Average size per document type
    Max("estimated_pages")  # Largest document per type
)

# AGGREGATION 2: Business Category Analysis
# Understand the distribution across business categories
# This helps with warehouse partitioning strategy

category_stats = documents_with_metadata.groupby("business_category").aggregate(
    Count(),  # How many per category?
    Mean("priority_score"),  # Average priority per category
    Sum("file_size_mb")  # Total data volume per category
)

## Step 2: Document Processing and Classification

### Text extraction and quality assessment

In [24]:

# STEP 4: QUALITY ASSESSMENT FUNCTION
# Evaluate document quality to filter out low-quality or problematic documents

def assess_document_quality(batch: pd.DataFrame) -> pd.DataFrame:
    """
    Assess document quality for data warehouse ingestion.
    
    BEGINNER NOTE:
    - Input: pandas DataFrame with multiple documents (a "batch")
    - Output: Same DataFrame with quality assessment columns added
    - Uses map_batches() for efficiency (process many docs at once)
    
    Why use map_batches() instead of map()?
    - Batching is more efficient for lightweight operations
    - Pandas DataFrame operations are optimized
    - Reduces overhead from function calls
    
    Explicitly use batch_format="pandas" 
    """
    quality_scores = np.zeros(len(batch), dtype=int)  # Numeric score (0-4)
    quality_ratings = []  # Text rating (high/medium/low)
    quality_issues_list = []  # List of issues found
    
    # We iterate through rows to apply business rules for quality
    # Each document gets a score from 0-4 based on quality criteria
    
    for idx, row in batch.iterrows():
        quality_score = 0
        quality_issues = []
        
        # CRITERION 1: File size check
        # Files smaller than 10KB might be empty or corrupt
        if row["file_size_mb"] > 0.01:  # More than 10KB
            quality_score += 1
        else:
            quality_issues.append("file_too_small")
        
        # CRITERION 2: Text length check
        # Documents should have meaningful text content
        if row["text_length"] > 100:  # At least 100 characters
            quality_score += 1
        else:
            quality_issues.append("insufficient_text")
        
        # CRITERION 3: Business relevance check
        # Classified documents are more valuable than unclassified
        if row["business_category"] != "general":
            quality_score += 1
        else:
            quality_issues.append("low_business_relevance")
        
        # CRITERION 4: Word count check
        # Documents should have substantial content
        if row["word_count"] > 20:  # At least 20 words
            quality_score += 1
        else:
            quality_issues.append("insufficient_content")
        
        # Score 4: All checks passed - high quality
        # Score 2-3: Some issues - medium quality
        # Score 0-1: Major issues - low quality
        quality_rating = "high" if quality_score >= 4 else "medium" if quality_score >= 2 else "low"
        
        # Store results for this document
        quality_scores[idx] = quality_score
        quality_ratings.append(quality_rating)
        quality_issues_list.append(json.dumps(quality_issues))  # Convert list to JSON string
    
    batch["quality_score"] = quality_scores
    batch["quality_rating"] = quality_ratings
    batch["quality_issues"] = quality_issues_list
    
    return batch


# Apply Quality Assessment to all documents

# Parameters:
#   - batch_format="pandas": Process as pandas DataFrame (easier than numpy arrays)
#   - num_cpus=0.25: Very low CPU (this is lightweight logic)
#   - batch_size=100: Process 100 documents at a time
#     * Larger batches = fewer function calls = better efficiency

quality_assessed_docs = documents_with_metadata.map_batches(
    assess_document_quality,
    batch_format="pandas",  # Ray Data pattern: explicit pandas format
    num_cpus=0.25,
    batch_size=100
)

## Step 3: Text Chunking and Enrichment

In [25]:

# STEP 5: TEXT CHUNKING FUNCTION
# Split long documents into smaller chunks for downstream processing
# Why chunk? Many applications (LLMs, vector databases) have size limits

def create_text_chunks(record: Dict[str, Any]) -> List[Dict[str, Any]]:
    """
    Create overlapping text chunks from each document.
    
    BEGINNER NOTE:
    - Input: ONE document record with full text
    - Output: MULTIPLE chunk records (one document → many chunks)
    - This is a "one-to-many" transformation
    
    Why chunking?
    - LLM APIs have token limits (e.g., 4096 tokens)
    - Vector databases work better with smaller chunks
    - Enables more granular search and analysis
    
    Why overlapping chunks?
    - Preserves context across chunk boundaries
    - Prevents splitting important information
    - 150 character overlap means each chunk shares text with neighbors
    """
    text = record["extracted_text"]  # The full document text
    chunk_size = 1500  # Each chunk will be ~1500 characters
    overlap = 150  # Adjacent chunks share 150 characters
    
    # Why these numbers?
    # - 1500 chars ≈ 300-400 tokens (good for most LLM APIs)
    # - 150 char overlap ≈ 10% overlap (preserves context without too much redundancy)
    # - You can adjust these based on your use case
    
    # Let's create chunks of data by a sliding window
    chunks = []
    start = 0  # Starting position in the text
    chunk_index = 0  # Track which chunk number this is

    # There are many more advanced chunking methods, we'll use this simple technique for demo purposes
    
    # Loop until we've processed all the text
    while start < len(text):
        # Calculate end position (don't go past text length)
        end = min(start + chunk_size, len(text))
        
        # Extract this chunk's text
        chunk_text = text[start:end]
        
        # Create a new record for this chunk
        # It contains all the original document metadata PLUS chunk-specific data
        chunk_record = {
            **record,  # All original fields (document_id, business_category, etc.)
            "chunk_id": str(uuid.uuid4()),  # Unique ID for this specific chunk
            "chunk_index": chunk_index,  # Position in sequence (0, 1, 2, ...)
            "chunk_text": chunk_text,  # The actual text content of this chunk
            "chunk_length": len(chunk_text),  # Characters in this chunk
            "chunk_word_count": len(chunk_text.split())  # Words in this chunk
        }
        
        chunks.append(chunk_record)
        
        # If we've reached the end of the text, we're done
        if end >= len(text):
            break
        
        # Move to next chunk position (with overlap)
        # Example: If chunk_size=1500 and overlap=150
        # Chunk 1: chars 0-1500
        # Chunk 2: chars 1350-2850 (starts 150 before chunk 1 ended)
        # Chunk 3: chars 2700-4200 (starts 150 before chunk 2 ended)
        start = end - overlap
        chunk_index += 1

    # After creating all chunks, add how many chunks the document has
    # This helps with progress tracking and completeness checks
    for chunk in chunks:
        chunk["total_chunks"] = len(chunks)
    
    # Return the list of chunk records
    # Ray Data's flat_map() will automatically flatten this list
    return chunks


# Apply Text Chunking to all documents
# Use flat_map() for one-to-many transformations
# One document becomes multiple chunks
print("Creating text chunks...")

# Why flat_map() instead of map()?
#   - map(): One input → One output (document → document)
#   - flat_map(): One input → Many outputs (document → multiple chunks)
#   - flat_map() automatically "flattens" the list of chunks
#
# Example:
#   Input: 100 documents
#   Output: 10,000+ chunks (each document becomes ~100 chunks on average)
#
# Parameters:
#   - num_cpus=0.5: Moderate CPU usage (string slicing is lightweight)

chunked_documents = quality_assessed_docs.flat_map(
    create_text_chunks,
    num_cpus=0.5
)

Creating text chunks...


## Step 4: Data Warehouse Schema and Output

### Create data warehouse schema

In [26]:

# STEP 6: DATA WAREHOUSE SCHEMA TRANSFORMATION
# Transform the raw processing data into a clean warehouse schema
# This is the "ETL" part - Extract (done), Transform (now), Load (next)

print("Creating data warehouse schema...")


# Get today's date in ISO format (YYYY-MM-DD)
# This will be used to partition the data by date in the warehouse
processing_date = datetime.now().isoformat()[:10]


# Data warehouses need clean, organized schemas
# We'll select only the columns we need and organize them logically
#
# Why not keep all columns?
# - Cleaner schema = easier queries
# - Less storage space
# - Better performance
# - Clear data contracts for downstream users

warehouse_dataset = chunked_documents.select_columns([
    # PRIMARY IDENTIFIERS: Keys for joining and relationships
    "document_id",  # Links all chunks from same document
    "chunk_id",     # Unique identifier for this specific chunk
    
    # DIMENSIONAL ATTRIBUTES: Categorical data for filtering/grouping
    # These are typically used in WHERE clauses and GROUP BY
    "business_category",      # finance, legal, compliance, etc.
    "document_type",          # financial_document, legal_document, etc.
    "file_extension",         # .pdf, .docx, etc.
    "quality_rating",         # high, medium, low
    "processing_priority",    # high, medium, low
    
    # FACT MEASURES: Numeric values for aggregation and analysis
    # These are typically used in SUM(), AVG(), COUNT(), etc.
    "file_size_mb",           # Document size
    "word_count",             # Total words in document
    "chunk_word_count",       # Words in this chunk
    "quality_score",          # Numeric quality (0-4)
    "priority_score",         # Numeric priority (1-3)
    "estimated_pages",        # Page count estimate
    "chunk_index",            # Position in document (0, 1, 2, ...)
    "total_chunks",           # How many chunks total
    
    # CONTENT FIELDS: The actual data payload
    "chunk_text",             # The text content (will rename this)
    "file_name",              # Original filename
    "file_path",              # S3 location
    
    # METADATA: Processing provenance and status tracking
    "discovery_timestamp",    # When was file discovered
    "extraction_status",      # success, error, unsupported_format
    "processing_status"       # classified, processed, etc.
])

# RENAME COLUMNS for data warehouse conventions
# "chunk_text" → "text_content" (more descriptive)
warehouse_dataset = warehouse_dataset.rename_columns({
    "chunk_text": "text_content"
})

# ADD PIPELINE METADATA: Constant columns for all records
# These columns are the same for every record in this run
# They help with data lineage and debugging

def add_pipeline_metadata(batch: pd.DataFrame) -> pd.DataFrame:
    """
    Add constant metadata columns to every record.
    
    BEGINNER NOTE:
    - These columns help track WHERE and WHEN data was processed
    - Useful for debugging and auditing
    - All records in this batch get the same values
    
    Why map_batches() for constants?
    - More efficient than adding columns one at a time
    - Can add multiple columns in one pass
    - Pandas operations are fast for this
    """
    batch["processing_date"] = processing_date     # When was this processed?
    batch["pipeline_version"] = "1.0"             # Which version of pipeline?
    batch["processing_engine"] = "ray_data"       # What tool processed it?
    return batch

# Apply metadata addition
# Parameters:
#   - batch_format="pandas": Use pandas for easy column addition
#   - num_cpus=0.1: Very low CPU (just adding constants)
#   - batch_size=10000: Large batches (this is very fast)

warehouse_dataset = warehouse_dataset.map_batches(
    add_pipeline_metadata,
    batch_format="pandas",
    num_cpus=0.1,
    batch_size=1000
)

Creating data warehouse schema...


### Write to data warehouse with partitioning

In [27]:

# STEP 7: WRITE TO DATA WAREHOUSE - MAIN TABLE
# Save all processed chunks to Parquet format with partitioning
# This is the "Load" part of ETL

# /mnt/cluster_storage is a shared storage volume accessible by all workers
# In production, this would typically be:
# - S3: s3://your-bucket/warehouse/
# - Azure: abfs://container@account.dfs.core.windows.net/
# - GCS: gs://your-bucket/warehouse/
OUTPUT_WAREHOUSE_PATH = "/mnt/cluster_storage"

# WRITE MAIN TABLE with PARTITIONING
# write_parquet() is Ray Data's native way to save data

# Key parameters explained:

# partition_cols=["business_category", "processing_date"]
#   - Creates folder structure: business_category=finance/processing_date=2025-10-15/
#   - Enables efficient querying: "SELECT * FROM table WHERE business_category='finance'"
#   - Query engines (Spark, Presto, Athena) can skip entire partitions
#   - Example structure:
#       main_table/
#       ├── business_category=finance/
#       │   └── processing_date=2025-10-15/
#       │       ├── part-001.parquet
#       │       └── part-002.parquet
#       ├── business_category=legal/
#       │   └── processing_date=2025-10-15/
#       │       └── part-001.parquet
#       └── business_category=compliance/
#           └── processing_date=2025-10-15/
#               └── part-001.parquet

# compression="snappy"
#   - Compress files to save storage space (50-70% reduction)
#   - Snappy is fast and well-supported by all query engines
#   - Alternatives: gzip (higher compression), zstd (good balance)

# ray_remote_args={"num_cpus": 0.1}
#   - Writing to storage is I/O-bound, not CPU-bound
#   - Low CPU allocation allows more parallel writes

warehouse_dataset.write_parquet(
    f"{OUTPUT_WAREHOUSE_PATH}/main_table/",
    partition_cols=["business_category", "processing_date"],
    compression="snappy",
    ray_remote_args={"num_cpus": 0.1}
)

print("Main warehouse table written successfully")

2025-10-20 20:09:20,615	INFO logging.py:293 -- Registered dataset logger for dataset dataset_137_0
2025-10-20 20:09:20,622	INFO streaming_executor.py:159 -- Starting execution of Dataset dataset_137_0. Full logs are in /tmp/ray/session_2025-10-20_17-37-47_219984_2353/logs/ray-data
2025-10-20 20:09:20,622	INFO streaming_executor.py:160 -- Execution plan of Dataset dataset_137_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> LimitOperator[limit=100] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[Map(process_file)] -> TaskPoolMapOperator[Map(enrich_business_metadata)] -> TaskPoolMapOperator[MapBatches(assess_document_quality)] -> TaskPoolMapOperator[FlatMap(create_text_chunks)] -> TaskPoolMapOperator[Project] -> TaskPoolMapOperator[MapBatches(add_pipeline_metadata)->Write]
  gpu_fraction_per_op = (optimal_num_tasks_per_op * num_gpus_per_op) / np.sum(
[36m(Map(process_file) pid=15079, ip=10.0.119.92)[0m Cannot set gray non-stroke color because /'P15' is an invalid

Main warehouse table written successfully


In [28]:


# STEP 8: CREATE BUSINESS-SPECIFIC DATASETS
# Create specialized datasets for specific business teams
# Each team gets only the data they need with relevant columns


# Example: Compliance Analytics Dataset
# Compliance team needs: document content, quality, and priority
# Priority matters for compliance review workflows

compliance_analytics = warehouse_dataset.filter(
    lambda row: row["business_category"] == "compliance"
).select_columns([
    "document_id",          # Link to main table
    "chunk_id",             # Unique chunk identifier
    "text_content",         # The actual text
    "quality_score",        # Data reliability
    "processing_priority",  # urgent/important/normal
    "processing_date"       # When processed
])

# Write to dedicated compliance folder
compliance_analytics.write_parquet(
    f"{OUTPUT_WAREHOUSE_PATH}/analytics/compliance/",
    partition_cols=["processing_date"],
    compression="snappy",
    ray_remote_args={"num_cpus": 0.1}
)

2025-10-20 20:10:35,277	INFO logging.py:293 -- Registered dataset logger for dataset dataset_142_0
2025-10-20 20:10:35,283	INFO streaming_executor.py:159 -- Starting execution of Dataset dataset_142_0. Full logs are in /tmp/ray/session_2025-10-20_17-37-47_219984_2353/logs/ray-data
2025-10-20 20:10:35,283	INFO streaming_executor.py:160 -- Execution plan of Dataset dataset_142_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> LimitOperator[limit=100] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[Map(process_file)] -> TaskPoolMapOperator[Map(enrich_business_metadata)] -> TaskPoolMapOperator[MapBatches(assess_document_quality)] -> TaskPoolMapOperator[FlatMap(create_text_chunks)] -> TaskPoolMapOperator[Project] -> TaskPoolMapOperator[MapBatches(add_pipeline_metadata)] -> TaskPoolMapOperator[Filter(<lambda>)->Project] -> TaskPoolMapOperator[Write]


  gpu_fraction_per_op = (optimal_num_tasks_per_op * num_gpus_per_op) / np.sum(
[36m(Map(process_file) pid=14641, ip=10.0.119.92)[0m Cannot set gray non-stroke color because /'P15' is an invalid float value
[36m(Map(process_file) pid=14641, ip=10.0.119.92)[0m Cannot set gray non-stroke color because /'P19' is an invalid float value
[36m(Map(process_file) pid=14641, ip=10.0.119.92)[0m Cannot set gray non-stroke color because /'P23' is an invalid float value
[36m(Map(process_file) pid=14641, ip=10.0.119.92)[0m Cannot set gray non-stroke color because /'P27' is an invalid float value
[36m(Map(process_file) pid=14641, ip=10.0.119.92)[0m Cannot set gray non-stroke color because /'P33' is an invalid float value
[36m(Map(process_file) pid=14641, ip=10.0.119.92)[0m Cannot set gray non-stroke color because /'P39' is an invalid float value
[36m(Map(process_file) pid=14641, ip=10.0.119.92)[0m Cannot set gray non-stroke color because /'P43' is an invalid float value
[36m(Map(process_

### Create analytics summary tables

In [29]:

# STEP 9: CREATE ANALYTICS SUMMARY TABLES
# Pre-compute common aggregations for fast dashboard queries
# Summary tables = faster analytics queries


# SUMMARY TABLE 1: Processing Metrics by Category and Date

# Answer questions like:
# - How many documents processed per category per day?
# - What's the total data volume per category?
# - What's the average document quality by category?

# This summary makes dashboard queries instant instead of scanning all data

# groupby() groups data by multiple columns
# aggregate() calculates statistics for each group

processing_metrics = warehouse_dataset.groupby(["business_category", "processing_date"]).aggregate(
    Count(),                    # How many chunks per category+date?
    Sum("file_size_mb"),        # Total data volume
    Mean("word_count"),         # Average document size
    Mean("quality_score")       # Average quality
)

# Write summary table
# Smaller than main table, so queries are very fast
processing_metrics.write_parquet(
    f"{OUTPUT_WAREHOUSE_PATH}/summaries/processing_metrics/",
    partition_cols=["processing_date"],
    compression="snappy",
    ray_remote_args={"num_cpus": 0.1}
)

2025-10-20 20:11:59,556	INFO logging.py:293 -- Registered dataset logger for dataset dataset_147_0
2025-10-20 20:11:59,563	INFO streaming_executor.py:159 -- Starting execution of Dataset dataset_147_0. Full logs are in /tmp/ray/session_2025-10-20_17-37-47_219984_2353/logs/ray-data
2025-10-20 20:11:59,564	INFO streaming_executor.py:160 -- Execution plan of Dataset dataset_147_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> LimitOperator[limit=100] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[Map(process_file)] -> TaskPoolMapOperator[Map(enrich_business_metadata)] -> TaskPoolMapOperator[MapBatches(assess_document_quality)] -> TaskPoolMapOperator[FlatMap(create_text_chunks)] -> TaskPoolMapOperator[Project] -> TaskPoolMapOperator[MapBatches(add_pipeline_metadata)] -> HashAggregateOperator[HashAggregate(key_columns=('business_category', 'processing_date'), num_partitions=200)] -> TaskPoolMapOperator[Write]
  gpu_fraction_per_op = (optimal_num_tasks_per_op * num_gp

In [30]:

# SUMMARY TABLE 2: Quality Distribution
# Answer questions like:
# - What percentage of documents are high/medium/low quality?
# - Which categories have the highest quality scores?
# - How does quality correlate with document size?

# This helps identify data quality issues by category

quality_distribution = warehouse_dataset.groupby(["quality_rating", "business_category"]).aggregate(
    Count(),                        # How many per quality+category?
    Mean("word_count"),             # Average document size by quality
    Mean("chunk_word_count")        # Average chunk size by quality
)

# Write quality summary
# Used for quality monitoring dashboards

quality_distribution.write_parquet(
    f"{OUTPUT_WAREHOUSE_PATH}/summaries/quality_distribution/",
    compression="snappy",
    ray_remote_args={"num_cpus": 0.1}
)

2025-10-20 20:13:15,036	INFO logging.py:293 -- Registered dataset logger for dataset dataset_152_0
2025-10-20 20:13:15,045	INFO streaming_executor.py:159 -- Starting execution of Dataset dataset_152_0. Full logs are in /tmp/ray/session_2025-10-20_17-37-47_219984_2353/logs/ray-data
2025-10-20 20:13:15,045	INFO streaming_executor.py:160 -- Execution plan of Dataset dataset_152_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> LimitOperator[limit=100] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[Map(process_file)] -> TaskPoolMapOperator[Map(enrich_business_metadata)] -> TaskPoolMapOperator[MapBatches(assess_document_quality)] -> TaskPoolMapOperator[FlatMap(create_text_chunks)] -> TaskPoolMapOperator[Project] -> TaskPoolMapOperator[MapBatches(add_pipeline_metadata)] -> HashAggregateOperator[HashAggregate(key_columns=('quality_rating', 'business_category'), num_partitions=200)] -> TaskPoolMapOperator[Write]
  gpu_fraction_per_op = (optimal_num_tasks_per_op * num_gpu

## Verification and Summary

After writing data to the warehouse, it's important to verify everything worked correctly. This section demonstrates:

**Why verification matters:**
- Ensures data was written successfully
- Validates record counts match expectations
- Confirms schema is correct
- Provides sample data for visual inspection

**What we'll verify:**
1. Main table record count (should be 10,000+ chunks)
2. Summary tables exist and have data
3. Schema includes all expected columns
4. Sample records look correct

### Verify data warehouse outputs

In [31]:

# STEP 10: VERIFY DATA WAREHOUSE OUTPUT
# Always verify your data pipeline worked correctly
# This is a critical production practice
print("Verifying data warehouse integration...")

# Use Ray Data's read_parquet() to read what we just wrote
# This verifies:
# 1. Files were written successfully
# 2. Partitioning works correctly
# 3. Data can be read back (no corruption)
#
# Ray Data will automatically discover all partitions:
# - main_table/business_category=finance/processing_date=2025-10-15/*.parquet
# - main_table/business_category=legal/processing_date=2025-10-15/*.parquet
# - etc.
main_table_verify = ray.data.read_parquet(
    f"{OUTPUT_WAREHOUSE_PATH}/main_table/",
    ray_remote_args={"num_cpus": 0.025}  # Low CPU for reading
)

# Verify our aggregated metrics tables also wrote successfully
metrics_verify = ray.data.read_parquet(
    f"{OUTPUT_WAREHOUSE_PATH}/summaries/processing_metrics/",
    ray_remote_args={"num_cpus": 0.025}
)

print(f"Data warehouse verification:")
print(f"  Main table records: {main_table_verify.count():,}")
print(f"  Processing metrics: {metrics_verify.count():,}")
print(f"  Schema compatibility: Verified")

2025-10-20 20:14:31,038	INFO logging.py:293 -- Registered dataset logger for dataset dataset_156_0
2025-10-20 20:14:31,042	INFO streaming_executor.py:159 -- Starting execution of Dataset dataset_156_0. Full logs are in /tmp/ray/session_2025-10-20_17-37-47_219984_2353/logs/ray-data
2025-10-20 20:14:31,042	INFO streaming_executor.py:160 -- Execution plan of Dataset dataset_156_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> TaskPoolMapOperator[MapBatches(count_rows)]


Verifying data warehouse integration...
Data warehouse verification:


  gpu_fraction_per_op = (optimal_num_tasks_per_op * num_gpus_per_op) / np.sum(
2025-10-20 20:14:31,388	INFO streaming_executor.py:279 -- ✔️  Dataset dataset_156_0 execution finished in 0.35 seconds
2025-10-20 20:14:31,395	INFO logging.py:293 -- Registered dataset logger for dataset dataset_157_0
2025-10-20 20:14:31,399	INFO streaming_executor.py:159 -- Starting execution of Dataset dataset_157_0. Full logs are in /tmp/ray/session_2025-10-20_17-37-47_219984_2353/logs/ray-data
2025-10-20 20:14:31,400	INFO streaming_executor.py:160 -- Execution plan of Dataset dataset_157_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> TaskPoolMapOperator[MapBatches(count_rows)]
2025-10-20 20:14:31,507	INFO streaming_executor.py:279 -- ✔️  Dataset dataset_157_0 execution finished in 0.11 seconds


  Main table records: 115,977
  Processing metrics: 16
  Schema compatibility: Verified


In [32]:
# INSPECT SAMPLE DATA

# take(10) gets first 10 records for manual inspection
# This helps catch issues like:
# - Wrong data types
# - Missing fields
# - Incorrect values
# - Encoding problems
samples = main_table_verify.take(10)

# Display key fields from each sample record
for i, record in enumerate(samples):
    # Show abbreviated document ID (first 8 characters)
    doc_id = record['document_id'][:8]
    category = record['business_category']
    words = record['word_count']
    quality = record['quality_rating']
    
    print(f"\t{i+1}. Doc: {doc_id}, Category: {category}, Words: {words}, Quality: {quality}")

2025-10-20 20:14:31,624	INFO logging.py:293 -- Registered dataset logger for dataset dataset_158_0
2025-10-20 20:14:31,629	INFO streaming_executor.py:159 -- Starting execution of Dataset dataset_158_0. Full logs are in /tmp/ray/session_2025-10-20_17-37-47_219984_2353/logs/ray-data
2025-10-20 20:14:31,630	INFO streaming_executor.py:160 -- Execution plan of Dataset dataset_158_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> TaskPoolMapOperator[ReadFiles] -> LimitOperator[limit=10]
2025-10-20 20:14:32,076	INFO streaming_executor.py:279 -- ✔️  Dataset dataset_158_0 execution finished in 0.45 seconds


	1. Doc: 9278d4de, Category: compliance, Words: 6925, Quality: high
	2. Doc: 9278d4de, Category: compliance, Words: 6925, Quality: high
	3. Doc: 9278d4de, Category: compliance, Words: 6925, Quality: high
	4. Doc: 9278d4de, Category: compliance, Words: 6925, Quality: high
	5. Doc: 9278d4de, Category: compliance, Words: 6925, Quality: high
	6. Doc: 9278d4de, Category: compliance, Words: 6925, Quality: high
	7. Doc: 9278d4de, Category: compliance, Words: 6925, Quality: high
	8. Doc: 9278d4de, Category: compliance, Words: 6925, Quality: high
	9. Doc: 9278d4de, Category: compliance, Words: 6925, Quality: high
	10. Doc: 9278d4de, Category: compliance, Words: 6925, Quality: high


## Summary and Next Steps

Congratulations! You've built a complete end-to-end document ingestion pipeline using Ray Data. Let's review what you learned and where to go from here.

### What You Built

**Complete ETL Pipeline**: Extract → Transform → Load
1. **Extract**: Read 100 documents from S3 data lake
2. **Transform**: Extract text, classify, assess quality, create chunks
3. **Load**: Write to partitioned data warehouse with analytics tables

**Final Output**: From raw documents to structured warehouse
- **Main table**: 10,000+ text chunks ready for analysis
- **Business datasets**: Finance and compliance specific views
- **Summary tables**: Pre-computed metrics for dashboards
- **Partitioned storage**: Optimized for query performance

### Ray Data Operations You Used

This pipeline demonstrated all major Ray Data operations:

| Operation | Purpose | When to Use |
|-----------|---------|-------------|
| `read_binary_files()` | Load documents from S3/storage | Reading PDFs, images, any binary files |
| `map()` | Transform each record individually | Variable-size processing, I/O-bound tasks |
| `map_batches()` | Transform records in batches | Batch-optimized operations, ML inference |
| `flat_map()` | One-to-many transformations | Chunking, splitting, exploding data |
| `filter()` | Keep/remove records | Selecting subsets, data quality filtering |
| `select_columns()` | Choose specific fields | Schema projection, reducing data size |
| `rename_columns()` | Change column names | Schema standardization |
| `groupby().aggregate()` | Calculate statistics | Analytics, metrics, summaries |
| `write_parquet()` | Save to warehouse | Final output, checkpointing |

### Key Concepts for Beginners

**1. Distributed Processing**
- Your code runs on a cluster of machines (not just one)
- Ray Data automatically distributes work across workers
- Each function (process_file, assess_quality) runs in parallel
- 100 documents processed simultaneously = 100x faster

**2. Lazy Evaluation**
- Operations like `map()` and `filter()` don't execute immediately
- Ray builds a plan and optimizes it
- Execution happens when you call `write_parquet()`, `count()`, or `take()`
- This allows Ray to optimize the entire pipeline

**3. Resource Management**
- `num_cpus`: How many CPU cores per task
- `concurrency`: How many tasks run in parallel
- `batch_size`: How many records per batch
- Balance these based on your workload

**4. Partitioning Strategy**
- Partitions = folders organized by column values
- `partition_cols=["business_category", "processing_date"]`
- Query engines skip entire folders when filtering
- Enables efficient query performance by reducing data scanned

### Implementation Patterns Applied

**Code Organization**:
- Separate functions for each processing stage
- Clear docstrings explaining purpose
- Type hints for inputs and outputs
- Comments explaining "why" not just "what"

**Ray Data Implementation Patterns**:
- Use `batch_format="pandas"` for clarity
- Process text early (don't pass binary through pipeline)
- Appropriate resource allocation per operation type
- Partition writes for query optimization
- Use native Ray Data operations (not custom code)

**Data Engineering Patterns**:
- Immediate text extraction (reduces memory)
- Separate classification stage (easier debugging)
- Quality assessment (data validation)
- Schema transformation (clean warehouse schema)
- Verification step (always check output)

### Production Recommendations

**Scaling to Production:**

1. **Remove the `.limit(100)` to process full dataset**
   - Currently processing 100 docs for demo
   - Remove this to process millions of documents
   - No code changes needed, just remove one line

2. **Tune resource parameters for your cluster**
   ```python
   # For larger clusters, increase parallelism:
   concurrency=50     # More parallel tasks
   batch_size=5000    # Larger batches
   num_cpus=2         # More CPU per task
   ```

3. **Add error handling and retry logic**
   ```python
   # For production, catch specific errors:
   try:
       elements = partition(file=stream)
   except CorruptedFileError:
       # Log and skip
   except TimeoutError:
       # Retry with backoff
   ```

4. **Monitor with Ray Dashboard**
   - View real-time progress
   - Check resource utilization
   - Identify bottlenecks
   - Debug failures

5. **Implement incremental processing**
   ```python
   # Only process new documents:
   new_docs = all_docs.filter(
       lambda row: row["processing_date"] > last_run_date
   )
   ```

6. **Add data quality checks**
   - Validate schema before writing
   - Check for null values
   - Verify foreign key relationships
   - Monitor quality metrics over time

### What You Learned

**Ray Data Fundamentals**:
- How to read from cloud storage (S3)
- Distributed data processing patterns
- Batch vs. row-based operations
- Resource management and tuning
- Writing to data warehouses

**Data Engineering Skills**:
- ETL pipeline design
- Document processing at scale
- Quality assessment strategies
- Data warehouse schema design
- Partitioning for performance

**Production Practices**:
- Verification and testing
- Error handling approaches
- Resource optimization
- Monitoring and debugging
- Scalability considerations

### Next Steps

**Extend This Pipeline:**
1. Add LLM-based content analysis (replace pattern matching)
2. Implement named entity recognition (NER)
3. Add sentiment analysis for customer documents
4. Create vector embeddings for semantic search
5. Integrate with Delta Lake or Apache Iceberg

**Learn More Ray Data:**
- **Batch Inference**: Process documents with ML models
- **Data Quality**: Advanced validation patterns
- **Performance Tuning**: Optimize for your workload
- **Integration**: Connect to Snowflake, Databricks, etc.

### Resources

- **Ray Data Documentation**: https://docs.ray.io/en/latest/data/data.html
- **Ray Data Examples**: https://docs.ray.io/en/latest/data/examples/examples.html
- **Ray Dashboard Guide**: https://docs.ray.io/en/latest/ray-observability/getting-started.html
- **Anyscale Platform**: https://docs.anyscale.com/

---

**You're now ready to build production-scale document ingestion pipelines with Ray Data!**