# Part 2: Advanced Optimization

**Time to complete**: 20 min | **Difficulty**: Intermediate | **Prerequisites**: Complete Part 1

---

## What You'll Learn

- Performance optimization decision framework
- Systematic optimization process with visual guides
- Observability tools for monitoring and debugging
- Production deployment best practices

---

## Optimization Framework

Ray Data performance tuning follows a clear hierarchy. Most issues can be resolved with simple parameter adjustments—always start with the simplest solutions first.

<div style="background-color: #e3f2fd; padding: 15px; border-left: 4px solid #2196F3; margin: 20px 0;">
<strong>Core Principle</strong><br>
Start with the simplest optimization first. Most performance issues can be solved with <code>num_cpus</code> adjustments alone.
</div>

### Three-Level Optimization Hierarchy

<table style="width:100%; border-collapse: collapse;">
<tr style="background-color: #4CAF50; color: white;">
<th style="padding: 12px; text-align: left;">Level</th>
<th style="padding: 12px; text-align: left;">Complexity</th>
<th style="padding: 12px; text-align: left;">When to Use</th>
<th style="padding: 12px; text-align: left;">Success Rate</th>
</tr>
<tr style="background-color: #e8f5e9;">
<td style="padding: 10px;"><strong>1. num_cpus</strong></td>
<td style="padding: 10px;">Simple</td>
<td style="padding: 10px;">Low CPU utilization, imbalanced stages</td>
<td style="padding: 10px;">80% of issues</td>
</tr>
<tr style="background-color: #fff3e0;">
<td style="padding: 10px;"><strong>2. batch_size</strong></td>
<td style="padding: 10px;">Medium</td>
<td style="padding: 10px;">Memory issues, GPU OOM</td>
<td style="padding: 10px;">15% of issues</td>
</tr>
<tr style="background-color: #ffebee;">
<td style="padding: 10px;"><strong>3. DataContext configs</strong></td>
<td style="padding: 10px;">Complex</td>
<td style="padding: 10px;">Advanced requirements only</td>
<td style="padding: 10px;">5% of issues</td>
</tr>
</table>

Each level increases in complexity and has diminishing returns. Level 1 optimizations are simple, safe, and highly effective. Level 2 requires understanding memory constraints. Level 3 should only be used when Levels 1 and 2 fail to resolve your issue.

---

## Quick Decision Guide

### Master Decision Tree


```
                    ┌─────────────────────────────┐
                    │  What's your main symptom?  │
                    └────────────┬────────────────┘
                                 │
         ┌───────────────────────┼───────────────────────┐
         │                       │                       │
    ┌────▼────┐            ┌─────▼─────┐          ┌──────▼──────┐
    │ Too Slow│            │ Crashing  │          │  Imbalanced │
    └────┬────┘            └─────┬─────┘          └──────┬──────┘
         │                       │                       │
         │                       │                       │
    Check CPU & GPU         Check Logs              Compare Stage
    Utilization            & Memory                   Progress Bars
         │                       │                       │
    ┌────┴────┐            ┌─────┴─────┐          ┌──────┴──────┐
    │         │            │           │          │             │
Low CPU  Low GPU       Workers     GPU OOM     One Stage     Pipeline
(<50%)   (<50%)         Killed                   Slow         Stalls
  │         │              │           │          │             │
  │         │              │           │          │             │
  ▼         ▼              ▼           ▼          ▼             ▼


LOW CPU UTILIZATION (<50%)
─────────────────────────────
1. Check which stage is slow:
   
   I/O Operations (read/write)
   └─→ num_cpus = 0.025-0.1
       Reason: Hide network/disk latency with high concurrency
   
   Simple Transforms (filter, map)
   └─→ num_cpus = 0.1-0.25
       Reason: Fast operations benefit from high parallelism
   
   Complex CPU Work (preprocessing)
   └─→ num_cpus = 0.25-0.5
       Reason: Balance parallelism with task overhead
   
   Heavy Compute (CPU inference)
   └─→ num_cpus = 2 * num_cpus
       Reason: Minimize scheduling overhead


LOW GPU UTILIZATION (<50%)
─────────────────────────────
1. Check if CPUs are busy:
   
   YES: Data preprocessing is bottleneck
   └─→ Decrease num_cpus on CPU stages (0.5 → 0.25)
   └─→ Increase preprocessing concurrency
   └─→ Consider GPU preprocessing if available
   
   NO: Batch size too small
   └─→ Increase batch_size (32 → 64 → 128)
   └─→ Check GPU memory allows larger batches
   
   Spiky GPU usage:
   └─→ Increase batch_size for smoother utilization
   └─→ Increase prefetch_batches if available


WORKERS KILLED (OOM)
─────────────────────────────
1. Check error message:
   
   "Killed" or "Out of memory"
   └─→ Increase num_cpus to reduce parallelism
       (0.5 → 1.0 → 2.0)
   └─→ Reduce concurrency parameter
   └─→ Decrease batch_size if applicable
   
   "Ray object store full"
   └─→ Increase num_cpus across ALL stages
   └─→ Reduce number of blocks (override_num_blocks)
   └─→ Decrease target_max_block_size
   
   Still failing?
   └─→ Reduce target_max_block_size (128MB → 64MB)
   └─→ Enable eager_free in DataContext
   └─→ Increase object store memory fraction


GPU OUT OF MEMORY
─────────────────────────────
1. First: Reduce batch_size
   └─→ 128 → 64 → 32 → 16 → 8
   
2. Still OOM? Separate CPU/GPU:
   └─→ Preprocess on CPU nodes
   └─→ Inference on GPU nodes
   └─→ Use accelerator_type parameter
   
3. Still OOM? Advanced options:
   └─→ Enable mixed precision (fp16/bf16)
   └─→ Use gradient checkpointing
   └─→ Enable model parallelism


ONE STAGE SLOW (IMBALANCED)
─────────────────────────────
1. Identify the slow stage:
   └─→ Check operator progress bars (watch out for backpressure)
   
2. Adjust num_cpus for that stage:
   
   Stage has empty output queue:
   └─→ Decrease num_cpus (increase parallelism)
   
   Stage has large input queue:
   └─→ Increase num_cpus (reduce parallelism)
       OR decrease upstream num_cpus
   
3. Check for data skew:
   └─→ Some tasks much slower than others?
   └─→ Use repartition() for better distribution
   └─→ Check the dataset.stats() to see if the block sizing looks right


PIPELINE STALLS (NO PROGRESS)
─────────────────────────────
1. Check progress bars:
   
   All stages stuck:
   └─→ Remove .count(), .show(), .schema() calls
   └─→ These materialize the entire dataset
   
   One stage stuck:
   └─→ Check for errors in that stage
   └─→ Enable verbose logging
   └─→ Check actor stack traces
   
   Intermittent stalls:
   └─→ Network issues or rate limiting
   └─→ Check retry configuration
   └─→ Increase io_timeout

   Scheduling issues:
   └─→ A stage being incorrectly configured and blocking other stages
```

<div style="background-color: #fff9c4; padding: 15px; border-left: 4px solid #FFC107; margin: 20px 0;">
<strong>The num_cpus Paradox</strong><br>
<strong>Lower num_cpus = MORE parallelism!</strong>
<ul>
<li><code>num_cpus=4.0</code> → Only 4 tasks on 16-CPU machine</li>
<li><code>num_cpus=0.5</code> → 32 tasks on 16-CPU machine</li>
</ul>
Use LOW values (0.025-0.1) for I/O operations, HIGH values (2.0-4.0) for CPU-intensive work.
</div>

The `num_cpus` parameter tells Ray Data how many CPUs to *reserve* for each task. When you set `num_cpus=4.0`, Ray reserves 4 CPUs for each task, so only a few tasks run simultaneously. When you set `num_cpus=0.5`, Ray reserves half a CPU per task, allowing many more tasks to run in parallel—beneficial for I/O-bound operations where tasks spend time waiting for data.

It doesn't actually isolate the hardware, it merely schedules the task to run on that node, so watch out for things like
fractional GPU or CPU usage that might overwhelm the node.

---

## Resource Allocation Decision Matrix

### By Operation Type

<table style="width:100%; border-collapse: collapse;">
<tr style="background-color: #3f51b5; color: white;">
<th style="padding: 12px;">Operation</th>
<th style="padding: 12px;">num_cpus</th>
<th style="padding: 12px;">batch_size</th>
<th style="padding: 12px;">Why?</th>
</tr>
<tr style="background-color: #f5f5f5;">
<td style="padding: 10px;">Data loading</td>
<td style="padding: 10px;"><code>0.025-0.1</code></td>
<td style="padding: 10px;">N/A</td>
<td style="padding: 10px;">I/O bound, hide latency with high concurrency</td>
</tr>
<tr>
<td style="padding: 10px;">Filtering</td>
<td style="padding: 10px;"><code>0.1-0.25</code></td>
<td style="padding: 10px;">N/A</td>
<td style="padding: 10px;">Fast operation, high parallelism beneficial</td>
</tr>
<tr style="background-color: #f5f5f5;">
<td style="padding: 10px;">CPU preprocessing</td>
<td style="padding: 10px;"><code>0.25-0.5</code></td>
<td style="padding: 10px;">100-1000</td>
<td style="padding: 10px;">Balance parallelism with task overhead</td>
</tr>
<tr>
<td style="padding: 10px;">CPU inference</td>
<td style="padding: 10px;"><code>2.0-4.0</code></td>
<td style="padding: 10px;">16-32</td>
<td style="padding: 10px;">Heavy compute, minimize task overhead</td>
</tr>
<tr style="background-color: #f5f5f5;">
<td style="padding: 10px;">GPU inference</td>
<td style="padding: 10px;"><code>1.0</code> (CPU)<br><code>1.0</code> (GPU)</td>
<td style="padding: 10px;">32-128</td>
<td style="padding: 10px;">Match GPU memory, maximize utilization</td>
</tr>
</table>

### By Hardware Configuration

<table style="width:100%; border-collapse: collapse;">
<tr style="background-color: #9c27b0; color: white;">
<th style="padding: 12px;">Scenario</th>
<th style="padding: 12px;">Configuration</th>
<th style="padding: 12px;">Rationale</th>
</tr>
<tr style="background-color: #f3e5f5;">
<td style="padding: 10px;"><strong>GPU Cluster</strong><br>(T4/A10G/A100)</td>
<td style="padding: 10px;">
• concurrency = # GPUs<br>
• batch_size = 32-128<br>
• num_cpus = 1.0 per GPU
</td>
<td style="padding: 10px;">One actor per GPU, maximize GPU utilization</td>
</tr>
<tr>
<td style="padding: 10px;"><strong>CPU Cluster</strong><br>(No GPUs)</td>
<td style="padding: 10px;">
• concurrency = CPUs / 4<br>
• batch_size = 16-32<br>
• num_cpus = 4.0 per actor
</td>
<td style="padding: 10px;">Balance actor count with CPU resources</td>
</tr>
<tr style="background-color: #f3e5f5;">
<td style="padding: 10px;"><strong>Mixed Cluster</strong><br>(CPU + GPU nodes)</td>
<td style="padding: 10px;">
• Separate CPU/GPU stages<br>
• CPU: num_cpus=0.25<br>
• GPU: as above
</td>
<td style="padding: 10px;">Prevent CPU tasks from running on GPU nodes</td>
</tr>
</table>

---

## Systematic Optimization Process

### Six-Step Workflow

```
┌──────────────┐
│  1. Monitor  │  Enable progress bars, check dashboards
└──────┬───────┘
       │
       ▼
┌──────────────┐
│  2. Baseline │  Measure current performance
└──────┬───────┘
       │
       ▼
┌──────────────┐
│  3. Identify │  Find the bottleneck stage
└──────┬───────┘
       │
       ▼
┌──────────────┐
│  4. One Fix  │  Apply single optimization
└──────┬───────┘
       │
       ▼
┌──────────────┐
│  5. Measure  │  Calculate improvement
└──────┬───────┘
       │
       ▼
┌──────────────┐
│  6. Repeat   │  Continue with next bottleneck
└──────────────┘
```

**Key practices:**

1. **Monitor**: Enable progress bars and open Ray Dashboard before optimizing
2. **Baseline**: Write down how long your pipeline takes
3. **Identify**: Use progress bars to see which stage is slowest
4. **One Fix**: Apply a single optimization—never change multiple parameters at once
5. **Measure**: Calculate improvement percentage
6. **Repeat**: Move to the next bottleneck or try a different optimization

### Measurement Template

```python
# Enable monitoring
ctx = ray.data.DataContext.get_current()
ctx.enable_progress_bars = True
ctx.enable_operator_progress_bars = True

# Measure baseline
import time
start = time.time()
ds.write_parquet("output.parquet")
baseline = time.time() - start
print(f"Baseline: {baseline:.2f}s")

# After optimization
new_time = time.time() - start
improvement = (baseline - new_time) / baseline * 100
print(f"Improvement: {improvement:.1f}%")
```

---

## Observability Tools

### Progress Bars

Ray Data provides two types of progress bars for monitoring:

**Main Progress Bar**: Shows overall operation progress including total rows, execution time, and high-level resource usage.

**Operator Progress Bars**: Display individual stage progress with detailed metrics per operator, showing which stage is the bottleneck.

**Configuration:**

```python
ctx = ray.data.DataContext.get_current()
ctx.enable_progress_bars = True  # Overall progress
ctx.enable_operator_progress_bars = True  # Per-stage details
ctx.enable_progress_bar_name_truncation = False  # See full operator names
```

**When to use:**
- Development/debugging: Enable both for real-time feedback
- Production: Disable for performance but keep metrics collection
- Notebooks: Enable for debugging, disable for cleaner saved output

**Key metrics to watch:**
- Rows processed (if stalled, pipeline may be hung)
- Resource usage (low CPU <50% needs more parallelism)
- Stage timing (longest stage is your bottleneck)
- Block counts (large queues indicate backpressure)

### Ray Dashboard

Access at `http://localhost:8265` (local) or through your cluster management system.

**Critical tabs for optimization:**

**Cluster Tab**: Node-level resource utilization (CPU, GPU, memory, disk). Look for idle CPUs (need more parallelism), underutilized GPUs (data loading bottleneck), or memory near capacity (risk of spilling/OOM).

**Jobs Tab**: Resource usage over time for active and completed jobs. Use to compare performance before and after optimization.

**Metrics Tab**: Time-series graphs of system and application metrics for trend analysis.

<table style="width:100%; border-collapse: collapse;">
<tr style="background-color: #1976d2; color: white;">
<th style="padding: 12px;">Metric</th>
<th style="padding: 12px;">What to Look For</th>
<th style="padding: 12px;">Optimization Action</th>
</tr>
<tr style="background-color: #e3f2fd;">
<td style="padding: 10px;"><strong>CPU Utilization</strong></td>
<td style="padding: 10px;">% of CPU cores actively computing</td>
<td style="padding: 10px;">
Low (<50%): Decrease num_cpus<br>
High (>90%): Well-optimized
</td>
</tr>
<tr>
<td style="padding: 10px;"><strong>GPU Utilization</strong></td>
<td style="padding: 10px;">% of GPU compute active</td>
<td style="padding: 10px;">
Low (<50%): Data loading bottleneck<br>
Spiky: Increase batch size
</td>
</tr>
<tr style="background-color: #e3f2fd;">
<td style="padding: 10px;"><strong>Memory Usage</strong></td>
<td style="padding: 10px;">RAM and object store consumption</td>
<td style="padding: 10px;">
Near capacity: Reduce parallelism<br>
Spilling: Increase num_cpus
</td>
</tr>
<tr>
<td style="padding: 10px;"><strong>Object Store</strong></td>
<td style="padding: 10px;">Memory for storing data blocks</td>
<td style="padding: 10px;">
Rapidly filling: Downstream too slow<br>
Empty: Upstream bottleneck
</td>
</tr>
</table>

### Ray Data Dashboard

Specialized view focusing on Ray Data pipeline metrics with detailed operator-level insights.

<div style="background-color: #e3f2fd; padding: 15px; border-left: 4px solid #2196F3; margin: 20px 0;">
<strong>Anyscale Enhanced Dashboard</strong><br>
If you're using Anyscale (Ray 2.44+), you have access to an enhanced dashboard with tree visualization for complex pipelines, integrated log views, and persistence across sessions. See the <a href="https://docs.anyscale.com/monitoring/workload-debugging/data-dashboard">Anyscale Data Dashboard documentation</a> for details and screenshots.
</div>

**Access:**
- **Open-source Ray**: Navigate to Ray Dashboard → Data or Metrics tabs
- **Anyscale**: Ray Workloads tab → Data tab

**Key sections:**

**Overview**: Total throughput (rows/second), execution time, aggregate resource usage. Shows CPU/GPU usage per operator, queued rows, and task counts.

**Inputs/Outputs**: Tracks data flow between operators. Rising input queues indicate backpressure from downstream; growing output queues indicate downstream is too slow.

**Tasks**: Task execution metrics including running, completed, and average duration. Useful for understanding parallelism settings.

**Object Store Memory**: Memory usage for Ray Data blocks. Shows which operators consume the most memory.

**Iteration** (for training): Tracks how fast workers consume data from iterators. Increasing "Iteration Blocked Time" means data preprocessing can't keep up with training.

**Optimization workflow:**

1. **Identify bottleneck**: Look for operator with lowest throughput
2. **Check resources**: Is CPU low? (decrease num_cpus) Is memory high? (increase num_cpus)
3. **Examine queues**: Large input queues = operator can't keep up; empty = upstream too slow
4. **Monitor blocked time**: For training, linear increase = need more data parallelism

**Anyscale-specific features:**

**Operator drill-down**: Click operators to view estimated remaining runtime, peak memory, task statistics, and resource utilization over time.

**Tree visualization**: For pipelines with `union`, `zip`, or `join`, see parent-child relationships and merge points in tree structure.

**Integrated logs**: View logs specific to each dataset from `/tmp/ray/{SESSION_NAME}/logs/ray-data/`. Includes automatic backpressure warnings, health checks, and OOM events. Add custom logs:

```python
import logging
logger = logging.getLogger("ray.data")

def my_map_function(batch):
    if is_interesting(batch):
        logger.info(f"Processing batch with {len(batch)} rows")
    return process(batch)
```

**Dashboard persistence**: Unlike open-source, Anyscale preserves dashboards after job termination. Use for post-mortem analysis, comparing runs, and sharing with team members via session dropdown.

### Actor Stack Traces

View stack traces of running actors to diagnose hangs or slow operations:

1. Navigate to Actors tab in Ray Dashboard
2. Find the actor (search by name or filter by state)
3. Click actor → Stack Trace tab

The stack trace shows the exact file, line number, and function where the actor is executing—immediately revealing if it's doing productive work or stuck waiting.

---

## Batch Size Selection

### GPU Inference Batch Sizes

<table style="width:100%; border-collapse: collapse;">
<tr style="background-color: #00897b; color: white;">
<th style="padding: 12px;">GPU Memory</th>
<th style="padding: 12px;">Small Models<br>(ResNet50)</th>
<th style="padding: 12px;">Medium Models<br>(ResNet152)</th>
<th style="padding: 12px;">Large Models<br>(ViT-L)</th>
</tr>
<tr style="background-color: #e0f2f1;">
<td style="padding: 10px;"><strong>16 GB</strong> (T4)</td>
<td style="padding: 10px;">128-256</td>
<td style="padding: 10px;">64-128</td>
<td style="padding: 10px;">32-64</td>
</tr>
<tr>
<td style="padding: 10px;"><strong>24 GB</strong> (A10G)</td>
<td style="padding: 10px;">256-512</td>
<td style="padding: 10px;">128-256</td>
<td style="padding: 10px;">64-128</td>
</tr>
<tr style="background-color: #e0f2f1;">
<td style="padding: 10px;"><strong>40 GB</strong> (A100)</td>
<td style="padding: 10px;">512+</td>
<td style="padding: 10px;">256-512</td>
<td style="padding: 10px;">128-256</td>
</tr>
</table>

<div style="background-color: #ffe0b2; padding: 15px; border-left: 4px solid #FF9800; margin: 20px 0;">
<strong>Image Resolution Matters</strong><br>
Values assume 224×224 resolution. For higher resolution:
<ul>
<li>448×448 → Divide by 4</li>
<li>512×512 → Divide by 5</li>
<li>1024×1024 → Divide by 16</li>
</ul>
</div>

Start with the higher end of the range and reduce if you encounter out-of-memory errors.

### CPU Inference Batch Sizes

<table style="width:100%; border-collapse: collapse;">
<tr style="background-color: #5d4037; color: white;">
<th style="padding: 12px;">Available RAM</th>
<th style="padding: 12px;">Recommended Batch Size</th>
</tr>
<tr style="background-color: #efebe9;">
<td style="padding: 10px;">4-8 GB</td>
<td style="padding: 10px;">4-16</td>
</tr>
<tr>
<td style="padding: 10px;">8-16 GB</td>
<td style="padding: 10px;">16-32</td>
</tr>
<tr style="background-color: #efebe9;">
<td style="padding: 10px;">16-32 GB</td>
<td style="padding: 10px;">32-64</td>
</tr>
<tr>
<td style="padding: 10px;">32+ GB</td>
<td style="padding: 10px;">64-128</td>
</tr>
</table>

RAM values refer to RAM per worker task. With high concurrency, each worker gets a fraction of total RAM.

### Batch Size Tuning

```
Start with recommended batch size
       │
       ▼
┌─────────────────┐
│ Run and measure │
│  memory usage   │
└────────┬────────┘
         │
    ┌────┴────┐
    │         │
Memory <50%  Memory >85%
    │         │
    ├→ 2x    ├→ 0.5x
    │  larger│  smaller
    │         │
    └─────────┴───→ Repeat until optimal
```

**Measure memory usage:**
- GPU: `torch.cuda.max_memory_allocated()` or `nvidia-smi`
- CPU: Ray Dashboard memory graphs

Target 60-80% memory usage for optimal performance without OOM risk.

---

## Troubleshooting Guide

### Symptom → Solution Matrix

<table style="width:100%; border-collapse: collapse;">
<tr style="background-color: #d32f2f; color: white;">
<th style="padding: 12px; width: 25%;">Symptom</th>
<th style="padding: 12px; width: 25%;">Root Cause</th>
<th style="padding: 12px; width: 25%;">Solution</th>
<th style="padding: 12px; width: 25%;">Verification</th>
</tr>
<tr style="background-color: #ffebee;">
<td style="padding: 10px;">Low CPU (<50%)</td>
<td style="padding: 10px;">Not enough parallel tasks</td>
<td style="padding: 10px;"><strong>Decrease</strong> num_cpus<br>(1.0 → 0.5 → 0.25)</td>
<td style="padding: 10px;">CPU utilization >80%</td>
</tr>
<tr>
<td style="padding: 10px;">Workers killed</td>
<td style="padding: 10px;">Out of memory</td>
<td style="padding: 10px;"><strong>Increase</strong> num_cpus<br>(0.5 → 1.0 → 2.0)</td>
<td style="padding: 10px;">No more crashes</td>
</tr>
<tr style="background-color: #ffebee;">
<td style="padding: 10px;">GPU OOM</td>
<td style="padding: 10px;">Batch too large</td>
<td style="padding: 10px;"><strong>Reduce</strong> batch_size<br>(64 → 32 → 16)</td>
<td style="padding: 10px;">No CUDA errors</td>
</tr>
<tr>
<td style="padding: 10px;">Pipeline stalls</td>
<td style="padding: 10px;">Materialization in pipeline</td>
<td style="padding: 10px;">Remove .count(), .schema()</td>
<td style="padding: 10px;">Continuous progress</td>
</tr>
<tr style="background-color: #ffebee;">
<td style="padding: 10px;">One slow stage</td>
<td style="padding: 10px;">Imbalanced parallelism</td>
<td style="padding: 10px;">Adjust that stage's num_cpus</td>
<td style="padding: 10px;">Balanced progress bars</td>
</tr>
<tr>
<td style="padding: 10px;">Uneven progress</td>
<td style="padding: 10px;">Data skew</td>
<td style="padding: 10px;">Check data distribution</td>
<td style="padding: 10px;">Even task durations</td>
</tr>
</table>

### Memory Issue Decision Tree

```
Memory Error?
    │
    ├─ Workers killed
    │      │
    │      ├─ First: num_cpus↑ (1.0 → 2.0)
    │      ├─ Still OOM: batch_size↓
    │      └─ Still OOM: Block size↓ (128MB → 64MB)
    │
    ├─ "Ray object store full"
    │      │
    │      ├─ First: num_cpus↑ all stages
    │      ├─ Still full: Override object store fraction
    │      └─ Still full: Reduce pipeline width
    │
    └─ "CUDA out of memory"
           │
           ├─ First: batch_size↓ (64 → 32 → 16 → 8)
           ├─ Still OOM: Separate CPU/GPU stages
           └─ Still OOM: Enable mixed precision
```

---

## Common Patterns

### Pattern Comparison

<table style="width:100%; border-collapse: collapse;">
<tr style="background-color: #1976d2; color: white;">
<th style="padding: 12px;">Pattern</th>
<th style="padding: 12px;">Use Case</th>
<th style="padding: 12px;">Key Parameters</th>
</tr>
<tr style="background-color: #e3f2fd;">
<td style="padding: 10px;"><strong>Optimized Pipeline</strong></td>
<td style="padding: 10px;">General ETL</td>
<td style="padding: 10px;">
• Read: num_cpus=0.025<br>
• Transform: num_cpus=0.5<br>
• Write: num_cpus=0.1
</td>
</tr>
<tr>
<td style="padding: 10px;"><strong>GPU Inference</strong></td>
<td style="padding: 10px;">Deep learning models</td>
<td style="padding: 10px;">
• num_gpus=1.0<br>
• batch_size=64<br>
• concurrency=# of GPUs
</td>
</tr>
<tr style="background-color: #e3f2fd;">
<td style="padding: 10px;"><strong>CPU Inference</strong></td>
<td style="padding: 10px;">CPU-only clusters</td>
<td style="padding: 10px;">
• num_cpus=4.0<br>
• batch_size=16<br>
• concurrency=CPUs/4
</td>
</tr>
</table>

### Code Examples

```python
# ETL Pattern
ds = (ray.data.read_parquet(path, columns=cols, num_cpus=0.025)
      .filter(condition, num_cpus=0.1)
      .map_batches(transform, num_cpus=0.5)
      .write_parquet(output, num_cpus=0.1))

# GPU Inference Pattern
class GPUModel:
    def __init__(self): self.model = load_model().cuda()
    def __call__(self, batch): return self.model(batch)

ds = ds.map_batches(GPUModel, num_gpus=1.0, batch_size=64, concurrency=2)

# CPU Inference Pattern
class CPUModel:
    def __init__(self): self.model = load_model()
    def __call__(self, batch): return self.model(batch)

ds = ds.map_batches(CPUModel, num_cpus=4.0, batch_size=16, concurrency=8)
```

Use class-based actors for model inference—`__init__` loads the model once per worker, and `__call__` processes each batch.

---

## Production Checklist

<table style="width:100%; border-collapse: collapse;">
<tr style="background-color: #388e3c; color: white;">
<th style="padding: 12px; width: 70%;">Item</th>
<th style="padding: 12px; width: 30%;">Status</th>
</tr>
<tr style="background-color: #e8f5e9;">
<td style="padding: 10px;">Column pruning enabled (columns= parameter)</td>
<td style="padding: 10px;">☐ Complete</td>
</tr>
<tr>
<td style="padding: 10px;">Early filtering applied (after read)</td>
<td style="padding: 10px;">☐ Complete</td>
</tr>
<tr style="background-color: #e8f5e9;">
<td style="padding: 10px;">Class-based actors for stateful ops</td>
<td style="padding: 10px;">☐ Complete</td>
</tr>
<tr>
<td style="padding: 10px;">num_cpus set for each stage</td>
<td style="padding: 10px;">☐ Complete</td>
</tr>
<tr style="background-color: #e8f5e9;">
<td style="padding: 10px;">batch_size tested with production data</td>
<td style="padding: 10px;">☐ Complete</td>
</tr>
<tr>
<td style="padding: 10px;">Concurrency matches resources</td>
<td style="padding: 10px;">☐ Complete</td>
</tr>
<tr style="background-color: #e8f5e9;">
<td style="padding: 10px;">Error handling configured</td>
<td style="padding: 10px;">☐ Complete</td>
</tr>
<tr>
<td style="padding: 10px;">Monitoring configured</td>
<td style="padding: 10px;">☐ Complete</td>
</tr>
<tr style="background-color: #e8f5e9;">
<td style="padding: 10px;">Full data testing completed</td>
<td style="padding: 10px;">☐ Complete</td>
</tr>
</table>

**Key items:**

- **Column pruning**: Read only needed columns to reduce I/O by 60-95%
- **Early filtering**: Filter immediately after reading to reduce downstream data volume
- **Class-based actors**: Use classes with `__init__` and `__call__` for stateful operations
- **Explicit num_cpus**: Set for every stage based on operation type
- **Test with production data**: Memory patterns change with data size
- **Match concurrency to resources**: GPU: concurrency = # GPUs; CPU: concurrency = total CPUs / CPUs per actor
- **Configure error handling**: Set `max_errored_blocks` for large datasets with quality issues
- **Enable monitoring**: Disable progress bars but enable metrics collection for production

---

## Key Takeaways

<div style="background-color: #c8e6c9; padding: 20px; border-radius: 8px; margin: 20px 0;">
<h3 style="margin-top: 0;">Remember These Seven Rules</h3>
<ol>
<li><strong>Start simple:</strong> Try num_cpus before anything else</li>
<li><strong>Monitor first:</strong> Enable progress bars to see bottlenecks</li>
<li><strong>One change at a time:</strong> Measure each optimization's impact</li>
<li><strong>Column pruning:</strong> Specify only needed columns in read operations</li>
<li><strong>Class-based pattern:</strong> Load models once per worker using __init__</li>
<li><strong>Batch size matters:</strong> Start high and reduce if you hit OOM</li>
<li><strong>CPU paradox:</strong> Lower num_cpus = MORE parallelism for I/O</li>
</ol>
</div>

Optimization is iterative. As data grows or requirements change, revisit your configuration. Monitor performance continuously and measure the impact of changes—intuition about performance is often wrong.

---

**[← Back to Part 1](01-inference-fundamentals.md)** | **[Continue to Part 3 →](03-ray-data-architecture.md)**


---

## Next Steps

You've learned advanced optimization techniques for batch inference. Continue to Part 3 to understand the Ray Data architecture that makes these optimizations possible.

In Part 3, you'll learn:
- How streaming execution enables unlimited dataset processing
- How blocks and memory management affect your optimization choices
- How operator fusion and backpressure work under the hood
- How to calculate optimal parameters from architectural constraints

---

**[← Back to Part 1](01-inference-fundamentals.md)** | **[Return to Overview](README.md)** | **[Continue to Part 3 →](03-ray-data-architecture.md)**

Or **[return to the overview](README.md)** to see all available parts.

---