# Part 1: Inference Fundamentals

**Time to complete**: 20 min | **Difficulty**: Beginner | **Prerequisites**: Basic Python, ML model concepts

**[← Back to Overview](README.md)** | **[Continue to Part 2 →](02-advanced-optimization.md)**

---

## What You'll Learn

In this part, you'll understand the fundamentals of batch inference optimization by comparing inefficient and efficient approaches:
- How to set up Ray Data for accelerated inference (CPU or GPU)
- Why naive inference patterns create performance bottlenecks
- How Ray Data's actor-based pattern solves these problems
- How to implement optimized inference with proper resource allocation for both CPU and GPU

## Table of Contents

1. [Introduction and Setup](#introduction-and-setup)
2. [The Wrong Way: Inefficient Batch Inference](#the-wrong-way-inefficient-batch-inference)
3. [Why the Naive Approach Fails](#why-the-naive-approach-fails)
4. [The Right Way: Optimized with Ray Data](#the-right-way-optimized-with-ray-data)

---

## Introduction and Setup

Batch inference is the process of running ML model predictions on large batches of data. While this sounds straightforward, naive implementations create severe performance bottlenecks that prevent production deployment. This part shows you the difference between inefficient and optimized approaches using real-world examples.

### What You'll Learn

By comparing inefficient and optimized implementations, you'll understand:
- **Why** repeated model loading destroys performance
- **How** Ray Data's actor pattern solves the problem
- **When** to apply specific optimization techniques
- **What** parameters to tune for your workload

### Initial Setup

In [1]:
import ray
import torch
import numpy as np
from PIL import Image
import time

# Initialize Ray for distributed processing
ray.init(ignore_reinit_error=True)

print("Ray cluster initialized for batch inference optimization")
print(f"Available resources: {ray.cluster_resources()}")

# Configure Ray Data for optimal performance monitoring
try:
    ctx = ray.data.DataContext.get_current()
    ctx.enable_progress_bars = True
    ctx.enable_operator_progress_bars = True
    print("Ray Data progress bars enabled")
except Exception as e:
    print(f"Note: Could not configure Ray Data context (progress bars disabled): {e}")
    print("This doesn't affect functionality - continuing with notebook...")

# Detect hardware availability
HAS_GPU = torch.cuda.is_available()
device = torch.device("cuda" if HAS_GPU else "cpu")

print(f"\nUsing device: {device}")
if HAS_GPU:
    print(f"GPU count: {torch.cuda.device_count()}")
    print("GPU detected - examples will use GPU acceleration")
else:
    print("No GPU detected - examples will run on CPU")

2025-10-10 16:32:19,037	INFO worker.py:1771 -- Connecting to existing Ray cluster at address: 10.0.71.116:6379...
2025-10-10 16:32:19,049	INFO worker.py:1942 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://session-77uweunq3awbhqefvry4lwcqq5.i.anyscaleuserdata.com [39m[22m
2025-10-10 16:32:19,056	INFO packaging.py:380 -- Pushing file package 'gcs://_ray_pkg_21cbb801d2a37fbeb0421b1464bfc910a4f77070.zip' (0.15MiB) to Ray cluster...
2025-10-10 16:32:19,057	INFO packaging.py:393 -- Successfully pushed file package 'gcs://_ray_pkg_21cbb801d2a37fbeb0421b1464bfc910a4f77070.zip'.


Ray cluster initialized for batch inference optimization
Available resources: {'anyscale/node-group:8CPU-32GB': 10.0, 'node:10.0.104.3': 1.0, 'anyscale/provider:aws': 11.0, 'CPU': 80.0, 'memory': 377957122048.0, 'object_store_memory': 105537718267.0, 'anyscale/cpu_only:true': 11.0, 'anyscale/region:us-west-2': 11.0, 'node:10.0.99.160': 1.0, 'node:10.0.116.84': 1.0, 'node:10.0.93.34': 1.0, 'node:10.0.109.213': 1.0, 'node:10.0.94.252': 1.0, 'node:10.0.83.124': 1.0, 'anyscale/node-group:head': 1.0, 'node:__internal_head__': 1.0, 'node:10.0.71.116': 1.0, 'node:10.0.100.254': 1.0, 'node:10.0.83.247': 1.0, 'node:10.0.127.67': 1.0}
Ray Data progress bars enabled

Using device: cpu
No GPU detected - examples will run on CPU



<div class="alert alert-block alert-info">
<b>Tip:</b> **GPU acceleration**: This template works on both CPU-only and GPU clusters, code automatically detects GPUs and uses `num_gpus=1` for acceleration, set the num
</div>

All optimization concepts (actor-based loading, batching, concurrency) apply equally to both environments.


### Load Demo Dataset

For this demonstration, you'll use the Imagenette dataset, which provides a realistic subset of ImageNet with 10 classes.

In [2]:
dataset = ray.data.read_images(
        "s3://anonymous@air-example-data-2/imagenette2/train/",
        mode="RGB"
    ) 

---

## The Wrong Way: Inefficient Batch Inference

This section demonstrates a common anti-pattern in ML inference systems. Understanding why this approach fails is essential before learning the optimized solution.

When models are loaded repeatedly for each batch, the initialization overhead dominates processing time. This pattern is unfortunately common in production systems where developers haven't considered the cost of model loading operations.

In [3]:
from typing import Dict, Any
import torch
from torchvision.models import ResNet152_Weights
from torchvision import transforms
from torchvision import models
import pandas as pd
import numpy as np

weights = ResNet152_Weights.IMAGENET1K_V1

# ============================================================================
# MISTAKE 1: Model initialization at module level
# ============================================================================
# NEVER do this: model = models.resnet152(weights=weights)
#
# WHY THIS IS WRONG:
# - When Ray serializes this function for distributed execution, it would try 
#   to serialize the entire model object and store it in the object store
# - ResNet152 is ~230MB, so every worker would need to download this massive
#   serialized object from the object store
# - This causes huge memory overhead and network transfer costs
# - Ray's object store would be unnecessarily bloated with duplicate models
#
# CORRECT APPROACH:
# - Use a callable class with __init__ and __call__ methods
# - Load the model once in __init__ (per worker)
# - Reuse the model across all batches in __call__
# ============================================================================

def inefficient_inference(batch: Dict[str, Any]) -> Dict[str, Any]:
    """INEFFICIENT: Loads model for every single batch.
    
    Anti-pattern demonstration - DO NOT use this approach in production!
    This function intentionally shows bad practices to highlight optimization opportunities.
    
    Note: This example runs on CPU for classroom use. The antipatterns shown here
    apply equally to GPU-based inference, where the performance differences would
    be even more pronounced.
    
    Args:
        batch: Dictionary containing 'image' key with array of images
        
    Returns:
        Dictionary with 'prediction' and 'image' arrays
    """
    import time
    import requests
    import json
    import tempfile
    import os
    
    # ========================================================================
    # MISTAKE 2: Model loading happens inside the batch processing function
    # ========================================================================
    # This is the MOST CRITICAL performance mistake in this code.
    #
    # WHY THIS IS WRONG:
    # - The model gets loaded from scratch for EVERY SINGLE BATCH
    # - ResNet152 has 60+ million parameters that need to be initialized
    # - Loading weights from disk/network is extremely expensive (gigabytes)
    # - With batch_size=4 and 1000 samples, this loads the model 250 times
    #
    # CORRECT APPROACH:
    # - Use a callable class with __init__ and __call__ methods
    # - Load the model once in __init__ (per worker)
    # - Reuse the model across all batches in __call__
    # ========================================================================
    start_load = time.time()
    
    # Note: Using CPU for classroom environment
    # In production with GPUs, you would use: torch.device("cuda" if torch.cuda.is_available() else "cpu")
    device = torch.device("cpu")

    model = models.resnet152(weights=weights).to(device)
    model.eval()

    # ========================================================================
    # MISTAKE 3: Transform pipeline recreated for every batch
    # ========================================================================
    # WHY THIS IS WRONG:
    # - While less expensive than reloading the model, this still has overhead
    # - Transform objects and their internal state get recreated repeatedly
    # - ImageNet transforms include normalization parameters that are constants
    # - Unnecessary object creation causes garbage collection pressure
    #
    # CORRECT APPROACH:
    # - Create transforms once in __init__ method
    # - Reuse the same transform pipeline for all batches
    # ========================================================================
    imagenet_transforms = weights.transforms()
    transform = transforms.Compose([
        transforms.ToTensor(),
        imagenet_transforms
    ])

    load_time = time.time() - start_load
    print(f"Model loading (per batch) took: {load_time:.2f} seconds")
    
    # ========================================================================
    # MISTAKE 4: Excessive per-batch logging/printing
    # ========================================================================
    # WHY THIS IS WRONG:
    # - Print statements in distributed tasks create massive log files
    # - Each worker prints to stdout, creating I/O contention
    # - Logs get scattered across different worker processes
    # - With 1000 batches, you get 1000+ print statements flooding logs
    # - Makes debugging harder (signal-to-noise ratio problems)
    #
    # CORRECT APPROACH:
    # - Use proper logging with appropriate log levels
    # - Log only errors and warnings during execution
    # - Use Ray metrics/counters for monitoring
    # - Sample logging (log every Nth batch, not every batch)
    # ========================================================================
    print(f"Processing batch of {len(batch['image'])} images...")
    print(f"Current device: {device}")
    
    # ========================================================================
    # MISTAKE 5: Synchronous network I/O inside inference function
    # ========================================================================
    # WHY THIS IS WRONG:
    # - Making HTTP requests during inference blocks the entire batch
    # - Network latency is orders of magnitude slower than inference
    # - External services can fail, timeout, or rate-limit you
    # - Creates hard dependency on external service availability
    # - CPU sits idle while waiting for network responses
    #
    # CORRECT APPROACH:
    # - Separate data processing from external I/O operations
    # - Use async/batch APIs if external calls are necessary
    # - Cache results from external services
    # - Consider using Ray Data's read_* functions for data loading
    # ========================================================================
    # try:
    #     # Simulating calling an external API for "metadata" (DON'T DO THIS!)
    #     response = requests.get(
    #         "https://api.example.com/model-config",
    #         timeout=5
    #     )
    #     config = response.json()
    #     print(f"Retrieved config from API: {config}")
    # except Exception as e:
    #     print(f"API call failed (this is expected in this demo): {e}")
    #     config = {}
    
    # ========================================================================
    # MISTAKE 6: Unnecessary data format conversions
    # ========================================================================
    # WHY THIS IS WRONG:
    # - Converting between numpy, pandas, PyArrow, and torch repeatedly
    # - Each conversion allocates new memory and copies data
    # - Pandas DataFrame creation has significant overhead
    # - Converting back and forth wastes CPU cycles
    #
    # CORRECT APPROACH:
    # - Keep data in optimal format for your operations
    # - For PyTorch: work with tensors directly
    # - Minimize conversions; convert once at boundaries
    # - Use zero-copy operations when possible
    # ========================================================================
    # Unnecessarily convert to pandas then back (DON'T DO THIS!)
    df = pd.DataFrame(batch)
    images_from_df = df["image"].tolist()
    
    # ========================================================================
    # MISTAKE 7: Creating temporary files during inference
    # ========================================================================
    # WHY THIS IS WRONG:
    # - Disk I/O is much slower than memory operations
    # - File creation/deletion creates filesystem overhead
    # - Temporary files can fill disk space if not cleaned up
    # - Multiple workers writing files simultaneously causes contention
    #
    # CORRECT APPROACH:
    # - Keep all intermediate data in memory
    # - Use numpy arrays or tensors for temporary data
    # - Only write files for final outputs if necessary
    # ========================================================================
    temp_dir = tempfile.mkdtemp()
    print(f"Created temporary directory: {temp_dir}")
    
    # MISTAKE 8: Processing images one-by-one instead of batched inference
    predictions = []
    confidence_scores = []

    for idx, img in enumerate(images_from_df):
        print(f"Processing image {idx + 1}/{len(images_from_df)}")
        
        # MISTAKE 9: Inefficient data transfer patterns
        img_tensor = transform(img).unsqueeze(0).to(device)
        
        # MISTAKE 10: Writing temporary files during inference loop
        temp_file = os.path.join(temp_dir, f"temp_image_{idx}.pt")
        torch.save(img_tensor, temp_file)
        print(f"Saved temporary tensor to {temp_file}")
        
        # Run inference on a single image
        with torch.no_grad():
            prediction = model(img_tensor)
            
            # MISTAKE 11: Not using mixed precision on GPUs
            # MISTAKE 12: Unnecessary device synchronization
            predicted_classes = prediction.argmax(dim=1).detach().cpu()
            predicted_label = weights.meta["categories"][predicted_classes[0].item()]
            
            # Get confidence score
            probs = torch.nn.functional.softmax(prediction, dim=1)
            confidence = probs.max().detach().cpu().item()
        
        predictions.append(predicted_label)
        confidence_scores.append(confidence)
        
        # Clean up temp file (but this still wasted time creating it!)
        os.remove(temp_file)
    
    # MISTAKE 13: Not managing memory properly
    # Clean up temporary directory
    try:
        os.rmdir(temp_dir)
    except:
        pass  # Directory might not be empty, but we don't care in this demo
    
    # MISTAKE 14: Returning inefficient data structures
    # Return dictionary with equal-length arrays
    return {
        "prediction": predictions,
        "confidence": confidence_scores,
        "image": batch["image"]
    }

# MISTAKE 15: Suboptimal resource configuration
inefficient_results = dataset.map_batches(
    inefficient_inference,
    num_cpus=8,  # Too many CPUs reserved per task
    batch_size=4,  # Too small for efficient processing
    concurrency=4  # Too low for this cluster
).take(1000)

2025-10-10 16:32:24,722	INFO dataset.py:3248 -- Tip: Use `take_batch()` instead of `take() / show()` to return records in pandas or numpy batch format.
2025-10-10 16:32:24,724	INFO logging.py:295 -- Registered dataset logger for dataset dataset_2_0
2025-10-10 16:32:24,779	INFO streaming_executor.py:159 -- Starting execution of Dataset dataset_2_0. Full logs are in /tmp/ray/session_2025-10-10_16-23-49_015346_2333/logs/ray-data
2025-10-10 16:32:24,780	INFO streaming_executor.py:160 -- Execution plan of Dataset dataset_2_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[MapBatches(inefficient_inference)] -> LimitOperator[limit=1000]


Running 0: 0.00 row [00:00, ? row/s]

- ListFiles 1: 0.00 row [00:00, ? row/s]

- ReadFiles 2: 0.00 row [00:00, ? row/s]

- MapBatches(inefficient_inference) 3: 0.00 row [00:00, ? row/s]

- limit=1000 4: 0.00 row [00:00, ? row/s]



[36m(MapBatches(inefficient_inference) pid=2681, ip=10.0.83.247)[0m Downloading: "https://download.pytorch.org/models/resnet152-394f9c45.pth" to /home/ray/.cache/torch/hub/checkpoints/resnet152-394f9c45.pth


  0%|          | 0.00/230M [00:00<?, ?B/s]d=2681, ip=10.0.83.247)[0m 
  4%|▍         | 9.62M/230M [00:00<00:02, 100MB/s]ip=10.0.83.247)[0m 
  9%|▉         | 20.4M/230M [00:00<00:02, 107MB/s]ip=10.0.83.247)[0m 
 18%|█▊        | 40.6M/230M [00:00<00:01, 155MB/s]ip=10.0.83.247)[0m 
 27%|██▋       | 62.6M/230M [00:00<00:00, 184MB/s]ip=10.0.83.247)[0m 
100%|██████████| 230M/230M [00:01<00:00, 206MB/s] ip=10.0.83.247)[0m 


[36m(MapBatches(inefficient_inference) pid=2681, ip=10.0.83.247)[0m Model loading (per batch) took: 2.28 seconds
[36m(MapBatches(inefficient_inference) pid=2681, ip=10.0.83.247)[0m Processing batch of 4 images...
[36m(MapBatches(inefficient_inference) pid=2681, ip=10.0.83.247)[0m Current device: cpu
[36m(MapBatches(inefficient_inference) pid=2681, ip=10.0.83.247)[0m Created temporary directory: /tmp/tmprg965k1x
[36m(MapBatches(inefficient_inference) pid=2681, ip=10.0.83.247)[0m Processing image 1/4
[36m(MapBatches(inefficient_inference) pid=2681, ip=10.0.83.247)[0m Saved temporary tensor to /tmp/tmprg965k1x/temp_image_0.pt
[36m(MapBatches(inefficient_inference) pid=2681, ip=10.0.83.247)[0m Processing image 2/4
[36m(MapBatches(inefficient_inference) pid=2681, ip=10.0.83.247)[0m Saved temporary tensor to /tmp/tmprg965k1x/temp_image_1.pt


100%|██████████| 230M/230M [00:01<00:00, 219MB/s] ip=10.0.99.160)[0m 


[36m(MapBatches(inefficient_inference) pid=2681, ip=10.0.83.247)[0m Processing image 3/4
[36m(MapBatches(inefficient_inference) pid=2681, ip=10.0.83.247)[0m Saved temporary tensor to /tmp/tmprg965k1x/temp_image_2.pt


 92%|█████████▏| 213M/230M [00:01<00:00, 228MB/s] ip=10.0.83.124)[0m 
100%|██████████| 230M/230M [00:01<00:00, 222MB/s] ip=10.0.83.124)[0m 
100%|██████████| 230M/230M [00:01<00:00, 198MB/s] ip=10.0.127.67)[0m 


[36m(MapBatches(inefficient_inference) pid=2679, ip=10.0.127.67)[0m Downloading: "https://download.pytorch.org/models/resnet152-394f9c45.pth" to /home/ray/.cache/torch/hub/checkpoints/resnet152-394f9c45.pth[32m [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)[0m
[36m(MapBatches(inefficient_inference) pid=2679, ip=10.0.127.67)[0m Model loading (per batch) took: 0.83 seconds[32m [repeated 11x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2679, ip=10.0.127.67)[0m Processing batch of 4 images...[32m [repeated 11x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2679, ip=10.0.127.67)[0m Current device: cpu[32m [repeated 11x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2679, ip=10.0.127.67)[0m Created temporary directory: /tmp/tmpn284ef0j

  0%|          | 0.00/230M [00:00<?, ?B/s][32m [repeated 4x across cluster][0m
 89%|████████▉ | 206M/230M [00:01<00:00, 216MB/s][32m [repeated 36x across cluster][0m


[36m(MapBatches(inefficient_inference) pid=2603, ip=10.0.100.254)[0m Downloading: "https://download.pytorch.org/models/resnet152-394f9c45.pth" to /home/ray/.cache/torch/hub/checkpoints/resnet152-394f9c45.pth
[36m(MapBatches(inefficient_inference) pid=2594, ip=10.0.99.160)[0m Model loading (per batch) took: 0.90 seconds[32m [repeated 10x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2594, ip=10.0.99.160)[0m Processing batch of 4 images...[32m [repeated 10x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2594, ip=10.0.99.160)[0m Current device: cpu[32m [repeated 10x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2594, ip=10.0.99.160)[0m Created temporary directory: /tmp/tmp071ys1ue[32m [repeated 10x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2594, ip=10.0.99.160)[0m Processing image 1/4[32m [repeated 31x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2594, ip=10.0.99.160)[0m Saved tempo

100%|██████████| 230M/230M [00:01<00:00, 196MB/s] ip=10.0.100.254)[0m 
 93%|█████████▎| 214M/230M [00:01<00:00, 245MB/s] ip=10.0.104.3)[0m 
100%|██████████| 230M/230M [00:01<00:00, 226MB/s] ip=10.0.104.3)[0m 


[36m(MapBatches(inefficient_inference) pid=2603, ip=10.0.100.254)[0m Created temporary directory: /tmp/tmptfhup_w_


 33%|███▎      | 75.8M/230M [00:00<00:00, 207MB/s]ip=10.0.116.84)[0m 
 96%|█████████▌| 220M/230M [00:01<00:00, 214MB/s] ip=10.0.116.84)[0m 
100%|██████████| 230M/230M [00:01<00:00, 209MB/s] ip=10.0.116.84)[0m 


[36m(MapBatches(inefficient_inference) pid=2594, ip=10.0.99.160)[0m Created temporary directory: /tmp/tmp_ujrbvmu
[36m(MapBatches(inefficient_inference) pid=2726, ip=10.0.116.84)[0m Downloading: "https://download.pytorch.org/models/resnet152-394f9c45.pth" to /home/ray/.cache/torch/hub/checkpoints/resnet152-394f9c45.pth[32m [repeated 2x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2726, ip=10.0.116.84)[0m Model loading (per batch) took: 0.83 seconds[32m [repeated 12x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2726, ip=10.0.116.84)[0m Processing batch of 4 images...[32m [repeated 12x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2726, ip=10.0.116.84)[0m Current device: cpu[32m [repeated 12x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2726, ip=10.0.116.84)[0m Created temporary directory: /tmp/tmpb2flx4c5[32m [repeated 10x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2726, ip=10.0.

  0%|          | 0.00/230M [00:00<?, ?B/s][32m [repeated 3x across cluster][0m
 87%|████████▋ | 200M/230M [00:01<00:00, 215MB/s][32m [repeated 28x across cluster][0m


[36m(MapBatches(inefficient_inference) pid=2584, ip=10.0.109.213)[0m Downloading: "https://download.pytorch.org/models/resnet152-394f9c45.pth" to /home/ray/.cache/torch/hub/checkpoints/resnet152-394f9c45.pth


100%|██████████| 230M/230M [00:01<00:00, 217MB/s] ip=10.0.109.213)[0m 


[36m(MapBatches(inefficient_inference) pid=2584, ip=10.0.109.213)[0m Model loading (per batch) took: 2.18 seconds[32m [repeated 10x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2584, ip=10.0.109.213)[0m Processing batch of 4 images...[32m [repeated 10x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2584, ip=10.0.109.213)[0m Current device: cpu[32m [repeated 10x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2584, ip=10.0.109.213)[0m Created temporary directory: /tmp/tmpscehwhs3[32m [repeated 9x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2584, ip=10.0.109.213)[0m Processing image 3/4[32m [repeated 34x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2584, ip=10.0.109.213)[0m Saved temporary tensor to /tmp/tmpscehwhs3/temp_image_2.pt[32m [repeated 34x across cluster][0m


100%|██████████| 230M/230M [00:01<00:00, 236MB/s] ip=10.0.93.34)[0m 


[36m(MapBatches(inefficient_inference) pid=2689, ip=10.0.93.34)[0m Downloading: "https://download.pytorch.org/models/resnet152-394f9c45.pth" to /home/ray/.cache/torch/hub/checkpoints/resnet152-394f9c45.pth
[36m(MapBatches(inefficient_inference) pid=2584, ip=10.0.109.213)[0m Created temporary directory: /tmp/tmppnkoooop
[36m(MapBatches(inefficient_inference) pid=2584, ip=10.0.109.213)[0m Model loading (per batch) took: 0.82 seconds[32m [repeated 14x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2584, ip=10.0.109.213)[0m Processing batch of 4 images...[32m [repeated 14x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2584, ip=10.0.109.213)[0m Current device: cpu[32m [repeated 14x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2689, ip=10.0.93.34)[0m Created temporary directory: /tmp/tmp7573q6pc[32m [repeated 13x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2584, ip=10.0.109.213)[0m Processing image 2/4[3

  0%|          | 0.00/230M [00:00<?, ?B/s][32m [repeated 2x across cluster][0m
 88%|████████▊ | 202M/230M [00:00<00:00, 244MB/s][32m [repeated 19x across cluster][0m
 97%|█████████▋| 223M/230M [00:01<00:00, 232MB/s] ip=10.0.94.252)[0m 
100%|██████████| 230M/230M [00:01<00:00, 212MB/s] ip=10.0.94.252)[0m 


[36m(MapBatches(inefficient_inference) pid=2587, ip=10.0.94.252)[0m Model loading (per batch) took: 0.83 seconds[32m [repeated 12x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2587, ip=10.0.94.252)[0m Processing batch of 4 images...[32m [repeated 12x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2587, ip=10.0.94.252)[0m Current device: cpu[32m [repeated 12x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2587, ip=10.0.94.252)[0m Created temporary directory: /tmp/tmpwusq9517[32m [repeated 12x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2603, ip=10.0.100.254)[0m Processing image 2/4[32m [repeated 44x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2603, ip=10.0.100.254)[0m Saved temporary tensor to /tmp/tmprinus5ii/temp_image_1.pt[32m [repeated 44x across cluster][0m
[36m(MapBatches(inefficient_inference) pid=2681, ip=10.0.83.247)[0m Created temporary directory: /tmp/tmppnremdgr
[36m

2025-10-10 16:35:50,185	INFO streaming_executor.py:279 -- ✔️  Dataset dataset_2_0 execution finished in 205.40 seconds


<div style="background-color: #e7f3ff; border-left: 4px solid #2196F3; padding: 12px; margin: 16px 0;">
<strong>Quick Reference:</strong> The most critical mistakes in Ray Data inference are initialization patterns (Mistakes 1-3) and batching strategies (Mistake 8). These alone can cause 10-100x performance differences.
</div>

## Mistake 1: Model initialization at module level

**Why this is wrong:**
- When Ray serializes the function for distributed execution, it tries to serialize the entire model object and store it in the object store
- Large models (ResNet152 is ~230MB) cause every worker to download massive serialized objects
- Causes huge memory overhead and network transfer costs
- Ray's object store becomes unnecessarily bloated with duplicate models

**Correct approach:**
- Use a callable class with `__init__` and `__call__` methods
- Load the model once in `__init__` (per worker)
- Reuse the model across all batches in `__call__`

```python
# Wrong: Model loaded at module level
model = torchvision.models.resnet152(pretrained=True)

def process_batch(batch):
    return model(batch["image"])

# Correct: Model loaded once per worker
class ImageClassifier:
    def __init__(self):
        self.model = torchvision.models.resnet152(pretrained=True)
        self.model.eval()
    
    def __call__(self, batch):
        return self.model(batch["image"])
```

## Mistake 2: Model loading inside the batch processing function

**Why this is wrong:**
- The model gets loaded from scratch for EVERY SINGLE BATCH
- Models with millions of parameters need to be initialized repeatedly
- Loading weights from disk/network is extremely expensive
- With small batch sizes, you reload the model hundreds or thousands of times

**Correct approach:**
- Use a callable class with `__init__` and `__call__` methods
- Load the model once in `__init__` (per worker)
- Reuse the model across all batches in `__call__`

```python
# Wrong: Model reloaded for every batch
def process_batch(batch):
    model = torchvision.models.resnet152(pretrained=True)  # Loaded every time
    model.eval()
    return model(batch["image"])

# Correct: Model loaded once, reused for all batches
class ImageClassifier:
    def __init__(self):
        self.model = torchvision.models.resnet152(pretrained=True)
        self.model.eval()
    
    def __call__(self, batch):
        return self.model(batch["image"])
```

<div style="background-color: #fff3cd; border-left: 4px solid #ffc107; padding: 12px; margin: 16px 0;">
<strong>Performance Impact:</strong> Loading a ResNet152 model takes approximately 2-3 seconds. If you process 1,000 batches, Mistake 2 wastes 2,000-3,000 seconds (33-50 minutes) just reloading the same model repeatedly.
</div>

## Mistake 3: Transform pipeline recreated for every batch

**Why this is wrong:**
- Transform objects and their internal state get recreated repeatedly
- Normalization parameters and other constants are recomputed
- Unnecessary object creation causes garbage collection pressure

**Correct approach:**
- Create transforms once in `__init__` method
- Reuse the same transform pipeline for all batches

## Mistake 4: Excessive per-batch logging/printing

**Why this is wrong:**
- Print statements in distributed tasks create massive log files
- Each worker prints to stdout, creating I/O contention
- Logs get scattered across different worker processes
- Makes debugging harder (signal-to-noise ratio problems)

**Correct approach:**
- Use proper logging with appropriate log levels
- Log only errors and warnings during execution
- Use Ray metrics/counters for monitoring
- Sample logging (log every Nth batch, not every batch)

## Mistake 5: Synchronous network I/O inside inference function

**Why this is wrong:**
- Making HTTP requests during inference blocks the entire batch
- Network latency is orders of magnitude slower than inference
- External services can fail, timeout, or rate-limit you
- Creates hard dependency on external service availability
- Compute resources sit idle while waiting for network responses

**Correct approach:**
- Separate data processing from external I/O operations
- Use async/batch APIs if external calls are necessary
- Cache results from external services
- Consider using Ray Data's `read_*` functions for data loading

```python
# Wrong: HTTP request during inference
def process_batch(batch):
    results = []
    for image in batch["image"]:
        metadata = requests.get(f"https://api.example.com/meta/{image.id}")  # Blocks
        result = model(image)
        results.append(result)
    return results

# Correct: Preload metadata separately
metadata_ds = ray.data.read_json("s3://bucket/metadata/")
image_ds = ray.data.read_images("s3://bucket/images/")
joined_ds = image_ds.zip(metadata_ds)
```

## Mistake 6: Unnecessary data format conversions

**Why this is wrong:**
- Converting between numpy, pandas, PyArrow, and torch repeatedly
- Each conversion allocates new memory and copies data
- Pandas DataFrame creation has significant overhead
- Converting back and forth wastes CPU cycles

**Correct approach:**
- Keep data in optimal format for your operations
- For PyTorch: work with tensors directly
- Minimize conversions; convert once at boundaries
- Use zero-copy operations when possible

<div style="background-color: #e8f5e9; border-left: 4px solid #4caf50; padding: 12px; margin: 16px 0;">
<strong>Pro Tip:</strong> Use <code>batch_format="numpy"</code> in <code>map_batches()</code> to receive batches in the most efficient format for your operation. Ray Data handles the conversion once at the boundary.
</div>

## Mistake 7: Creating temporary files during inference

**Why this is wrong:**
- Disk I/O is much slower than memory operations
- File creation/deletion creates filesystem overhead
- Temporary files can fill disk space if not cleaned up
- Multiple workers writing files simultaneously causes contention

**Correct approach:**
- Keep all intermediate data in memory
- Use numpy arrays or tensors for temporary data
- Only write files for final outputs if necessary

## Mistake 8: Processing images one-by-one instead of batched inference

**Why this is wrong:**
- Neural networks are optimized for batch processing
- Processing one image at a time prevents vectorization benefits
- With GPUs: compute units sit idle (poor parallelization)
- With CPUs: SIMD instructions and cache aren't utilized effectively
- Each forward pass has overhead
- Memory bandwidth is underutilized

**Correct approach:**
- Stack all images in the batch into a single tensor
- Run one forward pass on the entire batch: `model(stacked_tensor)`
- Let PyTorch parallelize across batch dimension

```python
# Wrong: Process images one-by-one
def process_batch(batch):
    results = []
    for image in batch["image"]:
        result = self.model(image.unsqueeze(0))  # Single image forward pass
        results.append(result)
    return results

# Correct: Batched inference
def __call__(self, batch):
    images = torch.stack([torch.tensor(img) for img in batch["image"]])
    results = self.model(images)  # Single batched forward pass
    return {"predictions": results.cpu().numpy()}
```

## Mistake 9: Inefficient data transfer patterns

**Why this is wrong:**
- Moving data to device (CPU or GPU) has overhead
- Doing it per-image instead of per-batch multiplies overhead
- Each `.to(device)` call can synchronize operations
- Prevents overlapping compute and data movement

**Correct approach:**
- Transfer entire batch to device at once
- For GPU: use `pin_memory=True` for faster transfers
- For GPU: overlap data transfer with computation using streams

## Mistake 10: Writing temporary files during inference loop

**Why this is wrong:**
- Combines disk I/O overhead with loop iteration overhead
- Creates many small files that stress filesystem metadata
- Cleanup adds additional overhead

**Correct approach:**
- Keep all intermediate data in memory
- Avoid writing files inside processing loops
- Only persist final results if needed

<div style="background-color: #fce4ec; border-left: 4px solid #e91e63; padding: 12px; margin: 16px 0;">
<strong>Memory Management:</strong> If you're concerned about memory usage, Ray Data automatically spills to disk using Arrow's memory-efficient format. You don't need to manually write temporary files.
</div>

## Mistake 11: Not using mixed precision on GPUs

**Why this is wrong (GPU-specific):**
- Modern GPUs (Volta, Turing, Ampere+) have specialized FP16 units
- Running in FP32 doesn't utilize these specialized units
- FP16 inference maintains accuracy for most vision models
- Uses less GPU memory (can fit larger batches)

**Correct approach (for GPU inference):**
- Use `torch.cuda.amp.autocast()` for automatic mixed precision
- Enables tensor cores on modern GPUs

**Note:** Mixed precision is primarily beneficial for GPUs. For CPU inference, FP32 is typically fine.

```python
# GPU inference with mixed precision
class ImageClassifier:
    def __init__(self):
        self.model = torchvision.models.resnet152(pretrained=True).cuda()
        self.model.eval()
    
    def __call__(self, batch):
        images = torch.stack([torch.tensor(img) for img in batch["image"]]).cuda()
        with torch.cuda.amp.autocast():  # Enable mixed precision
            results = self.model(images)
        return {"predictions": results.cpu().numpy()}
```

## Mistake 12: Unnecessary device synchronization

**Why this is wrong:**
- `.detach().cpu()` forces operations to complete before continuing
- Synchronization prevents pipelining of operations
- Transferring back to CPU per-image instead of per-batch adds overhead

**Correct approach:**
- Keep intermediate results on device until needed
- Transfer back to CPU once for entire batch at the end
- Let framework manage operation scheduling

## Mistake 13: Not managing memory properly

**Why this is wrong:**
- PyTorch caches memory allocations for performance
- Long-running jobs can accumulate fragmented memory
- For GPU: manually clearing cache every batch is overkill and counterproductive
- For CPU: Python garbage collector usually handles this

**Correct approach:**
- Let PyTorch manage memory automatically in most cases
- For GPU: only clear cache if you see OOM errors
- Clear between large operations, not every batch

## Mistake 14: Returning inefficient data structures

**Why this is wrong:**
- Returning Python lists instead of numpy arrays when possible
- Lists have more overhead for Ray Data to process
- Ray Data works best with numpy/PyArrow columnar formats

**Correct approach:**
- Return numpy arrays for numeric data
- Use appropriate dtypes (don't return float64 if float32 works)

```python
# Wrong: Return Python lists
def __call__(self, batch):
    results = self.model(batch["image"])
    return {"predictions": results.tolist()}  # Converts to Python list

# Correct: Return numpy arrays
def __call__(self, batch):
    results = self.model(batch["image"])
    return {"predictions": results.cpu().numpy()}  # Keep as numpy array
```

## Mistake 15: Suboptimal resource configuration

**Why this is wrong:**
- Requesting too many CPUs per task prevents other tasks from running
- Very small batch sizes don't utilize hardware efficiently
- High concurrency compounds model reloading problems
- Causes resource fragmentation and memory pressure

**Correct approach:**
- For CPU inference: set `num_cpus` based on model threading needs (usually 2-4)
- For GPU inference: use `num_gpus=1` (or fractional: `num_gpus=0.25`)
- Set `batch_size` based on available memory: typically 32-256
- Set concurrency based on available resources:
  - CPU: `num_cpus_available / num_cpus_per_task`
  - GPU: number of available GPUs
- Use `accelerator_type="A10G"` or similar for specific GPU types in production

<div style="background-color: #f3e5f5; border-left: 4px solid #9c27b0; padding: 12px; margin: 16px 0;">
<strong>Resource Configuration Example:</strong>
<pre><code>ds.map_batches(
    ImageClassifier,
    batch_size=64,           # Process 64 images at once
    num_gpus=1,              # 1 GPU per worker
    concurrency=4,           # 4 workers if you have 4 GPUs
    accelerator_type="A10G"  # Request specific GPU type
)</code></pre>
</div>

# Other common mistakes and antipatterns

## Mistake 16: Not handling errors gracefully

**Why this is wrong:**
- Letting one bad image crash entire batch/job
- No error handling around individual image processing
- Not leveraging Ray Data's error handling features

**Correct approach:**
- Use try-except around individual item processing
- Set `DataContext.max_errored_blocks` to tolerate some failures
- Log errors for debugging while continuing execution

## Mistake 17: Using non-deterministic operations without setting seeds

**Why this is wrong:**
- Makes debugging and reproduction impossible
- Can cause flaky tests and inconsistent results
- Different workers produce different results for same input

**Correct approach:**
- Set random seeds for Python, NumPy, and PyTorch
- Use deterministic algorithms when available
- Document any remaining sources of non-determinism

## Mistake 18: Not considering data locality

**Why this is wrong:**
- Reading data from remote storage repeatedly
- Not using Ray Data's automatic data locality optimizations
- Not colocating compute with data

**Correct approach:**
- Use Ray Data's built-in data sources that optimize locality
- Cache intermediate results when appropriate
- Let Ray schedule tasks close to data

## Mistake 19: Ignoring batch format

**Why this is wrong:**
- Not checking if batch is in numpy/pandas/pyarrow format
- Assuming specific format without verification
- Unnecessary conversions when batch is already in usable format

**Correct approach:**
- Use `batch_format` parameter in `map_batches`
- Specify format that works best for your operations
- Common formats: `"numpy"`, `"pandas"`, `"pyarrow"`

## Mistake 20: Using wrong tensor dtypes

**Why this is wrong:**
- Using float64 when float32 is sufficient
- Not matching model's expected input dtype
- Unnecessary precision wastes memory and compute

**Correct approach:**
- Use float32 for most deep learning operations
- Match input dtype to model expectations
- Only use higher precision when mathematically necessary

## Mistake 21: Not using class-based inference pattern

**Why this is wrong:**
- Functions can't maintain state between batches
- Can't initialize resources once and reuse them
- No proper lifecycle management (setup/teardown)

**Correct approach:**
- Use callable classes with `__init__` and `__call__`
- Initialize expensive resources (models, transforms) in `__init__`
- Process batches in `__call__`
- Optionally implement `__del__` for cleanup

```python
# Wrong: Function-based (can't maintain state)
def inference_function(batch):
    model = load_model()  # Reloaded every batch
    return model(batch)

# Correct: Class-based pattern
class InferenceModel:
    def __init__(self):
        self.model = load_model()  # Loaded once
        self.transform = create_transform()
    
    def __call__(self, batch):
        return self.model(self.transform(batch))
```

<div style="background-color: #e0f7fa; border-left: 4px solid #00bcd4; padding: 12px; margin: 16px 0;">
<strong>Summary:</strong> The class-based inference pattern (Mistakes 1-3, 21) is the foundation of efficient Ray Data inference. Master this pattern first, then optimize batching (Mistake 8) and resource allocation (Mistake 15).
</div>

## Fixed Version

In [4]:
from typing import Dict, Any
import numpy as np
import torch
from torchvision.models import ResNet152_Weights
from torchvision import transforms
from torchvision import models
import ray.data

# Check if GPU is available
HAS_GPU = torch.cuda.is_available()

# Read the dataset
dataset = ray.data.read_images(
    "s3://anonymous@air-example-data-2/imagenette2/train/",
    mode="RGB",
    # Override num_blocks can help with parallelism
    # override_num_blocks=100,
    ray_remote_args={"num_cpus": 0.1},
)


def preprocess_image(row: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]:
    """Preprocess a single image by applying transforms.
    
    Note: This function is kept lightweight and stateless. For more complex
    preprocessing that requires initialization, consider using a class-based
    approach similar to InferenceWorker.
    
    Args:
        row: Dictionary containing 'image' key with numpy array
        
    Returns:
        Dictionary with transformed image as numpy array
    """
    # ========================================================================
    # ANTIPATTERN ADDRESSED: Transform defined inside function
    # ========================================================================
    # This recreates the transform for each image, which adds overhead.
    # 
    # For this preprocessing step, we have a trade-off:
    # - Option 1: Keep it simple with a function (current approach)
    #   - Pros: Simple, works for lightweight transforms
    #   - Cons: Recreates transform objects repeatedly
    # 
    # - Option 2: Use a class-based approach (better for complex preprocessing)
    #   - Pros: Initialize transforms once per worker
    #   - Cons: More code, might be overkill for simple transforms
    #
    # For production with expensive preprocessing, use a class like InferenceWorker
    # ========================================================================
    weights = ResNet152_Weights.IMAGENET1K_V1
    imagenet_transforms = weights.transforms()
    transform = transforms.Compose([
        transforms.ToTensor(),
        imagenet_transforms
    ])
    
    # Transform returns a tensor, convert back to numpy for Ray Data
    transformed_tensor = transform(row["image"])
    
    return {
        "transformed_image": transformed_tensor.numpy(),
    }


class InferenceWorker:
    """Efficient inference worker using class-based pattern.
    
    This class demonstrates best practices for Ray Data batch inference:
    - Model loaded once in __init__ (not per batch)
    - Transforms initialized once in __init__
    - Reuses resources across all batches in __call__
    - Proper batched inference (not one-by-one)
    """
    
    def __init__(self):
        """Initialize model and transforms once per worker.
        
        This method runs once when the actor is created, not for every batch.
        All expensive initialization happens here.
        """
        # ====================================================================
        # BEST PRACTICE: Initialize model once in __init__
        # ====================================================================
        # The model is loaded once per worker and reused for all batches
        # This avoids the expensive model loading overhead for each batch
        # ====================================================================
        self.weights = ResNet152_Weights.IMAGENET1K_V1
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model = models.resnet152(weights=self.weights).to(self.device)
        self.model.eval()
        
        # ====================================================================
        # BEST PRACTICE: Initialize transforms once in __init__
        # ====================================================================
        # Transforms are created once and reused for all batches
        # ====================================================================
        imagenet_transforms = self.weights.transforms()
        self.transform = transforms.Compose([
            transforms.ToTensor(),
            imagenet_transforms
        ])

    def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, Any]:
        """Process a batch of images.
        
        This method is called for each batch. The model and transforms are
        already loaded and ready to use.
        
        Args:
            batch: Dictionary with 'transformed_image' key containing numpy array
            
        Returns:
            Dictionary with predictions
        """
        # ====================================================================
        # BEST PRACTICE: Batched inference
        # ====================================================================
        # Process entire batch at once, not one image at a time
        # This utilizes GPU/CPU parallelism and vectorization effectively
        # ====================================================================
        
        # Convert the numpy array of images into a PyTorch tensor
        # Shape: (batch_size, channels, height, width)
        torch_batch = torch.from_numpy(batch["transformed_image"]).to(self.device)
        
        # Run inference on the entire batch at once
        with torch.no_grad():
            # ================================================================
            # BEST PRACTICE: Single forward pass for entire batch
            # ================================================================
            # One forward pass processes all images simultaneously
            # Much more efficient than looping through images
            # ================================================================
            prediction = self.model(torch_batch)
            
            # Get predicted classes for all images in batch
            # argmax(dim=1) gets the class with highest score for each image
            predicted_classes = prediction.argmax(dim=1).detach().cpu()
            
            # Convert class indices to human-readable labels
            predicted_labels = [
                self.weights.meta["categories"][i] for i in predicted_classes
            ]
            
            # Optional: Get confidence scores
            probabilities = torch.nn.functional.softmax(prediction, dim=1)
            confidence_scores = probabilities.max(dim=1).values.detach().cpu().numpy()
        
        return {
            "predicted_label": predicted_labels,
            "confidence": confidence_scores.tolist(),
        }


# ============================================================================
# BEST PRACTICE: Proper resource configuration
# ============================================================================
# - Use class-based approach for stateful processing
# - Set concurrency based on available GPUs/CPUs
# - Use appropriate batch_size for memory and throughput
# - Allocate resources (num_gpus, num_cpus) appropriately
# ============================================================================

inference_results = (
    dataset
    .limit(1000)
    # Preprocess images before inference
    # Using low num_cpus since this is lightweight
    .map(preprocess_image, num_cpus=0.1)
    # Run batched inference with class-based actors
    .map_batches(
        InferenceWorker,
        concurrency=4 if HAS_GPU else 8,  # Fewer actors for GPU, more for CPU
        num_gpus=1 if HAS_GPU else 0,  # Allocate GPU if available
        num_cpus=1 if HAS_GPU else 4,  # Use more CPU cores if no GPU
        batch_size=64 if HAS_GPU else 16,  # Larger batches for GPU
        # Optional: Control batch format
        # batch_format="numpy" is default and works well here
    )
).take_all()

2025-10-10 16:35:50,357	INFO logging.py:295 -- Registered dataset logger for dataset dataset_6_0
2025-10-10 16:35:50,367	INFO streaming_executor.py:159 -- Starting execution of Dataset dataset_6_0. Full logs are in /tmp/ray/session_2025-10-10_16-23-49_015346_2333/logs/ray-data
2025-10-10 16:35:50,368	INFO streaming_executor.py:160 -- Execution plan of Dataset dataset_6_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> TaskPoolMapOperator[ReadFiles] -> LimitOperator[limit=1000] -> TaskPoolMapOperator[Map(preprocess_image)] -> ActorPoolMapOperator[MapBatches(InferenceWorker)]


Running optimized Ray Data inference with stateful workers...


Running 0: 0.00 row [00:00, ? row/s]

{"asctime":"2025-10-10 16:35:50,422","levelname":"E","message":"Actor with class name: 'MapWorker(MapBatches(InferenceWorker))' and ID: '81e93d832395eb07fd426fb002000000' has constructor arguments in the object store and max_restarts > 0. If the arguments in the object store go out of scope or are lost, the actor restart will fail. See https://github.com/ray-project/ray/issues/53727 for more details.","filename":"core_worker.cc","lineno":2254}


- ListFiles 1: 0.00 row [00:00, ? row/s]

- ReadFiles 2: 0.00 row [00:00, ? row/s]

- limit=1000 3: 0.00 row [00:00, ? row/s]

- Map(preprocess_image) 4: 0.00 row [00:00, ? row/s]

- MapBatches(InferenceWorker) 5: 0.00 row [00:00, ? row/s]

[36m(Map(preprocess_image) pid=2586, ip=10.0.83.124)[0m   img = torch.from_numpy(pic.transpose((2, 0, 1))).contiguous()
[36m(Map(preprocess_image) pid=4151, ip=10.0.94.252)[0m   img = torch.from_numpy(pic.transpose((2, 0, 1))).contiguous()
[36m(Map(preprocess_image) pid=4218, ip=10.0.94.252)[0m   img = torch.from_numpy(pic.transpose((2, 0, 1))).contiguous()

Operator 'Map(preprocess_image)' uses 485.8MB of memory per task on
average, but Ray only requests 0.0B per task at the start of the
pipeline.

To avoid out-of-memory errors, consider setting `memory=485.8MB` in
the appropriate function or method call. (This might be unnecessary if
the number of concurrent tasks is low.)

`DataContext.get_current().issue_detectors_config.high_memory_detector_config.detection_time_interval_s`,

2025-10-10 16:36:22,857	INFO streaming_executor.py:279 -- ✔️  Dataset dataset_6_0 execution finished in 32.49 seconds


[36m(autoscaler +1h15m28s)[0m Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
[36m(autoscaler +3h53m11s)[0m Cluster is starting.
[36m(autoscaler +3h53m11s)[0m [autoscaler] [head] Attempting to add 1 node to the cluster (increasing from 0 to 1).
[36m(autoscaler +3h53m11s)[0m [autoscaler] [8CPU-32GB] Attempting to add 10 nodes to the cluster (increasing from 0 to 10).
[36m(autoscaler +3h53m11s)[0m [autoscaler] [8CPU-32GB|m5.2xlarge] [us-west-2b] [on-demand] Launched 10 instances.
[36m(autoscaler +3h53m11s)[0m [autoscaler] [head|m5.2xlarge] [us-west-2b] [on-demand] Launched 1 instance.
[36m(autoscaler +3h53m11s)[0m [head] Node launched (instance ID: i-031a5b1287cda591d, node IP: 10.0.124.207).
[36m(autoscaler +3h53m11s)[0m [head] Pulling image for Ray container.
[36m(autoscaler +3h53m11s)[0m [head] Pulled image for Ray container (image size: 3.0 GB), took 1.486s.
[36m(autoscaler +3h53m11s)[0m [head] Created 

---

## Key Takeaways from Part 1

You've learned the fundamentals of batch inference optimization:
- Identified common anti-patterns that destroy performance
- Understood why repeated model loading is problematic  
- Implemented class-based actors for stateful model loading
- Used proper resource allocation with `num_gpus` and `concurrency`
- Learned CPU and GPU compatibility patterns

## Next Steps

Now that you understand the fundamentals, you're ready to learn systematic optimization techniques.

**[← Back to Overview](README.ipynb)** | **[Continue to Part 2: Advanced Optimization →](02-advanced-optimization.ipynb)**

In Part 2, you'll learn:
- Systematic decision frameworks for choosing optimization techniques
- Multi-model ensemble inference patterns
- Performance monitoring and diagnostics
- Production deployment best practices

**Or skip ahead to Part 3** for a deep dive into Ray Data's architecture:

**[Jump to Part 3: Ray Data Architecture →](03-ray-data-architecture.ipynb)**

In Part 3, you'll learn:
- How streaming execution enables unlimited dataset processing
- How blocks and memory management affect optimization
- How operator fusion and backpressure work
- How to calculate optimal parameters from architectural principles

**[Return to overview](README.ipynb)** to see all available parts.