# Chapter 18: Optimization and Parallelization

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/adiel2012/computer-vision/blob/main/chapter_18_optimization_and_parallelization.ipynb)

**Performance optimization** is critical for production rendering. This chapter covers algorithmic optimizations, multi-threading, SIMD vectorization, and GPU acceleration strategies.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import time
import math
from typing import List, Tuple
import threading
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from multiprocessing import cpu_count

## 1. Profiling and Measurement

**Always profile before optimizing!**

### Performance Metrics

- **Time**: Wall-clock, CPU time
- **Throughput**: Rays/second, pixels/second
- **Memory**: Peak usage, allocations
- **Cache**: Hit rate, misses

### Amdahl's Law

Maximum speedup from parallelization:

$$
S = \frac{1}{(1-P) + \frac{P}{N}}
$$

where:
- $P$ = fraction of code that can be parallelized
- $N$ = number of processors

Example: If 90% parallelizable with 8 cores:
$$
S = \frac{1}{0.1 + \frac{0.9}{8}} \approx 4.7\times
$$

In [None]:
class Timer:
    """Simple profiling timer"""
    def __init__(self, name="Operation"):
        self.name = name
        self.start_time = None
    
    def __enter__(self):
        self.start_time = time.time()
        return self
    
    def __exit__(self, *args):
        elapsed = time.time() - self.start_time
        print(f"{self.name}: {elapsed:.4f} seconds")

def benchmark(func, *args, iterations=10):
    """Benchmark function with multiple iterations"""
    times = []
    for _ in range(iterations):
        start = time.time()
        func(*args)
        times.append(time.time() - start)
    
    mean_time = np.mean(times)
    std_time = np.std(times)
    return mean_time, std_time

print("✓ Profiling tools loaded")

## 2. Algorithmic Optimizations

### Early Ray Termination

Stop tracing when contribution becomes negligible:

$$
\text{throughput} < \epsilon \implies \text{terminate}
$$

### Bounding Volume Hierarchy (Review)

- Use SAH for optimal splits
- Traverse in front-to-back order
- Cache-friendly memory layout

### Incremental Computation

Reuse calculations across iterations:
- Don't recompute static geometry
- Cache intersection data
- Incremental updates for animation

In [None]:
# Example: Naive vs optimized matrix-vector multiplication

def matmul_naive(matrix, vector):
    """Naive matrix-vector multiply (list of lists)"""
    n = len(matrix)
    result = [0] * n
    for i in range(n):
        for j in range(n):
            result[i] += matrix[i][j] * vector[j]
    return result

def matmul_numpy(matrix, vector):
    """NumPy optimized (BLAS)"""
    return np.dot(matrix, vector)

# Benchmark
size = 100
mat_list = [[float(i + j) for j in range(size)] for i in range(size)]
vec_list = [float(i) for i in range(size)]
mat_np = np.array(mat_list)
vec_np = np.array(vec_list)

with Timer("Naive matmul"):
    for _ in range(100):
        matmul_naive(mat_list, vec_list)

with Timer("NumPy matmul"):
    for _ in range(100):
        matmul_numpy(mat_np, vec_np)

print("NumPy uses optimized BLAS routines (often 10-100x faster)")

## 3. Data Structure Optimization

### Array of Structures (AoS) vs Structure of Arrays (SoA)

**AoS** (poor cache locality):
```python
vertices = [Vertex(x, y, z), Vertex(x, y, z), ...]
```

**SoA** (better cache locality):
```python
vertices_x = [x1, x2, x3, ...]
vertices_y = [y1, y2, y3, ...]
vertices_z = [z1, z2, z3, ...]
```

### Memory Layout

- **Contiguous memory**: Better cache performance
- **Alignment**: Align to cache line boundaries (64 bytes)
- **Padding**: Avoid false sharing in multi-threading

In [None]:
# Demonstrate AoS vs SoA performance

class Vec3AoS:
    """Array of Structures"""
    def __init__(self, x, y, z):
        self.x, self.y, self.z = x, y, z

def process_aos(vertices):
    """Process AoS"""
    result = 0.0
    for v in vertices:
        result += v.x + v.y + v.z
    return result

def process_soa(x, y, z):
    """Process SoA"""
    return np.sum(x) + np.sum(y) + np.sum(z)

# Create data
n = 100000
vertices_aos = [Vec3AoS(i, i+1, i+2) for i in range(n)]
vertices_x = np.arange(n, dtype=np.float64)
vertices_y = np.arange(n, dtype=np.float64) + 1
vertices_z = np.arange(n, dtype=np.float64) + 2

# Benchmark
with Timer("AoS processing"):
    for _ in range(10):
        process_aos(vertices_aos)

with Timer("SoA processing"):
    for _ in range(10):
        process_soa(vertices_x, vertices_y, vertices_z)

print("SoA is faster due to better cache locality and vectorization")

## 4. Multi-Threading

### Thread-Based Parallelism

Python's `threading` for I/O-bound tasks (GIL limitation)
Python's `multiprocessing` for CPU-bound tasks

### Rendering Parallelization Strategies

1. **Tile-based**: Divide image into tiles, one per thread
2. **Scanline**: Each thread renders different rows
3. **Interleaved**: Thread $i$ renders every $n$-th pixel
4. **Dynamic**: Work queue with tasks

### Thread Synchronization

- **Locks**: Protect shared data
- **Atomics**: Lock-free operations
- **Thread-local storage**: Avoid sharing

In [None]:
def render_tile(x_start, y_start, tile_size, width, height):
    """Render a tile of pixels (dummy implementation)"""
    tile = np.zeros((tile_size, tile_size, 3))
    
    for y in range(tile_size):
        for x in range(tile_size):
            px = x_start + x
            py = y_start + y
            
            if px < width and py < height:
                # Simulate ray tracing work
                tile[y, x] = [px / width, py / height, 0.5]
    
    return (x_start, y_start, tile)

def render_single_threaded(width, height, tile_size):
    """Single-threaded tiled rendering"""
    image = np.zeros((height, width, 3))
    
    for y in range(0, height, tile_size):
        for x in range(0, width, tile_size):
            x_start, y_start, tile = render_tile(x, y, tile_size, width, height)
            
            # Copy tile to image
            h = min(tile_size, height - y_start)
            w = min(tile_size, width - x_start)
            image[y_start:y_start+h, x_start:x_start+w] = tile[:h, :w]
    
    return image

def render_multi_threaded(width, height, tile_size, num_threads=None):
    """Multi-threaded tiled rendering"""
    if num_threads is None:
        num_threads = cpu_count()
    
    image = np.zeros((height, width, 3))
    
    # Generate tile jobs
    jobs = []
    for y in range(0, height, tile_size):
        for x in range(0, width, tile_size):
            jobs.append((x, y, tile_size, width, height))
    
    # Process tiles in parallel
    with ThreadPoolExecutor(max_workers=num_threads) as executor:
        results = executor.map(lambda args: render_tile(*args), jobs)
        
        for x_start, y_start, tile in results:
            h = min(tile_size, height - y_start)
            w = min(tile_size, width - x_start)
            image[y_start:y_start+h, x_start:x_start+w] = tile[:h, :w]
    
    return image

print("✓ Multi-threading loaded")

## 5. SIMD Vectorization

**SIMD** (Single Instruction, Multiple Data) processes multiple values simultaneously.

### Vector Operations

Modern CPUs support:
- **SSE**: 128-bit (4 floats)
- **AVX**: 256-bit (8 floats)
- **AVX-512**: 512-bit (16 floats)

### Ray Packet Tracing

Trace 4/8/16 coherent rays together:
- Better cache utilization
- Amortize BVH traversal
- SIMD intersection tests

### NumPy Vectorization

NumPy automatically uses SIMD for array operations.

In [None]:
# Vectorization example: computing distances

def distance_scalar(points, center):
    """Scalar computation (slow)"""
    distances = []
    for p in points:
        dx = p[0] - center[0]
        dy = p[1] - center[1]
        dz = p[2] - center[2]
        dist = math.sqrt(dx*dx + dy*dy + dz*dz)
        distances.append(dist)
    return distances

def distance_vectorized(points, center):
    """Vectorized computation (fast)"""
    diff = points - center
    return np.sqrt(np.sum(diff * diff, axis=1))

# Generate test data
n_points = 10000
points_list = [[np.random.rand(), np.random.rand(), np.random.rand()] 
               for _ in range(n_points)]
points_np = np.random.rand(n_points, 3)
center = np.array([0.5, 0.5, 0.5])

# Benchmark
with Timer("Scalar distance"):
    distance_scalar(points_list, center)

with Timer("Vectorized distance"):
    distance_vectorized(points_np, center)

print("Vectorized version is typically 10-100x faster!")

## 6. GPU Acceleration Concepts

### CUDA/OpenCL Architecture

- **Massive parallelism**: Thousands of threads
- **Memory hierarchy**: Global, shared, local, constant
- **Warps/Wavefronts**: Groups of threads execute together
- **Divergence**: Branching reduces efficiency

### Ray Tracing on GPU

**One thread per pixel**:
```cuda
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x < width && y < height) {
    Ray ray = generate_camera_ray(x, y);
    Color c = trace_ray(ray, scene);
    image[y * width + x] = c;
}
```

### Optimization Strategies

1. **Coalesced memory access**: Access consecutive memory
2. **Shared memory**: Cache frequently accessed data
3. **Occupancy**: Balance registers/shared memory/threads
4. **Avoid divergence**: Minimize if/else branching

In [None]:
# Conceptual GPU pseudo-code (not executable Python)
gpu_code = """
// CUDA kernel for ray tracing
__global__ void raytrace_kernel(
    float3* image,
    const Camera camera,
    const Scene* scene,
    int width, int height)
{
    // Thread coordinates
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    
    if (x >= width || y >= height) return;
    
    // Generate camera ray
    float u = (float)x / width;
    float v = (float)y / height;
    Ray ray = camera.generate_ray(u, v);
    
    // Trace ray
    float3 color = trace_ray(ray, scene, 0);
    
    // Write to image
    image[y * width + x] = color;
}

// Launch kernel
dim3 blockSize(16, 16);
dim3 gridSize(
    (width + blockSize.x - 1) / blockSize.x,
    (height + blockSize.y - 1) / blockSize.y
);
raytrace_kernel<<<gridSize, blockSize>>>(image, camera, scene, width, height);
"""

print("GPU code example (CUDA):")
print(gpu_code)
print("\nKey points:")
print("- One thread per pixel")
print("- Launched in 16x16 thread blocks")
print("- Massively parallel (thousands of threads)")

## Example: Multi-Threading Performance

In [None]:
# Compare single vs multi-threaded rendering
width, height = 512, 512
tile_size = 64

print(f"Rendering {width}x{height} image with {tile_size}x{tile_size} tiles")
print(f"Available CPU cores: {cpu_count()}\n")

# Single-threaded
with Timer("Single-threaded"):
    img_single = render_single_threaded(width, height, tile_size)

# Multi-threaded
for num_threads in [2, 4, cpu_count()]:
    with Timer(f"Multi-threaded ({num_threads} threads)"):
        img_multi = render_multi_threaded(width, height, tile_size, num_threads)

# Visualize result
fig, axes = plt.subplots(1, 2, figsize=(12, 6))

axes[0].imshow(img_single)
axes[0].set_title('Single-threaded Result')
axes[0].axis('off')

axes[1].imshow(img_multi)
axes[1].set_title('Multi-threaded Result')
axes[1].axis('off')

plt.tight_layout()
plt.show()

print("\nMulti-threading provides near-linear speedup for embarrassingly parallel tasks!")

## Summary

**Optimization and parallelization** strategies for production rendering:

### Key Concepts

1. **Profiling First**
   - Measure before optimizing
   - Identify hotspots
   - Amdahl's law limits

2. **Algorithmic Optimization**
   - BVH acceleration
   - Early termination
   - Incremental computation
   - Cache-friendly data structures

3. **Data Structure Layout**
   - SoA vs AoS: SoA usually better
   - Memory alignment
   - Contiguous storage
   - Avoid false sharing

4. **Multi-Threading**
   - Tile-based parallelism
   - Thread pool
   - Lock-free when possible
   - Near-linear speedup for rendering

5. **SIMD Vectorization**
   - Process 4-16 values at once
   - Ray packet tracing
   - NumPy auto-vectorizes
   - 10-100x faster than scalar

6. **GPU Acceleration**
   - Massive parallelism (1000s of threads)
   - One thread per pixel/ray
   - Coalesced memory access
   - Avoid divergence
   - 10-100x faster than CPU

### Performance Hierarchy

```
Naive Python:           1x      (baseline)
NumPy vectorized:       10-100x (SIMD)
Multi-threaded CPU:     4-16x   (cores)
Combined (NumPy+MT):    40-1600x
GPU (CUDA):             100-1000x
```

### Optimization Checklist

✅ Profile to find bottlenecks  
✅ Use efficient algorithms (BVH, etc.)  
✅ Vectorize with NumPy  
✅ Use SoA data layout  
✅ Multi-thread embarrassingly parallel tasks  
✅ Consider GPU for massive parallelism  

### Production Renderer Performance

Modern renderers combine all techniques:
- **Pixar RenderMan**: Multi-core CPU + GPU hybrid
- **Arnold**: Multi-threaded with SIMD
- **V-Ray**: CPU + GPU modes
- **Cycles**: OpenCL/CUDA GPU acceleration
- **Real-time (games)**: GPU ray tracing (RTX)

Optimization is essential for interactive and production rendering!