# Lab 05: Out-of-Core, Streaming & Parallel Processing

**Course:** Big Data

---

## Student Information

**Name:** `Your Name Here`

**Date:** `DD/MM/YYYY`

---

**Goal:** Process datasets larger than RAM using chunking, implement streaming statistics, and leverage parallelization (threading/multiprocessing) for performance.

## Learning Objectives

1. **Use PyArrow Directly**: Understand when to use PyArrow vs Pandas for I/O
2. **Apply Projection Pushdown**: Read only the columns you need
3. **Process Out-of-Core**: Handle datasets larger than RAM with chunking
4. **Implement Online Statistics**: Compute mean/std in a single pass (Welford's algorithm)
5. **Parallelize Work**: Use threading for I/O and multiprocessing for CPU-bound tasks
6. **Build Complete Pipelines**: Combine chunking + parallelization

## Instructions

1. **Fill in your information above** before starting the lab
2. Read each cell carefully before running it
3. Implement the **TODO functions** when you see them
4. Run cells **from top to bottom** (Shift+Enter)
5. Check that output makes sense after each cell

---

## Libraries Used in This Lab

- **`pyarrow`** — Direct Parquet reading and Arrow Table operations
- **`pandas`** — DataFrame operations and chunked CSV reading
- **`numpy`** — Numerical operations
- **`concurrent.futures`** — ThreadPoolExecutor and ProcessPoolExecutor
- **`psutil`** — Memory monitoring
- **`matplotlib`** — Plotting memory and speedup charts

---

## 1. Imports and Setup

In [None]:
import json
import time
import os
import glob
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
import psutil
import matplotlib.pyplot as plt

print("Imports successful!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"PyArrow version: {pa.__version__}")

## 2. Define Paths

In [None]:
DATA_RAW = Path("../data/raw")
DATA_PROCESSED = Path("../data/processed")
RESULTS_DIR = Path("../results")

WARMUP_PARQUET = DATA_PROCESSED / "sales_warmup.parquet"
SALES_CSV = DATA_RAW / "sales_large.csv"
SALES_PARTITIONED = DATA_PROCESSED / "sales_partitioned"
PARTITIONS_DIR = DATA_PROCESSED / "partitions"
ELECTRONICS_PARQUET = DATA_PROCESSED / "electronics_only.parquet"
METRICS_PATH = RESULTS_DIR / "lab05_metrics.json"

for d in [DATA_RAW, DATA_PROCESSED, RESULTS_DIR, PARTITIONS_DIR]:
    d.mkdir(parents=True, exist_ok=True)

print("Paths defined.")

---

## Exercise 0: PyArrow Benchmark & Warm-up (15 min)

**Objective**: Familiarize yourself with PyArrow by comparing its performance against Pandas, and understand `iter_batches()` that we will use in the rest of the lab.

### TODO 1: `generate_warmup_data()`

Generate a warmup dataset (5M rows) and save as Parquet.

**Columns**: `product_id`, `category`, `price`, `quantity`, `customer_id`

In [None]:
def generate_warmup_data(n: int = 5_000_000, seed: int = 42) -> pd.DataFrame:
    """
    Generate a warmup dataset and save as Parquet.

    Args:
        n: Number of rows
        seed: Random seed

    Returns:
        DataFrame with columns: product_id, category, price, quantity, customer_id
    """
    # TODO: Implement this function
    # Step 1: Set random seed with np.random.seed(seed)
    # Step 2: Create DataFrame with:
    #   - 'product_id': np.random.randint(1, 10000, n)
    #   - 'category': np.random.choice(['Electronics','Clothing','Home','Sports','Food'], n)
    #   - 'price': np.random.uniform(1, 1000, n).round(2)
    #   - 'quantity': np.random.randint(1, 50, n)
    #   - 'customer_id': np.random.randint(1, 100000, n)
    # Step 3: Save to WARMUP_PARQUET with df.to_parquet(..., index=False)
    # Step 4: Return df
    pass

In [None]:
print("Generating warmup dataset (5M rows)...")
df_warmup = generate_warmup_data()
if df_warmup is not None:
    print(f"Shape: {df_warmup.shape}")
    print(f"Memory: {df_warmup.memory_usage(deep=True).sum() / 1e6:.1f} MB")
    print(df_warmup.head())
else:
    print("TODO: Implement generate_warmup_data()")

### TODO 2: `benchmark_read_methods()`

Compare three approaches to reading Parquet:
- **Method A**: `pd.read_parquet()` — Pandas (with conversion overhead)
- **Method B**: `pq.read_table()` — Arrow direct (no conversion)
- **Method C**: `table.to_pandas()` — Measure conversion cost separately

In [None]:
def benchmark_read_methods() -> dict:
    """
    Compare pd.read_parquet vs pq.read_table vs Arrow-to-Pandas conversion.

    Returns:
        Dictionary with timing results for each method.
    """
    # TODO: Implement this function
    # Method A: start = time.perf_counter(); df = pd.read_parquet(WARMUP_PARQUET); ...
    #   RAM: df.memory_usage(deep=True).sum() / 1e6
    # Method B: table = pq.read_table(WARMUP_PARQUET); ...
    #   RAM: table.nbytes / 1e6
    # Method C: df = table.to_pandas(); ...
    # Print results and compute speedup = t_pandas / t_arrow
    pass

In [None]:
read_bench = benchmark_read_methods()
if read_bench is None:
    print("TODO: Implement benchmark_read_methods()")

**Question**: Why is reading as an Arrow Table faster than reading directly to Pandas, if Pandas uses Arrow internally?

*Your answer here:*

---

### TODO 3: `benchmark_projection_pushdown()`

Compare reading all columns vs only 2 columns (`price`, `quantity`).

In [None]:
def benchmark_projection_pushdown() -> dict:
    """
    Compare reading all columns vs only needed columns from Parquet.

    Returns:
        Dictionary with timing, size, speedup, and total_revenue.
    """
    # TODO: Implement this function
    # 1. Read ALL: table_all = pq.read_table(WARMUP_PARQUET)
    # 2. Read 2 cols: table_cols = pq.read_table(WARMUP_PARQUET, columns=['price', 'quantity'])
    # 3. Compute revenue: pc.multiply(table_cols.column('price'), table_cols.column('quantity'))
    #    total = pc.sum(revenue).as_py()
    # 4. Print speedup and data reduction
    pass

In [None]:
proj_bench = benchmark_projection_pushdown()
if proj_bench is None:
    print("TODO: Implement benchmark_projection_pushdown()")

**Question**: In what real-world cases would you leverage this pattern instead of reading the full DataFrame?

*Your answer here:*

---

### `iter_batches()` — The bridge between Arrow and out-of-core

This function is what we will use for the rest of the lab for Parquet streaming. It is **already implemented** — study it and run it.

In [None]:
def process_with_iter_batches(batch_size: int = 500_000) -> dict:
    """Use iter_batches() to process Parquet in streaming fashion (pre-filled)."""
    pf = pq.ParquetFile(WARMUP_PARQUET)
    print(f"Row groups: {pf.metadata.num_row_groups}")
    print(f"Total rows: {pf.metadata.num_rows:,}")

    total_revenue = 0
    total_rows = 0
    for i, batch in enumerate(pf.iter_batches(batch_size=batch_size, columns=['price', 'quantity'])):
        print(f"Batch {i}: type={type(batch).__name__}, rows={len(batch):,}")
        revenue = pc.multiply(batch.column('price'), batch.column('quantity'))
        total_revenue += pc.sum(revenue).as_py()
        total_rows += len(batch)

    print(f"\nTotal rows processed: {total_rows:,}")
    print(f"Total revenue: {total_revenue:,.0f}")
    print(f"Memory per batch: ~{batch_size * 2 * 8 / 1e6:.1f} MB (2 float64 columns)")
    return {'total_revenue': total_revenue, 'total_rows': total_rows, 'num_batches': i + 1}

batch_results = process_with_iter_batches()

### Schema inspection (pre-filled)

Inspect a Parquet file's metadata **without reading any data**.

In [None]:
def inspect_parquet_schema() -> dict:
    """Inspect Parquet file metadata without reading data (pre-filled)."""
    pf = pq.ParquetFile(WARMUP_PARQUET)
    print("Schema:")
    print(pf.schema_arrow)
    print(f"\nFile metadata:")
    print(f"  Rows:       {pf.metadata.num_rows:,}")
    print(f"  Row groups: {pf.metadata.num_row_groups}")
    print(f"  Columns:    {pf.metadata.num_columns}")
    rg = pf.metadata.row_group(0)
    col = rg.column(0)
    print(f"\nStatistics for '{col.path_in_schema}':")
    if col.statistics:
        print(f"  Min: {col.statistics.min}")
        print(f"  Max: {col.statistics.max}")
    return {'num_rows': pf.metadata.num_rows, 'num_row_groups': pf.metadata.num_row_groups}

schema_info = inspect_parquet_schema()

### Exercise 0 — Summary Table

| Operation | Approx. Time | Approx. RAM | When to use |
|-----------|-------------|-------------|-------------|
| `pd.read_parquet()` | ~1.0s | ~150 MB | When you need a full DataFrame |
| `pq.read_table()` | ~0.5s | ~80 MB | When you operate in Arrow or convert later |
| `pq.read_table(columns=[...])` | ~0.2s | ~20 MB | When you only need a few columns |
| `iter_batches()` | same, streaming | ~10 MB/batch | When the file doesn't fit in RAM |

---

## Exercise 1: Out-of-Core Processing with Chunking (25 min)

**Objective**: Process a dataset larger than RAM using chunking.

### TODO 4: `generate_large_dataset()`

Generate 20M rows and save as CSV + partitioned Parquet.

In [None]:
def generate_large_dataset(n: int = 20_000_000, seed: int = 42) -> None:
    """
    Generate a large dataset and save as CSV and partitioned Parquet.

    Args:
        n: Number of rows
        seed: Random seed
    """
    # TODO: Implement this function
    # Step 1: np.random.seed(seed)
    # Step 2: Create DataFrame with columns:
    #   date, product_id, category, price, quantity, customer_id
    # Step 3: df.to_csv(SALES_CSV, index=False)
    # Step 4: df.to_parquet(SALES_PARTITIONED, partition_cols=['category'], index=False)
    pass

In [None]:
print("Generating large dataset (20M rows)...")
print("(This may take 2-5 minutes)\n")
start = time.perf_counter()
generate_large_dataset()
print(f"\nCompleted in {time.perf_counter() - start:.1f} seconds")

### TODO 5: `chunked_statistics()`

Calculate average price using chunking — **without loading the full file in RAM**.

In [None]:
def chunked_statistics(chunksize: int = 500_000) -> dict:
    """
    Calculate statistics using chunking over CSV.

    Args:
        chunksize: Number of rows per chunk

    Returns:
        Dictionary with total_sum, total_count, and avg_price.
    """
    # TODO: Implement this function
    # 1. Initialize total_sum = 0, total_count = 0
    # 2. for chunk in pd.read_csv(SALES_CSV, chunksize=chunksize):
    #      total_sum += chunk['price'].sum()
    #      total_count += len(chunk)
    # 3. avg_price = total_sum / total_count
    # 4. Return {'total_sum': ..., 'total_count': ..., 'avg_price': ...}
    pass

In [None]:
stats = chunked_statistics()
if stats is None:
    print("TODO: Implement chunked_statistics()")
else:
    print(f"Average price: {stats['avg_price']:.4f}")

### TODO 6: `chunked_filter_save()`

Filter only "Electronics" sales and save to Parquet — processing chunk by chunk.

In [None]:
def chunked_filter_save(chunksize: int = 500_000) -> int:
    """
    Filter and save only Electronics sales using chunking.

    Args:
        chunksize: Number of rows per chunk

    Returns:
        Number of Electronics rows saved.
    """
    # TODO: Implement this function
    # 1. results = []
    # 2. for chunk in pd.read_csv(SALES_CSV, chunksize=chunksize):
    #      filtered = chunk[chunk['category'] == 'Electronics']
    #      results.append(filtered)
    # 3. electronics = pd.concat(results)
    # 4. electronics.to_parquet(ELECTRONICS_PARQUET, index=False)
    # 5. Return len(electronics)
    pass

In [None]:
n_electronics = chunked_filter_save()
if n_electronics is None:
    print("TODO: Implement chunked_filter_save()")
else:
    print(f"Saved {n_electronics:,} Electronics rows")

### Memory Monitoring (pre-filled)

Monitor memory usage during chunked processing to prove it stays constant.

In [None]:
def monitor_memory_chunking(chunksize: int = 500_000) -> list:
    """Monitor memory usage during chunked processing (pre-filled)."""
    process = psutil.Process(os.getpid())
    mem_usage = []
    for i, chunk in enumerate(pd.read_csv(SALES_CSV, chunksize=chunksize)):
        mem_mb = process.memory_info().rss / 1024**2
        mem_usage.append(mem_mb)
        chunk['price'].sum()
    print(f"Chunks processed: {len(mem_usage)}")
    print(f"Memory min: {min(mem_usage):.1f} MB, max: {max(mem_usage):.1f} MB")
    print(f"Variation: {max(mem_usage) - min(mem_usage):.1f} MB")
    return mem_usage

mem = monitor_memory_chunking()

plt.figure(figsize=(10, 5))
plt.plot(mem, 'b-o', markersize=3)
plt.xlabel('Chunk number')
plt.ylabel('Memory (MB)')
plt.title('Constant Memory with Chunking')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig(RESULTS_DIR / 'memory_chunking.png', dpi=100)
plt.show()

---

## Exercise 2: Online Statistics — Welford's Algorithm (20 min)

**Objective**: Implement streaming statistics that work on infinite data streams.

### Welford's Algorithm
```
For each new value x:
  count += 1
  delta = x - mean
  mean += delta / count
  delta2 = x - mean       (uses UPDATED mean!)
  M2 += delta * delta2

variance = M2 / count
std = sqrt(variance)
```

### TODO 7: `OnlineStats` class

In [None]:
class OnlineStats:
    """Online (streaming) statistics using Welford's algorithm."""

    def __init__(self):
        self.count = 0
        self.mean = 0.0
        self.M2 = 0.0
        self.min_val = float('inf')
        self.max_val = float('-inf')

    def update(self, x):
        """Update statistics with a new value using Welford's algorithm."""
        # TODO: Implement Welford's algorithm
        # 1. self.count += 1
        # 2. delta = x - self.mean
        # 3. self.mean += delta / self.count
        # 4. delta2 = x - self.mean  (uses UPDATED mean!)
        # 5. self.M2 += delta * delta2
        # 6. Update min_val and max_val
        pass

    def variance(self):
        """Return the population variance."""
        # TODO: return self.M2 / self.count (handle count < 2)
        pass

    def std(self):
        """Return the population standard deviation."""
        # TODO: return sqrt(self.variance())
        pass

### Validate OnlineStats against NumPy (pre-filled)

In [None]:
def test_online_stats(n_rows: int = 100_000) -> dict:
    """Test OnlineStats against NumPy (pre-filled)."""
    chunk = pd.read_csv(SALES_CSV, nrows=n_rows)
    stats = OnlineStats()
    for val in chunk['price'].values:
        stats.update(val)
    np_mean = chunk['price'].mean()
    np_std = chunk['price'].std(ddof=0)
    mean_match = np.isclose(stats.mean, np_mean)
    std_match = np.isclose(stats.std(), np_std)
    print(f"OnlineStats mean: {stats.mean:.4f}  |  NumPy mean: {np_mean:.4f}  |  Match: {mean_match}")
    print(f"OnlineStats std:  {stats.std():.4f}  |  NumPy std:  {np_std:.4f}  |  Match: {std_match}")
    assert mean_match, f"Mean mismatch: {stats.mean} vs {np_mean}"
    assert std_match, f"Std mismatch: {stats.std()} vs {np_std}"
    print("\n✅ All assertions passed!")
    return {'online_mean': round(stats.mean, 4), 'numpy_mean': round(np_mean, 4)}

validation = test_online_stats()

### Streaming stats on full dataset (pre-filled)

In [None]:
def streaming_stats_full(chunksize: int = 500_000) -> dict:
    """Compute statistics over full CSV using OnlineStats + chunking (pre-filled)."""
    stats = OnlineStats()
    for chunk in pd.read_csv(SALES_CSV, chunksize=chunksize):
        for value in chunk['price'].values:
            stats.update(value)
    print(f"Mean: {stats.mean:.4f}, Std: {stats.std():.4f}")
    print(f"Min: {stats.min_val:.2f}, Max: {stats.max_val:.2f}, Count: {stats.count:,}")
    return {'mean': round(stats.mean, 4), 'std': round(stats.std(), 4), 'count': stats.count}

print("Computing streaming statistics over full dataset...")
start = time.perf_counter()
full_stats = streaming_stats_full()
print(f"Completed in {time.perf_counter() - start:.1f} seconds")

---

## Exercise 3: Practical Parallelization (25 min)

**Objective**: Compare threading, multiprocessing, and sequential execution.

### Create Partitions (pre-filled)

In [None]:
def create_partitions(n_partitions: int = 16) -> list:
    """Create partition files from the large dataset (pre-filled)."""
    PARTITIONS_DIR.mkdir(parents=True, exist_ok=True)
    print("Reading dataset for partitioning...")
    df = pd.read_csv(SALES_CSV)
    files = []
    for i in range(n_partitions):
        start_idx = i * len(df) // n_partitions
        end_idx = (i + 1) * len(df) // n_partitions
        filepath = PARTITIONS_DIR / f"part_{i:03d}.parquet"
        df.iloc[start_idx:end_idx].to_parquet(filepath, index=False)
        files.append(str(filepath))
    print(f"Created {n_partitions} partitions in {PARTITIONS_DIR}")
    return files

partition_files = create_partitions()

### TODO 8: `benchmark_threading()`

Compare sequential vs `ThreadPoolExecutor` for **reading** 16 partition files.

Reading files is **I/O-bound** — threads help because the GIL is released during I/O.

In [None]:
def benchmark_threading(n_workers: int = 8) -> dict:
    """
    Benchmark sequential vs threaded reading of partition files.

    Args:
        n_workers: Number of threads

    Returns:
        Dictionary with sequential_sec, threaded_sec, speedup.
    """
    # TODO: Implement this function
    # 1. files = sorted(glob.glob(str(PARTITIONS_DIR / '*.parquet')))
    # 2. Sequential: dfs = [pd.read_parquet(f) for f in files]
    # 3. Threaded: with ThreadPoolExecutor(max_workers=n_workers) as executor:
    #        dfs = list(executor.map(pd.read_parquet, files))
    # 4. Print times and speedup
    pass

In [None]:
thread_bench = benchmark_threading()
if thread_bench is None:
    print("TODO: Implement benchmark_threading()")

### TODO 9: `benchmark_multiprocessing()`

Compare sequential vs `ProcessPoolExecutor` for **heavy computation**.

CPU-bound work — processes help because each has its own GIL.

In [None]:
def heavy_process(filepath):
    """Process a partition: read, transform, aggregate."""
    df = pd.read_parquet(filepath)
    df['score'] = np.sqrt(df['price']) * np.log1p(df['quantity'])
    return df.groupby('category')['score'].agg(['mean', 'sum', 'count'])

In [None]:
def benchmark_multiprocessing(n_workers: int = 4) -> dict:
    """
    Benchmark sequential vs multiprocessing for heavy computation.

    Args:
        n_workers: Number of processes

    Returns:
        Dictionary with sequential_sec, multiprocessing_sec, speedup.
    """
    # TODO: Implement this function
    # 1. files = sorted(glob.glob(str(PARTITIONS_DIR / '*.parquet')))
    # 2. Sequential: [heavy_process(f) for f in files]
    # 3. Parallel: with ProcessPoolExecutor(max_workers=n_workers) as executor:
    #        list(executor.map(heavy_process, files))
    # 4. Print times and speedup
    pass

In [None]:
proc_bench = benchmark_multiprocessing()
if proc_bench is None:
    print("TODO: Implement benchmark_multiprocessing()")

### Worker Scaling Experiment (pre-filled)

In [None]:
def benchmark_workers_scaling(max_workers: int = 8) -> dict:
    """Vary number of workers and measure speedup (pre-filled)."""
    files = sorted(glob.glob(str(PARTITIONS_DIR / '*.parquet')))
    start = time.time()
    [heavy_process(f) for f in files]
    seq_time = time.time() - start
    speedups = {}
    for n in [1, 2, 4, max_workers]:
        start = time.time()
        with ProcessPoolExecutor(max_workers=n) as ex:
            list(ex.map(heavy_process, files))
        speedups[n] = round(seq_time / (time.time() - start), 1)
    for w, s in speedups.items():
        print(f"  {w} workers: {s}x speedup")
    return speedups

print("Benchmarking worker scaling (this may take a few minutes)...\n")
scaling = benchmark_workers_scaling()

plt.figure(figsize=(8, 5))
plt.plot(list(scaling.keys()), list(scaling.values()), 'bo-', linewidth=2, markersize=8)
plt.plot([1, max(scaling.keys())], [1, max(scaling.keys())], 'r--', label='Ideal linear', alpha=0.7)
plt.xlabel('Workers')
plt.ylabel('Speedup')
plt.legend()
plt.title('Speedup vs Number of Workers')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig(RESULTS_DIR / 'speedup_workers.png', dpi=100)
plt.show()

---

## Exercise 4: Complete Pipeline — Out-of-Core + Parallel (20 min)

**Objective**: Combine chunking and parallelization in a real pipeline.

### `process_partition()` (pre-filled)

This function processes a single partition — study it before implementing the pipeline.

In [None]:
def process_partition(filepath):
    """Process a partition: read, transform, aggregate (pre-filled)."""
    df = pd.read_parquet(filepath)
    df['revenue'] = df['price'] * df['quantity']
    df['price_bin'] = pd.cut(df['price'], bins=[0, 50, 200, 500, 1000],
                            labels=['low', 'mid', 'high', 'premium'])
    return df.groupby(['category', 'price_bin'], observed=True).agg({
        'revenue': ['sum', 'mean', 'count'],
        'quantity': 'sum'
    })

### TODO 10: `run_parallel_pipeline()`

Run the complete pipeline: process all partitions sequentially vs in parallel, then combine results.

In [None]:
def run_parallel_pipeline(n_workers: int = 4) -> dict:
    """
    Run the complete parallel pipeline and compare with sequential.

    Args:
        n_workers: Number of worker processes

    Returns:
        Dictionary with sequential_sec, parallel_sec, speedup.
    """
    # TODO: Implement this function
    # 1. files = sorted(glob.glob(str(PARTITIONS_DIR / '*.parquet')))
    # 2. Sequential: results = [process_partition(f) for f in files]
    #    final_seq = pd.concat(results).groupby(level=[0, 1]).sum()
    # 3. Parallel: with ProcessPoolExecutor(max_workers=n_workers) as executor:
    #    results = list(executor.map(process_partition, files))
    #    final_par = pd.concat(results).groupby(level=[0, 1]).sum()
    # 4. Print times, speedup, and final results
    pass

In [None]:
print("Running complete pipeline (sequential vs parallel)...\n")
pipeline_results = run_parallel_pipeline()
if pipeline_results is None:
    print("TODO: Implement run_parallel_pipeline()")

---

## 5. Reflection

**Your task:** Write a short reflection (3-5 sentences) answering:

1. What was the difference in speed between `pq.read_table()` and `pd.read_parquet()`? Why?
2. How did chunking affect memory usage during processing?
3. When would you use threading vs multiprocessing in your own data pipelines?

In [None]:
# TODO: Write your reflection here
reflection = """
Replace this text with your reflection.
Think about what you learned about out-of-core processing and parallelization.
What will you do differently in your future data pipelines?
""".strip()

print("Your reflection:")
print(reflection)

---

## 6. Save Results

In [None]:
results = {
    "lab": "05_outofcore_parallel",
    "timestamp": pd.Timestamp.now().isoformat(),
    "exercise_0": {
        "read_benchmark": read_bench if 'read_bench' in dir() and read_bench else None,
        "projection": proj_bench if 'proj_bench' in dir() and proj_bench else None,
    },
    "exercise_1": {
        "chunked_stats": stats if 'stats' in dir() and stats else None,
        "electronics_rows": n_electronics if 'n_electronics' in dir() and n_electronics else None,
    },
    "exercise_2": {
        "validation": validation if 'validation' in dir() and validation else None,
        "full_stats": full_stats if 'full_stats' in dir() and full_stats else None,
    },
    "exercise_3": {
        "threading": thread_bench if 'thread_bench' in dir() and thread_bench else None,
        "multiprocessing": proc_bench if 'proc_bench' in dir() and proc_bench else None,
        "scaling": scaling if 'scaling' in dir() and scaling else None,
    },
    "exercise_4": pipeline_results if 'pipeline_results' in dir() and pipeline_results else None,
    "reflection": reflection,
}

with open(METRICS_PATH, 'w') as f:
    json.dump(results, f, indent=2, default=str)

print(f"\n✅ Results saved to {METRICS_PATH}")

---

## Lab Complete!

### What You Learned

1. **PyArrow > Pandas for I/O** — `pq.read_table()` avoids conversion overhead
2. **Projection pushdown** — read only needed columns, save time and memory
3. **Chunking keeps memory constant** — process any file size with fixed RAM
4. **Welford's algorithm** — compute statistics in a single pass
5. **Threading for I/O, multiprocessing for CPU** — choose the right tool
6. **Amdahl's Law** — speedup is limited by the sequential portion

### Files to Submit

1. `notebooks/lab05_outofcore_parallel.ipynb` (this notebook)
2. `results/lab05_metrics.json`

---