# Lab 04: Vectorization and Out-of-Core Computing

**Course:** Big Data

---

## 👤 Student Information

**Name:** `Your Name Here`

**Date:** `DD/MM/YYYY`

---

**Goal:** Master vectorization techniques and out-of-core computing to process data efficiently at scale.

## Learning Objectives

By the end of this lab, you will be able to:

1. **Vectorize Operations**: Replace slow Python loops with NumPy/Pandas operations (100-200x speedup)
2. **Use Broadcasting**: Apply operations across arrays of different shapes
3. **Process Large Files**: Handle datasets larger than RAM using chunking
4. **Implement Online Algorithms**: Calculate statistics with O(1) memory
5. **Use Dask**: Scale Pandas operations to larger-than-memory datasets

## Instructions

1. **Fill in your information above** before starting the lab
2. Read each cell carefully before running it
3. Implement the **TODO functions** when you see them
4. Run cells **from top to bottom** (Shift+Enter)
5. Check that output makes sense after each cell

---

## 📚 Libraries Used in This Lab

### Core Libraries

- **`numpy`** - Vectorized numerical operations
- **`pandas`** - DataFrame operations and I/O
- **`psutil`** - Memory monitoring
- **`time`** - Performance measurement
- **`dask`** (optional) - Parallel and out-of-core computing

### Why Vectorization Matters

Each Python loop iteration involves ~200 CPU instructions:
- Interpret bytecode
- Look up variables in hash table
- Check types (dynamic typing)
- Find methods
- Create stack frames

**NumPy**: 1-2 CPU instructions (compiled C code)

**Result**: 100-200x speedup!

---

## 1. Imports and Setup

In [None]:
import json
import math
import time
import os
from pathlib import Path
from collections import defaultdict

import pandas as pd
import numpy as np
import psutil

# Optional: Dask
try:
    import dask.dataframe as dd
    DASK_AVAILABLE = True
    print(f"Dask version: {dd.__version__}")
except ImportError:
    DASK_AVAILABLE = False
    print("Dask not installed. Bonus exercise will be skipped.")

print("\n✓ Core imports successful!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

## 2. Define Paths

In [None]:
# Base directories
DATA_RAW = Path("../data/raw")
DATA_PROCESSED = Path("../data/processed")
RESULTS_DIR = Path("../results")

# File paths for this lab
TEST_CSV = DATA_RAW / "benchmark_10m.csv"
LARGE_CSV = DATA_RAW / "benchmark_large.csv"
FILTERED_OUTPUT = DATA_PROCESSED / "filtered_output.parquet"
METRICS_PATH = RESULTS_DIR / "lab04_metrics.json"

# Ensure directories exist
DATA_RAW.mkdir(parents=True, exist_ok=True)
DATA_PROCESSED.mkdir(parents=True, exist_ok=True)
RESULTS_DIR.mkdir(parents=True, exist_ok=True)

print("Paths defined:")
print(f"  Test CSV: {TEST_CSV}")
print(f"  Large CSV: {LARGE_CSV}")
print(f"  Metrics: {METRICS_PATH}")

---

## 3. Dataset Generation

We'll create test datasets for benchmarking:
- **Medium dataset** (10M rows) for vectorization benchmarks
- **Large dataset** (50M+ rows) for out-of-core exercises

### TODO 1: `generate_test_data()`

Generate a synthetic dataset for benchmarking.

**💡 Hints:**
- Use `np.random.seed(seed)` for reproducibility
- Use `np.random.randn()` for float columns (normal distribution)
- Use `np.random.randint()` for integer columns
- Use `np.random.choice()` for category column
- Use `np.random.uniform()` for value column

In [None]:
def generate_test_data(path: Path, n_rows: int = 10_000_000, seed: int = 42) -> dict:
    """
    Generate a test dataset for benchmarking.
    
    Args:
        path: Where to save the CSV
        n_rows: Number of rows
        seed: Random seed
    
    Returns:
        Dictionary with metadata: {"rows": int, "cols": int, "size_mb": float}
    """
    # TODO: Implement this function
    # Step 1: Set random seed
    # Step 2: Generate columns:
    #   - 'a': random normal values (randn)
    #   - 'b': random normal values (randn)
    #   - 'c': random integers 0-100
    #   - 'age': random integers 0-100
    #   - 'category': random choice from ['A', 'B', 'C', 'D']
    #   - 'value': random uniform 0-1000
    # Step 3: Create DataFrame
    # Step 4: Save to CSV
    # Step 5: Return metadata
    pass

In [None]:
# Generate the test dataset
if not TEST_CSV.exists():
    print("Generating 10 million row test dataset...")
    print("(This may take 1-2 minutes)\n")
    
    start = time.perf_counter()
    metadata = generate_test_data(TEST_CSV, n_rows=10_000_000)
    elapsed = time.perf_counter() - start
    
    print(f"Generated in {elapsed:.1f} seconds")
    print(f"Rows: {metadata['rows']:,}")
    print(f"Size: {metadata['size_mb']:.1f} MB")
else:
    size_mb = TEST_CSV.stat().st_size / 1e6
    print(f"Dataset already exists: {size_mb:.1f} MB")

---

## Exercise 1: Loop to Vectorized Conversion 🎯

In this exercise, you'll convert slow loop-based code to fast vectorized operations.

### The Problem with Python Loops

```python
# Each iteration:
# 1. Interpret bytecode (Python VM)        ~50 instructions
# 2. Look up variable in hash table        ~20 instructions
# 3. Check type (dynamic typing)           ~30 instructions
# 4. Find method (__mul__, __add__)        ~40 instructions
# 5. Create stack frame                    ~30 instructions
# 6. Perform actual operation              ~2 instructions
# 7. Check result type                     ~20 instructions
# 8. Assign result                         ~10 instructions
# Total: ~200 CPU instructions per operation!
```

**NumPy**: Direct CPU operations in compiled C code = 1-2 instructions

In [None]:
# Load a sample of the data for testing
print("Loading sample data for vectorization exercises...")
df_sample = pd.read_csv(TEST_CSV, nrows=100_000)
print(f"Sample size: {len(df_sample):,} rows")
print(f"\nColumns: {list(df_sample.columns)}")

### Part 1A: Distance Calculation

Convert a loop-based Euclidean distance calculation to NumPy broadcasting.

In [None]:
# Original slow implementation
def calculate_distances_slow(points_a, points_b):
    """Calculate Euclidean distances using a loop."""
    distances = []
    for i in range(len(points_a)):
        dist = math.sqrt(
            (points_a[i][0] - points_b[i][0])**2 +
            (points_a[i][1] - points_b[i][1])**2
        )
        distances.append(dist)
    return distances

In [None]:
def calculate_distances_fast(points_a: np.ndarray, points_b: np.ndarray) -> np.ndarray:
    """
    Calculate Euclidean distances using vectorization.
    
    Args:
        points_a: Array of shape (N, 2) with x, y coordinates
        points_b: Array of shape (N, 2) with x, y coordinates
    
    Returns:
        Array of distances
    
    Hints:
        - Use broadcasting: diff = points_a - points_b
        - Square: diff**2
        - Sum along axis 1: np.sum(..., axis=1)
        - Square root: np.sqrt(...)
        - Or use np.linalg.norm(points_a - points_b, axis=1)
    """
    # TODO: Implement this function
    pass

In [None]:
# Test distance calculation
n_points = 100_000
np.random.seed(42)
points_a = np.random.randn(n_points, 2)
points_b = np.random.randn(n_points, 2)

# Benchmark slow version
start = time.perf_counter()
result_slow = calculate_distances_slow(points_a, points_b)
time_slow = time.perf_counter() - start

# Benchmark fast version
start = time.perf_counter()
result_fast = calculate_distances_fast(points_a, points_b)
time_fast = time.perf_counter() - start

# Verify results match
if result_fast is not None:
    match = np.allclose(result_slow, result_fast)
    speedup = time_slow / time_fast
    print(f"Distance calculation ({n_points:,} points):")
    print(f"  Slow: {time_slow:.4f} sec")
    print(f"  Fast: {time_fast:.6f} sec")
    print(f"  Speedup: {speedup:.1f}x")
    print(f"  Results match: {match}")
else:
    print("TODO: Implement calculate_distances_fast()")
    speedup = None

### Part 1B: Age Classification

Replace conditional loop with `np.select()`.

In [None]:
# Original slow implementation
def classify_ages_slow(ages):
    """Classify ages using a loop."""
    categories = []
    for age in ages:
        if age < 18:
            categories.append('child')
        elif age < 65:
            categories.append('adult')
        else:
            categories.append('senior')
    return categories

In [None]:
def classify_ages_fast(ages: np.ndarray) -> np.ndarray:
    """
    Classify ages using vectorization.
    
    Args:
        ages: Array of ages
    
    Returns:
        Array of categories ('child', 'adult', 'senior')
    
    Hints:
        - Use np.select(conditions, choices)
        - conditions = [ages < 18, ages < 65, ages >= 65]
        - choices = ['child', 'adult', 'senior']
    """
    # TODO: Implement this function
    pass

In [None]:
# Test age classification
ages = df_sample['age'].values

# Benchmark
start = time.perf_counter()
result_slow = classify_ages_slow(ages)
time_slow = time.perf_counter() - start

start = time.perf_counter()
result_fast = classify_ages_fast(ages)
time_fast = time.perf_counter() - start

if result_fast is not None:
    match = all(s == f for s, f in zip(result_slow, result_fast))
    speedup_ages = time_slow / time_fast
    print(f"Age classification ({len(ages):,} values):")
    print(f"  Slow: {time_slow:.4f} sec")
    print(f"  Fast: {time_fast:.6f} sec")
    print(f"  Speedup: {speedup_ages:.1f}x")
    print(f"  Results match: {match}")
else:
    print("TODO: Implement classify_ages_fast()")
    speedup_ages = None

### Part 1C: Column Normalization

Replace nested loops with broadcasting.

In [None]:
# Original slow implementation
def normalize_columns_slow(df, columns):
    """Normalize columns using nested loops."""
    df = df.copy()
    for col in columns:
        values = df[col].values
        mean = sum(values) / len(values)
        variance = sum((x - mean)**2 for x in values) / len(values)
        std = math.sqrt(variance)
        for i in range(len(df)):
            df.loc[i, col] = (df.loc[i, col] - mean) / std
    return df

In [None]:
def normalize_columns_fast(df: pd.DataFrame, columns: list) -> pd.DataFrame:
    """
    Normalize columns using vectorization.
    
    Args:
        df: DataFrame to normalize
        columns: List of columns to normalize
    
    Returns:
        Normalized DataFrame
    
    Hints:
        - Use df[columns].mean() to get means for all columns at once
        - Use df[columns].std() for standard deviations
        - Broadcasting: (df[columns] - mean) / std
    """
    # TODO: Implement this function
    pass

In [None]:
# Test normalization (use smaller sample due to slow version)
df_norm_test = df_sample.head(10_000).copy()
columns_to_normalize = ['a', 'b', 'value']

# Benchmark
start = time.perf_counter()
result_slow = normalize_columns_slow(df_norm_test, columns_to_normalize)
time_slow = time.perf_counter() - start

start = time.perf_counter()
result_fast = normalize_columns_fast(df_norm_test, columns_to_normalize)
time_fast = time.perf_counter() - start

if result_fast is not None:
    match = np.allclose(result_slow[columns_to_normalize].values, 
                        result_fast[columns_to_normalize].values, rtol=1e-5)
    speedup_norm = time_slow / time_fast
    print(f"Normalization ({len(df_norm_test):,} rows, {len(columns_to_normalize)} columns):")
    print(f"  Slow: {time_slow:.4f} sec")
    print(f"  Fast: {time_fast:.6f} sec")
    print(f"  Speedup: {speedup_norm:.1f}x")
    print(f"  Results match: {match}")
else:
    print("TODO: Implement normalize_columns_fast()")
    speedup_norm = None

### Part 1D: Score Calculation with Clipping

Replace loop with vectorized operations and `np.clip()`.

In [None]:
# Original slow implementation
def calculate_scores_slow(df):
    """Calculate scores using a loop."""
    scores = []
    for i in range(len(df)):
        row = df.iloc[i]
        score = (row['a'] * 2 + row['b']) / (row['c'] + 1)
        if score > 10:
            score = 10
        elif score < -10:
            score = -10
        scores.append(score)
    return scores

In [None]:
def calculate_scores_fast(df: pd.DataFrame) -> np.ndarray:
    """
    Calculate scores using vectorization.
    
    Args:
        df: DataFrame with columns 'a', 'b', 'c'
    
    Returns:
        Array of scores (clipped to [-10, 10])
    
    Hints:
        - Vectorized formula: (df['a'] * 2 + df['b']) / (df['c'] + 1)
        - Use np.clip(scores, -10, 10) to clip values
    """
    # TODO: Implement this function
    pass

In [None]:
# Test score calculation
df_score_test = df_sample.head(50_000).copy()

# Benchmark
start = time.perf_counter()
result_slow = calculate_scores_slow(df_score_test)
time_slow = time.perf_counter() - start

start = time.perf_counter()
result_fast = calculate_scores_fast(df_score_test)
time_fast = time.perf_counter() - start

if result_fast is not None:
    match = np.allclose(result_slow, result_fast)
    speedup_scores = time_slow / time_fast
    print(f"Score calculation ({len(df_score_test):,} rows):")
    print(f"  Slow: {time_slow:.4f} sec")
    print(f"  Fast: {time_fast:.6f} sec")
    print(f"  Speedup: {speedup_scores:.1f}x")
    print(f"  Results match: {match}")
else:
    print("TODO: Implement calculate_scores_fast()")
    speedup_scores = None

### 💡 Key Insight: The .apply() Trap

`.apply()` looks clean but is NOT vectorized:

```python
# This is SLOW (hidden Python loop):
df['result'] = df['x'].apply(lambda x: x * 2)

# This is FAST (vectorized):
df['result'] = df['x'] * 2
```

**Benchmark on 10M elements:**
| Method | Time | Speedup |
|--------|------|---------|
| Python loop | 12.5s | 1x |
| `.apply()` | 5.8s | 2.2x |
| Vectorized | 0.062s | **200x** |

---

## Exercise 2: Benchmarking 📊

Quantify the performance impact of vectorization.

In [None]:
def benchmark_operation(slow_fn, fast_fn, *args, n_runs: int = 3) -> dict:
    """
    Benchmark slow vs fast function.
    
    Args:
        slow_fn: Slow function
        fast_fn: Fast function  
        *args: Arguments to pass to functions
        n_runs: Number of runs for timing
    
    Returns:
        Dictionary with timing results
    """
    # Time slow function
    slow_times = []
    for _ in range(n_runs):
        start = time.perf_counter()
        slow_result = slow_fn(*args)
        slow_times.append(time.perf_counter() - start)
    
    # Time fast function
    fast_times = []
    for _ in range(n_runs):
        start = time.perf_counter()
        fast_result = fast_fn(*args)
        fast_times.append(time.perf_counter() - start)
    
    slow_median = np.median(slow_times)
    fast_median = np.median(fast_times)
    
    return {
        'slow_sec': round(slow_median, 4),
        'fast_sec': round(fast_median, 6),
        'speedup': round(slow_median / fast_median, 1) if fast_median > 0 else float('inf')
    }

In [None]:
# Run comprehensive benchmarks
print("Running comprehensive benchmarks...\n")

benchmark_results = {}

# 1. Distance calculation
if 'result_fast' in dir() and result_fast is not None:
    n = 50_000
    pts_a = np.random.randn(n, 2)
    pts_b = np.random.randn(n, 2)
    benchmark_results['distance'] = benchmark_operation(
        calculate_distances_slow, calculate_distances_fast, pts_a, pts_b
    )
    print(f"Distance: {benchmark_results['distance']['speedup']}x speedup")

# 2. Age classification
if 'speedup_ages' in dir() and speedup_ages is not None:
    ages_test = np.random.randint(0, 100, 100_000)
    benchmark_results['age_classification'] = benchmark_operation(
        classify_ages_slow, classify_ages_fast, ages_test
    )
    print(f"Age classification: {benchmark_results['age_classification']['speedup']}x speedup")

# 3. Score calculation
if 'speedup_scores' in dir() and speedup_scores is not None:
    df_bench = df_sample.head(20_000).copy()
    benchmark_results['scores'] = benchmark_operation(
        calculate_scores_slow, calculate_scores_fast, df_bench
    )
    print(f"Score calculation: {benchmark_results['scores']['speedup']}x speedup")

print("\n✓ Benchmarks complete!")

---

## Exercise 3: Out-of-Core Processing 💾

Process datasets larger than RAM using chunking.

### The Problem

```python
# This FAILS with MemoryError for large files:
df = pd.read_csv('huge_file.csv')  # Tries to load all into RAM
result = df['value'].mean()
```

### The Solution: Chunking

```python
# Process in chunks:
for chunk in pd.read_csv('huge_file.csv', chunksize=500_000):
    # Process each chunk
    partial_result = process(chunk)
    # Combine results
```

In [None]:
# Helper: Get current memory usage
def get_memory_mb() -> float:
    """Get current process memory in MB."""
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / 1024 / 1024

print(f"Current memory usage: {get_memory_mb():.1f} MB")

### Part 3A: Chunked Mean Calculation

In [None]:
def chunked_mean(path: Path, column: str, chunksize: int = 500_000) -> dict:
    """
    Calculate mean using chunking.
    
    Args:
        path: Path to CSV file
        column: Column to calculate mean for
        chunksize: Rows per chunk
    
    Returns:
        Dictionary with mean and memory stats
    
    Hints:
        - Keep running total_sum and total_count
        - For each chunk: total_sum += chunk[column].sum()
        - For each chunk: total_count += len(chunk)
        - Track peak memory with get_memory_mb()
        - Final mean = total_sum / total_count
    """
    # TODO: Implement this function
    pass

In [None]:
# Test chunked mean vs full load
print("Comparing chunked vs full load...\n")

# Full load
mem_before = get_memory_mb()
start = time.perf_counter()
df_full = pd.read_csv(TEST_CSV)
full_mean = df_full['value'].mean()
full_time = time.perf_counter() - start
full_memory = get_memory_mb() - mem_before
del df_full  # Free memory

print(f"Full load:")
print(f"  Mean: {full_mean:.6f}")
print(f"  Time: {full_time:.2f} sec")
print(f"  Memory: {full_memory:.1f} MB")

# Chunked
chunked_result = chunked_mean(TEST_CSV, 'value')
if chunked_result:
    print(f"\nChunked:")
    print(f"  Mean: {chunked_result['mean']:.6f}")
    print(f"  Count: {chunked_result['count']:,}")
    print(f"  Peak memory: {chunked_result['peak_memory_mb']:.1f} MB")
    print(f"\nMeans match: {abs(full_mean - chunked_result['mean']) < 1e-10}")
else:
    print("\nTODO: Implement chunked_mean()")

### Part 3B: Chunked Filter and Save

In [None]:
def chunked_filter(input_path: Path, output_path: Path,
                   column: str, value: str, chunksize: int = 500_000) -> dict:
    """
    Filter large file and save subset using chunking.
    
    Args:
        input_path: Input CSV path
        output_path: Output Parquet path
        column: Column to filter on
        value: Value to filter for
        chunksize: Rows per chunk
    
    Returns:
        Dictionary with stats
    
    Hints:
        - Filter each chunk: filtered = chunk[chunk[column] == value]
        - For first chunk with data: filtered.to_parquet(output_path)
        - For subsequent chunks: append using pd.concat
        - Track input and output row counts
    """
    # TODO: Implement this function
    pass

In [None]:
# Test chunked filter
print("Filtering category='A' from dataset...\n")

filter_result = chunked_filter(TEST_CSV, FILTERED_OUTPUT, 'category', 'A')
if filter_result:
    print(f"Input rows: {filter_result['input_rows']:,}")
    print(f"Output rows: {filter_result['output_rows']:,}")
    print(f"Ratio: {filter_result['ratio']:.2%}")
    
    # Verify output
    if FILTERED_OUTPUT.exists():
        df_check = pd.read_parquet(FILTERED_OUTPUT)
        print(f"\nVerification: {len(df_check):,} rows in output file")
        print(f"All category='A': {(df_check['category'] == 'A').all()}")
else:
    print("TODO: Implement chunked_filter()")

### Part 3C: Chunked GroupBy Aggregation

In [None]:
def chunked_groupby_sum(path: Path, group_col: str,
                        agg_col: str, chunksize: int = 500_000) -> pd.DataFrame:
    """
    Perform groupby sum using chunking.
    
    Args:
        path: Path to CSV file
        group_col: Column to group by
        agg_col: Column to sum
        chunksize: Rows per chunk
    
    Returns:
        DataFrame with aggregated results (sum, count, mean per group)
    
    Hints:
        - Use defaultdict(float) for group_sums
        - Use defaultdict(int) for group_counts
        - For each chunk: aggregate with chunk.groupby(group_col)[agg_col].agg(['sum', 'count'])
        - Accumulate into global totals
        - Calculate mean = sum / count at the end
    """
    # TODO: Implement this function
    pass

In [None]:
# Test chunked groupby
print("Calculating sum by category...\n")

groupby_result = chunked_groupby_sum(TEST_CSV, 'category', 'value')
if groupby_result is not None:
    print("Chunked groupby result:")
    print(groupby_result)
    
    # Verify against full load
    df_verify = pd.read_csv(TEST_CSV)
    full_result = df_verify.groupby('category')['value'].agg(['sum', 'count', 'mean'])
    print("\nFull load result:")
    print(full_result)
    del df_verify
else:
    print("TODO: Implement chunked_groupby_sum()")

### 💡 Key Insight: Chunking Compatibility

| Operation | Compatible | Strategy |
|-----------|------------|----------|
| Sum | ✅ Yes | Accumulate partial sums |
| Mean | ✅ Yes | Sum / Count |
| Min/Max | ✅ Yes | Running min/max |
| Variance | ✅ Yes | Welford's algorithm |
| Groupby | ✅ Yes | Accumulate per group |
| Filter | ✅ Yes | Write matching rows |
| Sort | ❌ No | External merge sort |
| Median | ❌ No | Approximate algorithms |

---

## Exercise 4: Online Statistics (Welford's Algorithm) 📊

Calculate statistics in a single pass with O(1) memory.

### The Problem with Standard Formulas

```python
# Naive variance (requires TWO passes and ALL data in memory):
mean = sum(data) / len(data)  # Pass 1
variance = sum((x - mean)**2 for x in data) / len(data)  # Pass 2
```

### Welford's Algorithm (One Pass)

- **Memory**: O(1) - only 3 variables
- **Passes**: 1 - single scan
- **Numerically stable**: Avoids catastrophic cancellation

In [None]:
class OnlineStats:
    """
    Calculate running statistics using Welford's algorithm.
    
    Properties:
        - O(1) memory
        - Single pass
        - Numerically stable
    
    Welford's update formulas:
        count += 1
        delta = x - mean
        mean += delta / count
        delta2 = x - mean  # Using NEW mean
        M2 += delta * delta2
    
    Variance = M2 / count
    """
    
    def __init__(self):
        self.count = 0
        self.mean = 0.0
        self.M2 = 0.0  # Sum of squared differences from mean
        self.min_val = float('inf')
        self.max_val = float('-inf')
    
    def update(self, x: float) -> None:
        """
        Update statistics with a new value.
        
        TODO: Implement Welford's algorithm
        
        Steps:
        1. Increment count
        2. Calculate delta = x - mean
        3. Update mean: mean += delta / count
        4. Calculate delta2 = x - mean (using UPDATED mean)
        5. Update M2: M2 += delta * delta2
        6. Update min/max
        """
        # TODO: Implement this method
        pass
    
    def update_batch(self, values: np.ndarray) -> None:
        """Update with multiple values."""
        for x in values:
            self.update(x)
    
    def variance(self) -> float:
        """Return population variance."""
        if self.count < 2:
            return 0.0
        return self.M2 / self.count
    
    def std(self) -> float:
        """Return population standard deviation."""
        return np.sqrt(self.variance())
    
    def summary(self) -> dict:
        """Return summary of all statistics."""
        return {
            'count': self.count,
            'mean': self.mean,
            'std': self.std(),
            'min': self.min_val,
            'max': self.max_val
        }

In [None]:
# Test OnlineStats
print("Testing OnlineStats (Welford's algorithm)...\n")

# Generate test data
np.random.seed(42)
test_data = np.random.randn(1_000_000)

# Calculate with OnlineStats
stats = OnlineStats()
stats.update_batch(test_data)
online_result = stats.summary()

# Calculate with NumPy (ground truth)
numpy_result = {
    'count': len(test_data),
    'mean': np.mean(test_data),
    'std': np.std(test_data),
    'min': np.min(test_data),
    'max': np.max(test_data)
}

if online_result['count'] > 0:
    print("Comparison:")
    print(f"{'Metric':<10} {'OnlineStats':<15} {'NumPy':<15} {'Match'}")
    print("-" * 50)
    for key in ['count', 'mean', 'std', 'min', 'max']:
        online_val = online_result[key]
        numpy_val = numpy_result[key]
        if isinstance(online_val, float):
            match = abs(online_val - numpy_val) < 1e-10
            print(f"{key:<10} {online_val:<15.6f} {numpy_val:<15.6f} {"✓" if match else "✗"}")
        else:
            match = online_val == numpy_val
            print(f"{key:<10} {online_val:<15,} {numpy_val:<15,} {"✓" if match else "✗"}")
else:
    print("TODO: Implement OnlineStats.update()")

In [None]:
# Apply OnlineStats to large file with chunking
print("Calculating statistics on large file with chunking + OnlineStats...\n")

stats_chunked = OnlineStats()
chunks_processed = 0

start = time.perf_counter()
for chunk in pd.read_csv(TEST_CSV, chunksize=500_000, usecols=['value']):
    stats_chunked.update_batch(chunk['value'].values)
    chunks_processed += 1
elapsed = time.perf_counter() - start

if stats_chunked.count > 0:
    result = stats_chunked.summary()
    print(f"Processed {chunks_processed} chunks in {elapsed:.2f} sec")
    print(f"\nResults:")
    print(f"  Count: {result['count']:,}")
    print(f"  Mean: {result['mean']:.6f}")
    print(f"  Std: {result['std']:.6f}")
    print(f"  Min: {result['min']:.6f}")
    print(f"  Max: {result['max']:.6f}")
else:
    print("TODO: Implement OnlineStats.update()")

---

## Exercise 5 (Bonus): Introduction to Dask 🚀

Compare Dask with manual chunking.

**Dask** = "Pandas but bigger"
- Lazy evaluation
- Automatic parallelism
- Familiar API

In [None]:
if DASK_AVAILABLE:
    print("Dask is available! Running bonus exercise...\n")
    
    # Read with Dask (lazy - no data loaded yet)
    print("Creating Dask DataFrame (lazy)...")
    ddf = dd.read_csv(TEST_CSV)
    print(f"Type: {type(ddf)}")
    print(f"Partitions: {ddf.npartitions}")
    print(f"\nNote: No data has been loaded yet!")
else:
    print("Dask not installed. Skipping bonus exercise.")
    print("To install: pip install 'dask[complete]'")

In [None]:
if DASK_AVAILABLE:
    # Compare Dask vs manual chunking for groupby
    print("Comparing Dask vs manual chunking for groupby...\n")
    
    # Manual chunking
    start = time.perf_counter()
    manual_result = chunked_groupby_sum(TEST_CSV, 'category', 'value')
    manual_time = time.perf_counter() - start
    
    # Dask
    start = time.perf_counter()
    dask_result = ddf.groupby('category')['value'].sum().compute()
    dask_time = time.perf_counter() - start
    
    print(f"Manual chunking: {manual_time:.3f} sec")
    print(f"Dask: {dask_time:.3f} sec")
    print(f"\nDask speedup: {manual_time / dask_time:.2f}x")
    
    # Verify results match
    if manual_result is not None:
        manual_sums = manual_result['sum'].sort_index()
        dask_sums = dask_result.sort_index()
        match = np.allclose(manual_sums.values, dask_sums.values, rtol=1e-5)
        print(f"Results match: {match}")

In [None]:
if DASK_AVAILABLE:
    # Show Dask's lazy evaluation
    print("Dask's lazy evaluation:")
    print("-" * 40)
    
    # Define computation (nothing runs yet)
    lazy_result = ddf.groupby('category')['value'].mean()
    print(f"Lazy result type: {type(lazy_result)}")
    print(f"\nPrint lazy result (shows task graph, not data):")
    print(lazy_result)
    
    # Execute computation
    print(f"\nAfter .compute() (actually runs):")
    actual_result = lazy_result.compute()
    print(actual_result)

### 💡 When to Use What

| Tool | Data Size | Use Case |
|------|-----------|----------|
| Pandas | < 1 GB | Standard analysis |
| Manual Chunking | 1-10 GB | Simple aggregations, full control |
| Dask | 1-100 GB | Complex operations, automatic parallelism |
| Spark | > 100 GB | Multi-node cluster, distributed computing |

---

## 5. Reflection

**Your task:** Write a short reflection (3-5 sentences) answering:

1. What was the biggest speedup you achieved with vectorization?
2. When would you use chunking vs loading the full dataset?
3. How does Welford's algorithm solve the memory problem for variance calculation?

In [None]:
# TODO: Write your reflection here
reflection = """
Replace this text with your reflection.
Think about what you learned about vectorization and out-of-core computing.
What will you do differently in your future projects?
""".strip()

print("Your reflection:")
print(reflection)

---

## 6. Save Results

In [None]:
# Compile all results
results = {
    "lab": "04_vectorization_out_of_core",
    "timestamp": pd.Timestamp.now().isoformat(),
    "exercise_1_vectorization": {
        "distance_speedup": benchmark_results.get('distance', {}).get('speedup'),
        "age_classification_speedup": benchmark_results.get('age_classification', {}).get('speedup'),
        "normalization_speedup": speedup_norm if 'speedup_norm' in dir() else None,
        "scores_speedup": benchmark_results.get('scores', {}).get('speedup'),
    },
    "exercise_2_benchmarks": benchmark_results,
    "exercise_3_chunking": {
        "chunked_mean_result": chunked_result if 'chunked_result' in dir() else None,
        "chunked_filter_result": filter_result if 'filter_result' in dir() else None,
    },
    "exercise_4_online_stats": {
        "welford_result": online_result if 'online_result' in dir() and online_result['count'] > 0 else None,
    },
    "exercise_5_dask": {
        "available": DASK_AVAILABLE,
    },
    "reflection": reflection,
}

# Save to JSON
with open(METRICS_PATH, "w") as f:
    json.dump(results, f, indent=2, default=str)

print(f"✓ Results saved to: {METRICS_PATH}")

---

## 🎉 Lab Complete!

### What You Learned

1. **Vectorization**: NumPy/Pandas operations are 100-200x faster than Python loops
2. **Broadcasting**: Apply operations across arrays without explicit loops
3. **The .apply() Trap**: It's a hidden loop, not vectorization
4. **Chunking**: Process datasets larger than RAM piece by piece
5. **Welford's Algorithm**: Calculate mean and variance in O(1) memory
6. **Dask**: Scales Pandas with lazy evaluation and parallelism

### Optimization Checklist

- ✅ Never use Python loops for array operations
- ✅ Replace `.apply()` with vectorized operations
- ✅ Use broadcasting for multi-array operations
- ✅ Use chunking for files larger than RAM
- ✅ Use online algorithms for streaming statistics

### Vectorization Cheat Sheet

| Pattern | Slow | Fast |
|---------|------|------|
| Arithmetic | loop + append | `df['a'] * df['b']` |
| Conditional | loop + if/else | `np.where()` or `np.select()` |
| Binning | loop + if/elif | `pd.cut()` |
| Clipping | loop + min/max | `np.clip()` |
| Normalization | nested loops | `(df - mean) / std` |
| Distance | loop + math.sqrt | `np.linalg.norm()` |

### Files to Submit

1. `notebooks/lab04_vectorization_out_of_core.ipynb` (this notebook)
2. `results/lab04_metrics.json`

---

**Next Lab**: We'll explore parallel processing and distributed computing with PySpark!