# Lab 04: Efficient Formats and Vectorization

**Course:** Big Data

---

## Student Information

**Name:** `Your Name Here`

**Date:** `DD/MM/YYYY`

---

**Goal:** Compare storage formats (CSV, Parquet, Feather) and master vectorization to process data efficiently.

## Learning Objectives

By the end of this lab, you will be able to:

1. **Compare Storage Formats**: Understand trade-offs between CSV, Parquet (Snappy, Zstd), and Feather
2. **Benchmark I/O**: Measure read/write speed and disk usage for each format
3. **Vectorize Operations**: Replace slow Python loops with NumPy/Pandas operations (100-200x speedup)
4. **Use Broadcasting**: Apply operations across arrays without explicit loops
5. **Build Optimized Pipelines**: Combine efficient formats + vectorization for maximum performance

## Instructions

1. **Fill in your information above** before starting the lab
2. Read each cell carefully before running it
3. Implement the **TODO functions** when you see them
4. Run cells **from top to bottom** (Shift+Enter)
5. Check that output makes sense after each cell

---

## Libraries Used in This Lab

### Core Libraries

- **`numpy`** - Vectorized numerical operations
- **`pandas`** - DataFrame operations and I/O
- **`pyarrow`** - Parquet/Feather engine
- **`time`** - Performance measurement
- **`os`** - File size measurement

### Why This Matters

Two key optimizations can transform your data pipeline:

1. **Format choice**: Parquet can be 5-10x smaller and faster than CSV
2. **Vectorization**: NumPy operations are 100-200x faster than Python loops

Combined, these can turn a 10-minute pipeline into a few seconds.

---

## 1. Imports and Setup

In [None]:
import json
import math
import time
import os
from pathlib import Path

import pandas as pd
import numpy as np
import pyarrow

print("Imports successful!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"PyArrow version: {pyarrow.__version__}")

## 2. Define Paths

In [None]:
# Base directories
DATA_RAW = Path("../data/raw")
DATA_PROCESSED = Path("../data/processed")
RESULTS_DIR = Path("../results")

# File paths for this lab
VENTAS_CSV = DATA_RAW / "ventas.csv"
VENTAS_SNAPPY = DATA_PROCESSED / "ventas_snappy.parquet"
VENTAS_ZSTD = DATA_PROCESSED / "ventas_zstd.parquet"
VENTAS_NONE = DATA_PROCESSED / "ventas_none.parquet"
VENTAS_FEATHER = DATA_PROCESSED / "ventas.feather"
METRICS_PATH = RESULTS_DIR / "lab04_metrics.json"

# Ensure directories exist
DATA_RAW.mkdir(parents=True, exist_ok=True)
DATA_PROCESSED.mkdir(parents=True, exist_ok=True)
RESULTS_DIR.mkdir(parents=True, exist_ok=True)

print("Paths defined:")
print(f"  CSV: {VENTAS_CSV}")
print(f"  Parquet (snappy): {VENTAS_SNAPPY}")
print(f"  Metrics: {METRICS_PATH}")

---

## 3. Dataset Generation

We create a realistic sales dataset with 5 million rows to benchmark formats and vectorization.

### TODO 1: `generate_ventas()`

Generate a synthetic sales dataset.

**Hints:**
- Use `np.random.seed(seed)` for reproducibility
- Use `pd.date_range()` for dates with frequency `'s'` (seconds)
- Use `np.random.choice()` for categorical columns
- Use `np.random.uniform()` for price, `np.random.randint()` for quantity

In [None]:
def generate_ventas(n: int = 5_000_000, seed: int = 42) -> pd.DataFrame:
    """
    Generate a synthetic sales dataset.
    
    Args:
        n: Number of rows
        seed: Random seed
    
    Returns:
        DataFrame with columns: id, fecha, categoria, producto, precio, cantidad, ciudad
    """
    # TODO: Implement this function
    # Step 1: Set random seed with np.random.seed(seed)
    # Step 2: Create DataFrame with columns:
    #   - 'id': range(n)
    #   - 'fecha': pd.date_range('2020-01-01', periods=n, freq='s')
    #   - 'categoria': np.random.choice(['Electronica', 'Ropa', 'Hogar', 'Deportes'], n)
    #   - 'producto': np.random.choice([f'prod_{i}' for i in range(1000)], n)
    #   - 'precio': np.random.uniform(1, 1000, n).round(2)
    #   - 'cantidad': np.random.randint(1, 50, n)
    #   - 'ciudad': np.random.choice(['Madrid', 'Barcelona', 'Valencia', 'Sevilla', 'Bilbao'], n)
    # Step 3: Return DataFrame
    pass

In [None]:
# Generate the dataset
print("Generating 5 million row sales dataset...")
print("(This may take 30-60 seconds)\n")

start = time.perf_counter()
df = generate_ventas(n=5_000_000)
elapsed = time.perf_counter() - start

if df is not None:
    print(f"Generated in {elapsed:.1f} seconds")
    print(f"Shape: {df.shape}")
    print(f"Memory: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")
    print(f"\nSample:")
    print(df.head())
    print(f"\nData types:")
    print(df.dtypes)
else:
    print("TODO: Implement generate_ventas()")

---

## Exercise 1: CSV vs Parquet vs Feather - Practical Comparison (25 min)

**Objective**: Experience firsthand the differences between storage formats.

### Format Overview

| Format | Type | Compression | Column Selection | Predicate Pushdown |
|--------|------|-------------|------------------|--------------------|
| CSV | Row-based, text | None | No (reads all) | No |
| Parquet (Snappy) | Columnar, binary | Fast, moderate ratio | Yes | Yes |
| Parquet (Zstd) | Columnar, binary | Slower, best ratio | Yes | Yes |
| Parquet (None) | Columnar, binary | None | Yes | Yes |
| Feather | Columnar, binary | Optional (LZ4) | Yes | No |

### TODO 2: `save_all_formats()`

Save the DataFrame in all formats and measure file sizes.

**Hints:**
- Use `df.to_csv(path, index=False)`
- Use `df.to_parquet(path, compression='snappy')`, `compression='zstd'`, `compression=None`
- Use `df.to_feather(path)`
- Use `os.path.getsize(path) / 1024**2` for size in MB

In [None]:
def save_all_formats(df: pd.DataFrame) -> dict:
    """
    Save DataFrame in CSV, Parquet (snappy, zstd, none), and Feather.
    Measure write time and file size for each.
    
    Args:
        df: DataFrame to save
    
    Returns:
        Dictionary with format -> {"write_sec": float, "size_mb": float}
    """
    # TODO: Implement this function
    # For each format:
    #   1. Record start time with time.perf_counter()
    #   2. Save the DataFrame
    #   3. Record elapsed time
    #   4. Measure file size with os.path.getsize() / 1024**2
    #   5. Store results in dictionary
    #
    # Formats to save:
    #   - CSV:             df.to_csv(VENTAS_CSV, index=False)
    #   - Parquet Snappy:  df.to_parquet(VENTAS_SNAPPY, compression='snappy')
    #   - Parquet Zstd:    df.to_parquet(VENTAS_ZSTD, compression='zstd')
    #   - Parquet None:    df.to_parquet(VENTAS_NONE, compression=None)
    #   - Feather:         df.to_feather(VENTAS_FEATHER)
    pass

In [None]:
# Save in all formats
print("Saving in all formats...\n")

format_results = save_all_formats(df)

if format_results:
    print(f"{'Format':<20} {'Size (MB)':>10} {'Write (s)':>10}")
    print("-" * 42)
    for fmt, info in format_results.items():
        print(f"{fmt:<20} {info['size_mb']:>10.1f} {info['write_sec']:>10.2f}")
    
    # Calculate compression ratios vs CSV
    csv_size = format_results['csv']['size_mb']
    print(f"\nCompression ratios vs CSV ({csv_size:.1f} MB):")
    for fmt, info in format_results.items():
        if fmt != 'csv':
            ratio = csv_size / info['size_mb']
            print(f"  {fmt}: {ratio:.1f}x smaller")
else:
    print("TODO: Implement save_all_formats()")

### TODO 3: `benchmark_reads()`

Benchmark read performance for each format.

**Hints:**
- Full read: `pd.read_csv()`, `pd.read_parquet()`, `pd.read_feather()`
- Selective read (2 columns): use `usecols=` for CSV, `columns=` for Parquet/Feather
- Filtered read: use `filters=[('categoria', '==', 'Electronica')]` for Parquet

In [None]:
def benchmark_reads(n_runs: int = 3) -> dict:
    """
    Benchmark read performance for all formats.
    
    Tests:
    - Full read (all columns, all rows)
    - Selective read (only 'precio' and 'cantidad' columns)
    - Filtered read (only 'Electronica' category, Parquet only)
    
    Args:
        n_runs: Number of runs for timing (use median)
    
    Returns:
        Dictionary with benchmark results
    """
    # TODO: Implement this function
    # For each test, run n_runs times and take the median time.
    #
    # Test 1 - Full read:
    #   - pd.read_csv(VENTAS_CSV)
    #   - pd.read_parquet(VENTAS_SNAPPY)
    #   - pd.read_feather(VENTAS_FEATHER)
    #
    # Test 2 - Selective read (2 columns: 'precio', 'cantidad'):
    #   - pd.read_csv(VENTAS_CSV, usecols=['precio', 'cantidad'])
    #   - pd.read_parquet(VENTAS_SNAPPY, columns=['precio', 'cantidad'])
    #
    # Test 3 - Filtered read (Parquet with predicate pushdown):
    #   - pd.read_parquet(VENTAS_SNAPPY, filters=[('categoria', '==', 'Electronica')])
    pass

In [None]:
# Run read benchmarks
print("Running read benchmarks (this may take a few minutes)...\n")

read_results = benchmark_reads()

if read_results:
    for test_name, timings in read_results.items():
        print(f"\n{test_name}:")
        for fmt, sec in timings.items():
            print(f"  {fmt:<20} {sec:.3f} sec")
    
    # Speedup vs CSV
    if 'full_read' in read_results:
        csv_time = read_results['full_read'].get('csv', 0)
        if csv_time > 0:
            print(f"\nFull read speedup vs CSV ({csv_time:.2f}s):")
            for fmt, sec in read_results['full_read'].items():
                if fmt != 'csv':
                    print(f"  {fmt}: {csv_time / sec:.1f}x faster")
else:
    print("TODO: Implement benchmark_reads()")

### Key Insight: Why Parquet is Faster

**CSV** (row-based, text):
- Must read ALL data even for 2 columns
- Must parse text to numbers
- No metadata about data types

**Parquet** (columnar, binary):
- Reads only requested columns (column pruning)
- Data already in binary format (no parsing)
- Built-in statistics for predicate pushdown
- Compression reduces I/O

**Feather** (columnar, binary):
- Fastest read/write (minimal overhead)
- Good for intermediate data (between pipeline steps)
- Less compression than Parquet

---

## Exercise 2: Rewrite Loops to Vectorized (25 min)

**Objective**: Practice identifying and vectorizing slow loop-based code.

### Why Python Loops Are Slow

Each Python loop iteration involves ~200 CPU instructions:
- Interpret bytecode, look up variables, check types, find methods, create stack frames...

**NumPy**: 1-2 CPU instructions (compiled C code)

**Result**: 100-200x speedup!

In [None]:
# Load sample data for vectorization exercises
print("Loading sample data for vectorization exercises...")
df_sample = pd.read_parquet(VENTAS_SNAPPY)
print(f"Loaded {len(df_sample):,} rows")
print(f"Columns: {list(df_sample.columns)}")

### Part 2A: Distance Calculation

Convert a loop-based Euclidean distance calculation to NumPy broadcasting.

In [None]:
# Original slow implementation
def calculate_distances_slow(points_a, points_b):
    """Calculate Euclidean distances using a loop."""
    distances = []
    for i in range(len(points_a)):
        dist = math.sqrt(
            (points_a[i][0] - points_b[i][0])**2 +
            (points_a[i][1] - points_b[i][1])**2
        )
        distances.append(dist)
    return distances

### TODO 4: `calculate_distances_fast()`

In [None]:
def calculate_distances_fast(points_a: np.ndarray, points_b: np.ndarray) -> np.ndarray:
    """
    Calculate Euclidean distances using vectorization.
    
    Args:
        points_a: Array of shape (N, 2) with x, y coordinates
        points_b: Array of shape (N, 2) with x, y coordinates
    
    Returns:
        Array of distances
    
    Hints:
        - Use broadcasting: diff = points_a - points_b
        - Square: diff**2
        - Sum along axis 1: np.sum(..., axis=1)
        - Square root: np.sqrt(...)
        - Or use np.linalg.norm(points_a - points_b, axis=1)
    """
    # TODO: Implement this function
    pass

In [None]:
# Test distance calculation
n_points = 100_000
np.random.seed(42)
points_a = np.random.randn(n_points, 2)
points_b = np.random.randn(n_points, 2)

# Benchmark slow version
start = time.perf_counter()
result_slow = calculate_distances_slow(points_a, points_b)
time_slow = time.perf_counter() - start

# Benchmark fast version
start = time.perf_counter()
result_fast = calculate_distances_fast(points_a, points_b)
time_fast = time.perf_counter() - start

if result_fast is not None:
    match = np.allclose(result_slow, result_fast)
    speedup_dist = time_slow / time_fast
    print(f"Distance calculation ({n_points:,} points):")
    print(f"  Slow: {time_slow:.4f} sec")
    print(f"  Fast: {time_fast:.6f} sec")
    print(f"  Speedup: {speedup_dist:.1f}x")
    print(f"  Results match: {match}")
else:
    print("TODO: Implement calculate_distances_fast()")
    speedup_dist = None

### Part 2B: Age Classification

Replace conditional loop with `np.select()` or `pd.cut()`.

In [None]:
# Original slow implementation
def classify_ages_slow(df):
    """Classify ages using a loop."""
    categories = []
    for age in df['age']:
        if age < 18:
            categories.append('child')
        elif age < 65:
            categories.append('adult')
        else:
            categories.append('senior')
    return categories

### TODO 5: `classify_ages_fast()`

In [None]:
def classify_ages_fast(df: pd.DataFrame) -> np.ndarray:
    """
    Classify ages using vectorization.
    
    Args:
        df: DataFrame with 'age' column
    
    Returns:
        Array of categories ('child', 'adult', 'senior')
    
    Hints:
        - Use np.select(conditions, choices)
        - conditions = [df['age'] < 18, df['age'] < 65, df['age'] >= 65]
        - choices = ['child', 'adult', 'senior']
    """
    # TODO: Implement this function
    pass

In [None]:
# Create test data with age column
n_test = 200_000
df_ages = pd.DataFrame({'age': np.random.randint(0, 100, n_test)})

# Benchmark
start = time.perf_counter()
result_slow = classify_ages_slow(df_ages)
time_slow = time.perf_counter() - start

start = time.perf_counter()
result_fast = classify_ages_fast(df_ages)
time_fast = time.perf_counter() - start

if result_fast is not None:
    match = all(s == f for s, f in zip(result_slow, result_fast))
    speedup_ages = time_slow / time_fast
    print(f"Age classification ({n_test:,} values):")
    print(f"  Slow: {time_slow:.4f} sec")
    print(f"  Fast: {time_fast:.6f} sec")
    print(f"  Speedup: {speedup_ages:.1f}x")
    print(f"  Results match: {match}")
else:
    print("TODO: Implement classify_ages_fast()")
    speedup_ages = None

### Part 2C: Column Normalization

Replace nested loops with broadcasting.

In [None]:
# Original slow implementation
def normalize_columns_slow(df, columns):
    """Normalize columns using nested loops."""
    df = df.copy()
    for col in columns:
        values = df[col].values
        mean = sum(values) / len(values)
        variance = sum((x - mean)**2 for x in values) / len(values)
        std = math.sqrt(variance)
        for i in range(len(df)):
            df.loc[i, col] = (df.loc[i, col] - mean) / std
    return df

### TODO 6: `normalize_columns_fast()`

In [None]:
def normalize_columns_fast(df: pd.DataFrame, columns: list) -> pd.DataFrame:
    """
    Normalize columns using vectorization.
    
    Args:
        df: DataFrame to normalize
        columns: List of columns to normalize
    
    Returns:
        Normalized DataFrame
    
    Hints:
        - Use df[columns].mean() to get means for all columns at once
        - Use df[columns].std() for standard deviations
        - Broadcasting: (df[columns] - mean) / std
    """
    # TODO: Implement this function
    pass

In [None]:
# Test normalization (use smaller sample due to slow version)
n_norm = 10_000
df_norm_test = pd.DataFrame({
    'a': np.random.randn(n_norm),
    'b': np.random.randn(n_norm),
    'value': np.random.uniform(0, 1000, n_norm)
})
columns_to_normalize = ['a', 'b', 'value']

# Benchmark
start = time.perf_counter()
result_slow = normalize_columns_slow(df_norm_test, columns_to_normalize)
time_slow = time.perf_counter() - start

start = time.perf_counter()
result_fast = normalize_columns_fast(df_norm_test, columns_to_normalize)
time_fast = time.perf_counter() - start

if result_fast is not None:
    match = np.allclose(result_slow[columns_to_normalize].values, 
                        result_fast[columns_to_normalize].values, rtol=1e-5)
    speedup_norm = time_slow / time_fast
    print(f"Normalization ({n_norm:,} rows, {len(columns_to_normalize)} columns):")
    print(f"  Slow: {time_slow:.4f} sec")
    print(f"  Fast: {time_fast:.6f} sec")
    print(f"  Speedup: {speedup_norm:.1f}x")
    print(f"  Results match: {match}")
else:
    print("TODO: Implement normalize_columns_fast()")
    speedup_norm = None

### Part 2D: Score Calculation with Clipping

Replace loop with vectorized operations and `np.clip()`.

In [None]:
# Original slow implementation
def calculate_scores_slow(df):
    """Calculate scores using a loop."""
    scores = []
    for i in range(len(df)):
        row = df.iloc[i]
        score = (row['a'] * 2 + row['b']) / (row['c'] + 1)
        if score > 10:
            score = 10
        scores.append(score)
    return scores

### TODO 7: `calculate_scores_fast()`

In [None]:
def calculate_scores_fast(df: pd.DataFrame) -> np.ndarray:
    """
    Calculate scores using vectorization.
    
    Args:
        df: DataFrame with columns 'a', 'b', 'c'
    
    Returns:
        Array of scores (clipped to max 10)
    
    Hints:
        - Vectorized formula: (df['a'] * 2 + df['b']) / (df['c'] + 1)
        - Use np.clip(scores, None, 10) to clip max value
    """
    # TODO: Implement this function
    pass

In [None]:
# Test score calculation
n_scores = 50_000
df_score_test = pd.DataFrame({
    'a': np.random.randn(n_scores),
    'b': np.random.randn(n_scores),
    'c': np.random.randint(0, 100, n_scores)
})

# Benchmark
start = time.perf_counter()
result_slow = calculate_scores_slow(df_score_test)
time_slow = time.perf_counter() - start

start = time.perf_counter()
result_fast = calculate_scores_fast(df_score_test)
time_fast = time.perf_counter() - start

if result_fast is not None:
    match = np.allclose(result_slow, result_fast)
    speedup_scores = time_slow / time_fast
    print(f"Score calculation ({n_scores:,} rows):")
    print(f"  Slow: {time_slow:.4f} sec")
    print(f"  Fast: {time_fast:.6f} sec")
    print(f"  Speedup: {speedup_scores:.1f}x")
    print(f"  Results match: {match}")
else:
    print("TODO: Implement calculate_scores_fast()")
    speedup_scores = None

### Key Insight: The .apply() Trap

`.apply()` looks clean but is NOT vectorized:

```python
# This is SLOW (hidden Python loop):
df['result'] = df['x'].apply(lambda x: x * 2)

# This is FAST (vectorized):
df['result'] = df['x'] * 2
```

**Benchmark on 10M elements:**
| Method | Time | Speedup |
|--------|------|--------|
| Python loop | 12.5s | 1x |
| `.apply()` | 5.8s | 2.2x |
| Vectorized | 0.062s | **200x** |

---

## Exercise 3: Comprehensive Benchmark (20 min)

**Objective**: Quantify the impact of vectorization across different scenarios.

In [None]:
# Create large test dataset for benchmarks
n_bench = 10_000_000
df_bench = pd.DataFrame({
    'a': np.random.randn(n_bench),
    'b': np.random.randn(n_bench),
    'c': np.random.randint(0, 100, n_bench)
})
print(f"Benchmark dataset: {n_bench:,} rows")

### TODO 8: `run_vectorization_benchmarks()`

Run 5 benchmarks comparing loops vs vectorized operations.

**Hints:**
- Use a subset (100K rows) for loop versions to avoid waiting too long
- Use full dataset for vectorized versions
- Scale the loop time: `loop_time * (n_bench / n_subset)`

In [None]:
def run_vectorization_benchmarks(df: pd.DataFrame) -> dict:
    """
    Run 5 benchmarks comparing loops vs vectorized operations.
    
    Benchmarks:
    1. Sum: loop vs .sum()
    2. Element-wise multiply: loop vs operator *
    3. Filter + transform: loop vs .loc[]
    4. .apply() with lambda vs vectorized
    5. .apply() with complex function vs NumPy equivalent
    
    Args:
        df: DataFrame with columns 'a', 'b', 'c'
    
    Returns:
        Dictionary with benchmark name -> {"loop_sec", "vec_sec", "speedup"}
    """
    results = {}
    n = len(df)
    # Use subset for loop versions
    n_subset = 100_000
    df_sub = df.head(n_subset)
    scale = n / n_subset
    
    # TODO: Implement 5 benchmarks
    #
    # Benchmark 1: Sum
    #   Loop: total = 0; for x in df_sub['a']: total += x
    #   Vectorized: df['a'].sum()
    #
    # Benchmark 2: Element-wise multiply
    #   Loop: result = []; for i in range(len(df_sub)): result.append(df_sub.iloc[i]['a'] * df_sub.iloc[i]['b'])
    #   Vectorized: df['a'] * df['b']
    #
    # Benchmark 3: Filter + transform
    #   Loop: result = []; for i in range(len(df_sub)):
    #             if df_sub.iloc[i]['c'] > 50: result.append(df_sub.iloc[i]['a'] * 2)
    #   Vectorized: df.loc[df['c'] > 50, 'a'] * 2
    #
    # Benchmark 4: .apply() with lambda vs vectorized
    #   Apply: df['a'].apply(lambda x: x * 2 + 1)
    #   Vectorized: df['a'] * 2 + 1
    #
    # Benchmark 5: .apply() with complex function vs NumPy
    #   Apply: df.apply(lambda row: (row['a'] * 2 + row['b']) / (row['c'] + 1), axis=1)
    #   Vectorized: (df['a'] * 2 + df['b']) / (df['c'] + 1)
    #   (Use df_sub for apply, scale time)
    #
    # For each benchmark, store:
    #   results[name] = {'loop_sec': ..., 'vec_sec': ..., 'speedup': ...}
    pass

In [None]:
# Run benchmarks
print("Running vectorization benchmarks...\n")

vec_benchmarks = run_vectorization_benchmarks(df_bench)

if vec_benchmarks:
    print(f"{'Benchmark':<30} {'Loop (s)':>10} {'Vec (s)':>10} {'Speedup':>10}")
    print("-" * 62)
    for name, result in vec_benchmarks.items():
        print(f"{name:<30} {result['loop_sec']:>10.4f} {result['vec_sec']:>10.6f} {result['speedup']:>9.0f}x")
else:
    print("TODO: Implement run_vectorization_benchmarks()")

# Clean up large DataFrame
del df_bench

---

## Exercise 4: Integrated Pipeline - Format + Vectorization (20 min)

**Objective**: Combine efficient formats with vectorized operations in a realistic pipeline.

We will compare:
- **Naive pipeline**: Read CSV + process with loops
- **Optimized pipeline**: Read Parquet (selective + filtered) + vectorized operations

### TODO 9: `pipeline_naive()`

Implement the naive (slow) pipeline.

**Hints:**
- Read from CSV (full file)
- Use a loop to filter and calculate totals

In [None]:
def pipeline_naive() -> dict:
    """
    Naive pipeline: CSV + loops.
    
    Steps:
    1. Read full CSV
    2. Loop through rows to find 'Electronica' category
    3. Calculate precio * cantidad for matching rows
    
    Returns:
        {"total": float, "count": int, "time_sec": float}
    """
    # TODO: Implement this function
    # start = time.perf_counter()
    # df = pd.read_csv(VENTAS_CSV)
    # totals = []
    # for i in range(len(df)):
    #     if df.iloc[i]['categoria'] == 'Electronica':
    #         totals.append(df.iloc[i]['precio'] * df.iloc[i]['cantidad'])
    # elapsed = time.perf_counter() - start
    # return {'total': sum(totals), 'count': len(totals), 'time_sec': round(elapsed, 2)}
    pass

### TODO 10: `pipeline_optimized()`

Implement the optimized pipeline.

**Hints:**
- Read from Parquet with `columns=` and `filters=` for predicate pushdown
- Use vectorized operations for calculations

In [None]:
def pipeline_optimized() -> dict:
    """
    Optimized pipeline: Parquet + vectorized.
    
    Steps:
    1. Read Parquet with column selection and predicate pushdown
    2. Calculate precio * cantidad using vectorized operations
    
    Returns:
        {"total": float, "count": int, "time_sec": float}
    """
    # TODO: Implement this function
    # start = time.perf_counter()
    # df = pd.read_parquet(VENTAS_SNAPPY,
    #                      columns=['categoria', 'precio', 'cantidad'],
    #                      filters=[('categoria', '==', 'Electronica')])
    # df['total'] = df['precio'] * df['cantidad']
    # elapsed = time.perf_counter() - start
    # return {'total': df['total'].sum(), 'count': len(df), 'time_sec': round(elapsed, 2)}
    pass

In [None]:
# Compare pipelines
print("Running pipeline comparison...\n")

naive_result = pipeline_naive()
optimized_result = pipeline_optimized()

if naive_result and optimized_result:
    print(f"Naive pipeline (CSV + loops):")
    print(f"  Time: {naive_result['time_sec']:.2f} sec")
    print(f"  Rows processed: {naive_result['count']:,}")
    print(f"  Total: {naive_result['total']:,.2f}")
    
    print(f"\nOptimized pipeline (Parquet + vectorized):")
    print(f"  Time: {optimized_result['time_sec']:.2f} sec")
    print(f"  Rows processed: {optimized_result['count']:,}")
    print(f"  Total: {optimized_result['total']:,.2f}")
    
    speedup_pipeline = naive_result['time_sec'] / optimized_result['time_sec']
    print(f"\nSpeedup: {speedup_pipeline:.0f}x")
    print(f"Results match: {abs(naive_result['total'] - optimized_result['total']) < 0.01}")
else:
    print("TODO: Implement pipeline_naive() and pipeline_optimized()")
    speedup_pipeline = None

### Where Does the Speedup Come From?

| Optimization | Contribution |
|--------------|-------------|
| Parquet vs CSV (less I/O) | ~3-5x |
| Column pruning (3 vs 7 columns) | ~2x |
| Predicate pushdown (skip non-matching row groups) | ~2-4x |
| Vectorized ops vs loop | ~100-200x |
| **Combined** | **~50-200x** |

---

## 5. Reflection

**Your task:** Write a short reflection (3-5 sentences) answering:

1. What was the biggest compression ratio you observed between CSV and Parquet?
2. Which vectorization exercise gave you the largest speedup and why?
3. How much total speedup did you achieve in the integrated pipeline?

In [None]:
# TODO: Write your reflection here
reflection = """
Replace this text with your reflection.
Think about what you learned about efficient formats and vectorization.
What will you do differently in your future data pipelines?
""".strip()

print("Your reflection:")
print(reflection)

---

## 6. Save Results

In [None]:
# Compile all results
results = {
    "lab": "04_formats_vectorization",
    "timestamp": pd.Timestamp.now().isoformat(),
    "exercise_1_formats": {
        "file_sizes": format_results if 'format_results' in dir() and format_results else None,
        "read_benchmarks": read_results if 'read_results' in dir() and read_results else None,
    },
    "exercise_2_vectorization": {
        "distance_speedup": speedup_dist if 'speedup_dist' in dir() and speedup_dist else None,
        "age_classification_speedup": speedup_ages if 'speedup_ages' in dir() and speedup_ages else None,
        "normalization_speedup": speedup_norm if 'speedup_norm' in dir() and speedup_norm else None,
        "scores_speedup": speedup_scores if 'speedup_scores' in dir() and speedup_scores else None,
    },
    "exercise_3_benchmarks": vec_benchmarks if 'vec_benchmarks' in dir() and vec_benchmarks else None,
    "exercise_4_pipeline": {
        "naive": naive_result if 'naive_result' in dir() and naive_result else None,
        "optimized": optimized_result if 'optimized_result' in dir() and optimized_result else None,
        "speedup": speedup_pipeline if 'speedup_pipeline' in dir() and speedup_pipeline else None,
    },
    "reflection": reflection,
}

# Save to JSON
with open(METRICS_PATH, "w") as f:
    json.dump(results, f, indent=2, default=str)

print(f"Results saved to: {METRICS_PATH}")

---

## Lab Complete!

### What You Learned

1. **Format comparison**: Parquet is 3-10x smaller and faster than CSV
2. **Compression trade-offs**: Snappy = fast, Zstd = best compression, None = fastest write
3. **Feather**: Best for intermediate data (fastest read/write)
4. **Vectorization**: NumPy/Pandas operations are 100-200x faster than Python loops
5. **The .apply() trap**: It's a hidden loop, not vectorization
6. **Combined optimization**: Format + vectorization = massive speedup

### Vectorization Cheat Sheet

| Pattern | Slow | Fast |
|---------|------|------|
| Arithmetic | loop + append | `df['a'] * df['b']` |
| Conditional | loop + if/else | `np.where()` or `np.select()` |
| Binning | loop + if/elif | `pd.cut()` |
| Clipping | loop + min/max | `np.clip()` |
| Normalization | nested loops | `(df - mean) / std` |
| Distance | loop + math.sqrt | `np.linalg.norm()` |

### Files to Submit

1. `notebooks/lab04_formats_vectorization.ipynb` (this notebook)
2. `results/lab04_metrics.json`

---
