# Module 01 — Mathematical & Programming Foundations
## 01-01: Python, NumPy & Tensor Speed

**Objective:** Understand *why* vectorized array operations are orders of magnitude faster
than plain Python loops, and learn to write NumPy/PyTorch code that exploits this speed.

**Prerequisites:** Basic Python (lists, loops, functions). No prior NumPy or PyTorch experience required.


---
## Part 0 — Setup & Prerequisites

This notebook is the very first in the course. We start with the most fundamental question
in scientific computing: **why are some code patterns 100× faster than others on the same data?**

We will build a benchmarking framework from scratch, systematically measure Python loops vs
NumPy vectorization vs PyTorch tensors, explore broadcasting rules, and study how memory
layout affects performance. By the end, you will have a deep intuition for writing fast
numerical code — a skill that underpins every notebook in this course.

**Prerequisites:** None — this is the entry point to the course.

In [None]:
# ── Imports ──────────────────────────────────────────────────────────────────
import sys
import time
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

print(f'Python: {sys.version.split()[0]}')
print(f'NumPy: {np.__version__}')

In [None]:
# ── Reproducibility ──────────────────────────────────────────────────────────
import random

SEED = 1103
random.seed(SEED)
np.random.seed(SEED)

In [None]:
# ── Configuration ────────────────────────────────────────────────────────────
# Benchmark parameters
SMALL_SIZE = 1_000
MEDIUM_SIZE = 100_000
LARGE_SIZE = 1_000_000
XL_SIZE = 10_000_000

# Timing parameters
NUM_WARMUP = 2       # Warmup runs before timing
NUM_TIMED = 5        # Timed runs to average

# Visualization
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11

---
## Part 1 — Vectorization from Scratch

### Why Vectorization Matters

Python is an interpreted, dynamically typed language. Every time you write `a + b` in a
Python loop, the interpreter must:

1. Look up the types of `a` and `b`
2. Find the appropriate `__add__` method
3. Check for overflow, type coercion, etc.
4. Allocate memory for the result
5. Return the result as a new Python object

This overhead costs ~100 nanoseconds **per operation**. For 10 million elements, that adds up
to ~1 second of pure overhead — before any actual math happens.

**Vectorized libraries** like NumPy bypass this by:
- Storing data in contiguous C arrays (not Python objects)
- Dispatching a single C/Fortran function call that loops over the entire array
- Using SIMD (Single Instruction, Multiple Data) CPU instructions
- Avoiding Python object creation for intermediate results

The result? The same operation can be **50–500× faster**. Let's measure it.

### 1.1 Building a Benchmarking Framework

Before we can compare approaches, we need a reliable way to measure execution time.
We'll build a `measure_time` function that handles warmup runs (to prime CPU caches)
and multiple timed runs (to reduce noise from OS scheduling).

In [None]:
def measure_time(
    func: callable,
    num_warmup: int = NUM_WARMUP,
    num_timed: int = NUM_TIMED,
) -> tuple[float, float]:
    """Measure execution time of a zero-argument callable.

    Runs warmup iterations first (to prime CPU caches), then times
    multiple runs and returns the mean and standard deviation.

    Args:
        func: Zero-argument callable to benchmark.
        num_warmup: Number of warmup runs before timing.
        num_timed: Number of timed runs to average.

    Returns:
        Tuple of (mean_seconds, std_seconds).
    """
    # Warmup: bring data into CPU cache
    for _ in range(num_warmup):
        func()

    # Timed runs
    times: list[float] = []
    for _ in range(num_timed):
        start = time.perf_counter()
        func()
        elapsed = time.perf_counter() - start
        times.append(elapsed)

    return float(np.mean(times)), float(np.std(times))

Let's verify our timing function works correctly by measuring a known operation — sleeping
for a fixed duration.

In [None]:
# Sanity check: time.sleep(0.01) should take ~10 ms
mean_t, std_t = measure_time(lambda: time.sleep(0.01), num_warmup=1, num_timed=3)
print(f'time.sleep(0.01): {mean_t*1000:.1f} ± {std_t*1000:.1f} ms (expected ~10 ms)')
assert 8.0 < mean_t * 1000 < 15.0, f'Timing sanity check failed: {mean_t*1000:.1f} ms'

### 1.2 The Speed Gap: Python Loops vs NumPy

Let's start with the most basic operation: **element-wise addition** of two arrays.
We'll implement it three ways and measure the speed difference.

In [None]:
def python_add(a: list[float], b: list[float]) -> list[float]:
    """Element-wise addition using a Python for loop.

    Args:
        a: First list of numbers.
        b: Second list of numbers.

    Returns:
        List of element-wise sums.
    """
    result = [0.0] * len(a)
    for i in range(len(a)):
        result[i] = a[i] + b[i]
    return result


def python_add_comprehension(a: list[float], b: list[float]) -> list[float]:
    """Element-wise addition using a list comprehension.

    Args:
        a: First list of numbers.
        b: Second list of numbers.

    Returns:
        List of element-wise sums.
    """
    return [x + y for x, y in zip(a, b)]


def numpy_add(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    """Element-wise addition using NumPy vectorization.

    Args:
        a: First NumPy array.
        b: Second NumPy array.

    Returns:
        NumPy array of element-wise sums.
    """
    return a + b

Now we'll benchmark all three on the same data. We create lists and arrays of the same
size and measure how long each approach takes.

In [None]:
# Create test data
np.random.seed(SEED)
size = MEDIUM_SIZE

list_a = list(np.random.randn(size))
list_b = list(np.random.randn(size))
arr_a = np.array(list_a)
arr_b = np.array(list_b)

print(f'Array size: {size:,} elements')
print(f'Data type: float64 ({arr_a.dtype})')
print()

# Benchmark each approach
t_loop, s_loop = measure_time(lambda: python_add(list_a, list_b))
t_comp, s_comp = measure_time(lambda: python_add_comprehension(list_a, list_b))
t_numpy, s_numpy = measure_time(lambda: numpy_add(arr_a, arr_b))

print(f'Python for-loop:       {t_loop*1000:8.2f} ± {s_loop*1000:.2f} ms')
print(f'List comprehension:    {t_comp*1000:8.2f} ± {s_comp*1000:.2f} ms')
print(f'NumPy vectorized:      {t_numpy*1000:8.2f} ± {s_numpy*1000:.2f} ms')
print()
print(f'Speedup (loop → NumPy):          {t_loop / t_numpy:6.1f}×')
print(f'Speedup (comprehension → NumPy): {t_comp / t_numpy:6.1f}×')

The NumPy version is dramatically faster — typically **50–200×** compared to pure Python.
The list comprehension is somewhat faster than the explicit loop (less Python overhead
per iteration), but still nowhere near NumPy's speed.

**Why?** NumPy's `+` operator dispatches to a compiled C function that processes the entire
array in a single call. Python never touches individual elements.

### 1.3 Scaling: How Speed Varies with Array Size

The speedup ratio isn't constant — it depends on the array size. For very small arrays,
NumPy's function call overhead can make it slower than a simple loop. For large arrays,
the vectorized advantage grows. Let's map out this relationship.

In [None]:
def benchmark_addition_scaling() -> pd.DataFrame:
    """Benchmark element-wise addition at various array sizes.

    Returns:
        DataFrame with timing results for each approach and size.
    """
    sizes = [10, 100, 1_000, 10_000, 100_000, 1_000_000]
    records: list[dict] = []

    for n in sizes:
        np.random.seed(SEED)
        lst_a = list(np.random.randn(n))
        lst_b = list(np.random.randn(n))
        npa = np.array(lst_a)
        npb = np.array(lst_b)

        t_loop, _ = measure_time(lambda: python_add(lst_a, lst_b), num_warmup=1, num_timed=3)
        t_comp, _ = measure_time(lambda: python_add_comprehension(lst_a, lst_b), num_warmup=1, num_timed=3)
        t_np, _ = measure_time(lambda: numpy_add(npa, npb), num_warmup=1, num_timed=3)

        records.append({
            'Size': n,
            'Python Loop (ms)': t_loop * 1000,
            'List Comp (ms)': t_comp * 1000,
            'NumPy (ms)': t_np * 1000,
            'Loop/NumPy': t_loop / t_np,
        })
        print(f'  n={n:>10,}: loop={t_loop*1000:.3f}ms, numpy={t_np*1000:.4f}ms, '
              f'speedup={t_loop/t_np:.1f}×')

    return pd.DataFrame(records)


print('Benchmarking addition at various sizes...')
scaling_df = benchmark_addition_scaling()
print()
print(scaling_df.to_string(index=False))

Let's visualize this scaling relationship. The log-log plot reveals the **crossover point**
where NumPy starts winning, and how the speedup grows with array size.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: absolute times
axes[0].loglog(scaling_df['Size'], scaling_df['Python Loop (ms)'], 'o-',
               label='Python Loop', color='#E53935', linewidth=2, markersize=7)
axes[0].loglog(scaling_df['Size'], scaling_df['List Comp (ms)'], 's-',
               label='List Comprehension', color='#FF9800', linewidth=2, markersize=7)
axes[0].loglog(scaling_df['Size'], scaling_df['NumPy (ms)'], '^-',
               label='NumPy Vectorized', color='#1E88E5', linewidth=2, markersize=7)
axes[0].set_xlabel('Array Size (n)')
axes[0].set_ylabel('Time (ms, log scale)')
axes[0].set_title('Element-wise Addition: Absolute Time')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Right: speedup ratio
axes[1].semilogx(scaling_df['Size'], scaling_df['Loop/NumPy'], 'o-',
                 color='#43A047', linewidth=2, markersize=7)
axes[1].set_xlabel('Array Size (n)')
axes[1].set_ylabel('Speedup (Python Loop / NumPy)')
axes[1].set_title('Vectorization Speedup vs Array Size')
axes[1].axhline(y=1.0, color='gray', linestyle='--', alpha=0.5, label='Break-even')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f'Maximum observed speedup: {scaling_df["Loop/NumPy"].max():.1f}× '
      f'at n={scaling_df.loc[scaling_df["Loop/NumPy"].idxmax(), "Size"]:,}')

**Key observations:**
- For n < ~50, Python loops can be comparable or faster (NumPy has function call overhead)
- For n > 1,000, NumPy is consistently 10–100× faster
- The speedup generally plateaus once the array is large enough to saturate memory bandwidth

This is the **first rule of numerical Python:** always vectorize when possible.

### 1.4 Beyond Addition: Common Operations

Addition is the simplest case. Let's verify that the speedup pattern holds across
different mathematical operations — dot products, element-wise multiplication,
aggregations (sum, mean), and mathematical functions (exp, sin).

In [None]:
def python_dot(a: list[float], b: list[float]) -> float:
    """Compute dot product using a Python for loop.

    Args:
        a: First vector as a list.
        b: Second vector as a list.

    Returns:
        Scalar dot product.
    """
    total = 0.0
    for i in range(len(a)):
        total += a[i] * b[i]
    return total


def python_sum(a: list[float]) -> float:
    """Compute sum using a Python for loop.

    Args:
        a: List of numbers.

    Returns:
        Sum of all elements.
    """
    total = 0.0
    for val in a:
        total += val
    return total


def python_elementwise_mul(a: list[float], b: list[float]) -> list[float]:
    """Element-wise multiplication using a Python for loop.

    Args:
        a: First list of numbers.
        b: Second list of numbers.

    Returns:
        List of element-wise products.
    """
    result = [0.0] * len(a)
    for i in range(len(a)):
        result[i] = a[i] * b[i]
    return result


def python_exp(a: list[float]) -> list[float]:
    """Compute element-wise exp using Python's math module.

    Args:
        a: List of numbers.

    Returns:
        List of exp(x) for each element.
    """
    import math
    return [math.exp(x) for x in a]

Now let's run a comprehensive benchmark across all these operations, comparing Python
loops against their NumPy equivalents.

In [None]:
# Prepare data
np.random.seed(SEED)
n_ops = MEDIUM_SIZE
list_x = list(np.random.randn(n_ops))
list_y = list(np.random.randn(n_ops))
arr_x = np.array(list_x)
arr_y = np.array(list_y)

# Define operation pairs: (name, python_func, numpy_func)
operations = [
    ('Addition',
     lambda: python_add(list_x, list_y),
     lambda: arr_x + arr_y),
    ('Multiplication',
     lambda: python_elementwise_mul(list_x, list_y),
     lambda: arr_x * arr_y),
    ('Dot Product',
     lambda: python_dot(list_x, list_y),
     lambda: np.dot(arr_x, arr_y)),
    ('Sum',
     lambda: python_sum(list_x),
     lambda: np.sum(arr_x)),
    ('Exp',
     lambda: python_exp(list_x),
     lambda: np.exp(arr_x)),
]

op_records: list[dict] = []
for name, py_func, np_func in operations:
    t_py, _ = measure_time(py_func)
    t_np, _ = measure_time(np_func)
    speedup = t_py / t_np
    op_records.append({
        'Operation': name,
        'Python (ms)': t_py * 1000,
        'NumPy (ms)': t_np * 1000,
        'Speedup': speedup,
    })

ops_df = pd.DataFrame(op_records)
print(f'Operation benchmarks (n={n_ops:,}):')
print(ops_df.to_string(index=False))

In [None]:
# Visualize operation speedups
fig, ax = plt.subplots(figsize=(10, 5))
colors = ['#1E88E5', '#E53935', '#43A047', '#FF9800', '#9C27B0']
bars = ax.barh(ops_df['Operation'], ops_df['Speedup'], color=colors)
ax.set_xlabel('Speedup (Python Loop / NumPy)')
ax.set_title(f'Vectorization Speedup by Operation (n={n_ops:,})')
ax.axvline(x=1.0, color='gray', linestyle='--', alpha=0.5)

# Add value labels on bars
for bar, val in zip(bars, ops_df['Speedup']):
    ax.text(bar.get_width() + 1, bar.get_y() + bar.get_height() / 2,
            f'{val:.0f}×', va='center', fontweight='bold')

ax.grid(True, axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

Every operation shows significant speedup. The exact ratio varies:
- **Sum and dot product** tend to show the highest speedups because they reduce to a single
  scalar, avoiding memory allocation for a result array.
- **Exp** is compute-heavy, so the SIMD advantage is particularly large.
- **Addition and multiplication** are memory-bound — the speedup is limited by how fast
  data can be read from RAM.

### 1.5 Matrix Multiplication: Where Vectorization Shines Brightest

Matrix multiplication is the most important operation in machine learning. A single
forward pass through a neural network is a chain of matrix multiplications. Let's
implement it from scratch and compare against NumPy's optimized BLAS (Basic Linear
Algebra Subprograms) implementation.

For matrices $\mathbf{A} \in \mathbb{R}^{m \times k}$ and $\mathbf{B} \in \mathbb{R}^{k \times n}$:

$$C_{ij} = \sum_{p=1}^{k} A_{ip} \cdot B_{pj}$$

The naive implementation requires three nested loops — $O(m \cdot k \cdot n)$ operations.

In [None]:
def python_matmul(
    a: list[list[float]],
    b: list[list[float]],
) -> list[list[float]]:
    """Matrix multiplication using triple nested Python loops.

    Args:
        a: Matrix of shape (m, k) as nested lists.
        b: Matrix of shape (k, n) as nested lists.

    Returns:
        Result matrix of shape (m, n) as nested lists.
    """
    m = len(a)
    k = len(a[0])
    n = len(b[0])

    result = [[0.0] * n for _ in range(m)]
    for i in range(m):
        for j in range(n):
            total = 0.0
            for p in range(k):
                total += a[i][p] * b[p][j]
            result[i][j] = total
    return result

Let's benchmark this against NumPy's `@` operator (which calls BLAS under the hood)
at several matrix sizes.

In [None]:
def benchmark_matmul_scaling() -> pd.DataFrame:
    """Benchmark matrix multiplication at various sizes.

    Returns:
        DataFrame with timing results.
    """
    sizes = [10, 25, 50, 100, 200]
    records: list[dict] = []

    for n in sizes:
        np.random.seed(SEED)
        np_a = np.random.randn(n, n)
        np_b = np.random.randn(n, n)
        py_a = np_a.tolist()
        py_b = np_b.tolist()

        t_py, _ = measure_time(lambda: python_matmul(py_a, py_b), num_warmup=1, num_timed=2)
        t_np, _ = measure_time(lambda: np_a @ np_b, num_warmup=1, num_timed=3)

        # Verify correctness
        py_result = np.array(python_matmul(py_a, py_b))
        np_result = np_a @ np_b
        max_error = np.max(np.abs(py_result - np_result))

        records.append({
            'Size': f'{n}×{n}',
            'Python (ms)': t_py * 1000,
            'NumPy (ms)': t_np * 1000,
            'Speedup': t_py / t_np,
            'Max Error': max_error,
        })
        print(f'  {n:>3}×{n:<3}: python={t_py*1000:>10.2f}ms, '
              f'numpy={t_np*1000:>8.4f}ms, '
              f'speedup={t_py/t_np:>8.0f}×, error={max_error:.2e}')

    return pd.DataFrame(records)


print('Benchmarking matrix multiplication...')
matmul_df = benchmark_matmul_scaling()
print()
print(matmul_df.to_string(index=False))

The speedup for matrix multiplication is **enormous** — often 1,000× or more for larger
matrices. This is because NumPy calls optimized BLAS libraries (OpenBLAS or MKL) that use:

- **Cache-aware blocking:** Splits the matrix into tiles that fit in L1/L2 cache
- **SIMD instructions:** Processes 4–8 floats per CPU instruction
- **Multi-threading:** Uses multiple CPU cores in parallel

This is why `np.dot()` or `@` should **always** be used instead of manual loops for
matrix operations.

### 1.6 Broadcasting: Vectorization Without Shape Matching

One of NumPy's most powerful features is **broadcasting** — the ability to perform
operations on arrays with different shapes. Broadcasting eliminates the need to
manually replicate data, saving both memory and computation time.

#### Broadcasting Rules

When operating on two arrays, NumPy compares their shapes element-wise, starting from
the trailing (rightmost) dimensions:

1. If two dimensions are equal, they're compatible
2. If one of them is 1, it's broadcast (stretched) to match the other
3. If neither is 1 and they differ, it's an error

If the arrays have different numbers of dimensions, the smaller array is padded with
size-1 dimensions on the left.

In [None]:
def demonstrate_broadcasting() -> None:
    """Show broadcasting rules with concrete examples and shape annotations."""
    print('=== Broadcasting Examples ===')
    print()

    # Example 1: Scalar + Array
    a1 = np.array([1, 2, 3, 4])
    b1 = 10
    result1 = a1 + b1
    print(f'1. Scalar + Array')
    print(f'   {a1} + {b1} = {result1}')
    print(f'   Shapes: {a1.shape} + () → {result1.shape}')
    print()

    # Example 2: Row vector + Column vector → Matrix
    row = np.array([[1, 2, 3]])          # Shape: (1, 3)
    col = np.array([[10], [20], [30]])   # Shape: (3, 1)
    result2 = row + col
    print(f'2. Row vector + Column vector → Matrix')
    print(f'   row shape: {row.shape}, col shape: {col.shape}')
    print(f'   Result shape: {result2.shape}')
    print(f'   Result:\n{result2}')
    print()

    # Example 3: Matrix + Row vector (common in ML: adding bias)
    matrix = np.ones((3, 4))
    bias = np.array([1, 2, 3, 4])       # Shape: (4,)
    result3 = matrix + bias
    print(f'3. Matrix + Row vector (bias addition)')
    print(f'   matrix shape: {matrix.shape}, bias shape: {bias.shape}')
    print(f'   Result shape: {result3.shape}')
    print(f'   Result:\n{result3}')
    print()

    # Example 4: 3D tensor broadcasting (batch operations)
    batch = np.random.randn(2, 3, 4)     # (batch, rows, cols)
    scale = np.array([1, 2, 3, 4])       # (cols,)
    result4 = batch * scale
    print(f'4. Batch tensor × Row vector')
    print(f'   batch shape: {batch.shape}, scale shape: {scale.shape}')
    print(f'   Result shape: {result4.shape}')
    print(f'   (Each row in each batch sample is scaled element-wise)')


demonstrate_broadcasting()

#### Broadcasting Speed: Why It Matters

Broadcasting isn't just convenient — it's also **faster** than the manual alternative.
Without broadcasting, you'd have to use `np.tile()` or `np.repeat()` to manually expand
the smaller array, which allocates a large temporary copy. Broadcasting avoids this copy.

In [None]:
# Compare: manual expansion vs broadcasting
np.random.seed(SEED)
n_rows, n_cols = 10_000, 1_000
matrix = np.random.randn(n_rows, n_cols)
bias = np.random.randn(n_cols)

def add_with_tile() -> np.ndarray:
    """Add bias to matrix by tiling the bias vector."""
    bias_tiled = np.tile(bias, (n_rows, 1))  # Allocates a full copy
    return matrix + bias_tiled

def add_with_broadcast() -> np.ndarray:
    """Add bias to matrix using broadcasting (no copy)."""
    return matrix + bias

t_tile, _ = measure_time(add_with_tile)
t_bcast, _ = measure_time(add_with_broadcast)

# Verify same result
assert np.allclose(add_with_tile(), add_with_broadcast())

print(f'Matrix shape: {matrix.shape}, Bias shape: {bias.shape}')
print(f'np.tile + add:  {t_tile*1000:.2f} ms')
print(f'Broadcasting:   {t_bcast*1000:.2f} ms')
print(f'Speedup: {t_tile/t_bcast:.1f}×')
print(f'Memory saved: {n_rows * n_cols * 8 / 1024 / 1024:.1f} MB '
      f'(avoided tiling a {n_rows}×{n_cols} temporary array)')

### 1.7 Memory Layout: Row-Major vs Column-Major

Computer memory is a flat 1D sequence of bytes. A 2D array must be **linearized** into
this flat memory. There are two conventions:

- **Row-major (C order):** Rows are stored contiguously. Element `A[i][j+1]` is next to
  `A[i][j]` in memory. This is NumPy's default.
- **Column-major (Fortran order):** Columns are stored contiguously. Element `A[i+1][j]`
  is next to `A[i][j]` in memory.

When you access elements sequentially in memory, CPU caches work efficiently (each cache
line loads 64 bytes ≈ 8 float64 values). When you access elements with large strides
(jumping across rows/columns), cache misses slow things down dramatically.

In [None]:
def benchmark_memory_layout() -> pd.DataFrame:
    """Benchmark row-wise vs column-wise operations on C and F order arrays.

    Returns:
        DataFrame with timing results.
    """
    n = 5_000
    np.random.seed(SEED)

    c_array = np.ascontiguousarray(np.random.randn(n, n))  # Row-major
    f_array = np.asfortranarray(c_array)                     # Column-major

    # Verify same data
    assert np.allclose(c_array, f_array)

    records: list[dict] = []

    # Row-wise sum on each layout
    t_c_row, _ = measure_time(lambda: c_array.sum(axis=1))
    t_f_row, _ = measure_time(lambda: f_array.sum(axis=1))
    records.append({
        'Operation': 'Row sum (axis=1)',
        'C-order (ms)': t_c_row * 1000,
        'F-order (ms)': t_f_row * 1000,
        'Winner': 'C-order' if t_c_row < t_f_row else 'F-order',
    })

    # Column-wise sum on each layout
    t_c_col, _ = measure_time(lambda: c_array.sum(axis=0))
    t_f_col, _ = measure_time(lambda: f_array.sum(axis=0))
    records.append({
        'Operation': 'Column sum (axis=0)',
        'C-order (ms)': t_c_col * 1000,
        'F-order (ms)': t_f_col * 1000,
        'Winner': 'C-order' if t_c_col < t_f_col else 'F-order',
    })

    # Row-wise mean
    t_c_rmean, _ = measure_time(lambda: c_array.mean(axis=1))
    t_f_rmean, _ = measure_time(lambda: f_array.mean(axis=1))
    records.append({
        'Operation': 'Row mean (axis=1)',
        'C-order (ms)': t_c_rmean * 1000,
        'F-order (ms)': t_f_rmean * 1000,
        'Winner': 'C-order' if t_c_rmean < t_f_rmean else 'F-order',
    })

    # Column-wise mean
    t_c_cmean, _ = measure_time(lambda: c_array.mean(axis=0))
    t_f_cmean, _ = measure_time(lambda: f_array.mean(axis=0))
    records.append({
        'Operation': 'Column mean (axis=0)',
        'C-order (ms)': t_c_cmean * 1000,
        'F-order (ms)': t_f_cmean * 1000,
        'Winner': 'C-order' if t_c_cmean < t_f_cmean else 'F-order',
    })

    return pd.DataFrame(records)


layout_df = benchmark_memory_layout()
print(f'Memory Layout Benchmark (5000×5000 matrix):')
print(layout_df.to_string(index=False))

**Takeaway:** C-order (row-major) is faster for row operations, F-order (column-major)
is faster for column operations. Since ML data is typically stored as *samples × features*
(rows = samples), NumPy's default C-order is the right choice.

When you see unexpectedly slow operations, check whether you're accessing data along the
'wrong' axis relative to the memory layout.

### 1.8 Views vs Copies: Understanding Memory Ownership

NumPy operations can return either a **view** (shared memory with the original) or a
**copy** (independent memory). Understanding this distinction is critical for both
correctness and performance:

- **Views** are fast (no data copying) but mutations affect the original
- **Copies** are safe (independent data) but use extra memory and time

In [None]:
def demonstrate_views_and_copies() -> None:
    """Show which operations create views vs copies and their implications."""
    original = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
    print(f'Original: {original}')
    print()

    # Slicing creates a VIEW
    view_slice = original[2:6]
    print(f'Slice original[2:6]: {view_slice}')
    print(f'  Is a view? {view_slice.base is original}')
    print(f'  Shares memory? {np.shares_memory(original, view_slice)}')

    # Modifying the view changes the original!
    view_slice[0] = 999
    print(f'  After view_slice[0] = 999:')
    print(f'    view_slice: {view_slice}')
    print(f'    original:   {original}  ← ALSO CHANGED!')
    original[2] = 3  # Restore
    print()

    # Fancy indexing creates a COPY
    fancy_idx = original[[0, 3, 7]]
    print(f'Fancy index original[[0, 3, 7]]: {fancy_idx}')
    print(f'  Shares memory? {np.shares_memory(original, fancy_idx)}')
    fancy_idx[0] = 999
    print(f'  After fancy_idx[0] = 999:')
    print(f'    fancy_idx: {fancy_idx}')
    print(f'    original:  {original}  ← NOT changed')
    print()

    # Reshape creates a VIEW (when possible)
    matrix_view = original.reshape(2, 5)
    print(f'Reshape to (2, 5):')
    print(f'  Shares memory? {np.shares_memory(original, matrix_view)}')
    print(f'  Strides: original={original.strides}, reshaped={matrix_view.strides}')
    print()

    # .copy() forces a copy
    explicit_copy = original.copy()
    print(f'Explicit .copy():')
    print(f'  Shares memory? {np.shares_memory(original, explicit_copy)}')
    print()

    # Summary table
    operations_table = pd.DataFrame({
        'Operation': ['Slicing [a:b]', 'Reshape', 'Transpose (.T)',
                      'Fancy indexing [list]', 'Boolean indexing [mask]',
                      '.flatten()', '.ravel()', '.copy()'],
        'Returns': ['View', 'View*', 'View', 'Copy', 'Copy',
                    'Copy', 'View*', 'Copy'],
        'Note': ['Always a view', 'View when contiguous possible',
                 'Always a view', 'Always a copy',
                 'Always a copy', 'Always a copy',
                 'View when contiguous; copy otherwise',
                 'Explicit copy'],
    })
    print('=== View vs Copy Summary ===')
    print(operations_table.to_string(index=False))


demonstrate_views_and_copies()

Understanding views is essential for two reasons:
1. **Performance:** Avoid unnecessary copies when working with large datasets
2. **Correctness:** Know when modifying a slice will affect the original data

---
## Part 2 — Putting It All Together: BenchmarkSuite Class

We've built individual benchmarking functions. Now let's combine them into a reusable
`BenchmarkSuite` class that can systematically compare any set of approaches across
multiple array sizes and produce publication-quality visualizations.

In [None]:
class BenchmarkSuite:
    """Systematic benchmarking tool for comparing numerical computation approaches.

    Supports registering multiple implementations for the same operation,
    running them across various input sizes, and generating comparison
    tables and plots.

    Attributes:
        results: Dictionary mapping experiment names to DataFrames of results.
    """

    def __init__(self) -> None:
        """Initialize with empty results."""
        self.results: dict[str, pd.DataFrame] = {}

    def run_scaling_benchmark(
        self,
        name: str,
        approaches: dict[str, callable],
        data_factory: callable,
        sizes: list[int],
        num_warmup: int = NUM_WARMUP,
        num_timed: int = NUM_TIMED,
    ) -> pd.DataFrame:
        """Run a benchmark across multiple sizes for several approaches.

        Args:
            name: Name for this benchmark experiment.
            approaches: Dictionary mapping approach names to callables.
                Each callable takes data_factory output and returns a
                zero-argument callable to benchmark.
            data_factory: Callable that takes (size,) and returns data
                to pass to each approach.
            sizes: List of input sizes to benchmark.
            num_warmup: Warmup runs before timing.
            num_timed: Timed runs to average.

        Returns:
            DataFrame with timing results.
        """
        records: list[dict] = []

        for size in sizes:
            data = data_factory(size)
            row: dict[str, object] = {'Size': size}

            for approach_name, make_func in approaches.items():
                func = make_func(data)
                mean_t, std_t = measure_time(func, num_warmup, num_timed)
                row[f'{approach_name} (ms)'] = mean_t * 1000
                row[f'{approach_name} (std)'] = std_t * 1000

            records.append(row)

        df = pd.DataFrame(records)
        self.results[name] = df
        return df

    def plot_scaling(
        self,
        name: str,
        approach_names: list[str] | None = None,
        title: str | None = None,
    ) -> None:
        """Plot log-log scaling curves for a benchmark experiment.

        Args:
            name: Name of the benchmark to plot.
            approach_names: Approaches to include. None plots all.
            title: Plot title. None uses the experiment name.
        """
        df = self.results[name]
        if approach_names is None:
            approach_names = [c.replace(' (ms)', '')
                             for c in df.columns if c.endswith(' (ms)')]

        colors = ['#1E88E5', '#E53935', '#43A047', '#FF9800', '#9C27B0', '#795548']
        markers = ['o', 's', '^', 'D', 'v', 'p']

        fig, ax = plt.subplots(figsize=(10, 6))
        for idx, approach in enumerate(approach_names):
            col = f'{approach} (ms)'
            if col in df.columns:
                ax.loglog(df['Size'], df[col],
                         f'{markers[idx % len(markers)]}-',
                         label=approach,
                         color=colors[idx % len(colors)],
                         linewidth=2, markersize=7)

        ax.set_xlabel('Input Size (n)')
        ax.set_ylabel('Time (ms, log scale)')
        ax.set_title(title or f'Scaling: {name}')
        ax.legend()
        ax.grid(True, alpha=0.3)
        plt.tight_layout()
        plt.show()

    def summary_table(self, name: str) -> pd.DataFrame:
        """Return a summary of the benchmark with speedup ratios.

        The first approach is treated as the baseline.

        Args:
            name: Name of the benchmark to summarize.

        Returns:
            DataFrame with timing and speedup columns.
        """
        df = self.results[name].copy()
        time_cols = [c for c in df.columns if c.endswith(' (ms)')]
        baseline_col = time_cols[0]

        for col in time_cols[1:]:
            approach_name = col.replace(' (ms)', '')
            df[f'Speedup vs {approach_name}'] = df[baseline_col] / df[col]

        # Drop std columns for readability
        drop_cols = [c for c in df.columns if c.endswith(' (std)')]
        return df.drop(columns=drop_cols)

Let's verify the suite works by running the addition benchmark through it.

In [None]:
suite = BenchmarkSuite()

# Define the data factory and approaches
def addition_data_factory(size: int) -> dict:
    """Create test data for addition benchmarks.

    Args:
        size: Number of elements.

    Returns:
        Dictionary with list and array versions of data.
    """
    np.random.seed(SEED)
    arr_a = np.random.randn(size)
    arr_b = np.random.randn(size)
    return {
        'list_a': arr_a.tolist(),
        'list_b': arr_b.tolist(),
        'arr_a': arr_a,
        'arr_b': arr_b,
    }

addition_approaches = {
    'Python Loop': lambda d: lambda: python_add(d['list_a'], d['list_b']),
    'List Comp': lambda d: lambda: python_add_comprehension(d['list_a'], d['list_b']),
    'NumPy': lambda d: lambda: d['arr_a'] + d['arr_b'],
}

print('Running addition scaling benchmark...')
suite.run_scaling_benchmark(
    name='Element-wise Addition',
    approaches=addition_approaches,
    data_factory=addition_data_factory,
    sizes=[100, 1_000, 10_000, 100_000, 1_000_000],
)

suite.plot_scaling('Element-wise Addition')
print(suite.summary_table('Element-wise Addition').to_string(index=False))

---
## Part 3 — PyTorch Tensors: The Bridge to GPU Computing

So far we've compared Python loops vs NumPy. But in deep learning, we use **PyTorch
tensors** — which are similar to NumPy arrays but can also run on GPUs.

Let's bring PyTorch into the picture and measure how it compares to NumPy on CPU,
and whether GPU acceleration (if available) provides further speedup.

In [None]:
import torch

print(f'PyTorch: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
if torch.cuda.is_available():
    print(f'GPU: {torch.cuda.get_device_name(0)}')

# Set PyTorch seed
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

### 3.1 NumPy Arrays vs PyTorch Tensors

PyTorch tensors and NumPy arrays share many similarities — they're both typed,
multi-dimensional arrays stored in contiguous memory. But there are key differences:

| Feature | NumPy ndarray | PyTorch Tensor |
|---------|--------------|----------------|
| Backend | CPU only | CPU + GPU |
| Autograd | No | Yes (gradient tracking) |
| Default dtype | float64 | **float32** |
| Memory format | Row-major | Row-major (+ channels-first for images) |
| Ecosystem | Scientific computing | Deep learning |

The default dtype difference (float64 vs float32) is important: deep learning almost
always uses float32, which is 2× smaller and 2× faster than float64 on most hardware.

In [None]:
# Create equivalent arrays and tensors
np.random.seed(SEED)
data_np = np.random.randn(1_000_000).astype(np.float32)
data_torch = torch.from_numpy(data_np.copy())  # Copy to avoid shared memory

print(f'NumPy:   dtype={data_np.dtype}, shape={data_np.shape}, '
      f'nbytes={data_np.nbytes / 1024 / 1024:.1f} MB')
print(f'PyTorch: dtype={data_torch.dtype}, shape={data_torch.shape}, '
      f'nbytes={data_torch.element_size() * data_torch.nelement() / 1024 / 1024:.1f} MB')
print()

# Verify they contain the same data
print(f'First 5 values match: {np.allclose(data_np[:5], data_torch[:5].numpy())}')

### 3.2 Zero-Copy Conversion Between NumPy and PyTorch

NumPy and PyTorch can share memory through zero-copy conversions. This means
converting between them is essentially free — no data is copied.

In [None]:
def demonstrate_zero_copy() -> None:
    """Show zero-copy conversion between NumPy and PyTorch."""
    # NumPy → PyTorch (shared memory)
    np_arr = np.array([1.0, 2.0, 3.0])
    torch_tensor = torch.from_numpy(np_arr)  # Zero copy!

    print('Zero-copy NumPy → PyTorch:')
    print(f'  np_arr:      {np_arr}')
    print(f'  torch_tensor: {torch_tensor}')

    # Modifying one affects the other (shared memory)
    torch_tensor[0] = 999.0
    print(f'  After torch_tensor[0] = 999:')
    print(f'    np_arr:      {np_arr}  ← ALSO CHANGED (shared memory)')
    np_arr[0] = 1.0  # Restore
    print()

    # PyTorch → NumPy (shared memory on CPU)
    t = torch.tensor([10.0, 20.0, 30.0])
    n = t.numpy()  # Zero copy on CPU
    print('Zero-copy PyTorch → NumPy:')
    print(f'  tensor: {t}')
    print(f'  numpy:  {n}')
    print(f'  Shares memory: {np.shares_memory(n, t.numpy())}')
    print()

    # Safe copy (independent memory)
    safe_copy = torch.from_numpy(np_arr.copy())  # .copy() breaks sharing
    safe_copy[0] = 999.0
    print('Safe copy (no sharing):')
    print(f'  np_arr unchanged: {np_arr}')


demonstrate_zero_copy()

### 3.3 CPU Speed: NumPy vs PyTorch

On CPU, NumPy and PyTorch should have similar performance since both use optimized
BLAS backends. Let's verify this across common operations.

In [None]:
def benchmark_numpy_vs_pytorch_cpu() -> pd.DataFrame:
    """Compare NumPy and PyTorch speed on CPU across common operations.

    Returns:
        DataFrame with timing results.
    """
    np.random.seed(SEED)
    n = 1_000_000

    # Create matching data
    np_a = np.random.randn(n).astype(np.float32)
    np_b = np.random.randn(n).astype(np.float32)
    np_mat = np.random.randn(1000, 1000).astype(np.float32)

    pt_a = torch.from_numpy(np_a.copy())
    pt_b = torch.from_numpy(np_b.copy())
    pt_mat = torch.from_numpy(np_mat.copy())

    operations = [
        ('Addition (1M)',
         lambda: np_a + np_b,
         lambda: pt_a + pt_b),
        ('Multiplication (1M)',
         lambda: np_a * np_b,
         lambda: pt_a * pt_b),
        ('Dot Product (1M)',
         lambda: np.dot(np_a, np_b),
         lambda: torch.dot(pt_a, pt_b)),
        ('Sum (1M)',
         lambda: np.sum(np_a),
         lambda: torch.sum(pt_a)),
        ('Exp (1M)',
         lambda: np.exp(np_a),
         lambda: torch.exp(pt_a)),
        ('MatMul (1000×1000)',
         lambda: np_mat @ np_mat,
         lambda: pt_mat @ pt_mat),
    ]

    records: list[dict] = []
    for name, np_func, pt_func in operations:
        t_np, _ = measure_time(np_func)
        t_pt, _ = measure_time(pt_func)
        records.append({
            'Operation': name,
            'NumPy (ms)': t_np * 1000,
            'PyTorch CPU (ms)': t_pt * 1000,
            'Ratio (NP/PT)': t_np / t_pt,
        })

    return pd.DataFrame(records)


print('NumPy vs PyTorch CPU Benchmark (float32):')
cpu_df = benchmark_numpy_vs_pytorch_cpu()
print(cpu_df.to_string(index=False))

On CPU with the same dtype (float32), NumPy and PyTorch perform very similarly.
The slight differences come from their different BLAS backends and internal
optimizations, but for practical purposes they're interchangeable on CPU.

The key advantage of PyTorch comes when you need:
- **GPU acceleration** (see below if CUDA is available)
- **Automatic differentiation** (Module 5: Backpropagation)
- **Integration with the DL ecosystem** (data loaders, models, etc.)

### 3.4 GPU Acceleration (If Available)

If a CUDA GPU is available, let's measure the speedup from moving computation
to the GPU. Note that GPU computation involves data transfer overhead, so small
operations may actually be slower on GPU.

In [None]:
def benchmark_gpu_if_available() -> pd.DataFrame | None:
    """Benchmark CPU vs GPU for various operations.

    Returns:
        DataFrame with results, or None if no GPU is available.
    """
    if not torch.cuda.is_available():
        print('No CUDA GPU available. Skipping GPU benchmark.')
        print('When running on a machine with a GPU (e.g., Colab with GPU runtime),')
        print('this section will show significant speedups for large operations.')
        return None

    device_gpu = torch.device('cuda')
    torch.manual_seed(SEED)

    sizes = [1_000, 10_000, 100_000, 1_000_000, 10_000_000]
    records: list[dict] = []

    for n in sizes:
        cpu_a = torch.randn(n)
        cpu_b = torch.randn(n)
        gpu_a = cpu_a.to(device_gpu)
        gpu_b = cpu_b.to(device_gpu)

        # CPU timing
        t_cpu, _ = measure_time(lambda: cpu_a + cpu_b)

        # GPU timing (include synchronization)
        def gpu_add() -> None:
            """GPU addition with synchronization."""
            _ = gpu_a + gpu_b
            torch.cuda.synchronize()

        t_gpu, _ = measure_time(gpu_add)

        records.append({
            'Size': n,
            'CPU (ms)': t_cpu * 1000,
            'GPU (ms)': t_gpu * 1000,
            'Speedup': t_cpu / t_gpu,
        })

    df = pd.DataFrame(records)

    # Also benchmark matrix multiplication
    mat_sizes = [256, 512, 1024, 2048, 4096]
    mat_records: list[dict] = []
    for n in mat_sizes:
        cpu_mat = torch.randn(n, n)
        gpu_mat = cpu_mat.to(device_gpu)

        t_cpu, _ = measure_time(lambda: cpu_mat @ cpu_mat)

        def gpu_matmul() -> None:
            """GPU matmul with synchronization."""
            _ = gpu_mat @ gpu_mat
            torch.cuda.synchronize()

        t_gpu, _ = measure_time(gpu_matmul)
        mat_records.append({
            'Size': f'{n}×{n}',
            'CPU (ms)': t_cpu * 1000,
            'GPU (ms)': t_gpu * 1000,
            'Speedup': t_cpu / t_gpu,
        })

    mat_df = pd.DataFrame(mat_records)

    print('Vector Addition: CPU vs GPU')
    print(df.to_string(index=False))
    print()
    print('Matrix Multiplication: CPU vs GPU')
    print(mat_df.to_string(index=False))

    return df


gpu_df = benchmark_gpu_if_available()

**GPU performance notes:**
- Small operations are **slower** on GPU due to kernel launch overhead (~5–10 µs per operation)
- GPU wins for large arrays and especially matrix multiplication
- The speedup increases with matrix size because GPUs have massive parallelism (thousands of cores)
- Always remember to call `torch.cuda.synchronize()` when timing GPU operations — GPU
  operations are asynchronous by default

---
## Part 4 — Evaluation & Analysis

Let's run a comprehensive set of benchmarks to build a complete picture of performance
across all the approaches and operations we've studied.

### 4.1 Comprehensive Operation Benchmark

We'll benchmark a wide range of operations commonly used in ML, comparing Python
loops, NumPy, and PyTorch on CPU.

In [None]:
def comprehensive_benchmark() -> pd.DataFrame:
    """Run a comprehensive benchmark of common ML operations.

    Returns:
        DataFrame with timing results for all approaches and operations.
    """
    np.random.seed(SEED)
    n = 500_000

    # Prepare data in all formats
    list_a = list(np.random.randn(n))
    list_b = list(np.random.randn(n))
    np_a = np.array(list_a, dtype=np.float32)
    np_b = np.array(list_b, dtype=np.float32)
    pt_a = torch.tensor(np_a)
    pt_b = torch.tensor(np_b)

    # Matrix data
    np_mat_a = np.random.randn(500, 500).astype(np.float32)
    np_mat_b = np.random.randn(500, 500).astype(np.float32)
    pt_mat_a = torch.from_numpy(np_mat_a.copy())
    pt_mat_b = torch.from_numpy(np_mat_b.copy())

    benchmarks = [
        ('Vector Add (500K)',
         lambda: python_add(list_a, list_b),
         lambda: np_a + np_b,
         lambda: pt_a + pt_b),
        ('Vector Mul (500K)',
         lambda: python_elementwise_mul(list_a, list_b),
         lambda: np_a * np_b,
         lambda: pt_a * pt_b),
        ('Dot Product (500K)',
         lambda: python_dot(list_a, list_b),
         lambda: np.dot(np_a, np_b),
         lambda: torch.dot(pt_a, pt_b)),
        ('Sum (500K)',
         lambda: python_sum(list_a),
         lambda: np.sum(np_a),
         lambda: torch.sum(pt_a)),
        ('MatMul (500×500)',
         lambda: python_matmul(np_mat_a[:50, :50].tolist(), np_mat_b[:50, :50].tolist()),
         lambda: np_mat_a @ np_mat_b,
         lambda: pt_mat_a @ pt_mat_b),
    ]

    records: list[dict] = []
    for name, py_func, np_func, pt_func in benchmarks:
        t_py, _ = measure_time(py_func, num_warmup=1, num_timed=3)
        t_np, _ = measure_time(np_func)
        t_pt, _ = measure_time(pt_func)
        records.append({
            'Operation': name,
            'Python (ms)': t_py * 1000,
            'NumPy (ms)': t_np * 1000,
            'PyTorch CPU (ms)': t_pt * 1000,
            'Speedup (Py→NP)': t_py / t_np,
            'Speedup (Py→PT)': t_py / t_pt,
        })

    return pd.DataFrame(records)


print('Comprehensive ML Operations Benchmark:')
comprehensive_df = comprehensive_benchmark()
print(comprehensive_df.to_string(index=False))

### 4.2 Visualization: The Full Picture

Let's create a comprehensive visualization showing speedups across all operations
and approaches.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Left: Absolute times (log scale)
x = range(len(comprehensive_df))
width = 0.25
axes[0].bar([i - width for i in x], comprehensive_df['Python (ms)'],
            width, label='Python', color='#E53935')
axes[0].bar(x, comprehensive_df['NumPy (ms)'],
            width, label='NumPy', color='#1E88E5')
axes[0].bar([i + width for i in x], comprehensive_df['PyTorch CPU (ms)'],
            width, label='PyTorch CPU', color='#43A047')
axes[0].set_yscale('log')
axes[0].set_xticks(x)
axes[0].set_xticklabels(comprehensive_df['Operation'], rotation=45, ha='right')
axes[0].set_ylabel('Time (ms, log scale)')
axes[0].set_title('Execution Time by Operation')
axes[0].legend()
axes[0].grid(True, axis='y', alpha=0.3)

# Right: Speedup ratios
axes[1].bar([i - width/2 for i in x], comprehensive_df['Speedup (Py→NP)'],
            width, label='Python → NumPy', color='#1E88E5')
axes[1].bar([i + width/2 for i in x], comprehensive_df['Speedup (Py→PT)'],
            width, label='Python → PyTorch', color='#43A047')
axes[1].set_xticks(x)
axes[1].set_xticklabels(comprehensive_df['Operation'], rotation=45, ha='right')
axes[1].set_ylabel('Speedup Factor')
axes[1].set_title('Speedup over Python Loops')
axes[1].legend()
axes[1].grid(True, axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

### 4.3 Performance Guidelines Summary

Based on our benchmarks, here are the practical performance guidelines for this course.

In [None]:
# Build a decision guide
guidelines = pd.DataFrame({
    'Scenario': [
        'Small data (n < 100)',
        'Medium data (100 < n < 10K)',
        'Large data (n > 10K)',
        'Matrix operations',
        'Need gradients (training)',
        'GPU available + large data',
        'Quick prototyping / EDA',
        'Production ML pipeline',
    ],
    'Recommended': [
        'Python lists (overhead of NumPy may not be worth it)',
        'NumPy (clear speedup over Python)',
        'NumPy or PyTorch (massive speedup)',
        'NumPy/PyTorch (BLAS optimized)',
        'PyTorch (autograd support)',
        'PyTorch on GPU',
        'NumPy + Pandas',
        'PyTorch (ecosystem integration)',
    ],
    'Expected Speedup': [
        '~1× (no benefit)',
        '5–20×',
        '50–200×',
        '100–10,000×',
        'N/A (feature, not speed)',
        '10–100× over CPU',
        'N/A (convenience)',
        'N/A (ecosystem)',
    ],
})

print('=== Performance Decision Guide ===')
print(guidelines.to_string(index=False))

### 4.4 Common Anti-Patterns and Their Fixes

Let's demonstrate the most common performance mistakes beginners make and how
to fix them. These patterns will appear throughout the course.

In [None]:
def demonstrate_antipatterns() -> None:
    """Show common performance anti-patterns and their vectorized fixes."""
    np.random.seed(SEED)
    n = 100_000
    data = np.random.randn(n)
    matrix = np.random.randn(1000, 100)

    print('=== Anti-Pattern 1: Growing a list in a loop ===')
    # Bad: appending to a list
    def bad_normalize() -> list[float]:
        """Normalize by appending to a list (slow)."""
        mean = sum(data) / len(data)
        std = (sum((x - mean) ** 2 for x in data) / len(data)) ** 0.5
        result = []
        for x in data:
            result.append((x - mean) / std)
        return result

    # Good: vectorized
    def good_normalize() -> np.ndarray:
        """Normalize using NumPy vectorization (fast)."""
        return (data - data.mean()) / data.std()

    t_bad, _ = measure_time(bad_normalize, num_warmup=1, num_timed=3)
    t_good, _ = measure_time(good_normalize)
    print(f'  Loop + append: {t_bad*1000:.2f} ms')
    print(f'  Vectorized:    {t_good*1000:.2f} ms')
    print(f'  Speedup:       {t_bad/t_good:.0f}×')
    print()

    print('=== Anti-Pattern 2: Iterating over rows ===')
    # Bad: loop over rows
    def bad_row_means() -> list[float]:
        """Compute row means with a loop (slow)."""
        means = []
        for i in range(matrix.shape[0]):
            means.append(np.mean(matrix[i, :]))
        return means

    # Good: axis parameter
    def good_row_means() -> np.ndarray:
        """Compute row means with axis parameter (fast)."""
        return matrix.mean(axis=1)

    t_bad, _ = measure_time(bad_row_means)
    t_good, _ = measure_time(good_row_means)
    print(f'  Loop over rows:  {t_bad*1000:.2f} ms')
    print(f'  matrix.mean(1):  {t_good*1000:.2f} ms')
    print(f'  Speedup:         {t_bad/t_good:.0f}×')
    print()

    print('=== Anti-Pattern 3: Conditional selection with loops ===')
    # Bad: loop with if
    def bad_filter() -> list[float]:
        """Filter positive values with a loop (slow)."""
        result = []
        for x in data:
            if x > 0:
                result.append(x)
        return result

    # Good: boolean indexing
    def good_filter() -> np.ndarray:
        """Filter positive values with boolean indexing (fast)."""
        return data[data > 0]

    t_bad, _ = measure_time(bad_filter, num_warmup=1, num_timed=3)
    t_good, _ = measure_time(good_filter)
    print(f'  Loop + if:         {t_bad*1000:.2f} ms')
    print(f'  Boolean indexing:  {t_good*1000:.2f} ms')
    print(f'  Speedup:           {t_bad/t_good:.0f}×')
    print()

    print('=== Anti-Pattern 4: Pairwise distances with nested loops ===')
    small_matrix = matrix[:100, :10]  # 100 points, 10 features

    def bad_pairwise_dist() -> np.ndarray:
        """Compute pairwise distances with nested loops (slow)."""
        n = small_matrix.shape[0]
        dist = np.zeros((n, n))
        for i in range(n):
            for j in range(n):
                diff = small_matrix[i] - small_matrix[j]
                dist[i, j] = np.sqrt(np.sum(diff ** 2))
        return dist

    def good_pairwise_dist() -> np.ndarray:
        """Compute pairwise distances with broadcasting (fast)."""
        # (n, 1, d) - (1, n, d) → (n, n, d) → sum over d → sqrt
        diff = small_matrix[:, np.newaxis, :] - small_matrix[np.newaxis, :, :]
        return np.sqrt(np.sum(diff ** 2, axis=2))

    t_bad, _ = measure_time(bad_pairwise_dist, num_warmup=1, num_timed=2)
    t_good, _ = measure_time(good_pairwise_dist)

    # Verify correctness
    assert np.allclose(bad_pairwise_dist(), good_pairwise_dist(), atol=1e-10)

    print(f'  Nested loops:   {t_bad*1000:.2f} ms')
    print(f'  Broadcasting:   {t_good*1000:.2f} ms')
    print(f'  Speedup:        {t_bad/t_good:.0f}×')


demonstrate_antipatterns()

**These anti-patterns are the most common source of slow Python code in ML.**
Throughout this course, we'll always use the vectorized patterns on the right.
When you see a `for` loop over array elements, ask yourself: *can this be vectorized?*
The answer is almost always yes.

### 4.5 dtype Impact on Performance

The choice of data type (float32 vs float64) significantly affects both memory usage
and computation speed. Deep learning almost always uses float32, while scientific
computing traditionally uses float64.

In [None]:
def benchmark_dtypes() -> pd.DataFrame:
    """Compare performance across different dtypes.

    Returns:
        DataFrame with timing and memory results.
    """
    np.random.seed(SEED)
    n = 5_000_000
    base = np.random.randn(n)

    dtypes = [np.float16, np.float32, np.float64]
    records: list[dict] = []

    for dtype in dtypes:
        arr = base.astype(dtype)
        arr2 = np.random.randn(n).astype(dtype)

        t_add, _ = measure_time(lambda: arr + arr2)
        t_mul, _ = measure_time(lambda: arr * arr2)
        t_sum, _ = measure_time(lambda: np.sum(arr))

        records.append({
            'dtype': str(dtype.__name__),
            'Bytes/element': np.dtype(dtype).itemsize,
            'Array MB': arr.nbytes / 1024 / 1024,
            'Add (ms)': t_add * 1000,
            'Multiply (ms)': t_mul * 1000,
            'Sum (ms)': t_sum * 1000,
        })

    return pd.DataFrame(records)


dtype_df = benchmark_dtypes()
print(f'dtype Performance Comparison (n=5,000,000):')
print(dtype_df.to_string(index=False))
print()
# Memory savings summary
f32_mb = dtype_df.loc[dtype_df['dtype'] == 'float32', 'Array MB'].values[0]
f64_mb = dtype_df.loc[dtype_df['dtype'] == 'float64', 'Array MB'].values[0]
print(f'Memory: float32 uses {f32_mb:.1f} MB vs float64 {f64_mb:.1f} MB ')
print(f'  → {f64_mb / f32_mb:.0f}× memory savings with float32')
print()
print('float32 uses half the memory of float64 and is often faster due to')
print('better cache utilization. This is why PyTorch defaults to float32.')

---
## Part 5 — Summary & Lessons Learned

### Key Takeaways

1. **Vectorization is non-negotiable.** NumPy and PyTorch are 50–500× faster than Python
   loops for array operations. Always use vectorized operations instead of element-wise loops.

2. **Broadcasting eliminates redundant copies.** Instead of manually expanding arrays with
   `np.tile()`, let broadcasting handle shape mismatches — it's faster and uses less memory.

3. **Memory layout matters.** C-order (row-major) arrays are faster for row operations.
   NumPy defaults to C-order, which matches ML's samples-as-rows convention.

4. **Views vs copies affect correctness and speed.** Slicing creates views (shared memory);
   fancy indexing creates copies. Use `.copy()` when you need independent data.

5. **float32 is the standard for ML.** It uses half the memory of float64 and is often
   faster. PyTorch defaults to float32; NumPy defaults to float64 — be aware of this
   when converting between them.

### What's Next

→ **01-02 (Advanced NumPy & PyTorch Operations)** builds on these fundamentals with
  reshape, einsum, advanced indexing, and in-place operations — the power tools for
  efficient tensor manipulation.

### Going Further

- [NumPy Internals](https://numpy.org/doc/stable/reference/internals.html) — How NumPy
  stores and accesses array data
- [Why Python is Slow](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/) —
  Deep dive into Python's object model overhead
- [BLAS Libraries](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) —
  The optimized math kernels behind NumPy and PyTorch