# Module 01 — Mathematical & Programming Foundations
## 1-01: Python, NumPy & Tensor Speed

**Objective:** Understand why vectorized computation is essential for ML and learn the foundations of NumPy and PyTorch tensor operations.

**Prerequisites:** None (entry point to the course)

---
## Part 0 — Setup & Prerequisites

This notebook covers the performance gap between pure Python loops, NumPy vectorized operations, and PyTorch tensor operations. We will benchmark identical computations across all three approaches, explore broadcasting rules, and examine memory layout concepts that determine computational efficiency.

**Prerequisites:** None — this is the entry point to the entire 200-topic course.

In [None]:
# ── Imports ──────────────────────────────────────────────────────────────────────
import sys
import warnings
warnings.filterwarnings("ignore")

import random
import time
from typing import Callable

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch

print(f"Python: {sys.version.split()[0]}")
print(f"NumPy: {np.__version__}")
print(f"PyTorch: {torch.__version__}")
if torch.cuda.is_available():
    print(f"CUDA: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")

In [None]:
# ── Reproducibility ───────────────────────────────────────────────────────────────
SEED = 1103
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

In [None]:
# ── Configuration ─────────────────────────────────────────────────────────────────
VECTOR_SIZES = [100, 500, 1_000, 5_000, 10_000, 50_000]  # Sizes for vector benchmarks
MATRIX_SIZES = [50, 100, 200, 500, 1_000]                 # Sizes for matrix benchmarks
NUM_WARMUP = 2                                              # Warmup runs before timing
NUM_TIMED_RUNS = 5                                          # Timed runs to average
PYTHON_MAX_SIZE = 10_000                                    # Max size for pure Python (too slow beyond)

### Synthetic Data Generation

We generate random vectors and matrices for benchmarking. Since this notebook focuses on
computational speed rather than a specific ML task, all data is synthetic.

In [None]:
# Generate sample data for initial demonstrations
DEMO_SIZE = 1_000

# Python lists
python_vec_a = [random.gauss(0, 1) for _ in range(DEMO_SIZE)]
python_vec_b = [random.gauss(0, 1) for _ in range(DEMO_SIZE)]

# NumPy arrays
numpy_vec_a = np.array(python_vec_a)
numpy_vec_b = np.array(python_vec_b)

# PyTorch tensors
torch_vec_a = torch.tensor(python_vec_a)
torch_vec_b = torch.tensor(python_vec_b)

print(f"Python list length:  {len(python_vec_a)}")
print(f"NumPy array shape:   {numpy_vec_a.shape}, dtype: {numpy_vec_a.dtype}")
print(f"PyTorch tensor shape: {torch_vec_a.shape}, dtype: {torch_vec_a.dtype}")
print(f"\nFirst 5 values (Python):  {python_vec_a[:5]}")
print(f"First 5 values (NumPy):   {numpy_vec_a[:5]}")
print(f"First 5 values (PyTorch): {torch_vec_a[:5]}")

---
## Part 1 — Vectorization from Scratch

Vectorization is the technique of replacing explicit Python loops with batch operations
implemented in compiled languages (C, C++, Fortran). Libraries like NumPy and PyTorch
execute operations on entire arrays in a single call, leveraging:

- **Optimized C/C++ backends** (NumPy uses BLAS/LAPACK; PyTorch uses ATen/MKL)
- **Contiguous memory layout** enabling CPU cache efficiency
- **SIMD instructions** (Single Instruction, Multiple Data) on modern CPUs
- **Optional GPU parallelism** (PyTorch tensors can move to CUDA devices)

The mathematical operations are identical — only the *implementation* differs.

### 1.1 Pure Python Operations

We start with three fundamental operations implemented using only Python loops:

1. **Dot product:** $\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i$
2. **Element-wise multiplication:** $\mathbf{c}_i = \mathbf{a}_i \times \mathbf{b}_i$ for all $i$
3. **Matrix multiplication:** $(\mathbf{C})_{ij} = \sum_{k=1}^{m} (\mathbf{A})_{ik} (\mathbf{B})_{kj}$

In [None]:
def python_dot_product(vec_a: list[float], vec_b: list[float]) -> float:
    """Compute dot product of two vectors using a Python for loop.

    Args:
        vec_a: First vector as a Python list.
        vec_b: Second vector as a Python list.

    Returns:
        Scalar dot product value.
    """
    assert len(vec_a) == len(vec_b), "Vectors must have the same length"
    result = 0.0
    for idx in range(len(vec_a)):
        result += vec_a[idx] * vec_b[idx]
    return result


def python_elementwise_multiply(vec_a: list[float], vec_b: list[float]) -> list[float]:
    """Compute element-wise multiplication using a Python for loop.

    Args:
        vec_a: First vector as a Python list.
        vec_b: Second vector as a Python list.

    Returns:
        Result vector as a Python list.
    """
    assert len(vec_a) == len(vec_b), "Vectors must have the same length"
    result = [0.0] * len(vec_a)
    for idx in range(len(vec_a)):
        result[idx] = vec_a[idx] * vec_b[idx]
    return result


def python_matmul(mat_a: list[list[float]], mat_b: list[list[float]]) -> list[list[float]]:
    """Compute matrix multiplication using nested Python for loops.

    Args:
        mat_a: First matrix as a list of lists, shape (rows_a, cols_a).
        mat_b: Second matrix as a list of lists, shape (rows_b, cols_b).

    Returns:
        Result matrix as a list of lists, shape (rows_a, cols_b).
    """
    rows_a = len(mat_a)
    cols_a = len(mat_a[0])
    rows_b = len(mat_b)
    cols_b = len(mat_b[0])
    assert cols_a == rows_b, f"Incompatible shapes: ({rows_a}, {cols_a}) x ({rows_b}, {cols_b})"

    result = [[0.0] * cols_b for _ in range(rows_a)]
    for row_idx in range(rows_a):
        for col_idx in range(cols_b):
            total = 0.0
            for k_idx in range(cols_a):
                total += mat_a[row_idx][k_idx] * mat_b[k_idx][col_idx]
            result[row_idx][col_idx] = total
    return result

In [None]:
# Test pure Python implementations
dot_result_python = python_dot_product(python_vec_a, python_vec_b)
elem_result_python = python_elementwise_multiply(python_vec_a[:5], python_vec_b[:5])

# Small matrix test
small_mat_a = [[1.0, 2.0], [3.0, 4.0]]
small_mat_b = [[5.0, 6.0], [7.0, 8.0]]
matmul_result_python = python_matmul(small_mat_a, small_mat_b)

print(f"Dot product (Python):           {dot_result_python:.4f}")
print(f"Element-wise (first 5, Python): {elem_result_python}")
print(f"Matrix multiply (Python):       {matmul_result_python}")
print(f"Expected matmul result:         [[19.0, 22.0], [43.0, 50.0]]")

### 1.2 NumPy Vectorized Operations

NumPy performs the same operations without explicit Python loops. Under the hood,
NumPy dispatches to optimized BLAS (Basic Linear Algebra Subprograms) routines
written in C and Fortran. Key functions:

- `np.dot(a, b)` — dot product for 1-D arrays, matrix multiply for 2-D
- `np.multiply(a, b)` or `a * b` — element-wise multiplication
- `np.matmul(a, b)` or `a @ b` — matrix multiplication (preferred over `np.dot` for matrices)

In [None]:
def numpy_dot_product(vec_a: np.ndarray, vec_b: np.ndarray) -> float:
    """Compute dot product using NumPy.

    Args:
        vec_a: First vector as a NumPy array.
        vec_b: Second vector as a NumPy array.

    Returns:
        Scalar dot product value.
    """
    return float(np.dot(vec_a, vec_b))


def numpy_elementwise_multiply(vec_a: np.ndarray, vec_b: np.ndarray) -> np.ndarray:
    """Compute element-wise multiplication using NumPy.

    Args:
        vec_a: First vector as a NumPy array.
        vec_b: Second vector as a NumPy array.

    Returns:
        Result array.
    """
    return np.multiply(vec_a, vec_b)


def numpy_matmul(mat_a: np.ndarray, mat_b: np.ndarray) -> np.ndarray:
    """Compute matrix multiplication using NumPy.

    Args:
        mat_a: First matrix as a NumPy array.
        mat_b: Second matrix as a NumPy array.

    Returns:
        Result matrix as a NumPy array.
    """
    return mat_a @ mat_b

In [None]:
# Test NumPy implementations and verify against Python results
dot_result_numpy = numpy_dot_product(numpy_vec_a, numpy_vec_b)
elem_result_numpy = numpy_elementwise_multiply(numpy_vec_a[:5], numpy_vec_b[:5])

np_mat_a = np.array(small_mat_a)
np_mat_b = np.array(small_mat_b)
matmul_result_numpy = numpy_matmul(np_mat_a, np_mat_b)

print(f"Dot product (NumPy):           {dot_result_numpy:.4f}")
print(f"Element-wise (first 5, NumPy): {elem_result_numpy}")
print(f"Matrix multiply (NumPy):\n{matmul_result_numpy}")

# Verify Python and NumPy produce the same results
print(f"\nDot product match: {np.isclose(dot_result_python, dot_result_numpy)}")
print(f"Matmul match:      {np.allclose(matmul_result_python, matmul_result_numpy)}")

### 1.3 PyTorch Tensor Operations

PyTorch tensors provide the same vectorized operations as NumPy, with two key additions:

1. **GPU support** — tensors can be placed on a CUDA device for massive parallelism
2. **Automatic differentiation** — `autograd` tracks operations for gradient computation (covered in Module 5)

Key functions:
- `torch.dot(a, b)` — dot product (1-D only)
- `torch.mul(a, b)` or `a * b` — element-wise multiplication
- `torch.matmul(a, b)` or `a @ b` — matrix multiplication

In [None]:
def torch_dot_product(vec_a: torch.Tensor, vec_b: torch.Tensor) -> float:
    """Compute dot product using PyTorch.

    Args:
        vec_a: First vector as a PyTorch tensor.
        vec_b: Second vector as a PyTorch tensor.

    Returns:
        Scalar dot product value.
    """
    return float(torch.dot(vec_a, vec_b))


def torch_elementwise_multiply(vec_a: torch.Tensor, vec_b: torch.Tensor) -> torch.Tensor:
    """Compute element-wise multiplication using PyTorch.

    Args:
        vec_a: First vector as a PyTorch tensor.
        vec_b: Second vector as a PyTorch tensor.

    Returns:
        Result tensor.
    """
    return torch.mul(vec_a, vec_b)


def torch_matmul(mat_a: torch.Tensor, mat_b: torch.Tensor) -> torch.Tensor:
    """Compute matrix multiplication using PyTorch.

    Args:
        mat_a: First matrix as a PyTorch tensor.
        mat_b: Second matrix as a PyTorch tensor.

    Returns:
        Result matrix as a PyTorch tensor.
    """
    return torch.matmul(mat_a, mat_b)

In [None]:
# Test PyTorch implementations and verify against NumPy
dot_result_torch = torch_dot_product(torch_vec_a, torch_vec_b)
elem_result_torch = torch_elementwise_multiply(torch_vec_a[:5], torch_vec_b[:5])

torch_mat_a = torch.tensor(small_mat_a)
torch_mat_b = torch.tensor(small_mat_b)
matmul_result_torch = torch_matmul(torch_mat_a, torch_mat_b)

print(f"Dot product (PyTorch):           {dot_result_torch:.4f}")
print(f"Element-wise (first 5, PyTorch): {elem_result_torch}")
print(f"Matrix multiply (PyTorch):\n{matmul_result_torch}")

# Verify all three implementations agree
print(f"\nAll three dot products match: {np.isclose(dot_result_python, dot_result_numpy) and np.isclose(dot_result_numpy, dot_result_torch)}")
print(f"NumPy vs PyTorch matmul match: {np.allclose(matmul_result_numpy, matmul_result_torch.numpy())}")

#### `torch.Tensor` vs `torch.tensor`

A common source of confusion for beginners:

- **`torch.tensor(data)`** — a *function* that infers dtype from the input data. This is the **recommended** way to create tensors.
- **`torch.Tensor(data)`** — a *class constructor* that always creates `float32` tensors, even if you pass integers.

Always prefer `torch.tensor()` for predictable dtype behavior.

In [None]:
# Demonstrate torch.Tensor vs torch.tensor
int_data = [1, 2, 3]

tensor_from_class = torch.Tensor(int_data)    # Always float32
tensor_from_func = torch.tensor(int_data)     # Infers int64

print(f"torch.Tensor([1,2,3]) -> dtype: {tensor_from_class.dtype}")
print(f"torch.tensor([1,2,3]) -> dtype: {tensor_from_func.dtype}")

# Explicit dtype control with torch.tensor
tensor_float = torch.tensor(int_data, dtype=torch.float32)
print(f"torch.tensor([1,2,3], dtype=float32) -> dtype: {tensor_float.dtype}")

### 1.4 Broadcasting Rules

Broadcasting is the mechanism by which NumPy and PyTorch handle operations on arrays/tensors
with different shapes. Instead of requiring identical shapes, broadcasting *stretches* smaller
dimensions to match larger ones — without actually copying data in memory.

**Three rules of broadcasting** (applied from right to left):

1. **If the dimensions differ in length, pad the shorter shape with 1s on the left.**
   - Example: shape `(3,)` becomes `(1, 3)` when paired with `(2, 3)`.

2. **Dimensions of size 1 are stretched to match the other array's size in that dimension.**
   - Example: `(1, 3)` is broadcast to `(2, 3)` when paired with `(2, 3)`.

3. **Dimensions must either be equal or one of them must be 1. Otherwise, broadcasting fails.**
   - Example: `(2, 3)` and `(4, 3)` cannot be broadcast — dimension 0 is 2 vs 4, and neither is 1.

In [None]:
# Broadcasting example 1: Adding a scalar to a matrix
matrix = np.array([[1, 2, 3],
                   [4, 5, 6]])
scalar = 10
result_scalar = matrix + scalar
print("Example 1: Matrix + Scalar")
print(f"Matrix shape: {matrix.shape} + Scalar shape: () -> Result shape: {result_scalar.shape}")
print(f"Result:\n{result_scalar}\n")

# Broadcasting example 2: Adding a row vector to a matrix
row_vector = np.array([10, 20, 30])  # shape (3,)
result_row = matrix + row_vector
print("Example 2: Matrix (2,3) + Row Vector (3,)")
print(f"Matrix shape: {matrix.shape} + Row shape: {row_vector.shape} -> Result shape: {result_row.shape}")
print(f"Result:\n{result_row}\n")

# Broadcasting example 3: Adding a column vector to a matrix
col_vector = np.array([[100], [200]])  # shape (2, 1)
result_col = matrix + col_vector
print("Example 3: Matrix (2,3) + Column Vector (2,1)")
print(f"Matrix shape: {matrix.shape} + Col shape: {col_vector.shape} -> Result shape: {result_col.shape}")
print(f"Result:\n{result_col}\n")

# Broadcasting example 4: Outer product via broadcasting
vec_row = np.array([1, 2, 3])          # shape (3,)
vec_col = np.array([[10], [20], [30]])  # shape (3, 1)
outer_product = vec_col * vec_row       # (3,1) * (3,) -> (3,3)
print("Example 4: Outer Product via Broadcasting")
print(f"Col (3,1) * Row (3,) -> shape {outer_product.shape}")
print(f"Result:\n{outer_product}")

#### Common Broadcasting Patterns in ML

Broadcasting appears constantly in ML code:

| Pattern | Shapes | ML Use Case |
|---------|--------|-------------|
| Add bias to each sample | `(batch, features) + (features,)` | Linear layer: $\mathbf{y} = \mathbf{X}\mathbf{W} + \mathbf{b}$ |
| Normalize each feature | `(batch, features) - (features,)` | Feature standardization |
| Scale per-channel | `(batch, channels, H, W) * (1, channels, 1, 1)` | Batch normalization |
| Compute pairwise distances | `(n, 1, d) - (1, m, d)` | k-NN, kernel methods |

In [None]:
# ML broadcasting pattern: Adding bias to a batch of samples
batch_size = 4
num_features = 3

# Simulating the output of a linear layer: y = Xw + b
batch_data = np.random.randn(batch_size, num_features)
bias = np.array([0.5, -0.3, 1.0])  # shape (3,) - one bias per feature

# Broadcasting adds bias to every sample in the batch
result_with_bias = batch_data + bias
print(f"Batch shape: {batch_data.shape} + Bias shape: {bias.shape}")
print(f"Result shape: {result_with_bias.shape}")
print(f"\nBefore bias (first 2 rows):\n{batch_data[:2]}")
print(f"After bias  (first 2 rows):\n{result_with_bias[:2]}")

# Same pattern works identically in PyTorch
torch_batch = torch.tensor(batch_data)
torch_bias = torch.tensor(bias)
torch_result = torch_batch + torch_bias
print(f"\nPyTorch result matches NumPy: {np.allclose(result_with_bias, torch_result.numpy())}")

In [None]:
# Broadcasting pattern: Pairwise distance computation
# This is used in k-NN (Module 2) and kernel methods (Module 3)
num_points_a = 3
num_points_b = 4
num_dims = 2

points_a = np.random.randn(num_points_a, num_dims)  # shape (3, 2)
points_b = np.random.randn(num_points_b, num_dims)  # shape (4, 2)

# Expand dimensions for broadcasting: (3,1,2) - (1,4,2) -> (3,4,2)
diff = points_a[:, np.newaxis, :] - points_b[np.newaxis, :, :]
pairwise_distances = np.sqrt(np.sum(diff ** 2, axis=2))  # shape (3, 4)

print(f"Points A shape:     {points_a.shape}")
print(f"Points B shape:     {points_b.shape}")
print(f"Difference shape:   {diff.shape}")
print(f"Distances shape:    {pairwise_distances.shape}")
print(f"\nPairwise distances:\n{pairwise_distances}")

### 1.5 Memory Layout

Understanding how tensors are stored in memory is crucial for performance. Two key concepts:

**Row-major (C order)** vs **Column-major (Fortran order):**
- Row-major: consecutive elements in a row are adjacent in memory (NumPy default, PyTorch default)
- Column-major: consecutive elements in a column are adjacent in memory

**Strides:** The number of bytes (or elements) to skip in memory to move along each dimension.
For a contiguous row-major matrix of shape $(m, n)$, the strides are $(n, 1)$ — moving down a row
skips $n$ elements, moving across a column skips 1 element.

**Contiguous vs Non-contiguous:** A tensor is *contiguous* when its elements are laid out in
memory in the order implied by its shape and strides. Transpose operations create
non-contiguous views, which can impact performance.

In [None]:
# Demonstrate strides in NumPy
arr_c = np.array([[1, 2, 3, 4],
                  [5, 6, 7, 8],
                  [9, 10, 11, 12]], dtype=np.int64)  # Row-major (C order)

print("Row-major (C order) array:")
print(f"  Shape:   {arr_c.shape}")
print(f"  Strides: {arr_c.strides} bytes")
print(f"  Strides in elements: ({arr_c.strides[0] // arr_c.itemsize}, {arr_c.strides[1] // arr_c.itemsize})")
print(f"  C-contiguous: {arr_c.flags['C_CONTIGUOUS']}")
print(f"  F-contiguous: {arr_c.flags['F_CONTIGUOUS']}")

# Column-major (Fortran order)
arr_f = np.asfortranarray(arr_c)
print(f"\nColumn-major (Fortran order) array:")
print(f"  Same data: {np.array_equal(arr_c, arr_f)}")
print(f"  Strides: {arr_f.strides} bytes")
print(f"  C-contiguous: {arr_f.flags['C_CONTIGUOUS']}")
print(f"  F-contiguous: {arr_f.flags['F_CONTIGUOUS']}")

In [None]:
# Demonstrate contiguous vs non-contiguous in PyTorch
tensor_2d = torch.tensor([[1, 2, 3],
                          [4, 5, 6]], dtype=torch.float32)

print("Original tensor:")
print(f"  Shape:         {tensor_2d.shape}")
print(f"  Stride:        {tensor_2d.stride()}")
print(f"  Is contiguous: {tensor_2d.is_contiguous()}")

# Transpose creates a non-contiguous view (no data copy!)
tensor_transposed = tensor_2d.T
print(f"\nTransposed tensor (view, no copy):")
print(f"  Shape:         {tensor_transposed.shape}")
print(f"  Stride:        {tensor_transposed.stride()}")
print(f"  Is contiguous: {tensor_transposed.is_contiguous()}")

# Making it contiguous creates a new copy with proper layout
tensor_contiguous = tensor_transposed.contiguous()
print(f"\nAfter .contiguous() (new copy):")
print(f"  Shape:         {tensor_contiguous.shape}")
print(f"  Stride:        {tensor_contiguous.stride()}")
print(f"  Is contiguous: {tensor_contiguous.is_contiguous()}")

# Verify the data is the same
print(f"\nData matches: {torch.equal(tensor_transposed, tensor_contiguous)}")

In [None]:
# Visualize memory layout
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Row-major layout
row_major_data = np.array([[1, 2, 3], [4, 5, 6]])
flat_row = row_major_data.flatten()  # Row-major: [1, 2, 3, 4, 5, 6]

colors_row = ['#2196F3', '#2196F3', '#2196F3', '#FF9800', '#FF9800', '#FF9800']
axes[0].barh(range(6), [1]*6, color=colors_row, edgecolor='black', linewidth=1.5)
for idx, val in enumerate(flat_row):
    axes[0].text(0.5, idx, str(val), ha='center', va='center', fontsize=14, fontweight='bold')
axes[0].set_yticks(range(6))
axes[0].set_yticklabels([f'addr {i}' for i in range(6)])
axes[0].set_title('Row-Major (C Order)\nRow 0 = blue, Row 1 = orange', fontsize=12)
axes[0].set_xlabel('Memory Address')
axes[0].invert_yaxis()
axes[0].set_xlim(0, 1)
axes[0].set_xticks([])

# Column-major layout
flat_col = row_major_data.flatten('F')  # Column-major: [1, 4, 2, 5, 3, 6]
colors_col = ['#2196F3', '#FF9800', '#2196F3', '#FF9800', '#2196F3', '#FF9800']
axes[1].barh(range(6), [1]*6, color=colors_col, edgecolor='black', linewidth=1.5)
for idx, val in enumerate(flat_col):
    axes[1].text(0.5, idx, str(val), ha='center', va='center', fontsize=14, fontweight='bold')
axes[1].set_yticks(range(6))
axes[1].set_yticklabels([f'addr {i}' for i in range(6)])
axes[1].set_title('Column-Major (Fortran Order)\nRow 0 = blue, Row 1 = orange', fontsize=12)
axes[1].set_xlabel('Memory Address')
axes[1].invert_yaxis()
axes[1].set_xlim(0, 1)
axes[1].set_xticks([])

plt.tight_layout()
plt.show()

---
## Part 2 — Putting It All Together

We now build a reusable `BenchmarkSuite` class that standardizes timing methodology.
Good benchmarking requires:

1. **Warmup runs** — the first execution is often slower due to JIT compilation, cache loading, etc.
2. **Multiple timed runs** — take the average (or median) to reduce variance.
3. **High-resolution timer** — `time.perf_counter()` provides the best precision for short operations.

In [None]:
def benchmark_function(
    func: Callable,
    num_warmup: int = NUM_WARMUP,
    num_timed_runs: int = NUM_TIMED_RUNS,
) -> dict[str, float]:
    """Benchmark a callable with warmup and multiple timed runs.

    Args:
        func: Callable that takes no arguments (use lambda or functools.partial).
        num_warmup: Number of warmup executions before timing.
        num_timed_runs: Number of timed executions to average.

    Returns:
        Dictionary with 'mean', 'min', 'max', and 'all_times' keys.
    """
    # Warmup
    for _ in range(num_warmup):
        func()

    # Timed runs
    times: list[float] = []
    for _ in range(num_timed_runs):
        start_time = time.perf_counter()
        func()
        end_time = time.perf_counter()
        times.append(end_time - start_time)

    return {
        "mean": np.mean(times),
        "min": np.min(times),
        "max": np.max(times),
        "all_times": times,
    }

In [None]:
class BenchmarkSuite:
    """A reusable benchmarking harness for comparing operations.

    Attributes:
        tasks: List of (name, callable) pairs to benchmark.
        results: DataFrame of timing results after running.
    """

    def __init__(
        self,
        tasks: list[tuple[str, Callable]],
        num_warmup: int = NUM_WARMUP,
        num_timed_runs: int = NUM_TIMED_RUNS,
    ) -> None:
        """Initialize the benchmark suite.

        Args:
            tasks: List of (name, callable) pairs.
            num_warmup: Number of warmup runs.
            num_timed_runs: Number of timed runs.
        """
        self.tasks = tasks
        self.num_warmup = num_warmup
        self.num_timed_runs = num_timed_runs
        self.results: pd.DataFrame | None = None

    def run(self) -> pd.DataFrame:
        """Execute all benchmarks and return a results DataFrame.

        Returns:
            DataFrame with columns: Name, Mean (s), Min (s), Max (s).
        """
        rows: list[dict[str, object]] = []
        for name, func in self.tasks:
            timing = benchmark_function(func, self.num_warmup, self.num_timed_runs)
            rows.append({
                "Name": name,
                "Mean (s)": timing["mean"],
                "Min (s)": timing["min"],
                "Max (s)": timing["max"],
            })
        self.results = pd.DataFrame(rows)
        return self.results

    def plot_results(self, title: str = "Benchmark Results") -> None:
        """Plot benchmark results as a horizontal bar chart.

        Args:
            title: Title for the plot.
        """
        if self.results is None:
            raise ValueError("Run benchmarks first with .run()")

        fig, ax = plt.subplots(figsize=(8, 5))
        names = self.results["Name"]
        means = self.results["Mean (s)"]
        colors = plt.cm.viridis(np.linspace(0.2, 0.8, len(names)))

        bars = ax.barh(range(len(names)), means, color=colors, edgecolor='black')
        ax.set_yticks(range(len(names)))
        ax.set_yticklabels(names)
        ax.set_xlabel("Mean Time (seconds)")
        ax.set_title(title)

        # Add time labels on bars
        for bar_item, mean_val in zip(bars, means):
            ax.text(
                bar_item.get_width() * 1.02,
                bar_item.get_y() + bar_item.get_height() / 2,
                f"{mean_val:.6f}s",
                va='center',
                fontsize=10,
            )
        ax.set_xlim(0, max(means) * 1.3)
        plt.tight_layout()
        plt.show()

In [None]:
# Sanity check: benchmark the three dot product implementations
dot_tasks = [
    ("Python Loop", lambda: python_dot_product(python_vec_a, python_vec_b)),
    ("NumPy", lambda: numpy_dot_product(numpy_vec_a, numpy_vec_b)),
    ("PyTorch", lambda: torch_dot_product(torch_vec_a, torch_vec_b)),
]

dot_suite = BenchmarkSuite(dot_tasks)
dot_results = dot_suite.run()
print(f"Dot product benchmark (vector size = {DEMO_SIZE}):")
print(dot_results.to_string(index=False))
dot_suite.plot_results(f"Dot Product Benchmark (n={DEMO_SIZE})")

---
## Part 3 — Benchmarking Real ML Operations

Now we systematically benchmark operations at increasing scales. This reveals:

1. At what size does vectorization become essential?
2. How does the speedup ratio grow with problem size?
3. Where does PyTorch overhead exceed NumPy for small operations?

### 3.1 Vector Dot Product at Scale

In [None]:
def run_scaling_benchmark_dot(
    sizes: list[int],
    python_max_size: int = PYTHON_MAX_SIZE,
) -> pd.DataFrame:
    """Benchmark dot product across Python, NumPy, and PyTorch at multiple sizes.

    Args:
        sizes: List of vector sizes to benchmark.
        python_max_size: Maximum size to include Python loop benchmarks (too slow beyond).

    Returns:
        DataFrame with timing results for each approach and size.
    """
    records: list[dict[str, object]] = []

    for size in sizes:
        print(f"  Benchmarking size {size:>6d}...", end="")

        # Generate data
        py_a = [random.gauss(0, 1) for _ in range(size)]
        py_b = [random.gauss(0, 1) for _ in range(size)]
        np_a = np.array(py_a)
        np_b = np.array(py_b)
        pt_a = torch.tensor(py_a)
        pt_b = torch.tensor(py_b)

        # Python benchmark (skip for very large sizes)
        if size <= python_max_size:
            python_time = benchmark_function(lambda: python_dot_product(py_a, py_b))["mean"]
        else:
            python_time = float("nan")

        # NumPy benchmark
        numpy_time = benchmark_function(lambda: numpy_dot_product(np_a, np_b))["mean"]

        # PyTorch benchmark
        torch_time = benchmark_function(lambda: torch_dot_product(pt_a, pt_b))["mean"]

        records.append({
            "Size": size,
            "Python (s)": python_time,
            "NumPy (s)": numpy_time,
            "PyTorch (s)": torch_time,
        })
        print(" done")

    return pd.DataFrame(records)


print("Running dot product scaling benchmark...")
dot_scaling_results = run_scaling_benchmark_dot(VECTOR_SIZES)
print("\nDot Product Scaling Results:")
print(dot_scaling_results.to_string(index=False))

### 3.2 Matrix Multiplication at Scale

In [None]:
def run_scaling_benchmark_matmul(
    sizes: list[int],
    python_max_size: int = 200,
) -> pd.DataFrame:
    """Benchmark matrix multiplication across Python, NumPy, and PyTorch.

    Args:
        sizes: List of matrix dimensions (square matrices of size n x n).
        python_max_size: Maximum matrix dimension for Python loop benchmarks.

    Returns:
        DataFrame with timing results for each approach and size.
    """
    records: list[dict[str, object]] = []

    for size in sizes:
        print(f"  Benchmarking {size}x{size} matrix multiply...", end="")

        # Generate data
        np_a = np.random.randn(size, size)
        np_b = np.random.randn(size, size)
        pt_a = torch.tensor(np_a)
        pt_b = torch.tensor(np_b)

        # Python benchmark (skip for large sizes - O(n^3) is very slow)
        if size <= python_max_size:
            py_a = np_a.tolist()
            py_b = np_b.tolist()
            python_time = benchmark_function(lambda: python_matmul(py_a, py_b))["mean"]
        else:
            python_time = float("nan")

        # NumPy benchmark
        numpy_time = benchmark_function(lambda: numpy_matmul(np_a, np_b))["mean"]

        # PyTorch benchmark
        torch_time = benchmark_function(lambda: torch_matmul(pt_a, pt_b))["mean"]

        # Verify NumPy and PyTorch produce the same result
        np_result = numpy_matmul(np_a, np_b)
        pt_result = torch_matmul(pt_a, pt_b).numpy()
        assert np.allclose(np_result, pt_result, atol=1e-6), (
            f"NumPy and PyTorch disagree at size {size}"
        )

        records.append({
            "Size": f"{size}x{size}",
            "Python (s)": python_time,
            "NumPy (s)": numpy_time,
            "PyTorch (s)": torch_time,
        })
        print(" done")

    return pd.DataFrame(records)


print("Running matrix multiplication scaling benchmark...")
matmul_scaling_results = run_scaling_benchmark_matmul(MATRIX_SIZES)
print("\nMatrix Multiplication Scaling Results:")
print(matmul_scaling_results.to_string(index=False))

### 3.3 Element-wise Operations at Scale

In [None]:
def run_scaling_benchmark_elementwise(
    sizes: list[int],
    python_max_size: int = PYTHON_MAX_SIZE,
) -> pd.DataFrame:
    """Benchmark element-wise multiplication across Python, NumPy, and PyTorch.

    Args:
        sizes: List of vector sizes to benchmark.
        python_max_size: Maximum size for Python loop benchmarks.

    Returns:
        DataFrame with timing results for each approach and size.
    """
    records: list[dict[str, object]] = []

    for size in sizes:
        print(f"  Benchmarking element-wise multiply, size {size:>6d}...", end="")

        py_a = [random.gauss(0, 1) for _ in range(size)]
        py_b = [random.gauss(0, 1) for _ in range(size)]
        np_a = np.array(py_a)
        np_b = np.array(py_b)
        pt_a = torch.tensor(py_a)
        pt_b = torch.tensor(py_b)

        if size <= python_max_size:
            python_time = benchmark_function(
                lambda: python_elementwise_multiply(py_a, py_b)
            )["mean"]
        else:
            python_time = float("nan")

        numpy_time = benchmark_function(
            lambda: numpy_elementwise_multiply(np_a, np_b)
        )["mean"]

        torch_time = benchmark_function(
            lambda: torch_elementwise_multiply(pt_a, pt_b)
        )["mean"]

        records.append({
            "Size": size,
            "Python (s)": python_time,
            "NumPy (s)": numpy_time,
            "PyTorch (s)": torch_time,
        })
        print(" done")

    return pd.DataFrame(records)


print("Running element-wise multiply scaling benchmark...")
elemwise_scaling_results = run_scaling_benchmark_elementwise(VECTOR_SIZES)
print("\nElement-wise Multiply Scaling Results:")
print(elemwise_scaling_results.to_string(index=False))

### 3.4 Computing Speedup Ratios

In [None]:
def compute_speedup_table(results_df: pd.DataFrame, operation_name: str) -> pd.DataFrame:
    """Compute speedup ratios relative to Python for a benchmark results table.

    Args:
        results_df: DataFrame with 'Size', 'Python (s)', 'NumPy (s)', 'PyTorch (s)' columns.
        operation_name: Name of the operation for display.

    Returns:
        DataFrame with speedup ratios added.
    """
    speedup_df = results_df.copy()

    # Compute speedup ratios where Python time is available
    python_times = speedup_df["Python (s)"]
    numpy_times = speedup_df["NumPy (s)"]
    torch_times = speedup_df["PyTorch (s)"]

    speedup_df["NumPy vs Python"] = python_times / numpy_times
    speedup_df["PyTorch vs Python"] = python_times / torch_times
    speedup_df["NumPy vs PyTorch"] = torch_times / numpy_times

    return speedup_df


# Compute speedups for dot product
dot_speedup = compute_speedup_table(dot_scaling_results, "Dot Product")
print("Dot Product Speedups:")
print(dot_speedup.to_string(index=False, float_format="{:.2f}".format))

print("\nElement-wise Multiply Speedups:")
elemwise_speedup = compute_speedup_table(elemwise_scaling_results, "Element-wise Multiply")
print(elemwise_speedup.to_string(index=False, float_format="{:.2f}".format))

### 3.5 Broadcasting Performance: Loops vs Vectorized

In [None]:
def add_bias_python_loop(
    data: list[list[float]], bias: list[float]
) -> list[list[float]]:
    """Add a bias vector to each row of a data matrix using Python loops.

    Args:
        data: Matrix as a list of lists, shape (num_samples, num_features).
        bias: Bias vector as a list, shape (num_features,).

    Returns:
        Result matrix with bias added to each row.
    """
    num_samples = len(data)
    num_features = len(data[0])
    result = [[0.0] * num_features for _ in range(num_samples)]
    for row_idx in range(num_samples):
        for col_idx in range(num_features):
            result[row_idx][col_idx] = data[row_idx][col_idx] + bias[col_idx]
    return result


def add_bias_numpy_broadcast(
    data: np.ndarray, bias: np.ndarray
) -> np.ndarray:
    """Add a bias vector using NumPy broadcasting.

    Args:
        data: Matrix of shape (num_samples, num_features).
        bias: Bias vector of shape (num_features,).

    Returns:
        Result array with bias added.
    """
    return data + bias


def add_bias_torch_broadcast(
    data: torch.Tensor, bias: torch.Tensor
) -> torch.Tensor:
    """Add a bias vector using PyTorch broadcasting.

    Args:
        data: Tensor of shape (num_samples, num_features).
        bias: Bias tensor of shape (num_features,).

    Returns:
        Result tensor with bias added.
    """
    return data + bias


# Benchmark broadcasting: add bias to 1000 samples x 500 features
broadcast_num_samples = 1000
broadcast_num_features = 500

np_data = np.random.randn(broadcast_num_samples, broadcast_num_features)
np_bias = np.random.randn(broadcast_num_features)
py_data = np_data.tolist()
py_bias = np_bias.tolist()
pt_data = torch.tensor(np_data)
pt_bias = torch.tensor(np_bias)

broadcast_tasks = [
    ("Python Loops", lambda: add_bias_python_loop(py_data, py_bias)),
    ("NumPy Broadcasting", lambda: add_bias_numpy_broadcast(np_data, np_bias)),
    ("PyTorch Broadcasting", lambda: add_bias_torch_broadcast(pt_data, pt_bias)),
]

broadcast_suite = BenchmarkSuite(broadcast_tasks)
broadcast_results = broadcast_suite.run()
print(f"Broadcasting Benchmark ({broadcast_num_samples} samples x {broadcast_num_features} features):")
print(broadcast_results.to_string(index=False))

python_broadcast_time = float(broadcast_results.loc[0, "Mean (s)"])
numpy_broadcast_time = float(broadcast_results.loc[1, "Mean (s)"])
torch_broadcast_time = float(broadcast_results.loc[2, "Mean (s)"])

print(f"\nSpeedup: NumPy is {python_broadcast_time / numpy_broadcast_time:.2f}x faster than Python")
print(f"Speedup: PyTorch is {python_broadcast_time / torch_broadcast_time:.2f}x faster than Python")

---
## Part 4 — Evaluation & Analysis

We now visualize and analyze the benchmark results to understand the full picture.

### 4.1 Speedup Curves (Log Scale)

In [None]:
def plot_scaling_comparison(
    results_df: pd.DataFrame,
    title: str,
    size_column: str = "Size",
) -> None:
    """Plot timing results on a log scale for all three implementations.

    Args:
        results_df: DataFrame with timing results.
        title: Plot title.
        size_column: Column name for the x-axis (problem size).
    """
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    sizes = results_df[size_column]

    # Convert sizes to numeric if they are strings like "50x50"
    if isinstance(sizes.iloc[0], str):
        numeric_sizes = [int(s.split('x')[0]) for s in sizes]
    else:
        numeric_sizes = sizes.tolist()

    # Left plot: Absolute times (log scale)
    python_times = results_df["Python (s)"]
    numpy_times = results_df["NumPy (s)"]
    torch_times = results_df["PyTorch (s)"]

    # Only plot Python where it was measured
    valid_python_mask = ~python_times.isna()
    if valid_python_mask.any():
        axes[0].plot(
            [numeric_sizes[i] for i in range(len(numeric_sizes)) if valid_python_mask.iloc[i]],
            python_times[valid_python_mask],
            'o-', label='Python Loop', color='#E53935', linewidth=2, markersize=7,
        )

    axes[0].plot(numeric_sizes, numpy_times, 's-', label='NumPy', color='#1E88E5',
                 linewidth=2, markersize=7)
    axes[0].plot(numeric_sizes, torch_times, '^-', label='PyTorch', color='#43A047',
                 linewidth=2, markersize=7)

    axes[0].set_xscale('log')
    axes[0].set_yscale('log')
    axes[0].set_xlabel('Problem Size')
    axes[0].set_ylabel('Time (seconds, log scale)')
    axes[0].set_title(f'{title} \u2014 Absolute Times')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)

    # Right plot: Speedup ratios
    if valid_python_mask.any():
        valid_sizes = [numeric_sizes[i] for i in range(len(numeric_sizes)) if valid_python_mask.iloc[i]]
        numpy_speedup = (python_times[valid_python_mask] / numpy_times[valid_python_mask]).tolist()
        torch_speedup = (python_times[valid_python_mask] / torch_times[valid_python_mask]).tolist()

        axes[1].plot(valid_sizes, numpy_speedup, 's-', label='NumPy vs Python',
                     color='#1E88E5', linewidth=2, markersize=7)
        axes[1].plot(valid_sizes, torch_speedup, '^-', label='PyTorch vs Python',
                     color='#43A047', linewidth=2, markersize=7)
        axes[1].axhline(y=1, color='gray', linestyle='--', alpha=0.5, label='No speedup')

        axes[1].set_xscale('log')
        axes[1].set_yscale('log')
        axes[1].set_xlabel('Problem Size')
        axes[1].set_ylabel('Speedup (x times faster)')
        axes[1].set_title(f'{title} \u2014 Speedup vs Python')
        axes[1].legend()
        axes[1].grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()


plot_scaling_comparison(dot_scaling_results, "Dot Product")

In [None]:
plot_scaling_comparison(elemwise_scaling_results, "Element-wise Multiply")

In [None]:
plot_scaling_comparison(matmul_scaling_results, "Matrix Multiplication")

### 4.2 Comprehensive Comparison DataFrame

In [None]:
def build_comprehensive_table(
    dot_results: pd.DataFrame,
    elemwise_results: pd.DataFrame,
    matmul_results: pd.DataFrame,
) -> pd.DataFrame:
    """Build a comprehensive comparison table for all operations and sizes.

    Args:
        dot_results: Dot product benchmark results.
        elemwise_results: Element-wise multiply benchmark results.
        matmul_results: Matrix multiply benchmark results.

    Returns:
        Combined DataFrame with all results and speedups.
    """
    rows: list[dict[str, object]] = []

    # Dot product rows
    for _, row in dot_results.iterrows():
        python_time = row["Python (s)"]
        numpy_time = row["NumPy (s)"]
        torch_time = row["PyTorch (s)"]
        numpy_speedup = python_time / numpy_time if not np.isnan(python_time) else float("nan")
        rows.append({
            "Operation": "Dot Product",
            "Size": row["Size"],
            "Python (s)": f"{python_time:.6f}" if not np.isnan(python_time) else "N/A",
            "NumPy (s)": f"{numpy_time:.6f}",
            "PyTorch (s)": f"{torch_time:.6f}",
            "NumPy Speedup": f"{numpy_speedup:.2f}x" if not np.isnan(numpy_speedup) else "N/A",
        })

    # Element-wise rows
    for _, row in elemwise_results.iterrows():
        python_time = row["Python (s)"]
        numpy_time = row["NumPy (s)"]
        torch_time = row["PyTorch (s)"]
        numpy_speedup = python_time / numpy_time if not np.isnan(python_time) else float("nan")
        rows.append({
            "Operation": "Elem-wise Multiply",
            "Size": row["Size"],
            "Python (s)": f"{python_time:.6f}" if not np.isnan(python_time) else "N/A",
            "NumPy (s)": f"{numpy_time:.6f}",
            "PyTorch (s)": f"{torch_time:.6f}",
            "NumPy Speedup": f"{numpy_speedup:.2f}x" if not np.isnan(numpy_speedup) else "N/A",
        })

    # Matmul rows
    for _, row in matmul_results.iterrows():
        python_time = row["Python (s)"]
        numpy_time = row["NumPy (s)"]
        torch_time = row["PyTorch (s)"]
        numpy_speedup = python_time / numpy_time if not np.isnan(python_time) else float("nan")
        rows.append({
            "Operation": "Matrix Multiply",
            "Size": row["Size"],
            "Python (s)": f"{python_time:.6f}" if not np.isnan(python_time) else "N/A",
            "NumPy (s)": f"{numpy_time:.6f}",
            "PyTorch (s)": f"{torch_time:.6f}",
            "NumPy Speedup": f"{numpy_speedup:.2f}x" if not np.isnan(numpy_speedup) else "N/A",
        })

    return pd.DataFrame(rows)


comprehensive_table = build_comprehensive_table(
    dot_scaling_results, elemwise_scaling_results, matmul_scaling_results
)
print("Comprehensive Benchmark Results:")
print(comprehensive_table.to_string(index=False))

### 4.3 PyTorch Overhead Analysis

For very small arrays, PyTorch can actually be *slower* than NumPy due to:

1. **Tensor creation overhead** — PyTorch allocates memory through its own allocator
2. **Dispatch overhead** — PyTorch routes operations through a dispatcher that checks for autograd, device type, etc.
3. **Autograd bookkeeping** — even when not computing gradients, the system checks whether it should

Let us quantify this crossover point.

In [None]:
def benchmark_overhead_crossover(sizes: list[int]) -> pd.DataFrame:
    """Find the size where PyTorch becomes faster than NumPy for dot product.

    Args:
        sizes: List of vector sizes to test.

    Returns:
        DataFrame comparing NumPy and PyTorch times at each size.
    """
    records: list[dict[str, object]] = []

    for size in sizes:
        np_a = np.random.randn(size)
        np_b = np.random.randn(size)
        pt_a = torch.tensor(np_a)
        pt_b = torch.tensor(np_b)

        numpy_time = benchmark_function(lambda: np.dot(np_a, np_b), num_timed_runs=10)["mean"]
        torch_time = benchmark_function(lambda: torch.dot(pt_a, pt_b), num_timed_runs=10)["mean"]

        records.append({
            "Size": size,
            "NumPy (s)": numpy_time,
            "PyTorch (s)": torch_time,
            "Ratio (PT/NP)": torch_time / numpy_time,
            "Faster": "NumPy" if numpy_time < torch_time else "PyTorch",
        })

    return pd.DataFrame(records)


overhead_sizes = [10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000]
overhead_results = benchmark_overhead_crossover(overhead_sizes)
print("NumPy vs PyTorch Overhead Crossover:")
print(overhead_results.to_string(index=False))

In [None]:
# Visualize the overhead crossover
fig, ax = plt.subplots(figsize=(8, 5))

ax.plot(overhead_results["Size"], overhead_results["NumPy (s)"],
        's-', label='NumPy', color='#1E88E5', linewidth=2, markersize=7)
ax.plot(overhead_results["Size"], overhead_results["PyTorch (s)"],
        '^-', label='PyTorch', color='#43A047', linewidth=2, markersize=7)

ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('Vector Size')
ax.set_ylabel('Time (seconds, log scale)')
ax.set_title('NumPy vs PyTorch: Overhead Crossover for Dot Product')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### 4.4 Memory Layout Performance Impact

In [None]:
def benchmark_contiguous_vs_noncontiguous(matrix_size: int) -> pd.DataFrame:
    """Benchmark operations on contiguous vs non-contiguous tensors.

    Args:
        matrix_size: Size of square matrix (n x n).

    Returns:
        DataFrame with timing results.
    """
    # Create a contiguous tensor
    contiguous_tensor = torch.randn(matrix_size, matrix_size)
    assert contiguous_tensor.is_contiguous(), "Should be contiguous"

    # Create a non-contiguous view via transpose
    noncontiguous_tensor = contiguous_tensor.T
    assert not noncontiguous_tensor.is_contiguous(), "Transpose should be non-contiguous"

    # Benchmark sum operation on both
    contiguous_time = benchmark_function(
        lambda: contiguous_tensor.sum(), num_timed_runs=10
    )["mean"]
    noncontiguous_time = benchmark_function(
        lambda: noncontiguous_tensor.sum(), num_timed_runs=10
    )["mean"]

    # Benchmark matrix multiply
    other_matrix = torch.randn(matrix_size, matrix_size)
    contiguous_matmul_time = benchmark_function(
        lambda: torch.matmul(contiguous_tensor, other_matrix), num_timed_runs=10
    )["mean"]
    noncontiguous_matmul_time = benchmark_function(
        lambda: torch.matmul(noncontiguous_tensor, other_matrix), num_timed_runs=10
    )["mean"]

    return pd.DataFrame([
        {"Operation": "Sum", "Contiguous (s)": contiguous_time,
         "Non-contiguous (s)": noncontiguous_time,
         "Slowdown": noncontiguous_time / contiguous_time},
        {"Operation": "MatMul", "Contiguous (s)": contiguous_matmul_time,
         "Non-contiguous (s)": noncontiguous_matmul_time,
         "Slowdown": noncontiguous_matmul_time / contiguous_matmul_time},
    ])


contiguity_size = 1000
contiguity_results = benchmark_contiguous_vs_noncontiguous(contiguity_size)
print(f"Contiguous vs Non-contiguous Performance ({contiguity_size}x{contiguity_size} matrix):")
print(contiguity_results.to_string(index=False))

In [None]:
# Visualize contiguity impact
fig, ax = plt.subplots(figsize=(8, 5))

operations = contiguity_results["Operation"]
contiguous_times = contiguity_results["Contiguous (s)"]
noncontiguous_times = contiguity_results["Non-contiguous (s)"]

x_positions = np.arange(len(operations))
bar_width = 0.35

bars1 = ax.bar(x_positions - bar_width / 2, contiguous_times,
               bar_width, label='Contiguous', color='#1E88E5', edgecolor='black')
bars2 = ax.bar(x_positions + bar_width / 2, noncontiguous_times,
               bar_width, label='Non-contiguous', color='#FF9800', edgecolor='black')

ax.set_ylabel('Time (seconds)')
ax.set_title(f'Contiguous vs Non-contiguous Tensor Performance\n({contiguity_size}x{contiguity_size} matrix)')
ax.set_xticks(x_positions)
ax.set_xticklabels(operations)
ax.legend()

# Add slowdown labels
for idx, (bar1, bar2) in enumerate(zip(bars1, bars2)):
    slowdown = float(contiguity_results.iloc[idx]["Slowdown"])
    ax.text(bar2.get_x() + bar2.get_width() / 2, bar2.get_height() * 1.02,
            f'{slowdown:.2f}x', ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

### 4.5 Summary Visualization

In [None]:
def plot_speedup_summary(dot_df: pd.DataFrame, elemwise_df: pd.DataFrame) -> None:
    """Plot a summary of speedups across operations and sizes.

    Args:
        dot_df: Dot product scaling results.
        elemwise_df: Element-wise multiply scaling results.
    """
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # Filter to sizes where Python was benchmarked
    dot_valid = dot_df[~dot_df["Python (s)"].isna()]
    elem_valid = elemwise_df[~elemwise_df["Python (s)"].isna()]

    # Dot product speedups
    if len(dot_valid) > 0:
        numpy_speedup = dot_valid["Python (s)"] / dot_valid["NumPy (s)"]
        torch_speedup = dot_valid["Python (s)"] / dot_valid["PyTorch (s)"]
        sizes = dot_valid["Size"]

        axes[0].bar(np.arange(len(sizes)) - 0.2, numpy_speedup, 0.4,
                    label='NumPy vs Python', color='#1E88E5', edgecolor='black')
        axes[0].bar(np.arange(len(sizes)) + 0.2, torch_speedup, 0.4,
                    label='PyTorch vs Python', color='#43A047', edgecolor='black')
        axes[0].set_xticks(np.arange(len(sizes)))
        axes[0].set_xticklabels([str(s) for s in sizes], rotation=45)
        axes[0].set_xlabel('Vector Size')
        axes[0].set_ylabel('Speedup (x times faster)')
        axes[0].set_title('Dot Product Speedup')
        axes[0].legend()
        axes[0].grid(axis='y', alpha=0.3)

    # Element-wise speedups
    if len(elem_valid) > 0:
        numpy_speedup = elem_valid["Python (s)"] / elem_valid["NumPy (s)"]
        torch_speedup = elem_valid["Python (s)"] / elem_valid["PyTorch (s)"]
        sizes = elem_valid["Size"]

        axes[1].bar(np.arange(len(sizes)) - 0.2, numpy_speedup, 0.4,
                    label='NumPy vs Python', color='#1E88E5', edgecolor='black')
        axes[1].bar(np.arange(len(sizes)) + 0.2, torch_speedup, 0.4,
                    label='PyTorch vs Python', color='#43A047', edgecolor='black')
        axes[1].set_xticks(np.arange(len(sizes)))
        axes[1].set_xticklabels([str(s) for s in sizes], rotation=45)
        axes[1].set_xlabel('Vector Size')
        axes[1].set_ylabel('Speedup (x times faster)')
        axes[1].set_title('Element-wise Multiply Speedup')
        axes[1].legend()
        axes[1].grid(axis='y', alpha=0.3)

    plt.tight_layout()
    plt.show()


plot_speedup_summary(dot_scaling_results, elemwise_scaling_results)

In [None]:
# Broadcasting benchmark plot
broadcast_suite.plot_results(
    f"Broadcasting: Add Bias ({broadcast_num_samples} samples x {broadcast_num_features} features)"
)

---
## Part 5 — Summary & Lessons Learned

### Key Takeaways

1. **Vectorized operations are 10-1000x faster than Python loops** for numerical computation. The speedup grows with problem size because vectorized libraries amortize their fixed overhead (dispatch, memory allocation) across more useful work.

2. **NumPy and PyTorch produce numerically identical results** but have different performance profiles. NumPy has lower overhead for small arrays, while PyTorch matches or exceeds NumPy at larger sizes and adds GPU support.

3. **Broadcasting eliminates the need for explicit loops** when combining arrays of different shapes. The three broadcasting rules (pad with 1s, stretch size-1 dims, fail if incompatible) apply identically in both NumPy and PyTorch.

4. **Memory layout (contiguous, strided) affects performance.** Understanding strides and contiguity prevents subtle bugs (e.g., passing a non-contiguous tensor to an operation that requires contiguous memory) and helps you reason about cache efficiency.

5. **PyTorch tensors share NumPy's API but add GPU support and autograd.** The autograd system (covered in Module 5) automatically computes gradients through all tensor operations, enabling neural network training.

### What's Next

**1-02 (Advanced NumPy & PyTorch Operations)** builds on the tensor foundations established here with reshape, view, transpose, einsum notation, and advanced indexing — the manipulation operations you will use in every notebook from here forward.