# Module 01 — Mathematical & Programming Foundations
## 1-02: Advanced NumPy & PyTorch Operations

**Objective:** Master tensor manipulation operations — stacking, reshaping, transposing, matrix multiplication, einsum notation, and advanced indexing — that form the building blocks of every ML algorithm.

**Prerequisites:** 1-01 (Python, NumPy & Tensor Speed)

## Part 0 — Setup & Prerequisites

This notebook covers the advanced tensor manipulation operations that ML practitioners use daily. We will explore stacking, reshaping, transposing, matrix multiplication, einsum notation, advanced indexing, and in-place aliasing — all in both NumPy and PyTorch. These skills are prerequisites for implementing attention mechanisms (Module 8), convolutional operations (Module 6), and any from-scratch algorithm that requires careful tensor manipulation.

**Prerequisites:** 1-01 (Python, NumPy & Tensor Speed)

In [None]:
# ── Imports ──────────────────────────────────────────────────────────────────
import sys
import warnings
warnings.filterwarnings("ignore")

import random
import time

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch

print(f"Python: {sys.version.split()[0]}")
print(f"NumPy: {np.__version__}")
print(f"PyTorch: {torch.__version__}")

In [None]:
# ── Reproducibility ─────────────────────────────────────────────────────────
SEED = 1103
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

In [None]:
# ── Configuration ────────────────────────────────────────────────────────────
NUM_TIMING_RUNS = 50          # Number of timing iterations for benchmarks
SMALL_DIM = 4                 # Small dimension for demonstration
MEDIUM_DIM = 64               # Medium dimension for practical examples
LARGE_DIM = 256               # Larger dimension for performance tests

### Synthetic Data for Demonstrations

We generate a variety of synthetic arrays and tensors that we will use throughout the notebook. Using synthetic data keeps us focused on the operations themselves rather than data loading.

In [None]:
# ── Generate synthetic data ──────────────────────────────────────────────────
# Small arrays for step-by-step demonstrations
array_1d = np.random.randn(6)
array_2d = np.random.randn(3, 4)
array_3d = np.random.randn(2, 3, 4)

# Equivalent PyTorch tensors
tensor_1d = torch.from_numpy(array_1d.copy())
tensor_2d = torch.from_numpy(array_2d.copy())
tensor_3d = torch.from_numpy(array_3d.copy())

# Feature vectors for stacking demonstration
NUM_SAMPLES = 5
FEATURE_DIM = 8
feature_vectors = [np.random.randn(FEATURE_DIM) for _ in range(NUM_SAMPLES)]

# Larger matrices for performance demonstrations
matrix_a = np.random.randn(LARGE_DIM, LARGE_DIM)
matrix_b = np.random.randn(LARGE_DIM, LARGE_DIM)

print(f"array_1d shape: {array_1d.shape}")
print(f"array_2d shape: {array_2d.shape}")
print(f"array_3d shape: {array_3d.shape}")
print(f"Feature vectors: {NUM_SAMPLES} vectors of dim {FEATURE_DIM}")
print(f"Large matrices: {matrix_a.shape}")

---
## Part 1 — Tensor Operations from Scratch

We systematically cover the seven families of tensor operations that every ML practitioner needs: stacking/concatenation, reshaping, transposing, matrix multiplication, einsum, advanced indexing, and in-place aliasing.

### 1.1 Stacking and Concatenation

Stacking and concatenation combine multiple arrays into one. The key distinction is:

- **Concatenation** joins arrays along an **existing** axis.
- **Stacking** joins arrays along a **new** axis.

In NumPy: `np.concatenate`, `np.vstack`, `np.hstack`  
In PyTorch: `torch.cat` (concatenate) vs `torch.stack` (stack along new dimension)

In [None]:
# ── NumPy: vstack, hstack, concatenate ──────────────────────────────────────
row_a = np.array([1, 2, 3])
row_b = np.array([4, 5, 6])

# vstack: stack vertically (along axis=0) — rows become rows of a 2D array
vstacked = np.vstack([row_a, row_b])
print(f"vstack result shape: {vstacked.shape}")
print(f"vstack result:\n{vstacked}\n")

# hstack: stack horizontally (along axis=1 for 2D, concatenate for 1D)
hstacked = np.hstack([row_a, row_b])
print(f"hstack result shape: {hstacked.shape}")
print(f"hstack result: {hstacked}\n")

# concatenate with explicit axis parameter
mat_a = np.random.randn(2, 3)
mat_b = np.random.randn(2, 3)
concat_axis0 = np.concatenate([mat_a, mat_b], axis=0)
concat_axis1 = np.concatenate([mat_a, mat_b], axis=1)
print(f"Concatenate axis=0: {mat_a.shape} + {mat_b.shape} -> {concat_axis0.shape}")
print(f"Concatenate axis=1: {mat_a.shape} + {mat_b.shape} -> {concat_axis1.shape}")

In [None]:
# ── PyTorch: torch.cat vs torch.stack ────────────────────────────────────────
vec_a = torch.tensor([1.0, 2.0, 3.0])
vec_b = torch.tensor([4.0, 5.0, 6.0])

# torch.cat: concatenate along an EXISTING dimension
catted = torch.cat([vec_a, vec_b], dim=0)
print(f"torch.cat (1D): {vec_a.shape} + {vec_b.shape} -> {catted.shape}")
print(f"  Result: {catted}\n")

# torch.stack: create a NEW dimension and stack along it
stacked = torch.stack([vec_a, vec_b], dim=0)
print(f"torch.stack dim=0: {vec_a.shape} + {vec_b.shape} -> {stacked.shape}")
print(f"  Result:\n{stacked}\n")

stacked_dim1 = torch.stack([vec_a, vec_b], dim=1)
print(f"torch.stack dim=1: {vec_a.shape} + {vec_b.shape} -> {stacked_dim1.shape}")
print(f"  Result:\n{stacked_dim1}")

In [None]:
# ── Practical example: Stacking feature vectors into a batch matrix ──────────
# In ML, we frequently collect individual feature vectors and need to form a batch
torch_features = [torch.randn(FEATURE_DIM) for _ in range(NUM_SAMPLES)]

# Using torch.stack to create a (num_samples, feature_dim) batch matrix
batch_matrix = torch.stack(torch_features, dim=0)
assert batch_matrix.shape == (NUM_SAMPLES, FEATURE_DIM), (
    f"Expected ({NUM_SAMPLES}, {FEATURE_DIM}), got {batch_matrix.shape}"
)
print(f"Individual feature shape: {torch_features[0].shape}")
print(f"Batch matrix shape:       {batch_matrix.shape}")
print(f"Batch matrix (first 2 rows):\n{batch_matrix[:2]}")

### 1.2 Reshape vs View

Both `reshape` and `view` change the shape of a tensor without changing the data. The critical difference lies in **memory layout**:

| Operation | Library | Behavior |
|-----------|---------|----------|
| `np.reshape` | NumPy | Returns a view if possible, a copy otherwise |
| `tensor.view()` | PyTorch | **Requires** contiguous memory — fails otherwise |
| `tensor.reshape()` | PyTorch | Works like `view` when possible, copies when needed |

A tensor is **contiguous** when its elements are stored in a single, unbroken block of memory in row-major (C) order. Operations like `transpose` change strides without moving data, making the tensor non-contiguous.

In [None]:
# ── NumPy reshape ────────────────────────────────────────────────────────────
original_np = np.arange(12)
print(f"Original: {original_np}, shape={original_np.shape}\n")

# Reshape to 3x4 — returns a view (shared memory)
reshaped_np = original_np.reshape(3, 4)
print(f"Reshaped to (3,4):\n{reshaped_np}")
print(f"Shares memory: {np.shares_memory(original_np, reshaped_np)}")

# Modify the view — original changes too!
reshaped_np[0, 0] = 999
print(f"\nAfter modifying reshaped[0,0] = 999:")
print(f"  Original[0] = {original_np[0]} (also changed!)")
original_np[0] = 0  # Reset

# Shape inference with -1: NumPy infers the unknown dimension
auto_reshaped = original_np.reshape(2, -1)  # -1 means "infer this dimension"
print(f"\nreshape(2, -1): {original_np.shape} -> {auto_reshaped.shape}")

In [None]:
# ── PyTorch view vs reshape ──────────────────────────────────────────────────
original_pt = torch.arange(12, dtype=torch.float32)
print(f"Original tensor: {original_pt}")
print(f"Is contiguous: {original_pt.is_contiguous()}\n")

# view: works on contiguous tensors — returns a view (shared memory)
viewed = original_pt.view(3, 4)
print(f"view(3, 4):\n{viewed}")
print(f"Same storage: {viewed.data_ptr() == original_pt.data_ptr()}\n")

# Transpose makes a tensor non-contiguous
transposed = viewed.t()  # (3,4) -> (4,3)
print(f"After transpose: shape={transposed.shape}")
print(f"Is contiguous: {transposed.is_contiguous()}")

# view FAILS on non-contiguous tensor
try:
    transposed.view(12)
except RuntimeError as error:
    print(f"\nview() fails on non-contiguous tensor:")
    print(f"  Error: {error}")

# Fix 1: use .contiguous() first, then view
fixed_view = transposed.contiguous().view(12)
print(f"\n.contiguous().view(12): {fixed_view.shape}")

# Fix 2: use .reshape() — handles both contiguous and non-contiguous
reshaped_pt = transposed.reshape(12)
print(f".reshape(12):           {reshaped_pt.shape}")

In [None]:
def demonstrate_contiguity(tensor: torch.Tensor, name: str) -> None:
    """Print contiguity info, strides, and storage details for a tensor.

    Args:
        tensor: The tensor to inspect.
        name: A descriptive name for display.
    """
    print(f"{name}:")
    print(f"  Shape:      {tuple(tensor.shape)}")
    print(f"  Strides:    {tensor.stride()}")
    print(f"  Contiguous: {tensor.is_contiguous()}")
    print(f"  Data ptr:   {tensor.data_ptr()}")
    print()


base = torch.arange(12, dtype=torch.float32).view(3, 4)
demonstrate_contiguity(base, "Original (3x4)")
demonstrate_contiguity(base.t(), "Transposed (4x3)")
demonstrate_contiguity(base.t().contiguous(), "Transposed + contiguous()")

### 1.3 Transpose Mechanics

Transposing swaps dimensions of a tensor. Under the hood, it only changes the **strides** — the actual data in memory stays put. This is why transposed tensors become non-contiguous.

| NumPy | PyTorch | Description |
|-------|---------|-------------|
| `.T` | `.T` | Reverse all dimensions |
| `np.transpose(a, axes)` | `.permute(*dims)` | Arbitrary axis reordering |
| `np.swapaxes(a, ax1, ax2)` | `.transpose(dim0, dim1)` | Swap exactly two axes |
| — | `.t()` | 2D-only transpose |

In [None]:
# ── NumPy transpose operations ───────────────────────────────────────────────
mat = np.arange(12).reshape(3, 4)
print(f"Original (3x4):\n{mat}\n")

# .T — simple transpose
print(f".T (4x3):\n{mat.T}\n")

# For 3D arrays, .T reverses ALL dimensions
arr_3d = np.arange(24).reshape(2, 3, 4)
print(f"3D shape: {arr_3d.shape}")
print(f"3D .T shape: {arr_3d.T.shape}  (reversed: 4,3,2)")

# np.transpose with explicit axis order
permuted_np = np.transpose(arr_3d, (1, 0, 2))  # swap first two axes
print(f"np.transpose(arr, (1,0,2)): {arr_3d.shape} -> {permuted_np.shape}")

# np.swapaxes — swap exactly two axes
swapped_np = np.swapaxes(arr_3d, 0, 2)
print(f"np.swapaxes(arr, 0, 2): {arr_3d.shape} -> {swapped_np.shape}")

In [None]:
# ── PyTorch transpose operations ─────────────────────────────────────────────
tensor_mat = torch.arange(12, dtype=torch.float32).view(3, 4)
print(f"Original (3x4):\n{tensor_mat}\n")

# .t() — 2D only
print(f".t() (4x3):\n{tensor_mat.t()}\n")

# .transpose(dim0, dim1) — swap exactly two dimensions
tensor_3d = torch.arange(24, dtype=torch.float32).view(2, 3, 4)
transposed_pt = tensor_3d.transpose(0, 2)  # swap dim 0 and dim 2
print(f".transpose(0, 2): {tuple(tensor_3d.shape)} -> {tuple(transposed_pt.shape)}")

# .permute(*dims) — arbitrary reordering (most flexible)
permuted_pt = tensor_3d.permute(1, 0, 2)  # (2,3,4) -> (3,2,4)
print(f".permute(1, 0, 2): {tuple(tensor_3d.shape)} -> {tuple(permuted_pt.shape)}")

# Verify non-contiguity after transpose
print(f"\nOriginal contiguous:   {tensor_3d.is_contiguous()}")
print(f"Transposed contiguous: {transposed_pt.is_contiguous()}")
print(f"Permuted contiguous:   {permuted_pt.is_contiguous()}")

In [None]:
# ── Practical: Transposing for matrix multiplication ─────────────────────────
# Computing X^T X (common in linear regression normal equation)
data_matrix = torch.randn(100, 10)  # 100 samples, 10 features
xtx = data_matrix.t() @ data_matrix  # (10, 100) @ (100, 10) -> (10, 10)
assert xtx.shape == (10, 10), f"Expected (10, 10), got {xtx.shape}"
print(f"X shape: {tuple(data_matrix.shape)}")
print(f"X^T X shape: {tuple(xtx.shape)}")
print(f"X^T X is symmetric: {torch.allclose(xtx, xtx.t())}")

### 1.4 Matrix Multiplication

Matrix multiplication is the most fundamental operation in ML. Different functions handle different cases:

**NumPy:**
- `np.dot(a, b)` — dot product for 1D, matrix multiply for 2D, sum-product over last/second-to-last axes for higher dims
- `np.matmul(a, b)` / `a @ b` — matrix multiply with **broadcasting** for 3D+ tensors

**PyTorch:**
- `torch.mm(a, b)` — strict 2D matrix multiply only
- `torch.matmul(a, b)` / `a @ b` — matrix multiply with broadcasting
- `torch.bmm(a, b)` — batched matrix multiply (both inputs must be 3D)

For two matrices $\mathbf{A} \in \mathbb{R}^{m \times n}$ and $\mathbf{B} \in \mathbb{R}^{n \times p}$:

$$\mathbf{C}_{ij} = \sum_{k=1}^{n} \mathbf{A}_{ik} \mathbf{B}_{kj}$$

In [None]:
# ── NumPy: dot vs matmul vs @ ───────────────────────────────────────────────
vec_x = np.array([1.0, 2.0, 3.0])
vec_y = np.array([4.0, 5.0, 6.0])

# 1D: dot product
print(f"np.dot (1D): {np.dot(vec_x, vec_y)}")
print(f"np.matmul (1D): {np.matmul(vec_x, vec_y)}")
print(f"@ operator (1D): {vec_x @ vec_y}\n")

# 2D: matrix multiplication — all three are equivalent
mat_x = np.random.randn(3, 4)
mat_y = np.random.randn(4, 2)
print(f"np.dot (2D):    shape = {np.dot(mat_x, mat_y).shape}")
print(f"np.matmul (2D): shape = {np.matmul(mat_x, mat_y).shape}")
print(f"@ operator (2D): shape = {(mat_x @ mat_y).shape}")
print(f"All equal: {np.allclose(np.dot(mat_x, mat_y), mat_x @ mat_y)}\n")

# 3D: dot and matmul DIFFER for higher-dimensional arrays
batch_x = np.random.randn(2, 3, 4)
batch_y = np.random.randn(2, 4, 5)

matmul_result = np.matmul(batch_x, batch_y)  # Broadcasts: (2,3,4) @ (2,4,5) -> (2,3,5)
dot_result = np.dot(batch_x, batch_y)  # Different behavior for 3D+!

print(f"np.matmul (3D): {batch_x.shape} @ {batch_y.shape} -> {matmul_result.shape}")
print(f"np.dot (3D):    {batch_x.shape} dot {batch_y.shape} -> {dot_result.shape}")
print(f"Note: np.dot gives a DIFFERENT shape for 3D tensors!")

In [None]:
# ── PyTorch: mm vs matmul vs bmm ────────────────────────────────────────────
# torch.mm: strictly 2D
pt_a = torch.randn(3, 4)
pt_b = torch.randn(4, 5)
mm_result = torch.mm(pt_a, pt_b)
print(f"torch.mm: ({tuple(pt_a.shape)}) x ({tuple(pt_b.shape)}) -> {tuple(mm_result.shape)}")

# torch.matmul: handles broadcasting for batched operations
batch_a = torch.randn(2, 3, 4)
batch_b = torch.randn(2, 4, 5)
matmul_result_pt = torch.matmul(batch_a, batch_b)
print(f"torch.matmul: ({tuple(batch_a.shape)}) x ({tuple(batch_b.shape)}) -> {tuple(matmul_result_pt.shape)}")

# torch.bmm: strictly batched 3D — both inputs must have same batch size
bmm_result = torch.bmm(batch_a, batch_b)
print(f"torch.bmm: ({tuple(batch_a.shape)}) x ({tuple(batch_b.shape)}) -> {tuple(bmm_result.shape)}")

# Verify matmul and bmm give same result for 3D inputs
print(f"\nmatmul == bmm: {torch.allclose(matmul_result_pt, bmm_result)}")

In [None]:
def batch_matmul_manual(
    tensor_a: torch.Tensor, tensor_b: torch.Tensor
) -> torch.Tensor:
    """Batch matrix multiply using explicit loops (for understanding).

    Multiplies each corresponding pair of matrices in the batch dimension.

    Args:
        tensor_a: Tensor of shape (batch_size, m, n).
        tensor_b: Tensor of shape (batch_size, n, p).

    Returns:
        Result tensor of shape (batch_size, m, p).
    """
    assert tensor_a.dim() == 3 and tensor_b.dim() == 3, "Both inputs must be 3D"
    assert tensor_a.shape[0] == tensor_b.shape[0], "Batch sizes must match"
    assert tensor_a.shape[2] == tensor_b.shape[1], "Inner dimensions must match"

    batch_size = tensor_a.shape[0]
    rows_m = tensor_a.shape[1]
    cols_p = tensor_b.shape[2]
    result = torch.zeros(batch_size, rows_m, cols_p, dtype=tensor_a.dtype)

    for batch_idx in range(batch_size):
        result[batch_idx] = torch.mm(tensor_a[batch_idx], tensor_b[batch_idx])

    return result


# Compare manual vs torch.bmm
test_a = torch.randn(4, 3, 5)
test_b = torch.randn(4, 5, 2)

manual_result = batch_matmul_manual(test_a, test_b)
bmm_result = torch.bmm(test_a, test_b)

print(f"Manual batch matmul shape: {tuple(manual_result.shape)}")
print(f"torch.bmm shape:           {tuple(bmm_result.shape)}")
print(f"Results match: {torch.allclose(manual_result, bmm_result, atol=1e-6)}")

### 1.5 Einsum — The Swiss Army Knife

Einstein summation (`einsum`) provides a compact, readable notation for expressing tensor operations. The key idea:

- **Repeated indices** are summed over (contraction)
- **Output indices** specify which dimensions remain
- The `->` arrow separates input subscripts from the output subscript

For example, matrix multiplication $\mathbf{C}_{ik} = \sum_j \mathbf{A}_{ij} \mathbf{B}_{jk}$ is written as `'ij,jk->ik'`.

Einsum becomes critical in Module 8 (multi-head attention) and Module 12 (tensor decomposition).

In [None]:
# ── Einsum basics ────────────────────────────────────────────────────────────
vec_p = torch.tensor([1.0, 2.0, 3.0])
vec_q = torch.tensor([4.0, 5.0, 6.0])
mat_m = torch.randn(3, 4)
mat_n = torch.randn(4, 5)

# 1. Vector dot product: sum_i (p_i * q_i)
dot_einsum = torch.einsum('i,i->', vec_p, vec_q)
dot_explicit = torch.dot(vec_p, vec_q)
print(f"Dot product:")
print(f"  einsum 'i,i->':  {dot_einsum.item():.4f}")
print(f"  torch.dot:       {dot_explicit.item():.4f}\n")

# 2. Outer product: C_ij = p_i * q_j
outer_einsum = torch.einsum('i,j->ij', vec_p, vec_q)
outer_explicit = torch.outer(vec_p, vec_q)
print(f"Outer product:")
print(f"  einsum 'i,j->ij': shape={tuple(outer_einsum.shape)}")
print(f"  Match: {torch.allclose(outer_einsum, outer_explicit)}\n")

# 3. Matrix multiply: C_ik = sum_j A_ij * B_jk
mm_einsum = torch.einsum('ij,jk->ik', mat_m, mat_n)
mm_explicit = torch.mm(mat_m, mat_n)
print(f"Matrix multiply:")
print(f"  einsum 'ij,jk->ik': shape={tuple(mm_einsum.shape)}")
print(f"  Match: {torch.allclose(mm_einsum, mm_explicit)}\n")

# 4. Transpose: B_ji = A_ij
transpose_einsum = torch.einsum('ij->ji', mat_m)
print(f"Transpose:")
print(f"  einsum 'ij->ji': shape={tuple(transpose_einsum.shape)}")
print(f"  Match: {torch.allclose(transpose_einsum, mat_m.t())}\n")

# 5. Trace: sum_i A_ii
square_mat = torch.randn(4, 4)
trace_einsum = torch.einsum('ii->', square_mat)
trace_explicit = torch.trace(square_mat)
print(f"Trace:")
print(f"  einsum 'ii->':  {trace_einsum.item():.4f}")
print(f"  torch.trace:    {trace_explicit.item():.4f}")

In [None]:
# ── Batch matrix multiply via einsum ─────────────────────────────────────────
batch_p = torch.randn(8, 3, 4)  # 8 matrices of shape (3, 4)
batch_q = torch.randn(8, 4, 5)  # 8 matrices of shape (4, 5)

bmm_einsum = torch.einsum('bij,bjk->bik', batch_p, batch_q)
bmm_explicit = torch.bmm(batch_p, batch_q)

print(f"Batch matmul via einsum 'bij,bjk->bik':")
print(f"  Shape: {tuple(bmm_einsum.shape)}")
print(f"  Match: {torch.allclose(bmm_einsum, bmm_explicit, atol=1e-5)}")

In [None]:
# ── Attention-like operation preview (Module 8) ──────────────────────────────
# In multi-head attention, we compute: scores = Q @ K^T for each batch and head
# Shape: Q, K are (batch, heads, seq_len, d_k)
BATCH_SIZE_DEMO = 2
NUM_HEADS = 4
SEQ_LEN = 6
D_K = 8

query = torch.randn(BATCH_SIZE_DEMO, NUM_HEADS, SEQ_LEN, D_K)
key = torch.randn(BATCH_SIZE_DEMO, NUM_HEADS, SEQ_LEN, D_K)

# Using einsum: 'bhqd,bhkd->bhqk' — contract over d dimension
attention_scores_einsum = torch.einsum('bhqd,bhkd->bhqk', query, key)

# Explicit equivalent: Q @ K^T for each batch and head
attention_scores_explicit = torch.matmul(query, key.transpose(-2, -1))

print(f"Query shape: {tuple(query.shape)}")
print(f"Key shape:   {tuple(key.shape)}")
print(f"Attention scores shape (einsum): {tuple(attention_scores_einsum.shape)}")
print(f"Match explicit: {torch.allclose(attention_scores_einsum, attention_scores_explicit, atol=1e-5)}")
print(f"\nNote: einsum 'bhqd,bhkd->bhqk' is a compact way to express")
print(f"Q @ K^T across batch and head dimensions simultaneously.")

In [None]:
def compare_einsum_vs_explicit(
    num_runs: int = NUM_TIMING_RUNS,
) -> pd.DataFrame:
    """Compare execution time of einsum vs explicit operations.

    Benchmarks five common operations to show when einsum is competitive
    versus dedicated functions.

    Args:
        num_runs: Number of timing iterations for each operation.

    Returns:
        DataFrame with timing comparison results.
    """
    size = 128
    mat_x = torch.randn(size, size)
    mat_y = torch.randn(size, size)
    batch_x = torch.randn(16, size, size)
    batch_y = torch.randn(16, size, size)
    vec_x = torch.randn(size)
    vec_y = torch.randn(size)

    operations = [
        ("Dot product", "i,i->", [vec_x, vec_y],
         lambda: torch.dot(vec_x, vec_y)),
        ("Matrix multiply", "ij,jk->ik", [mat_x, mat_y],
         lambda: torch.mm(mat_x, mat_y)),
        ("Batch matmul", "bij,bjk->bik", [batch_x, batch_y],
         lambda: torch.bmm(batch_x, batch_y)),
        ("Trace", "ii->", [mat_x],
         lambda: torch.trace(mat_x)),
        ("Transpose", "ij->ji", [mat_x],
         lambda: mat_x.t()),
    ]

    results = []
    for op_name, subscripts, einsum_inputs, explicit_fn in operations:
        # Time einsum
        start_time = time.perf_counter()
        for _ in range(num_runs):
            torch.einsum(subscripts, *einsum_inputs)
        einsum_time = (time.perf_counter() - start_time) / num_runs * 1000

        # Time explicit
        start_time = time.perf_counter()
        for _ in range(num_runs):
            explicit_fn()
        explicit_time = (time.perf_counter() - start_time) / num_runs * 1000

        results.append({
            "Operation": op_name,
            "Einsum (ms)": round(einsum_time, 4),
            "Explicit (ms)": round(explicit_time, 4),
            "Ratio (einsum/explicit)": round(einsum_time / max(explicit_time, 1e-9), 2),
        })

    return pd.DataFrame(results)


timing_df = compare_einsum_vs_explicit()
print("Einsum vs Explicit Operations Timing Comparison:")
print(timing_df.to_string(index=False))

### 1.6 Advanced Indexing

Beyond simple slicing (`a[0:3]`), NumPy and PyTorch support two powerful indexing modes:

1. **Boolean indexing** — use a boolean mask to select elements where the condition is `True`
2. **Fancy indexing** — use an integer array to select specific elements by their indices

These patterns are used constantly in ML: filtering samples by label, selecting top-k predictions, masking padded tokens, etc.

In [None]:
# ── Boolean indexing ─────────────────────────────────────────────────────────
data = np.array([3, -1, 4, -1, 5, 9, -2, 6, 5, 3])

# Boolean mask: True where condition holds
positive_mask = data > 0
print(f"Data:          {data}")
print(f"Positive mask: {positive_mask}")
print(f"Positive vals: {data[positive_mask]}\n")

# Compound conditions
range_mask = (data > 0) & (data < 6)
print(f"Between 0 and 6: {data[range_mask]}\n")

# PyTorch equivalent
tensor_data = torch.tensor(data, dtype=torch.float32)
torch_mask = tensor_data > 0
print(f"PyTorch masked_select: {torch.masked_select(tensor_data, torch_mask)}")
print(f"PyTorch boolean index: {tensor_data[torch_mask]}")

In [None]:
# ── Fancy indexing (integer array indexing) ──────────────────────────────────
data_2d = np.arange(20).reshape(4, 5)
print(f"Data (4x5):\n{data_2d}\n")

# Select specific rows
row_indices = np.array([0, 2, 3])
print(f"Rows [0, 2, 3]:\n{data_2d[row_indices]}\n")

# Select specific elements: (row_i, col_i) pairs
row_idx = np.array([0, 1, 2, 3])
col_idx = np.array([1, 3, 0, 4])
print(f"Elements at (row, col) pairs: {data_2d[row_idx, col_idx]}")

# PyTorch equivalents
tensor_2d = torch.arange(20, dtype=torch.float32).view(4, 5)
selected_rows = torch.index_select(tensor_2d, dim=0, index=torch.tensor([0, 2, 3]))
print(f"\ntorch.index_select (rows 0,2,3):\n{selected_rows}")

In [None]:
# ── torch.where — element-wise conditional selection ─────────────────────────
values = torch.randn(8)
print(f"Values: {values}")

# torch.where(condition, x, y) — select from x where True, y where False
clamped = torch.where(values > 0, values, torch.zeros_like(values))
print(f"ReLU via where: {clamped}")

# torch.where(condition) — returns indices where True (like np.where)
positive_indices = torch.where(values > 0)
print(f"Positive indices: {positive_indices[0]}")

In [None]:
# ── Practical: Selecting samples by class label ──────────────────────────────
NUM_SAMPLES_DEMO = 100
NUM_FEATURES_DEMO = 5
NUM_CLASSES = 3

# Simulate a dataset with features and labels
features = torch.randn(NUM_SAMPLES_DEMO, NUM_FEATURES_DEMO)
labels = torch.randint(0, NUM_CLASSES, (NUM_SAMPLES_DEMO,))

print(f"Features shape: {tuple(features.shape)}")
print(f"Labels shape:   {tuple(labels.shape)}")
print(f"Class distribution: {[(labels == cls).sum().item() for cls in range(NUM_CLASSES)]}\n")

# Select all samples belonging to class 1
class_1_mask = labels == 1
class_1_features = features[class_1_mask]
print(f"Class 1 samples: {class_1_features.shape[0]}")
print(f"Class 1 mean features: {class_1_features.mean(dim=0)}\n")

# Per-class statistics using boolean indexing
for class_idx in range(NUM_CLASSES):
    class_mask = labels == class_idx
    class_features = features[class_mask]
    print(f"Class {class_idx}: n={class_features.shape[0]}, "
          f"mean={class_features.mean():.4f}, std={class_features.std():.4f}")

### 1.7 In-Place Operations and Aliasing

**Aliasing** occurs when two variables point to the same underlying memory. Modifying one silently changes the other. This is a frequent source of subtle bugs.

**NumPy:** Views (from reshape, slicing, transpose) share memory with the original.  
**PyTorch:** In-place operations (those ending with `_`, like `.add_()`, `.mul_()`) modify the tensor directly. This can **break autograd** computation graphs in deep learning.

In [None]:
# ── NumPy: Views share memory — aliasing bugs ────────────────────────────────
original = np.array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0])
view_reshaped = original.reshape(2, 3)  # This is a VIEW, not a copy

print(f"Original:  {original}")
print(f"View (2x3):\n{view_reshaped}\n")

# Modifying the view changes the original!
view_reshaped[0, 0] = 999
print(f"After setting view[0,0] = 999:")
print(f"  Original: {original}  <- also changed!")
print(f"  View:     {view_reshaped[0]}\n")

# Use .copy() to break the alias
original = np.array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0])
safe_copy = original.reshape(2, 3).copy()  # Independent copy
safe_copy[0, 0] = 999
print(f"With .copy():")
print(f"  Original: {original}  <- unchanged")
print(f"  Copy:     {safe_copy[0]}")

In [None]:
# ── PyTorch: In-place operations ─────────────────────────────────────────────
tensor_orig = torch.tensor([1.0, 2.0, 3.0])
print(f"Original: {tensor_orig}")
print(f"Data pointer: {tensor_orig.data_ptr()}\n")

# In-place add: modifies tensor_orig directly
tensor_orig.add_(10)  # equivalent to tensor_orig += 10
print(f"After .add_(10): {tensor_orig}")
print(f"Data pointer (same): {tensor_orig.data_ptr()}\n")

# Out-of-place add: creates a new tensor
tensor_new = tensor_orig.add(100)  # creates new tensor
print(f"After .add(100): {tensor_new}")
print(f"Original unchanged: {tensor_orig}")
print(f"Different pointers: {tensor_orig.data_ptr()} vs {tensor_new.data_ptr()}")

In [None]:
# ── Danger: In-place operations break autograd ───────────────────────────────
# This is critical when building neural networks (Modules 5+)

# Safe: out-of-place operation preserves the computation graph
param_safe = torch.tensor([2.0], requires_grad=True)
result_safe = param_safe * 3  # out-of-place
result_safe = result_safe + 1  # out-of-place
result_safe.backward()
print(f"Safe gradient: {param_safe.grad}")

# Dangerous: in-place operation on a tensor that requires grad
param_danger = torch.tensor([2.0], requires_grad=True)
result_danger = param_danger * 3
try:
    result_danger.add_(1)  # in-place on tensor in computation graph
    result_danger.backward()
    print(f"Danger gradient: {param_danger.grad}")
except RuntimeError as error:
    print(f"In-place operation error: {error}")

In [None]:
def detect_aliasing(tensor_a: torch.Tensor, tensor_b: torch.Tensor) -> bool:
    """Check if two tensors share the same underlying storage.

    Args:
        tensor_a: First tensor.
        tensor_b: Second tensor.

    Returns:
        True if tensors share storage (are aliases), False otherwise.
    """
    return tensor_a.storage().data_ptr() == tensor_b.storage().data_ptr()


# Demonstrate alias detection
base_tensor = torch.arange(12, dtype=torch.float32)
view_tensor = base_tensor.view(3, 4)
clone_tensor = base_tensor.clone()
reshape_contig = base_tensor.reshape(3, 4)  # contiguous -> view

print(f"base vs view:    alias = {detect_aliasing(base_tensor, view_tensor)}")
print(f"base vs clone:   alias = {detect_aliasing(base_tensor, clone_tensor)}")
print(f"base vs reshape: alias = {detect_aliasing(base_tensor, reshape_contig)}")

# Non-contiguous reshape may create a copy
transposed_tensor = view_tensor.t()  # non-contiguous
reshaped_noncontig = transposed_tensor.reshape(12)  # must copy
print(f"transposed vs reshaped(12): alias = {detect_aliasing(transposed_tensor, reshaped_noncontig)}")

---
## Part 2 — Putting It All Together

We combine the operations from Part 1 into a reusable `TensorToolkit` class with static methods for common ML tensor manipulation patterns.

In [None]:
class TensorToolkit:
    """Collection of utility methods for safe, common tensor operations.

    All methods are static and operate on PyTorch tensors. This class
    provides checked versions of operations that are commonly needed in
    ML pipelines: batch reshaping, einsum-based attention preview,
    bounds-checked indexing, and contiguity verification.
    """

    @staticmethod
    def batch_reshape(
        tensor: torch.Tensor, new_shape: tuple[int, ...]
    ) -> torch.Tensor:
        """Reshape a batched tensor, preserving the batch dimension.

        The first dimension (batch) is kept intact; only the remaining
        dimensions are reshaped according to new_shape.

        Args:
            tensor: Input tensor of shape (batch_size, ...).
            new_shape: Target shape for non-batch dimensions.
                Use -1 to infer one dimension.

        Returns:
            Reshaped tensor of shape (batch_size, *new_shape).
        """
        batch_size = tensor.shape[0]
        full_shape = (batch_size,) + new_shape
        return tensor.reshape(full_shape)

    @staticmethod
    def einsum_attention(
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
    ) -> torch.Tensor:
        """Compute scaled dot-product attention using einsum notation.

        This is a preview of the attention mechanism covered in Module 8.
        Computes: softmax(Q @ K^T / sqrt(d_k)) @ V.

        Args:
            query: Query tensor of shape (batch, heads, seq_len, d_k).
            key: Key tensor of shape (batch, heads, seq_len, d_k).
            value: Value tensor of shape (batch, heads, seq_len, d_v).

        Returns:
            Attention output of shape (batch, heads, seq_len, d_v).
        """
        d_k = query.shape[-1]
        scale = d_k ** 0.5

        # Q @ K^T scaled
        scores = torch.einsum('bhqd,bhkd->bhqk', query, key) / scale

        # Softmax over key dimension
        attention_weights = torch.softmax(scores, dim=-1)

        # Weighted sum of values
        output = torch.einsum('bhqk,bhkd->bhqd', attention_weights, value)
        return output

    @staticmethod
    def safe_index(
        tensor: torch.Tensor,
        indices: torch.Tensor,
        dim: int = 0,
    ) -> torch.Tensor:
        """Index a tensor along a dimension with bounds checking.

        Args:
            tensor: Input tensor to index.
            indices: Integer tensor of indices to select.
            dim: Dimension along which to index.

        Returns:
            Selected elements from the tensor.

        Raises:
            IndexError: If any index is out of bounds.
        """
        max_idx = tensor.shape[dim]
        if indices.max().item() >= max_idx or indices.min().item() < -max_idx:
            raise IndexError(
                f"Index out of bounds: indices range [{indices.min().item()}, "
                f"{indices.max().item()}] for dimension {dim} with size {max_idx}"
            )
        return torch.index_select(tensor, dim=dim, index=indices)

    @staticmethod
    def check_contiguous(tensor: torch.Tensor, fix: bool = True) -> torch.Tensor:
        """Check if a tensor is contiguous and optionally fix it.

        Args:
            tensor: Input tensor to check.
            fix: If True, return a contiguous copy when non-contiguous.
                If False, raise an error for non-contiguous tensors.

        Returns:
            The original tensor if contiguous, or a contiguous copy.

        Raises:
            RuntimeError: If tensor is non-contiguous and fix=False.
        """
        if tensor.is_contiguous():
            return tensor
        if fix:
            return tensor.contiguous()
        raise RuntimeError(
            f"Tensor with shape {tuple(tensor.shape)} and strides "
            f"{tensor.stride()} is not contiguous."
        )

In [None]:
# ── Sanity check: TensorToolkit ──────────────────────────────────────────────
toolkit = TensorToolkit()

# Test batch_reshape
batch_tensor = torch.randn(8, 3, 32, 32)  # (B, C, H, W) image batch
flat = toolkit.batch_reshape(batch_tensor, (-1,))
assert flat.shape == (8, 3072), f"Expected (8, 3072), got {flat.shape}"
print(f"batch_reshape: {tuple(batch_tensor.shape)} -> {tuple(flat.shape)} [PASS]")

# Test einsum_attention
test_query = torch.randn(2, 4, 6, 8)
test_key = torch.randn(2, 4, 6, 8)
test_value = torch.randn(2, 4, 6, 8)
attn_out = toolkit.einsum_attention(test_query, test_key, test_value)
assert attn_out.shape == (2, 4, 6, 8), f"Expected (2,4,6,8), got {attn_out.shape}"
print(f"einsum_attention: Q{tuple(test_query.shape)} -> {tuple(attn_out.shape)} [PASS]")

# Test safe_index
test_data = torch.randn(10, 5)
selected = toolkit.safe_index(test_data, torch.tensor([0, 3, 7]), dim=0)
assert selected.shape == (3, 5), f"Expected (3, 5), got {selected.shape}"
print(f"safe_index: rows [0,3,7] from (10,5) -> {tuple(selected.shape)} [PASS]")

# Test safe_index with out-of-bounds
try:
    toolkit.safe_index(test_data, torch.tensor([0, 15]), dim=0)
    print("safe_index: out-of-bounds NOT caught [FAIL]")
except IndexError:
    print("safe_index: out-of-bounds correctly caught [PASS]")

# Test check_contiguous
non_contig = torch.randn(3, 4).t()  # non-contiguous
fixed = toolkit.check_contiguous(non_contig, fix=True)
assert fixed.is_contiguous(), "Should be contiguous after fix"
print(f"check_contiguous: non-contiguous -> contiguous [PASS]")

try:
    toolkit.check_contiguous(non_contig, fix=False)
    print("check_contiguous: should have raised [FAIL]")
except RuntimeError:
    print("check_contiguous: correctly raised for fix=False [PASS]")

---
## Part 3 — Application to ML Data Patterns

Now we apply these operations to realistic ML scenarios. Each example demonstrates a pattern you will encounter repeatedly when building models in later modules.

### 3.1 Image Batch Manipulation

In convolutional neural networks (Module 6), images are stored as tensors of shape $(B, C, H, W)$ — batch, channels, height, width. Many operations require flattening the spatial dimensions, e.g., before passing to a fully connected layer.

In [None]:
# ── Image batch: (B, C, H, W) -> (B, C*H*W) flattening ─────────────────────
BATCH = 16
CHANNELS = 3
HEIGHT = 32
WIDTH = 32

image_batch = torch.randn(BATCH, CHANNELS, HEIGHT, WIDTH)
print(f"Image batch shape (B,C,H,W): {tuple(image_batch.shape)}")

# Method 1: reshape
flat_reshape = image_batch.reshape(BATCH, -1)
print(f"Flattened (reshape):  {tuple(flat_reshape.shape)}")

# Method 2: view
flat_view = image_batch.view(BATCH, -1)
print(f"Flattened (view):     {tuple(flat_view.shape)}")

# Method 3: torch.flatten with start_dim
flat_flatten = torch.flatten(image_batch, start_dim=1)
print(f"Flattened (flatten):  {tuple(flat_flatten.shape)}")

# Verify all methods give the same result
assert torch.allclose(flat_reshape, flat_view)
assert torch.allclose(flat_reshape, flat_flatten)
assert flat_reshape.shape == (BATCH, CHANNELS * HEIGHT * WIDTH)
print(f"\nAll methods equivalent: True")
print(f"Feature dimension: {CHANNELS} * {HEIGHT} * {WIDTH} = {CHANNELS * HEIGHT * WIDTH}")

### 3.2 Sequence Batch Manipulation

Recurrent neural networks (Module 7) often expect input in time-major format $(T, B, D)$ — time steps, batch, features — while data is typically stored as $(B, T, D)$. Transposing between these formats is a daily operation.

In [None]:
# ── Sequence batch: (B, T, D) <-> (T, B, D) ─────────────────────────────────
BATCH_SEQ = 8
TIME_STEPS = 20
EMBED_DIM = 64

# Batch-major format (common for data loading)
batch_major = torch.randn(BATCH_SEQ, TIME_STEPS, EMBED_DIM)
print(f"Batch-major (B,T,D): {tuple(batch_major.shape)}")

# Convert to time-major format for RNN processing
# Method 1: transpose
time_major_transpose = batch_major.transpose(0, 1)
print(f"Time-major (transpose): {tuple(time_major_transpose.shape)}")

# Method 2: permute
time_major_permute = batch_major.permute(1, 0, 2)
print(f"Time-major (permute):   {tuple(time_major_permute.shape)}")

# Verify equivalence
assert torch.allclose(time_major_transpose, time_major_permute)
assert time_major_transpose.shape == (TIME_STEPS, BATCH_SEQ, EMBED_DIM)
print(f"\nBoth methods equivalent: True")

# Note: transposed tensor is NOT contiguous
print(f"Time-major contiguous: {time_major_transpose.is_contiguous()}")
# If a downstream operation needs contiguous data:
time_major_contig = time_major_transpose.contiguous()
print(f"After .contiguous():   {time_major_contig.is_contiguous()}")

### 3.3 Feature Matrix Operations with Boolean Indexing

Class-conditional analysis is common in ML: computing per-class statistics, visualizing class distributions, or implementing class-balanced sampling.

In [None]:
# ── Class-conditional analysis with boolean indexing ──────────────────────────
NUM_SAMPLES_FEAT = 200
NUM_FEATURES_FEAT = 10
NUM_CLASSES_FEAT = 4

# Generate synthetic feature matrix and labels
feature_matrix = torch.randn(NUM_SAMPLES_FEAT, NUM_FEATURES_FEAT)
class_labels = torch.randint(0, NUM_CLASSES_FEAT, (NUM_SAMPLES_FEAT,))

# Compute per-class means using boolean indexing
class_means = torch.zeros(NUM_CLASSES_FEAT, NUM_FEATURES_FEAT)
class_counts = torch.zeros(NUM_CLASSES_FEAT, dtype=torch.long)

for class_idx in range(NUM_CLASSES_FEAT):
    mask = class_labels == class_idx
    class_counts[class_idx] = mask.sum()
    class_means[class_idx] = feature_matrix[mask].mean(dim=0)

print(f"Feature matrix: {tuple(feature_matrix.shape)}")
print(f"Class counts: {class_counts.tolist()}")
print(f"\nPer-class mean of first 3 features:")
for class_idx in range(NUM_CLASSES_FEAT):
    means = class_means[class_idx, :3]
    print(f"  Class {class_idx} (n={class_counts[class_idx].item()}): "
          f"[{means[0]:.4f}, {means[1]:.4f}, {means[2]:.4f}]")

In [None]:
# ── Visualize class-conditional feature distributions ─────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(12, 4))

for feat_idx in range(3):
    ax = axes[feat_idx]
    for class_idx in range(NUM_CLASSES_FEAT):
        mask = class_labels == class_idx
        values = feature_matrix[mask, feat_idx].numpy()
        ax.hist(values, bins=15, alpha=0.5, label=f"Class {class_idx}")
    ax.set_xlabel("Feature Value")
    ax.set_ylabel("Count")
    ax.set_title(f"Feature {feat_idx} Distribution")
    ax.legend(fontsize=8)

plt.tight_layout()
plt.show()

### 3.4 Pairwise Distances via Einsum

Computing pairwise distances between samples is a core operation in algorithms like k-NN, k-means, and kernel methods. We can express this efficiently with einsum.

For two sets of vectors $\mathbf{X} \in \mathbb{R}^{n \times d}$ and $\mathbf{Y} \in \mathbb{R}^{m \times d}$, the squared Euclidean distance matrix is:

$$D_{ij}^2 = \|\mathbf{x}_i - \mathbf{y}_j\|^2 = \|\mathbf{x}_i\|^2 - 2\mathbf{x}_i \cdot \mathbf{y}_j + \|\mathbf{y}_j\|^2$$

In [None]:
def pairwise_squared_distance_einsum(
    tensor_x: torch.Tensor, tensor_y: torch.Tensor
) -> torch.Tensor:
    """Compute pairwise squared Euclidean distances using einsum.

    Uses the expansion ||x-y||^2 = ||x||^2 - 2*x.y + ||y||^2
    where the cross term is computed efficiently via einsum.

    Args:
        tensor_x: First set of vectors, shape (n, d).
        tensor_y: Second set of vectors, shape (m, d).

    Returns:
        Distance matrix of shape (n, m) where entry (i,j) is
        the squared Euclidean distance between x_i and y_j.
    """
    # ||x_i||^2 for each x_i
    x_sq = torch.einsum('id,id->i', tensor_x, tensor_x)  # shape (n,)
    # ||y_j||^2 for each y_j
    y_sq = torch.einsum('jd,jd->j', tensor_y, tensor_y)  # shape (m,)
    # Cross term: x_i . y_j
    cross = torch.einsum('id,jd->ij', tensor_x, tensor_y)  # shape (n, m)

    # ||x_i - y_j||^2 = ||x_i||^2 - 2*x_i.y_j + ||y_j||^2
    distances = x_sq.unsqueeze(1) - 2 * cross + y_sq.unsqueeze(0)
    return distances


def pairwise_squared_distance_loop(
    tensor_x: torch.Tensor, tensor_y: torch.Tensor
) -> torch.Tensor:
    """Compute pairwise squared Euclidean distances using loops (baseline).

    Args:
        tensor_x: First set of vectors, shape (n, d).
        tensor_y: Second set of vectors, shape (m, d).

    Returns:
        Distance matrix of shape (n, m).
    """
    num_x = tensor_x.shape[0]
    num_y = tensor_y.shape[0]
    distances = torch.zeros(num_x, num_y)
    for idx_i in range(num_x):
        for idx_j in range(num_y):
            diff = tensor_x[idx_i] - tensor_y[idx_j]
            distances[idx_i, idx_j] = torch.dot(diff, diff)
    return distances


# Compare einsum vs loop vs library
points_x = torch.randn(50, 10)
points_y = torch.randn(30, 10)

dist_einsum = pairwise_squared_distance_einsum(points_x, points_y)
dist_loop = pairwise_squared_distance_loop(points_x, points_y)
dist_cdist = torch.cdist(points_x, points_y, p=2) ** 2  # library version

print(f"Distance matrix shape: {tuple(dist_einsum.shape)}")
print(f"Einsum vs loop match: {torch.allclose(dist_einsum, dist_loop, atol=1e-4)}")
print(f"Einsum vs cdist match: {torch.allclose(dist_einsum, dist_cdist, atol=1e-4)}")

In [None]:
# ── Visualize the pairwise distance matrix ───────────────────────────────────
fig, ax = plt.subplots(figsize=(8, 5))
im = ax.imshow(dist_einsum.numpy(), cmap='viridis', aspect='auto')
ax.set_xlabel("Y Samples")
ax.set_ylabel("X Samples")
ax.set_title("Pairwise Squared Euclidean Distance Matrix")
plt.colorbar(im, ax=ax, label="Squared Distance")
plt.tight_layout()
plt.show()

---
## Part 4 — Evaluation & Analysis

We now evaluate the performance characteristics of the operations covered, analyze memory behavior, build a comprehensive reference table, and demonstrate common bugs with their fixes.

### 4.1 Performance Comparison: Einsum vs Explicit vs Loop

In [None]:
def benchmark_pairwise_distance(num_points: int, num_dims: int) -> dict[str, float]:
    """Benchmark three approaches to pairwise distance computation.

    Args:
        num_points: Number of points in each set.
        num_dims: Dimensionality of each point.

    Returns:
        Dictionary mapping method name to execution time in milliseconds.
    """
    pts_x = torch.randn(num_points, num_dims)
    pts_y = torch.randn(num_points, num_dims)

    timings = {}

    # Einsum approach
    start = time.perf_counter()
    for _ in range(NUM_TIMING_RUNS):
        pairwise_squared_distance_einsum(pts_x, pts_y)
    timings["Einsum"] = (time.perf_counter() - start) / NUM_TIMING_RUNS * 1000

    # torch.cdist (library)
    start = time.perf_counter()
    for _ in range(NUM_TIMING_RUNS):
        torch.cdist(pts_x, pts_y, p=2)
    timings["torch.cdist"] = (time.perf_counter() - start) / NUM_TIMING_RUNS * 1000

    # Loop approach (only for small sizes)
    if num_points <= 50:
        start = time.perf_counter()
        for _ in range(min(NUM_TIMING_RUNS, 5)):
            pairwise_squared_distance_loop(pts_x, pts_y)
        timings["Loop"] = (time.perf_counter() - start) / min(NUM_TIMING_RUNS, 5) * 1000
    else:
        timings["Loop"] = float('nan')

    return timings


# Run benchmarks at different scales
scales = [(20, 10), (50, 50), (100, 50), (200, 100)]
benchmark_results = []

for num_pts, num_d in scales:
    timings = benchmark_pairwise_distance(num_pts, num_d)
    benchmark_results.append({
        "Points": num_pts,
        "Dims": num_d,
        "Einsum (ms)": round(timings["Einsum"], 4),
        "torch.cdist (ms)": round(timings["torch.cdist"], 4),
        "Loop (ms)": round(timings["Loop"], 4) if not np.isnan(timings["Loop"]) else "N/A",
    })

benchmark_df = pd.DataFrame(benchmark_results)
print("Pairwise Distance Performance Comparison:")
print(benchmark_df.to_string(index=False))

### 4.2 Memory Analysis: Views vs Copies

In [None]:
def analyze_memory_sharing(
    original: torch.Tensor, operations: dict[str, torch.Tensor]
) -> pd.DataFrame:
    """Analyze memory sharing between an original tensor and derived tensors.

    Args:
        original: The base tensor.
        operations: Dictionary mapping operation names to result tensors.

    Returns:
        DataFrame summarizing memory relationship for each operation.
    """
    results = []
    original_ptr = original.storage().data_ptr()
    original_size = original.storage().nbytes()

    for op_name, result_tensor in operations.items():
        shares_memory = result_tensor.storage().data_ptr() == original_ptr
        result_size = result_tensor.storage().nbytes()
        results.append({
            "Operation": op_name,
            "Shape": str(tuple(result_tensor.shape)),
            "Contiguous": result_tensor.is_contiguous(),
            "Shares Storage": shares_memory,
            "Extra Memory": "None" if shares_memory else f"{result_size} bytes",
        })

    return pd.DataFrame(results)


# Create test tensor and apply various operations
base = torch.randn(100, 200)

derived_ops = {
    "view(200, 100)": base.view(200, 100),
    "reshape(200, 100)": base.reshape(200, 100),
    ".t() (transpose)": base.t(),
    ".t().contiguous()": base.t().contiguous(),
    ".clone()": base.clone(),
    "slice [:50]": base[:50],
    "flatten()": base.flatten(),
}

memory_df = analyze_memory_sharing(base, derived_ops)
print(f"Base tensor: shape={tuple(base.shape)}, storage={base.storage().nbytes()} bytes\n")
print("Memory Sharing Analysis:")
print(memory_df.to_string(index=False))

### 4.3 Operations Reference Table

In [None]:
# ── Comprehensive operations reference table ─────────────────────────────────
reference_data = [
    {"Operation": "Concatenate",
     "NumPy": "np.concatenate(arrs, axis)",
     "PyTorch": "torch.cat(tensors, dim)",
     "Creates New Dim": "No",
     "When to Use": "Joining arrays along existing dimension"},
    {"Operation": "Stack",
     "NumPy": "np.stack(arrs, axis)",
     "PyTorch": "torch.stack(tensors, dim)",
     "Creates New Dim": "Yes",
     "When to Use": "Creating batch from individual samples"},
    {"Operation": "Vertical stack",
     "NumPy": "np.vstack(arrs)",
     "PyTorch": "torch.vstack(tensors)",
     "Creates New Dim": "If 1D",
     "When to Use": "Stacking rows"},
    {"Operation": "Horizontal stack",
     "NumPy": "np.hstack(arrs)",
     "PyTorch": "torch.hstack(tensors)",
     "Creates New Dim": "No",
     "When to Use": "Stacking columns"},
    {"Operation": "Reshape",
     "NumPy": "arr.reshape(shape)",
     "PyTorch": "tensor.reshape(shape)",
     "Creates New Dim": "—",
     "When to Use": "Safe reshape (copies if needed)"},
    {"Operation": "View",
     "NumPy": "arr.reshape (view if possible)",
     "PyTorch": "tensor.view(shape)",
     "Creates New Dim": "—",
     "When to Use": "Guaranteed view (fails if non-contiguous)"},
    {"Operation": "Transpose (2D)",
     "NumPy": "arr.T",
     "PyTorch": "tensor.t() or .T",
     "Creates New Dim": "No",
     "When to Use": "Swap rows and columns"},
    {"Operation": "Permute dims",
     "NumPy": "np.transpose(arr, axes)",
     "PyTorch": "tensor.permute(*dims)",
     "Creates New Dim": "No",
     "When to Use": "Arbitrary axis reordering"},
    {"Operation": "Matrix multiply",
     "NumPy": "np.matmul(a, b) or a @ b",
     "PyTorch": "torch.matmul(a, b) or a @ b",
     "Creates New Dim": "—",
     "When to Use": "General matrix multiply with broadcasting"},
    {"Operation": "Batch matmul",
     "NumPy": "np.matmul(a, b)",
     "PyTorch": "torch.bmm(a, b)",
     "Creates New Dim": "—",
     "When to Use": "Batched 3D matmul (strict)"},
    {"Operation": "Einsum",
     "NumPy": "np.einsum(subscripts, *ops)",
     "PyTorch": "torch.einsum(subscripts, *ops)",
     "Creates New Dim": "Varies",
     "When to Use": "Complex tensor contractions, attention"},
    {"Operation": "Boolean index",
     "NumPy": "arr[mask]",
     "PyTorch": "tensor[mask] or masked_select",
     "Creates New Dim": "—",
     "When to Use": "Filtering by condition"},
    {"Operation": "Fancy index",
     "NumPy": "arr[idx_array]",
     "PyTorch": "tensor[idx] or index_select",
     "Creates New Dim": "—",
     "When to Use": "Selecting specific elements by index"},
]

reference_df = pd.DataFrame(reference_data)
print("Operations Reference Table:")
print(reference_df.to_string(index=False))

### 4.4 Common Bugs and Their Fixes

Three common bugs arise from misunderstanding tensor operations. We demonstrate each one and show how to diagnose and fix it.

In [None]:
# ── Bug 1: Aliasing — Modifying a view silently changes the original ─────────
print("=" * 60)
print("BUG 1: Aliasing (Shared Memory)")
print("=" * 60)

# Scenario: You reshape a weight matrix and accidentally modify the original
weights = torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0, 6.0])
print(f"Original weights: {weights}")

# Bug: view creates an alias
reshaped_weights = weights.view(2, 3)
reshaped_weights[0, 0] = 999.0
print(f"After modifying view[0,0]: weights = {weights}")
print(f"  -> Original was corrupted!\n")

# Fix: use .clone() to create an independent copy
weights = torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0, 6.0])
safe_reshaped = weights.view(2, 3).clone()
safe_reshaped[0, 0] = 999.0
print(f"Fix: clone() before modifying")
print(f"  Original weights: {weights}  <- unchanged")
print(f"  Cloned view:      {safe_reshaped[0]}")

In [None]:
# ── Bug 2: Non-contiguous view failure ───────────────────────────────────────
print("=" * 60)
print("BUG 2: Non-Contiguous View Failure")
print("=" * 60)

# Scenario: You transpose a tensor and try to view it
tensor_bug2 = torch.arange(12, dtype=torch.float32).view(3, 4)
transposed_bug2 = tensor_bug2.t()  # (4, 3) — non-contiguous!

print(f"Transposed shape: {tuple(transposed_bug2.shape)}")
print(f"Contiguous: {transposed_bug2.is_contiguous()}")

try:
    flat_bug2 = transposed_bug2.view(-1)
    print("view() succeeded (unexpected)")
except RuntimeError as error:
    print(f"view() fails: {error}\n")

# Fix 1: Use .contiguous() before .view()
flat_fix1 = transposed_bug2.contiguous().view(-1)
print(f"Fix 1 — .contiguous().view(-1): {flat_fix1}")

# Fix 2: Use .reshape() which handles non-contiguous tensors
flat_fix2 = transposed_bug2.reshape(-1)
print(f"Fix 2 — .reshape(-1):          {flat_fix2}")

In [None]:
# ── Bug 3: Shape mismatch in matrix multiplication ───────────────────────────
print("=" * 60)
print("BUG 3: Shape Mismatch in Matmul")
print("=" * 60)

# Scenario: Forgetting to transpose when computing X^T X
feature_data = torch.randn(100, 10)  # 100 samples, 10 features

try:
    wrong_result = torch.mm(feature_data, feature_data)  # (100,10) x (100,10) -> error!
except RuntimeError as error:
    print(f"Shape mismatch: {error}\n")

# Fix: Transpose the first operand
correct_result = torch.mm(feature_data.t(), feature_data)  # (10,100) x (100,10) -> (10,10)
print(f"Fix: X.t() @ X gives shape {tuple(correct_result.shape)}")

# Diagnostic tip: always print shapes before matmul
def safe_matmul(
    tensor_a: torch.Tensor, tensor_b: torch.Tensor, verbose: bool = True
) -> torch.Tensor:
    """Matrix multiply with shape validation and helpful error messages.

    Args:
        tensor_a: Left operand.
        tensor_b: Right operand.
        verbose: If True, print shape information.

    Returns:
        Result of matrix multiplication.

    Raises:
        ValueError: If inner dimensions do not match.
    """
    if tensor_a.shape[-1] != tensor_b.shape[-2]:
        raise ValueError(
            f"Shape mismatch: {tuple(tensor_a.shape)} x {tuple(tensor_b.shape)}. "
            f"Inner dims {tensor_a.shape[-1]} != {tensor_b.shape[-2]}. "
            f"Did you forget to transpose?"
        )
    if verbose:
        result_shape = tuple(tensor_a.shape[:-1]) + (tensor_b.shape[-1],)
        print(f"Matmul: {tuple(tensor_a.shape)} x {tuple(tensor_b.shape)} -> {result_shape}")
    return torch.matmul(tensor_a, tensor_b)


# Test the safe version
result = safe_matmul(feature_data.t(), feature_data)

try:
    safe_matmul(feature_data, feature_data)
except ValueError as error:
    print(f"Caught: {error}")

### 4.5 Performance Visualization

In [None]:
# ── Visualize: einsum vs explicit ops across matrix sizes ─────────────────────
sizes = [16, 32, 64, 128, 256]
einsum_times = []
explicit_times = []

for size in sizes:
    mat_test_a = torch.randn(size, size)
    mat_test_b = torch.randn(size, size)

    # Einsum matmul
    start = time.perf_counter()
    for _ in range(NUM_TIMING_RUNS):
        torch.einsum('ij,jk->ik', mat_test_a, mat_test_b)
    einsum_times.append((time.perf_counter() - start) / NUM_TIMING_RUNS * 1000)

    # Explicit matmul
    start = time.perf_counter()
    for _ in range(NUM_TIMING_RUNS):
        torch.mm(mat_test_a, mat_test_b)
    explicit_times.append((time.perf_counter() - start) / NUM_TIMING_RUNS * 1000)

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(sizes, einsum_times, 'o-', label='einsum', linewidth=2)
ax.plot(sizes, explicit_times, 's-', label='torch.mm', linewidth=2)
ax.set_xlabel('Matrix Size (NxN)')
ax.set_ylabel('Time (ms)')
ax.set_title('Matrix Multiplication: einsum vs torch.mm')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# ── Summary of all einsum patterns covered ───────────────────────────────────
einsum_patterns = pd.DataFrame([
    {"Pattern": "i,i->", "Operation": "Vector dot product",
     "Example Shapes": "(d,) x (d,) -> scalar"},
    {"Pattern": "i,j->ij", "Operation": "Outer product",
     "Example Shapes": "(m,) x (n,) -> (m,n)"},
    {"Pattern": "ij,jk->ik", "Operation": "Matrix multiply",
     "Example Shapes": "(m,n) x (n,p) -> (m,p)"},
    {"Pattern": "ij->ji", "Operation": "Transpose",
     "Example Shapes": "(m,n) -> (n,m)"},
    {"Pattern": "ii->", "Operation": "Trace",
     "Example Shapes": "(n,n) -> scalar"},
    {"Pattern": "bij,bjk->bik", "Operation": "Batch matmul",
     "Example Shapes": "(B,m,n) x (B,n,p) -> (B,m,p)"},
    {"Pattern": "bhqd,bhkd->bhqk", "Operation": "Attention scores",
     "Example Shapes": "(B,H,Q,d) x (B,H,K,d) -> (B,H,Q,K)"},
    {"Pattern": "id,jd->ij", "Operation": "Pairwise dot products",
     "Example Shapes": "(n,d) x (m,d) -> (n,m)"},
])

print("Einsum Pattern Reference:")
print(einsum_patterns.to_string(index=False))

---
## Part 5 — Summary & Lessons Learned

### Key Takeaways

1. **reshape/view return views (shared memory) when possible** — understand that modifying a view changes the original. Use `.clone()` when you need an independent copy. PyTorch's `.view()` requires contiguous memory while `.reshape()` handles both cases.

2. **einsum provides a unified notation for nearly all tensor operations** — from dot products (`'i,i->'`) to attention scores (`'bhqd,bhkd->bhqk'`). It is readable, general, and performs competitively with dedicated functions.

3. **Advanced indexing enables efficient data selection without loops** — boolean masks filter by condition, fancy indexing selects by position, and `torch.where` provides element-wise conditional logic. These are essential for class-conditional analysis and data preprocessing.

4. **In-place operations save memory but can break autograd graphs** — operations ending with `_` (like `.add_()`, `.mul_()`) modify tensors in place. Use them only when you are certain the tensor is not part of a computation graph that needs gradients.

5. **Transposing changes strides, not data** — after transpose, the tensor is non-contiguous. This is efficient (no data copy) but means `.view()` will fail. Understanding contiguity is essential for debugging shape-related errors.

### What's Next

- **1-03 (Pandas for Tabular Data)** applies data manipulation skills to real-world tabular datasets with Pandas DataFrames.
- **1-05 (Data Loading with PyTorch)** builds directly on tensor operations to create Dataset and DataLoader pipelines.
- **1-06 (Linear Algebra for ML)** uses matrix multiplication and decompositions introduced here for eigendecomposition and SVD.