# Module 01 — Mathematical & Programming Foundations## 01-02: Advanced NumPy & PyTorch Operations**Objective:** Master the tensor manipulation operations — reshape, einsum,advanced indexing, and in-place ops — that form the backbone of every ML/DLimplementation in this course.**Prerequisites:** 01-01 (Python, NumPy & Tensor Speed)

---## Part 0 — Setup & PrerequisitesIn 01-01 we learned *why* vectorization matters and measured the speed gap betweenPython loops and NumPy/PyTorch. Now we learn *how* to manipulate tensor shapes anddimensions so that vectorized operations can replace loops in complex scenarios.We will cover:- **Reshaping & dimension manipulation** — reshape, view, squeeze, unsqueeze, permute, transpose- **Advanced indexing** — fancy indexing, boolean masks, `np.where`, scatter/gather- **Einsum notation** — a universal language for tensor contractions- **In-place operations** — when they help, when they break autograd- **Stacking & concatenation** — combining tensors along new or existing axesThese operations appear in virtually every notebook from Module 2 onward.**Prerequisites:** 01-01 (Python, NumPy & Tensor Speed)

In [None]:
# ── Imports ──────────────────────────────────────────────────────────────────
import sys
import time
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch

print(f'Python: {sys.version.split()[0]}')
print(f'NumPy: {np.__version__}')
print(f'PyTorch: {torch.__version__}')
if torch.cuda.is_available():
    print(f'CUDA: {torch.version.cuda}')
    print(f'GPU: {torch.cuda.get_device_name(0)}')

In [None]:
# ── Reproducibility ──────────────────────────────────────────────────────────
import random

SEED = 1103
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

In [None]:
# ── Configuration ────────────────────────────────────────────────────────────
# Visualization
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11

# Timing helper — reused from 01-01
def measure_time(
    func: callable,
    num_warmup: int = 2,
    num_timed: int = 5,
) -> tuple[float, float]:
    """Measure execution time of a zero-argument callable.

    Args:
        func: Zero-argument callable to benchmark.
        num_warmup: Number of warmup runs before timing.
        num_timed: Number of timed runs to average.

    Returns:
        Tuple of (mean_seconds, std_seconds).
    """
    for _ in range(num_warmup):
        func()
    times: list[float] = []
    for _ in range(num_timed):
        start = time.perf_counter()
        func()
        elapsed = time.perf_counter() - start
        times.append(elapsed)
    return float(np.mean(times)), float(np.std(times))

---## Part 1 — Tensor Manipulation from ScratchTensor manipulation is the art of rearranging data without changing the underlyingvalues. In ML, we constantly need to:- Reshape a flat vector into a batch of images- Transpose matrices for matrix multiplication compatibility- Select specific elements based on conditions- Contract tensors along specific dimensionsWe'll build up from basic reshaping to the powerful einsum notation, implementingeach operation from scratch to understand what it does under the hood.

### 1.1 Reshape & View: Changing Tensor GeometryReshaping changes how we *interpret* the same block of memory. No data is copied —only the shape metadata and strides change. This makes reshape essentially free.**Key distinction:**- `np.reshape()` / `tensor.reshape()` — always works, may copy if needed- `tensor.view()` — PyTorch only, requires contiguous memory (faster guarantee)- `tensor.contiguous().view()` — safe pattern when view might fail

In [None]:
def demonstrate_reshape() -> None:
    """Show reshape operations with shape tracking at each step."""
    # Start with a 1D array of 24 elements
    arr = np.arange(24)
    print(f'Original: shape={arr.shape}, strides={arr.strides}')
    print(f'  Data: {arr}')
    print()

    # Reshape to 2D (4 rows × 6 cols)
    mat_4x6 = arr.reshape(4, 6)
    print(f'reshape(4, 6): shape={mat_4x6.shape}, strides={mat_4x6.strides}')
    print(f'  Shares memory: {np.shares_memory(arr, mat_4x6)}')
    print(f'{mat_4x6}')
    print()

    # Reshape to 3D (2 × 3 × 4) — e.g., 2 images, 3 rows, 4 cols
    tensor_3d = arr.reshape(2, 3, 4)
    print(f'reshape(2, 3, 4): shape={tensor_3d.shape}, strides={tensor_3d.strides}')
    print(f'  Element [1, 2, 3] = {tensor_3d[1, 2, 3]}')
    print(f'  Flat index: 1×12 + 2×4 + 3 = {1*12 + 2*4 + 3}')
    print()

    # Using -1 to infer one dimension
    auto_shape = arr.reshape(6, -1)  # -1 infers 4
    print(f'reshape(6, -1): shape={auto_shape.shape} (-1 inferred as 4)')
    print()

    # Flatten back
    flat = tensor_3d.reshape(-1)
    print(f'reshape(-1): shape={flat.shape} (flattened back to 1D)')
    assert np.array_equal(flat, arr), 'Flatten should recover original data'


demonstrate_reshape()

Now let's see how PyTorch's `view()` differs from `reshape()`. The key differenceis that `view()` guarantees no data copy — it fails if the tensor isn't contiguousin memory.

In [None]:
def demonstrate_view_vs_reshape() -> None:
    """Compare PyTorch view() and reshape() behavior."""
    t = torch.arange(24)
    print(f'Original: shape={t.shape}, is_contiguous={t.is_contiguous()}')
    print()

    # view() works on contiguous tensors
    v1 = t.view(4, 6)
    print(f'view(4, 6): shape={v1.shape}, is_contiguous={v1.is_contiguous()}')
    print(f'  data_ptr matches: {t.data_ptr() == v1.data_ptr()}')
    print()

    # Transpose makes tensor non-contiguous
    v2 = v1.t()  # Transpose: (4, 6) → (6, 4)
    print(f'After .t(): shape={v2.shape}, is_contiguous={v2.is_contiguous()}')
    print(f'  Strides: {v2.stride()}')
    print()

    # view() fails on non-contiguous tensor
    try:
        v2.view(-1)
    except RuntimeError as e:
        print(f'view() on non-contiguous: RuntimeError')
        print(f'  {str(e)[:80]}...')
    print()

    # reshape() works (makes a copy internally)
    r1 = v2.reshape(-1)
    print(f'reshape() on non-contiguous: shape={r1.shape} (works, may copy)')
    print()

    # contiguous().view() — safe pattern
    safe = v2.contiguous().view(-1)
    print(f'contiguous().view(): shape={safe.shape} (always works)')
    assert torch.equal(r1, safe)


demonstrate_view_vs_reshape()

### 1.2 Squeeze & Unsqueeze: Adding and Removing Size-1 DimensionsThese operations add or remove dimensions of size 1. They're essential formaking tensors compatible for broadcasting and batch operations.Common use cases:- `unsqueeze(0)` — add a batch dimension to a single sample- `unsqueeze(-1)` — turn a vector into a column vector- `squeeze()` — remove all size-1 dimensions from a result

In [None]:
def demonstrate_squeeze_unsqueeze() -> None:
    """Show squeeze and unsqueeze operations with practical ML examples."""
    # Start with a vector (common: a single feature vector)
    feature_vec = torch.randn(512)
    print(f'Feature vector: shape={feature_vec.shape}')
    print()

    # unsqueeze(0): Add batch dimension — needed to pass through nn.Module
    batched = feature_vec.unsqueeze(0)
    print(f'unsqueeze(0) — add batch dim: shape={batched.shape}')
    print(f'  Use case: single sample → model expects (batch, features)')
    print()

    # unsqueeze(-1): Column vector for matrix operations
    col_vec = feature_vec.unsqueeze(-1)
    print(f'unsqueeze(-1) — column vector: shape={col_vec.shape}')
    print()

    # Multiple unsqueezes: prepare for broadcasting
    # E.g., (C,) → (1, C, 1, 1) for channel-wise operations on images
    channel_weights = torch.randn(3)  # RGB channel weights
    broadcast_ready = channel_weights.unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
    print(f'Channel weights (3,) → {broadcast_ready.shape} for (N, C, H, W) broadcasting')
    print()

    # Alternative: use [None, :, None, None] indexing (equivalent)
    broadcast_alt = channel_weights[None, :, None, None]
    print(f'Equivalent via indexing: {broadcast_alt.shape}')
    assert torch.equal(broadcast_ready, broadcast_alt)
    print()

    # squeeze: Remove size-1 dimensions
    bloated = torch.randn(1, 3, 1, 5, 1)
    squeezed = bloated.squeeze()
    print(f'squeeze(): {bloated.shape} → {squeezed.shape}')
    print()

    # squeeze(dim): Remove only a specific size-1 dimension
    partial = bloated.squeeze(0)
    print(f'squeeze(0): {bloated.shape} → {partial.shape}')
    partial2 = bloated.squeeze(2)
    print(f'squeeze(2): {bloated.shape} → {partial2.shape}')
    print()

    # Shape summary table
    operations = pd.DataFrame({
        'Operation': ['unsqueeze(0)', 'unsqueeze(-1)', 'unsqueeze(1)',
                      'squeeze()', 'squeeze(0)', '[None, :]'],
        'Input Shape': ['(D,)', '(D,)', '(N, D)',
                        '(1, D, 1)', '(1, D)', '(D,)'],
        'Output Shape': ['(1, D)', '(D, 1)', '(N, 1, D)',
                         '(D,)', '(D,)', '(1, D)'],
        'ML Use Case': [
            'Add batch dim for inference',
            'Column vector for matmul',
            'Add sequence dim',
            'Remove all size-1 dims',
            'Remove batch dim after inference',
            'Same as unsqueeze(0)',
        ],
    })
    print('=== Squeeze/Unsqueeze Reference ===')
    print(operations.to_string(index=False))


demonstrate_squeeze_unsqueeze()

### 1.3 Transpose & Permute: Reordering DimensionsTranspose swaps two dimensions; permute reorders all dimensions at once.These are critical for converting between data format conventions:- Images: `(H, W, C)` ↔ `(C, H, W)` (channels-last ↔ channels-first)- Sequences: `(batch, seq, features)` ↔ `(seq, batch, features)`- Attention: rearranging `(batch, heads, seq, d_k)` dimensions**Important:** Transpose and permute return *views* — the data stays in place,only the stride metadata changes. The result is typically non-contiguous.

In [None]:
def demonstrate_transpose_permute() -> None:
    """Show transpose and permute with shape tracking."""
    # 2D transpose
    mat = torch.arange(12).reshape(3, 4)
    print(f'Original: shape={mat.shape}')
    print(mat)
    print()

    # .T shorthand (2D only)
    mat_t = mat.T
    print(f'.T: shape={mat_t.shape}')
    print(mat_t)
    print()

    # .t() method (2D only, equivalent)
    mat_t2 = mat.t()
    assert torch.equal(mat_t, mat_t2)
    print(f'.t(): same as .T for 2D')
    print()

    # transpose(dim0, dim1) — works for any number of dimensions
    tensor_3d = torch.randn(2, 3, 4)
    swapped = tensor_3d.transpose(1, 2)  # Swap dims 1 and 2
    print(f'3D transpose(1, 2): {tensor_3d.shape} → {swapped.shape}')
    print(f'  Is contiguous: {swapped.is_contiguous()}')
    print()

    # permute — reorder all dimensions at once
    # Common: convert image from (H, W, C) to (C, H, W)
    image_hwc = torch.randn(224, 224, 3)  # Height, Width, Channels
    image_chw = image_hwc.permute(2, 0, 1)  # Channels, Height, Width
    print(f'Image HWC → CHW: {image_hwc.shape} → {image_chw.shape}')
    print()

    # Batch of images: (N, H, W, C) → (N, C, H, W)
    batch_hwc = torch.randn(32, 224, 224, 3)
    batch_chw = batch_hwc.permute(0, 3, 1, 2)
    print(f'Batch HWC → CHW: {batch_hwc.shape} → {batch_chw.shape}')
    print()

    # Attention tensor rearrangement
    # (batch, seq, num_heads, d_k) → (batch, num_heads, seq, d_k)
    attn = torch.randn(8, 64, 12, 64)
    attn_rearranged = attn.permute(0, 2, 1, 3)
    print(f'Attention rearrange: {attn.shape} → {attn_rearranged.shape}')
    print(f'  (batch, seq, heads, d_k) → (batch, heads, seq, d_k)')


demonstrate_transpose_permute()

### 1.4 Stacking & Concatenation: Combining TensorsTwo ways to combine tensors:- **`cat` / `concatenate`** — join along an *existing* dimension (sizes must match on all other dims)- **`stack`** — join along a *new* dimension (all tensors must have the same shape)These appear constantly: concatenating features, stacking batch elements,building sequence tensors from individual timesteps.

In [None]:
def demonstrate_cat_stack() -> None:
    """Show concatenation and stacking operations."""
    # Three (3, 4) matrices
    a = torch.ones(3, 4)
    b = torch.ones(3, 4) * 2
    c = torch.ones(3, 4) * 3
    print(f'Input shapes: a={a.shape}, b={b.shape}, c={c.shape}')
    print()

    # cat along dim=0 (stack rows)
    cat0 = torch.cat([a, b, c], dim=0)
    print(f'cat(dim=0): {cat0.shape}  (9 rows = 3+3+3)')
    print()

    # cat along dim=1 (stack columns)
    cat1 = torch.cat([a, b, c], dim=1)
    print(f'cat(dim=1): {cat1.shape}  (12 cols = 4+4+4)')
    print()

    # stack creates a NEW dimension
    stacked0 = torch.stack([a, b, c], dim=0)
    print(f'stack(dim=0): {stacked0.shape}  (new batch dim)')
    print()

    stacked1 = torch.stack([a, b, c], dim=1)
    print(f'stack(dim=1): {stacked1.shape}  (new dim inserted at position 1)')
    print()

    # Practical: building a batch from individual samples
    samples = [torch.randn(3, 32, 32) for _ in range(16)]  # 16 RGB images
    batch = torch.stack(samples, dim=0)
    print(f'Stack 16 images: list of {samples[0].shape} → {batch.shape}')
    print()

    # Practical: concatenating feature vectors
    visual_features = torch.randn(8, 256)
    text_features = torch.randn(8, 128)
    combined = torch.cat([visual_features, text_features], dim=1)
    print(f'Feature concat: {visual_features.shape} + {text_features.shape} → {combined.shape}')
    print()

    # NumPy equivalents
    np_a = np.ones((3, 4))
    np_b = np.ones((3, 4)) * 2
    print(f'NumPy concatenate(axis=0): {np.concatenate([np_a, np_b], axis=0).shape}')
    print(f'NumPy stack(axis=0): {np.stack([np_a, np_b], axis=0).shape}')
    print(f'NumPy vstack: {np.vstack([np_a, np_b]).shape}')
    print(f'NumPy hstack: {np.hstack([np_a, np_b]).shape}')


demonstrate_cat_stack()

### 1.5 Advanced Indexing: Selecting and Modifying ElementsBeyond basic slicing, NumPy and PyTorch support **fancy indexing** (integer arrays)and **boolean indexing** (masks). These are essential for:- Selecting specific samples from a batch- Gathering predictions for specific classes- Masking out padded positions in sequences- Implementing attention masks**Critical difference from basic slicing:** fancy and boolean indexing alwaysreturn *copies*, not views.

In [None]:
def demonstrate_fancy_indexing() -> None:
    """Show integer (fancy) indexing patterns."""
    # Sample data: batch of 5 feature vectors, each with 4 features
    data = torch.tensor([
        [1.0, 2.0, 3.0, 4.0],
        [5.0, 6.0, 7.0, 8.0],
        [9.0, 10.0, 11.0, 12.0],
        [13.0, 14.0, 15.0, 16.0],
        [17.0, 18.0, 19.0, 20.0],
    ])
    print(f'Data shape: {data.shape}')
    print()

    # Select specific rows (samples)
    indices = torch.tensor([0, 2, 4])
    selected = data[indices]
    print(f'Select rows [0, 2, 4]: shape={selected.shape}')
    print(selected)
    print()

    # Select specific (row, col) pairs
    row_idx = torch.tensor([0, 1, 2, 3, 4])
    col_idx = torch.tensor([3, 2, 1, 0, 3])
    diagonal_like = data[row_idx, col_idx]
    print(f'Select (row, col) pairs: shape={diagonal_like.shape}')
    print(f'  Values: {diagonal_like}')
    print(f'  data[0,3]={data[0,3]:.0f}, data[1,2]={data[1,2]:.0f}, ...')
    print()

    # Practical: gather predicted class probabilities
    # logits shape: (batch, num_classes)
    logits = torch.randn(4, 10)  # 4 samples, 10 classes
    targets = torch.tensor([3, 7, 1, 5])  # True class for each sample
    target_logits = logits[torch.arange(4), targets]
    print(f'Gather target logits: logits{list(logits.shape)} → {target_logits.shape}')
    print(f'  target_logits[0] = logits[0, {targets[0]}] = {target_logits[0]:.4f}')
    print()

    # Negative indexing
    last_two = data[[-2, -1]]
    print(f'Negative indices [-2, -1]: shape={last_two.shape}')
    print(last_two)


demonstrate_fancy_indexing()

Boolean indexing uses a mask of True/False values to select elements. This isthe vectorized replacement for `if` statements inside loops.

In [None]:
def demonstrate_boolean_indexing() -> None:
    """Show boolean masking patterns common in ML."""
    torch.manual_seed(SEED)
    data = torch.randn(5, 4)
    print(f'Data:\n{data}')
    print()

    # Basic condition: select positive values
    mask = data > 0
    print(f'Mask (data > 0):\n{mask}')
    print(f'Positive values: {data[mask]}')
    print(f'Count: {mask.sum()} / {mask.numel()}')
    print()

    # Combining conditions
    complex_mask = (data > -0.5) & (data < 0.5)
    print(f'Values in (-0.5, 0.5): {data[complex_mask]}')
    print()

    # Practical: ReLU activation via masking
    relu_manual = data.clone()
    relu_manual[relu_manual < 0] = 0
    relu_torch = torch.relu(data)
    print(f'Manual ReLU matches torch.relu: {torch.allclose(relu_manual, relu_torch)}')
    print()

    # Practical: mask padded positions in a sequence
    seq_lengths = torch.tensor([3, 5, 2, 4])  # 4 sequences, max_len=5
    max_len = 5
    # Create padding mask: True where position < seq_length
    positions = torch.arange(max_len).unsqueeze(0)  # (1, max_len)
    lengths = seq_lengths.unsqueeze(1)               # (batch, 1)
    padding_mask = positions < lengths               # (batch, max_len)
    print(f'Sequence lengths: {seq_lengths}')
    print(f'Padding mask (True = valid):\n{padding_mask}')
    print()

    # Apply mask to zero out padded positions
    sequences = torch.randn(4, 5, 8)  # (batch, seq_len, features)
    masked_sequences = sequences * padding_mask.unsqueeze(-1)
    print(f'Masked sequences shape: {masked_sequences.shape}')
    print(f'  Padded positions are zero: '
          f'{(masked_sequences[2, 2:, :].abs().sum() == 0).item()}')


demonstrate_boolean_indexing()

### 1.6 Conditional Selection: where, clamp, and masked_fillThese operations apply conditions element-wise without explicit loops:- `torch.where(condition, x, y)` — select from `x` where True, `y` where False- `torch.clamp(x, min, max)` — clip values to a range- `tensor.masked_fill(mask, value)` — fill masked positions with a constantThese are building blocks for activation functions, loss clipping, andattention masking.

In [None]:
def demonstrate_conditional_ops() -> None:
    """Show conditional tensor operations."""
    torch.manual_seed(SEED)
    x = torch.randn(3, 4)
    print(f'x:\n{x}')
    print()

    # torch.where: element-wise conditional
    result = torch.where(x > 0, x, torch.zeros_like(x))
    print(f'where(x > 0, x, 0) — manual ReLU:\n{result}')
    print()

    # torch.where with different fill values
    labels = torch.where(x > 0, torch.ones_like(x), -torch.ones_like(x))
    print(f'where(x > 0, 1, -1) — sign function:\n{labels}')
    print()

    # torch.clamp: clip to range
    clamped = torch.clamp(x, min=-0.5, max=0.5)
    print(f'clamp(-0.5, 0.5):\n{clamped}')
    print()

    # Practical: gradient clipping simulation
    gradients = torch.randn(1000) * 5
    clipped = torch.clamp(gradients, min=-1.0, max=1.0)
    print(f'Gradient clipping: max before={gradients.abs().max():.2f}, '
          f'max after={clipped.abs().max():.2f}')
    print()

    # masked_fill: used in transformer attention
    attention_scores = torch.randn(4, 4)
    # Create causal mask (upper triangular = future tokens)
    causal_mask = torch.triu(torch.ones(4, 4), diagonal=1).bool()
    masked_scores = attention_scores.masked_fill(causal_mask, float('-inf'))
    print(f'Causal mask:\n{causal_mask.int()}')
    print(f'Masked attention scores:\n{masked_scores}')
    print(f'  -inf positions will become 0 after softmax')


demonstrate_conditional_ops()

### 1.7 Gather & Scatter: Advanced Element Selection`torch.gather` and `torch.scatter` are the most powerful (and most confusing)indexing operations. They select or place elements along a specific dimensionusing an index tensor.**`gather(input, dim, index)`** — for each position in `index`, look up theelement in `input` at that index along `dim`.**`scatter(dim, index, src)`** — the inverse of gather: place elements from`src` into positions specified by `index` along `dim`.These are used in:- Cross-entropy loss (gathering log-probabilities for target classes)- One-hot encoding (scattering 1s into class positions)- Top-k selection in beam search

In [None]:
def demonstrate_gather_scatter() -> None:
    """Show gather and scatter operations with practical examples."""
    # ── Gather ──────────────────────────────────────────────────────────────
    # Scenario: select the probability of the correct class for each sample
    torch.manual_seed(SEED)
    probs = torch.softmax(torch.randn(4, 5), dim=1)  # (batch=4, classes=5)
    targets = torch.tensor([2, 0, 4, 1])              # Correct class per sample
    print(f'Class probabilities (4 samples, 5 classes):')
    print(f'{probs}')
    print(f'Targets: {targets}')
    print()

    # Gather: select target class probability for each sample
    target_probs = probs.gather(dim=1, index=targets.unsqueeze(1))
    print(f'Gathered target probs: {target_probs.squeeze()}')
    print(f'  Verify: probs[0, 2] = {probs[0, 2]:.4f}, '
          f'gathered[0] = {target_probs[0, 0]:.4f}')
    print()

    # Equivalent using fancy indexing (simpler but less general)
    target_probs_fancy = probs[torch.arange(4), targets]
    assert torch.allclose(target_probs.squeeze(), target_probs_fancy)
    print(f'Fancy indexing gives same result: True')
    print()

    # ── Scatter ─────────────────────────────────────────────────────────────
    # Scenario: create one-hot encoding
    num_classes = 5
    labels = torch.tensor([0, 3, 1, 4, 2])
    one_hot = torch.zeros(5, num_classes)
    one_hot.scatter_(dim=1, index=labels.unsqueeze(1), value=1.0)
    print(f'One-hot encoding (scatter):')
    print(one_hot)
    print()

    # Verify against F.one_hot
    import torch.nn.functional as F
    one_hot_lib = F.one_hot(labels, num_classes=num_classes).float()
    assert torch.equal(one_hot, one_hot_lib)
    print(f'Matches F.one_hot: True')
    print()

    # ── Top-k gather ────────────────────────────────────────────────────────
    scores = torch.randn(3, 8)  # 3 sequences, vocabulary of 8
    topk_vals, topk_idx = scores.topk(3, dim=1)  # Top 3 per sequence
    print(f'Top-3 values: {topk_vals.shape}')
    print(f'Top-3 indices: {topk_idx}')
    print(f'  Gathered back: {scores.gather(1, topk_idx)}')
    assert torch.equal(topk_vals, scores.gather(1, topk_idx))


demonstrate_gather_scatter()

### 1.8 Einsum: The Universal Tensor OperationEinstein summation (`einsum`) is a compact notation for expressing tensorcontractions, transpositions, traces, and outer products in a single string.The notation uses subscript labels for each dimension:- Repeated indices are summed over (contraction)- Output indices appear on the right side of `→`- Indices that appear in input but not output are summed outExamples:- `'ij,jk->ik'` — matrix multiplication (sum over j)- `'ii->'` — trace (sum of diagonal)- `'ij->ji'` — transpose- `'i,j->ij'` — outer product- `'bij,bjk->bik'` — batched matrix multiplication

In [None]:
def demonstrate_einsum_basics() -> None:
    """Show einsum for common operations, verifying against explicit implementations."""
    torch.manual_seed(SEED)

    # ── Dot product: 'i,i->' ────────────────────────────────────────────────
    a = torch.randn(5)
    b = torch.randn(5)
    dot_einsum = torch.einsum('i,i->', a, b)
    dot_manual = torch.dot(a, b)
    print(f'Dot product:')
    print(f'  einsum: {dot_einsum:.4f}')
    print(f'  manual: {dot_manual:.4f}')
    assert torch.allclose(dot_einsum, dot_manual)
    print()

    # ── Matrix multiplication: 'ij,jk->ik' ─────────────────────────────────
    A = torch.randn(3, 4)
    B = torch.randn(4, 5)
    matmul_einsum = torch.einsum('ij,jk->ik', A, B)
    matmul_manual = A @ B
    print(f'Matrix multiply: {A.shape} × {B.shape} → {matmul_einsum.shape}')
    assert torch.allclose(matmul_einsum, matmul_manual)
    print(f'  Matches @ operator: True')
    print()

    # ── Outer product: 'i,j->ij' ────────────────────────────────────────────
    x = torch.tensor([1.0, 2.0, 3.0])
    y = torch.tensor([4.0, 5.0])
    outer_einsum = torch.einsum('i,j->ij', x, y)
    outer_manual = x.unsqueeze(1) * y.unsqueeze(0)
    print(f'Outer product: ({x.shape[0]},) × ({y.shape[0]},) → {outer_einsum.shape}')
    print(outer_einsum)
    assert torch.allclose(outer_einsum, outer_manual)
    print()

    # ── Transpose: 'ij->ji' ─────────────────────────────────────────────────
    M = torch.randn(3, 4)
    transpose_einsum = torch.einsum('ij->ji', M)
    assert torch.allclose(transpose_einsum, M.T)
    print(f'Transpose: {M.shape} → {transpose_einsum.shape} (matches .T)')
    print()

    # ── Trace: 'ii->' ───────────────────────────────────────────────────────
    sq = torch.randn(4, 4)
    trace_einsum = torch.einsum('ii->', sq)
    trace_manual = torch.trace(sq)
    print(f'Trace: {trace_einsum:.4f} (matches torch.trace: {trace_manual:.4f})')
    print()

    # ── Diagonal: 'ii->i' ───────────────────────────────────────────────────
    diag_einsum = torch.einsum('ii->i', sq)
    diag_manual = torch.diagonal(sq)
    print(f'Diagonal: {diag_einsum} (sum = trace = {diag_einsum.sum():.4f})')
    assert torch.allclose(diag_einsum, diag_manual)
    print()

    # ── Element-wise multiply and sum: 'ij,ij->' ────────────────────────────
    C = torch.randn(3, 4)
    D = torch.randn(3, 4)
    hadamard_sum_einsum = torch.einsum('ij,ij->', C, D)
    hadamard_sum_manual = (C * D).sum()
    print(f'Frobenius inner product: {hadamard_sum_einsum:.4f}')
    assert torch.allclose(hadamard_sum_einsum, hadamard_sum_manual)


demonstrate_einsum_basics()

Now let's look at the einsum patterns that appear most frequently in deep learning:batched operations, attention computation, and multi-head projections.

In [None]:
def demonstrate_einsum_ml_patterns() -> None:
    """Show einsum patterns commonly used in ML/DL implementations."""
    torch.manual_seed(SEED)

    # ── Batched matrix multiplication: 'bij,bjk->bik' ──────────────────────
    batch_A = torch.randn(8, 3, 4)  # 8 batches of (3×4) matrices
    batch_B = torch.randn(8, 4, 5)  # 8 batches of (4×5) matrices
    bmm_einsum = torch.einsum('bij,bjk->bik', batch_A, batch_B)
    bmm_torch = torch.bmm(batch_A, batch_B)
    print(f'Batched matmul: {batch_A.shape} × {batch_B.shape} → {bmm_einsum.shape}')
    assert torch.allclose(bmm_einsum, bmm_torch)
    print(f'  Matches torch.bmm: True')
    print()

    # ── Attention scores: 'bhid,bhjd->bhij' ─────────────────────────────────
    batch, heads, seq_q, seq_k, d_k = 2, 4, 6, 8, 16
    Q = torch.randn(batch, heads, seq_q, d_k)
    K = torch.randn(batch, heads, seq_k, d_k)
    attn_scores = torch.einsum('bhid,bhjd->bhij', Q, K)
    attn_manual = Q @ K.transpose(-2, -1)
    print(f'Attention scores: Q{list(Q.shape)} × K{list(K.shape)} → {list(attn_scores.shape)}')
    assert torch.allclose(attn_scores, attn_manual, atol=1e-6)
    print(f'  Matches Q @ K^T: True')
    print()

    # ── Bilinear form: 'bi,ij,bj->b' ────────────────────────────────────────
    x = torch.randn(4, 3)
    W = torch.randn(3, 3)
    y = torch.randn(4, 3)
    bilinear = torch.einsum('bi,ij,bj->b', x, W, y)
    bilinear_manual = (x @ W * y).sum(dim=1)
    print(f'Bilinear form x^T W y: batch result shape={bilinear.shape}')
    assert torch.allclose(bilinear, bilinear_manual, atol=1e-5)
    print()

    # ── Row-wise and column-wise sum ─────────────────────────────────────────
    mat = torch.randn(3, 4)
    row_sum = torch.einsum('ij->i', mat)  # Sum each row
    col_sum = torch.einsum('ij->j', mat)  # Sum each column
    total = torch.einsum('ij->', mat)      # Sum all elements
    print(f'Row sums: {row_sum} (matches .sum(1): {torch.allclose(row_sum, mat.sum(1))})')
    print(f'Col sums: {col_sum} (matches .sum(0): {torch.allclose(col_sum, mat.sum(0))})')
    print(f'Total: {total:.4f} (matches .sum(): {torch.allclose(total, mat.sum())})')
    print()

    # ── Summary table ───────────────────────────────────────────────────────
    einsum_ref = pd.DataFrame({
        'Pattern': [
            'i,i->', 'ij,jk->ik', 'bij,bjk->bik', 'i,j->ij',
            'ij->ji', 'ii->', 'ij,ij->', 'bhid,bhjd->bhij',
        ],
        'Operation': [
            'Dot product', 'Matrix multiply', 'Batched matmul', 'Outer product',
            'Transpose', 'Trace', 'Frobenius inner', 'Attention scores',
        ],
        'Equivalent': [
            'torch.dot', 'A @ B', 'torch.bmm', 'outer()',
            '.T', 'torch.trace', '(A*B).sum()', 'Q @ K.T',
        ],
    })
    print('=== Einsum Quick Reference ===')
    print(einsum_ref.to_string(index=False))


demonstrate_einsum_ml_patterns()

### 1.9 In-Place Operations: Speed vs SafetyIn-place operations modify a tensor without allocating new memory. In PyTorch,they're marked with a trailing underscore: `add_()`, `mul_()`, `relu_()`.**Benefits:**- Save memory (no temporary allocation)- Slightly faster for large tensors**Dangers:**- **Break autograd** if applied to tensors that require gradients- **Corrupt views** if another tensor shares the same memory- **Make debugging harder** — original values are lost**Rule of thumb:** Use in-place ops only in inference or data preprocessing.Never use them during training on tensors involved in gradient computation.

In [None]:
def demonstrate_inplace_ops() -> None:
    """Show in-place operations and their gotchas."""
    # Basic in-place operations
    a = torch.tensor([1.0, 2.0, 3.0])
    print(f'Before: {a}')
    print(f'  id(a.data): {id(a.storage())}')

    a.add_(10)  # In-place add
    print(f'After add_(10): {a}')
    print(f'  id(a.data): {id(a.storage())} (same!)')
    print()

    # Compare: out-of-place creates new tensor
    b = torch.tensor([1.0, 2.0, 3.0])
    c = b + 10  # Out-of-place
    print(f'Out-of-place: b unchanged = {b}, c = {c}')
    print()

    # ── Danger 1: Breaking autograd ─────────────────────────────────────────
    x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
    y = x * 2  # y depends on x through the computation graph
    try:
        y.add_(1)  # In-place on a tensor needed for backward
        print('In-place on grad tensor: succeeded (may cause issues in backward)')
    except RuntimeError as e:
        print(f'In-place on grad tensor: RuntimeError')
        print(f'  {str(e)[:80]}')
    print()

    # ── Danger 2: Corrupting views ──────────────────────────────────────────
    original = torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0])
    view_slice = original[1:4]  # View of elements [2, 3, 4]
    print(f'original: {original}')
    print(f'view:     {view_slice}')

    view_slice.mul_(10)  # In-place multiply
    print(f'After view.mul_(10):')
    print(f'  view:     {view_slice}')
    print(f'  original: {original}  ← CORRUPTED!')
    print()

    # ── Common in-place operations ──────────────────────────────────────────
    ops_table = pd.DataFrame({
        'In-place': ['add_()', 'mul_()', 'zero_()', 'fill_(v)',
                     'clamp_()', 'relu_()', 'uniform_()', 'normal_()'],
        'Out-of-place': ['add() / +', 'mul() / *', 'torch.zeros_like()', 'torch.full_like()',
                         'torch.clamp()', 'torch.relu()', 'torch.rand_like()', 'torch.randn_like()'],
        'Safe in Training?': ['No*', 'No*', 'Yes (param init)', 'Yes (param init)',
                              'No*', 'No*', 'Yes (param init)', 'Yes (param init)'],
    })
    print('=== In-place Operations Reference ===')
    print(ops_table.to_string(index=False))
    print('* Not safe if tensor requires grad or is used in computation graph')


demonstrate_inplace_ops()

---## Part 2 — Putting It All Together: TensorOps ToolkitWe've covered many individual operations. Now let's assemble them into areusable `TensorOps` class that provides common tensor manipulation patternsused throughout this course, with shape validation at each step.

In [None]:
class TensorOps:
    """Collection of common tensor manipulation patterns with shape validation.

    All methods are static and work with both NumPy arrays and PyTorch tensors.
    Each method includes assertion checks to catch shape mismatches early.

    Attributes:
        None — all methods are static.
    """

    @staticmethod
    def batch_flatten(x: torch.Tensor) -> torch.Tensor:
        """Flatten all dimensions except the batch dimension.

        Args:
            x: Input tensor of shape (batch, ...).

        Returns:
            Flattened tensor of shape (batch, product_of_remaining_dims).
        """
        assert x.dim() >= 2, f'Expected at least 2D tensor, got {x.dim()}D'
        batch_size = x.shape[0]
        return x.reshape(batch_size, -1)

    @staticmethod
    def add_batch_dim(x: torch.Tensor) -> torch.Tensor:
        """Add a batch dimension at position 0.

        Args:
            x: Input tensor of any shape.

        Returns:
            Tensor with shape (1, *original_shape).
        """
        return x.unsqueeze(0)

    @staticmethod
    def channels_last_to_first(x: torch.Tensor) -> torch.Tensor:
        """Convert image tensor from (N, H, W, C) to (N, C, H, W).

        Args:
            x: Image tensor in channels-last format.

        Returns:
            Image tensor in channels-first format.
        """
        assert x.dim() == 4, f'Expected 4D tensor (N,H,W,C), got {x.dim()}D'
        return x.permute(0, 3, 1, 2)

    @staticmethod
    def channels_first_to_last(x: torch.Tensor) -> torch.Tensor:
        """Convert image tensor from (N, C, H, W) to (N, H, W, C).

        Args:
            x: Image tensor in channels-first format.

        Returns:
            Image tensor in channels-last format.
        """
        assert x.dim() == 4, f'Expected 4D tensor (N,C,H,W), got {x.dim()}D'
        return x.permute(0, 2, 3, 1)

    @staticmethod
    def create_padding_mask(
        lengths: torch.Tensor,
        max_len: int,
    ) -> torch.Tensor:
        """Create a boolean padding mask from sequence lengths.

        Args:
            lengths: Tensor of sequence lengths, shape (batch,).
            max_len: Maximum sequence length.

        Returns:
            Boolean mask of shape (batch, max_len). True = valid, False = padding.
        """
        positions = torch.arange(max_len, device=lengths.device).unsqueeze(0)
        return positions < lengths.unsqueeze(1)

    @staticmethod
    def create_causal_mask(seq_len: int) -> torch.Tensor:
        """Create a causal (lower-triangular) attention mask.

        Args:
            seq_len: Sequence length.

        Returns:
            Boolean mask of shape (seq_len, seq_len). True = masked (future).
        """
        return torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()

    @staticmethod
    def one_hot_encode(
        labels: torch.Tensor,
        num_classes: int,
    ) -> torch.Tensor:
        """Create one-hot encoding using scatter.

        Args:
            labels: Integer labels of shape (batch,).
            num_classes: Total number of classes.

        Returns:
            One-hot tensor of shape (batch, num_classes).
        """
        assert labels.dim() == 1, f'Expected 1D labels, got {labels.dim()}D'
        one_hot = torch.zeros(labels.shape[0], num_classes, dtype=torch.float32)
        one_hot.scatter_(1, labels.unsqueeze(1), 1.0)
        return one_hot

    @staticmethod
    def attention_scores(
        query: torch.Tensor,
        key: torch.Tensor,
    ) -> torch.Tensor:
        """Compute scaled dot-product attention scores using einsum.

        Args:
            query: Query tensor of shape (batch, heads, seq_q, d_k).
            key: Key tensor of shape (batch, heads, seq_k, d_k).

        Returns:
            Attention scores of shape (batch, heads, seq_q, seq_k).
        """
        d_k = query.shape[-1]
        scores = torch.einsum('bhid,bhjd->bhij', query, key)
        return scores / (d_k ** 0.5)

Let's verify that each method in the toolkit works correctly on sample data.

In [None]:
def test_tensor_ops() -> None:
    """Test all TensorOps methods with assertions."""
    ops = TensorOps()
    torch.manual_seed(SEED)

    # batch_flatten
    images = torch.randn(8, 3, 32, 32)
    flat = ops.batch_flatten(images)
    assert flat.shape == (8, 3 * 32 * 32), f'Expected (8, 3072), got {flat.shape}'
    print(f'batch_flatten: {images.shape} → {flat.shape} ✓')

    # add_batch_dim
    single = torch.randn(3, 32, 32)
    batched = ops.add_batch_dim(single)
    assert batched.shape == (1, 3, 32, 32)
    print(f'add_batch_dim: {single.shape} → {batched.shape} ✓')

    # channels_last_to_first
    nhwc = torch.randn(8, 32, 32, 3)
    nchw = ops.channels_last_to_first(nhwc)
    assert nchw.shape == (8, 3, 32, 32)
    print(f'channels_last_to_first: {nhwc.shape} → {nchw.shape} ✓')

    # channels_first_to_last
    back = ops.channels_first_to_last(nchw)
    assert back.shape == (8, 32, 32, 3)
    assert torch.equal(back, nhwc)
    print(f'channels_first_to_last: {nchw.shape} → {back.shape} ✓')

    # create_padding_mask
    lens = torch.tensor([3, 5, 2])
    mask = ops.create_padding_mask(lens, max_len=6)
    assert mask.shape == (3, 6)
    assert mask[0].sum() == 3
    assert mask[1].sum() == 5
    print(f'create_padding_mask: lengths={lens.tolist()} → {mask.shape} ✓')

    # create_causal_mask
    cmask = ops.create_causal_mask(4)
    assert cmask.shape == (4, 4)
    assert cmask[0, 1] == True   # Future position masked
    assert cmask[1, 0] == False  # Past position not masked
    print(f'create_causal_mask: seq_len=4 → {cmask.shape} ✓')

    # one_hot_encode
    labels = torch.tensor([0, 3, 1, 4])
    onehot = ops.one_hot_encode(labels, num_classes=5)
    assert onehot.shape == (4, 5)
    assert onehot[0, 0] == 1.0
    assert onehot[1, 3] == 1.0
    print(f'one_hot_encode: {labels.tolist()} → {onehot.shape} ✓')

    # attention_scores
    Q = torch.randn(2, 4, 6, 16)
    K = torch.randn(2, 4, 8, 16)
    scores = ops.attention_scores(Q, K)
    assert scores.shape == (2, 4, 6, 8)
    print(f'attention_scores: Q{list(Q.shape)} × K{list(K.shape)} → {list(scores.shape)} ✓')

    print()
    print('All TensorOps tests passed!')


test_tensor_ops()

---## Part 3 — Application: Real-World Tensor ManipulationNow we apply our tensor manipulation skills to realistic scenarios. We'll workthrough three practical problems that combine multiple operations:1. **Image data pipeline** — loading, reshaping, and normalizing image batches2. **Einsum-powered linear layer** — building a neural network layer with einsum3. **Sequence padding and batching** — preparing variable-length text for models

### 3.1 Image Data PipelineWhen working with image data, we constantly need to convert between formats,normalize pixel values, and reshape tensors for model input. Let's build acomplete pipeline.

In [None]:
def image_pipeline_demo() -> None:
    """Demonstrate a complete image tensor manipulation pipeline."""
    np.random.seed(SEED)

    # Simulate loading 16 images as uint8 arrays in (H, W, C) format
    raw_images = np.random.randint(0, 256, size=(16, 64, 64, 3), dtype=np.uint8)
    print(f'Raw images: shape={raw_images.shape}, dtype={raw_images.dtype}')
    print(f'  Pixel range: [{raw_images.min()}, {raw_images.max()}]')
    print()

    # Step 1: Convert to float32 and normalize to [0, 1]
    images_float = raw_images.astype(np.float32) / 255.0
    print(f'Step 1 — Float32 normalize: dtype={images_float.dtype}, '
          f'range=[{images_float.min():.2f}, {images_float.max():.2f}]')
    print()

    # Step 2: Convert to PyTorch tensor
    images_tensor = torch.from_numpy(images_float)
    print(f'Step 2 — To tensor: {images_tensor.shape}')
    print()

    # Step 3: Channels-last to channels-first (N, H, W, C) → (N, C, H, W)
    images_chw = images_tensor.permute(0, 3, 1, 2)
    print(f'Step 3 — Channels first: {images_chw.shape}')
    assert images_chw.shape == (16, 3, 64, 64)
    print()

    # Step 4: Normalize with ImageNet mean/std (per channel)
    mean = torch.tensor([0.485, 0.456, 0.406]).reshape(1, 3, 1, 1)
    std = torch.tensor([0.229, 0.224, 0.225]).reshape(1, 3, 1, 1)
    images_normalized = (images_chw - mean) / std
    print(f'Step 4 — ImageNet normalize: '
          f'mean≈{images_normalized.mean(dim=(0,2,3)).tolist()}')
    print()

    # Step 5: Flatten for a fully-connected layer
    images_flat = images_normalized.reshape(16, -1)
    print(f'Step 5 — Flatten: {images_flat.shape} '
          f'({3*64*64} = 3×64×64 features)')
    print()

    # Visualize the pipeline
    fig, axes = plt.subplots(2, 4, figsize=(12, 6))
    for i in range(4):
        # Original
        axes[0, i].imshow(raw_images[i])
        axes[0, i].set_title(f'Raw #{i}')
        axes[0, i].axis('off')
        # Normalized (undo normalization for display)
        img_display = images_chw[i].permute(1, 2, 0).numpy()
        axes[1, i].imshow(img_display)
        axes[1, i].set_title(f'Float #{i}')
        axes[1, i].axis('off')
    axes[0, 0].set_ylabel('uint8 [0, 255]', fontsize=11)
    axes[1, 0].set_ylabel('float32 [0, 1]', fontsize=11)
    plt.suptitle('Image Pipeline: Raw → Float → Channels-first', fontsize=13)
    plt.tight_layout()
    plt.show()

    # Shape tracking summary
    pipeline_steps = pd.DataFrame({
        'Step': ['Load', 'Normalize', 'To Tensor', 'Permute', 'Normalize (IN)', 'Flatten'],
        'Shape': ['(16,64,64,3)', '(16,64,64,3)', '(16,64,64,3)',
                  '(16,3,64,64)', '(16,3,64,64)', '(16,12288)'],
        'dtype': ['uint8', 'float32', 'float32', 'float32', 'float32', 'float32'],
        'Range': ['[0,255]', '[0,1]', '[0,1]', '[0,1]', '~[-2,2]', '~[-2,2]'],
    })
    print('=== Pipeline Shape Tracking ===')
    print(pipeline_steps.to_string(index=False))


image_pipeline_demo()

### 3.2 Einsum-Powered Linear LayerLet's implement a linear layer (`y = xW + b`) using einsum and compare itagainst PyTorch's `nn.Linear`. This demonstrates how einsum replaces bothmatrix multiplication and broadcasting in a single expression.

In [None]:
def einsum_linear_layer_demo() -> None:
    """Build and benchmark a linear layer using einsum."""
    torch.manual_seed(SEED)

    # Layer parameters
    in_features = 512
    out_features = 256
    batch_size = 64

    # Create weight and bias
    W = torch.randn(in_features, out_features) * 0.01
    b = torch.zeros(out_features)
    x = torch.randn(batch_size, in_features)

    # Method 1: Explicit matmul + broadcast
    y_matmul = x @ W + b
    print(f'Matmul: x{list(x.shape)} @ W{list(W.shape)} + b{list(b.shape)} → {list(y_matmul.shape)}')

    # Method 2: Einsum
    y_einsum = torch.einsum('bi,io->bo', x, W) + b
    print(f'Einsum: "bi,io->bo" → {list(y_einsum.shape)}')

    # Verify equivalence
    assert torch.allclose(y_matmul, y_einsum, atol=1e-5)
    print(f'Outputs match: True')
    print()

    # Method 3: nn.Linear (library reference)
    import torch.nn as nn
    linear = nn.Linear(in_features, out_features, bias=True)
    with torch.no_grad():
        linear.weight.copy_(W.T)  # nn.Linear stores weight transposed
        linear.bias.copy_(b)
    y_nn = linear(x)
    assert torch.allclose(y_matmul, y_nn, atol=1e-5)
    print(f'Matches nn.Linear: True')
    print()

    # Benchmark all three
    t_matmul, _ = measure_time(lambda: x @ W + b)
    t_einsum, _ = measure_time(lambda: torch.einsum('bi,io->bo', x, W) + b)
    t_nn, _ = measure_time(lambda: linear(x))

    bench_df = pd.DataFrame({
        'Method': ['x @ W + b', 'einsum("bi,io->bo")', 'nn.Linear'],
        'Time (ms)': [t_matmul * 1000, t_einsum * 1000, t_nn * 1000],
        'Relative': [1.0, t_einsum / t_matmul, t_nn / t_matmul],
    })
    print('=== Linear Layer Benchmark ===')
    print(bench_df.to_string(index=False))
    print()
    print('All three produce identical results. Einsum is typically similar speed')
    print('to explicit matmul. nn.Linear may be slightly faster due to fused ops.')


einsum_linear_layer_demo()

### 3.3 Sequence Padding and BatchingIn NLP, sequences have variable lengths but models need fixed-size tensors.We need to pad shorter sequences and create masks that tell the model whichpositions are real vs padded. This combines multiple operations we've learned.

In [None]:
def sequence_batching_demo() -> None:
    """Demonstrate variable-length sequence padding and masking."""
    torch.manual_seed(SEED)

    # Simulate tokenized sequences of different lengths
    sequences = [
        torch.tensor([4, 12, 7, 23, 1]),        # length 5
        torch.tensor([8, 3]),                     # length 2
        torch.tensor([15, 6, 9, 2, 11, 4, 7]),  # length 7
        torch.tensor([1, 22, 5, 18]),            # length 4
    ]
    lengths = torch.tensor([len(s) for s in sequences])
    print(f'Sequences: {[s.tolist() for s in sequences]}')
    print(f'Lengths: {lengths.tolist()}')
    print()

    # Step 1: Find max length and create padded tensor
    max_len = lengths.max().item()
    batch_size = len(sequences)
    PAD_TOKEN = 0
    padded = torch.full((batch_size, max_len), PAD_TOKEN, dtype=torch.long)
    for i, seq in enumerate(sequences):
        padded[i, :len(seq)] = seq
    print(f'Padded tensor (pad={PAD_TOKEN}):')
    print(padded)
    print(f'Shape: {padded.shape}')
    print()

    # Step 2: Create padding mask
    padding_mask = TensorOps.create_padding_mask(lengths, max_len)
    print(f'Padding mask (True = valid):')
    print(padding_mask.int())
    print()

    # Step 3: Simulate embedding lookup
    vocab_size = 30
    embed_dim = 8
    embedding_table = torch.randn(vocab_size, embed_dim)
    embedded = embedding_table[padded]  # Fancy indexing for lookup!
    print(f'Embedded: {padded.shape} → {embedded.shape} (each token → {embed_dim}D vector)')
    print()

    # Step 4: Apply mask to zero out padded embeddings
    mask_expanded = padding_mask.unsqueeze(-1)  # (batch, seq, 1)
    embedded_masked = embedded * mask_expanded
    print(f'Masked embedding shape: {embedded_masked.shape}')
    # Verify padded positions are zeroed
    assert embedded_masked[1, 2:, :].abs().sum() == 0, 'Padded positions should be zero'
    print(f'Padded positions zeroed: True')
    print()

    # Step 5: Compute sequence representations (mean of valid tokens)
    # Sum valid embeddings, divide by actual lengths
    seq_sums = embedded_masked.sum(dim=1)  # (batch, embed_dim)
    seq_means = seq_sums / lengths.unsqueeze(1).float()
    print(f'Sequence representations: {seq_means.shape}')
    print(f'  (Mean pooling over valid tokens only)')
    print()

    # Shape tracking
    steps = pd.DataFrame({
        'Step': ['Raw sequences', 'Padded', 'Embedded', 'Mask expanded',
                 'Masked embedded', 'Mean pooled'],
        'Shape': ['variable', f'({batch_size}, {max_len})',
                  f'({batch_size}, {max_len}, {embed_dim})',
                  f'({batch_size}, {max_len}, 1)',
                  f'({batch_size}, {max_len}, {embed_dim})',
                  f'({batch_size}, {embed_dim})'],
        'Operation': ['—', 'torch.full + assign', 'embedding[padded]',
                      'mask.unsqueeze(-1)', 'element-wise multiply',
                      'sum(dim=1) / lengths'],
    })
    print('=== Sequence Pipeline Shape Tracking ===')
    print(steps.to_string(index=False))


sequence_batching_demo()

### 3.4 Library Comparison: Our Implementations vs PyTorch Built-insLet's systematically verify that our manual implementations match PyTorch'sbuilt-in operations in both correctness and speed.

In [None]:
def library_comparison() -> None:
    """Compare manual tensor ops against PyTorch built-in equivalents."""
    torch.manual_seed(SEED)
    import torch.nn.functional as F

    comparisons: list[dict] = []

    # 1. One-hot encoding
    labels = torch.randint(0, 10, (1000,))
    our_onehot = TensorOps.one_hot_encode(labels, 10)
    lib_onehot = F.one_hot(labels, 10).float()
    match = torch.equal(our_onehot, lib_onehot)
    t_ours, _ = measure_time(lambda: TensorOps.one_hot_encode(labels, 10))
    t_lib, _ = measure_time(lambda: F.one_hot(labels, 10).float())
    comparisons.append({
        'Operation': 'One-hot encode',
        'Match': match,
        'Ours (ms)': t_ours * 1000,
        'PyTorch (ms)': t_lib * 1000,
    })

    # 2. Batch flatten
    images = torch.randn(32, 3, 32, 32)
    our_flat = TensorOps.batch_flatten(images)
    lib_flat = torch.flatten(images, start_dim=1)
    match = torch.equal(our_flat, lib_flat)
    t_ours, _ = measure_time(lambda: TensorOps.batch_flatten(images))
    t_lib, _ = measure_time(lambda: torch.flatten(images, start_dim=1))
    comparisons.append({
        'Operation': 'Batch flatten',
        'Match': match,
        'Ours (ms)': t_ours * 1000,
        'PyTorch (ms)': t_lib * 1000,
    })

    # 3. Causal mask
    our_mask = TensorOps.create_causal_mask(64)
    lib_mask = torch.triu(torch.ones(64, 64), diagonal=1).bool()
    match = torch.equal(our_mask, lib_mask)
    t_ours, _ = measure_time(lambda: TensorOps.create_causal_mask(64))
    t_lib, _ = measure_time(lambda: torch.triu(torch.ones(64, 64), diagonal=1).bool())
    comparisons.append({
        'Operation': 'Causal mask (64×64)',
        'Match': match,
        'Ours (ms)': t_ours * 1000,
        'PyTorch (ms)': t_lib * 1000,
    })

    # 4. Attention scores
    Q = torch.randn(4, 8, 32, 64)
    K = torch.randn(4, 8, 32, 64)
    our_attn = TensorOps.attention_scores(Q, K)
    lib_attn = (Q @ K.transpose(-2, -1)) / (64 ** 0.5)
    match = torch.allclose(our_attn, lib_attn, atol=1e-5)
    t_ours, _ = measure_time(lambda: TensorOps.attention_scores(Q, K))
    t_lib, _ = measure_time(lambda: (Q @ K.transpose(-2, -1)) / (64 ** 0.5))
    comparisons.append({
        'Operation': 'Attention scores',
        'Match': match,
        'Ours (ms)': t_ours * 1000,
        'PyTorch (ms)': t_lib * 1000,
    })

    comp_df = pd.DataFrame(comparisons)
    print('=== Library Comparison ===')
    print(comp_df.to_string(index=False))
    print()
    all_match = all(c['Match'] for c in comparisons)
    print(f'All implementations match PyTorch built-ins: {all_match}')


library_comparison()

---## Part 4 — Evaluation & AnalysisLet's analyze the performance characteristics of the operations we've learned,identify common pitfalls, and build a comprehensive reference.

### 4.1 Reshape Operations: Performance ImpactReshaping is supposed to be free (just metadata change), but some patternsforce copies. Let's measure when reshaping has actual cost.

In [None]:
def benchmark_reshape_patterns() -> pd.DataFrame:
    """Benchmark various reshape patterns to identify hidden copies.

    Returns:
        DataFrame with timing and copy status.
    """
    torch.manual_seed(SEED)
    n = 10_000_000
    data = torch.randn(n)
    mat = torch.randn(1000, 1000)

    patterns: list[dict] = []

    # reshape (contiguous → view, no copy)
    t, _ = measure_time(lambda: data.reshape(1000, 10000))
    patterns.append({'Pattern': 'reshape (contiguous)', 'Time (µs)': t * 1e6,
                     'Copies Data': 'No', 'Note': 'Just changes metadata'})

    # view (contiguous, no copy)
    t, _ = measure_time(lambda: data.view(1000, 10000))
    patterns.append({'Pattern': 'view (contiguous)', 'Time (µs)': t * 1e6,
                     'Copies Data': 'No', 'Note': 'Identical to reshape'})

    # reshape after transpose (non-contiguous → copy)
    mat_t = mat.t()  # Non-contiguous
    t, _ = measure_time(lambda: mat_t.reshape(-1))
    patterns.append({'Pattern': 'reshape (non-contiguous)', 'Time (µs)': t * 1e6,
                     'Copies Data': 'Yes', 'Note': 'Forces contiguous copy'})

    # contiguous() cost
    t, _ = measure_time(lambda: mat_t.contiguous())
    patterns.append({'Pattern': '.contiguous()', 'Time (µs)': t * 1e6,
                     'Copies Data': 'Yes', 'Note': 'Explicit copy to C-order'})

    # flatten (contiguous → view)
    t, _ = measure_time(lambda: data.reshape(100, 100, 1000).flatten())
    patterns.append({'Pattern': 'flatten (contiguous)', 'Time (µs)': t * 1e6,
                     'Copies Data': 'No', 'Note': 'View when possible'})

    # unsqueeze/squeeze (always free)
    t, _ = measure_time(lambda: data.unsqueeze(0).unsqueeze(-1).squeeze(0))
    patterns.append({'Pattern': 'unsqueeze/squeeze', 'Time (µs)': t * 1e6,
                     'Copies Data': 'No', 'Note': 'Always metadata-only'})

    # permute (returns view but non-contiguous)
    img = torch.randn(8, 3, 64, 64)
    t, _ = measure_time(lambda: img.permute(0, 2, 3, 1))
    patterns.append({'Pattern': 'permute', 'Time (µs)': t * 1e6,
                     'Copies Data': 'No', 'Note': 'View, but non-contiguous'})

    # permute + contiguous (forced copy)
    t, _ = measure_time(lambda: img.permute(0, 2, 3, 1).contiguous())
    patterns.append({'Pattern': 'permute + contiguous', 'Time (µs)': t * 1e6,
                     'Copies Data': 'Yes', 'Note': 'Copy to make C-order'})

    return pd.DataFrame(patterns)


reshape_bench = benchmark_reshape_patterns()
print('=== Reshape Performance ===')
print(reshape_bench.to_string(index=False))
print()
print('Key insight: view/reshape/unsqueeze/squeeze are essentially free (~1µs).')
print('The only cost comes when a copy is forced (non-contiguous → contiguous).')

### 4.2 Einsum vs Explicit Operations: Speed ComparisonEinsum is convenient, but is it slower than explicit operations? Let's find outwith a systematic benchmark.

In [None]:
def benchmark_einsum_vs_explicit() -> pd.DataFrame:
    """Compare einsum against explicit implementations for common operations.

    Returns:
        DataFrame with timing comparisons.
    """
    torch.manual_seed(SEED)

    records: list[dict] = []

    # Matrix multiply
    A = torch.randn(256, 512)
    B = torch.randn(512, 256)
    t_exp, _ = measure_time(lambda: A @ B)
    t_ein, _ = measure_time(lambda: torch.einsum('ij,jk->ik', A, B))
    records.append({'Operation': 'MatMul (256×512)×(512×256)',
                    'Explicit (ms)': t_exp * 1000,
                    'Einsum (ms)': t_ein * 1000,
                    'Ratio': t_ein / t_exp})

    # Batched matmul
    bA = torch.randn(16, 64, 128)
    bB = torch.randn(16, 128, 64)
    t_exp, _ = measure_time(lambda: torch.bmm(bA, bB))
    t_ein, _ = measure_time(lambda: torch.einsum('bij,bjk->bik', bA, bB))
    records.append({'Operation': 'Batched MatMul (16×64×128)',
                    'Explicit (ms)': t_exp * 1000,
                    'Einsum (ms)': t_ein * 1000,
                    'Ratio': t_ein / t_exp})

    # Dot product
    v1 = torch.randn(100_000)
    v2 = torch.randn(100_000)
    t_exp, _ = measure_time(lambda: torch.dot(v1, v2))
    t_ein, _ = measure_time(lambda: torch.einsum('i,i->', v1, v2))
    records.append({'Operation': 'Dot Product (100K)',
                    'Explicit (ms)': t_exp * 1000,
                    'Einsum (ms)': t_ein * 1000,
                    'Ratio': t_ein / t_exp})

    # Attention scores
    Q = torch.randn(4, 8, 32, 64)
    K = torch.randn(4, 8, 32, 64)
    t_exp, _ = measure_time(lambda: Q @ K.transpose(-2, -1))
    t_ein, _ = measure_time(lambda: torch.einsum('bhid,bhjd->bhij', Q, K))
    records.append({'Operation': 'Attention (4×8×32×64)',
                    'Explicit (ms)': t_exp * 1000,
                    'Einsum (ms)': t_ein * 1000,
                    'Ratio': t_ein / t_exp})

    # Outer product
    u = torch.randn(1000)
    w = torch.randn(1000)
    t_exp, _ = measure_time(lambda: u.unsqueeze(1) * w.unsqueeze(0))
    t_ein, _ = measure_time(lambda: torch.einsum('i,j->ij', u, w))
    records.append({'Operation': 'Outer Product (1000×1000)',
                    'Explicit (ms)': t_exp * 1000,
                    'Einsum (ms)': t_ein * 1000,
                    'Ratio': t_ein / t_exp})

    return pd.DataFrame(records)


einsum_bench = benchmark_einsum_vs_explicit()
print('=== Einsum vs Explicit Operations ===')
print(einsum_bench.to_string(index=False))
print()
avg_ratio = einsum_bench['Ratio'].mean()
print(f'Average ratio (einsum/explicit): {avg_ratio:.2f}×')
print('Einsum has slight overhead for parsing the format string, but for')
print('compute-heavy operations the difference is negligible.')

### 4.3 Visualization: Operation Performance OverviewLet's create a visual summary of all the performance data we've collected.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Reshape costs
reshape_data = reshape_bench.copy()
colors = ['#43A047' if c == 'No' else '#E53935' for c in reshape_data['Copies Data']]
bars = axes[0].barh(reshape_data['Pattern'], reshape_data['Time (µs)'], color=colors)
axes[0].set_xlabel('Time (µs)')
axes[0].set_title('Reshape Operation Costs')
axes[0].set_xscale('log')
# Add legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor='#43A047', label='No Copy (Free)'),
                   Patch(facecolor='#E53935', label='Copies Data')]
axes[0].legend(handles=legend_elements, loc='lower right')
axes[0].grid(True, axis='x', alpha=0.3)

# Right: Einsum overhead
x_pos = range(len(einsum_bench))
width = 0.35
axes[1].bar([p - width/2 for p in x_pos], einsum_bench['Explicit (ms)'],
            width, label='Explicit', color='#1E88E5')
axes[1].bar([p + width/2 for p in x_pos], einsum_bench['Einsum (ms)'],
            width, label='Einsum', color='#FF9800')
axes[1].set_xticks(list(x_pos))
short_labels = [op.split('(')[0].strip() for op in einsum_bench['Operation']]
axes[1].set_xticklabels(short_labels, rotation=45, ha='right')
axes[1].set_ylabel('Time (ms)')
axes[1].set_title('Einsum vs Explicit Operations')
axes[1].legend()
axes[1].grid(True, axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

### 4.4 Common Pitfalls and Error AnalysisLet's document the most common mistakes when working with tensor operationsand show how to diagnose and fix them.

In [None]:
def demonstrate_common_pitfalls() -> None:
    """Show common tensor manipulation mistakes and how to fix them."""
    print('=== Pitfall 1: Shape Mismatch in Broadcasting ===')
    a = torch.randn(3, 4)
    b = torch.randn(3)  # Intending to add to each row
    try:
        result = a + b  # Fails! (3, 4) + (3,) — 4 ≠ 3
    except RuntimeError as e:
        print(f'  Error: {str(e)[:60]}')
        print(f'  Fix: b.unsqueeze(1) makes (3,) → (3, 1), which broadcasts')
        result = a + b.unsqueeze(1)  # (3, 4) + (3, 1) ✓
        print(f'  Result shape: {result.shape}')
    print()

    print('=== Pitfall 2: Forgetting contiguous() Before view() ===')
    mat = torch.randn(4, 4)
    transposed = mat.t()
    print(f'  Transposed is contiguous: {transposed.is_contiguous()}')
    try:
        flat = transposed.view(-1)
    except RuntimeError:
        print(f'  view(-1) fails on non-contiguous tensor')
        flat = transposed.contiguous().view(-1)  # Fix
        print(f'  Fix: .contiguous().view(-1) → shape={flat.shape}')
    print()

    print('=== Pitfall 3: Unintended Dimension Collapse ===')
    batch = torch.randn(8, 3, 32, 32)
    # Wrong: sum without keepdim collapses the dimension
    wrong_mean = batch.mean(dim=1)        # shape (8, 32, 32) — lost channel dim
    right_mean = batch.mean(dim=1, keepdim=True)  # shape (8, 1, 32, 32)
    print(f'  mean(dim=1):             shape={wrong_mean.shape} (channel dim gone!)')
    print(f'  mean(dim=1, keepdim):    shape={right_mean.shape} (can still broadcast)')
    print()

    print('=== Pitfall 4: Integer vs Float Division ===')
    int_tensor = torch.tensor([1, 2, 3, 4, 5])
    # Integer division truncates
    wrong_norm = int_tensor / int_tensor.sum()  # sum=15
    right_norm = int_tensor.float() / int_tensor.sum().float()
    print(f'  Integer division: {wrong_norm}')
    print(f'  Float division:   {right_norm}')
    print(f'  Sum check: int={wrong_norm.sum():.4f}, float={right_norm.sum():.4f}')
    print()

    print('=== Pitfall 5: Mixing NumPy and PyTorch Random Seeds ===')
    np.random.seed(42)
    torch.manual_seed(42)
    np_vals = np.random.randn(3)
    torch_vals = torch.randn(3).numpy()
    print(f'  NumPy random:   {np_vals}')
    print(f'  PyTorch random: {torch_vals}')
    print(f'  Same values? {np.allclose(np_vals, torch_vals)}')
    print(f'  Different RNGs! Set both seeds independently.')
    # Restore seeds
    np.random.seed(SEED)
    torch.manual_seed(SEED)


demonstrate_common_pitfalls()

### 4.5 Comprehensive Operation ReferenceLet's build a master reference table summarizing every operation we've covered,with the NumPy and PyTorch equivalents side by side.

In [None]:
reference = pd.DataFrame({
    'Operation': [
        'Reshape', 'View (no copy)', 'Flatten', 'Squeeze',
        'Unsqueeze', 'Transpose (2D)', 'Permute (nD)',
        'Concatenate', 'Stack', 'Fancy index', 'Boolean mask',
        'Where', 'Gather', 'Scatter', 'Einsum',
    ],
    'NumPy': [
        'arr.reshape()', 'N/A', 'arr.flatten() / ravel()', 'np.squeeze()',
        'np.expand_dims()', 'arr.T', 'arr.transpose()',
        'np.concatenate()', 'np.stack()', 'arr[[0,2]]', 'arr[mask]',
        'np.where()', 'N/A (use fancy idx)', 'N/A', 'np.einsum()',
    ],
    'PyTorch': [
        't.reshape()', 't.view()', 't.flatten()', 't.squeeze()',
        't.unsqueeze()', 't.t() / t.T', 't.permute()',
        'torch.cat()', 'torch.stack()', 't[[0,2]]', 't[mask]',
        'torch.where()', 'torch.gather()', 'torch.scatter()', 'torch.einsum()',
    ],
    'Returns View?': [
        'If contiguous', 'Always', 'If contiguous', 'Yes',
        'Yes', 'Yes', 'Yes',
        'No (new tensor)', 'No (new tensor)', 'No (copy)', 'No (copy)',
        'No (new tensor)', 'No (new tensor)', 'In-place variant', 'No (new tensor)',
    ],
})
print('=== Tensor Operations Reference ===')
print(reference.to_string(index=False))
print()
print(f'Total operations covered: {len(reference)}')

### 4.6 Indexing Performance: Fancy vs Boolean vs GatherDifferent selection methods have different performance profiles. Let's measurethem for a common scenario: selecting specific elements from a large tensor.

In [None]:
def benchmark_indexing_methods() -> pd.DataFrame:
    """Compare indexing methods for selecting elements.

    Returns:
        DataFrame with timing comparison.
    """
    torch.manual_seed(SEED)
    n_rows = 10_000
    n_cols = 1_000
    data = torch.randn(n_rows, n_cols)

    # Select one element per row (like gathering class probabilities)
    indices = torch.randint(0, n_cols, (n_rows,))

    # Method 1: Fancy indexing
    t_fancy, _ = measure_time(
        lambda: data[torch.arange(n_rows), indices])

    # Method 2: Gather
    t_gather, _ = measure_time(
        lambda: data.gather(1, indices.unsqueeze(1)).squeeze(1))

    # Method 3: Loop (anti-pattern)
    def loop_select() -> torch.Tensor:
        """Select elements with a loop (slow)."""
        result = torch.empty(n_rows)
        for i in range(n_rows):
            result[i] = data[i, indices[i]]
        return result
    t_loop, _ = measure_time(loop_select, num_warmup=1, num_timed=2)

    # Verify all methods give same result
    r_fancy = data[torch.arange(n_rows), indices]
    r_gather = data.gather(1, indices.unsqueeze(1)).squeeze(1)
    r_loop = loop_select()
    assert torch.allclose(r_fancy, r_gather)
    assert torch.allclose(r_fancy, r_loop)

    records = [
        {'Method': 'Loop (anti-pattern)', 'Time (ms)': t_loop * 1000,
         'Speedup': 1.0},
        {'Method': 'Fancy indexing', 'Time (ms)': t_fancy * 1000,
         'Speedup': t_loop / t_fancy},
        {'Method': 'torch.gather', 'Time (ms)': t_gather * 1000,
         'Speedup': t_loop / t_gather},
    ]
    return pd.DataFrame(records)


idx_bench = benchmark_indexing_methods()
print(f'Indexing Benchmark ({10_000} rows × {1_000} cols):')
print(idx_bench.to_string(index=False))

---## Part 5 — Summary & Lessons Learned### Key Takeaways1. **Reshape, view, squeeze, unsqueeze are free.** They only change metadata (shape and   strides), not the underlying data. Use them liberally to make tensors compatible   for broadcasting and batch operations.2. **Permute creates non-contiguous views.** After permuting, call `.contiguous()` if   you need to `view()` the result. Or just use `.reshape()` which handles this automatically.3. **Einsum is a universal tool.** It can express any tensor contraction — from dot   products to batched attention — in a single readable string. The performance overhead   vs explicit operations is negligible.4. **Advanced indexing (fancy + boolean) always copies.** Unlike slicing, which returns   views, fancy indexing and boolean masks create new tensors. Use `gather`/`scatter`   for differentiable selection in training.5. **In-place operations save memory but break autograd.** Only use `add_()`, `mul_()`,   etc. on tensors that don't require gradients. During training, always prefer   out-of-place operations.### What's Next→ **01-03 (Pandas for Tabular Data)** applies these manipulation skills to real-world  tabular datasets with Pandas — the primary tool for data exploration and preprocessing.### Going Further- [PyTorch Tensor Views](https://pytorch.org/docs/stable/tensor_view.html) — Official  documentation on which operations return views vs copies- [Einsum Is All You Need](https://rockt.github.io/2018/04/30/einsum) — Visual guide  to einsum notation with diagrams- [NumPy Broadcasting Rules](https://numpy.org/doc/stable/user/basics.broadcasting.html)  — The definitive reference for broadcasting semantics