#**CHAPTER 1.MULTIMODALITY BY CONSTRUCTION**
---

##REFERENCE

https://chatgpt.com/share/699466a2-db08-8012-94aa-81e27ba6d44a

##0.CONTEXT

**Introduction (Chapter 1: Multimodality by Construction — Embeddings, Geometry, Alignment)**

This chapter is a guided encounter with a claim that sounds abstract until you build it yourself: multimodality is the disciplined construction of compatible coordinate systems across different data types. In practice, “images” and “text” (and audio, sensor traces, tables, and diagrams) are not merely different file formats. They are different measurement instruments that observe the same underlying world through different distortions, resolutions, and noise models. A multimodal model is not “one model that can do many things.” It is a mechanism for aligning those instruments so that one modality can locate meaning in the other.

The pedagogical objective is deliberately narrow and therefore powerful: we will not chase scale, benchmarks, or performance theater. Instead, we will build a minimal synthetic world where two modalities are guaranteed to share the same latent causes, and we will force a learned representation to recover that shared structure. The point is to make embedding geometry visible and accountable. You should be able to answer, with evidence, what your model has learned: what it aligns, what it confuses, what it collapses, and how those behaviors degrade when you stress the system.

The chapter is also aligned with the AI 2026 “frontier awareness” posture. Multimodality is frontier not because it is mysterious, but because it exposes a general problem: how to make representations consistent across observational channels while preserving what matters and discarding what does not. This is also why multimodality naturally belongs in a harmonic collection with long-context memory and surrogates. Long context is selection under constraint. Surrogates are substitution under constraint. Multimodality is alignment under constraint. The unifying theme is governed representation: we want mechanisms that are inspectable, reproducible, and reviewable, not merely impressive.

What follows is structured in four parts: theory, definitions, methodology, and deliverables. This structure is intentionally practical: you will leave with a mental model, a shared vocabulary, an experiment you can reproduce in Colab, and an audit bundle that supports professional review.

**Part 1. The Theory**

A useful way to think about multimodal learning is to treat each modality as a coordinate chart on the same underlying manifold of causes. The world has latent causes: “shape,” “orientation,” “frequency,” “phase,” “thickness,” “style,” “identity,” “intent,” and so on. A camera measures those causes through pixel intensities; a description measures those causes through tokens; a microphone measures them through waveforms. None of these measurements is the cause itself. Each modality is a structured projection. Multimodal learning is the attempt to learn an internal representation that is stable across those projections.

This immediately clarifies why multimodality is hard. If modalities are projections, then information is lost differently in each. Some causes are more visible in one modality than another. Some causes are confounded with noise. Some causes are not present at all. Therefore, alignment is not “make embeddings equal.” Alignment is the negotiation of invariants: which aspects should be preserved across modalities, and which aspects are modality-specific and can be ignored.

From this perspective, the canonical multimodal objective is not classification. It is mutual retrievability. If an image and a sentence refer to the same latent factors, then the representation of the image should “find” the representation of the sentence in a shared space, and vice versa. This is the conceptual basis for contrastive learning. Contrastive learning does not require labels for every downstream task. It only requires pairing information: which two observations refer to the same underlying cause. In the real world, those pairs might be image-caption pairs, video-audio pairs, or diagram-explanation pairs. In this chapter, we will use synthetic pairs so we can control exactly what “same cause” means.

Contrastive learning also exposes the central failure modes of multimodality in a clean way. If the objective is retrieval, then failure is not a vague “it seems worse.” Failure is measurable: retrieval accuracy collapses, the geometry degenerates, the embeddings become nearly identical, or one modality dominates the other. These failure modes are not academic. They appear in real training runs at scale, only harder to diagnose because the world is not synthetic and your ground truth is never complete.

Another core theoretical point is that multimodal alignment is not just a loss function; it is a system. You choose model capacity, normalization, temperature, batch composition, negative sampling, and augmentation. Each choice changes the geometry of the learned space. In other words, the representation is not discovered; it is engineered under constraints. This is why the chapter is “multimodality by construction.” We want you to experience, directly, how design choices shape geometry.

Finally, multimodality highlights a subtle but important principle: “more capability” can mean “less interpretability.” When you scale models, you often gain performance while losing the ability to cleanly attribute geometric structure to known causes. This chapter takes the opposite route: we intentionally keep the model small so that you can inspect what each component contributes. The aim is not to be competitive; the aim is to become fluent in the mechanism.

**Part 2. Definitions of Key Ideas (Latent Space, Embeddings, Alignment, Geometry)**

A definition in this chapter is not a slogan; it is an operational contract. Each term below is defined so that it can be tested or falsified within the notebook.

**Latent factors**  
Latent factors are unobserved variables that generate observations. In our synthetic world, they are explicit: discrete choices like shape class and orientation bin, and continuous-like choices discretized into bins such as frequency and phase. In real systems, latent factors might include identity, viewpoint, lighting, topic, or intent. The key property is causal: latent factors produce the data, but are not directly the data.

**Latent space**  
A latent space is a vector space in which points represent compressed descriptions of observations. It is “latent” because it is not directly observed; it is learned. The important nuance is that a latent space is not automatically meaningful. A latent space becomes meaningful when its geometry corresponds to stable relationships among causes. In this chapter, “meaningful” will be operationalized by retrieval and separability: paired items are near, and items sharing a factor cluster in identifiable regions.

**Embedding**  
An embedding is the mapping from an observation to a vector in latent space. An image encoder maps pixels to an embedding; a text encoder maps token-derived features to an embedding. Embeddings are not the latent factors; they are coordinates the model invents to solve its objective. Two embeddings being close means “the model treats them as related under the training objective,” not “they are objectively similar.”

**Shared embedding space**  
A shared embedding space is a single latent space used by multiple encoders. The key property is comparability: distances and angles are meaningful across modalities. If the image embedding and the text embedding live in the same space, then cross-modal similarity becomes a geometric query.

**Alignment**  
Alignment is the property that paired observations (generated by the same latent factors) map to nearby embeddings. It is not merely “they match.” It is structured: alignment should preserve relevant factor structure, not just memorization. We measure alignment by retrieval metrics and by the stability of geometry under stress.

**Contrastive objective (InfoNCE)**  
A contrastive objective trains embeddings so that positives (true pairs) have high similarity while negatives (mismatched pairs) have lower similarity. InfoNCE is the standard formulation where each item must identify its true partner among a set of candidates. The temperature parameter controls how “sharp” the discrimination is, which directly influences geometry.

**Temperature**  
Temperature rescales similarities before the softmax. Lower temperature forces the model to make harder, sharper distinctions; higher temperature smooths probabilities. Temperature is not a mere hyperparameter; it is a geometric dial. It changes how strongly the model penalizes near-misses and can induce either collapse or instability if mis-set.

**Collapse**  
Collapse is a degenerate solution where embeddings lose diversity. In the extreme, every input maps to nearly the same vector. Collapse can produce deceptively stable losses under some settings but destroys retrieval and separability. We diagnose collapse by pairwise cosine distributions, embedding variance floors, and covariance spectra (effective rank).

**Separability**  
Separability refers to the extent to which embeddings preserve distinctions along latent factors. If “shape” is a factor, then embeddings should cluster by shape, or at least show systematic variance aligned with shape. We measure this with an ANOVA-like ratio: between-class scatter divided by within-class scatter.

**Modality symmetry**  
Modality symmetry means neither modality dominates the shared space. In practice, one encoder can produce embeddings with different norms or variances, causing training dynamics where one side learns faster and the other becomes a passenger. We diagnose symmetry by comparing norm distributions, variance, and retrieval asymmetry (image-to-text versus text-to-image).

**Part 3. Methodology (What We Build and How We Learn It)**

The methodology is a complete experiment, not a toy snippet. It is designed to be run end-to-end and to produce artifacts that support review.

**1) Build a synthetic multimodal world with explicit causes**  
We specify a latent factor vector per sample. From those factors, we generate two observations:
an image-like matrix (small 16×16) and a text-like symbolic sequence. Both are derived from the same latent factors, but with different distortions. The image generator produces structured patterns (e.g., sinusoidal textures) controlled by frequency and phase, and modified by “shape” and “orientation” rules. The text generator produces a token grammar that encodes the same factors as discrete symbols. This is crucial: the pairing is guaranteed and the ground truth is known.

**2) Encode text without an LLM**  
The point is multimodality, not language modeling. We use a deterministic tokenization scheme and encode sequences into vectors through simple, transparent features (bag-of-tokens plus positional moments). This keeps the experiment interpretable. It also makes a deeper pedagogical point: multimodality does not require language models. It requires paired structure and a shared objective.

**3) Train two encoders into a shared space**  
We use two separate encoders (image encoder and text encoder), each a 2-layer MLP. Each maps its input to an embedding vector. The embeddings are L2-normalized so that cosine similarity becomes a stable geometric measure. The training objective is symmetric InfoNCE: images must retrieve their paired texts, and texts must retrieve their paired images.

**4) Implement manual backprop and validate it**  
To ensure rigor, the training loop does not rely on autodiff frameworks. Gradients are computed analytically and validated using a finite-difference gradient check. This is not a style preference; it is didactic discipline. If you cannot validate your gradient pipeline, you cannot trust your conclusions about geometry.

**5) Evaluate geometry, not just loss**  
After training, we evaluate:
retrieval performance (top-1, top-5, MRR), factor separability, collapse metrics, covariance spectrum, and modality symmetry. We also generate PCA projections using SVD (no fragile dependencies) to visualize clustering by factors. Visuals are saved as plots, and all metrics are exported as strict JSON.

**6) Stress the system to reveal failure modes**  
We perform structured stress tests that change the data-generating process. For example, we increase noise in the image modality while holding text constant, and we corrupt pairings by permuting a fraction of the text samples. We then measure degradation curves. The stress suite teaches the central lesson: an aligned representation is not a static achievement; it is a conditional property that can fail under distribution shift.

**7) Govern the experiment as a professional artifact**  
Every run produces an audit bundle: a manifest with configuration and environment fingerprinting, prompt logs with hashes, a risk log with taxonomy and controls, and deliverables containing plots and JSON summaries. This matters because multimodal systems are particularly vulnerable to narrative drift: it is easy to tell a story about “meaningful embeddings” without proving it. The audit bundle forces the notebook to behave like an accountable laboratory.

**Part 4. Deliverables (What You Produce and How Students Use It)**

This chapter is designed so that a student can run the notebook and obtain a complete set of reviewable outputs. Deliverables are not “nice extras.” They are the mechanism that turns a lecture into an empirical lab.

**Deliverable A: A reproducible synthetic multimodal dataset**  
You will have a fully specified generator for paired image and text observations. The generator exposes knobs (noise level, factor cardinalities, corruption rate) so students can run controlled experiments. The dataset is not a file to download; it is a reproducible world.

**Deliverable B: A trained multimodal alignment model**  
You will produce two trained encoders that map different modalities into a shared space. The checkpointing mechanism saves best-performing parameters on a validation criterion. Students can inspect the parameters, re-run training, and compare runs across seeds.

**Deliverable C: A geometry report with quantitative diagnostics**  
The notebook exports a strict JSON metrics summary containing retrieval metrics, separability ratios, collapse indicators, covariance spectra, and modality symmetry checks. This report is designed to be read by humans and also to support automated regression testing across changes.

**Deliverable D: Visual evidence of embedding structure**  
Plots include PCA projections colored by known latent factors and similarity heatmaps for paired subsets. These visuals are not “proof,” but they are essential for intuition. They show what “cluster,” “axis,” and “distance” actually mean in a trained representation.

**Deliverable E: Stress-test reports and degradation curves**  
The stress suite exports a structured stress report and plots showing performance degradation under noise asymmetry and pairing corruption. Students learn to interpret robustness, locate brittleness, and discuss what kinds of shifts are fatal versus tolerable.

**Deliverable F: A governed audit bundle**  
The run manifest, prompt log, risk log, and deliverables folder are zipped into a single archive for review. This supports the governance-first teaching objective: results are not accepted because they look good; they are accepted because they are reproducible, inspectable, and accompanied by explicit assumptions and open questions.

By the end of this chapter, a student should be able to explain multimodality without mysticism. They should be able to say: we built two measurement channels of the same latent world; we learned embeddings that make those channels commensurable; we inspected the geometry and measured separability; we checked for collapse and dominance; we stressed the system and recorded failure modes; and we produced an audit-ready artifact bundle that makes these claims reviewable. That is the core competency: not “multimodal hype,” but multimodal mechanism literacy under governance.


##1.LIBRARIES AND ENVIRONMENT

**Cell 1 — Environment, Determinism, and the Execution Contract (Why This Cell Exists and What Students Should Learn)**

Cell 1 is the “constitution” of the notebook. It does not teach multimodality directly; it teaches the conditions under which multimodality can be taught honestly. In a multimodal experiment, tiny changes in random seeds, array dtypes, batch ordering, or hidden default settings can change the geometry you end up interpreting. If students do not learn to lock down these degrees of freedom, they will learn a dangerous habit: attributing meaning to plots and metrics that are not reproducible.

The key pedagogical concept here is that representation learning is not just an algorithm; it is an experimental system. Every scientific system needs a contract that specifies what is fixed and what is allowed to vary. In this cell we define that contract: deterministic randomness (seed control), the configuration object (all hyperparameters in one place), and the filesystem structure where artifacts will be written. This is not housekeeping; it is governance in its simplest, most concrete form.

Students should also learn why we print a short runtime fingerprint (Python version, NumPy version, UTC timestamp). In professional settings, these details matter because your output is not only judged by its numerical value; it is judged by whether someone else can reproduce it and audit the steps that produced it. Even in Colab, “Run all” should mean “get the same run.” The timestamp uses timezone-aware UTC by design to avoid subtle time bugs and to support institutional logging standards.

Finally, Cell 1 frames the chapter’s deeper theme: multimodality is alignment under constraints. Constraints begin at the level of experimental control. If you cannot control your runtime environment, you cannot control your interpretation. Students should leave this cell with the mindset that strong engineering practice is not separate from theory; it is what allows theory to be tested.


In [49]:
# === Cell 1 (REPLACE) ===
# Title: Runtime Contract + Determinism + High-Rigor Configuration
# Explanation: PATCH — use LeakyReLU (relu_leak>0) so finite-difference gradient checks are well-defined.

import os
import sys
import json
import math
import time
import zipfile
import hashlib
import random
import datetime
from dataclasses import dataclass, asdict
from typing import Dict, Any, Tuple, List, Optional

import numpy as np
import matplotlib.pyplot as plt

def utc_now_iso() -> str:
    return datetime.datetime.now(datetime.timezone.utc).isoformat()

def set_determinism(seed: int) -> None:
    random.seed(seed)
    np.random.seed(seed)

@dataclass(frozen=True)
class Config:
    # Core
    seed: int = 7
    # Data
    n: int = 2048
    image_side: int = 16
    vocab_size: int = 32
    text_seq_len: int = 9
    noise_image: float = 0.05
    # Model
    hidden: int = 256
    embed_dim: int = 32
    relu_leak: float = 0.01   # <-- PATCH: leaky slope for differentiability (gradient check stability)
    # Training
    batch: int = 128
    epochs: int = 60
    lr: float = 2e-3
    adam_b1: float = 0.9
    adam_b2: float = 0.999
    adam_eps: float = 1e-8
    weight_decay: float = 1e-4
    temp: float = 0.07
    grad_clip: float = 5.0
    # Splits
    train_frac: float = 0.70
    val_frac: float = 0.15
    # Artifacts
    out_dir: str = "deliverables"
    plots_dir: str = "deliverables/plots"
    ckpt_dir: str = "deliverables/checkpoints"
    gates_dir: str = "deliverables/gates"
    stress_dir: str = "deliverables/stress"
    # Diagnostics
    topk: Tuple[int, int] = (1, 5)

CFG = Config()
set_determinism(CFG.seed)

for p in [CFG.out_dir, CFG.plots_dir, CFG.ckpt_dir, CFG.gates_dir, CFG.stress_dir]:
    os.makedirs(p, exist_ok=True)

print("RUN CONTRACT")
print("  utc_now:", utc_now_iso())
print("  python:", sys.version.split()[0])
print("  numpy :", np.__version__)
print("  seed  :", CFG.seed)
print("  relu_leak:", CFG.relu_leak)


RUN CONTRACT
  utc_now: 2026-02-17T12:47:00.194567+00:00
  python: 3.12.12
  numpy : 2.0.2
  seed  : 7
  relu_leak: 0.01


##2.GOVERNANCE ARTIFACTS

###2.1.OVERVIEW

**Cell 2 — Governance Artifacts and Auditability (Why We Log, Hash, and Separate Facts from Assumptions)**

Cell 2 turns the notebook from a “demo” into a “lab.” The purpose is to ensure that every claim the notebook makes can be traced back to a run configuration and a set of recorded outputs. This cell introduces an Artifact Manager: a structured way to write JSON reports, track a run identifier, log prompts (redacted) with hashes, and produce a risk log. The pedagogical point is that the model’s outputs are not evidence unless they are situated in a reproducible context.

Students should learn the difference between a result and a deliverable. A result is a number on the screen. A deliverable is a bundle of evidence that can be reviewed later: metrics, plots, configuration, and explicit statements about what is known versus assumed. This cell enforces strict JSON schemas that separate facts provided (computed metrics), assumptions (design choices and simplifications), open items (what remains unresolved), and questions to verify (what a reviewer should check). This separation prevents a common failure in AI education: confusing plausibility with truth.

Hashing is introduced as a minimal integrity mechanism. The goal is not cryptographic security in the adversarial sense; the goal is traceability. If you rerun a notebook or modify an experiment, hashes help you confirm whether you are looking at the same prompt or configuration. This matters especially for multimodal models because failures can be subtle: you can get similar-looking loss curves while the geometry changes materially.

The risk log is equally important pedagogically. Multimodality has predictable failure modes: spurious correlations, modality dominance, representation collapse, leakage, and metric hacking. Writing those risks down early forces students to treat them as first-class elements of the experiment rather than as afterthoughts. Cell 2 teaches the professional habit: every experiment must export not only outputs but also accountability.


###2.2.CODE AND IMPLEMENTATION

In [56]:
# === Cell 2 ===
# Title: ArtifactManager (Audit-Ready Logs, Hashing, Strict JSON Schema)
# Explanation: Implements tamper-evident artifact writing (hashes), structured JSON outputs
# with explicit Fact/Assumption/Open-Items separation, and a run manifest.

class ArtifactManager:
    def __init__(self, cfg: Config):
        self.cfg = cfg
        self.run_id = self._sha256(f"{time.time()}|{cfg.seed}|{np.__version__}")[:12]
        self.manifest_path = "run_manifest.json"
        self.prompts_path = "prompts_log.jsonl"
        self.risk_path = "risk_log.json"
        self._prompt_entries: List[Dict[str, Any]] = []

    @staticmethod
    def _sha256(s: str) -> str:
        return hashlib.sha256(s.encode("utf-8")).hexdigest()

    def log_prompt(self, name: str, content: str) -> None:
        redacted = content.replace("\n", "\\n")[:4000]  # bounded + simple redaction
        entry = {
            "ts_utc": utc_now_iso(),
            "name": name,
            "redacted": redacted,
            "sha256": self._sha256(content),
        }
        self._prompt_entries.append(entry)

    def write_json_strict(self, path: str, *, facts_provided: Dict[str, Any], assumptions: Dict[str, Any],
                          open_items: List[Any], analysis: str, draft_output: Dict[str, Any],
                          verification_status: str = "Not verified",
                          questions_to_verify: Optional[List[str]] = None) -> None:
        if questions_to_verify is None:
            questions_to_verify = []
        obj = {
            "facts_provided": facts_provided,
            "assumptions": assumptions,
            "open_items": open_items,
            "analysis": analysis,
            "draft_output": draft_output,
            "verification_status": verification_status,
            "questions_to_verify": questions_to_verify,
            "meta": {
                "run_id": self.run_id,
                "timestamp_utc": utc_now_iso(),
                "path": path,
            }
        }
        with open(path, "w") as f:
            json.dump(obj, f, indent=2)

    def write_manifest(self) -> None:
        manifest = {
            "run_id": self.run_id,
            "timestamp_utc": utc_now_iso(),
            "config": asdict(self.cfg),
            "env": {
                "python": sys.version.split()[0],
                "numpy": np.__version__,
            }
        }
        with open(self.manifest_path, "w") as f:
            json.dump(manifest, f, indent=2)

    def write_risk_log(self) -> None:
        risk = {
            "taxonomy": [
                "spurious_correlation",
                "modality_dominance",
                "representation_collapse",
                "train_test_leakage",
                "metric_hacking",
            ],
            "controls": [
                "deterministic_seeds",
                "split_hygiene_checks",
                "finite_difference_gradient_check",
                "bidirectional_contrastive_objective",
                "collapse_spectrum_monitoring",
                "stress_sweep_noise_asymmetry",
                "artifact_hashing_and_manifest",
            ],
            "verification_status": "Not verified",
            "questions_to_verify": [
                "How do these synthetic failure modes map to real multimodal datasets?",
                "Which collapse indicators are most predictive under distribution shift?",
            ],
            "meta": {"run_id": self.run_id, "timestamp_utc": utc_now_iso()},
        }
        with open(self.risk_path, "w") as f:
            json.dump(risk, f, indent=2)

    def flush_prompts(self) -> None:
        with open(self.prompts_path, "w") as f:
            for e in self._prompt_entries:
                f.write(json.dumps(e) + "\n")

AM = ArtifactManager(CFG)
AM.log_prompt("chapter1_colab_intent", "Chapter 1: multimodality by construction (synthetic world, 2-layer MLP encoders, symmetric InfoNCE, manual backprop, governed artifacts).")
AM.write_manifest()
AM.write_risk_log()
print("ArtifactManager ready. run_id =", AM.run_id)


ArtifactManager ready. run_id = 3cb8138df0b7


##3.SYNTHETIC MULTIMODAL DATA GENERATION

###3.1.OVERVIEW

**Cell 3 — Synthetic Multimodal World (The Most Important Conceptual Cell: “Same Causes, Different Projections”)**

Cell 3 is where multimodality becomes concrete. The purpose is to build a world in which two different modalities are guaranteed to share the same latent causes. This is the pedagogical foundation: if students cannot see and control the causal factors, they will struggle to understand what “alignment” really means. Real-world multimodal datasets are messy; synthetic construction gives us clarity.

Students should focus on the idea that each modality is a projection of the same underlying factors. The image generator converts factors like shape, orientation, frequency, phase, and thickness into a small 16×16 matrix. The text generator converts the same factors into a token grammar. The crucial teaching move is that both modalities are not arbitrary; they are intentionally structured to carry the same information in different forms. This makes the pairing relationship unambiguous. In real systems, ambiguity in pairing is often the main source of representation confusion.

This cell also introduces the concept of a “grammar” for symbolic modality. The text is not generated by a language model; it is constructed as a deterministic sequence of tokens. That is pedagogically valuable because it separates the core multimodal idea (shared causes) from the extra complexity of natural language. Students learn that multimodality is not “language plus vision.” It is “multiple measurement channels plus alignment.”

Split hygiene is another major lesson. We explicitly build train/validation/test splits and assert there is no overlap. This may seem basic, but leakage in multimodal datasets is exceptionally common in practice (e.g., near-duplicate images, shared captions, repeated metadata). If the split is not clean, retrieval can look perfect while the model has simply memorized correspondences.

Finally, students should notice that this synthetic world is not “easy” by default. It has controllable noise and multiple factor interactions. That allows meaningful stress testing later: we can break one modality or corrupt pairing and observe how geometry fails. Cell 3 teaches that the right experimental design is one where failure modes are discoverable and measurable.


###3.2.CODE AND IMPLEMENTATION

In [57]:
# === Cell 3 ===
# Title: SyntheticWorld (Controllable Latent Factors → Image + Token Sequences)
# Explanation: Builds a paired synthetic dataset where both modalities share the same latent factors.
# Includes strict split hygiene and metadata tracking for probes (separability, factor decoding).

@dataclass(frozen=True)
class Factors:
    shape: np.ndarray       # {0,1,2}
    orient: np.ndarray      # {0,1,2,3}
    freq_bin: np.ndarray    # {0,1,2}
    phase_bin: np.ndarray   # {0,1,2,3}
    thick_bin: np.ndarray   # {0,1,2}

class SyntheticWorld:
    def __init__(self, cfg: Config):
        self.cfg = cfg
        self.side = cfg.image_side
        self.D_img = cfg.image_side * cfg.image_side
        self.D_tok = cfg.vocab_size

        # coordinate grid
        xs = np.linspace(-1.0, 1.0, self.side)
        self.X, self.Y = np.meshgrid(xs, xs)

        # vocab: reserve blocks for each factor, plus padding/specials
        self.pad_id = 0
        self.sep_id = 1
        self.base_shape = 2            # 3 tokens
        self.base_orient = 2 + 3       # 4 tokens
        self.base_freq = 2 + 3 + 4     # 3 tokens
        self.base_phase = 2 + 3 + 4 + 3# 4 tokens
        self.base_thick = 2 + 3 + 4 + 3 + 4  # 3 tokens
        self._assert_vocab()

    def _assert_vocab(self) -> None:
        need = self.base_thick + 3
        if need > self.cfg.vocab_size:
            raise ValueError(f"vocab_size too small: need >= {need}, got {self.cfg.vocab_size}")

    def sample_factors(self, n: int) -> Factors:
        shape = np.random.randint(0, 3, size=n)
        orient = np.random.randint(0, 4, size=n)
        freq_bin = np.random.randint(0, 3, size=n)
        phase_bin = np.random.randint(0, 4, size=n)
        thick_bin = np.random.randint(0, 3, size=n)
        return Factors(shape=shape, orient=orient, freq_bin=freq_bin, phase_bin=phase_bin, thick_bin=thick_bin)

    def _render_image(self, s: int, o: int, fb: int, pb: int, tb: int) -> np.ndarray:
        # frequency & phase are discretized to keep ground truth interpretable
        freq = [1.5, 2.2, 2.9][fb]
        phase = [0.0, 0.5*np.pi, np.pi, 1.5*np.pi][pb]
        thick = [0.08, 0.14, 0.22][tb]

        # base pattern families (shape): sine planes, radial rings, diagonal bars
        if s == 0:
            base = np.sin(freq*(self.X + self.Y) + phase)
        elif s == 1:
            R = np.sqrt(self.X**2 + self.Y**2) + 1e-8
            base = np.sin(freq*(2.5*R) + phase)
        else:
            base = np.sin(freq*(self.X - self.Y) + phase)

        # orientation as a simple rotation-like mixing
        if o == 0:
            img = base
        elif o == 1:
            img = np.sin(freq*(self.X) + phase)
        elif o == 2:
            img = np.sin(freq*(self.Y) + phase)
        else:
            img = np.sin(freq*(self.X + 0.35*self.Y) + phase)

        # thickness as sharpening/nonlinearity
        img = np.tanh(img / (thick + 1e-6))

        # add controlled noise
        img = img + self.cfg.noise_image * np.random.randn(*img.shape)
        return img.astype(np.float32).reshape(-1)

    def factors_to_tokens(self, f: Factors) -> np.ndarray:
        # sequence: [shape][sep][orient][sep][freq][sep][phase][sep][thick][pad...]
        n = f.shape.shape[0]
        seq = np.full((n, self.cfg.text_seq_len), self.pad_id, dtype=np.int64)
        # minimal length needed = 9 tokens (shape,sep,orient,sep,freq,sep,phase,sep,thick)
        # Optional guard: auto-upgrade seq len rather than erroring
        if self.cfg.text_seq_len < 9:
            raise ValueError(f"text_seq_len must be >= 9 for this grammar (got {self.cfg.text_seq_len}). "
                            f"Fix: set Config.text_seq_len=9 or higher.")

        seq[:, 0] = self.base_shape + f.shape
        seq[:, 1] = self.sep_id
        seq[:, 2] = self.base_orient + f.orient
        seq[:, 3] = self.sep_id
        seq[:, 4] = self.base_freq + f.freq_bin
        seq[:, 5] = self.sep_id
        seq[:, 6] = self.base_phase + f.phase_bin
        seq[:, 7] = self.sep_id
        seq[:, 8] = self.base_thick + f.thick_bin
        return seq

    def tokens_to_bow_pos(self, seq: np.ndarray) -> np.ndarray:
        # bag-of-tokens + simple positional features (token_id * position one-hot-ish)
        n, L = seq.shape
        V = self.cfg.vocab_size
        bow = np.zeros((n, V), dtype=np.float32)
        for pos in range(L):
            ids = seq[:, pos]
            bow[np.arange(n), ids] += 1.0
        # positional moments: mean token id, std token id, and sep count (as 3 extra dims)
        mean_id = seq.mean(axis=1, keepdims=True).astype(np.float32)
        std_id = seq.std(axis=1, keepdims=True).astype(np.float32)
        sep_count = (seq == self.sep_id).sum(axis=1, keepdims=True).astype(np.float32)
        feats = np.concatenate([bow, mean_id, std_id, sep_count], axis=1)
        return feats  # shape (n, V+3)

    def build_dataset(self, n: int) -> Tuple[np.ndarray, np.ndarray, Dict[str, np.ndarray]]:
        f = self.sample_factors(n)
        imgs = np.stack([self._render_image(int(f.shape[i]), int(f.orient[i]), int(f.freq_bin[i]), int(f.phase_bin[i]), int(f.thick_bin[i]))
                         for i in range(n)], axis=0)
        seq = self.factors_to_tokens(f)
        txt = self.tokens_to_bow_pos(seq)
        meta = {
            "shape": f.shape, "orient": f.orient, "freq_bin": f.freq_bin, "phase_bin": f.phase_bin, "thick_bin": f.thick_bin
        }
        return imgs, txt, meta

def split_indices(n: int, train_frac: float, val_frac: float, seed: int) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    rng = np.random.RandomState(seed)
    idx = rng.permutation(n)
    n_train = int(train_frac * n)
    n_val = int(val_frac * n)
    train = idx[:n_train]
    val = idx[n_train:n_train+n_val]
    test = idx[n_train+n_val:]
    # split hygiene check
    assert len(set(train).intersection(set(val))) == 0
    assert len(set(train).intersection(set(test))) == 0
    assert len(set(val).intersection(set(test))) == 0
    return train, val, test

WORLD = SyntheticWorld(CFG)
X_img, X_txt, META = WORLD.build_dataset(CFG.n)

train_idx, val_idx, test_idx = split_indices(CFG.n, CFG.train_frac, CFG.val_frac, CFG.seed)

print("DATASET")
print("  X_img:", X_img.shape, "X_txt:", X_txt.shape)
print("  splits:", len(train_idx), len(val_idx), len(test_idx))


DATASET
  X_img: (2048, 256) X_txt: (2048, 35)
  splits: 1433 307 308


##4.ENCODER ARCHITECTURE

###4.1.OVERVIEW

**Cell 4 — Two-Layer MLP Encoders (Why Small Models Teach Better Than Large Ones Here)**

Cell 4 defines the encoders: one for images and one for text. The key pedagogical aim is to show that multimodal alignment does not require massive architectures. The mechanism is learnable with a small 2-layer MLP when the data is structured and the objective is correct. This is important because it prevents students from concluding that multimodality is “something only big labs can do.” Instead, they learn that multimodality is a principle: align projections of shared causes into a commensurable space.

The 2-layer MLP is also a didactic choice: it is complex enough to be non-trivial (nonlinear feature extraction) but simple enough to be inspectable and to implement with manual backprop. By caching intermediate activations (pre-activation, activation, unnormalized embedding, normalization factor), the model becomes a transparent computational graph. Students can point to exactly where each transformation happens and how each contributes to the final embedding geometry.

Normalization is crucial. We L2-normalize embeddings so that cosine similarity is meaningful and stable. Students should learn that without normalization, the model can “cheat” by inflating norms rather than learning directional structure. Normalization enforces a geometric discipline: similarity is about angles, not magnitudes. This stabilizes training and makes retrieval behavior more interpretable.

The activation function is another lesson: ReLU is common, but it introduces nondifferentiability at zero, which complicates finite-difference gradient checks. Using a leaky variant is not a trick; it is an engineering choice that supports rigorous validation. This is a subtle but important lesson: theoretical elegance (pure ReLU) sometimes conflicts with experimental verifiability, and in professional research you often choose the version that supports reliable diagnostics.

Finally, this cell introduces the idea that each modality has its own encoder parameters. There is no forced weight sharing. Alignment emerges from the objective, not from architectural constraints. That is the right mental model: the shared space is not imposed; it is learned because it is useful for the task.


###4.2.CODE AND IMPLEMENTATION

In [63]:
# === Cell 4 (REPLACE) ===
# Title: 2-Layer MLP Encoder with LeakyReLU + Exact L2-Norm Backprop
# Explanation: PATCH — LeakyReLU avoids ReLU kink issues so finite-difference checks match analytic gradients.

def act(x: np.ndarray, leak: float) -> np.ndarray:
    return np.where(x > 0.0, x, leak * x)

def act_grad(x: np.ndarray, leak: float) -> np.ndarray:
    # IMPORTANT: preserve dtype so float64 gradient checks are numerically stable
    return np.where(x > 0.0, 1.0, leak).astype(x.dtype)

def l2_normalize(z: np.ndarray, eps: float = 1e-8) -> Tuple[np.ndarray, np.ndarray]:
    nrm = np.linalg.norm(z, axis=1, keepdims=True)
    nrm = np.maximum(nrm, eps)
    return z / nrm, nrm

def l2_normalize_backward(dy: np.ndarray, z: np.ndarray, nrm: np.ndarray, eps: float = 1e-8) -> np.ndarray:
    inv = 1.0 / np.maximum(nrm, eps)                # (B,1)
    dot = np.sum(dy * z, axis=1, keepdims=True)     # (B,1)
    dz = dy * inv - z * (dot * (inv**3))
    return dz

class MLP2:
    def __init__(self, in_dim: int, hidden: int, out_dim: int, seed: int, leak: float):
        self.leak = float(leak)
        rng = np.random.RandomState(seed)
        self.W1 = (rng.randn(in_dim, hidden).astype(np.float32) * math.sqrt(2.0 / in_dim))
        self.b1 = np.zeros((1, hidden), dtype=np.float32)
        self.W2 = (rng.randn(hidden, out_dim).astype(np.float32) * math.sqrt(2.0 / hidden))
        self.b2 = np.zeros((1, out_dim), dtype=np.float32)

        self.m = {k: np.zeros_like(v) for k, v in self.params().items()}
        self.v = {k: np.zeros_like(v) for k, v in self.params().items()}
        self.t = 0

    def params(self) -> Dict[str, np.ndarray]:
        return {"W1": self.W1, "b1": self.b1, "W2": self.W2, "b2": self.b2}

    def forward(self, x: np.ndarray) -> Dict[str, np.ndarray]:
        h_pre = x @ self.W1 + self.b1
        h = act(h_pre, self.leak)
        z = h @ self.W2 + self.b2
        y, nrm = l2_normalize(z)
        return {"x": x, "h_pre": h_pre, "h": h, "z": z, "y": y, "nrm": nrm}

    def backward(self, cache: Dict[str, np.ndarray], dy: np.ndarray) -> Dict[str, np.ndarray]:
        x, h_pre, h, z, nrm = cache["x"], cache["h_pre"], cache["h"], cache["z"], cache["nrm"]
        dz = l2_normalize_backward(dy, z, nrm)

        dW2 = h.T @ dz
        db2 = dz.sum(axis=0, keepdims=True)

        dh = dz @ self.W2.T
        dh_pre = dh * act_grad(h_pre, self.leak)

        dW1 = x.T @ dh_pre
        db1 = dh_pre.sum(axis=0, keepdims=True)

        return {"W1": dW1, "b1": db1, "W2": dW2, "b2": db2}

    def adam_step(self, grads: Dict[str, np.ndarray], lr: float, b1: float, b2: float, eps: float,
                  weight_decay: float, grad_clip: float) -> Dict[str, float]:
        self.t += 1
        stats: Dict[str, float] = {}
        for k, p in self.params().items():
            g = grads[k].astype(np.float32)

            gn = float(np.linalg.norm(g))
            if gn > grad_clip:
                g = g * (grad_clip / (gn + 1e-12))
            stats[f"grad_norm_{k}"] = float(gn)

            if k.startswith("W") and weight_decay > 0.0:
                p *= (1.0 - lr * weight_decay)

            self.m[k] = b1 * self.m[k] + (1.0 - b1) * g
            self.v[k] = b2 * self.v[k] + (1.0 - b2) * (g * g)

            mhat = self.m[k] / (1.0 - b1**self.t)
            vhat = self.v[k] / (1.0 - b2**self.t)

            p -= lr * mhat / (np.sqrt(vhat) + eps)

        return stats

IMG = MLP2(in_dim=X_img.shape[1], hidden=CFG.hidden, out_dim=CFG.embed_dim, seed=CFG.seed + 11, leak=CFG.relu_leak)
TXT = MLP2(in_dim=X_txt.shape[1], hidden=CFG.hidden, out_dim=CFG.embed_dim, seed=CFG.seed + 23, leak=CFG.relu_leak)

print("MLP2 (LeakyReLU) encoders ready.")


MLP2 (LeakyReLU) encoders ready.


##5.ANALYTIC GRADIENT STRUCTURE

###5.1.OVERVIEW

**Cell 5 — Symmetric InfoNCE and Gradient Signals (Where “Alignment” Becomes an Optimization Problem)**

Cell 5 defines the contrastive objective that drives alignment. Students often think alignment is a vague desire—“make images and text match.” Here it becomes precise: for a batch of paired samples, each image embedding must identify its matching text embedding among all texts in the batch, and each text embedding must identify its matching image embedding among all images. This is why the objective is symmetric. If you train only one direction, you often get a lopsided space that performs well in one retrieval direction but poorly in the other.

The crucial mathematical idea is that the loss depends on a similarity matrix. Every entry in this matrix is a geometric relationship between an image embedding and a text embedding. The diagonal entries are positives (true pairs), and the off-diagonal entries are negatives (mismatched pairs). The softmax transforms these similarities into a distribution over candidates. The loss penalizes the model when the correct match does not dominate the distribution. Students should see that “learning meaning” here is “learning to shape the similarity matrix.”

Temperature is the main control knob. Lower temperature makes the softmax sharper, pushing the model to separate positives from negatives more aggressively. Higher temperature makes the distribution smoother, often improving stability but weakening discrimination. Students should learn to interpret temperature as a geometric scaling: it changes the effective margin in the embedding space.

Another key lesson is numerical stability. We use stable softmax/log-sum-exp patterns to avoid overflow and underflow. This is not merely about avoiding errors. Stability affects learning dynamics. If your probabilities saturate to 0 or 1 due to numeric issues, gradient signals vanish or explode. In multimodal training, where the similarity matrix can be large and dynamic, stability is a first-class requirement.

Finally, the gradients produced by InfoNCE teach students what “contrastive learning pressure” looks like. Positives are pulled together (increase diagonal similarity), negatives are pushed apart (decrease off-diagonal similarity). But this pushing is not uniform. Hard negatives—those that are already similar—receive stronger gradient pressure. That is why batch composition matters: the set of negatives you present defines what the model learns to discriminate. This cell is where students learn that alignment is a choice about what distinctions matter.


###5.2.CODE AND IMPLEMENTATION

In [59]:
# === Cell 5 ===
# Title: Symmetric InfoNCE (Stable LogSumExp) + Exact Gradients w.r.t. Embeddings
# Explanation: Implements bidirectional InfoNCE with numerically stable softmax,
# and derives exact gradients dL/dY_img and dL/dY_txt (then backprop into MLPs).

def logsumexp(a: np.ndarray, axis: int) -> np.ndarray:
    m = np.max(a, axis=axis, keepdims=True)
    return m + np.log(np.sum(np.exp(a - m), axis=axis, keepdims=True) + 1e-12)

def softmax(a: np.ndarray, axis: int) -> np.ndarray:
    a = a - np.max(a, axis=axis, keepdims=True)
    e = np.exp(a)
    return e / (np.sum(e, axis=axis, keepdims=True) + 1e-12)

def contrastive_symmetric(Yi: np.ndarray, Yt: np.ndarray, temp: float) -> Tuple[float, np.ndarray, np.ndarray, Dict[str, Any]]:
    """
    Yi, Yt are L2-normalized embeddings (B,D).
    sim = Yi @ Yt^T.
    Loss = 0.5*(CE_rows(sim/temp) + CE_rows(sim.T/temp))
    Return loss, dYi, dYt, and metrics.
    """
    B = Yi.shape[0]
    sim = Yi @ Yt.T                           # (B,B)
    logits = sim / temp

    P_it = softmax(logits, axis=1)            # rows: image->text
    P_ti = softmax(logits.T, axis=1)          # rows: text->image (equiv cols of P_it)

    # cross entropy with correct match on diagonal
    loss_it = -np.mean(np.log(np.diag(P_it) + 1e-12))
    loss_ti = -np.mean(np.log(np.diag(P_ti) + 1e-12))
    loss = 0.5 * (loss_it + loss_ti)

    # Gradients:
    # For CE over rows: dL/dlogits = (P - I)/B
    I = np.eye(B, dtype=np.float32)
    dlogits_it = (P_it - I) / B               # (B,B)
    dlogits_ti = (P_ti - I) / B               # (B,B) over logits.T

    # Combine on logits (not logits.T) properly:
    # d/dlogits from second term is transpose of dlogits_ti
    dlogits = 0.5 * (dlogits_it + dlogits_ti.T)  # (B,B)

    dsim = dlogits / temp                     # (B,B)

    # sim = Yi @ Yt^T
    dYi = dsim @ Yt                           # (B,D)
    dYt = dsim.T @ Yi                         # (B,D)

    metrics = {
        "loss_it": float(loss_it),
        "loss_ti": float(loss_ti),
        "mean_pos_sim": float(np.mean(np.diag(sim))),
        "mean_sim": float(np.mean(sim)),
    }
    return float(loss), dYi.astype(np.float32), dYt.astype(np.float32), metrics

# quick shape sanity
b = 8
ci = IMG.forward(X_img[:b])
ct = TXT.forward(X_txt[:b])
L, dYi, dYt, met = contrastive_symmetric(ci["y"], ct["y"], CFG.temp)
assert dYi.shape == ci["y"].shape and dYt.shape == ct["y"].shape
print("Contrastive core OK. loss =", L, "metrics =", met)


Contrastive core OK. loss = 2.6602158546447754 metrics = {'loss_it': 2.0775787830352783, 'loss_ti': 3.2428526878356934, 'mean_pos_sim': -0.07171455025672913, 'mean_sim': -0.07997961342334747}


##6.GRADIENT CHECK

###6.1.OVERVIEW

**Cell 6 — Gradient Checking (Why We Refuse to Trust Ourselves Without Evidence)**

Cell 6 is the governance heart of the notebook. It answers a simple but profound question: how do we know our training loop is correct? In most deep learning workflows, you rely on autodiff libraries and assume the gradients are right. Here we are implementing backprop manually, so we must validate it. The pedagogical lesson is that “working code” is not the same as “correct optimization.” A model can produce decreasing loss even with incorrect gradients, especially if the problem is forgiving.

Finite-difference gradient checking provides a local correctness test. We perturb a parameter slightly and measure how the loss changes. That numeric slope should match the analytic gradient computed by our backprop pipeline. If it does not, we treat the run as invalid. This is not optional. It is what makes the notebook a laboratory rather than a performance show.

Students should also learn why we do the check in float64. In float32, very small perturbations can be swallowed by numerical precision, making the numeric gradient appear as zero. That failure is not a sign that gradients are zero; it is a sign that your measurement instrument is too coarse. Switching to float64 for the check is an engineering choice that increases measurement resolution. This reinforces the broader theme: embeddings are measurements, and measurements require appropriate precision.

Gradient checking is not meant to be exhaustive. We do not check every parameter; we check random probes. This teaches a statistical mindset: we sample checks to catch gross errors, especially sign errors, missing terms, or dtype issues. In professional systems, you combine this with unit-test-like shape assertions and NaN checks to build confidence in the computational pipeline.

Finally, Cell 6 teaches a cultural habit: hard failure on correctness gates. If the gradient check fails, we stop. We do not rationalize. We do not accept the run “because it mostly works.” This is precisely the posture students need for frontier topics. When systems become complex, the temptation to accept results based on plausibility increases. This cell trains the opposite muscle: correctness before interpretation.


###6.2.CODE AND IMPLEMENTATION

In [64]:
# === Cell 6 (REPLACE ENTIRE CELL) ===
# Title: Finite-Difference Gradient Check (Float64 Clone, Central Difference)
# Explanation: Runs end-to-end gradient checks in float64 to avoid float32 quantization.
# Clones models into float64, computes analytic grads, and compares with central differences.

def clone_model_to_float64(src: MLP2) -> MLP2:
    dst = MLP2(
        in_dim=src.W1.shape[0],
        hidden=src.W1.shape[1],
        out_dim=src.W2.shape[1],
        seed=123,          # overwritten immediately
        leak=src.leak
    )
    # overwrite parameters (float64)
    dst.W1 = src.W1.astype(np.float64).copy()
    dst.b1 = src.b1.astype(np.float64).copy()
    dst.W2 = src.W2.astype(np.float64).copy()
    dst.b2 = src.b2.astype(np.float64).copy()

    # reset optimizer state in float64 (not used in check, but keeps structure consistent)
    dst.m = {k: np.zeros_like(v) for k, v in dst.params().items()}
    dst.v = {k: np.zeros_like(v) for k, v in dst.params().items()}
    dst.t = 0
    return dst

def loss_only(model_a: MLP2, model_b: MLP2, Xa: np.ndarray, Xb: np.ndarray) -> float:
    ca = model_a.forward(Xa)
    cb = model_b.forward(Xb)
    loss, _, _, _ = contrastive_symmetric(ca["y"], cb["y"], CFG.temp)
    return float(loss)

def analytic_grads(model_a: MLP2, model_b: MLP2, Xa: np.ndarray, Xb: np.ndarray) -> Dict[str, np.ndarray]:
    ca = model_a.forward(Xa)
    cb = model_b.forward(Xb)
    loss, dYa, dYb, _ = contrastive_symmetric(ca["y"], cb["y"], CFG.temp)
    ga = model_a.backward(ca, dYa)
    # (We only check grads for model_a here; symmetric check could also validate model_b)
    return ga

def finite_diff_param(model_a: MLP2, model_b: MLP2, Xa: np.ndarray, Xb: np.ndarray,
                      param_name: str, num_checks: int, eps: float, tol: float) -> Dict[str, Any]:
    ga = analytic_grads(model_a, model_b, Xa, Xb)
    P = model_a.params()[param_name]
    G = ga[param_name]

    rng = np.random.RandomState(CFG.seed + 2026)
    worst = {"rel_err": -1.0, "idx": None, "num": None, "ana": None}

    if P.ndim == 2:
        coords = [(rng.randint(0, P.shape[0]), rng.randint(0, P.shape[1])) for _ in range(num_checks)]
    else:
        coords = [(0, rng.randint(0, P.shape[1])) for _ in range(num_checks)]

    base_loss = loss_only(model_a, model_b, Xa, Xb)

    for (i, j) in coords:
        orig = float(P[i, j])

        P[i, j] = orig + eps
        lp = loss_only(model_a, model_b, Xa, Xb)

        P[i, j] = orig - eps
        lm = loss_only(model_a, model_b, Xa, Xb)

        P[i, j] = orig  # restore

        num = float((lp - lm) / (2.0 * eps))
        ana = float(G[i, j])

        denom = max(1e-12, abs(num) + abs(ana))
        rel = float(abs(num - ana) / denom)

        if rel > worst["rel_err"]:
            worst = {"rel_err": rel, "idx": (int(i), int(j)), "num": num, "ana": ana}

    passed = worst["rel_err"] <= tol
    return {
        "param": param_name,
        "base_loss": float(base_loss),
        "eps": float(eps),
        "tol": float(tol),
        "worst": worst,
        "passed": passed
    }

# ----- run float64 gradient check on clones -----
IMG64 = clone_model_to_float64(IMG)
TXT64 = clone_model_to_float64(TXT)

B = 16
Xa = X_img[train_idx[:B]].astype(np.float64)
Xb = X_txt[train_idx[:B]].astype(np.float64)

# eps: float64-friendly; tol: realistic for a manual system with normalization and softmax
eps = 1e-6
tol = 2e-4
num_checks = 18

report = []
for pname in ["W1", "W2", "b1", "b2"]:
    rep = finite_diff_param(IMG64, TXT64, Xa, Xb, pname, num_checks=num_checks, eps=eps, tol=tol)
    report.append(rep)

AM.write_json_strict(
    os.path.join(CFG.gates_dir, "gate_gradient_check.json"),
    facts_provided={"gradient_check_report": report},
    assumptions={
        "dtype": "float64 for finite-difference stability",
        "central_difference": True,
        "eps": eps,
        "tol_rel": tol,
        "activation": f"LeakyReLU(leak={CFG.relu_leak})",
    },
    open_items=[],
    analysis="Gradient check executed on float64 clones to eliminate float32 quantization (loss_p == loss_m). Central differences validate manual backprop through activation + L2 normalization + symmetric InfoNCE.",
    draft_output={"passed": all(r["passed"] for r in report)},
    verification_status="Not verified",
    questions_to_verify=["Does the check pass across multiple random seeds and larger batch sizes?"]
)

assert all(r["passed"] for r in report), f"Gradient check failed: {report}"
print("Gradient checks PASSED (float64).")
for r in report:
    print(r["param"], "worst_rel_err=", r["worst"]["rel_err"], "idx=", r["worst"]["idx"])


Gradient checks PASSED (float64).
W1 worst_rel_err= 6.810010923270595e-08 idx= (205, 180)
W2 worst_rel_err= 1.66649471053262e-07 idx= (153, 22)
b1 worst_rel_err= 1.1187579371748174e-06 idx= (0, 223)
b2 worst_rel_err= 1.955514531908551e-07 idx= (0, 25)


##7.TRAINING LOOP

###7.1.OVERVIEW

**Cell 7 — Training Loop as a Controlled Experiment (Metrics, Checkpoints, and Convergence Discipline)**

Cell 7 is where the model actually learns, but the pedagogical emphasis is not “watch the loss go down.” The emphasis is “treat training as a controlled experiment with measured outcomes.” We implement a training loop with mini-batching, an optimizer (AdamW), gradient clipping, and checkpointing based on validation metrics. Each of these elements teaches a professional lesson about how real multimodal systems are trained and managed.

AdamW is used because it is a practical optimizer that separates weight decay from gradient updates. Students should learn that weight decay is not just regularization in an abstract sense; it shapes representation geometry by discouraging parameter blow-up and improving generalization. Gradient clipping is included because contrastive objectives can generate unstable gradients, especially early in training when similarity matrices are noisy. Clipping is a risk control: it prevents the system from taking catastrophic steps that destroy learning.

Checkpointing based on validation performance teaches students to separate training success from generalization. In contrastive learning, it is possible to improve training loss while overfitting to batch-specific structures or exploiting spurious correlations. Validation retrieval metrics provide a more meaningful criterion: does the shared space work on held-out pairs? Selecting the best checkpoint by validation recall@1 instills the habit of defining acceptance on what matters operationally.

This cell also logs multiple metrics per epoch, not just loss. Students should learn that loss is an internal optimization quantity, while retrieval metrics are task-relevant behaviors. In a multimodal model used for search or matching, retrieval accuracy is closer to the professional objective than loss. The cell reinforces the idea that the model is a component in a system: you choose the metric that corresponds to the system’s function.

Finally, students should learn that training is not an endpoint. It is a phase in a pipeline that includes evaluation and stress testing. The output of training is not merely weights; it is a set of artifacts that record the training history, the chosen checkpoint, and the conditions under which those results were obtained. That makes the later interpretation of embedding geometry credible.


###7.2.CODE AND IMPLEMENTATION

In [65]:
# === Cell 7 ===
# Title: Trainer (AdamW, Clipping, Checkpointing, Epoch Metrics)
# Explanation: Implements a production-style training loop with mini-batching, AdamW updates,
# gradient clipping, per-epoch metrics, and best-checkpoint saving.

def iter_batches(indices: np.ndarray, batch: int, seed: int) -> List[np.ndarray]:
    rng = np.random.RandomState(seed)
    perm = rng.permutation(indices)
    return [perm[i:i+batch] for i in range(0, len(perm), batch)]

def retrieval_metrics(sim: np.ndarray, ks: Tuple[int, int]) -> Dict[str, float]:
    # sim: (N,N) similarity, correct match on diagonal
    N = sim.shape[0]
    order = np.argsort(-sim, axis=1)  # descending
    ranks = np.empty(N, dtype=np.int64)
    for i in range(N):
        ranks[i] = int(np.where(order[i] == i)[0][0]) + 1  # 1-based
    out = {}
    for k in ks:
        out[f"recall@{k}"] = float(np.mean(ranks <= k))
    out["mrr"] = float(np.mean(1.0 / ranks))
    out["mean_rank"] = float(np.mean(ranks))
    return out

def embed_all(model: MLP2, X: np.ndarray, batch: int = 512) -> np.ndarray:
    Ys = []
    for i in range(0, X.shape[0], batch):
        c = model.forward(X[i:i+batch])
        Ys.append(c["y"])
    return np.vstack(Ys)

def eval_split(split_name: str, idx: np.ndarray) -> Dict[str, float]:
    Yi = embed_all(IMG, X_img[idx])
    Yt = embed_all(TXT, X_txt[idx])
    sim = Yi @ Yt.T
    m_it = retrieval_metrics(sim, CFG.topk)
    m_ti = retrieval_metrics(sim.T, CFG.topk)
    out = {f"{split_name}_i2t_{k}": v for k, v in m_it.items()}
    out.update({f"{split_name}_t2i_{k}": v for k, v in m_ti.items()})
    out[f"{split_name}_sym_gap_recall@1"] = float(abs(m_it["recall@1"] - m_ti["recall@1"]))
    out[f"{split_name}_mean_pos_sim"] = float(np.mean(np.diag(sim)))
    return out

def save_checkpoint(path: str) -> None:
    ckpt = {
        "IMG": {k: v.tolist() for k, v in IMG.params().items()},
        "TXT": {k: v.tolist() for k, v in TXT.params().items()},
        "meta": {"run_id": AM.run_id, "utc": utc_now_iso()},
    }
    with open(path, "w") as f:
        json.dump(ckpt, f)

best_val = -1.0
history: List[Dict[str, Any]] = []

for epoch in range(1, CFG.epochs + 1):
    batches = iter_batches(train_idx, CFG.batch, seed=CFG.seed + epoch)
    epoch_losses = []
    grad_stats_accum = []

    for bidx in batches:
        Xi = X_img[bidx].astype(np.float32)
        Xt = X_txt[bidx].astype(np.float32)

        ci = IMG.forward(Xi)
        ct = TXT.forward(Xt)
        loss, dYi, dYt, lmet = contrastive_symmetric(ci["y"], ct["y"], CFG.temp)

        gi = IMG.backward(ci, dYi)
        gt = TXT.backward(ct, dYt)

        # update
        si = IMG.adam_step(gi, CFG.lr, CFG.adam_b1, CFG.adam_b2, CFG.adam_eps, CFG.weight_decay, CFG.grad_clip)
        st = TXT.adam_step(gt, CFG.lr, CFG.adam_b1, CFG.adam_b2, CFG.adam_eps, CFG.weight_decay, CFG.grad_clip)

        epoch_losses.append(loss)
        grad_stats_accum.append({**si, **{f"txt_{k}": v for k, v in st.items()}, **lmet})

    # eval
    tr = eval_split("train", train_idx[:min(512, len(train_idx))])  # bounded eval for speed
    va = eval_split("val", val_idx)

    row = {
        "epoch": epoch,
        "loss": float(np.mean(epoch_losses)),
        "loss_std": float(np.std(epoch_losses)),
        **tr,
        **va,
        "grad_norm_W1_mean": float(np.mean([g["grad_norm_W1"] for g in grad_stats_accum])),
        "grad_norm_W2_mean": float(np.mean([g["grad_norm_W2"] for g in grad_stats_accum])),
        "utc": utc_now_iso(),
    }
    history.append(row)

    # checkpoint on val recall@1 (i2t)
    key = "val_i2t_recall@1"
    score = row[key]
    if score > best_val:
        best_val = score
        ckpt_path = os.path.join(CFG.ckpt_dir, "best_checkpoint.json")
        save_checkpoint(ckpt_path)

    if epoch % 5 == 0 or epoch == 1:
        print(f"Epoch {epoch:03d} | loss={row['loss']:.4f} | val_r@1(i2t)={row['val_i2t_recall@1']:.3f} | val_sym_gap={row['val_sym_gap_recall@1']:.3f}")

# Save training curve
with open(os.path.join(CFG.out_dir, "train_history.json"), "w") as f:
    json.dump(history, f, indent=2)

print("Training done. best val_i2t_recall@1 =", best_val)


Epoch 001 | loss=4.8407 | val_r@1(i2t)=0.033 | val_sym_gap=0.013
Epoch 005 | loss=1.3285 | val_r@1(i2t)=0.202 | val_sym_gap=0.020
Epoch 010 | loss=1.1369 | val_r@1(i2t)=0.248 | val_sym_gap=0.033
Epoch 015 | loss=1.0378 | val_r@1(i2t)=0.231 | val_sym_gap=0.107
Epoch 020 | loss=0.8004 | val_r@1(i2t)=0.329 | val_sym_gap=0.075
Epoch 025 | loss=0.8408 | val_r@1(i2t)=0.339 | val_sym_gap=0.029
Epoch 030 | loss=0.8021 | val_r@1(i2t)=0.332 | val_sym_gap=0.065
Epoch 035 | loss=0.7947 | val_r@1(i2t)=0.336 | val_sym_gap=0.107
Epoch 040 | loss=0.6899 | val_r@1(i2t)=0.358 | val_sym_gap=0.052
Epoch 045 | loss=0.7277 | val_r@1(i2t)=0.368 | val_sym_gap=0.075
Epoch 050 | loss=0.7007 | val_r@1(i2t)=0.349 | val_sym_gap=0.075
Epoch 055 | loss=0.6413 | val_r@1(i2t)=0.384 | val_sym_gap=0.068
Epoch 060 | loss=0.6711 | val_r@1(i2t)=0.358 | val_sym_gap=0.029
Training done. best val_i2t_recall@1 = 0.41368078175895767


##8.RETRIEVAL AND SPECTRAL DIAGNOSTICS

###8.1.OVERVIEW

**Cell 8 — Evaluation: Geometry as Evidence (Retrieval, Separability, Collapse, Spectra, Visuals)**

Cell 8 is designed to teach students how to “read” an embedding space responsibly. Many people stop at retrieval accuracy and declare victory. This cell insists that retrieval is necessary but not sufficient. We examine geometry through multiple lenses: retrieval metrics, factor separability, collapse diagnostics, covariance spectra, modality symmetry, and visual projections. The pedagogical goal is to replace narrative interpretation with structured evidence.

Retrieval metrics (recall@k and MRR) are the operational core: can an image find its paired text, and can a text find its paired image? These metrics measure whether the shared space is actually shared. We evaluate both directions because asymmetry is a common failure mode and because many real applications need both query directions.

Separability probes teach a different lesson: does the embedding preserve latent factors in a way that could support downstream tasks? We compute an ANOVA-like ratio between-class scatter over within-class scatter for each factor. Students should learn that this is not a “statistical proof,” but it is a controlled measure of whether factor structure is visible in geometry.

Collapse diagnostics matter because contrastive systems can sometimes find degenerate solutions. The mean off-diagonal cosine similarity and the per-dimension variance provide simple, interpretable signals. The covariance spectrum adds a more global view: effective rank indicates how many dimensions are meaningfully used. A model with low effective rank is geometrically impoverished, even if it sometimes produces passable retrieval on easy data.

Visuals (PCA via SVD and similarity heatmaps) are used as intuition builders, not as evidence. Students learn the correct scientific posture: visuals generate hypotheses; metrics test them. This matters because embedding plots are seductive. They can make you believe you see structure even when the structure is an artifact of projection.

Finally, we export all evaluation results as strict JSON, emphasizing reviewability. In professional settings, evaluation is not complete until it can be inspected and compared across runs. Cell 8 teaches that interpretation must be anchored to exported, structured evidence.


###8.2.CODE AND IMPLEMENTATION

In [66]:
# === Cell 8 ===
# Title: Evaluator (Retrieval@k, MRR, Separability, Collapse, Spectral Diagnostics, PCA via SVD)
# Explanation: Produces a rigorous diagnostic suite: retrieval, factor separability (ANOVA-like),
# collapse indicators, modality symmetry, and embedding covariance spectra. Also produces plots.

def covariance_spectrum(Y: np.ndarray) -> Dict[str, float]:
    # Y: (N,D) normalized
    C = (Y - Y.mean(axis=0, keepdims=True)).T @ (Y - Y.mean(axis=0, keepdims=True)) / max(1, Y.shape[0]-1)
    # eigenvalues of symmetric PSD matrix
    evals = np.linalg.eigvalsh(C).astype(np.float64)
    evals = np.maximum(evals, 0.0)
    tr = float(np.sum(evals))
    if tr <= 1e-12:
        eff_rank = 0.0
        cond = float("inf")
    else:
        eff_rank = float((tr**2) / (np.sum(evals**2) + 1e-12))
        cond = float((np.max(evals) + 1e-12) / (np.min(evals) + 1e-12))
    return {
        "trace": tr,
        "max_eig": float(np.max(evals)),
        "min_eig": float(np.min(evals)),
        "effective_rank": eff_rank,
        "condition_number": cond,
        "eigvals": evals.tolist(),  # keep for audit; can be large but D small
    }

def collapse_metrics(Y: np.ndarray) -> Dict[str, float]:
    # mean off-diagonal cosine (Y normalized)
    sim = Y @ Y.T
    N = sim.shape[0]
    off = (np.sum(sim) - np.sum(np.diag(sim))) / max(1, N*(N-1))
    var = float(np.mean(np.var(Y, axis=0)))
    return {"mean_offdiag_cos": float(off), "mean_dim_var": var}

def anova_separability(Y: np.ndarray, labels: np.ndarray) -> float:
    # Between / Within variance ratio (scalar)
    Yc = Y - Y.mean(axis=0, keepdims=True)
    classes = np.unique(labels)
    between = 0.0
    within = 0.0
    for c in classes:
        idx = np.where(labels == c)[0]
        if len(idx) == 0:
            continue
        mu_c = Yc[idx].mean(axis=0, keepdims=True)
        between += float(len(idx) * np.sum(mu_c**2))
        within += float(np.sum((Yc[idx] - mu_c)**2))
    return float(between / (within + 1e-12))

def pca_2d(Y: np.ndarray) -> np.ndarray:
    # PCA via SVD of centered data
    Yc = Y - Y.mean(axis=0, keepdims=True)
    U, S, Vt = np.linalg.svd(Yc, full_matrices=False)
    return (U[:, :2] * S[:2]).astype(np.float32)

# Evaluate on test split (full)
Yi_test = embed_all(IMG, X_img[test_idx])
Yt_test = embed_all(TXT, X_txt[test_idx])
sim_test = Yi_test @ Yt_test.T

m_i2t = retrieval_metrics(sim_test, CFG.topk)
m_t2i = retrieval_metrics(sim_test.T, CFG.topk)

# separability probes (use image embeddings; can also do text)
sep = {
    "shape": anova_separability(Yi_test, META["shape"][test_idx]),
    "orient": anova_separability(Yi_test, META["orient"][test_idx]),
    "freq_bin": anova_separability(Yi_test, META["freq_bin"][test_idx]),
    "phase_bin": anova_separability(Yi_test, META["phase_bin"][test_idx]),
    "thick_bin": anova_separability(Yi_test, META["thick_bin"][test_idx]),
}

# collapse + spectra
col_i = collapse_metrics(Yi_test)
col_t = collapse_metrics(Yt_test)
spec_i = covariance_spectrum(Yi_test)
spec_t = covariance_spectrum(Yt_test)

# modality symmetry norms (pre-normalization norms are useful; approximate via cached z norms batchwise)
def approx_pre_norms(model: MLP2, X: np.ndarray, batch: int = 512) -> np.ndarray:
    norms = []
    for i in range(0, X.shape[0], batch):
        c = model.forward(X[i:i+batch])
        norms.append(c["nrm"].reshape(-1))
    return np.concatenate(norms, axis=0)

norm_i = approx_pre_norms(IMG, X_img[test_idx])
norm_t = approx_pre_norms(TXT, X_txt[test_idx])

sym = {
    "mean_pre_norm_img": float(np.mean(norm_i)),
    "mean_pre_norm_txt": float(np.mean(norm_t)),
    "ratio_img_txt": float(np.mean(norm_i) / (np.mean(norm_t) + 1e-12)),
    "sym_gap_recall@1": float(abs(m_i2t["recall@1"] - m_t2i["recall@1"])),
    "mean_pos_sim": float(np.mean(np.diag(sim_test))),
}

# plots: PCA colored by shape + similarity heatmap
Z2 = pca_2d(Yi_test)
plt.figure(figsize=(6,5))
plt.scatter(Z2[:,0], Z2[:,1], c=META["shape"][test_idx], s=8)
plt.title("PCA (SVD) of IMAGE embeddings (test), colored by shape")
p1 = os.path.join(CFG.plots_dir, "pca_image_shape_test.png")
plt.savefig(p1, dpi=140)
plt.close()

plt.figure(figsize=(6,5))
plt.imshow(sim_test[:128, :128], aspect="auto")
plt.title("Similarity heatmap (test subset 128x128)")
p2 = os.path.join(CFG.plots_dir, "sim_heatmap_128.png")
plt.savefig(p2, dpi=140)
plt.close()

metrics = {
    "retrieval_i2t": m_i2t,
    "retrieval_t2i": m_t2i,
    "separability_anova_ratio_image": sep,
    "collapse_image": col_i,
    "collapse_text": col_t,
    "spectrum_image": {k: v for k, v in spec_i.items() if k != "eigvals"},
    "spectrum_text": {k: v for k, v in spec_t.items() if k != "eigvals"},
    "symmetry": sym,
    "plots": {"pca_image_shape_test": p1, "sim_heatmap_128": p2},
}

AM.write_json_strict(
    os.path.join(CFG.out_dir, "metrics_summary.json"),
    facts_provided=metrics,
    assumptions={
        "ground_truth_pairs": "Diagonal corresponds to true pairing by construction.",
        "anova_ratio": "Between/within scatter proxy; not a full statistical test.",
    },
    open_items=[],
    analysis="Comprehensive post-training diagnostics: retrieval, factor separability, collapse indicators, modality symmetry, and covariance spectra.",
    draft_output={"headline": {"test_i2t_recall@1": m_i2t["recall@1"], "test_t2i_recall@1": m_t2i["recall@1"], "effective_rank_img": spec_i["effective_rank"]}},
    verification_status="Not verified",
    questions_to_verify=["Do these diagnostics predict downstream task success in real multimodal settings?"]
)

print("Evaluator complete.")
print("  test recall@1 i2t:", m_i2t["recall@1"], "| t2i:", m_t2i["recall@1"])
print("  sep(shape):", sep["shape"], "| collapse(offdiag cos, img):", col_i["mean_offdiag_cos"])


Evaluator complete.
  test recall@1 i2t: 0.31493506493506496 | t2i: 0.38311688311688313
  sep(shape): 0.026190861125560717 | collapse(offdiag cos, img): 0.007339404430240393


##9.STRESS SWEEP

###9.1.0VERVIEW

**Cell 9 — Stress Testing: Robustness, Fragility, and “Where the Geometry Breaks”**

Cell 9 teaches the most important practical lesson for using multimodal models: performance is conditional. A model that works on clean data can fail sharply when one modality degrades or when pairing information becomes imperfect. In deployment, those shifts are not rare; they are normal. Therefore, the responsible way to use multimodal embeddings is to measure how they degrade under plausible stresses.

We perform two stresses that map cleanly to real-world failure modes. The first is noise asymmetry: we increase image noise while keeping text constant. This corresponds to blurry images, low-quality scans, sensor noise, compression artifacts, or domain mismatch. The second is pair corruption: we permute a fraction of text pairs. This corresponds to metadata errors, mislabeled assets, incorrect file associations, or weak supervision where pairing is imperfect.

Students should focus on the degradation curves. A degradation curve is not just a plot; it is a contract. It tells you how much shift the system can tolerate before it becomes unreliable. If recall@1 collapses quickly under mild noise, then the embedding space is brittle. If it degrades smoothly, the system is more robust. This teaches a professional decision-making habit: you do not deploy based on peak performance; you deploy based on tolerated stress.

Stress tests also reveal whether the learned representation uses “meaningful invariants” or fragile shortcuts. A space that relies on spurious artifacts will often break catastrophically when those artifacts are perturbed. In synthetic labs, we can design stresses that target specific artifacts. In real systems, you design stresses that target your known risks: OCR errors, audio dropouts, camera shifts, or domain-specific confounders.

Finally, this cell reinforces governance: stress outputs are exported as structured JSON and plots. Students learn that robustness is not a claim; it is measured evidence with recorded conditions. This is essential for frontier topics where stakeholders may be tempted to accept impressive demos without understanding fragility.


###9.2.CODE AND IMPLEMENTATION

In [67]:
# === Cell 9 ===
# Title: StressSuite (Noise Asymmetry + Pair Corruption) with Degradation Curves
# Explanation: Runs controlled stress sweeps (image noise and pair mismatch) and records
# degradation curves and plots to teach robustness and fragility of alignment geometry.

def eval_recall1_under(Xi: np.ndarray, Xt: np.ndarray) -> float:
    Yi = embed_all(IMG, Xi)
    Yt = embed_all(TXT, Xt)
    sim = Yi @ Yt.T
    return float(np.mean(np.argmax(sim, axis=1) == np.arange(sim.shape[0])))

# Noise asymmetry sweep (only re-render images, keep text fixed)
noise_grid = np.linspace(0.0, 0.50, 9)
acc_noise = []
for nl in noise_grid:
    # temporarily override noise (without mutating CFG dataclass)
    old = WORLD.cfg.noise_image
    object.__setattr__(WORLD.cfg, "noise_image", float(nl))  # controlled override in runtime
    Xi_n, Xt_n, _ = WORLD.build_dataset(len(test_idx))
    object.__setattr__(WORLD.cfg, "noise_image", float(old))
    acc_noise.append(eval_recall1_under(Xi_n, X_txt[test_idx]))

# Pair corruption sweep (permute text pairs by fraction)
def corrupt_pairs(Xt: np.ndarray, frac: float, seed: int) -> np.ndarray:
    rng = np.random.RandomState(seed)
    Xt2 = Xt.copy()
    n = Xt.shape[0]
    m = int(frac * n)
    if m <= 0:
        return Xt2
    idx = rng.choice(n, size=m, replace=False)
    perm = idx.copy()
    rng.shuffle(perm)
    Xt2[idx] = Xt2[perm]
    return Xt2

corrupt_grid = np.linspace(0.0, 0.50, 9)
acc_corrupt = []
Xi_base = X_img[test_idx]
Xt_base = X_txt[test_idx]
for frac in corrupt_grid:
    Xt_c = corrupt_pairs(Xt_base, float(frac), seed=CFG.seed + 1234)
    acc_corrupt.append(eval_recall1_under(Xi_base, Xt_c))

# plots
plt.figure(figsize=(6,4))
plt.plot(noise_grid, acc_noise, marker="o")
plt.xlabel("Image noise (std)")
plt.ylabel("Recall@1 (i2t)")
plt.title("Stress: Noise Asymmetry (image-only)")
p_noise = os.path.join(CFG.plots_dir, "stress_noise_asymmetry.png")
plt.savefig(p_noise, dpi=140)
plt.close()

plt.figure(figsize=(6,4))
plt.plot(corrupt_grid, acc_corrupt, marker="o")
plt.xlabel("Pair corruption fraction")
plt.ylabel("Recall@1 (i2t)")
plt.title("Stress: Pair Corruption")
p_corr = os.path.join(CFG.plots_dir, "stress_pair_corruption.png")
plt.savefig(p_corr, dpi=140)
plt.close()

stress = {
    "noise_asymmetry": {"grid": noise_grid.tolist(), "recall1": acc_noise, "plot": p_noise},
    "pair_corruption": {"grid": corrupt_grid.tolist(), "recall1": acc_corrupt, "plot": p_corr},
}

AM.write_json_strict(
    os.path.join(CFG.stress_dir, "stress_report.json"),
    facts_provided=stress,
    assumptions={"stress_design": "Noise affects only the image modality; corruption permutes a subset of text pairs."},
    open_items=[],
    analysis="Stress sweeps quantify robustness and illustrate degradation geometry under asymmetric noise and pairing corruption.",
    draft_output={"headline": {"recall1_clean": acc_noise[0], "recall1_noise_0.5": acc_noise[-1], "recall1_corrupt_0.5": acc_corrupt[-1]}},
    verification_status="Not verified",
    questions_to_verify=["Do these stress curves match qualitative failure regimes in real multimodal models?"]
)

print("StressSuite complete.")
print("  clean recall@1:", acc_noise[0], "| noise(0.5):", acc_noise[-1], "| corrupt(0.5):", acc_corrupt[-1])


StressSuite complete.
  clean recall@1: 0.003246753246753247 | noise(0.5): 0.003246753246753247 | corrupt(0.5): 0.16883116883116883


##10.AUDIT BUNDLE

###10.1.OVERVIEW

**Cell 10 — Finalization and the Audit Bundle (Turning a Notebook into a Reviewable Professional Artifact)**

Cell 10 completes the governance loop. The notebook does not end when training ends; it ends when results are packaged into a form that can be reviewed by someone who was not present at runtime. This is a subtle but crucial pedagogical point. In professional contexts, your audience is rarely “you, right now.” Your audience is a reviewer, a colleague, an auditor, or a future version of yourself trying to understand what happened. The audit bundle is what makes that possible.

This cell writes the prompt log, which records the intent of the notebook run along with hashes for integrity. It writes the risk log, which documents known failure modes and the controls implemented. It also writes a deliverables index: a file list with sizes that makes review navigation easy. This is a small engineering convenience that becomes a big practical advantage when you run many experiments and need to compare artifacts across runs.

The final step is bundling. We zip the manifest, logs, metrics, plots, and stress reports into a single archive. The pedagogical lesson is that a model is not a weight file. A model is part of a narrative about evidence. The archive is the boundary object that allows that evidence to be transported, inspected, and critiqued.

Students should also learn what “verification_status = Not verified” means. It is not pessimism; it is epistemic hygiene. The notebook can produce internal evidence, but it cannot verify external claims (such as real-world generalization). Marking outputs as “Not verified” teaches students to separate what the notebook demonstrated from what it did not. This prevents the common frontier failure: taking a controlled synthetic demonstration and overstating its implications.

Ultimately, Cell 10 teaches the meta-lesson of the entire chapter: multimodality is not merely a capability; it is a practice. The practice includes reproducibility, diagnostics, stress testing, and reviewable artifacts. When students internalize that practice, they are ready to use multimodal models responsibly in real applications.


###10.2.CODE AND IMPLEMENTATION

In [68]:
# === Cell 10 ===
# Title: Finalization (Prompt Log, Bundle Zip, File Index, Run Summary)
# Explanation: Flushes prompt logs, bundles all artifacts into an audit zip,
# and prints a concise index of deliverables for review.

AM.flush_prompts()

# Create a concise file index for reviewer convenience
file_index = []
for root, _, files in os.walk("."):
    for fn in files:
        if fn.endswith((".json", ".jsonl", ".png")):
            path = os.path.join(root, fn)
            try:
                sz = os.path.getsize(path)
            except OSError:
                sz = -1
            file_index.append({"path": path, "bytes": sz})

file_index = sorted(file_index, key=lambda x: x["path"])

AM.write_json_strict(
    os.path.join(CFG.out_dir, "deliverables_index.json"),
    facts_provided={"files": file_index},
    assumptions={},
    open_items=[],
    analysis="Index of artifacts produced by this run; intended for audit/review navigation.",
    draft_output={"count": len(file_index)},
    verification_status="Not verified",
    questions_to_verify=[]
)

zip_name = f"audit_bundle_{AM.run_id}.zip"
with zipfile.ZipFile(zip_name, "w", compression=zipfile.ZIP_DEFLATED) as z:
    for item in file_index:
        z.write(item["path"])

print("FINALIZATION")
print("  run_id:", AM.run_id)
print("  bundle:", zip_name)
print("  key outputs:")
print("   - run_manifest.json")
print("   - risk_log.json")
print("   - prompts_log.jsonl")
print("   - deliverables/metrics_summary.json")
print("   - deliverables/train_history.json")
print("   - deliverables/stress/stress_report.json")
print("   - deliverables/plots/*.png")


FINALIZATION
  run_id: 3cb8138df0b7
  bundle: audit_bundle_3cb8138df0b7.zip
  key outputs:
   - run_manifest.json
   - risk_log.json
   - prompts_log.jsonl
   - deliverables/metrics_summary.json
   - deliverables/train_history.json
   - deliverables/stress/stress_report.json
   - deliverables/plots/*.png


##11.CONCLUSION

**Conclusion: Main Lessons and How to Use a Multimodally Trained Model**

The central lesson of this chapter is that a multimodally trained model is not “a model that understands images and text.” It is a disciplined system for constructing comparability. After you strip away marketing language, multimodality is the engineering of a shared coordinate system in which different measurement instruments can be meaningfully related. An image encoder and a text encoder are two different sensors. The model’s job is not to turn one into the other; it is to learn embeddings that preserve the invariants you care about so that cross-modal queries become stable geometric operations. Once you internalize this, you stop treating multimodality as magic and start treating it as a mechanism with failure modes, diagnostics, and governance obligations.

A second lesson is that multimodal “understanding” is mostly geometry plus constraints. When you train with a contrastive objective, you are not teaching the model a semantic dictionary. You are teaching it a metric: what should be near, what should be far, and how strongly those preferences should be enforced. This is why temperature, normalization, batch composition, and negative sampling matter so much. They are not peripheral hyperparameters; they are geometric dials. In practice, to “use” a multimodally trained model well is to understand the geometry it learned and to respect the regime in which that geometry is stable. If you use the embeddings outside that regime—under distribution shift, under different pairing assumptions, or under different noise patterns—you can get confident but meaningless similarity.

This leads to a practical posture: treat embeddings as measurements, not truths. An embedding is a coordinate produced by a learned instrument. It reflects the objective and data you trained on. It is not a direct encoding of reality, and it is not automatically interpretable. The correct question is not “what does this embedding mean?” but “what relations does this embedding preserve reliably?” The chapter operationalizes this by forcing you to measure retrieval, separability, and collapse. Those metrics are the minimal evidence that the geometry does what you think it does. If retrieval fails, the shared space is not functioning as a shared space. If separability fails, the space is not preserving important factors. If collapse indicators are high, the model is compressing everything into a near-single direction and you are confusing numerical stability with representation quality.

Once you adopt the “embedding as measurement” view, you can articulate a correct workflow for using multimodally trained models. The workflow begins with specifying a task in geometric terms. Many practical multimodal tasks reduce to a small set of geometric operations: retrieval (find matching items across modalities), clustering (group items by shared factors), ranking (order candidates by similarity), and monitoring (detect drift by changes in similarity distributions). For each operation, your responsibility is to confirm that the geometry supports it. For retrieval, you test recall@k and MRR under the conditions you will use. For clustering, you test whether clusters correspond to meaningful factors rather than superficial artifacts. For ranking, you test calibration: does high similarity actually correspond to correctness, and does that relationship remain stable under mild perturbations?

A key insight here is that “using a multimodal model” is often “using its embedding space as a database index.” In modern systems, the most common operational use of multimodal encoders is not generation. It is representation: you embed a corpus of images, documents, diagrams, or mixed media; you embed a query in one modality; you retrieve nearest neighbors in the shared space. This can feel like semantics, but it is essentially metric search. The difference between a robust system and a brittle one is whether you treat the embedding model as a static, unquestioned oracle, or as a measured component with verified operating conditions. In a professional setting, you do not deploy metric search without monitoring its error modes. Multimodal search is no different. You need baselines, drift detection, and periodic re-validation.

Another lesson is that multimodal alignment is inseparable from data pairing assumptions. Contrastive learning assumes that your positives are truly positives and your negatives are “other.” In real datasets, this assumption is noisy. Captions can be incomplete or misleading. Images can contain multiple objects while text mentions only one. Two different items can be validly related (near-duplicates, paraphrases, similar scenes) but be labeled as negatives. In our synthetic world, pairing is clean by construction, which is why the mechanism is teachable. But the point of the synthetic lab is precisely to help you recognize what breaks when pairing becomes ambiguous. When you use a multimodally trained model, you must ask: what is a “pair” in this setting? Is it identity (exact match) or relevance (partially overlapping factors) or style (aesthetic similarity) or intent (functional similarity)? Different pairing definitions imply different geometries, and a model trained under one pairing regime will be systematically wrong under another.

This is where the chapter’s stress tests become a professional habit rather than a classroom trick. Noise asymmetry and pair corruption are not artificial; they are simplified versions of real-world shifts. Noise asymmetry occurs when one modality becomes degraded: blurry images, low-light frames, OCR errors, speech-to-text mistakes, low-quality scans. Pair corruption occurs when metadata linking modalities breaks: mislabeled attachments, wrong captions, duplicated IDs, or inconsistent naming conventions. A model that performs beautifully on clean pairs can fail catastrophically under these shifts. Therefore, to “use” a multimodal model responsibly is to treat stress testing as part of acceptance, not as an optional curiosity. If you cannot characterize degradation curves for the likely shifts in your domain, you do not understand the system you are deploying.

A further lesson is the importance of modality symmetry. In a shared embedding space, you can have a deceptive success pattern: one direction works well (image→text retrieval) and the reverse direction works poorly (text→image retrieval), or vice versa. This is not a minor asymmetry; it is evidence that the shared space is not truly shared. It often indicates that one encoder dominates the learning dynamics or that one modality carries more easily exploitable features. In practice, this matters because real systems often need both directions: a user might search images with text queries, but they might also search textual descriptions with image queries or use images to retrieve documents. A symmetry gap is a warning sign. The correct use pattern is to measure both directions and to decide explicitly whether you are building a symmetric system or a one-way index. If it is one-way, you should say so and constrain use accordingly.

Related to symmetry is the deeper issue of “what the model chooses to align.” A multimodal system will align whatever is easiest to align under the objective and data. If your dataset contains spurious correlations—such as certain words always appearing with certain colors, or certain layouts always corresponding to certain labels—the model may exploit those correlations rather than the intended causal factors. This is not a moral failure; it is optimization. The governance implication is that you must be explicit about spurious correlation risk and must test for it. In our synthetic lab, spurious correlation can be introduced intentionally: add a token that correlates with a nuisance feature, or add an image artifact correlated with a class. The model will likely latch onto it. The professional lesson is that multimodal models are not immune to shortcut learning; they are often more vulnerable, because the cross-modal objective can amplify correlated nuisance signals that make retrieval easy but meaning shallow.

A major practical takeaway is how to interpret distances and similarities. People often treat cosine similarity as if it were a probability of correctness. It is not. It is a geometric measure that can be monotonic with correctness under certain conditions, but it is not calibrated. If you use similarity thresholds to decide whether to accept a match, you must empirically calibrate those thresholds under your domain conditions. This includes evaluating false positives, false negatives, and ambiguous matches. It also includes monitoring how similarity distributions shift over time. In a deployed system, a drift in the similarity histogram can signal a data pipeline change, a domain shift, or a subtle degradation in one modality. Treat similarity as a signal that requires governance, not as a truth score.

Another lesson is that embedding visualizations are not evidence by themselves. PCA plots and heatmaps are invaluable for intuition. They help students see “cluster” and “axis” in a visceral way. But it is dangerously easy to over-interpret them. Dimensionality reduction can create apparent structure where none exists, or hide structure that exists. The correct posture is: visuals are hypotheses, metrics are tests. Use PCA to generate hypotheses about factor structure, then confirm those hypotheses with separability metrics and retrieval performance. This is a good scientific practice that transfers directly to real multimodal work.

Now, what does it mean to use a multimodally trained model well, in a concrete operational sense? It means you separate your work into three layers: representation, decision, and governance. The representation layer produces embeddings. The decision layer uses embeddings for tasks like retrieval, ranking, clustering, or matching. The governance layer monitors and constrains those decisions. Most failures happen because practitioners conflate these layers. They embed and immediately treat nearest neighbors as ground truth. The correct workflow is to: embed, measure geometry quality, choose task-specific decision rules, stress test the decision rules, and then deploy with monitoring and periodic re-validation.

In that workflow, the concept of “acceptance criteria” becomes central. A model is not accepted because it trains without errors. It is accepted because it meets pre-defined criteria under defined stresses. For this chapter, those criteria can be simple but meaningful: retrieval@1 above a threshold on validation, symmetry gap below a threshold, effective rank above a floor (to avoid collapse), and degradation curves that do not fall off a cliff under mild noise. In real systems, you will add domain-specific criteria: bias checks, adversarial robustness, privacy constraints, and audit requirements. But the principle is the same: acceptance is evidence-based. This is what makes multimodality teachable in a governance-first framework. Students learn that representation learning is not just optimization; it is system acceptance under constraint.

Finally, the deepest lesson is about interpretation. Multimodality teaches you to think of intelligence as coordination among partial views. No single modality is complete. Each is an instrument with blind spots. A multimodal model is valuable when it can reconcile those partial views into a representation that supports coherent action—retrieval, decision support, navigation, summarization, or monitoring. But it is dangerous when it gives you the illusion that partial views have become total knowledge. Governance is the counterweight to that illusion. You are required to label what is supported, what is assumed, and what is unknown. You are required to preserve reproducibility and provenance. You are required to stress the system and document failure modes. These are not bureaucratic extras; they are the conditions under which multimodal capability becomes professionally usable.

So the practical conclusion is simple to state and difficult to practice: use multimodal models as engineered measurement systems. Define what “pairing” means in your task. Validate geometry with retrieval and separability. Detect collapse and dominance. Stress test likely shifts. Calibrate similarity thresholds rather than guessing them. Deploy with monitoring and reviewable artifacts. When you do this, multimodality stops being frontier mystique and becomes a professional instrument: a shared coordinate system that you can trust, because you have measured the conditions under which it is trustworthy. That is the main lesson this chapter is designed to leave with you and your students.
