Layout quality evaluation: metric core, corpus sweep, calibration, and Rung 0 by bpowers · Pull Request #637 · bpowers/simlin

bpowers · 2026-05-23T19:59:05Z

Summary

Adds a layout-quality evaluation system for the SD diagram auto-layout, then uses it to improve production seed selection (Rung 0). Implements the 5-phase 2026-05-22-layout-quality-eval plan, plus two correctness fixes discovered during execution.

Phase 1 — metric core (layout/metrics.rs): a pure, geometry-accurate LayoutMetrics (per-aesthetic-concern costs: node/connector/label overlap, crossings, sprawl, edge-length CV, aspect, loop compactness) computed from the same geometry the renderer draws. Refactored arc geometry out of render_arc into a shared connector_polyline producer (byte-identical SVG) and re-pointed count_view_crossings to count on sampled polylines via build_view_segments.
Phase 2 — statistics core (layout/eval_stats.rs): pure benchstat-style seed-sample stats (geomean, type-7 percentile/median, Mann-Whitney U) + ModelStats/CorpusReport/compare.
Phase 3 — corpus sweep (examples/layout_eval.rs, gated [[example]]): on-demand sweep that lays out a curated corpus across many seeds, scores each, renders best/median/worst + reference PNGs, and emits metrics.json + a worst-first index.html contact-sheet + a baseline diff under gitignored target/. Not part of cargo test.
Phase 4 — calibration (human-in-the-loop): tuned the metric to a modeler's taste over the curated default_projects references. Readability dominates compactness: overlaps (on bare shape boxes) + crossings + per-label obscuration carry the weight; sprawl/edge-CV/aspect are zeroed; loop_compactness (isoperimetric) is a small nudge. Committed calibrated MetricWeights::default() with explicit user sign-off; AC5.1 dominance test + AC5.2 human-vs-auto reference-pair test.
Phase 5 — Rung 0 (select_best_layout): production best-of-k selection now minimizes the calibrated weighted_cost (NaN-safe, order-independent) instead of fewest crossings. Plus a deterministic weighted_cost regression guard and a per-seed determinism test.

Correctness fixes found and fixed in-branch

Layout nondeterminism (layout: generate_layout_with_config is nondeterministic per seed (HashMap/HashSet iteration order in force accumulation) #633): fresh_layout and the incremental diff_connectors path were not reproducible per seed (HashMap iteration order leaked into float accumulation and element ordering). Made deterministic and tested. This was a prerequisite for the regression guard, reference-pair test, and determinism check.
Flow-vs-link crossing miscount: links terminating at a flow valve were counted as crossing the flow pipe (different segment node-names defeated shared-endpoint suppression). Fixed; genuine mid-span crossings still counted.

Test Plan

Automated (in cargo test -p simlin-engine): metric geometry (AC1.), polyline crossings (AC2.), statistics vs reference values (AC4.4/4.5), weight dominance (AC5.1) + reference-pair ordering (AC5.2), Rung-0 selection + regression guard + determinism (AC6-AC8). All green; clippy clean; SVG parity (svg-rendering.test.ts) unchanged.

Manual (the corpus sweep is operational, not a suite test): see docs/test-plans/2026-05-22-layout-quality-eval.md for the on-demand sweep walkthrough and the human-judgment calibration gate (best/median/worst ordering, reference-vs-auto scoring, weight magnitudes).

Run the full sweep and inspect target/layout-eval/index.html + metrics.json
Confirm the contact-sheet's best/median/worst ordering and reference-vs-auto scoring match expectation
Exercise the baseline diff and skip-on-failure paths

Follow-ups (filed, out of scope here)

layout: revisit AC1.8 scale-invariance term classification during Phase 4 calibration #631 — AC1.8 scale-invariance term classification (design-doc vs implementation consistency note; implementation is documented and correct).
server: @simlin/server Jest suite intermittently fails one test under parallel pre-commit load #635 — pre-existing flaky @simlin/server Jest test under parallel pre-commit load (unrelated to this work).

Documents infrastructure for iterating on simlin-engine's automatic diagram layout. Today layout judges itself by edge-crossing count alone and cannot be rendered outside the browser; this plan closes both gaps with a pure, geometry-accurate LayoutMetrics suite scored on the same geometry the PNG renderer draws, a benchstat-style statistics core that treats layout quality as a distribution over seeds (median, geomean, Mann-Whitney U) rather than one fixed-seed sample, and an on-demand corpus sweep example that renders and scores layouts against human-authored reference views. The first algorithm step (Rung 0) re-points select_best_layout to rank seeds by the full weighted_cost instead of crossings-only, and the crossing count moves from a straight-chord approximation to polyline sampling so curved/bent connectors are measured faithfully. Rungs 1-3 (parameter search, metric-driven annealing cost, new layout passes) are documented as the forward path the harness is built to support.

Addresses a codebase-accuracy review of the design plan so it does not mislead the implementation planner: - The diagram geometry helpers (elements/flow/label/connector) live in private modules today, so Phase 1 must first expose them pub(crate) before layout::metrics can reuse them; this was previously presented as free. - The Arc->polyline factoring must keep render_svg byte-identical (a TS-vs-Rust parity test asserts it); flagged as the highest-effort Phase 1 item with that constraint as an acceptance condition. - MultiPoint links currently render to nothing, so AC2.2 focuses on the Arc case and MultiPoint is documented as a known renderer gap. - Corpus loading uses open_xmile/open_vensim + salsa sync (verify_layout is only an assertion helper, not a loader), and the per-seed seam is the existing generate_layout_with_config, not a new function. - AC8.1 (determinism) is now covered by Phase 5; AC8.2 is noted as satisfied by the design document itself.

…producer

Add a pure layout::metrics module computing one quality cost per aesthetic concern (0 = ideal): node_overlap, node_connector_overlap, label_overlap, crossings, sprawl, edge_length_cv, aspect_penalty, plus the reserved zero-weighted chain_straightness/loop_compactness. weighted_cost is the linear combination an optimizer minimizes; MetricWeights::default() is all-zeros so any accidental use before the Phase 4 calibration is inert. compute_layout_metrics is Functional Core: it takes a StockFlow plus a LayoutConfig (kept for the design's optimizer signature; presently unused since box geometry comes from the diagram helpers) and performs no I/O. Every term is sourced from the same geometry the renderer draws -- the diagram *_bounds and label helpers, connector_polyline, and the shared build_view_segments -- so a layout's score can never disagree with its rendered SVG, and the crossings term exactly equals count_view_crossings. Every division guards a zero denominator, so empty and single-element views yield all-zero, NaN-free metrics. aspect_penalty uses the documented ar - TARGET_AR_MAX overshoot form with TARGET_AR_MAX = 16/9. The label-vs-node term skips a label's own element box (the label is part of that element's merged bounds, so charging it would always add a constant equal to the label area). Inline tests cover AC1.1-AC1.8: hand-computed node/label/connector overlaps and aspect penalties, weighted_cost linearity, empty/single-element finiteness, a node_overlap shuffle-invariance proptest, and scale invariance. The AC1.8 test corrects the plan's scoping note: with fixed-pixel element boxes (the design's load-bearing renderer-fidelity invariant), only the topological crossings term is exactly scale-invariant; node_connector_overlap/edge_length_cv/aspect_penalty inherit a fixed-pixel offset and are scale-sensitive (asymptotically invariant), so the test asserts crossings invariance and pins node_connector_overlap's documented scale-sensitivity instead.

Drop the now-stale #[allow(dead_code)] on the diagram::common rect helpers that the layout::metrics core consumes (rect_width/rect_height/rect_area/ rect_overlap_area/segment_length_in_rect). Task 2 added those attributes as a forward reference because the production callers had not landed yet; now that metrics.rs uses them they are reachable, so the suppression is removed and the explanatory comment updated. rect_contains_point keeps its allow since no non-test caller needs it yet.

…ment The label_overlap term summed label-vs-label overlap plus label-vs-node overlap, but the node boxes it compared labels against were the label-merged `aux_bounds`/`stock_bounds`/`flow_bounds` (which union each element's own label into the returned rect). So a label-vs-label overlap was charged again inside the other node's merged box, inflating the term's absolute magnitude. That magnitude matters because Phase 4 calibrates label_overlap's weight against it. Expose label-free shape boxes (`aux_shape_bounds`/`stock_shape_bounds`/ `flow_shape_bounds`) -- the geometry each `*_bounds` already computes before merging the label, refactored into a single source of truth so renderer output stays byte-identical -- and compare labels against the OTHER element's bare shape box. Modules/clouds already exclude their label from `*_bounds`, so they are their own shape box. This cleanly separates "label lands on another label" from "label lands on another node's shape", counting each label pair exactly once. Also narrow the AC1.8 scale-invariance comment: crossings are exactly scale-invariant only for crossings interior to both connectors. A crossing grazing a fixed-size node boundary (the polylines are clipped to those boundaries) can flip under uniform scale. The test fixture's crossing is at the center of the square the links form, far from every node box, so the test remains correct.

…cance

The best_seed and median_seed lowest-seed tie-breaks in ModelStats::from_samples already had dedicated tests, but worst_seed did not. worst_seed uses a non-obvious construct -- max_by(...) followed by a reversed seed comparison (.then(y.seed.cmp(&x.seed))) -- precisely so the lowest seed wins when two samples tie at the maximum weighted cost. Add a focused test that pins this behavior: two samples share the max cost at different seeds (plus a lower-cost third), and worst_seed must be the lower of the tied seeds. The test was verified non-vacuous by temporarily flipping the tie-break direction (it failed, picking the higher seed) before reverting.

For each loaded corpus model, lay out the main model once per sampled seed (rayon par_iter, mirroring generate_best_layout), score each layout with compute_layout_metrics, and reduce the costs to a weighted scalar. Collect MetricSamples into ModelStats::from_samples and aggregate into a CorpusReport, printing a per-model median/p25/p75/best-of-k line plus the corpus geomean-of-medians. The weighted_cost uses a Phase-3 PLACEHOLDER MetricWeights (overlap and crossings dominant; sprawl/edge-cv/aspect moderate; reserved structure terms zero), since MetricWeights::default() is all-zeros until Phase 4 calibrates and commits the real weights. Phase 4 must switch this sweep to MetricWeights::default() once those land. The parallel results are sorted back into seed order before summarizing, so the sample vector and every derived statistic are invariant to rayon scheduling. Note that generate_layout_with_config is itself not deterministic per seed -- the same (model, seed) yields slightly different layouts even serially within one process, traced to per-process-randomized HashMap iteration order in the layout pipeline. That is a pre-existing layout-engine issue independent of this sweep.

generate_layout / generate_layout_with_config / generate_best_layout produced different diagrams for the same (model, annealing_random_seed) on repeated serial calls in one process. The annealing RNG is already deterministic (StdRng::seed_from_u64); the run-to-run entropy came from per-instance-random std HashMap iteration order in two layout functions. In run_sfdp_with_rigid_chains, two loops over var_to_node (a HashMap) were order-sensitive: the centroid accumulation summed float positions (float addition is non-associative, so hash order perturbs the centroid) and the auxiliary initial-placement loop assigned each unpositioned aux a polar seed angle by its hash-iteration rank, so each aux got a different seed position that SFDP diverged into a different final layout. Both loops now iterate a once-materialized sorted Vec of the entries. var_to_node stays a HashMap (it is still used as a .get() lookup table elsewhere). The incremental path (incremental_layout -> diff_connectors) had the same class of bug, which the original report did not cover: new causal links were created while iterating a HashSet of new edges, and each new link both allocates a sequential uid and is appended to the view in that order, so hash order produced different uids and element ordering for the same logical link. The two HashSet/HashMap-driven link loops now iterate sorted keys, and the alias-match fallback picks the lowest matching key. Adds tests/layout.rs determinism tests for both the fresh and incremental paths: SIR previously differed in 8/15 (fresh) and 6-7/17 (incremental) elements between two serial calls; both are now 0.

Add serde::Deserialize alongside the existing Serialize on LayoutMetrics and MetricWeights (metrics.rs) and add serde::Serialize+Deserialize to MetricSample/ModelStats/CorpusReport (eval_stats.rs) so a full CorpusReport -- including each model's per-seed samples -- round-trips through JSON. The samples must survive so compare() can re-run Mann-Whitney U over the seed-sample cost sets. Comparison/ModelComparison gain Serialize only (the diff is recomputed every run, never read back). The layout_eval example now resolves a baseline diff: LAYOUT_EVAL_WRITE_BASELINE=1 writes the run's CorpusReport to the committed examples/layout_eval_baseline.json; a normal run reads that file, runs compare(baseline, candidate), prints the per-model delta_ratio + p_value + significance plus the aggregate, and embeds the Comparison into metrics.json (the baseline_comparison slot widened from Option<()> to Option<Comparison>) and the index.html header. An absent baseline skips the diff with a note. The committed baseline is seeded from a small representative subset (LAYOUT_EVAL_MODELS=sir,teacup LAYOUT_EVAL_SEEDS=8) to keep the run fast and the JSON modest. It captures CURRENT pre-Rung-0 behavior scored with the Phase-3 PLACEHOLDER_WEIGHTS. It MUST be regenerated after Phase 4 commits the calibrated weights (and the example switches to MetricWeights::default()) and before Phase 5 measures Rung 0's improvement -- see the sibling layout_eval_baseline.README.md. Since layout is now deterministic per seed, seeding then diffing candidate==baseline yields exactly-zero, non-significant deltas (consistent with AC4.5).

Wrap each model's full pipeline (load -> seed sweep -> render) in a new process_model() returning Result<(ModelStats, ModelRenders), String>, the single model-level skip-on-failure boundary. main() WARN-logs any per-model Err and continues to the next model, omitting the failed model from every artifact; the harness always reaches the end and exits 0 (even a fully-empty corpus is exit 0 with warnings). Three failure modes funnel through the Result, validated in the order data flows: a load/parse failure (already surfaced by load_model), a layout that fails on EVERY seed (zero usable samples -- now a model-level skip rather than a degenerate all-zero report entry), and render failures (kept non-fatal inside render_model, so a model with valid scores but a missing PNG still appears). A partial per-seed failure never sinks a model: the skip fires only when no seed produced a usable sample. Also refreshes the now-stale sweep_model doc comment: generate_layout_with_config is deterministic per seed since fix #633, so the prior note about run-to-run median/spread drift no longer applies.

The four built-in default_projects (fishbanks, logistic-growth, population, reliability) are hand-laid-out reference layouts and are the primary good-layout taste anchors for the Phase 4 metric calibration. Added them to the eval corpus so the sweep can score their reference views against generated layouts.

… boxes These two layout-quality terms previously iterated the label-merged node boxes (node_box), which conflated shape-vs-shape overlap with label overlap and charged a connector for merely passing under a label. The user's calibrated taste treats node shapes overlapping other node shapes, and a connector passing under a non-incident node SHAPE (it reads as a false causal connection at a glance), as the real costs; a connector passing only under a semi-transparent LABEL is mild noise, and label collisions already belong to label_overlap. node_overlap now sums pairwise overlap over the bare shape boxes (node_shape_box) and normalizes by total shape-box area; node_connector_overlap charges a connector for the length it spends inside non-incident shape boxes. node_shape_boxes is hoisted to be computed once and shared with label_overlap. sprawl and aspect_penalty intentionally keep using the label-merged node_box because the view's visual extent and characteristic node size include labels. MetricWeights, weighted_cost, crossings, and loop_compactness are untouched.

…chments build_view_segments named a link's endpoint vertices elem_{from_uid}/ elem_{to_uid} but a flow's pipe vertices flow_{uid}#{i}. do_segments_intersect only suppresses a crossing when two segments share a from_node/to_node string, so a link incident on a flow -- terminating at the flow's valve (link.to_uid == flow.uid) or at a stock/cloud the flow attaches to -- grazed the pipe at the connection point with mismatched node names and was miscounted as a real crossing. The flow branch now names each pipe vertex to share the elem_{uid} name the incident link/stock/cloud already uses: an attached point becomes elem_{attached_to_uid}, and the valve (which lies on the pipe, not at a stored point) is injected as an elem_{flow.uid} vertex by splitting the pipe segment it sits on. A genuinely free interior point keeps the per-flow flow_{uid}#{i} name, so a real mid-span crossing -- segments sharing no element with the flow -- is still counted. Consecutive pipe sub-segments share the joining vertex name, so a flow never self-crosses. This fixes the dp_logistic_growth reference miscount (two links terminating at the net birth rate valve reported 2 spurious crossings) that was distorting the Phase 4 layout-quality calibration; the reference now scores crossings == 0.

loop_compactness was a hardcoded 0.0 reserved field. Compute and populate it so the layout-quality eval sweep can report a per-model value. The committed weight stays 0 (MetricWeights::default and weighted_cost are unchanged); calibration happens in a later phase. The user wants to see what the term reports before it influences any optimizer. The measure is isoperimetric over the view's feedback cycles. Build a directed graph over positioned node-box elements (aux/stock/flow/module/ cloud): each Link contributes from_uid -> to_uid, and each Flow contributes source_attached -> flow.uid -> dest_attached through its own valve, so a stock--flow--stock feedback path is part of the graph. Enumerate simple directed cycles with a bounded DFS (each cycle's smallest uid is the search root, so a cycle is discovered once), capped at 12 nodes per cycle and 64 cycles total -- SD diagrams are tiny, so this stays O(small) and total. For each cycle of >= 3 distinct positioned nodes, take the node-box centers in cycle order, compute the shoelace area and summed-edge perimeter, and form the isoperimetric quotient Q = 4*PI*Area / Perimeter^2 clamped to [0,1] (Q=1 a perfect circle, Q->0 collinear/degenerate). The per-cycle penalty is 1 - Q; loop_compactness is the mean penalty over all qualifying cycles, or 0.0 when there is no cycle of >= 3 nodes. It thus rewards clean, well-spread loops and penalizes collapsed/collinear ones. Centers come from the bare shape box (node_shape_box), which is symmetric about the element position -- unlike the label-merged node_box, whose label offset would skew the polygon. Determinism does not rely on layout seed alone: adjacency targets are sorted, the search visits nodes in sorted uid order, every cycle is canonicalized (rotated to its smallest uid) and de-duplicated, and every per-cycle guard returns None rather than a NaN, so the mean is identical for a given view regardless of element ordering.

…sion sensitive) The old label_overlap divided the corpus-wide sum of label-vs-label and label-vs-shape overlap areas by the corpus's total label area. That under-counted a small-but-readability-killing overlap: a node circle clipping the last couple of characters of a short label has a small absolute overlap area, and dividing it by the total label area of every label in the diagram diluted it to near zero (~0.004 on the eval corpus), so the term could not distinguish "one label is 30% obscured" from "nothing is obscured." Redefine label_overlap as a per-label obscuration measure: for each labeled element L with box B_L, approximate the area of B_L covered by any OTHER label box or any OTHER element's bare shape box as the sum of the individual overlap areas, capped at area(B_L) so the obscured fraction is in [0,1]; label_overlap is the SUM of those fractions over all labels. A small clip now registers at its true obscuration fraction (~0.3 for the fixture above) instead of ~0.004. The capped sum is a monotone proxy for the exact covered area -- a pixel-exact union is unnecessary, and more/larger overlaps never decrease a label's fraction. This preserves the Phase-1 guards: a label is never charged against its own element's shape box, and the comparison uses the label-free shape box (node_shape_box), not the label-merged box, so label-vs-label coverage is not re-counted via another node's merged bounds. A mutual label-label collision is charged from both labels' perspectives -- intended, since both are unreadable. area(B_L) == 0 is guarded (skip), so the term stays finite. MetricWeights/weighted_cost are unchanged (calibration is a later phase). On the eval corpus the within-model worse-is-higher anchor is preserved (best <= median <= worst label_overlap for every model) while the magnitudes rise to reflect true per-label obscuration: e.g. reliability best 0.127 -> 0.937, fishbanks best 0.019 -> 0.312.

Replace the Phase-1 all-zeros placeholder Default impl for MetricWeights with the calibrated production weights from the Phase 3 contact-sheet calibration, with explicit user sign-off (2026-05-23): node_overlap = 1.0 node_connector_overlap = 1.0 label_overlap = 1.0 crossings = 1.0 sprawl = 0.0 edge_length_cv = 0.0 aspect_penalty = 0.0 loop_compactness = 0.25 chain_straightness = 0.0 Failure-mode rationale -- readability >> compactness. The dominant concerns (node-shape overlap, connectors under node shapes, obscured labels, edge crossings) carry weight 1: they make a diagram unreadable or assert false causal connections. sprawl, edge_length_cv, and aspect_penalty are intentionally 0 -- compactness and aspect ratio are NOT goals; spreading out to keep labels legible and feedback loops visible is good, not penalized. loop_compactness is a low 0.25 that gently rewards drawing feedback loops as visible circles without dominating the overlap/crossings family. chain_straightness stays 0 (reserved, not yet computed). The old AC5.1 test asserted the default was all-zeros so weighted_cost was inert; that is no longer true by design. It is replaced by a dominance-ordering test that pins the readability-dominant relationships (each overlap/crossings term strictly exceeds each compactness term; sprawl/edge_length_cv/aspect_penalty/chain_straightness == 0; 0 < loop_compactness < node_overlap) and re-confirms weighted_cost under the default equals the explicit sum of weighted terms.

Encode the user's taste anchors as an objective validation of the calibrated MetricWeights::default(): on the agreed reference pairs the shipped hand-authored (human) layout must score a lower weighted_cost than a machine-generated (auto) layout of the same model. Construction (b) -- human view vs generated layout. For each anchor we load the default_projects XMILE model, score its as-loaded main view (human), generate a single fixed-seed layout (auto), and assert human < auto under the committed weights. One fixed seed (not generate_best_layout) keeps the test deterministic (layout is deterministic per seed, #633) and fast: all five tests run in well under half a second over the small (<= 42 element) default_projects. Anchor costs under the committed weights (human < auto): reliability: human 0.255 < auto 2.997 fishbanks: human 0.194 < auto 1.603 population: human 0.052 < auto 0.053 logistic-growth: human 0.089 < auto 0.154 sir is deliberately NOT a human<auto anchor: its shipped reference obscures more labels than the auto layout (human 0.458 > auto 0.039), so the metric correctly prefers the auto. That direction is pinned by a separate documented test so the asymmetry is recorded, not silently assumed.

Now that Phase 4 has committed the calibrated MetricWeights::default(), the layout_eval sweep scores every weighted_cost with the committed default instead of the Phase-3 PLACEHOLDER_WEIGHTS const. Remove the PLACEHOLDER_WEIGHTS const (and its 'Phase 4 will replace' doc comment) and route the three call sites (per-seed sweep, render scoring, and the report builder) through MetricWeights::default(). Re-seed the committed examples/layout_eval_baseline.json from current behavior under the calibrated weights, over the same small documented subset (teacup,sir at M=8) so the committed file stays small. The recorded per-seed sample costs change because the weights changed, so the old baseline was stale. Update the sibling README to note the 2026-05-23 re-seed with the committed weights and drop the now-done 'regenerate after Phase 4' caveat.

…tion The population reference-pair anchor beats the auto layout by only ~2.3% under the committed default weights, far thinner than the other anchors (reliability ~8.5%, fishbanks ~12%, logistic-growth ~58%). Document that it is a marginal taste anchor so a future failure is read as a near-boundary case rather than necessarily a metric regression; the assertion is unchanged. Also note that canonicalize_cycle normalizes rotation but not traversal direction. A reverse-direction duplicate essentially never arises for directed SD feedback loops, and even if it did the shoelace polygon area is direction-invariant, so both yield the identical isoperimetric penalty.

Add a dedicated layout_selection_tests.rs (wired via #[path], following the crossings_tests.rs precedent so layout_tests.rs stays under the per-file line cap) covering AC6.1: select_best_layout minimizes weighted_cost, not crossings. The headline case builds two candidates whose crossing ordering is the inverse of their weighted_cost ordering -- the lowest-cost view actually has MORE connector crossings (asserted via count_view_crossings, so the inversion is real, not narrative) -- and confirms the lowest-cost view wins, which the retired crossings-only rule would not have done. Companion cases pin the lowest-seed tie-break and NaN safety: a NaN challenger never displaces a finite running best, while a NaN seeded first is sticky under the current fold (pinned as a documented limitation rather than silently assumed away).

Add a fast, deterministic guard (AC7.1) over three tiny programmatic models built with TestProject -- a population feedback model, a pure aux dependency chain, and a two-stock transfer. Each is laid out at the fixed annealing seed 42 and its calibrated weighted_cost asserted at or below a committed per-model ceiling. The ceilings are observed-cost ceilings (observed at seed 42: pop 0.0533, chain 0.0, two_stock 0.1646) set a small margin above the observed value, with a comment documenting how to regenerate them after an intentional metric/weight change. For AC7.2, a companion test takes a real fixed-seed layout and piles every node onto the origin (blowing up node_overlap to ~6.2) and asserts the result exceeds the pop ceiling, proving the guard discriminates good layouts from bad rather than passing vacuously. The whole guard runs in well under a second (tiny models, one seed each).

Add an AC8.1 determinism test: laying out the same model twice through generate_layout_with_config at the same explicit seed yields two StockFlow values that compare equal (StockFlow derives PartialEq, so every field -- positions, view box, element order -- is checked). The test uses a seed-sensitive model (a stock fed/drained by ten leaf auxes through two flows) where the SFDP/annealing RNG genuinely shapes the layout, and additionally asserts that a different seed produces a different layout -- so the same-seed equality is a real determinism guarantee, not a vacuous pass on a tiny model whose arrangement ignores the seed. A comment notes this per-seed reproducibility is distinct from the Phase 3 M-seed statistical sweep, which deliberately varies the seed.

…s of order The fold in select_best_layout kept the running best whenever the challenger's `<` comparison was false. That correctly drops a NaN-cost challenger (`NaN < finite` is false), but it also failed to let a finite challenger overtake a NaN running best (`finite < NaN` and `finite == NaN` are both false). So a degenerate NaN-cost layout from the first seed in the fixed order [42, 123, 456, 789] was sticky: generate_best_layout would ship it even when a later seed produced a usable finite layout. Add explicit NaN branches so a NaN challenger never wins and a finite challenger always beats a NaN running best, making the guard order-independent. If every candidate is NaN the earliest is kept deterministically. The metric is NaN-free in practice (compute_layout_metrics guards every division), so this is defensive robustness rather than an observed production failure.

Document the layout-quality eval branch's contract and structure changes in src/simlin-engine/CLAUDE.md: the new pure metrics.rs (LayoutMetrics / MetricWeights / compute_layout_metrics / weighted_cost) and eval_stats.rs modules, the per-seed bit-identical layout determinism guarantee (#633), the shift of generate_best_layout/select_best_layout from fewest-crossings to the calibrated weighted_cost (NaN-safe), the rebuilt count_view_crossings / build_view_segments crossing geometry, the now-public LAYOUT_SEEDS, the diagram submodules widened to pub(crate) plus connector_polyline and the label-free *_shape_bounds / rect-segment helpers they share with the metric, and the gated layout_eval example.

bpowers · 2026-05-23T20:24:40Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7f00822399

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-23T20:29:29Z

+                    x: (r.left + r.right) / 2.0,
+                    y: (r.top + r.bottom) / 2.0,
+                },


Use valve position for flow loop geometry

build_loop_graph computes each node center from the shape-box midpoint, but for flows that midpoint is not the valve position because flow_shape_bounds now includes the entire pipe polyline (src/simlin-engine/src/diagram/flow.rs), which is often asymmetric when users drag the valve along the pipe. This makes loop_compactness evaluate loop polygons with shifted flow vertices, so the new production weighted_cost can rank seeds based on pipe endpoint extents instead of actual loop geometry. A flow node in loop scoring should use (flow.x, flow.y) (the valve) rather than the flow bounding-box center.

Useful? React with 👍 / 👎.

claude · 2026-05-23T20:32:07Z

Code review

Reviewed the production-impacting changes in depth (the eval harness/docs are out of the critical path):

diagram/connector.rs — arc_geometry/render_arc factor-out is guarded by a byte-identical SVG regression test; connector_polyline reproduces the drawn geometry faithfully (straight branch matches render_straight_line's clip endpoints; arc branch samples center-to-center exactly as render_arc draws). sample_arc's SVG endpoint→center reconstruction is self-consistent with the fixed circ/flags and is exercised by the arc-crossing test.
diagram/common.rs — Liang-Barsky segment_length_in_rect and the rect helpers are correct, including the parallel/outside and clamp cases.
diagram/{elements,flow}.rs — *_shape_bounds split preserves *_bounds output (bounding-box union is order-insensitive).
layout/mod.rs — build_view_segments valve injection / vertex naming correctly suppresses graze-at-valve while still counting genuine mid-span crossings; select_best_layout NaN handling is order-independent for finite candidates and deterministic for the all-NaN case; the diff_connectors and run_sfdp_with_rigid_chains determinism fixes materialize sorted iteration in every layout-affecting HashMap/HashSet loop.
layout/metrics.rs — every term guards its denominator (finite, NaN-free); node_overlap/node_connector_overlap/label_overlap use the bare shape boxes to avoid the documented double-counts; loop-cycle enumeration is bounded, canonicalized, and order-independent; weighted_cost matches MetricWeights::default().
layout/eval_stats.rs — geomean/type-7 percentile/Mann-Whitney (tie + continuity correction) formulas are correct, with documented non-NaN defaults.

Tests are non-vacuous (crossing inversions asserted via count_view_crossings, determinism checked with a seed-sensitivity counter-case, regression-guard ceilings backed by a degenerate-layout rejection test).

No blocking issues found.

Overall correctness verdict: correct

build_loop_graph computed each loop-polygon vertex as the midpoint of the element's bare shape box. For flows that box is flow_shape_bounds -- the valve box unioned with every pipe point -- so its midpoint is the pipe-extent center, which drifts off the valve (flow.x, flow.y) whenever the valve is dragged off-center or the pipe is bent. Flows are vertices in the loop graph (stock->flow->stock edges), so loop_compactness, and therefore the weighted_cost that Phase-5 Rung-0 select_best_layout minimizes, was being skewed by pipe geometry rather than true loop geometry. Compute each loop vertex via diagram::connector::get_visual_center -- the same visual center the SVG renderer uses -- which returns the valve for a flow and the element center (the symmetric shape-box midpoint) for aux/stock/module/cloud, so only flow vertices change. The node-membership gate stays node_shape_box. This restores the module's same-geometry-as-renderer invariant and corrects the now- stale LoopGraph/build_loop_graph doc comments that claimed the shape box is always the element center. Addresses the PR #637 review comment.

claude · 2026-05-23T21:33:30Z

Automated review — PR #637 (layout-quality eval + Rung 0)

I reviewed the production-impacting changes and the new cores in detail:

Connector refactor (connector.rs) — arc_geometry/render_arc factor-out is guarded by a byte-identical SVG regression test; sample_arc's endpoint→center arc parametrization reconstructs the drawn arc consistently with the SVG flags, and the bulge-direction + translation/rotation-invariance tests confirm the sampled geometry.
build_view_segments / count_view_crossings rewrite — all element kinds resolved by uid (Module/Alias no longer dropped), valve injected as elem_{flow.uid}, and the elem_/link_#i/flow_#i vertex naming correctly drives do_segments_intersect's shared-endpoint suppression. The flow-vs-link miscount fix and a genuine mid-span crossing are both covered.
Determinism fixes (layout: generate_layout_with_config is nondeterministic per seed (HashMap/HashSet iteration order in force accumulation) #633) — diff_connectors (sorted new-edge / lowest-key alias-match / sorted preserved-link loops) and run_sfdp_with_rigid_chains (sorted var_to_node materialization for both the centroid sum and the aux-placement rank) remove the HashMap-order nondeterminism; guarded by per-seed bit-identical tests for both the fresh and incremental paths.
select_best_layout — NaN-safe and order-independent (finite always beats NaN; cost ties break to lowest seed regardless of order).
metrics.rs — every term guards its zero denominator (finite, no NaN on empty/single-element views); overlap/connector terms correctly use bare shape boxes vs. label-merged boxes to avoid double-counting; loop_compactness cycle enumeration is complete (each cycle found from its min node) and order-independent.
eval_stats.rs — geomean, type-7 percentile/median, and tie+continuity-corrected Mann-Whitney U match their reference definitions; documented finite defaults on degenerate input.
flow.rs/elements.rs — *_shape_bounds extraction is union-preserving, so *_bounds behavior is unchanged.

No blocking correctness issues found. Test coverage is thorough and matches the documented contracts. (annealing::count_crossings remains O(n²) and arc sampling raises the per-seed segment count ~16×, but this is on the once-per-seed fresh-layout path, is dominated by annealing, and is already documented at annealing.rs:82 — not introduced or hidden here.)

Overall correctness: correct.

bpowers · 2026-05-23T22:24:14Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 35ef355faa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-23T22:35:38Z

+                for seg in c.polyline.windows(2) {
+                    inside += segment_length_in_rect(&seg[0], &seg[1], rect);


Cap per-segment node connector coverage once

node_connector_overlap sums segment_length_in_rect across every non-incident node box, so when two node boxes overlap, the same connector sub-segment is counted multiple times. That makes this "fraction of connector length" exceed 1.0 and can disproportionately inflate weighted_cost, causing generate_best_layout to reject otherwise better layouts just because overlap regions were double-counted. The overlap should be measured as a union length per segment (or otherwise de-duplicated/clamped) before dividing by total_connector_length.

Useful? React with 👍 / 👎.

…oss boxes) node_connector_overlap is documented as a "fraction of total connector length", but it accumulated segment_length_in_rect for every non-incident node shape box, summing the per-box scalar lengths. When two non-incident shape boxes overlapped, the connector sub-length in the overlap region was counted once per box, so the term could exceed 1.0 -- breaking the fraction-in-[0,1] contract, over-inflating weighted_cost (this term's weight is 1.0), and skewing Phase-5 Rung-0 select_best_layout. Overlapping shape boxes are common because a Flow's node_shape_box is its whole-pipe bounding box, which frequently overlaps stocks/auxes/other flows. The fix unions the per-segment clip intervals across all non-incident boxes before summing, so each physical sub-length is counted at most once. A new pub(crate) segment_clip_interval_in_rect exposes the Liang-Barsky [t0, t1] core; segment_length_in_rect now delegates to it (behavior-preserving), and the metric merges the intervals per segment. Single-non-incident-box layouts are unchanged (nothing to merge); overlapping-box layouts see the term decrease toward the true union fraction, lowering weighted_cost in the safe direction so the AC5.2 human<auto anchors and the AC7 regression-guard ceilings continue to hold. Addresses the PR #637 review comment.

bpowers · 2026-05-24T02:06:39Z

@codex review

chatgpt-codex-connector · 2026-05-24T02:12:07Z

Codex Review: Didn't find any major issues. Another round soon, please!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

claude · 2026-05-24T02:14:45Z

Review — layout-quality eval + Rung-0 selection

I focused on the code that newly runs in the production layout path (generate_best_layout → compute_layout_metrics), the render_arc/connector_polyline factor-out (must stay byte-identical), the build_view_segments flow-valve crossing fix, the select_best_layout NaN handling, and the #633 determinism fixes. These all look correct: the NaN-safe selection is order-independent and ties break to the lowest seed; the arc factor-out is guarded by test_render_arc_svg_byte_identical; the metric terms each guard their zero-denominators so weighted_cost is finite; and the diff_connectors / run_sfdp_with_rigid_chains sorts make the per-seed layout reproducible. The benchstat statistics (geomean, type-7 percentile, mann_whitney_u) also check out.

One finding below.

[P2] Cycle enumeration has no global exploration budget (now on the production layout path)

src/simlin-engine/src/layout/metrics.rs (lines ~519-548, dfs_cycles)

compute_loop_compactness now runs for every candidate seed inside generate_best_layout. Its dfs_cycles DFS is bounded only by the per-path length cap (path.len() < MAX_CYCLE_LEN, 12) and the found-cycles cap (found.len() >= MAX_CYCLES, 64). The MAX_CYCLES short-circuit only fires after a cycle is recorded, so on a large, dense, but cycle-sparse directed graph (e.g. a wide/deep acyclic causal graph — a forecasting/calculation model with little feedback) few or no paths return to start, the cap never triggers, and the search enumerates essentially all simple paths of length ≤12 through the nodes with uid > start. That is combinatorial (up to ~d^12 for out-degree d) and can dominate layout generation, run ×4 seeds. This contradicts the inline claim at lines ~372-374 that the caps make enumeration "O(small) and total" / that "enumeration stops once the cap is hit". Severity depends on the input: feedback-rich SD models hit the cycle cap immediately and are unaffected, and no model in the current corpus triggers it (CI is green); the risk is a large near-acyclic model. A small global node-visit/step budget threaded through dfs_cycles (bailing the whole enumeration, like MAX_CYCLES already does) would bound the worst case and make the comment accurate.

Overall correctness verdict: correct. The production-impacting logic and the byte-identical render refactor are sound and well-tested; the one finding above is a latent worst-case performance concern (not a correctness break) that only bites on an uncommon large near-acyclic model.

bpowers added 30 commits May 23, 2026 13:22

engine: expose diagram geometry modules to layout (pub(crate))

c33fcd3

engine: add rect overlap + segment-clip helpers to diagram::common

39c09f0

engine: factor arc geometry out of render_arc into a shared polyline …

d643bdb

…producer

engine: count view crossings on sampled connector polylines

65e564c

engine: add pure layout eval statistics primitives (eval_stats.rs)

fb32e36

engine: add ModelStats/CorpusReport constructors

a94a930

engine: add baseline-vs-candidate compare() with Mann-Whitney signifi…

26a2232

…cance

engine: add layout_eval example skeleton + expose LAYOUT_SEEDS

df6ac00

engine: layout_eval renders best/median/worst + reference PNGs

7139212

engine: layout_eval emits metrics.json + index.html contact-sheet

7eadbca

engine: rung 0 - select best layout by weighted_cost

ccdd783

bpowers added 6 commits May 23, 2026 13:22

doc: add test plan for layout quality eval

37e2719

bpowers force-pushed the layout-quality-eval branch from 7f00822 to 37e2719 Compare May 23, 2026 20:25

chatgpt-codex-connector Bot reviewed May 23, 2026

View reviewed changes

bpowers merged commit 930bb5f into main May 24, 2026
10 of 13 checks passed

bpowers deleted the layout-quality-eval branch May 24, 2026 02:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Layout quality evaluation: metric core, corpus sweep, calibration, and Rung 0#637

Layout quality evaluation: metric core, corpus sweep, calibration, and Rung 0#637
bpowers merged 38 commits into
mainfrom
layout-quality-eval

bpowers commented May 23, 2026

Uh oh!

bpowers commented May 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 23, 2026

Uh oh!

claude Bot commented May 23, 2026

Uh oh!

claude Bot commented May 23, 2026

Uh oh!

bpowers commented May 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 23, 2026

Uh oh!

bpowers commented May 24, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 24, 2026

Uh oh!

Uh oh!

claude Bot commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		for seg in c.polyline.windows(2) {
		inside += segment_length_in_rect(&seg[0], &seg[1], rect);

Conversation

bpowers commented May 23, 2026

Summary

Correctness fixes found and fixed in-branch

Test Plan

Follow-ups (filed, out of scope here)

Uh oh!

bpowers commented May 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented May 23, 2026

Code review

Overall correctness verdict: correct

Uh oh!

claude Bot commented May 23, 2026

Automated review — PR #637 (layout-quality eval + Rung 0)

Uh oh!

bpowers commented May 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

bpowers commented May 24, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 24, 2026

Uh oh!

Uh oh!

claude Bot commented May 24, 2026

Review — layout-quality eval + Rung-0 selection

[P2] Cycle enumeration has no global exploration budget (now on the production layout path)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant