Skip to content

Layout quality evaluation: metric core, corpus sweep, calibration, and Rung 0#637

Merged
bpowers merged 38 commits into
mainfrom
layout-quality-eval
May 24, 2026
Merged

Layout quality evaluation: metric core, corpus sweep, calibration, and Rung 0#637
bpowers merged 38 commits into
mainfrom
layout-quality-eval

Conversation

@bpowers
Copy link
Copy Markdown
Owner

@bpowers bpowers commented May 23, 2026

Summary

Adds a layout-quality evaluation system for the SD diagram auto-layout, then uses it to improve production seed selection (Rung 0). Implements the 5-phase 2026-05-22-layout-quality-eval plan, plus two correctness fixes discovered during execution.

  • Phase 1 — metric core (layout/metrics.rs): a pure, geometry-accurate LayoutMetrics (per-aesthetic-concern costs: node/connector/label overlap, crossings, sprawl, edge-length CV, aspect, loop compactness) computed from the same geometry the renderer draws. Refactored arc geometry out of render_arc into a shared connector_polyline producer (byte-identical SVG) and re-pointed count_view_crossings to count on sampled polylines via build_view_segments.
  • Phase 2 — statistics core (layout/eval_stats.rs): pure benchstat-style seed-sample stats (geomean, type-7 percentile/median, Mann-Whitney U) + ModelStats/CorpusReport/compare.
  • Phase 3 — corpus sweep (examples/layout_eval.rs, gated [[example]]): on-demand sweep that lays out a curated corpus across many seeds, scores each, renders best/median/worst + reference PNGs, and emits metrics.json + a worst-first index.html contact-sheet + a baseline diff under gitignored target/. Not part of cargo test.
  • Phase 4 — calibration (human-in-the-loop): tuned the metric to a modeler's taste over the curated default_projects references. Readability dominates compactness: overlaps (on bare shape boxes) + crossings + per-label obscuration carry the weight; sprawl/edge-CV/aspect are zeroed; loop_compactness (isoperimetric) is a small nudge. Committed calibrated MetricWeights::default() with explicit user sign-off; AC5.1 dominance test + AC5.2 human-vs-auto reference-pair test.
  • Phase 5 — Rung 0 (select_best_layout): production best-of-k selection now minimizes the calibrated weighted_cost (NaN-safe, order-independent) instead of fewest crossings. Plus a deterministic weighted_cost regression guard and a per-seed determinism test.

Correctness fixes found and fixed in-branch

  • Layout nondeterminism (layout: generate_layout_with_config is nondeterministic per seed (HashMap/HashSet iteration order in force accumulation) #633): fresh_layout and the incremental diff_connectors path were not reproducible per seed (HashMap iteration order leaked into float accumulation and element ordering). Made deterministic and tested. This was a prerequisite for the regression guard, reference-pair test, and determinism check.
  • Flow-vs-link crossing miscount: links terminating at a flow valve were counted as crossing the flow pipe (different segment node-names defeated shared-endpoint suppression). Fixed; genuine mid-span crossings still counted.

Test Plan

Automated (in cargo test -p simlin-engine): metric geometry (AC1.), polyline crossings (AC2.), statistics vs reference values (AC4.4/4.5), weight dominance (AC5.1) + reference-pair ordering (AC5.2), Rung-0 selection + regression guard + determinism (AC6-AC8). All green; clippy clean; SVG parity (svg-rendering.test.ts) unchanged.

Manual (the corpus sweep is operational, not a suite test): see docs/test-plans/2026-05-22-layout-quality-eval.md for the on-demand sweep walkthrough and the human-judgment calibration gate (best/median/worst ordering, reference-vs-auto scoring, weight magnitudes).

  • Run the full sweep and inspect target/layout-eval/index.html + metrics.json
  • Confirm the contact-sheet's best/median/worst ordering and reference-vs-auto scoring match expectation
  • Exercise the baseline diff and skip-on-failure paths

Follow-ups (filed, out of scope here)

bpowers added 30 commits May 23, 2026 13:22
Documents infrastructure for iterating on simlin-engine's automatic
diagram layout. Today layout judges itself by edge-crossing count alone
and cannot be rendered outside the browser; this plan closes both gaps
with a pure, geometry-accurate LayoutMetrics suite scored on the same
geometry the PNG renderer draws, a benchstat-style statistics core that
treats layout quality as a distribution over seeds (median, geomean,
Mann-Whitney U) rather than one fixed-seed sample, and an on-demand
corpus sweep example that renders and scores layouts against
human-authored reference views.

The first algorithm step (Rung 0) re-points select_best_layout to rank
seeds by the full weighted_cost instead of crossings-only, and the
crossing count moves from a straight-chord approximation to polyline
sampling so curved/bent connectors are measured faithfully. Rungs 1-3
(parameter search, metric-driven annealing cost, new layout passes) are
documented as the forward path the harness is built to support.
Addresses a codebase-accuracy review of the design plan so it does not
mislead the implementation planner:

- The diagram geometry helpers (elements/flow/label/connector) live in
  private modules today, so Phase 1 must first expose them pub(crate)
  before layout::metrics can reuse them; this was previously presented
  as free.
- The Arc->polyline factoring must keep render_svg byte-identical (a
  TS-vs-Rust parity test asserts it); flagged as the highest-effort
  Phase 1 item with that constraint as an acceptance condition.
- MultiPoint links currently render to nothing, so AC2.2 focuses on the
  Arc case and MultiPoint is documented as a known renderer gap.
- Corpus loading uses open_xmile/open_vensim + salsa sync (verify_layout
  is only an assertion helper, not a loader), and the per-seed seam is
  the existing generate_layout_with_config, not a new function.
- AC8.1 (determinism) is now covered by Phase 5; AC8.2 is noted as
  satisfied by the design document itself.
Add a pure layout::metrics module computing one quality cost per aesthetic
concern (0 = ideal): node_overlap, node_connector_overlap, label_overlap,
crossings, sprawl, edge_length_cv, aspect_penalty, plus the reserved
zero-weighted chain_straightness/loop_compactness. weighted_cost is the linear
combination an optimizer minimizes; MetricWeights::default() is all-zeros so any
accidental use before the Phase 4 calibration is inert.

compute_layout_metrics is Functional Core: it takes a StockFlow plus a
LayoutConfig (kept for the design's optimizer signature; presently unused since
box geometry comes from the diagram helpers) and performs no I/O. Every term is
sourced from the same geometry the renderer draws -- the diagram *_bounds and
label helpers, connector_polyline, and the shared build_view_segments -- so a
layout's score can never disagree with its rendered SVG, and the crossings term
exactly equals count_view_crossings. Every division guards a zero denominator,
so empty and single-element views yield all-zero, NaN-free metrics.

aspect_penalty uses the documented ar - TARGET_AR_MAX overshoot form with
TARGET_AR_MAX = 16/9. The label-vs-node term skips a label's own element box
(the label is part of that element's merged bounds, so charging it would always
add a constant equal to the label area).

Inline tests cover AC1.1-AC1.8: hand-computed node/label/connector overlaps and
aspect penalties, weighted_cost linearity, empty/single-element finiteness, a
node_overlap shuffle-invariance proptest, and scale invariance. The AC1.8 test
corrects the plan's scoping note: with fixed-pixel element boxes (the design's
load-bearing renderer-fidelity invariant), only the topological crossings term
is exactly scale-invariant; node_connector_overlap/edge_length_cv/aspect_penalty
inherit a fixed-pixel offset and are scale-sensitive (asymptotically invariant),
so the test asserts crossings invariance and pins node_connector_overlap's
documented scale-sensitivity instead.
Drop the now-stale #[allow(dead_code)] on the diagram::common rect helpers that
the layout::metrics core consumes (rect_width/rect_height/rect_area/
rect_overlap_area/segment_length_in_rect). Task 2 added those attributes as a
forward reference because the production callers had not landed yet; now that
metrics.rs uses them they are reachable, so the suppression is removed and the
explanatory comment updated. rect_contains_point keeps its allow since no
non-test caller needs it yet.
…ment

The label_overlap term summed label-vs-label overlap plus label-vs-node
overlap, but the node boxes it compared labels against were the
label-merged `aux_bounds`/`stock_bounds`/`flow_bounds` (which union each
element's own label into the returned rect). So a label-vs-label overlap
was charged again inside the other node's merged box, inflating the
term's absolute magnitude. That magnitude matters because Phase 4
calibrates label_overlap's weight against it.

Expose label-free shape boxes (`aux_shape_bounds`/`stock_shape_bounds`/
`flow_shape_bounds`) -- the geometry each `*_bounds` already computes
before merging the label, refactored into a single source of truth so
renderer output stays byte-identical -- and compare labels against the
OTHER element's bare shape box. Modules/clouds already exclude their
label from `*_bounds`, so they are their own shape box. This cleanly
separates "label lands on another label" from "label lands on another
node's shape", counting each label pair exactly once.

Also narrow the AC1.8 scale-invariance comment: crossings are exactly
scale-invariant only for crossings interior to both connectors. A
crossing grazing a fixed-size node boundary (the polylines are clipped
to those boundaries) can flip under uniform scale. The test fixture's
crossing is at the center of the square the links form, far from every
node box, so the test remains correct.
The best_seed and median_seed lowest-seed tie-breaks in
ModelStats::from_samples already had dedicated tests, but worst_seed
did not. worst_seed uses a non-obvious construct -- max_by(...) followed
by a reversed seed comparison (.then(y.seed.cmp(&x.seed))) -- precisely
so the lowest seed wins when two samples tie at the maximum weighted
cost. Add a focused test that pins this behavior: two samples share the
max cost at different seeds (plus a lower-cost third), and worst_seed
must be the lower of the tied seeds. The test was verified non-vacuous
by temporarily flipping the tie-break direction (it failed, picking the
higher seed) before reverting.
For each loaded corpus model, lay out the main model once per sampled
seed (rayon par_iter, mirroring generate_best_layout), score each layout
with compute_layout_metrics, and reduce the costs to a weighted scalar.
Collect MetricSamples into ModelStats::from_samples and aggregate into a
CorpusReport, printing a per-model median/p25/p75/best-of-k line plus the
corpus geomean-of-medians.

The weighted_cost uses a Phase-3 PLACEHOLDER MetricWeights (overlap and
crossings dominant; sprawl/edge-cv/aspect moderate; reserved structure
terms zero), since MetricWeights::default() is all-zeros until Phase 4
calibrates and commits the real weights. Phase 4 must switch this sweep
to MetricWeights::default() once those land.

The parallel results are sorted back into seed order before summarizing,
so the sample vector and every derived statistic are invariant to rayon
scheduling. Note that generate_layout_with_config is itself not
deterministic per seed -- the same (model, seed) yields slightly
different layouts even serially within one process, traced to
per-process-randomized HashMap iteration order in the layout pipeline.
That is a pre-existing layout-engine issue independent of this sweep.
generate_layout / generate_layout_with_config / generate_best_layout
produced different diagrams for the same (model, annealing_random_seed)
on repeated serial calls in one process. The annealing RNG is already
deterministic (StdRng::seed_from_u64); the run-to-run entropy came from
per-instance-random std HashMap iteration order in two layout functions.

In run_sfdp_with_rigid_chains, two loops over var_to_node (a HashMap)
were order-sensitive: the centroid accumulation summed float positions
(float addition is non-associative, so hash order perturbs the centroid)
and the auxiliary initial-placement loop assigned each unpositioned aux a
polar seed angle by its hash-iteration rank, so each aux got a different
seed position that SFDP diverged into a different final layout. Both loops
now iterate a once-materialized sorted Vec of the entries. var_to_node
stays a HashMap (it is still used as a .get() lookup table elsewhere).

The incremental path (incremental_layout -> diff_connectors) had the same
class of bug, which the original report did not cover: new causal links
were created while iterating a HashSet of new edges, and each new link
both allocates a sequential uid and is appended to the view in that order,
so hash order produced different uids and element ordering for the same
logical link. The two HashSet/HashMap-driven link loops now iterate sorted
keys, and the alias-match fallback picks the lowest matching key.

Adds tests/layout.rs determinism tests for both the fresh and incremental
paths: SIR previously differed in 8/15 (fresh) and 6-7/17 (incremental)
elements between two serial calls; both are now 0.
Add serde::Deserialize alongside the existing Serialize on LayoutMetrics
and MetricWeights (metrics.rs) and add serde::Serialize+Deserialize to
MetricSample/ModelStats/CorpusReport (eval_stats.rs) so a full CorpusReport
-- including each model's per-seed samples -- round-trips through JSON. The
samples must survive so compare() can re-run Mann-Whitney U over the
seed-sample cost sets. Comparison/ModelComparison gain Serialize only (the
diff is recomputed every run, never read back).

The layout_eval example now resolves a baseline diff: LAYOUT_EVAL_WRITE_BASELINE=1
writes the run's CorpusReport to the committed examples/layout_eval_baseline.json;
a normal run reads that file, runs compare(baseline, candidate), prints the
per-model delta_ratio + p_value + significance plus the aggregate, and embeds
the Comparison into metrics.json (the baseline_comparison slot widened from
Option<()> to Option<Comparison>) and the index.html header. An absent baseline
skips the diff with a note.

The committed baseline is seeded from a small representative subset
(LAYOUT_EVAL_MODELS=sir,teacup LAYOUT_EVAL_SEEDS=8) to keep the run fast and
the JSON modest. It captures CURRENT pre-Rung-0 behavior scored with the
Phase-3 PLACEHOLDER_WEIGHTS. It MUST be regenerated after Phase 4 commits the
calibrated weights (and the example switches to MetricWeights::default()) and
before Phase 5 measures Rung 0's improvement -- see the sibling
layout_eval_baseline.README.md. Since layout is now deterministic per seed,
seeding then diffing candidate==baseline yields exactly-zero, non-significant
deltas (consistent with AC4.5).
Wrap each model's full pipeline (load -> seed sweep -> render) in a new
process_model() returning Result<(ModelStats, ModelRenders), String>, the
single model-level skip-on-failure boundary. main() WARN-logs any per-model
Err and continues to the next model, omitting the failed model from every
artifact; the harness always reaches the end and exits 0 (even a fully-empty
corpus is exit 0 with warnings).

Three failure modes funnel through the Result, validated in the order data
flows: a load/parse failure (already surfaced by load_model), a layout that
fails on EVERY seed (zero usable samples -- now a model-level skip rather than
a degenerate all-zero report entry), and render failures (kept non-fatal
inside render_model, so a model with valid scores but a missing PNG still
appears). A partial per-seed failure never sinks a model: the skip fires only
when no seed produced a usable sample.

Also refreshes the now-stale sweep_model doc comment: generate_layout_with_config
is deterministic per seed since fix #633, so the prior note about run-to-run
median/spread drift no longer applies.
The four built-in default_projects (fishbanks, logistic-growth, population,
reliability) are hand-laid-out reference layouts and are the primary good-layout
taste anchors for the Phase 4 metric calibration. Added them to the eval corpus
so the sweep can score their reference views against generated layouts.
… boxes

These two layout-quality terms previously iterated the label-merged
node boxes (node_box), which conflated shape-vs-shape overlap with
label overlap and charged a connector for merely passing under a
label. The user's calibrated taste treats node shapes overlapping
other node shapes, and a connector passing under a non-incident node
SHAPE (it reads as a false causal connection at a glance), as the
real costs; a connector passing only under a semi-transparent LABEL
is mild noise, and label collisions already belong to label_overlap.

node_overlap now sums pairwise overlap over the bare shape boxes
(node_shape_box) and normalizes by total shape-box area;
node_connector_overlap charges a connector for the length it spends
inside non-incident shape boxes. node_shape_boxes is hoisted to be
computed once and shared with label_overlap. sprawl and
aspect_penalty intentionally keep using the label-merged node_box
because the view's visual extent and characteristic node size include
labels. MetricWeights, weighted_cost, crossings, and loop_compactness
are untouched.
…chments

build_view_segments named a link's endpoint vertices elem_{from_uid}/
elem_{to_uid} but a flow's pipe vertices flow_{uid}#{i}. do_segments_intersect
only suppresses a crossing when two segments share a from_node/to_node string,
so a link incident on a flow -- terminating at the flow's valve
(link.to_uid == flow.uid) or at a stock/cloud the flow attaches to -- grazed
the pipe at the connection point with mismatched node names and was miscounted
as a real crossing.

The flow branch now names each pipe vertex to share the elem_{uid} name the
incident link/stock/cloud already uses: an attached point becomes
elem_{attached_to_uid}, and the valve (which lies on the pipe, not at a stored
point) is injected as an elem_{flow.uid} vertex by splitting the pipe segment
it sits on. A genuinely free interior point keeps the per-flow flow_{uid}#{i}
name, so a real mid-span crossing -- segments sharing no element with the flow
-- is still counted. Consecutive pipe sub-segments share the joining vertex
name, so a flow never self-crosses.

This fixes the dp_logistic_growth reference miscount (two links terminating at
the net birth rate valve reported 2 spurious crossings) that was distorting the
Phase 4 layout-quality calibration; the reference now scores crossings == 0.
loop_compactness was a hardcoded 0.0 reserved field. Compute and populate
it so the layout-quality eval sweep can report a per-model value. The
committed weight stays 0 (MetricWeights::default and weighted_cost are
unchanged); calibration happens in a later phase. The user wants to see
what the term reports before it influences any optimizer.

The measure is isoperimetric over the view's feedback cycles. Build a
directed graph over positioned node-box elements (aux/stock/flow/module/
cloud): each Link contributes from_uid -> to_uid, and each Flow contributes
source_attached -> flow.uid -> dest_attached through its own valve, so a
stock--flow--stock feedback path is part of the graph. Enumerate simple
directed cycles with a bounded DFS (each cycle's smallest uid is the search
root, so a cycle is discovered once), capped at 12 nodes per cycle and 64
cycles total -- SD diagrams are tiny, so this stays O(small) and total. For
each cycle of >= 3 distinct positioned nodes, take the node-box centers in
cycle order, compute the shoelace area and summed-edge perimeter, and form
the isoperimetric quotient Q = 4*PI*Area / Perimeter^2 clamped to [0,1]
(Q=1 a perfect circle, Q->0 collinear/degenerate). The per-cycle penalty is
1 - Q; loop_compactness is the mean penalty over all qualifying cycles, or
0.0 when there is no cycle of >= 3 nodes. It thus rewards clean, well-spread
loops and penalizes collapsed/collinear ones.

Centers come from the bare shape box (node_shape_box), which is symmetric
about the element position -- unlike the label-merged node_box, whose label
offset would skew the polygon. Determinism does not rely on layout seed
alone: adjacency targets are sorted, the search visits nodes in sorted uid
order, every cycle is canonicalized (rotated to its smallest uid) and
de-duplicated, and every per-cycle guard returns None rather than a NaN, so
the mean is identical for a given view regardless of element ordering.
…sion sensitive)

The old label_overlap divided the corpus-wide sum of label-vs-label and
label-vs-shape overlap areas by the corpus's total label area. That
under-counted a small-but-readability-killing overlap: a node circle
clipping the last couple of characters of a short label has a small
absolute overlap area, and dividing it by the total label area of every
label in the diagram diluted it to near zero (~0.004 on the eval corpus),
so the term could not distinguish "one label is 30% obscured" from "nothing
is obscured."

Redefine label_overlap as a per-label obscuration measure: for each labeled
element L with box B_L, approximate the area of B_L covered by any OTHER
label box or any OTHER element's bare shape box as the sum of the individual
overlap areas, capped at area(B_L) so the obscured fraction is in [0,1];
label_overlap is the SUM of those fractions over all labels. A small clip
now registers at its true obscuration fraction (~0.3 for the fixture above)
instead of ~0.004. The capped sum is a monotone proxy for the exact covered
area -- a pixel-exact union is unnecessary, and more/larger overlaps never
decrease a label's fraction.

This preserves the Phase-1 guards: a label is never charged against its own
element's shape box, and the comparison uses the label-free shape box
(node_shape_box), not the label-merged box, so label-vs-label coverage is
not re-counted via another node's merged bounds. A mutual label-label
collision is charged from both labels' perspectives -- intended, since both
are unreadable. area(B_L) == 0 is guarded (skip), so the term stays finite.

MetricWeights/weighted_cost are unchanged (calibration is a later phase). On
the eval corpus the within-model worse-is-higher anchor is preserved (best
<= median <= worst label_overlap for every model) while the magnitudes rise
to reflect true per-label obscuration: e.g. reliability best 0.127 -> 0.937,
fishbanks best 0.019 -> 0.312.
Replace the Phase-1 all-zeros placeholder Default impl for MetricWeights
with the calibrated production weights from the Phase 3 contact-sheet
calibration, with explicit user sign-off (2026-05-23):

  node_overlap            = 1.0
  node_connector_overlap  = 1.0
  label_overlap           = 1.0
  crossings               = 1.0
  sprawl                  = 0.0
  edge_length_cv          = 0.0
  aspect_penalty          = 0.0
  loop_compactness        = 0.25
  chain_straightness      = 0.0

Failure-mode rationale -- readability >> compactness. The dominant
concerns (node-shape overlap, connectors under node shapes, obscured
labels, edge crossings) carry weight 1: they make a diagram unreadable
or assert false causal connections. sprawl, edge_length_cv, and
aspect_penalty are intentionally 0 -- compactness and aspect ratio are
NOT goals; spreading out to keep labels legible and feedback loops
visible is good, not penalized. loop_compactness is a low 0.25 that
gently rewards drawing feedback loops as visible circles without
dominating the overlap/crossings family. chain_straightness stays 0
(reserved, not yet computed).

The old AC5.1 test asserted the default was all-zeros so weighted_cost
was inert; that is no longer true by design. It is replaced by a
dominance-ordering test that pins the readability-dominant relationships
(each overlap/crossings term strictly exceeds each compactness term;
sprawl/edge_length_cv/aspect_penalty/chain_straightness == 0;
0 < loop_compactness < node_overlap) and re-confirms weighted_cost under
the default equals the explicit sum of weighted terms.
Encode the user's taste anchors as an objective validation of the
calibrated MetricWeights::default(): on the agreed reference pairs the
shipped hand-authored (human) layout must score a lower weighted_cost
than a machine-generated (auto) layout of the same model.

Construction (b) -- human view vs generated layout. For each anchor we
load the default_projects XMILE model, score its as-loaded main view
(human), generate a single fixed-seed layout (auto), and assert
human < auto under the committed weights. One fixed seed (not
generate_best_layout) keeps the test deterministic (layout is
deterministic per seed, #633) and fast: all five tests run in well under
half a second over the small (<= 42 element) default_projects.

Anchor costs under the committed weights (human < auto):
  reliability:      human 0.255 < auto 2.997
  fishbanks:        human 0.194 < auto 1.603
  population:       human 0.052 < auto 0.053
  logistic-growth:  human 0.089 < auto 0.154

sir is deliberately NOT a human<auto anchor: its shipped reference
obscures more labels than the auto layout (human 0.458 > auto 0.039), so
the metric correctly prefers the auto. That direction is pinned by a
separate documented test so the asymmetry is recorded, not silently
assumed.
Now that Phase 4 has committed the calibrated MetricWeights::default(),
the layout_eval sweep scores every weighted_cost with the committed
default instead of the Phase-3 PLACEHOLDER_WEIGHTS const. Remove the
PLACEHOLDER_WEIGHTS const (and its 'Phase 4 will replace' doc comment)
and route the three call sites (per-seed sweep, render scoring, and the
report builder) through MetricWeights::default().

Re-seed the committed examples/layout_eval_baseline.json from current
behavior under the calibrated weights, over the same small documented
subset (teacup,sir at M=8) so the committed file stays small. The
recorded per-seed sample costs change because the weights changed, so
the old baseline was stale. Update the sibling README to note the
2026-05-23 re-seed with the committed weights and drop the now-done
'regenerate after Phase 4' caveat.
…tion

The population reference-pair anchor beats the auto layout by only ~2.3%
under the committed default weights, far thinner than the other anchors
(reliability ~8.5%, fishbanks ~12%, logistic-growth ~58%). Document that
it is a marginal taste anchor so a future failure is read as a near-boundary
case rather than necessarily a metric regression; the assertion is unchanged.

Also note that canonicalize_cycle normalizes rotation but not traversal
direction. A reverse-direction duplicate essentially never arises for
directed SD feedback loops, and even if it did the shoelace polygon area is
direction-invariant, so both yield the identical isoperimetric penalty.
bpowers added 6 commits May 23, 2026 13:22
Add a dedicated layout_selection_tests.rs (wired via #[path], following the
crossings_tests.rs precedent so layout_tests.rs stays under the per-file line
cap) covering AC6.1: select_best_layout minimizes weighted_cost, not crossings.

The headline case builds two candidates whose crossing ordering is the
inverse of their weighted_cost ordering -- the lowest-cost view actually has
MORE connector crossings (asserted via count_view_crossings, so the inversion
is real, not narrative) -- and confirms the lowest-cost view wins, which the
retired crossings-only rule would not have done. Companion cases pin the
lowest-seed tie-break and NaN safety: a NaN challenger never displaces a
finite running best, while a NaN seeded first is sticky under the current
fold (pinned as a documented limitation rather than silently assumed away).
Add a fast, deterministic guard (AC7.1) over three tiny programmatic models
built with TestProject -- a population feedback model, a pure aux dependency
chain, and a two-stock transfer. Each is laid out at the fixed annealing seed
42 and its calibrated weighted_cost asserted at or below a committed per-model
ceiling. The ceilings are observed-cost ceilings (observed at seed 42:
pop 0.0533, chain 0.0, two_stock 0.1646) set a small margin above the
observed value, with a comment documenting how to regenerate them after an
intentional metric/weight change.

For AC7.2, a companion test takes a real fixed-seed layout and piles every
node onto the origin (blowing up node_overlap to ~6.2) and asserts the result
exceeds the pop ceiling, proving the guard discriminates good layouts from
bad rather than passing vacuously. The whole guard runs in well under a
second (tiny models, one seed each).
Add an AC8.1 determinism test: laying out the same model twice through
generate_layout_with_config at the same explicit seed yields two StockFlow
values that compare equal (StockFlow derives PartialEq, so every field --
positions, view box, element order -- is checked).

The test uses a seed-sensitive model (a stock fed/drained by ten leaf auxes
through two flows) where the SFDP/annealing RNG genuinely shapes the layout,
and additionally asserts that a different seed produces a different layout --
so the same-seed equality is a real determinism guarantee, not a vacuous pass
on a tiny model whose arrangement ignores the seed. A comment notes this
per-seed reproducibility is distinct from the Phase 3 M-seed statistical
sweep, which deliberately varies the seed.
…s of order

The fold in select_best_layout kept the running best whenever the
challenger's `<` comparison was false. That correctly drops a NaN-cost
challenger (`NaN < finite` is false), but it also failed to let a finite
challenger overtake a NaN running best (`finite < NaN` and `finite ==
NaN` are both false). So a degenerate NaN-cost layout from the first seed
in the fixed order [42, 123, 456, 789] was sticky: generate_best_layout
would ship it even when a later seed produced a usable finite layout.

Add explicit NaN branches so a NaN challenger never wins and a finite
challenger always beats a NaN running best, making the guard
order-independent. If every candidate is NaN the earliest is kept
deterministically. The metric is NaN-free in practice
(compute_layout_metrics guards every division), so this is defensive
robustness rather than an observed production failure.
Document the layout-quality eval branch's contract and structure changes in
src/simlin-engine/CLAUDE.md: the new pure metrics.rs (LayoutMetrics /
MetricWeights / compute_layout_metrics / weighted_cost) and eval_stats.rs
modules, the per-seed bit-identical layout determinism guarantee (#633), the
shift of generate_best_layout/select_best_layout from fewest-crossings to the
calibrated weighted_cost (NaN-safe), the rebuilt count_view_crossings /
build_view_segments crossing geometry, the now-public LAYOUT_SEEDS, the
diagram submodules widened to pub(crate) plus connector_polyline and the
label-free *_shape_bounds / rect-segment helpers they share with the metric,
and the gated layout_eval example.
@bpowers
Copy link
Copy Markdown
Owner Author

bpowers commented May 23, 2026

@codex review

@bpowers bpowers force-pushed the layout-quality-eval branch from 7f00822 to 37e2719 Compare May 23, 2026 20:25
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7f00822399

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/simlin-engine/src/layout/metrics.rs Outdated
Comment on lines +393 to +395
x: (r.left + r.right) / 2.0,
y: (r.top + r.bottom) / 2.0,
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use valve position for flow loop geometry

build_loop_graph computes each node center from the shape-box midpoint, but for flows that midpoint is not the valve position because flow_shape_bounds now includes the entire pipe polyline (src/simlin-engine/src/diagram/flow.rs), which is often asymmetric when users drag the valve along the pipe. This makes loop_compactness evaluate loop polygons with shifted flow vertices, so the new production weighted_cost can rank seeds based on pipe endpoint extents instead of actual loop geometry. A flow node in loop scoring should use (flow.x, flow.y) (the valve) rather than the flow bounding-box center.

Useful? React with 👍 / 👎.

@claude
Copy link
Copy Markdown

claude Bot commented May 23, 2026

Code review

Reviewed the production-impacting changes in depth (the eval harness/docs are out of the critical path):

  • diagram/connector.rsarc_geometry/render_arc factor-out is guarded by a byte-identical SVG regression test; connector_polyline reproduces the drawn geometry faithfully (straight branch matches render_straight_line's clip endpoints; arc branch samples center-to-center exactly as render_arc draws). sample_arc's SVG endpoint→center reconstruction is self-consistent with the fixed circ/flags and is exercised by the arc-crossing test.
  • diagram/common.rs — Liang-Barsky segment_length_in_rect and the rect helpers are correct, including the parallel/outside and clamp cases.
  • diagram/{elements,flow}.rs*_shape_bounds split preserves *_bounds output (bounding-box union is order-insensitive).
  • layout/mod.rsbuild_view_segments valve injection / vertex naming correctly suppresses graze-at-valve while still counting genuine mid-span crossings; select_best_layout NaN handling is order-independent for finite candidates and deterministic for the all-NaN case; the diff_connectors and run_sfdp_with_rigid_chains determinism fixes materialize sorted iteration in every layout-affecting HashMap/HashSet loop.
  • layout/metrics.rs — every term guards its denominator (finite, NaN-free); node_overlap/node_connector_overlap/label_overlap use the bare shape boxes to avoid the documented double-counts; loop-cycle enumeration is bounded, canonicalized, and order-independent; weighted_cost matches MetricWeights::default().
  • layout/eval_stats.rs — geomean/type-7 percentile/Mann-Whitney (tie + continuity correction) formulas are correct, with documented non-NaN defaults.

Tests are non-vacuous (crossing inversions asserted via count_view_crossings, determinism checked with a seed-sensitivity counter-case, regression-guard ceilings backed by a degenerate-layout rejection test).

No blocking issues found.

Overall correctness verdict: correct

build_loop_graph computed each loop-polygon vertex as the midpoint of the
element's bare shape box. For flows that box is flow_shape_bounds -- the valve
box unioned with every pipe point -- so its midpoint is the pipe-extent center,
which drifts off the valve (flow.x, flow.y) whenever the valve is dragged
off-center or the pipe is bent. Flows are vertices in the loop graph
(stock->flow->stock edges), so loop_compactness, and therefore the weighted_cost
that Phase-5 Rung-0 select_best_layout minimizes, was being skewed by pipe
geometry rather than true loop geometry.

Compute each loop vertex via diagram::connector::get_visual_center -- the same
visual center the SVG renderer uses -- which returns the valve for a flow and
the element center (the symmetric shape-box midpoint) for aux/stock/module/cloud,
so only flow vertices change. The node-membership gate stays node_shape_box. This
restores the module's same-geometry-as-renderer invariant and corrects the now-
stale LoopGraph/build_loop_graph doc comments that claimed the shape box is
always the element center. Addresses the PR #637 review comment.
@claude
Copy link
Copy Markdown

claude Bot commented May 23, 2026

Automated review — PR #637 (layout-quality eval + Rung 0)

I reviewed the production-impacting changes and the new cores in detail:

  • Connector refactor (connector.rs)arc_geometry/render_arc factor-out is guarded by a byte-identical SVG regression test; sample_arc's endpoint→center arc parametrization reconstructs the drawn arc consistently with the SVG flags, and the bulge-direction + translation/rotation-invariance tests confirm the sampled geometry.
  • build_view_segments / count_view_crossings rewrite — all element kinds resolved by uid (Module/Alias no longer dropped), valve injected as elem_{flow.uid}, and the elem_/link_#i/flow_#i vertex naming correctly drives do_segments_intersect's shared-endpoint suppression. The flow-vs-link miscount fix and a genuine mid-span crossing are both covered.
  • Determinism fixes (layout: generate_layout_with_config is nondeterministic per seed (HashMap/HashSet iteration order in force accumulation) #633)diff_connectors (sorted new-edge / lowest-key alias-match / sorted preserved-link loops) and run_sfdp_with_rigid_chains (sorted var_to_node materialization for both the centroid sum and the aux-placement rank) remove the HashMap-order nondeterminism; guarded by per-seed bit-identical tests for both the fresh and incremental paths.
  • select_best_layout — NaN-safe and order-independent (finite always beats NaN; cost ties break to lowest seed regardless of order).
  • metrics.rs — every term guards its zero denominator (finite, no NaN on empty/single-element views); overlap/connector terms correctly use bare shape boxes vs. label-merged boxes to avoid double-counting; loop_compactness cycle enumeration is complete (each cycle found from its min node) and order-independent.
  • eval_stats.rs — geomean, type-7 percentile/median, and tie+continuity-corrected Mann-Whitney U match their reference definitions; documented finite defaults on degenerate input.
  • flow.rs/elements.rs*_shape_bounds extraction is union-preserving, so *_bounds behavior is unchanged.

No blocking correctness issues found. Test coverage is thorough and matches the documented contracts. (annealing::count_crossings remains O(n²) and arc sampling raises the per-seed segment count ~16×, but this is on the once-per-seed fresh-layout path, is dominated by annealing, and is already documented at annealing.rs:82 — not introduced or hidden here.)

Overall correctness: correct.

@bpowers
Copy link
Copy Markdown
Owner Author

bpowers commented May 23, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 35ef355faa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/simlin-engine/src/layout/metrics.rs Outdated
Comment on lines +679 to +680
for seg in c.polyline.windows(2) {
inside += segment_length_in_rect(&seg[0], &seg[1], rect);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Cap per-segment node connector coverage once

node_connector_overlap sums segment_length_in_rect across every non-incident node box, so when two node boxes overlap, the same connector sub-segment is counted multiple times. That makes this "fraction of connector length" exceed 1.0 and can disproportionately inflate weighted_cost, causing generate_best_layout to reject otherwise better layouts just because overlap regions were double-counted. The overlap should be measured as a union length per segment (or otherwise de-duplicated/clamped) before dividing by total_connector_length.

Useful? React with 👍 / 👎.

…oss boxes)

node_connector_overlap is documented as a "fraction of total connector
length", but it accumulated segment_length_in_rect for every non-incident
node shape box, summing the per-box scalar lengths. When two non-incident
shape boxes overlapped, the connector sub-length in the overlap region was
counted once per box, so the term could exceed 1.0 -- breaking the
fraction-in-[0,1] contract, over-inflating weighted_cost (this term's weight
is 1.0), and skewing Phase-5 Rung-0 select_best_layout. Overlapping shape
boxes are common because a Flow's node_shape_box is its whole-pipe bounding
box, which frequently overlaps stocks/auxes/other flows.

The fix unions the per-segment clip intervals across all non-incident boxes
before summing, so each physical sub-length is counted at most once. A new
pub(crate) segment_clip_interval_in_rect exposes the Liang-Barsky [t0, t1]
core; segment_length_in_rect now delegates to it (behavior-preserving), and
the metric merges the intervals per segment. Single-non-incident-box layouts
are unchanged (nothing to merge); overlapping-box layouts see the term
decrease toward the true union fraction, lowering weighted_cost in the safe
direction so the AC5.2 human<auto anchors and the AC7 regression-guard
ceilings continue to hold. Addresses the PR #637 review comment.
@bpowers
Copy link
Copy Markdown
Owner Author

bpowers commented May 24, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Another round soon, please!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@bpowers bpowers merged commit 930bb5f into main May 24, 2026
10 of 13 checks passed
@bpowers bpowers deleted the layout-quality-eval branch May 24, 2026 02:12
@claude
Copy link
Copy Markdown

claude Bot commented May 24, 2026

Review — layout-quality eval + Rung-0 selection

I focused on the code that newly runs in the production layout path (generate_best_layoutcompute_layout_metrics), the render_arc/connector_polyline factor-out (must stay byte-identical), the build_view_segments flow-valve crossing fix, the select_best_layout NaN handling, and the #633 determinism fixes. These all look correct: the NaN-safe selection is order-independent and ties break to the lowest seed; the arc factor-out is guarded by test_render_arc_svg_byte_identical; the metric terms each guard their zero-denominators so weighted_cost is finite; and the diff_connectors / run_sfdp_with_rigid_chains sorts make the per-seed layout reproducible. The benchstat statistics (geomean, type-7 percentile, mann_whitney_u) also check out.

One finding below.

[P2] Cycle enumeration has no global exploration budget (now on the production layout path)

src/simlin-engine/src/layout/metrics.rs (lines ~519-548, dfs_cycles)

compute_loop_compactness now runs for every candidate seed inside generate_best_layout. Its dfs_cycles DFS is bounded only by the per-path length cap (path.len() < MAX_CYCLE_LEN, 12) and the found-cycles cap (found.len() >= MAX_CYCLES, 64). The MAX_CYCLES short-circuit only fires after a cycle is recorded, so on a large, dense, but cycle-sparse directed graph (e.g. a wide/deep acyclic causal graph — a forecasting/calculation model with little feedback) few or no paths return to start, the cap never triggers, and the search enumerates essentially all simple paths of length ≤12 through the nodes with uid > start. That is combinatorial (up to ~d^12 for out-degree d) and can dominate layout generation, run ×4 seeds. This contradicts the inline claim at lines ~372-374 that the caps make enumeration "O(small) and total" / that "enumeration stops once the cap is hit". Severity depends on the input: feedback-rich SD models hit the cycle cap immediately and are unaffected, and no model in the current corpus triggers it (CI is green); the risk is a large near-acyclic model. A small global node-visit/step budget threaded through dfs_cycles (bailing the whole enumeration, like MAX_CYCLES already does) would bound the worst case and make the comment accurate.


Overall correctness verdict: correct. The production-impacting logic and the byte-identical render refactor are sound and well-tested; the one finding above is a latent worst-case performance concern (not a correctness break) that only bites on an uncommon large near-acyclic model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant