Layout quality evaluation: metric core, corpus sweep, calibration, and Rung 0#637
Conversation
Documents infrastructure for iterating on simlin-engine's automatic diagram layout. Today layout judges itself by edge-crossing count alone and cannot be rendered outside the browser; this plan closes both gaps with a pure, geometry-accurate LayoutMetrics suite scored on the same geometry the PNG renderer draws, a benchstat-style statistics core that treats layout quality as a distribution over seeds (median, geomean, Mann-Whitney U) rather than one fixed-seed sample, and an on-demand corpus sweep example that renders and scores layouts against human-authored reference views. The first algorithm step (Rung 0) re-points select_best_layout to rank seeds by the full weighted_cost instead of crossings-only, and the crossing count moves from a straight-chord approximation to polyline sampling so curved/bent connectors are measured faithfully. Rungs 1-3 (parameter search, metric-driven annealing cost, new layout passes) are documented as the forward path the harness is built to support.
Addresses a codebase-accuracy review of the design plan so it does not mislead the implementation planner: - The diagram geometry helpers (elements/flow/label/connector) live in private modules today, so Phase 1 must first expose them pub(crate) before layout::metrics can reuse them; this was previously presented as free. - The Arc->polyline factoring must keep render_svg byte-identical (a TS-vs-Rust parity test asserts it); flagged as the highest-effort Phase 1 item with that constraint as an acceptance condition. - MultiPoint links currently render to nothing, so AC2.2 focuses on the Arc case and MultiPoint is documented as a known renderer gap. - Corpus loading uses open_xmile/open_vensim + salsa sync (verify_layout is only an assertion helper, not a loader), and the per-seed seam is the existing generate_layout_with_config, not a new function. - AC8.1 (determinism) is now covered by Phase 5; AC8.2 is noted as satisfied by the design document itself.
Add a pure layout::metrics module computing one quality cost per aesthetic concern (0 = ideal): node_overlap, node_connector_overlap, label_overlap, crossings, sprawl, edge_length_cv, aspect_penalty, plus the reserved zero-weighted chain_straightness/loop_compactness. weighted_cost is the linear combination an optimizer minimizes; MetricWeights::default() is all-zeros so any accidental use before the Phase 4 calibration is inert. compute_layout_metrics is Functional Core: it takes a StockFlow plus a LayoutConfig (kept for the design's optimizer signature; presently unused since box geometry comes from the diagram helpers) and performs no I/O. Every term is sourced from the same geometry the renderer draws -- the diagram *_bounds and label helpers, connector_polyline, and the shared build_view_segments -- so a layout's score can never disagree with its rendered SVG, and the crossings term exactly equals count_view_crossings. Every division guards a zero denominator, so empty and single-element views yield all-zero, NaN-free metrics. aspect_penalty uses the documented ar - TARGET_AR_MAX overshoot form with TARGET_AR_MAX = 16/9. The label-vs-node term skips a label's own element box (the label is part of that element's merged bounds, so charging it would always add a constant equal to the label area). Inline tests cover AC1.1-AC1.8: hand-computed node/label/connector overlaps and aspect penalties, weighted_cost linearity, empty/single-element finiteness, a node_overlap shuffle-invariance proptest, and scale invariance. The AC1.8 test corrects the plan's scoping note: with fixed-pixel element boxes (the design's load-bearing renderer-fidelity invariant), only the topological crossings term is exactly scale-invariant; node_connector_overlap/edge_length_cv/aspect_penalty inherit a fixed-pixel offset and are scale-sensitive (asymptotically invariant), so the test asserts crossings invariance and pins node_connector_overlap's documented scale-sensitivity instead.
Drop the now-stale #[allow(dead_code)] on the diagram::common rect helpers that the layout::metrics core consumes (rect_width/rect_height/rect_area/ rect_overlap_area/segment_length_in_rect). Task 2 added those attributes as a forward reference because the production callers had not landed yet; now that metrics.rs uses them they are reachable, so the suppression is removed and the explanatory comment updated. rect_contains_point keeps its allow since no non-test caller needs it yet.
…ment The label_overlap term summed label-vs-label overlap plus label-vs-node overlap, but the node boxes it compared labels against were the label-merged `aux_bounds`/`stock_bounds`/`flow_bounds` (which union each element's own label into the returned rect). So a label-vs-label overlap was charged again inside the other node's merged box, inflating the term's absolute magnitude. That magnitude matters because Phase 4 calibrates label_overlap's weight against it. Expose label-free shape boxes (`aux_shape_bounds`/`stock_shape_bounds`/ `flow_shape_bounds`) -- the geometry each `*_bounds` already computes before merging the label, refactored into a single source of truth so renderer output stays byte-identical -- and compare labels against the OTHER element's bare shape box. Modules/clouds already exclude their label from `*_bounds`, so they are their own shape box. This cleanly separates "label lands on another label" from "label lands on another node's shape", counting each label pair exactly once. Also narrow the AC1.8 scale-invariance comment: crossings are exactly scale-invariant only for crossings interior to both connectors. A crossing grazing a fixed-size node boundary (the polylines are clipped to those boundaries) can flip under uniform scale. The test fixture's crossing is at the center of the square the links form, far from every node box, so the test remains correct.
The best_seed and median_seed lowest-seed tie-breaks in ModelStats::from_samples already had dedicated tests, but worst_seed did not. worst_seed uses a non-obvious construct -- max_by(...) followed by a reversed seed comparison (.then(y.seed.cmp(&x.seed))) -- precisely so the lowest seed wins when two samples tie at the maximum weighted cost. Add a focused test that pins this behavior: two samples share the max cost at different seeds (plus a lower-cost third), and worst_seed must be the lower of the tied seeds. The test was verified non-vacuous by temporarily flipping the tie-break direction (it failed, picking the higher seed) before reverting.
For each loaded corpus model, lay out the main model once per sampled seed (rayon par_iter, mirroring generate_best_layout), score each layout with compute_layout_metrics, and reduce the costs to a weighted scalar. Collect MetricSamples into ModelStats::from_samples and aggregate into a CorpusReport, printing a per-model median/p25/p75/best-of-k line plus the corpus geomean-of-medians. The weighted_cost uses a Phase-3 PLACEHOLDER MetricWeights (overlap and crossings dominant; sprawl/edge-cv/aspect moderate; reserved structure terms zero), since MetricWeights::default() is all-zeros until Phase 4 calibrates and commits the real weights. Phase 4 must switch this sweep to MetricWeights::default() once those land. The parallel results are sorted back into seed order before summarizing, so the sample vector and every derived statistic are invariant to rayon scheduling. Note that generate_layout_with_config is itself not deterministic per seed -- the same (model, seed) yields slightly different layouts even serially within one process, traced to per-process-randomized HashMap iteration order in the layout pipeline. That is a pre-existing layout-engine issue independent of this sweep.
generate_layout / generate_layout_with_config / generate_best_layout produced different diagrams for the same (model, annealing_random_seed) on repeated serial calls in one process. The annealing RNG is already deterministic (StdRng::seed_from_u64); the run-to-run entropy came from per-instance-random std HashMap iteration order in two layout functions. In run_sfdp_with_rigid_chains, two loops over var_to_node (a HashMap) were order-sensitive: the centroid accumulation summed float positions (float addition is non-associative, so hash order perturbs the centroid) and the auxiliary initial-placement loop assigned each unpositioned aux a polar seed angle by its hash-iteration rank, so each aux got a different seed position that SFDP diverged into a different final layout. Both loops now iterate a once-materialized sorted Vec of the entries. var_to_node stays a HashMap (it is still used as a .get() lookup table elsewhere). The incremental path (incremental_layout -> diff_connectors) had the same class of bug, which the original report did not cover: new causal links were created while iterating a HashSet of new edges, and each new link both allocates a sequential uid and is appended to the view in that order, so hash order produced different uids and element ordering for the same logical link. The two HashSet/HashMap-driven link loops now iterate sorted keys, and the alias-match fallback picks the lowest matching key. Adds tests/layout.rs determinism tests for both the fresh and incremental paths: SIR previously differed in 8/15 (fresh) and 6-7/17 (incremental) elements between two serial calls; both are now 0.
Add serde::Deserialize alongside the existing Serialize on LayoutMetrics and MetricWeights (metrics.rs) and add serde::Serialize+Deserialize to MetricSample/ModelStats/CorpusReport (eval_stats.rs) so a full CorpusReport -- including each model's per-seed samples -- round-trips through JSON. The samples must survive so compare() can re-run Mann-Whitney U over the seed-sample cost sets. Comparison/ModelComparison gain Serialize only (the diff is recomputed every run, never read back). The layout_eval example now resolves a baseline diff: LAYOUT_EVAL_WRITE_BASELINE=1 writes the run's CorpusReport to the committed examples/layout_eval_baseline.json; a normal run reads that file, runs compare(baseline, candidate), prints the per-model delta_ratio + p_value + significance plus the aggregate, and embeds the Comparison into metrics.json (the baseline_comparison slot widened from Option<()> to Option<Comparison>) and the index.html header. An absent baseline skips the diff with a note. The committed baseline is seeded from a small representative subset (LAYOUT_EVAL_MODELS=sir,teacup LAYOUT_EVAL_SEEDS=8) to keep the run fast and the JSON modest. It captures CURRENT pre-Rung-0 behavior scored with the Phase-3 PLACEHOLDER_WEIGHTS. It MUST be regenerated after Phase 4 commits the calibrated weights (and the example switches to MetricWeights::default()) and before Phase 5 measures Rung 0's improvement -- see the sibling layout_eval_baseline.README.md. Since layout is now deterministic per seed, seeding then diffing candidate==baseline yields exactly-zero, non-significant deltas (consistent with AC4.5).
Wrap each model's full pipeline (load -> seed sweep -> render) in a new process_model() returning Result<(ModelStats, ModelRenders), String>, the single model-level skip-on-failure boundary. main() WARN-logs any per-model Err and continues to the next model, omitting the failed model from every artifact; the harness always reaches the end and exits 0 (even a fully-empty corpus is exit 0 with warnings). Three failure modes funnel through the Result, validated in the order data flows: a load/parse failure (already surfaced by load_model), a layout that fails on EVERY seed (zero usable samples -- now a model-level skip rather than a degenerate all-zero report entry), and render failures (kept non-fatal inside render_model, so a model with valid scores but a missing PNG still appears). A partial per-seed failure never sinks a model: the skip fires only when no seed produced a usable sample. Also refreshes the now-stale sweep_model doc comment: generate_layout_with_config is deterministic per seed since fix #633, so the prior note about run-to-run median/spread drift no longer applies.
The four built-in default_projects (fishbanks, logistic-growth, population, reliability) are hand-laid-out reference layouts and are the primary good-layout taste anchors for the Phase 4 metric calibration. Added them to the eval corpus so the sweep can score their reference views against generated layouts.
… boxes These two layout-quality terms previously iterated the label-merged node boxes (node_box), which conflated shape-vs-shape overlap with label overlap and charged a connector for merely passing under a label. The user's calibrated taste treats node shapes overlapping other node shapes, and a connector passing under a non-incident node SHAPE (it reads as a false causal connection at a glance), as the real costs; a connector passing only under a semi-transparent LABEL is mild noise, and label collisions already belong to label_overlap. node_overlap now sums pairwise overlap over the bare shape boxes (node_shape_box) and normalizes by total shape-box area; node_connector_overlap charges a connector for the length it spends inside non-incident shape boxes. node_shape_boxes is hoisted to be computed once and shared with label_overlap. sprawl and aspect_penalty intentionally keep using the label-merged node_box because the view's visual extent and characteristic node size include labels. MetricWeights, weighted_cost, crossings, and loop_compactness are untouched.
…chments
build_view_segments named a link's endpoint vertices elem_{from_uid}/
elem_{to_uid} but a flow's pipe vertices flow_{uid}#{i}. do_segments_intersect
only suppresses a crossing when two segments share a from_node/to_node string,
so a link incident on a flow -- terminating at the flow's valve
(link.to_uid == flow.uid) or at a stock/cloud the flow attaches to -- grazed
the pipe at the connection point with mismatched node names and was miscounted
as a real crossing.
The flow branch now names each pipe vertex to share the elem_{uid} name the
incident link/stock/cloud already uses: an attached point becomes
elem_{attached_to_uid}, and the valve (which lies on the pipe, not at a stored
point) is injected as an elem_{flow.uid} vertex by splitting the pipe segment
it sits on. A genuinely free interior point keeps the per-flow flow_{uid}#{i}
name, so a real mid-span crossing -- segments sharing no element with the flow
-- is still counted. Consecutive pipe sub-segments share the joining vertex
name, so a flow never self-crosses.
This fixes the dp_logistic_growth reference miscount (two links terminating at
the net birth rate valve reported 2 spurious crossings) that was distorting the
Phase 4 layout-quality calibration; the reference now scores crossings == 0.
loop_compactness was a hardcoded 0.0 reserved field. Compute and populate it so the layout-quality eval sweep can report a per-model value. The committed weight stays 0 (MetricWeights::default and weighted_cost are unchanged); calibration happens in a later phase. The user wants to see what the term reports before it influences any optimizer. The measure is isoperimetric over the view's feedback cycles. Build a directed graph over positioned node-box elements (aux/stock/flow/module/ cloud): each Link contributes from_uid -> to_uid, and each Flow contributes source_attached -> flow.uid -> dest_attached through its own valve, so a stock--flow--stock feedback path is part of the graph. Enumerate simple directed cycles with a bounded DFS (each cycle's smallest uid is the search root, so a cycle is discovered once), capped at 12 nodes per cycle and 64 cycles total -- SD diagrams are tiny, so this stays O(small) and total. For each cycle of >= 3 distinct positioned nodes, take the node-box centers in cycle order, compute the shoelace area and summed-edge perimeter, and form the isoperimetric quotient Q = 4*PI*Area / Perimeter^2 clamped to [0,1] (Q=1 a perfect circle, Q->0 collinear/degenerate). The per-cycle penalty is 1 - Q; loop_compactness is the mean penalty over all qualifying cycles, or 0.0 when there is no cycle of >= 3 nodes. It thus rewards clean, well-spread loops and penalizes collapsed/collinear ones. Centers come from the bare shape box (node_shape_box), which is symmetric about the element position -- unlike the label-merged node_box, whose label offset would skew the polygon. Determinism does not rely on layout seed alone: adjacency targets are sorted, the search visits nodes in sorted uid order, every cycle is canonicalized (rotated to its smallest uid) and de-duplicated, and every per-cycle guard returns None rather than a NaN, so the mean is identical for a given view regardless of element ordering.
…sion sensitive) The old label_overlap divided the corpus-wide sum of label-vs-label and label-vs-shape overlap areas by the corpus's total label area. That under-counted a small-but-readability-killing overlap: a node circle clipping the last couple of characters of a short label has a small absolute overlap area, and dividing it by the total label area of every label in the diagram diluted it to near zero (~0.004 on the eval corpus), so the term could not distinguish "one label is 30% obscured" from "nothing is obscured." Redefine label_overlap as a per-label obscuration measure: for each labeled element L with box B_L, approximate the area of B_L covered by any OTHER label box or any OTHER element's bare shape box as the sum of the individual overlap areas, capped at area(B_L) so the obscured fraction is in [0,1]; label_overlap is the SUM of those fractions over all labels. A small clip now registers at its true obscuration fraction (~0.3 for the fixture above) instead of ~0.004. The capped sum is a monotone proxy for the exact covered area -- a pixel-exact union is unnecessary, and more/larger overlaps never decrease a label's fraction. This preserves the Phase-1 guards: a label is never charged against its own element's shape box, and the comparison uses the label-free shape box (node_shape_box), not the label-merged box, so label-vs-label coverage is not re-counted via another node's merged bounds. A mutual label-label collision is charged from both labels' perspectives -- intended, since both are unreadable. area(B_L) == 0 is guarded (skip), so the term stays finite. MetricWeights/weighted_cost are unchanged (calibration is a later phase). On the eval corpus the within-model worse-is-higher anchor is preserved (best <= median <= worst label_overlap for every model) while the magnitudes rise to reflect true per-label obscuration: e.g. reliability best 0.127 -> 0.937, fishbanks best 0.019 -> 0.312.
Replace the Phase-1 all-zeros placeholder Default impl for MetricWeights with the calibrated production weights from the Phase 3 contact-sheet calibration, with explicit user sign-off (2026-05-23): node_overlap = 1.0 node_connector_overlap = 1.0 label_overlap = 1.0 crossings = 1.0 sprawl = 0.0 edge_length_cv = 0.0 aspect_penalty = 0.0 loop_compactness = 0.25 chain_straightness = 0.0 Failure-mode rationale -- readability >> compactness. The dominant concerns (node-shape overlap, connectors under node shapes, obscured labels, edge crossings) carry weight 1: they make a diagram unreadable or assert false causal connections. sprawl, edge_length_cv, and aspect_penalty are intentionally 0 -- compactness and aspect ratio are NOT goals; spreading out to keep labels legible and feedback loops visible is good, not penalized. loop_compactness is a low 0.25 that gently rewards drawing feedback loops as visible circles without dominating the overlap/crossings family. chain_straightness stays 0 (reserved, not yet computed). The old AC5.1 test asserted the default was all-zeros so weighted_cost was inert; that is no longer true by design. It is replaced by a dominance-ordering test that pins the readability-dominant relationships (each overlap/crossings term strictly exceeds each compactness term; sprawl/edge_length_cv/aspect_penalty/chain_straightness == 0; 0 < loop_compactness < node_overlap) and re-confirms weighted_cost under the default equals the explicit sum of weighted terms.
Encode the user's taste anchors as an objective validation of the calibrated MetricWeights::default(): on the agreed reference pairs the shipped hand-authored (human) layout must score a lower weighted_cost than a machine-generated (auto) layout of the same model. Construction (b) -- human view vs generated layout. For each anchor we load the default_projects XMILE model, score its as-loaded main view (human), generate a single fixed-seed layout (auto), and assert human < auto under the committed weights. One fixed seed (not generate_best_layout) keeps the test deterministic (layout is deterministic per seed, #633) and fast: all five tests run in well under half a second over the small (<= 42 element) default_projects. Anchor costs under the committed weights (human < auto): reliability: human 0.255 < auto 2.997 fishbanks: human 0.194 < auto 1.603 population: human 0.052 < auto 0.053 logistic-growth: human 0.089 < auto 0.154 sir is deliberately NOT a human<auto anchor: its shipped reference obscures more labels than the auto layout (human 0.458 > auto 0.039), so the metric correctly prefers the auto. That direction is pinned by a separate documented test so the asymmetry is recorded, not silently assumed.
Now that Phase 4 has committed the calibrated MetricWeights::default(), the layout_eval sweep scores every weighted_cost with the committed default instead of the Phase-3 PLACEHOLDER_WEIGHTS const. Remove the PLACEHOLDER_WEIGHTS const (and its 'Phase 4 will replace' doc comment) and route the three call sites (per-seed sweep, render scoring, and the report builder) through MetricWeights::default(). Re-seed the committed examples/layout_eval_baseline.json from current behavior under the calibrated weights, over the same small documented subset (teacup,sir at M=8) so the committed file stays small. The recorded per-seed sample costs change because the weights changed, so the old baseline was stale. Update the sibling README to note the 2026-05-23 re-seed with the committed weights and drop the now-done 'regenerate after Phase 4' caveat.
…tion The population reference-pair anchor beats the auto layout by only ~2.3% under the committed default weights, far thinner than the other anchors (reliability ~8.5%, fishbanks ~12%, logistic-growth ~58%). Document that it is a marginal taste anchor so a future failure is read as a near-boundary case rather than necessarily a metric regression; the assertion is unchanged. Also note that canonicalize_cycle normalizes rotation but not traversal direction. A reverse-direction duplicate essentially never arises for directed SD feedback loops, and even if it did the shoelace polygon area is direction-invariant, so both yield the identical isoperimetric penalty.
Add a dedicated layout_selection_tests.rs (wired via #[path], following the crossings_tests.rs precedent so layout_tests.rs stays under the per-file line cap) covering AC6.1: select_best_layout minimizes weighted_cost, not crossings. The headline case builds two candidates whose crossing ordering is the inverse of their weighted_cost ordering -- the lowest-cost view actually has MORE connector crossings (asserted via count_view_crossings, so the inversion is real, not narrative) -- and confirms the lowest-cost view wins, which the retired crossings-only rule would not have done. Companion cases pin the lowest-seed tie-break and NaN safety: a NaN challenger never displaces a finite running best, while a NaN seeded first is sticky under the current fold (pinned as a documented limitation rather than silently assumed away).
Add a fast, deterministic guard (AC7.1) over three tiny programmatic models built with TestProject -- a population feedback model, a pure aux dependency chain, and a two-stock transfer. Each is laid out at the fixed annealing seed 42 and its calibrated weighted_cost asserted at or below a committed per-model ceiling. The ceilings are observed-cost ceilings (observed at seed 42: pop 0.0533, chain 0.0, two_stock 0.1646) set a small margin above the observed value, with a comment documenting how to regenerate them after an intentional metric/weight change. For AC7.2, a companion test takes a real fixed-seed layout and piles every node onto the origin (blowing up node_overlap to ~6.2) and asserts the result exceeds the pop ceiling, proving the guard discriminates good layouts from bad rather than passing vacuously. The whole guard runs in well under a second (tiny models, one seed each).
Add an AC8.1 determinism test: laying out the same model twice through generate_layout_with_config at the same explicit seed yields two StockFlow values that compare equal (StockFlow derives PartialEq, so every field -- positions, view box, element order -- is checked). The test uses a seed-sensitive model (a stock fed/drained by ten leaf auxes through two flows) where the SFDP/annealing RNG genuinely shapes the layout, and additionally asserts that a different seed produces a different layout -- so the same-seed equality is a real determinism guarantee, not a vacuous pass on a tiny model whose arrangement ignores the seed. A comment notes this per-seed reproducibility is distinct from the Phase 3 M-seed statistical sweep, which deliberately varies the seed.
…s of order The fold in select_best_layout kept the running best whenever the challenger's `<` comparison was false. That correctly drops a NaN-cost challenger (`NaN < finite` is false), but it also failed to let a finite challenger overtake a NaN running best (`finite < NaN` and `finite == NaN` are both false). So a degenerate NaN-cost layout from the first seed in the fixed order [42, 123, 456, 789] was sticky: generate_best_layout would ship it even when a later seed produced a usable finite layout. Add explicit NaN branches so a NaN challenger never wins and a finite challenger always beats a NaN running best, making the guard order-independent. If every candidate is NaN the earliest is kept deterministically. The metric is NaN-free in practice (compute_layout_metrics guards every division), so this is defensive robustness rather than an observed production failure.
Document the layout-quality eval branch's contract and structure changes in src/simlin-engine/CLAUDE.md: the new pure metrics.rs (LayoutMetrics / MetricWeights / compute_layout_metrics / weighted_cost) and eval_stats.rs modules, the per-seed bit-identical layout determinism guarantee (#633), the shift of generate_best_layout/select_best_layout from fewest-crossings to the calibrated weighted_cost (NaN-safe), the rebuilt count_view_crossings / build_view_segments crossing geometry, the now-public LAYOUT_SEEDS, the diagram submodules widened to pub(crate) plus connector_polyline and the label-free *_shape_bounds / rect-segment helpers they share with the metric, and the gated layout_eval example.
|
@codex review |
7f00822 to
37e2719
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7f00822399
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| x: (r.left + r.right) / 2.0, | ||
| y: (r.top + r.bottom) / 2.0, | ||
| }, |
There was a problem hiding this comment.
Use valve position for flow loop geometry
build_loop_graph computes each node center from the shape-box midpoint, but for flows that midpoint is not the valve position because flow_shape_bounds now includes the entire pipe polyline (src/simlin-engine/src/diagram/flow.rs), which is often asymmetric when users drag the valve along the pipe. This makes loop_compactness evaluate loop polygons with shifted flow vertices, so the new production weighted_cost can rank seeds based on pipe endpoint extents instead of actual loop geometry. A flow node in loop scoring should use (flow.x, flow.y) (the valve) rather than the flow bounding-box center.
Useful? React with 👍 / 👎.
Code reviewReviewed the production-impacting changes in depth (the eval harness/docs are out of the critical path):
Tests are non-vacuous (crossing inversions asserted via No blocking issues found. Overall correctness verdict: correct |
build_loop_graph computed each loop-polygon vertex as the midpoint of the element's bare shape box. For flows that box is flow_shape_bounds -- the valve box unioned with every pipe point -- so its midpoint is the pipe-extent center, which drifts off the valve (flow.x, flow.y) whenever the valve is dragged off-center or the pipe is bent. Flows are vertices in the loop graph (stock->flow->stock edges), so loop_compactness, and therefore the weighted_cost that Phase-5 Rung-0 select_best_layout minimizes, was being skewed by pipe geometry rather than true loop geometry. Compute each loop vertex via diagram::connector::get_visual_center -- the same visual center the SVG renderer uses -- which returns the valve for a flow and the element center (the symmetric shape-box midpoint) for aux/stock/module/cloud, so only flow vertices change. The node-membership gate stays node_shape_box. This restores the module's same-geometry-as-renderer invariant and corrects the now- stale LoopGraph/build_loop_graph doc comments that claimed the shape box is always the element center. Addresses the PR #637 review comment.
Automated review — PR #637 (layout-quality eval + Rung 0)I reviewed the production-impacting changes and the new cores in detail:
No blocking correctness issues found. Test coverage is thorough and matches the documented contracts. ( Overall correctness: correct. |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 35ef355faa
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for seg in c.polyline.windows(2) { | ||
| inside += segment_length_in_rect(&seg[0], &seg[1], rect); |
There was a problem hiding this comment.
Cap per-segment node connector coverage once
node_connector_overlap sums segment_length_in_rect across every non-incident node box, so when two node boxes overlap, the same connector sub-segment is counted multiple times. That makes this "fraction of connector length" exceed 1.0 and can disproportionately inflate weighted_cost, causing generate_best_layout to reject otherwise better layouts just because overlap regions were double-counted. The overlap should be measured as a union length per segment (or otherwise de-duplicated/clamped) before dividing by total_connector_length.
Useful? React with 👍 / 👎.
…oss boxes) node_connector_overlap is documented as a "fraction of total connector length", but it accumulated segment_length_in_rect for every non-incident node shape box, summing the per-box scalar lengths. When two non-incident shape boxes overlapped, the connector sub-length in the overlap region was counted once per box, so the term could exceed 1.0 -- breaking the fraction-in-[0,1] contract, over-inflating weighted_cost (this term's weight is 1.0), and skewing Phase-5 Rung-0 select_best_layout. Overlapping shape boxes are common because a Flow's node_shape_box is its whole-pipe bounding box, which frequently overlaps stocks/auxes/other flows. The fix unions the per-segment clip intervals across all non-incident boxes before summing, so each physical sub-length is counted at most once. A new pub(crate) segment_clip_interval_in_rect exposes the Liang-Barsky [t0, t1] core; segment_length_in_rect now delegates to it (behavior-preserving), and the metric merges the intervals per segment. Single-non-incident-box layouts are unchanged (nothing to merge); overlapping-box layouts see the term decrease toward the true union fraction, lowering weighted_cost in the safe direction so the AC5.2 human<auto anchors and the AC7 regression-guard ceilings continue to hold. Addresses the PR #637 review comment.
|
@codex review |
|
Codex Review: Didn't find any major issues. Another round soon, please! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Review — layout-quality eval + Rung-0 selectionI focused on the code that newly runs in the production layout path ( One finding below. [P2] Cycle enumeration has no global exploration budget (now on the production layout path)
Overall correctness verdict: correct. The production-impacting logic and the byte-identical render refactor are sound and well-tested; the one finding above is a latent worst-case performance concern (not a correctness break) that only bites on an uncommon large near-acyclic model. |
Summary
Adds a layout-quality evaluation system for the SD diagram auto-layout, then uses it to improve production seed selection (Rung 0). Implements the 5-phase
2026-05-22-layout-quality-evalplan, plus two correctness fixes discovered during execution.layout/metrics.rs): a pure, geometry-accurateLayoutMetrics(per-aesthetic-concern costs: node/connector/label overlap, crossings, sprawl, edge-length CV, aspect, loop compactness) computed from the same geometry the renderer draws. Refactored arc geometry out ofrender_arcinto a sharedconnector_polylineproducer (byte-identical SVG) and re-pointedcount_view_crossingsto count on sampled polylines viabuild_view_segments.layout/eval_stats.rs): pure benchstat-style seed-sample stats (geomean, type-7 percentile/median, Mann-Whitney U) +ModelStats/CorpusReport/compare.examples/layout_eval.rs, gated[[example]]): on-demand sweep that lays out a curated corpus across many seeds, scores each, renders best/median/worst + reference PNGs, and emitsmetrics.json+ a worst-firstindex.htmlcontact-sheet + a baseline diff under gitignoredtarget/. Not part ofcargo test.default_projectsreferences. Readability dominates compactness: overlaps (on bare shape boxes) + crossings + per-label obscuration carry the weight; sprawl/edge-CV/aspect are zeroed;loop_compactness(isoperimetric) is a small nudge. Committed calibratedMetricWeights::default()with explicit user sign-off; AC5.1 dominance test + AC5.2 human-vs-auto reference-pair test.select_best_layout): production best-of-k selection now minimizes the calibratedweighted_cost(NaN-safe, order-independent) instead of fewest crossings. Plus a deterministicweighted_costregression guard and a per-seed determinism test.Correctness fixes found and fixed in-branch
fresh_layoutand the incrementaldiff_connectorspath were not reproducible per seed (HashMap iteration order leaked into float accumulation and element ordering). Made deterministic and tested. This was a prerequisite for the regression guard, reference-pair test, and determinism check.Test Plan
Automated (in
cargo test -p simlin-engine): metric geometry (AC1.), polyline crossings (AC2.), statistics vs reference values (AC4.4/4.5), weight dominance (AC5.1) + reference-pair ordering (AC5.2), Rung-0 selection + regression guard + determinism (AC6-AC8). All green; clippy clean; SVG parity (svg-rendering.test.ts) unchanged.Manual (the corpus sweep is operational, not a suite test): see
docs/test-plans/2026-05-22-layout-quality-eval.mdfor the on-demand sweep walkthrough and the human-judgment calibration gate (best/median/worst ordering, reference-vs-auto scoring, weight magnitudes).target/layout-eval/index.html+metrics.jsonFollow-ups (filed, out of scope here)
@simlin/serverJest test under parallel pre-commit load (unrelated to this work).