Skip to content

Feature/segmentation engine#6

Merged
dennys246 merged 16 commits intomainfrom
feature/segmentation-engine
Feb 17, 2026
Merged

Feature/segmentation engine#6
dennys246 merged 16 commits intomainfrom
feature/segmentation-engine

Conversation

@dennys246
Copy link
Copy Markdown
Owner

Abstracting the current YOLO approach with a segmentation engine, due to recent YOLO license changes we’ve set the main segmentation engine to MMPose with optional YOLO dependent on long term licensing of the project at hand

@dennys246 dennys246 merged commit 9c1bc64 into main Feb 17, 2026
dennys246 added a commit that referenced this pull request Mar 8, 2026
Feature/segmentation engine passes all current tests and new tests examining the segmentation engine structure  all pass, zero regression from prior version
dennys246 added a commit that referenced this pull request Apr 13, 2026
Ten findings from the R2 pre-merge review round, folded into a single
commit. Details in the inline comments (each fix is tagged with its
review number).

REAL BEHAVIOR BUGS
- #1 Stage-2 probe mis-classified HTTPAuthError (401/403) and
  HTTPRateLimited (429) as inference_broken. Auth-gated or throttled
  peers would report "model loading" instead of "rotate key" or "busy",
  and get the 15s short-backoff instead of the appropriate signal.
  Added explicit except branches BEFORE the generic HTTPError catch.
- #2 _has_local_llm_flag scanned full argv including subcommand names,
  so `maxim tunnel --llm foo` on a leader falsely flipped role to solo
  and _model_state_file would read active_llm_model.solo.txt on that
  machine for the process lifetime. Added _SUBCOMMAND_NAMES guard.

BAND-AIDS PROMOTED
- #3 cli.py detect_and_apply_role exception wrapped a bare try/except
  with DEBUG log. Real failures would silently skip role export.
  Promoted to WARNING + structured role_detection_failed event.
- #4 migrate_persisted_model_file caught Exception at DEBUG. Same
  silent-failure pattern. Promoted to WARNING +
  persisted_model_migration_failed event.
- #5 Migration used .replace() without checking destination existence.
  On Windows this silently clobbered an existing role-specific file
  with legacy content. Added .exists() branch: delete legacy, preserve
  role-specific, emit persisted_model_migrated with collision=True.
- #6 Moved `_validate_base_url` compat alias in openai_backend.py from
  mid-file line 101 (with # noqa: E402) to the imports block so the
  suppression is no longer needed.

GAPS CLOSED
- #7 _model_state_file now runs migrate_persisted_model_file lazily on
  first access per process via a module-level flag. Python-API users
  who never go through cli.py::main now get their legacy
  active_llm_model.txt migrated instead of orphaned.
- #9 detect_and_apply_role now calls _check_leader_mode_divergence,
  which compares role.py's decision against leader_mode.detect_role()
  and emits a structured role_divergence event when they disagree.
  Normalizes leader_mode's "client" → role.py's "peer" semantically.
  Insurance against Plan 4 consolidating these.
- #10 probe_llm_server now generates a per-probe uuid4 hex[:8]
  correlation ID threaded through both probe_started and
  probe_completed events so concurrent probes can be matched.
- #11 role_detected event renamed decision-rule field from `source`
  to `role_source` so it doesn't collide with StructuredFormatter's
  top-level `s` short-key (logger module name).

#8 (explicit __init__ per BackendError subclass) is a separate commit
per the triage guidance — it's a real refactor with its own review
surface.

Tests added:
- test_fix1_stage2_auth_rejected_not_inference_broken
- test_fix1_stage2_rate_limited_not_inference_broken
- test_fix2_subcommand_suppresses_llm_flag_scan
- test_fix2_bare_llm_still_detects_solo
- test_fix5_migration_destination_collision
- test_fix7_lazy_migration_in_model_state_file
- test_fix9_role_divergence_warns
- test_fix10_probe_correlation_id_in_events
- test_fix11_role_source_field_name

Fast suite: 4082 passing (was 4073 post-R2). CI grep clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246 added a commit that referenced this pull request Apr 13, 2026
Pre-merge review round per the refined R2 pattern — two parallel review
Claudes (Executor + Architecture lens) audited R2.5 + R2.6 on
feat/llm-path-r3 and returned 18 findings (2 critical, 7 high, 5
medium, 4 low). This commit folds the 12 items that block the PR into
a single fix commit on the same branch. Deferred items listed at the
end with rationale.

## CRITICAL — behavior bugs

### 1. _MaximPeerBackend.for_url mutated os.environ (Exec CRIT #1, Arch HIGH #1)

Both reviewers flagged the same race: the R2.5 factory stored the
probe API key in ``os.environ["MAXIM_PEER_PROBE_KEY"]``. Under
concurrent probes (lane-A and lane-B validated in parallel, doctor
--retry loop, or the deferred multi-peer dispatch path), the env var
would race — lane-B's key could leak into lane-A's request, or
lane-A's key would survive past its probe and mis-authenticate a
later unrelated call. The pop-on-None branch also wiped any
pre-existing user value for the same env key.

Fix: store the key on the backend INSTANCE via a new
``_api_key_override: str | None`` attribute consulted by
``_get_api_key()`` before the env-var fallback. ``for_url`` no longer
touches ``os.environ`` at all — it builds the cfg with an empty
``api_key_env`` and writes the key to the instance override. Safe to
call from threads / doctor re-probe loops / future parallel
validation paths.

New regression tests:
- ``test_for_url_does_not_mutate_os_environ`` — asserts env is
  untouched after ``for_url(api_key=k)``
- ``test_for_url_concurrent_instances_have_distinct_keys`` — asserts
  two backends built back-to-back do NOT observe each other's keys
- ``test_for_url_none_key_returns_empty_string`` — asserts
  ``api_key=None`` yields an empty ``_get_api_key()`` (not a stale
  value from the env)

### 2. _emit_dispatch_exhausted bypassed the canonical normalization shim (Exec CRIT #2)

The original R2.5 aggregated-failure log read ``ctx.get("agent_id")
or ctx.get("agent")`` inline — reproducing the legacy-key bridge
that Plan 2 R2b's ``_normalize_request_context`` was explicitly
introduced to centralize. This was the one remaining inline
duplication of the shim's logic anywhere in the new router code. A
future rename of the legacy ``"agent"`` key to ``"agent_id"`` would
leave this call site stranded with a stale fallback; a caller passing
an empty / None context would land nulls in all fields.

Fix: ``_emit_dispatch_exhausted`` now lazy-imports
``maxim.agents.llm_worker._normalize_request_context`` and reads
``ctx.request_id / .agent_id / .session_id / .lane`` off the typed
result. Single canonical bridge, single rename surface. The existing
test ``test_dispatch_exhausted_emits_one_warn_with_all_attempts`` was
ALSO beefed up (per the reviewer's finding) to assert every
multi-agent field reaches the LogRecord — the original test only
checked the event-name substring, which would have allowed a
regression that dropped ``agent_id`` / ``session_id`` / etc. to sail
through.

## HIGH — dispatch + test coverage + CI enforcement

### 3. BACKEND_CLASSES dispatch table was a dead identity map (Exec HIGH #5, Arch LOW #7)

The R2.5 shipment had ``BACKEND_CLASSES: dict[str, str] = {"openai":
"openai", "maxim_peer": "maxim_peer"}`` and ``LLMRouter
._get_backend_for_provider`` hard-coded the ``"maxim_peer" /
"maxim-peer"`` branch as a string literal. Zero indirection. Adding a
new backend required edits in two files, and a typo in either would
fall through to the router's "unknown provider type" warning silently.

Fix: ``BACKEND_CLASSES`` now maps identifier → ``(module_path,
class_name)`` tuples (lazy import so optional-dependency backends
don't pay startup cost). New helper ``resolve_backend_class`` in
``lane_backends.py`` does the lookup + import + class validation +
hyphen normalisation in one place. ``LLMRouter
._get_backend_for_provider`` now calls a thin
``_maybe_dispatch_backend_class`` helper which wraps
``resolve_backend_class``; the ``maxim_peer`` / ``maxim-peer``
string-literal branch is gone. Adding a new backend is now one line
in ``BACKEND_CLASSES`` plus wiring in ``_classify_backend``.

### 4. health_check had no direct unit tests (Exec HIGH #3)

``health_check`` was only exercised indirectly via the
``_probe(url, ...)`` helper in ``test_two_stage_probe.py`` which
routes through ``for_url``. Four direct tests now lock in the
two-stage probe contract:

- ``test_missing_base_url_returns_other``
- ``test_two_attempt_fires_on_unreachable_first_attempt``
- ``test_stage2_skipped_when_stage1_auth_rejected`` — the R2c stage-2
  probe bug regression guard, now at the backend-method layer
- ``test_stage2_runs_only_when_stage1_ok``

### 5. Router + BACKEND_CLASSES dispatch had no integration test (Arch MEDIUM #6)

``test_router_typed_exceptions`` stuffed ``router._backends[...] =
MagicMock()`` to bypass the backend lookup, meaning the
``"maxim_peer"`` provider-type branch in
``_get_backend_for_provider`` was never exercised by the unit tests.
A typo in ``BACKEND_CLASSES`` or the normalisation step would have
slipped through. Four new tests in a new ``TestBackendClassesDispatch``
class exercise the full dispatch:

- ``test_maxim_peer_type_instantiates_maxim_peer_backend``
- ``test_hyphenated_maxim_peer_type_also_dispatches``
- ``test_resolve_backend_class_unknown_returns_none``
- ``test_resolve_backend_class_maxim_peer``

### 6. No CI enforcement preventing new deprecated-shim callers (Arch HIGH #2)

The R2.6 compat-shim deviation from spec ("kept as deprecated thin
delegates instead of strict delete") only holds if no new code paths
grow call sites of ``probe_llm_server`` / ``llm_server_responding_at``
/ ``_probe_once``. Nothing structural prevented that.

Fix: new CI grep invariant in ``.github/workflows/test.yml``. Matches
only function-call patterns (trailing ``(``) and imports (leading
``import ``) — prose mentions in docstrings and comments are ignored
by regex rather than post-filter. Allow list records every existing
in-repo call site at R2.6 landing:

- ``src/maxim/runtime/llm_server.py`` (the definition file)
- ``src/maxim/runtime/lane_backends.py`` (probe_llm_server +
  _llm_server_responding_at usages)
- ``src/maxim/doctor/checks.py`` (probe_llm_server +
  _llm_server_responding_at usages)
- ``src/maxim/models/language/maxim_peer_backend.py`` (the
  ``_probe_once`` lazy-import inside ``health_check``)

Any new match is a CI failure with an explicit migration hint in the
error message.

## MEDIUM — consistency, safety nets, dead indirection

### 7. _record_attempt_outcome now asserts _inference_lock.locked() (Arch MEDIUM #4)

The R2.5 ``_dispatch_attempts`` list is documented as safe-under-lock
but the safety is invisible to future edits. Added an explicit
``assert self._inference_lock.locked()`` at the top of
``_record_attempt_outcome`` so any future code path that calls
``_try_provider`` outside the lock (warmup probe, benchmark hook,
test helper) trips loudly instead of silently corrupting the buffer
across dispatches.

The existing test helper ``_call_try_provider`` and the two
``_complete_text_locked`` tests were updated to acquire
``router._inference_lock`` before the call.

### 8. Auth-branch comment corrected — no misleading "ordering" reference (Exec MEDIUM #6)

The R2.5 auth-rejected / model-missing / inference-broken branches
had a comment claiming "apply the long cooldown AFTER so
_note_provider_failure's exponential ramp doesn't overwrite the
hard value." But those branches never call ``_note_provider_failure``
at all — only the direct ``state.last_error`` assignment + the
``_set_long_backoff`` / ``_set_short_backoff`` helper. The comment
was misleading and risked a future "cleanup" edit that added
``_note_provider_failure(...)`` "for symmetry" and silently regressed
the hard-cooldown value. Comment rewritten to say explicitly:
"Do NOT call ``_note_provider_failure`` in this branch" and explain
why the hard value is load-bearing.

### 9. _probe_once docstring corrected — canonical primitive, not deprecated copy (Exec MEDIUM #7)

R2.6's docstring mislabelled ``runtime/llm_server.py::_probe_once``
as "deprecated" and "intentionally a direct copy" of a backend static
method. Neither was true: after deleting the backend's static method
wrapper (see #10 below), ``_probe_once`` IS the single canonical
liveness primitive. Both the backend's ``health_check`` (via lazy
import inside the method body) and the compat shims delegate into
it. Docstring rewritten to reflect this + documents the circular-
import avoidance (the lazy imports between llm_server.py and
maxim_peer_backend.py are load-bearing).

### 10. Removed _probe_liveness_once static method (Exec MEDIUM #8)

The static ``_MaximPeerBackend._probe_liveness_once(url, api_key,
timeout_s)`` was dead indirection — it only existed to call
``runtime/llm_server._probe_once`` and had no per-instance state or
logic. Deleted. ``health_check`` now calls ``_probe_once`` directly
(lazy-imported alongside ``_probe_stage2_readiness``). Cleanest
outcome: single call site, single implementation, no dead wrapper to
drift.

### 11. _record_unclassified_backend_error logs import failures once (Exec MEDIUM #9)

The R2.5 safety-net helper silently swallowed all exceptions,
including import failures on ``lane_metrics``. Silent failures
defeat the safety net — the whole point of
``backend_unclassified_errors_total`` is to be the canary for missing
typed exception classes, and a silent zero counter masks both the
target bug and any underlying import break. Fix:
``_UNCLASSIFIED_IMPORT_WARNED`` module-level flag + ``warn(...)`` on
first failure. Subsequent calls stay silent to avoid log flood; the
first-failure warning is enough to surface a broken lane_metrics
module.

### 12. Streaming malformed-chunk + empty-content paths now emit per-call WARN (Exec LOW #10, #11)

Both streaming failure paths raised ``BackendDown`` without calling
``self._log_failure(...)``, so the only record of a mid-stream break
was the router's aggregated ``dispatch_exhausted`` WARN — which
combines all providers' outcomes into one line. Operators debugging
"why did peer-X fail?" couldn't see the individual mid-stream
malformed-chunk vs empty-content signal per call. Fix: both branches
now call ``_log_failure("malformed_chunk", ...)`` or
``_log_failure("empty_stream", ...)`` before raising, so each
streaming failure has its own ``peer_backend_failed`` WARN with the
typed exception's ``fix_hint`` + request_id + agent_id + full
multi-agent context. The malformed-chunk path also includes the
model name in the ``fix_hint`` so operators know which model's
streaming is broken.

## Deferred — rationale

### Exec HIGH #4 — real-network performance gate

Reviewer correctly observed that ``tests/performance/test_fast_failover.py``
mocks ``_http.post`` and thus wouldn't catch a regression that added
retry logic at the http-client layer (e.g., a ``max_retries=3`` kwarg
on the registered endpoint). Real-network tests against dropping
ports are flaky in CI. Deferred to the stress test protocol Phase D
(``docs/plans/llm_path_fast_failover.md`` "Stress test protocol"
section) which exercises the full path end-to-end against a real
restarting leader. The mocked gate still catches every failure-mode
classification bug; the real-network gate catches http-client config
regressions that the mocked layer cannot see.

### Arch MEDIUM #5 — request_context dict | None typing

Reviewer's tracking-only finding: the ``request_context: dict[str,
Any] | None`` parameter signature of ``_MaximPeerBackend
.complete_with_usage`` perpetuates the legacy dict shape even though
the canonical contract is the typed ``RequestContext``. The
``_OpenAIBackend`` has the same signature, so this is a multi-backend
cleanup for 0.5 or later, not Plan 3 scope. Not blocking the PR.

### Arch HIGH #3 — probe JSONL events gained a "provider" field

R3 added a ``provider`` field to ``probe_started`` and
``probe_completed`` JSONL events. This is additive and does not
break any downstream consumer (the existing ``probe_id`` correlation
test passes unchanged). Documented here rather than via a separate
schema-change commit because no tests or dashboards currently assert
the event schema's field count.

## Test invariants verified after fold-in

- Fast suite: 4142 passed, 2 skipped, 20 deselected (up from 4131
  post-R2.6 = +11 new tests from the review fixes)
- ``ruff check src/ tests/`` clean
- ``ruff format --check`` clean
- mypy on touched files: no new errors vs R2.6 baseline (confirmed
  by running mypy on the R2.6 router.py snapshot — pre-existing
  str|None errors unchanged, zero new errors introduced)
- Plan 1 R1 urllib invariant: clean
- Plan 3 R2.5 no-retry-loop invariant: clean (excluding
  ``retry_after_s`` + ``retry_timeout_s`` parameter names)
- Plan 3 R2.6 deprecated-shim allow list: clean (new CI invariant
  now enforces this)

## Related
- Review round: Executor-lens + Architecture-lens Claudes, 2026-04-12
- Spec: ``docs/plans/llm_path_fast_failover.md``
- Prior commits on this branch: ``824d737`` (R2.5), ``d09b74d`` (R2.6)
- Review pattern: ``feedback_review_before_ship.md`` — pre-merge
  reviewer round catches bugs tests don't (2 critical + 10 non-
  critical this round, 0 overlap with test coverage)
- Grep-before-dismissing: ``feedback_grep_before_assuming_fixed.md``
  — verified every disputed finding against the actual code before
  folding in (no false-dismissals this round)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246 added a commit that referenced this pull request Apr 14, 2026
…graph

Ships the Hebbian episode-binding mechanism: episodes close, nodes
that co-activated within them form durable links on a binding graph
owned by Hippocampus, and presenting one node from a prior episode as
a cue retrieves the others via Hippocampus.retrieve_on_cue.

New module: src/maxim/memory/episode.py (~310 LOC)
- Episode dataclass — immutable, frozen=True, to_dict / from_dict for
  P3.5 round-trip.
- CaptureEvent + PendingEpisodeState lightweight types feeding the
  boundary detector.
- EpisodeStore — standalone class (NOT inlined onto Hippocampus per
  Round 1 Arch important #1) with node→episode inverted index,
  to_dict / load_from_dict, and an internal RLock.
- Boundary detector as a rule-list shape (Round 1 Arch important #2):
  BoundaryRule is a Callable[[PendingEpisodeState, CaptureEvent], bool].
  Three default rules ship: tick_gap_rule, channel_change_rule,
  scn_tag_change_rule. EpisodeBoundaryDetector.add_rule appends more
  without touching Stage 1 code (P3b extension seam).
- apply_hebbian_on_close function implements the Hebbian update rule
  with itertools.combinations (unordered pairs — fixes Round 1 Exec
  critical #1 double-delta). Calls find_edge before add_bidirectional
  to prevent parallel edge accumulation (Round 1 Exec critical #2 —
  DependencyGraph.add_edge has no dedupe).

Architectural pivot: binding graph on Hippocampus, NOT ATL (Round 1
Arch critical #6):
- self._binding_graph: DependencyGraph lives on Hippocampus.
- Hebbian edges never touch ATL.graph. Decouples the binding layer
  from ATL concept eviction / compression lifecycle. Still reuses
  DependencyGraph + EdgeType.ASSOCIATES — the split proposal's
  "no new infrastructure" intent is preserved; what changed is the
  instance that owns the edges.
- Rationale documented in full in the memory/episode.py module
  docstring + referenced from the plan file.

Hippocampus wiring (src/maxim/memory/hippocampus.py):
- EpisodeConfig dataclass added: boundary_tick_gap (50),
  hebbian_init (0.3), hebbian_delta (0.1), hebbian_max (1.0). Nested
  on HippocampusConfig.episode. Tests override via
  HippocampusConfig(episode=EpisodeConfig(hebbian_delta=0.2)).
- Instance fields: _episode_store (EpisodeStore), _binding_graph
  (DependencyGraph), _pending_episode (PendingEpisodeState or None),
  _episode_detector (EpisodeBoundaryDetector seeded with the three
  default rules), _next_episode_ordinal, _episode_lock (RLock).
- Public methods:
  - observe_episode_event(event) — feeds an event into the detector,
    opens a pending episode on first call, extends or closes it on
    subsequent calls.
  - finalize_pending_episode() — force-closes the current pending
    episode (test hook + end-of-session teardown).
  - retrieve_on_cue(cue_node_id, limit) — one-hop partial-cue
    retrieval via DependencyGraph.get_associated, sorted by weight
    descending. Multi-hop via spreading_activation is Stage 2.
- Private helpers: _start_episode, _close_pending_episode_locked.
- Lock ordering invariant: self._episode_lock →
  self._episode_store._lock → self._binding_graph._lock. Regression-
  guarded by test_finalize_pending_episode_no_deadlock_under_capture_thread.

Hippocampus._to_dict / load_state (from P3.5 Stage 1) now delegate the
reserved "episodes" key to self._episode_store.to_dict /
load_from_dict. Hippocampus round-trip round-trips the episodes. The
binding graph itself is NOT persisted in Stage 1 (deferred to P3.5
Stage 2 — either persist or rebuild from episode replay, decided in
Stage 2 after seeing real data).

Tests (tests/substrate/test_p3a_episode_binding.py, 18 tests):

TestP3aMechanism:
- test_episode_close_creates_hebbian_edges: three unordered pairs at
  hebbian_init.
- test_episode_close_strengthens_existing_edges_by_exactly_delta:
  Round 1 Exec critical #1 regression. Asserts exact 0.4 weight
  (0.3 + 0.1, single delta), not 0.5 (double-delta).
- test_repeated_closes_no_edge_duplication: Round 1 Exec critical #2
  regression. Asserts len(_outgoing["a"]) matches expected pair count
  after 5 repeated closes, not 5×.
- test_strengthen_saturates_at_max: hebbian_max clamp.

TestP3aRetrieval:
- test_partial_cue_retrieves_co_activated_nodes.
- test_partial_cue_non_member_returns_nothing.
- test_multiple_episodes_shared_cue_merge.
- test_retrieve_respects_limit.

TestP3aBoundaryDetector:
- test_tick_gap_closes_pending.
- test_channel_change_closes_pending.
- test_scn_tag_change_closes_pending.
- test_scn_tag_none_does_not_close (None on either side does not
  spuriously close).
- test_custom_rule_extension (P3b rule-list extension seam).

TestP3aWireDiscipline:
- test_hebbian_fires_when_atl_is_none: binding graph is Hippocampus-
  owned, ATL is irrelevant to Stage 1 Hebbian.
- test_hebbian_fires_when_atl_wired_and_empty: empty ATL (len == 0,
  falsy under truthy check) must not block Hebbian — general wire-
  discipline regression for the bug class that bit NAc twice in P2.
- test_p3a_diff_has_no_truthy_biosystem_checks: subprocess git-diff
  sentinel that greps added lines for forbidden `if self._X` truthy
  checks, asserts zero matches.

TestP3aLockOrdering:
- test_finalize_pending_episode_no_deadlock_under_capture_thread:
  spawns a background worker that hammers observe_episode_event +
  finalize_pending_episode while main thread calls retrieve_on_cue.
  Asserts completion within a 2-second budget.

TestP3aPersistence:
- test_episode_round_trips_via_hippocampus_dump: dump → fresh instance
  → load_state → assert episodes survive. Depends on P3.5 Stage 1.

All 18 tests green. Full fast suite: 4345 passed, 1 skipped, zero
regressions (+18 tests over P3.5 Stage 1 baseline). Ruff check +
format clean.

Deferred to Stage 2 (not in this commit):
- 100-episode synthetic fixture + TF-IDF baseline.
- Full metric extractor (precision/recall/F1, baseline comparison).
- Binding graph persistence (or rebuild-from-episodes path).
- Multi-hop retrieval via spreading_activation.
- Reward-modulated Hebbian delta.
- Episode thread_id handling (reserved in dataclass, unused in Stage 1).
dennys246 added a commit that referenced this pull request Apr 15, 2026
Architecture lens: 2 critical + 5 important + 3 minor
Executor lens: 3 minor (0 critical, 0 important)

All critical + important findings folded. Three minors folded plus
the fixture stale-comment fix. Two minors deferred with explicit
rationale (tfidf baseline location move; p3a_metrics consolidation
with p2_metrics) — rule of three, extract when 3 consumers exist.

── CRITICAL FIXES (Architecture lens) ──

1. **Flip `retrieve_on_cue` default from `multi_hop=False` to
   `multi_hop=True`** (Arch critical #1). The Stage 1 "backward
   compat" hedge protected exactly one test suite while silently
   degrading every future P3b/P4/P5 caller that forgot the kwarg.
   Stage 2 architectural finding says one-hop is TF-IDF-equivalent
   on bag-of-words and the mechanism's value lives entirely in the
   multi-hop path; default must match the primary path. Stage 1
   mechanism tests that specifically exercise one-hop weight
   semantics now opt into `multi_hop=False` explicitly.

2. **Add `node_filter: Callable[[str], bool] | None` kwarg to
   `DependencyGraph.spreading_activation` and wire through
   `Hippocampus.retrieve_on_cue`** (Arch critical #2). Reserves the
   P3b channel-filter seam and the P4 modality-filter seam without
   committing to semantics. 5-line change in bus.py + wired through
   retrieve_on_cue; drops filtered nodes from traversal (both as
   sources and as hop targets). New `TestNodeFilterSeam` (3 tests)
   validates the seam: multi-hop and one-hop modes both honor the
   filter, and `node_filter=None` is an identity with the default.

── IMPORTANT FIXES (Architecture lens) ──

3. **Split `EpisodeConfig` into nested `HebbianConfig` +
   `RetrievalConfig`** (Arch important #4). Before: flat fields
   (`hebbian_init/delta/max`, `retrieval_decay/threshold/max_depth`)
   heading toward a kitchen-sink as P3b/P4/P6 add their own knobs.
   After: `cfg.episode.hebbian.{init, delta, max_weight}` and
   `cfg.episode.retrieval.{decay, threshold, max_depth}` — each
   concern in its own dataclass with a natural home. `max_weight`
   is named with the suffix to avoid shadowing the builtin.
   Callers updated: `Hippocampus._close_pending_episode_locked`,
   `Hippocampus.retrieve_on_cue` multi-hop path,
   `Hippocampus.load_state` binding-graph rebuild, Stage 1 test
   `_fresh_hippocampus` helper. Doing this now is cheap (20 lines);
   doing it after P6 lands is a breaking config rename.

4. **Soften parity assertion to `multi_hop_f1 > one_hop_f1 + 0.20`**
   (Arch important #5). Before: `|one_hop_f1 - tfidf_f1| < 0.05`
   locked in the parity between one-hop and TF-IDF as a hard
   invariant. Any future one-hop improvement (normalized weights,
   PageRank-style inference) would trip the test as a regression.
   The REAL architectural claim is "multi-hop lift is measurable and
   large," which is what the softened assertion locks in. Test
   renamed from `test_one_hop_does_not_beat_tfidf` to
   `test_multi_hop_lift_over_one_hop_is_real`. Current lift: 0.3045.

5. **Add 10% per-seed episode dropout as a real variance source**
   (Arch important #5). Before: all seeds produced byte-identical
   metrics (std_f1 = 0.0) and the `baseline_mean + 2×std` gate
   collapsed to `baseline_mean` — ceremonial. After: the generator
   drops 17 of 170 base episodes per seed via an independent seeded
   RNG. The gate now performs real statistical work:
   - Hebbian multi-hop F1: 0.9955 ± 0.0055
   - Hebbian one-hop F1:   0.6910 ± 0.0074
   - TF-IDF F1:            0.6600 ± 0.0058
   - Gate (baseline + 2σ): 0.6715
   - Margin:               0.3240 absolute
   Probes are topology-only (do NOT depend on dropped episodes) so
   every seed runs the same 50 retrieval tasks; dropout affects
   what edges the retriever can build, not what it's asked to
   retrieve. Multi-hop robustness absorbs the dropout almost
   perfectly (F1 stays near 1.0 because redundant hub+chain
   topology survives loss of any single reinforcement episode).

── DOCUMENTATION FOLDS ──

6. **Document the `max`-aggregation ↔ reinforcement ↔ P6 extinction
   coupling as a load-bearing invariant + P6 entry condition**
   (Arch important #3). `spreading_activation` uses `max`-path
   aggregation, not `sum`. The Stage 2 reinforcement-doubling fix
   (core weight 0.4 vs peripheral 0.3) creates a fragile equilibrium
   that P6 extinction can collapse if it decays core weights back
   toward 0.3. Three options documented for P6:
   - Add `sum` mode to `spreading_activation` (kwarg).
   - Use distinct `EdgeType` values (`HEBBIAN_BIND` vs `ASSOCIATES`).
   - Hold core weights above a strict floor during extinction.
   P6 must pick one before implementing extinction.

7. **Document P8 replay + `load_state` rebuild double-count risk**
   (Arch important #6). `apply_hebbian_on_close` is called during
   both persistence load and (future) P8 sleep replay. If both fire
   on the same episode in one session, edge weights double-apply
   (silently absorbed by the `hebbian_max=1.0` clamp today, but
   will bite if anyone raises the clamp). P8 entry condition: add
   an `(episode_id, edge_key)` idempotency marker or a
   `replayed_at_hebbian` flag.

8. **Document retrieval defaults calibration derivation**
   (Arch minor #3). The `RetrievalConfig` docstring now derives
   `decay=0.7 × core_weight=0.4 → ~5 effective hops` under
   `threshold=0.001`, with a "re-tune when" note for P3b real-text
   fixtures.

── EXECUTOR MINOR FOLDS ──

9. **Fix stale "Episodes/topic = 10" comment** in the generator
   docstring + the generated YAML header (Exec minor #1). Left over
   from the pre-reinforcement draft; the actual fixture ships 17
   base episodes/topic.

10. **Add `peripherals_per_topic > pool_size` validation**
    (Exec minor #2). Pre-fold: silent undercount on pool overflow.
    Post-fold: generator raises `ValueError` with a clear message.
    New test `test_peripherals_per_topic_validation` regression-
    guards it. Also added `episode_dropout_rate` bounds validation.

11. **Rename `test_retrieval_defaults_produce_stage2_pass` →
    `test_retrieval_defaults_match_spec`** (Exec minor #3). The old
    name promised a probe but the body only asserted default values.
    New name matches what the test does.

── DEFERRED WITH RATIONALE ──

- **`TfidfBaseline` location move** (Arch minor #1). Keeps
  `tests/substrate/tfidf_baseline.py` — test-to-test imports work
  from P3b fine. Move to `src/maxim/memory/baselines/` when a 3rd
  consumer emerges. Rule of three.
- **`p3a_metrics.py` / `p2_metrics.py` extraction** (Arch minor #2).
  Extract common `aggregate_seeds` / `compare_to_baseline` into
  `tests/substrate/metrics_common.py` when P3b adds a 3rd
  consumer. Same rationale.

── TEST STATE ──

- tests/substrate/test_p3a_fixture_validation.py: 12 → 19 tests
  (+7: node-filter seam × 3, dropout variance × 2, pool/rate
  validation × 2).
- tests/substrate/test_p3a_episode_binding.py: 24 tests (Stage 1 +
  0 regressions; 4 call sites updated to `multi_hop=False`).
- tests/unit/test_bio_system_snapshot.py: 52 tests (P3.5 + 0
  regressions).
- Full fast suite: 4454 passed, 1 skipped, 0 regressions.
- Ruff check + format clean.

── RESULTS AFTER FOLD ──

  Retriever           mean F1    std F1    Beats TF-IDF + 2σ?
  ------------------  --------   -------   ----------------------
  Hebbian multi-hop    0.9955    0.0055    YES (margin 0.324)
  Hebbian one-hop      0.6910    0.0074    ≈ parity
  TF-IDF baseline      0.6600    0.0058    --

  Multi-hop lift over one-hop: 0.3045 (architectural invariant ≥ 0.20).

Results JSON + experiment writeup regenerated with the dropout-
based numbers; reproduction protocol unchanged (inline snippets
still work).
dennys246 added a commit that referenced this pull request Apr 15, 2026
Cataloging the broader structural-enforcement work that came out of
executor_bootstrap_unification.md. Three identical "forgot to wire X"
bugs in three weeks taught us that helper-discipline is too weak when
the failure mode is silent. The same shape exists across the rest of
the bio-system construction surface — every PainBus, MemoryHub,
DefaultNetwork, etc. has its own scattered construction pattern with
the same silent-no-op risk on every entry point.

This commit is planning-only, no code. Cuts a fresh planning branch
off main so the in-flight executor_bootstrap_unification PR diff
stays clean.

CENTRAL DOC: docs/plans/biosystem_unification.md
- Lessons learned (L1-L7) from the executor_bootstrap_unification work
  and pre-merge review round, especially:
  - L1: silent failure mode is the primary trigger, not headcount
  - L2: audit before designing
  - L3: pre-merge review is non-negotiable
  - L4: gate construction on the learning subject, not signal source
    (the C2 cross-confirmed finding from the executor unification)
  - L5: declared fields beat attribute stashes
  - L6: one unification per PR
  - L7: doc + memory refinement is part of the work
- General guidelines for any new bio-system unification plan
- Catalog of 6 candidates ranked by silent-failure blast radius
- Logical chain diagram + parallel-safety analysis
- Wave-based recommended ordering (1: PainBus + ReactionBus parallel;
  2: MemoryHub + DefaultNetwork parallel; 3: bio_stack umbrella;
  4: agent_factory_canonicalization)
- Status tracking table

SHELL PLANS (all marked DRAFT, audit + design pending):

- docs/plans/pain_bus_unification.md
  Wave 1. Highest-priority next move — same bug class as executor
  unification, currently scattered across 5+ entry points.

- docs/plans/reaction_bus_unification.md
  Wave 1. Parallel-safe with pain_bus. Same shape, smaller surface.

- docs/plans/memory_hub_unification.md
  Wave 2. Folds MemoryHub.connect() into __init__ so forgetting
  bridge wiring is a TypeError. Three design options sketched.

- docs/plans/default_network_unification.md
  Wave 2. Replaces the broad-except swallow + _sim_maxim_stub
  pattern with explicit construction.

- docs/plans/bio_stack_unification.md
  Wave 3. Umbrella build_bio_stack(...) returning a frozen BioStack
  dataclass. Composes all Wave 1+2 builders. Single PR but explicitly
  not parallel-safe with anything in the catalog. Why-Wave-3
  rationale documented inline.

ALSO NOTED (not getting a shell yet):
- LearnedToolIndex registration coupling — different shape (registry
  binding, not bus subscription). Logged in central doc as item #6
  with notes; revisit when a second instance surfaces.

Each shell follows the same template: Status / Goal / Audit
(PENDING) / Design sketch / Open design questions / Pre-merge review
questions / Estimated scope / Out of scope. They are intentionally
shells — full plans open when their wave activates per
biosystem_unification.md trigger conditions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246 added a commit that referenced this pull request Apr 15, 2026
Two parallel reviewers (Executor + Architecture lenses) produced 20
total findings against P4 Stage 1; this commit folds the IMPORTANT
and MINOR items that require code changes. No CRITICAL findings.
4702 tests pass (185 across the P3a/P3b/P3.5/P4/hippocampus slice).

Cross-confirmed folds (both lenses caught independently):

Consistency windows:
- hippocampus_persistence.load_state now wraps the ENTIRE episode-
  binding restore block (episode_store.load, binding_graph rebuild,
  ordinal restore, node_modality clear-then-load) in a single
  `with self._episode_lock:` inside the existing `_rwlock.write()`.
  Pre-fold split regime held _rwlock.write() only for the first
  three and acquired _episode_lock separately for just the sidecar,
  exposing a torn-read window to any concurrent observe_episode_event.
  Lock order _rwlock then _episode_lock verified deadlock-free by
  inspection (Arch-lens #1).
- hippocampus_persistence.dump mirror: episode-binding reads
  (episode_store.to_dict, _next_episode_ordinal, _node_modality) now
  all happen under a SINGLE _episode_lock acquisition nested inside
  the existing _rwlock.read() block.
- hippocampus._close_pending_episode_locked flips the drain order:
  modality buffer now drains into _node_modality AFTER
  _episode_store.add and apply_hebbian_on_close both succeed. Drain
  is a pure dict update, so drain-last means if either earlier step
  raises, all three pieces of episode-binding state stay consistently
  unmutated. The symmetric drain-first choice introduced a "sidecar
  entries without graph nodes" failure window that Exec-lens #6
  flagged.

Cue handling:
- retrieve_cross_modal adds explicit runtime validation of
  target_modality literal ("text" or "vision"), mirroring load_state's
  validation. Typed Literal is only checked by mypy; callers without
  mypy (tests, notebooks) would silently return [] on a typo, exactly
  the silent-no-op failure mode the Literal was added to prevent
  (Exec-lens #8).
- retrieve_cross_modal docstring updated with explicit point-in-time
  read semantics (Arch-lens #2) and explicit single-hop-only cross-
  modal limitation (Arch-lens #3 — multi-hop paths through same-
  modality intermediates are silently truncated; decision deferred
  to Stage 2/3).

Test additions (9 new tests, total 34 P4 Stage 1 tests vs 25 pre-fold):
- TestSnapshotPatternFilter: frozenset content assertion pins
  `frozenset({"vision_mug"})` exactly, not just "some frozenset"
  (Exec-lens #1 — shape-only check let a regression writing
  `frozenset(_node_modality.keys())` slip through).
- test_load_state_rejects_unknown_modality_value: seeded with prior
  state, asserts EVERY piece (sidecar, episode store, ordinal,
  binding graph edges) is intact after the raise (Exec-lens #3).
- test_node_modality_stale_entries_cleared_on_rollback: the broken
  nac adapter now asserts hippocampus HAS been mutated to state C at
  raise time. Guards against a future SNAPSHOT_KINDS reordering
  silently trivial-passing the rollback path (Exec-lens #2).
- TestCueExemptionWithInGraphUntaggedCue: directly exercises the cue-
  exemption branch with a cue that IS in the binding graph but NOT
  tagged — pre-fold test used a cue absent from the graph entirely
  so spreading_activation early-returned before touching the
  exemption (Exec-lens #4).
- TestLastWriteWinsOnDuplicateNodeIdWithinEpisode: pins the drain's
  last-write-wins contract for the degenerate same-node-two-
  modalities-in-one-episode case (Arch-lens #7).
- TestStageThreeLimitation: pins the current single-hop-only cross-
  modal limitation. If a future refactor enables multi-hop traversal
  through same-modality intermediates this test FAILS and forces an
  explicit Stage 2/3 design decision (Arch-lens #3).
- TestConcurrencyCrossLockSmoke: spawns concurrent dump + observe
  workers for 0.5s and asserts both make forward progress. Guards
  against any future _episode_lock holder acquiring _rwlock that
  would deadlock against dump's _rwlock then _episode_lock order
  (Exec-lens #5).
- TestFixtureGeometry::test_within_pair_similarity_is_dim_invariant:
  dim-parametrized across 64/128/384/512/768. Pins the Arch-lens #4
  fold that rescaled noise_scale to noise_scale / sqrt(dim) so
  within-pair similarity is dim-independent. A Stage 2 caller picking
  CLIP's native 512-d space with the pre-fold coupling would have
  produced within-pair sim ~0.44, right at EC's 0.40 threshold.

Fixture cleanup:
- p4_fixture_gen renames noise_std to noise_scale and rescales per-
  element std as noise_scale / sqrt(dim) so within-pair expected
  cosine similarity is invariant in dim. Default noise_scale=0.4
  gives within-pair ~0.862 at any reasonable dim (Arch-lens #4).
- p4_fixture_gen cosine_similarity drops the zero-norm guard and
  lets numpy surface nan loudly on degenerate inputs (Exec-lens #10).

Persistence cargo-cult cleanup:
- hippocampus_persistence dump and load_state drop the
  getattr(self, "_episode_lock", None) defensive patterns. Both
  fields are set unconditionally in Hippocampus.__init__ and there
  is no subclass that omits them (Arch-lens #8).

Deferred / noted in PR description:
- Arch #5 (error-message PII): inapplicable — cue node ids are
  internal validated identifiers.
- Arch #6 (reset-path hygiene): audit came back empty, no existing
  reset paths touch episode state.
- Exec #9 (_start_episode redundant last_tick): cosmetic.
- Open Stage 2/3 design decision: should Stage 3 split
  retrieve_on_cue's node_filter into traversal_filter and
  result_filter to enable multi-hop cross-modal paths through same-
  modality intermediates? TestStageThreeLimitation is the pin;
  decision needs to happen BEFORE Stage 3's metric is frozen.

Test surface after fold: 34 P4 Stage 1 tests, 185 across broader
episode/persistence/hippocampus slice, 4702 full fast suite clean.
Only failures are 2 pre-existing worktree-environmental failures in
TestVersionInfo unrelated to P4.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dennys246 dennys246 mentioned this pull request Apr 15, 2026
7 tasks
dennys246 added a commit that referenced this pull request Apr 15, 2026
…ption 2 decision (fold steps 5+6+7)

Stage 2 v2 fold — commit sequence steps 5 (sweep runner), 6
(execution), 7 (operating point selection). Answers Arch #4 head-on:
Phase 2D v1's "tautological 1.000 recall → defer Option 2" decision
was built on unfalsifiable evidence; v2 rebuilds the measurement with
real signal-vs-noise pressure + real multi-hop reachability paths.

scripts/p4_mug_test_sweep_v2.py — new sweep runner:
- Iterates noise_reps ∈ {0, 1, 2, 3, 4, 5} × bridge_conditions
  ∈ {none, shared_superclass} = 12 combinations.
- For each: calls build_and_bind with the given BuildConfig, runs
  forward top-5 recall on all 10 classes, runs exhaustive cross-class
  reachability (every (text_X, vision_Y, X≠Y) pair — 450 total),
  counts single-hop via retrieve_cross_modal and multi-hop via raw
  BFS (Option 2 simulation).
- Metrics per combination: mean/std/min forward top-5 recall,
  n_cross_class_pairs, n_single_hop_reachable, n_multi_hop_reachable,
  option_2_lift_pct, episode counts by layer, wall clock.
- Operating point rule: LARGEST noise_reps at which bridge-enabled
  mean forward recall ≥ 0.90 (user-chosen tighter threshold from the
  fold planning session).
- Output writes to docs/experiments/p4_mug_test_sweep_v2.md +
  results/p4_mug_test_sweep_v2.json. Stage 2 v1 report at the
  non-suffixed path is PRESERVED as the historical record.

Results (22s wall clock, 12 combinations):

Same-class recall degradation (bridge topology ON):
- noise_reps=0: 1.000 ✅
- noise_reps=1: 0.980 (min 0.80) ✅ ← operating point
- noise_reps=2: 0.800 ❌ (sharp cliff)
- noise_reps=3-5: 0.800 ❌ (flat)

The substrate's ranker discriminates signal (weight 0.7) from weak
noise (weight 0.4 at reps=2) at the noise=1 level but the cliff at
reps=2 is unexpected — probably the substrate's activation-based
ranking tie-breaks in a specific way when noise approaches signal.
Worth investigating as a follow-up but does not affect the operating
point selection.

Cross-class reachability (bridge topology ON):
- Single-hop: 9-18 pairs (just the direct noise edges)
- Multi-hop (Option 2 simulated via raw BFS max_depth=5): 450/450
  pairs reachable (every class connects to every other via the
  bridge topology + noise chain)
- Option 2 lift: +96.0% at the operating point

**Option 2 DECISION: SHIP.** Not a marginal judgment call — the lift
is overwhelming. Per the fold planning, Option 2 goes in a separate
follow-up PR after this fold merges. That PR renames node_filter →
traversal_filter, adds result_filter, provides a P3b compat shim,
re-validates P3a's 10-seed sweep, and flips TestStageThreeLimitation.

Interesting finding worth preserving: the `none` bridge rows at
noise_reps ≥ 1 ALREADY show +44.4% Option 2 lift through the noise
chain alone (class_A → vision_B_0 → text_B → vision_B_1..4). So
Option 2's value doesn't rest exclusively on the bridge construct —
the noise chain creates the same class of multi-hop reachability that
real-world usage patterns would produce naturally.

**Caveat:** the multi-hop metric uses raw BFS with max_depth=5 and
no decay/threshold accounting. Actual Option 2 filter would apply
the existing spreading_activation decay, so nodes at 5-hop depth may
have activation below threshold and not actually be returned. The
true lift is somewhat less than +96% but clearly non-zero — the
decision is robust to the caveat.

46 P4 tests pass. Lint + format clean. Fold branch pushed to
origin/fix/substrate-p4-stage2-fold.

Next fold steps:
- 8: materialize new pinned fixture YAML with noise_reps=1 + bridge
  shape (or canonical BuildConfig alongside)
- 9: update round-trip test + add 0.70 retrieval gate test (Arch #6)
- 10: tactical fixes
- 11: Phase 2D v2 report writeup (partially done via the sweep
  output — needs an overview doc at a separate location)
- 12-14: Round 2 review + fold + PR

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246 added a commit that referenced this pull request Apr 15, 2026
…+ 0.70 retrieval gate (fold steps 8+9)

Stage 2 v2 fold — commit sequence steps 8 (materialize new pinned
fixture) and 9 (round-trip update + retrieval gate). Addresses
Arch #5 (threshold entanglement) + Arch #6 (missing 0.70 retrieval
gate specified in plan deliverable 5).

**fixture_version bumped 1 → 2.** This commit is tagged with the
"change-fixture" prefix per the loader's pinned-SHA contract, which
requires an explicit regeneration trigger.

scenarios/substrate/p4_mug_test.yaml:

- fixture_version: 1 → 2
- dataset: "nelorth/oxford-flowers" → "torchvision.datasets.Flowers102"
  (Arch #1 follow-through: the v1 stale label is replaced with the
  actual library in use, matching the plan amendment in commit
  6de09c6)
- NEW fields at the top-level (canonical ship-shape BuildConfig):
  - build_noise_reps: 1
  - build_bridges_enabled: true
  - build_text_ec_threshold: 0.60
  - build_vision_ec_threshold: 1.01
- Extensive header comment explaining the operating point selection
  from the Phase 2D v2 sweep: "largest noise_reps at which
  bridge-enabled mean forward top-5 recall ≥ 0.90." At noise_reps=1
  the sweep measured 0.980 mean recall + +96.0% Option 2 lift, at
  noise_reps=2 recall dropped to 0.800 (fails the gate), so 1 is
  the canonical choice.
- FIXTURE_SHA256 regenerated: 967e83ed18851e1dfcad418be57f3275cf04a961462e6dc4dd055b6b71c8920b

**Note on IDE diagnostics:** the editor's YAML language server auto-
infers a schema from the v1 shape and flags the new build_* fields
as "property not allowed." These are false positives — no schema
file exists in the repo, PyYAML (the sole parser after Exec #1+#2
fold) does not enforce additional-property rules, and the new
FixtureDescriptor fields are validated by the unit tests below.
The IDE warnings will clear once the language server re-infers from
the new shape.

tests/substrate/p4_fixture_loader.py:

- FixtureDescriptor dataclass extended with 4 new build_* fields
  (defaults match v1 hardcoded values for backward compat if a
  future fixture omits them).
- load_fixture_descriptor() pulls the new fields via .get() with
  defaults so legacy callers don't break.
- FIXTURE_SHA256 bumped to match the regenerated YAML.

tests/substrate/test_p4_mug_test_roundtrip.py:

- New _build_canonical_config(descriptor) helper constructs a
  BuildConfig from the fixture descriptor's build_* fields. Shared
  between the round-trip test and the new retrieval gate test.
- _prepare_mug_test_hippocampus() now uses _build_canonical_config
  instead of hardcoded BuildConfig values. The round-trip test
  therefore exercises the canonical ship shape (noise + bridges
  ON), validating that P3.5 SessionSnapshot round-trip survives the
  full fixture shape, not just the no-noise baseline.

tests/substrate/test_p4_fixture_validation.py (Arch #5 + Arch #6
enforcement):

- TestFixtureV2BuildConfig (4 tests) — semantic pin on the operating
  point values:
  - fixture_version == 2 (drift guard on the bump)
  - build_noise_reps == 1 (matches Phase 2D v2 operating point)
  - build_bridges_enabled is True (shared_superclass topology is
    the ship shape)
  - build_text/vision_ec_threshold == 0.60/1.01 (Stage 2 v1
    calibration values carried forward)
- TestFixtureRetrievalGate (1 test) — the 0.70 gate the plan
  specified in deliverable (5). Runs build_and_bind with the
  canonical config, measures forward + reverse top-5 recall on all
  10 classes, asserts both means ≥ 0.70. The bar is lower than
  the 0.90 operating-point selection rule because it's a CI
  regression gate, not the stricter operating-point criterion.
- Old test_descriptor_parses_successfully loosened from
  fixture_version == 1 to >= 1 (the specific v2 pin now lives in
  TestFixtureV2BuildConfig).

17/17 P4 fixture + round-trip + mechanism tests pass after the
fixture bump. The retrieval gate is a first-class CI check now,
failing loudly if a future fixture edit drops recall below 0.70.

The round-trip test is now harder: it runs against the noisy +
bridged fixture, which has 80 total episodes (50 class + 10 noise +
20 bridge) vs the v1 baseline 50. Wall clock stays ~17s. P3.5
SessionSnapshot round-trip proves the full canonical ship shape
survives serialize-kill-reload intact.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246 added a commit that referenced this pull request Apr 15, 2026
…t, VRAM OOM, canonical mps, cosmetics, tripwire (fold step 10)

Stage 2 v2 fold — commit sequence step 10. Bundles six Round 2
review findings that are each small and tactically independent but
thematically "polish before the PR":

**Exec #3** — `tests/substrate/p4_mug_probe.py`: sort the probe's
returned vision node lists by node id lexicographically instead of
by weight-rank. Stage 2 v1's probe returned the ranked list as-is,
which the round-trip test compared with exact equality. After P3.5
restore_into rebuilds the binding graph via apply_hebbian_on_close,
edges may be inserted in a different order than the original build;
tied-weight retrievals (5 class-correct nodes all at identical
activation) then flake across platforms without indicating a real
correctness regression. Sorting by node id captures "the binding
graph round-trips" (the actual claim) without false flakes from
insertion-order drift. Module docstring updated with the full
rationale.

**Exec #5** — `scripts/p4_clip_calibration_sweep.py::_pick_headroom_band`:
hard-assert `len(candidates) >= target_n` at the end. Before the
fold, a pathological sweep (NaN accuracies, empty dataset, etc.)
could silently return a short list and the fixture would be written
with fewer than target_n classes; downstream tests assume target_n.
Loud failure at sweep time beats silent short-ship.

**Exec #6** — `scripts/p4_vram_audit.py::_write_report`: fix the
`+-4 MB` rendering when torch.cuda.empty_cache releases memory.
Change the hardcoded `+` prefix on both torch_alloc and nvidia-smi
deltas to explicit signed format `{:+.0f}` so negative deltas render
correctly.

**Exec #7** — `scripts/p4_vram_audit.py::_torch_allocated_reserved_mb`:
use the canonical `torch.backends.mps.is_available()` instead of the
torch 2.x-only `torch.mps.is_available()` shortcut. The canonical
path works on any torch version with MPS support; the shortcut is
not documented as stable.

**Exec #9** — `tests/substrate/test_p4_fixture_validation.py::TestFixtureV2BuildConfig::test_thresholds_are_not_the_ec_default`:
tripwire guarding against a future edit dropping either EC threshold
to the default 0.40. Stage 2 v1's original sweep-with-defaults
produced the "fake 1.000 recall" bug (every prompt collapsing into
one text node); this test fires if the fixture drifts back into
that regime. Strict: asserts both thresholds >= 0.45 so a
near-default value cannot sneak through.

**Arch #7** — `scripts/p4_vram_audit.py::main` + new `_safe_step`
helper: wrap each audit step (CLIP load, mpnet load, mug test
encoding) in a try/except capturing torch.cuda.OutOfMemoryError,
MemoryError, and RuntimeError. On failure, the helper records a
Sample with notes='OOM: ...', logs an error, writes the PARTIAL
report with the captured samples, and returns 2. Before the fold,
the script crashed on OOM and _write_report was never called,
leaving operators with zero data from the audit they specifically
ran to detect OOM. The audit's entire purpose is to measure
"does the encoder stack fit?" — an OOM IS the answer, not a crash.

**Arch #8** — `docs/experiments/p4_vram_audit.md`: soften the
Phase 2E interpretation's "3.02 GB is steady-state" claim to be
honest about what the measurement actually captured. The 0-MB-delta
was taken at completion of ONE encode pass, not at peak during
encoding, and not across repeated runs. torch caching allocator
drift under Stage 3's 60-run sweep could move the number. The
Option 3γ recommendation (dedicated worktree with MAXIM_LLM_ENABLED=0)
is reaffirmed specifically because it sidesteps the allocator-drift
uncertainty — we don't have to reason about it if the LLM isn't
there.

**Exec #10 (topology metric reverse direction)**: NOT addressed in
this commit. The sweep's Option 2 lift metric is forward-only
(text → vision). Adding reverse measurement is a Stage 3 concern,
not a Stage 2 v2 fold item — the +96% forward lift is unambiguous
and the Option 2 SHIP decision doesn't hinge on symmetry data.
Documented as a known limitation in the next commit's report.

58/58 P4 tests pass after the bundle. Lint + format clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246 added a commit that referenced this pull request Apr 17, 2026
…on candidates (Wave 3)

Full audit of every bio-pipeline construction site in src/maxim/.
Ten production sites: 8 umbrella callers that reproduce ~30 lines of
bio-system construction, 2 leaf factories (create.py, load.py) that
return individual systems and don't need the umbrella.

Six sites collapse into build_bio_stack: CLI non-sim (#1), sim AUT
(#4), sim orch NPC (#5), Reachy (#6), AgentFactory (#7), and the
CLI sim modes (#2/#3) that reuse #1's bio-systems. Two are not
candidates: simulation/tools.py sub-AUT (consumer, not constructor)
and the leaf factories.

Key findings:
- Construction order is consistent: Hippocampus → NAc → SCN → EC →
  (ATL, AngularGyrus) → MemoryHub → PainBus → DefaultNetwork
- ATL and AngularGyrus optional everywhere (try/except)
- DefaultNetwork is opt-in (only sites #4 and #6)
- PainBus wiring varies (CLI builds independently, sim/Reachy inject)
- MemoryHub.connect() with DN subsystems only at Reachy (#6)
- All umbrella sites already use Wave 1/2 builders

4 open design questions carried to implementation session.
biosystem_unification.md status row updated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246 added a commit that referenced this pull request Apr 17, 2026
…migrations

Introduces runtime/bio_stack.py::build_bio_stack(*, persistence_dir) that
composes the four Wave 1+2 builders (build_reaction_bus, build_pain_bus,
build_memory_hub, build_default_network) into a single call returning a
frozen BioStack dataclass. Collapses ~30 lines of hand-rolled bio-pipeline
construction at each entry point into ~5.

Migrated sites:
- cli.py non-sim agent (site #1)
- simulation/orchestrator.py AUT (site #4, with pre-built pain_bus=)
- simulation/orchestrator.py orch NPC hub (site #5)
- embodied_runtime/agentic_runtime.py Reachy (site #6)

Not migrated (documented):
- CLI sim modes (#2/#3): just build_pain_bus, too thin for umbrella
- AgentFactory (#7): conditional remembers/learns + auto_load deferred to
  agent_factory_canonicalization.md Wave 4

27 new tests, 4902 passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant