Skip to content

Feature/agentic backbone#2

Merged
dennys246 merged 26 commits intomainfrom
feature/agentic_backbone
Jan 6, 2026
Merged

Feature/agentic backbone#2
dennys246 merged 26 commits intomainfrom
feature/agentic_backbone

Conversation

@dennys246
Copy link
Copy Markdown
Owner

Adding in an agentic infrastructure for running agents on mini through the Maxim repo. Initially focused on creating agents for controlling the Reachy, running LLM’s, accessing the filesystem running Reachy, and also accessing the internet.

…d reachy agent for controlling the proposing intent for the reachy robot
…small movements that don't actually move the robot head but cause it to jerk
…he agentic loop, and preserving and loading states
@dennys246 dennys246 merged commit c1a6fef into main Jan 6, 2026
@dennys246 dennys246 deleted the feature/agentic_backbone branch January 6, 2026 03:32
dennys246 added a commit that referenced this pull request Apr 3, 2026
New items from codebase audit:
8. Unused PerceptSource protocol module
9. Any type overuse in runtime functions
10. Silent exception swallowing in agent loop
11. Missing unit tests for simulation modules
12. Config 'int4' vs 'Q4_K_M' mismatch
13. Stale re-exports in llm_worker.py
14. ScenarioRunner race condition
15. README prompt profile table inaccurate

Updated: #2 marked partially done (PIPELINE traces silenced).
Added suggested execution order (trivial → small → medium).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246 added a commit that referenced this pull request Apr 7, 2026
Writer plan: added 6 specific issues from first paper output
(hallucinated refs, duplicate headings, wrong metrics, etc.)

Run notes: added Run #2 results — Verath survived, miller fixation
eliminated, behavioral recall still fails at door, full pipeline
completed in 133s with paper accepted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246 added a commit that referenced this pull request Apr 11, 2026
… gate

Standalone bug-fix wave. Six independent bugs discovered while reviewing
the peer_leader_flexibility plan — all pure fixes with no design
decisions, safe to ship ahead of the larger wave. Fixes items #1, #2,
#3, #9, #12 from the plan's "Issues discovered during investigation"
table plus one additional UX bug caught while testing.

Bug 1+2: download_file leaked partial files on URLError and
KeyboardInterrupt. The existing code cleaned up on generic Exception
but neither path it actually hit in practice: URLError just printed
and returned False (leaving the file at the final path), and Ctrl+C
wasn't caught at all. Both bugs produced a file that passed
profile_has_local_file on subsequent runs and crashed the spawner in
a cryptic loop.

Fix: write to {dest}.partial, catch BaseException (including
KeyboardInterrupt, re-raised after cleanup), os.replace() to the
final path only after optional size verification. Stale .partial
files from prior crashed runs are cleared before each new download.

Bug 3: LLM_MODELS had no integrity metadata. A truncated download
(network drop that urlretrieve didn't notice) would silently land at
the final path.

Fix: add expected_bytes field to each entry, sourced from actual
HuggingFace metadata. Values are authoritative when set:
  qwen2.5-14b-instruct: 8,988,110,976 (Q4_K_M)
  mistral-7b-instruct-v0.2: 4,368,439,584
  llama-3-8b-instruct: 4,920,734,272
  smollm-1.7b-instruct: 1,055,609,344
  phi-2: 1,789,239,136
  gemma-2-2b-it: 1,708,582,752
  qwen2-7b-instruct: 4,683,071,264
Profiles without verified sizes use expected_bytes=None which skips
the check (legacy behavior). download_file threads the value through.

Bug 12: profile_has_local_file accepted any .is_file() match including
.partial tmp files, compounding with bugs 1+2.

Fix: explicitly reject model_path values ending in .partial before
the is_file() check. Paired with the atomicity fix above, this closes
the loop on corrupt-file-loading on subsequent runs.

Bug 9: openai_backend logged str(last_err) verbatim, which on a
Cloudflare 502 is ~4KB of HTML. The resulting warning line buried
terminals during sim failures (visible in stability-wave transcripts).

Fix: _sanitize_error_message helper in openai_backend.py. Detects
<!DOCTYPE/<html in the first 200 chars of the error string (covers
exception-chain prefixes like "RuntimeError: <!DOCTYPE..."), extracts
the <title> tag if present, returns "<{title}, {N} bytes>" summary.
Plain-text errors over max_len (500) are truncated with an ellipsis.
Short plain strings pass through unchanged. Applied to both the trace
record's error field and the final warn() log line.

New UX bug caught during testing: maxim --llm <model> silently
printed the quick-start banner and exited. The gate at cli.py:830-833
checked MAXIM_LLM_ENABLED env var but never args.language_model, so
explicit CLI model choice wasn't recognized as a meaningful action.

Fix: add `_has_llm_cli = bool(args.language_model)` to the action set.
`maxim --llm qwen2.5-14b` now enters the main loop with that model
selected (or fails cleanly at model-load time if the GGUF is missing).

Test coverage (21 new tests):
- test_download_atomicity.py: URLError cleanup, KeyboardInterrupt
  cleanup + re-raise, size mismatch rejection, successful atomic
  rename, expected_bytes=None legacy path, stale partial cleanup,
  already-exists short-circuit.
- test_openai_error_sanitize.py: plain passthrough, long truncation,
  HTML with/without title, exception-chain prefix detection,
  upper-case tag variants, BaseException (KeyboardInterrupt) safety.
- test_cli_action_gate.py: --llm short form, --language-model long
  form, bare maxim (negative control), --sim positive control,
  MAXIM_ROLE=leader positive control.
- test_llm_server.py: extended TestProfileHasLocalFile with a
  .partial rejection test.

Full fast suite: 3631 passed (up from 3610, +21 new).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246 added a commit that referenced this pull request Apr 13, 2026
Ten findings from the R2 pre-merge review round, folded into a single
commit. Details in the inline comments (each fix is tagged with its
review number).

REAL BEHAVIOR BUGS
- #1 Stage-2 probe mis-classified HTTPAuthError (401/403) and
  HTTPRateLimited (429) as inference_broken. Auth-gated or throttled
  peers would report "model loading" instead of "rotate key" or "busy",
  and get the 15s short-backoff instead of the appropriate signal.
  Added explicit except branches BEFORE the generic HTTPError catch.
- #2 _has_local_llm_flag scanned full argv including subcommand names,
  so `maxim tunnel --llm foo` on a leader falsely flipped role to solo
  and _model_state_file would read active_llm_model.solo.txt on that
  machine for the process lifetime. Added _SUBCOMMAND_NAMES guard.

BAND-AIDS PROMOTED
- #3 cli.py detect_and_apply_role exception wrapped a bare try/except
  with DEBUG log. Real failures would silently skip role export.
  Promoted to WARNING + structured role_detection_failed event.
- #4 migrate_persisted_model_file caught Exception at DEBUG. Same
  silent-failure pattern. Promoted to WARNING +
  persisted_model_migration_failed event.
- #5 Migration used .replace() without checking destination existence.
  On Windows this silently clobbered an existing role-specific file
  with legacy content. Added .exists() branch: delete legacy, preserve
  role-specific, emit persisted_model_migrated with collision=True.
- #6 Moved `_validate_base_url` compat alias in openai_backend.py from
  mid-file line 101 (with # noqa: E402) to the imports block so the
  suppression is no longer needed.

GAPS CLOSED
- #7 _model_state_file now runs migrate_persisted_model_file lazily on
  first access per process via a module-level flag. Python-API users
  who never go through cli.py::main now get their legacy
  active_llm_model.txt migrated instead of orphaned.
- #9 detect_and_apply_role now calls _check_leader_mode_divergence,
  which compares role.py's decision against leader_mode.detect_role()
  and emits a structured role_divergence event when they disagree.
  Normalizes leader_mode's "client" → role.py's "peer" semantically.
  Insurance against Plan 4 consolidating these.
- #10 probe_llm_server now generates a per-probe uuid4 hex[:8]
  correlation ID threaded through both probe_started and
  probe_completed events so concurrent probes can be matched.
- #11 role_detected event renamed decision-rule field from `source`
  to `role_source` so it doesn't collide with StructuredFormatter's
  top-level `s` short-key (logger module name).

#8 (explicit __init__ per BackendError subclass) is a separate commit
per the triage guidance — it's a real refactor with its own review
surface.

Tests added:
- test_fix1_stage2_auth_rejected_not_inference_broken
- test_fix1_stage2_rate_limited_not_inference_broken
- test_fix2_subcommand_suppresses_llm_flag_scan
- test_fix2_bare_llm_still_detects_solo
- test_fix5_migration_destination_collision
- test_fix7_lazy_migration_in_model_state_file
- test_fix9_role_divergence_warns
- test_fix10_probe_correlation_id_in_events
- test_fix11_role_source_field_name

Fast suite: 4082 passing (was 4073 post-R2). CI grep clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246 added a commit that referenced this pull request Apr 13, 2026
Pre-merge review round per the refined R2 pattern — two parallel review
Claudes (Executor + Architecture lens) audited R2.5 + R2.6 on
feat/llm-path-r3 and returned 18 findings (2 critical, 7 high, 5
medium, 4 low). This commit folds the 12 items that block the PR into
a single fix commit on the same branch. Deferred items listed at the
end with rationale.

## CRITICAL — behavior bugs

### 1. _MaximPeerBackend.for_url mutated os.environ (Exec CRIT #1, Arch HIGH #1)

Both reviewers flagged the same race: the R2.5 factory stored the
probe API key in ``os.environ["MAXIM_PEER_PROBE_KEY"]``. Under
concurrent probes (lane-A and lane-B validated in parallel, doctor
--retry loop, or the deferred multi-peer dispatch path), the env var
would race — lane-B's key could leak into lane-A's request, or
lane-A's key would survive past its probe and mis-authenticate a
later unrelated call. The pop-on-None branch also wiped any
pre-existing user value for the same env key.

Fix: store the key on the backend INSTANCE via a new
``_api_key_override: str | None`` attribute consulted by
``_get_api_key()`` before the env-var fallback. ``for_url`` no longer
touches ``os.environ`` at all — it builds the cfg with an empty
``api_key_env`` and writes the key to the instance override. Safe to
call from threads / doctor re-probe loops / future parallel
validation paths.

New regression tests:
- ``test_for_url_does_not_mutate_os_environ`` — asserts env is
  untouched after ``for_url(api_key=k)``
- ``test_for_url_concurrent_instances_have_distinct_keys`` — asserts
  two backends built back-to-back do NOT observe each other's keys
- ``test_for_url_none_key_returns_empty_string`` — asserts
  ``api_key=None`` yields an empty ``_get_api_key()`` (not a stale
  value from the env)

### 2. _emit_dispatch_exhausted bypassed the canonical normalization shim (Exec CRIT #2)

The original R2.5 aggregated-failure log read ``ctx.get("agent_id")
or ctx.get("agent")`` inline — reproducing the legacy-key bridge
that Plan 2 R2b's ``_normalize_request_context`` was explicitly
introduced to centralize. This was the one remaining inline
duplication of the shim's logic anywhere in the new router code. A
future rename of the legacy ``"agent"`` key to ``"agent_id"`` would
leave this call site stranded with a stale fallback; a caller passing
an empty / None context would land nulls in all fields.

Fix: ``_emit_dispatch_exhausted`` now lazy-imports
``maxim.agents.llm_worker._normalize_request_context`` and reads
``ctx.request_id / .agent_id / .session_id / .lane`` off the typed
result. Single canonical bridge, single rename surface. The existing
test ``test_dispatch_exhausted_emits_one_warn_with_all_attempts`` was
ALSO beefed up (per the reviewer's finding) to assert every
multi-agent field reaches the LogRecord — the original test only
checked the event-name substring, which would have allowed a
regression that dropped ``agent_id`` / ``session_id`` / etc. to sail
through.

## HIGH — dispatch + test coverage + CI enforcement

### 3. BACKEND_CLASSES dispatch table was a dead identity map (Exec HIGH #5, Arch LOW #7)

The R2.5 shipment had ``BACKEND_CLASSES: dict[str, str] = {"openai":
"openai", "maxim_peer": "maxim_peer"}`` and ``LLMRouter
._get_backend_for_provider`` hard-coded the ``"maxim_peer" /
"maxim-peer"`` branch as a string literal. Zero indirection. Adding a
new backend required edits in two files, and a typo in either would
fall through to the router's "unknown provider type" warning silently.

Fix: ``BACKEND_CLASSES`` now maps identifier → ``(module_path,
class_name)`` tuples (lazy import so optional-dependency backends
don't pay startup cost). New helper ``resolve_backend_class`` in
``lane_backends.py`` does the lookup + import + class validation +
hyphen normalisation in one place. ``LLMRouter
._get_backend_for_provider`` now calls a thin
``_maybe_dispatch_backend_class`` helper which wraps
``resolve_backend_class``; the ``maxim_peer`` / ``maxim-peer``
string-literal branch is gone. Adding a new backend is now one line
in ``BACKEND_CLASSES`` plus wiring in ``_classify_backend``.

### 4. health_check had no direct unit tests (Exec HIGH #3)

``health_check`` was only exercised indirectly via the
``_probe(url, ...)`` helper in ``test_two_stage_probe.py`` which
routes through ``for_url``. Four direct tests now lock in the
two-stage probe contract:

- ``test_missing_base_url_returns_other``
- ``test_two_attempt_fires_on_unreachable_first_attempt``
- ``test_stage2_skipped_when_stage1_auth_rejected`` — the R2c stage-2
  probe bug regression guard, now at the backend-method layer
- ``test_stage2_runs_only_when_stage1_ok``

### 5. Router + BACKEND_CLASSES dispatch had no integration test (Arch MEDIUM #6)

``test_router_typed_exceptions`` stuffed ``router._backends[...] =
MagicMock()`` to bypass the backend lookup, meaning the
``"maxim_peer"`` provider-type branch in
``_get_backend_for_provider`` was never exercised by the unit tests.
A typo in ``BACKEND_CLASSES`` or the normalisation step would have
slipped through. Four new tests in a new ``TestBackendClassesDispatch``
class exercise the full dispatch:

- ``test_maxim_peer_type_instantiates_maxim_peer_backend``
- ``test_hyphenated_maxim_peer_type_also_dispatches``
- ``test_resolve_backend_class_unknown_returns_none``
- ``test_resolve_backend_class_maxim_peer``

### 6. No CI enforcement preventing new deprecated-shim callers (Arch HIGH #2)

The R2.6 compat-shim deviation from spec ("kept as deprecated thin
delegates instead of strict delete") only holds if no new code paths
grow call sites of ``probe_llm_server`` / ``llm_server_responding_at``
/ ``_probe_once``. Nothing structural prevented that.

Fix: new CI grep invariant in ``.github/workflows/test.yml``. Matches
only function-call patterns (trailing ``(``) and imports (leading
``import ``) — prose mentions in docstrings and comments are ignored
by regex rather than post-filter. Allow list records every existing
in-repo call site at R2.6 landing:

- ``src/maxim/runtime/llm_server.py`` (the definition file)
- ``src/maxim/runtime/lane_backends.py`` (probe_llm_server +
  _llm_server_responding_at usages)
- ``src/maxim/doctor/checks.py`` (probe_llm_server +
  _llm_server_responding_at usages)
- ``src/maxim/models/language/maxim_peer_backend.py`` (the
  ``_probe_once`` lazy-import inside ``health_check``)

Any new match is a CI failure with an explicit migration hint in the
error message.

## MEDIUM — consistency, safety nets, dead indirection

### 7. _record_attempt_outcome now asserts _inference_lock.locked() (Arch MEDIUM #4)

The R2.5 ``_dispatch_attempts`` list is documented as safe-under-lock
but the safety is invisible to future edits. Added an explicit
``assert self._inference_lock.locked()`` at the top of
``_record_attempt_outcome`` so any future code path that calls
``_try_provider`` outside the lock (warmup probe, benchmark hook,
test helper) trips loudly instead of silently corrupting the buffer
across dispatches.

The existing test helper ``_call_try_provider`` and the two
``_complete_text_locked`` tests were updated to acquire
``router._inference_lock`` before the call.

### 8. Auth-branch comment corrected — no misleading "ordering" reference (Exec MEDIUM #6)

The R2.5 auth-rejected / model-missing / inference-broken branches
had a comment claiming "apply the long cooldown AFTER so
_note_provider_failure's exponential ramp doesn't overwrite the
hard value." But those branches never call ``_note_provider_failure``
at all — only the direct ``state.last_error`` assignment + the
``_set_long_backoff`` / ``_set_short_backoff`` helper. The comment
was misleading and risked a future "cleanup" edit that added
``_note_provider_failure(...)`` "for symmetry" and silently regressed
the hard-cooldown value. Comment rewritten to say explicitly:
"Do NOT call ``_note_provider_failure`` in this branch" and explain
why the hard value is load-bearing.

### 9. _probe_once docstring corrected — canonical primitive, not deprecated copy (Exec MEDIUM #7)

R2.6's docstring mislabelled ``runtime/llm_server.py::_probe_once``
as "deprecated" and "intentionally a direct copy" of a backend static
method. Neither was true: after deleting the backend's static method
wrapper (see #10 below), ``_probe_once`` IS the single canonical
liveness primitive. Both the backend's ``health_check`` (via lazy
import inside the method body) and the compat shims delegate into
it. Docstring rewritten to reflect this + documents the circular-
import avoidance (the lazy imports between llm_server.py and
maxim_peer_backend.py are load-bearing).

### 10. Removed _probe_liveness_once static method (Exec MEDIUM #8)

The static ``_MaximPeerBackend._probe_liveness_once(url, api_key,
timeout_s)`` was dead indirection — it only existed to call
``runtime/llm_server._probe_once`` and had no per-instance state or
logic. Deleted. ``health_check`` now calls ``_probe_once`` directly
(lazy-imported alongside ``_probe_stage2_readiness``). Cleanest
outcome: single call site, single implementation, no dead wrapper to
drift.

### 11. _record_unclassified_backend_error logs import failures once (Exec MEDIUM #9)

The R2.5 safety-net helper silently swallowed all exceptions,
including import failures on ``lane_metrics``. Silent failures
defeat the safety net — the whole point of
``backend_unclassified_errors_total`` is to be the canary for missing
typed exception classes, and a silent zero counter masks both the
target bug and any underlying import break. Fix:
``_UNCLASSIFIED_IMPORT_WARNED`` module-level flag + ``warn(...)`` on
first failure. Subsequent calls stay silent to avoid log flood; the
first-failure warning is enough to surface a broken lane_metrics
module.

### 12. Streaming malformed-chunk + empty-content paths now emit per-call WARN (Exec LOW #10, #11)

Both streaming failure paths raised ``BackendDown`` without calling
``self._log_failure(...)``, so the only record of a mid-stream break
was the router's aggregated ``dispatch_exhausted`` WARN — which
combines all providers' outcomes into one line. Operators debugging
"why did peer-X fail?" couldn't see the individual mid-stream
malformed-chunk vs empty-content signal per call. Fix: both branches
now call ``_log_failure("malformed_chunk", ...)`` or
``_log_failure("empty_stream", ...)`` before raising, so each
streaming failure has its own ``peer_backend_failed`` WARN with the
typed exception's ``fix_hint`` + request_id + agent_id + full
multi-agent context. The malformed-chunk path also includes the
model name in the ``fix_hint`` so operators know which model's
streaming is broken.

## Deferred — rationale

### Exec HIGH #4 — real-network performance gate

Reviewer correctly observed that ``tests/performance/test_fast_failover.py``
mocks ``_http.post`` and thus wouldn't catch a regression that added
retry logic at the http-client layer (e.g., a ``max_retries=3`` kwarg
on the registered endpoint). Real-network tests against dropping
ports are flaky in CI. Deferred to the stress test protocol Phase D
(``docs/plans/llm_path_fast_failover.md`` "Stress test protocol"
section) which exercises the full path end-to-end against a real
restarting leader. The mocked gate still catches every failure-mode
classification bug; the real-network gate catches http-client config
regressions that the mocked layer cannot see.

### Arch MEDIUM #5 — request_context dict | None typing

Reviewer's tracking-only finding: the ``request_context: dict[str,
Any] | None`` parameter signature of ``_MaximPeerBackend
.complete_with_usage`` perpetuates the legacy dict shape even though
the canonical contract is the typed ``RequestContext``. The
``_OpenAIBackend`` has the same signature, so this is a multi-backend
cleanup for 0.5 or later, not Plan 3 scope. Not blocking the PR.

### Arch HIGH #3 — probe JSONL events gained a "provider" field

R3 added a ``provider`` field to ``probe_started`` and
``probe_completed`` JSONL events. This is additive and does not
break any downstream consumer (the existing ``probe_id`` correlation
test passes unchanged). Documented here rather than via a separate
schema-change commit because no tests or dashboards currently assert
the event schema's field count.

## Test invariants verified after fold-in

- Fast suite: 4142 passed, 2 skipped, 20 deselected (up from 4131
  post-R2.6 = +11 new tests from the review fixes)
- ``ruff check src/ tests/`` clean
- ``ruff format --check`` clean
- mypy on touched files: no new errors vs R2.6 baseline (confirmed
  by running mypy on the R2.6 router.py snapshot — pre-existing
  str|None errors unchanged, zero new errors introduced)
- Plan 1 R1 urllib invariant: clean
- Plan 3 R2.5 no-retry-loop invariant: clean (excluding
  ``retry_after_s`` + ``retry_timeout_s`` parameter names)
- Plan 3 R2.6 deprecated-shim allow list: clean (new CI invariant
  now enforces this)

## Related
- Review round: Executor-lens + Architecture-lens Claudes, 2026-04-12
- Spec: ``docs/plans/llm_path_fast_failover.md``
- Prior commits on this branch: ``824d737`` (R2.5), ``d09b74d`` (R2.6)
- Review pattern: ``feedback_review_before_ship.md`` — pre-merge
  reviewer round catches bugs tests don't (2 critical + 10 non-
  critical this round, 0 overlap with test coverage)
- Grep-before-dismissing: ``feedback_grep_before_assuming_fixed.md``
  — verified every disputed finding against the actual code before
  folding in (no false-dismissals this round)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246 added a commit that referenced this pull request Apr 14, 2026
Two parallel pre-merge reviews (Executor lens + Architecture lens)
surfaced 13 findings. The Executor focused on correctness and test
rigor; the Architecture lens focused on coupling, blast radius, and
doc consistency. One finding cross-confirmed across both lenses; the
rest were unique to their respective lens. None CRITICAL, two BLOCKING,
the rest NICE-TO-HAVE. This commit folds all 13 into the same branch
before PR open, per the Plan 3 R3 review-before-ship pattern
(feedback_review_before_ship.md).

=== BLOCKING (Architecture) ===

**B1. Stale source-code refs to archived plan paths.**
After the plans-audit commit (711105b) moved Plans 1-3.5 into
docs/plans/archive/, eight source/CI files still cited the unprefixed
paths in module docstrings and CI error-hint strings:

- src/maxim/mesh/__init__.py:6 and mesh/message.py:6 (llm_path_foundation)
- src/maxim/runtime/rate_limit.py:4 (llm_path_foundation)
- src/maxim/time/scn.py:235 (llm_path_foundation)
- src/maxim/models/language/maxim_peer_backend.py:18 (llm_path_fast_failover)
- src/maxim/models/language/router.py:1042 (llm_path_fast_failover)
- src/maxim/cli.py:183 (llm_path_typed_errors)
- **.github/workflows/test.yml:84,109** (llm_path_fast_failover, twice)
  — this one is the worst: it's the CI grep guard's "go read the
  plan doc" hint printed to the operator on workflow failure. Before
  this fix, hitting the grep guard would land users on a 404.

All eight rewritten to docs/plans/archive/ paths.

**B2. docs/architecture/llm_routing.md:755 sub-plans list broken.**
The plans-audit commit partially-rewrote this line but only caught
three of the four archived plans. Plan 3.5 (cancellation_hygiene)
was missing from the sub-plans list entirely. Also referenced
mesh_operations.md as "created by Plan 4" — but Plan 4 Stage C is
deferred so that file does not exist yet. Fixed to add Plan 3.5,
Plan 3.6, Plan 4 ship status, and to mark mesh_operations.md as
deferred to Stage C.

=== NICE-TO-HAVE (Executor, #1-#4, #7) ===

**#1. Autouse contextvar scrub missing in tests/conftest.py.**
Plan 4 A.2 added a ``current_context()`` fallback in
``_normalize_request_context(None)``. If a test leaks a
``set_context`` binding (no autouse scrub existed), every later test
that asserts ``_normalize_request_context(None).agent_id is None``
silently inherits the bound context. Exact same bug class as the
env-var leak lesson in CLAUDE.md. Fix: new
``_isolate_maxim_request_context_contextvar`` autouse fixture,
matching the shape of ``_isolate_maxim_cancellation_contextvar``
(Plan 3.5 R4). Also widened ``set_context(ctx: RequestContext)`` to
accept ``None`` so the scrub is type-clean.

**#2. Bench JSONL was NOT actually wire-compatible with production.**
The load-bearing claim in the commit message and plan doc was that
``jq 'select(.e=="peer_backend_call")'`` queries work unchanged on
bench output. Executor verified by comparing field sets and found
the bench omitted ``provider``, ``model``, ``input_tokens``,
``output_tokens``, ``fix_hint`` and used ``error`` for the message
string instead of the exception class name (production uses
``error=type(exc).__name__`` and ``fix_hint`` for the message).

Fixed via: ``BenchAttempt`` carries all production fields
(``provider``, ``model``, ``input_tokens``, ``output_tokens``,
``error`` as class name, ``fix_hint``, ``http_status``). Success
path captures the LLMResponse's ``provider``/``model``/token fields;
failure path captures ``type(exc).__name__`` + ``exc.fix_hint``.
``_classify_error`` now returns ``(outcome, fix_hint)`` where
fix_hint uses the exact same ``getattr(exc, "fix_hint", "") or
str(exc)`` pattern as ``_MaximPeerBackend._log_failure``.

Added TWO new regression guards that lock the bench JSONL shape
against the production event fields:
- ``test_success_jsonl_field_parity_with_production_log_success``
- ``test_failure_jsonl_field_parity_with_production_log_failure``

These walk the field sets as Python sets and assert nothing from
production is missing. If production adds a new field, these fail
loudly and force the bench to be updated in lockstep.

**#3. _analyse_recovery empty-list edge case.**
An empty attempt list was conflated with ``no_outage_observed`` —
semantically wrong because the bench may have been killed before
firing its first call. Added a dedicated ``no_attempts`` reason
branch + regression test.

**#4. test_tight_loop_fires_many_calls_in_short_duration fragility.**
The test asserted ``result.successes >= 10`` in a 0.25s window —
loose enough that a future regression adding e.g. ``time.sleep(0.025)``
per call would still pass. Replaced with a rate-based assertion
(``observed_rate >= 40/sec``) so real hot-path regressions fail
loudly.

**#7. Stale self-ref in archive/llm_path_foundation.md:363.**
The archived plan's memory snippet pointed at ``docs/plans/
llm_path_foundation.md`` (its pre-archive path). Rewrote to make
the move explicit ("archived 2026-04-14 when Plans 2, 3, 3.5 also
landed").

=== NICE-TO-HAVE (Architecture, N2-N6, N8) ===

**N2. Memory caveat about warmup agent_id coverage.**
The memory entry's "750/750 agent_id coverage" claim was technically
narrower than implied — ``_MaximPeerBackend.warmup()`` emits system-
level DNS preflight events with ``agent_id=null`` by design. Added
a caveat to the memory file making the scope explicit ("per-call
inference events only").

**N3. accepts_request_context → supports_request_context rename.**
Architecture reviewer pointed out the naming inconsistency with
``supports_model_override`` / ``supports_streaming`` /
``supports_tool_use``. The user's distinction (``accepts_*`` = wants
forwarding vs ``supports_*`` = feature available) was too subtle to
survive future refactors. Rename lands now while the flag has only
three call sites. Updated across router.py, maxim_peer_backend.py,
test_router_typed_exceptions.py (6 refs), test_maxim_peer_backend.py
(2 refs), the plan doc, and the memory file.

**N4. cli-reference.md missing bench subcommand.**
Added a new "Bench Harnesses" section explicitly disambiguating
``maxim bench recovery-time`` (tight-loop, Plan 4 B) from the
existing ``--benchmark`` model-evaluation flag.

**N5/#5 cross-confirmed. Redundant outer set_context in bench loop.**
Both lenses independently flagged that ``run_recovery_benchmark``
called ``set_context(bench_ctx)`` at the top AND
``set_context(per_call_ctx)`` per attempt. The outer binding was
dead code because the per-call binding always shadowed it. Dropped
the outer ``set_context`` + ``ctx_token``; only the per-call
binding remains. Removed the now-unused ``dataclasses`` import.

**N6. _classify_error subclass coverage regression test.**
Bench's ``_classify_error`` enumerates BackendError subclasses by
isinstance. If a new subclass is added without updating the bench,
it falls through to ``unhandled_*``, misleading the recovery
signal. Added
``test_every_backend_error_subclass_has_specific_mapping`` which
walks ``BackendError.__subclasses__()`` transitively and asserts
every concrete subclass maps to a non-unhandled outcome.

**N8. Rerun protocol denominator clarification.**
Plan 3's ``< 10s from leader-ready`` gate and this bench's
``recovery_time_s`` measure different things with different
denominators. Added a "Two different gates, two different
denominators" section to bench_recovery_time_rerun.md explaining
that Plan 3's gate measures peer-side wedge latency (< 10s) while
the bench measures end-to-end outage window (dominated by leader
reload time). Both gates are valuable; they're complementary.

=== Verified clean (both reviewers) ===

- set_context binding order (before copy_context) — correct
- try/finally symmetry for set_context + reset_context — correct
- Router capability-flag check doesn't break cloud backends or
  Path A legacy prompt-formatting path
- CI grep invariant ``retry|backoff|gateway`` in maxim_peer_backend.py
  still holds (no new matches)
- ``maxim.benchmark`` public API verb still loads correctly after
  the ``src/maxim/bench/`` rename
- _MaximPeerBackend.complete_with_usage still makes EXACTLY one HTTP
  call (Plan 3 invariant preserved)
- httpx _stream_ctx invariant (Plan 3.6 lesson) untouched
- Plan audit git mv operations were pure renames (R100, no content
  drift)

=== Tests / guards ===

- 4234 passed, 0 failed in the full fast suite (+6 over the
  pre-fold baseline: 4 new regression tests + 2 previously-skipped
  SemanticEngine tests that re-enabled)
- ruff check + format: clean on all touched files
- mypy: no new errors introduced (the 11 pre-existing errors on
  llm_worker.py / router.py / maxim_peer_backend.py are unchanged at
  unchanged line numbers)
- Cross-confirmed finding (N5/#5) — the strongest signal to trust a
  finding — was folded first

=== Refs ===

- Plan doc with shipped invariants: docs/plans/llm_path_operator_visibility.md
- Phase D2 hardware validation: docs/experiments/results/llm_path_stress_plan4_20260414.md
- Bench rerun runbook: docs/experiments/protocols/bench_recovery_time_rerun.md
- Memory entry: memory/project_llm_path_operator_visibility_ab_shipped.md

Follow-up to 71f7c24 (Plan 4 A+B initial) and 711105b (plans audit)
on the same feature branch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246 added a commit that referenced this pull request Apr 14, 2026
…graph

Ships the Hebbian episode-binding mechanism: episodes close, nodes
that co-activated within them form durable links on a binding graph
owned by Hippocampus, and presenting one node from a prior episode as
a cue retrieves the others via Hippocampus.retrieve_on_cue.

New module: src/maxim/memory/episode.py (~310 LOC)
- Episode dataclass — immutable, frozen=True, to_dict / from_dict for
  P3.5 round-trip.
- CaptureEvent + PendingEpisodeState lightweight types feeding the
  boundary detector.
- EpisodeStore — standalone class (NOT inlined onto Hippocampus per
  Round 1 Arch important #1) with node→episode inverted index,
  to_dict / load_from_dict, and an internal RLock.
- Boundary detector as a rule-list shape (Round 1 Arch important #2):
  BoundaryRule is a Callable[[PendingEpisodeState, CaptureEvent], bool].
  Three default rules ship: tick_gap_rule, channel_change_rule,
  scn_tag_change_rule. EpisodeBoundaryDetector.add_rule appends more
  without touching Stage 1 code (P3b extension seam).
- apply_hebbian_on_close function implements the Hebbian update rule
  with itertools.combinations (unordered pairs — fixes Round 1 Exec
  critical #1 double-delta). Calls find_edge before add_bidirectional
  to prevent parallel edge accumulation (Round 1 Exec critical #2 —
  DependencyGraph.add_edge has no dedupe).

Architectural pivot: binding graph on Hippocampus, NOT ATL (Round 1
Arch critical #6):
- self._binding_graph: DependencyGraph lives on Hippocampus.
- Hebbian edges never touch ATL.graph. Decouples the binding layer
  from ATL concept eviction / compression lifecycle. Still reuses
  DependencyGraph + EdgeType.ASSOCIATES — the split proposal's
  "no new infrastructure" intent is preserved; what changed is the
  instance that owns the edges.
- Rationale documented in full in the memory/episode.py module
  docstring + referenced from the plan file.

Hippocampus wiring (src/maxim/memory/hippocampus.py):
- EpisodeConfig dataclass added: boundary_tick_gap (50),
  hebbian_init (0.3), hebbian_delta (0.1), hebbian_max (1.0). Nested
  on HippocampusConfig.episode. Tests override via
  HippocampusConfig(episode=EpisodeConfig(hebbian_delta=0.2)).
- Instance fields: _episode_store (EpisodeStore), _binding_graph
  (DependencyGraph), _pending_episode (PendingEpisodeState or None),
  _episode_detector (EpisodeBoundaryDetector seeded with the three
  default rules), _next_episode_ordinal, _episode_lock (RLock).
- Public methods:
  - observe_episode_event(event) — feeds an event into the detector,
    opens a pending episode on first call, extends or closes it on
    subsequent calls.
  - finalize_pending_episode() — force-closes the current pending
    episode (test hook + end-of-session teardown).
  - retrieve_on_cue(cue_node_id, limit) — one-hop partial-cue
    retrieval via DependencyGraph.get_associated, sorted by weight
    descending. Multi-hop via spreading_activation is Stage 2.
- Private helpers: _start_episode, _close_pending_episode_locked.
- Lock ordering invariant: self._episode_lock →
  self._episode_store._lock → self._binding_graph._lock. Regression-
  guarded by test_finalize_pending_episode_no_deadlock_under_capture_thread.

Hippocampus._to_dict / load_state (from P3.5 Stage 1) now delegate the
reserved "episodes" key to self._episode_store.to_dict /
load_from_dict. Hippocampus round-trip round-trips the episodes. The
binding graph itself is NOT persisted in Stage 1 (deferred to P3.5
Stage 2 — either persist or rebuild from episode replay, decided in
Stage 2 after seeing real data).

Tests (tests/substrate/test_p3a_episode_binding.py, 18 tests):

TestP3aMechanism:
- test_episode_close_creates_hebbian_edges: three unordered pairs at
  hebbian_init.
- test_episode_close_strengthens_existing_edges_by_exactly_delta:
  Round 1 Exec critical #1 regression. Asserts exact 0.4 weight
  (0.3 + 0.1, single delta), not 0.5 (double-delta).
- test_repeated_closes_no_edge_duplication: Round 1 Exec critical #2
  regression. Asserts len(_outgoing["a"]) matches expected pair count
  after 5 repeated closes, not 5×.
- test_strengthen_saturates_at_max: hebbian_max clamp.

TestP3aRetrieval:
- test_partial_cue_retrieves_co_activated_nodes.
- test_partial_cue_non_member_returns_nothing.
- test_multiple_episodes_shared_cue_merge.
- test_retrieve_respects_limit.

TestP3aBoundaryDetector:
- test_tick_gap_closes_pending.
- test_channel_change_closes_pending.
- test_scn_tag_change_closes_pending.
- test_scn_tag_none_does_not_close (None on either side does not
  spuriously close).
- test_custom_rule_extension (P3b rule-list extension seam).

TestP3aWireDiscipline:
- test_hebbian_fires_when_atl_is_none: binding graph is Hippocampus-
  owned, ATL is irrelevant to Stage 1 Hebbian.
- test_hebbian_fires_when_atl_wired_and_empty: empty ATL (len == 0,
  falsy under truthy check) must not block Hebbian — general wire-
  discipline regression for the bug class that bit NAc twice in P2.
- test_p3a_diff_has_no_truthy_biosystem_checks: subprocess git-diff
  sentinel that greps added lines for forbidden `if self._X` truthy
  checks, asserts zero matches.

TestP3aLockOrdering:
- test_finalize_pending_episode_no_deadlock_under_capture_thread:
  spawns a background worker that hammers observe_episode_event +
  finalize_pending_episode while main thread calls retrieve_on_cue.
  Asserts completion within a 2-second budget.

TestP3aPersistence:
- test_episode_round_trips_via_hippocampus_dump: dump → fresh instance
  → load_state → assert episodes survive. Depends on P3.5 Stage 1.

All 18 tests green. Full fast suite: 4345 passed, 1 skipped, zero
regressions (+18 tests over P3.5 Stage 1 baseline). Ruff check +
format clean.

Deferred to Stage 2 (not in this commit):
- 100-episode synthetic fixture + TF-IDF baseline.
- Full metric extractor (precision/recall/F1, baseline comparison).
- Binding graph persistence (or rebuild-from-episodes path).
- Multi-hop retrieval via spreading_activation.
- Reward-modulated Hebbian delta.
- Episode thread_id handling (reserved in dataclass, unused in Stage 1).
dennys246 added a commit that referenced this pull request Apr 14, 2026
Two independent reviewers (Executor + Architecture lenses) ran in
parallel against commits 1befe9e (P3.5 Stage 1) + 71a5921 (P3a Stage
1) and produced 22 findings total. This commit folds all critical +
important findings + most minor findings. A few deferred items are
flagged below with rationale.

── CRITICAL FIXES ──

1. **_next_episode_ordinal not restored on load_state** (Exec critical
   #1). Pre-fix: dump → fresh Hippocampus → load_state → observe_
   episode_event crashed with `duplicate episode id: ep_1` because the
   ordinal re-initialized to 0. Hippocampus.dump() now emits
   "next_episode_ordinal" and load_state restores it, with a fallback
   that derives max ordinal from loaded episode ids for corrupt-file
   recovery. Regression guards:
   - test_next_episode_ordinal_restored_on_load_state
   - test_ordinal_derived_from_max_id_when_dump_missing_ordinal

2. **Hippocampus.load_state still enforced payload version check**
   (Exec critical #2). Contradicted the envelope-authoritative
   tombstoned versioning rule documented in snapshot.py. Removed the
   check from Hippocampus / NAc / SCN load_state for uniformity with
   ATL. Hippocampus.dump no longer emits a "version" key at all
   (absence is the cleanest tombstone form). Regression guards:
   - test_nac_payload_version_mutation_does_not_affect_load
   - test_hippocampus_payload_legacy_version_absent_from_dump

3. **PerceptTraceBuffer.load_state broke ring-buffer capacity
   invariant** (Exec critical #3 + Arch critical #2). Pre-fix:
   restored entries list at arbitrary length, silently exceeding
   self._max_entries until the next record() call. Also silently
   dropped dumped tuning params (tau, tick_rate, etc.) with no
   diagnostic surface, causing P4 subprocess-restore across mismatched
   tuning to decay at the wrong rate. Fix: trim restored entries to
   max_entries keeping the most recent; emit a WARN log listing every
   dumped vs live tuning mismatch. Regression guards:
   - test_load_state_trims_to_max_entries
   - test_load_state_warns_on_tuning_drift

4. **EpisodeConfig forward-reference** (Arch critical #1). Pre-fix:
   HippocampusConfig.episode: EpisodeConfig = ... where EpisodeConfig
   was defined BELOW HippocampusConfig in the same module. Worked at
   runtime via `from __future__ import annotations` but broke
   typing.get_type_hints() and dict-shaped YAML construction — a
   latent bug that any future pydantic adapter or YAML config surface
   would have hit. Moved EpisodeConfig definition above
   HippocampusConfig. Verified with an inline get_type_hints() probe
   before commit.

── IMPORTANT FIXES ──

5. **apply_hebbian_on_close never called add_node** (Exec important
   #3). Pre-fix: _outgoing / _incoming populated but _nodes empty.
   Stage 2 binding-graph persistence via DependencyGraph.to_dict()
   would have silently lost the node list. Fix: add_node(id, id) for
   every activated node before any edge write. Regression guards:
   - test_binding_graph_nodes_populated_after_hebbian
   - test_binding_graph_to_dict_includes_nodes

6. **EpisodeStore.load_from_dict silently overwrote duplicates** (Exec
   important #4). Pre-fix: direct _episodes[ep.id] = ep assignment
   collapsed duplicate ids and left stale _by_node inverted-index
   entries. Fix: raise ValueError on duplicate id. Regression guard:
   - test_duplicate_episode_id_in_loaded_state_raises

7. **Binding graph not persisted → silent asymmetry** (Arch important
   #4). Pre-fix: episodes round-tripped through Hippocampus.dump but
   binding-graph edges did not, so post-load retrieve_on_cue returned
   nothing until new episodes landed. Fix: rebuild the binding graph
   from loaded episodes inside Hippocampus.load_state via
   apply_hebbian_on_close, so a restored Hippocampus produces the
   same retrieve_on_cue results as the original. Clears internal
   DependencyGraph state under its lock before rebuilding. Regression
   guard:
   - test_binding_graph_rebuilt_on_load_state

8. **_episode_detector was private, no P3b public seam** (Arch
   important #1). Fix: added Hippocampus.add_boundary_rule(rule) as a
   public method forwarding to self._episode_detector.add_rule. P3b
   channel integration will call this without reaching into private
   state. Regression guard:
   - test_add_boundary_rule_is_public_and_consulted

9. **SessionSnapshot silent partial-capture / partial-restore** (Exec
   minor #2 + Arch important #7, cross-confirmed). Fix: both capture
   and restore_into accept strict: bool = False. Under strict=True,
   partial capture raises with the missing system list, and
   restore_into raises on any mismatch between provided instances
   and envelope systems. P4 / P5 harnesses should use strict mode.
   Regression guards:
   - TestSessionSnapshotStrictMode (4 tests)

10. **test_hebbian_fires_when_atl_wired_and_empty was a no-op** (Exec
    important #1). Pre-fix: set h._atl = empty_atl — but Hippocampus
    has no _atl field in Stage 1, so the test was attaching an
    arbitrary attribute that exercised no real code. Deleted.
    Replaced with the in-tree source grep below.

11. **test_p3a_diff_has_no_truthy_biosystem_checks silently skipped on
    shallow clones** (Exec important #2). Pre-fix: used `git diff
    main` which returns non-zero on shallow-clone CI (the one place
    the guard is most needed), falling back to pytest.skip. Fix:
    replaced with inspect.getsource regex scan over memory/episode.py
    module + every P3a-added Hippocampus method. Always runs, always
    catches violations.

12. **Protocol lock-acquisition undocumented** (Arch important #5).
    Fix: added a "Locking caveat" note to the BioSystemSnapshot
    Protocol docstring warning callers on hot-path threads to spawn
    dump() off to a worker thread.

13. **_graph vs _binding_graph collision risk** (Arch important #3).
    Rather than renaming one of the fields (bigger blast radius),
    added prominent docstrings on both init-time field declarations +
    expanded the retrieve_on_cue docstring to explicitly state it
    queries ONLY the binding graph. Stage 2 may drive a rename if
    P3b retrieval fusion actually surfaces confusion.

── MINOR FIXES ──

14. Deadlock test iter-count threshold: `> 10` → `> 1` to avoid
    flaking on slow CI runners under lock contention. The "didn't
    deadlock" assertion via thread.join timeout remains the load-
    bearing check.

15. Plan file fold: updated every `load(state)` → `load_state(state)`
    reference in both plan files. Added a Production integration
    section to substrate_p3a_episode_binding.md specifying the post-
    Stage-1 wiring path (runtime/agent_loop.py call site, event
    construction from percept context, lock-order discipline).
    Populated the Load-bearing invariants section in both plan files
    with the post-Round-2 invariant list (13 for P3a, 8 for P3.5).

── DEFERRED (NOT FOLDED, WITH RATIONALE) ──

- **retrieve_on_cue return shape (list[tuple] vs list[Memory])** (Arch
  minor): wrapping in a BindingHit dataclass is a shape decision that
  should happen when P3b retrieval fusion actually lands and the cost
  of shape divergence becomes concrete. Stage 1 deliberately uses raw
  tuples.
- **memory_hub integration** (Arch minor): adding a
  MemoryHub.capture_session_snapshot() facade is plan-level
  orchestration that doesn't belong in this commit. Flagged for the
  P4 session.
- **find_edge / update_edge non-atomic under multi-writer** (Exec
  important #5): real concern for Stage 2 multi-writer correctness
  but Stage 1 has no multi-writer tests or call sites — deferred to
  Stage 2 with an explicit comment in apply_hebbian_on_close.
- **find_edge O(degree) perf under P5 stress** (Exec minor #3): flag
  only, deferred to P5.
- **Tombstone string enforcement via AST test** (Arch minor): the
  docstring + comment convention plus removed load_state checks
  (nothing now branches on the payload version) make silent divergence
  extremely unlikely. Adding an AST test is cheap but not gating.
- **load_state naming collision risk with future pydantic/msgspec
  conventions** (Arch important): deferred to the plan session that
  tackles the load(path) → load_from_path rename across 37+ call
  sites. Plan doc explicitly calls this out as out of scope for
  Stage 1.

── TEST STATE ──

- tests/unit/test_bio_system_snapshot.py: 43 → 57 tests (+14)
- tests/substrate/test_p3a_episode_binding.py: 18 → 24 tests (+6;
  removed 1 no-op, added 7 regression guards)
- Full fast suite: 4360 passed, 1 skipped, 0 regressions (+15 over
  P3a Stage 1 baseline).
- Ruff check + format clean on all touched files.

── BRANCH STATE ──

feat/substrate-p3-kickoff now has 5 commits ahead of main:
- 6aaa4c4 docs(plans): open P3.5 + P3a plan files
- 12b0310 docs(plans): Round 1 review fold
- 1befe9e feat(memory): P3.5 Stage 1
- 71a5921 feat(memory): P3a Stage 1
- THIS:  fix(substrate-p3): Round 2 fold
dennys246 added a commit that referenced this pull request Apr 15, 2026
Architecture lens: 2 critical + 5 important + 3 minor
Executor lens: 3 minor (0 critical, 0 important)

All critical + important findings folded. Three minors folded plus
the fixture stale-comment fix. Two minors deferred with explicit
rationale (tfidf baseline location move; p3a_metrics consolidation
with p2_metrics) — rule of three, extract when 3 consumers exist.

── CRITICAL FIXES (Architecture lens) ──

1. **Flip `retrieve_on_cue` default from `multi_hop=False` to
   `multi_hop=True`** (Arch critical #1). The Stage 1 "backward
   compat" hedge protected exactly one test suite while silently
   degrading every future P3b/P4/P5 caller that forgot the kwarg.
   Stage 2 architectural finding says one-hop is TF-IDF-equivalent
   on bag-of-words and the mechanism's value lives entirely in the
   multi-hop path; default must match the primary path. Stage 1
   mechanism tests that specifically exercise one-hop weight
   semantics now opt into `multi_hop=False` explicitly.

2. **Add `node_filter: Callable[[str], bool] | None` kwarg to
   `DependencyGraph.spreading_activation` and wire through
   `Hippocampus.retrieve_on_cue`** (Arch critical #2). Reserves the
   P3b channel-filter seam and the P4 modality-filter seam without
   committing to semantics. 5-line change in bus.py + wired through
   retrieve_on_cue; drops filtered nodes from traversal (both as
   sources and as hop targets). New `TestNodeFilterSeam` (3 tests)
   validates the seam: multi-hop and one-hop modes both honor the
   filter, and `node_filter=None` is an identity with the default.

── IMPORTANT FIXES (Architecture lens) ──

3. **Split `EpisodeConfig` into nested `HebbianConfig` +
   `RetrievalConfig`** (Arch important #4). Before: flat fields
   (`hebbian_init/delta/max`, `retrieval_decay/threshold/max_depth`)
   heading toward a kitchen-sink as P3b/P4/P6 add their own knobs.
   After: `cfg.episode.hebbian.{init, delta, max_weight}` and
   `cfg.episode.retrieval.{decay, threshold, max_depth}` — each
   concern in its own dataclass with a natural home. `max_weight`
   is named with the suffix to avoid shadowing the builtin.
   Callers updated: `Hippocampus._close_pending_episode_locked`,
   `Hippocampus.retrieve_on_cue` multi-hop path,
   `Hippocampus.load_state` binding-graph rebuild, Stage 1 test
   `_fresh_hippocampus` helper. Doing this now is cheap (20 lines);
   doing it after P6 lands is a breaking config rename.

4. **Soften parity assertion to `multi_hop_f1 > one_hop_f1 + 0.20`**
   (Arch important #5). Before: `|one_hop_f1 - tfidf_f1| < 0.05`
   locked in the parity between one-hop and TF-IDF as a hard
   invariant. Any future one-hop improvement (normalized weights,
   PageRank-style inference) would trip the test as a regression.
   The REAL architectural claim is "multi-hop lift is measurable and
   large," which is what the softened assertion locks in. Test
   renamed from `test_one_hop_does_not_beat_tfidf` to
   `test_multi_hop_lift_over_one_hop_is_real`. Current lift: 0.3045.

5. **Add 10% per-seed episode dropout as a real variance source**
   (Arch important #5). Before: all seeds produced byte-identical
   metrics (std_f1 = 0.0) and the `baseline_mean + 2×std` gate
   collapsed to `baseline_mean` — ceremonial. After: the generator
   drops 17 of 170 base episodes per seed via an independent seeded
   RNG. The gate now performs real statistical work:
   - Hebbian multi-hop F1: 0.9955 ± 0.0055
   - Hebbian one-hop F1:   0.6910 ± 0.0074
   - TF-IDF F1:            0.6600 ± 0.0058
   - Gate (baseline + 2σ): 0.6715
   - Margin:               0.3240 absolute
   Probes are topology-only (do NOT depend on dropped episodes) so
   every seed runs the same 50 retrieval tasks; dropout affects
   what edges the retriever can build, not what it's asked to
   retrieve. Multi-hop robustness absorbs the dropout almost
   perfectly (F1 stays near 1.0 because redundant hub+chain
   topology survives loss of any single reinforcement episode).

── DOCUMENTATION FOLDS ──

6. **Document the `max`-aggregation ↔ reinforcement ↔ P6 extinction
   coupling as a load-bearing invariant + P6 entry condition**
   (Arch important #3). `spreading_activation` uses `max`-path
   aggregation, not `sum`. The Stage 2 reinforcement-doubling fix
   (core weight 0.4 vs peripheral 0.3) creates a fragile equilibrium
   that P6 extinction can collapse if it decays core weights back
   toward 0.3. Three options documented for P6:
   - Add `sum` mode to `spreading_activation` (kwarg).
   - Use distinct `EdgeType` values (`HEBBIAN_BIND` vs `ASSOCIATES`).
   - Hold core weights above a strict floor during extinction.
   P6 must pick one before implementing extinction.

7. **Document P8 replay + `load_state` rebuild double-count risk**
   (Arch important #6). `apply_hebbian_on_close` is called during
   both persistence load and (future) P8 sleep replay. If both fire
   on the same episode in one session, edge weights double-apply
   (silently absorbed by the `hebbian_max=1.0` clamp today, but
   will bite if anyone raises the clamp). P8 entry condition: add
   an `(episode_id, edge_key)` idempotency marker or a
   `replayed_at_hebbian` flag.

8. **Document retrieval defaults calibration derivation**
   (Arch minor #3). The `RetrievalConfig` docstring now derives
   `decay=0.7 × core_weight=0.4 → ~5 effective hops` under
   `threshold=0.001`, with a "re-tune when" note for P3b real-text
   fixtures.

── EXECUTOR MINOR FOLDS ──

9. **Fix stale "Episodes/topic = 10" comment** in the generator
   docstring + the generated YAML header (Exec minor #1). Left over
   from the pre-reinforcement draft; the actual fixture ships 17
   base episodes/topic.

10. **Add `peripherals_per_topic > pool_size` validation**
    (Exec minor #2). Pre-fold: silent undercount on pool overflow.
    Post-fold: generator raises `ValueError` with a clear message.
    New test `test_peripherals_per_topic_validation` regression-
    guards it. Also added `episode_dropout_rate` bounds validation.

11. **Rename `test_retrieval_defaults_produce_stage2_pass` →
    `test_retrieval_defaults_match_spec`** (Exec minor #3). The old
    name promised a probe but the body only asserted default values.
    New name matches what the test does.

── DEFERRED WITH RATIONALE ──

- **`TfidfBaseline` location move** (Arch minor #1). Keeps
  `tests/substrate/tfidf_baseline.py` — test-to-test imports work
  from P3b fine. Move to `src/maxim/memory/baselines/` when a 3rd
  consumer emerges. Rule of three.
- **`p3a_metrics.py` / `p2_metrics.py` extraction** (Arch minor #2).
  Extract common `aggregate_seeds` / `compare_to_baseline` into
  `tests/substrate/metrics_common.py` when P3b adds a 3rd
  consumer. Same rationale.

── TEST STATE ──

- tests/substrate/test_p3a_fixture_validation.py: 12 → 19 tests
  (+7: node-filter seam × 3, dropout variance × 2, pool/rate
  validation × 2).
- tests/substrate/test_p3a_episode_binding.py: 24 tests (Stage 1 +
  0 regressions; 4 call sites updated to `multi_hop=False`).
- tests/unit/test_bio_system_snapshot.py: 52 tests (P3.5 + 0
  regressions).
- Full fast suite: 4454 passed, 1 skipped, 0 regressions.
- Ruff check + format clean.

── RESULTS AFTER FOLD ──

  Retriever           mean F1    std F1    Beats TF-IDF + 2σ?
  ------------------  --------   -------   ----------------------
  Hebbian multi-hop    0.9955    0.0055    YES (margin 0.324)
  Hebbian one-hop      0.6910    0.0074    ≈ parity
  TF-IDF baseline      0.6600    0.0058    --

  Multi-hop lift over one-hop: 0.3045 (architectural invariant ≥ 0.20).

Results JSON + experiment writeup regenerated with the dropout-
based numbers; reproduction protocol unchanged (inline snippets
still work).
dennys246 added a commit that referenced this pull request Apr 15, 2026
Plan-only commit, no code. Two parallel reviewers (Executor lens +
Architecture lens) ran against commit 2c92444 and produced:

- Executor: 3 critical + 3 important + 3 minor (1 critical REJECTED
  as false positive — Executor missed the sender_ids writers in
  _start_episode + observe_episode_event; verified against real code)
- Architecture: 2 critical + 4 important + 3 minor
- Cross-confirmed between lenses: 2 findings (metadata-grep baseline
  circularity + channel_membership_filter semantics ambiguity)

All critical + important findings folded into the plan file. All
minors folded except those explicitly deferred with rationale. One
Executor critical pushed back on with a grep-verified correction.

── Critical folds ──

1. Metadata-grep baseline is circular (Exec I1 + Arch I4,
   CROSS-CONFIRMED). Draft baseline was "nodes in episodes matching
   filter" — strictly weaker than Hebbian one-hop (which is "nodes
   that co-occurred with cue in those episodes"). Pass gate cleared
   by construction, not by mechanism. Fix: baseline is now cue-aware
   one-hop ("nodes that co-occurred with cue in filtered episodes").
   Stage 2 also adds a SECOND head-to-head — Hebbian multi-hop
   filtered vs Hebbian one-hop filtered — to prove the multi-hop
   lift is real within the filtered subgraph, matching the P3a
   Stage 2 ≥0.20 absolute lift invariant.

2. channel_specific_rule wrapper gates on wrong field (Exec I2).
   Plan said pending.channel == channel; that only works because
   channel_change_rule is always installed. Fix: gate on
   event.channel == channel. This stays correct under any
   default-rule configuration.

3. sms_sender_change_rule fires on cold start (Exec C1).
   If first event had sender_id=None, pending.sender_ids stays
   empty, and "bob" not in empty_set is trivially True → rule fires
   on any later non-None sender. Fix: guard on
   bool(pending.sender_ids) before the membership check.

4. channel_membership_filter ANY-semantics will box in P4 (Arch C1).
   Stage 1 ships membership_mode: Literal["any","exclusive"] = "any"
   parameter even though Stage 1 only implements "any". P4 cross-
   modal mug test needs "exclusive" semantics for bridge-node
   detection; committing the parameter shape now means P4 can add
   the mode without breaking P3b callers.

5. Mixed-hop semantics need explicit test (Exec C3). Under
   channel-filtered multi-hop, if an intermediate node is in a
   non-matching episode, the BFS breaks there — downstream matching
   nodes reachable only through that intermediate become unreachable.
   This is intended (path-strict, not bridge-transparent) but the
   earlier plan didn't pin the semantic. Fix:
   test_channel_filter_mixed_hop_breaks_transitive_path explicitly
   asserts the BFS breaks at the filtered intermediate.

── Important folds ──

6. channel_membership_filter misplaced on Hippocampus (Arch I3).
   Moved to EpisodeStore.episode_membership_filter. Hippocampus
   exposes a thin convenience alias delegating to the store. Keeps
   Hippocampus from becoming a filter-builder factory as P4/P5/P6
   add their own axes.

7. Filter axis commitment too narrow (Arch I4).
   Generalized from (channel, sender) to **criteria that introspect
   Episode dataclass fields. P4 adds modality= without new code.

8. Persistence + rule re-registration contract under-specified
   (Exec I3). Boundary rules are NOT persisted; callers must
   re-register per-channel rules at construction time post-load.
   New test test_channel_rules_contract_re_registered_on_load pins
   the contract.

9. P8 replay × channel-rule interaction under-specified (Arch I5).
   Documented as a P8 entry condition: replay must either bypass
   the detector (replay closed Episode objects directly) or go
   through the detector with replay-idempotent channel rules. P3b
   does not pick one — P8 decides.

── Minor folds ──

10. narrative_scene_rule alias dropped (Exec minor #1). Stage 1
    uses scn_tag_change_rule directly. If Stage 2 surfaces a real
    need for narrator-change detection, the factory lands then as
    substantive code.

11. Capture-thread deadlock regression guard added (Exec minor #2)
    following the P3a Stage 1 pattern.

12. Default+P3b rule composition model made explicit as ADDITIVE
    (Arch minor #1). Regression guard:
    test_channel_specific_rule_composes_additively_with_defaults.

13. Stage 1 Blocks: states the P4 unblock boundary explicitly
    (Arch minor #2). Stage 1 unblocks P4 seam consumption; Stage 2
    unblocks P4 baseline comparison work.

14. node_filter STOP-and-escalate rule named as a load-bearing
    invariant (Arch minor #3). If Stage 2 testing surfaces a case
    where node_filter is insufficient, STOP — do not add a parallel
    retrieval method.

── Deferred with rationale ──

15. metrics_common.py extraction deferred to P4 (Exec minor #3 +
    Arch minor #3). P4 is the 1.0-gating comparison shape and
    deserves to drive the shared module's API. Folding the
    extraction into P3b now would force a rush decision on an API
    that has to serve P4.

16. Boundary provenance (Arch C2, pushed back on). Arch lens wanted
    rules to return (matched, reason) so P6 extinction could decay
    channel-specific boundaries differently from generic ones.
    Deferred to P6 entry condition because P6 extinction operates
    at the binding-graph edge level, not at boundary state; whether
    provenance is needed depends on P6's design, which hasn't
    committed yet. P3b does not ship the Episode.closed_by field.

17. PageRank baseline (Arch important #4, downgraded). Stage 2
    already ships two baselines (cue-aware metadata-grep +
    Hebbian-one-hop-filtered); a third against PageRank is an
    interesting open question but explicitly deferred to P4 or
    later.

── Pushback on Executor C2 (REJECTED, not folded) ──

Executor flagged "nothing populates PendingEpisodeState.sender_ids
today" as critical. Verified against the actual code:
src/maxim/memory/hippocampus.py has TWO writers —
_start_episode populates from the first event's sender_id
(line around "if event.sender_id is not None: pending.sender_ids
.add(event.sender_id)"), and observe_episode_event's extension
path also appends. The Executor reviewer's grep missed both.
P3a Stage 1 correctly wired sender_ids; P3b does NOT need to add
the wire-up. This is a false positive, not folded.

── Diff ──

Plan file: 118 insertions, 53 deletions.
dennys246 added a commit that referenced this pull request Apr 15, 2026
…bership filter

Ships the P3b Stage 1 mechanism on top of the seams P3a Stage 2 fold
reserved (`add_boundary_rule`, `retrieve_on_cue(node_filter=...)`,
`spreading_activation(node_filter=...)`). 28 synthetic mechanism
tests, no fixture YAML or baseline yet (those are Stage 2).

── New code ──

src/maxim/memory/episode.py:

- channel_specific_rule(channel, inner) — wraps any BoundaryRule to
  gate it on event.channel == channel. Round 1 Exec important #2:
  gates on EVENT.channel, not pending.channel. Pending-channel gating
  only happens to work because the default channel_change_rule is
  always installed; removing it would silently disable every wrapped
  rule under the wrong gating.

- sms_gap_rule(max_gap_ticks=500) — channel_specific_rule wrapping a
  tick_gap_rule with a 10x larger default than the canonical
  EpisodeConfig.boundary_tick_gap=50. SMS conversations have natural
  minute-to-hour gaps that aren't episode boundaries.

- sms_sender_change_rule() — channel-gated rule that closes when a
  new contact texts. Round 1 Exec critical #1: includes a cold-start
  guard (returns False when pending.sender_ids is empty). Without
  the guard, an episode opened with sender_id=None would close the
  moment any later non-None sender arrived, because "bob" not in
  empty_set is trivially True.

- EpisodeStore.episode_membership_filter(*, membership_mode="any",
  **criteria) — Round 1 Arch important #3: lives on EpisodeStore
  where the inverted index is, NOT on Hippocampus. Round 1 Arch
  important #4: filter axes are introspected from Episode dataclass
  fields via **criteria — pass channel="sms", sender_ids="alice",
  thread_id="x", or any other Episode field. Round 1 Arch critical
  #1: ships membership_mode parameter with both "any" and "exclusive"
  semantics (Stage 1 implements both per the spec; P4 cross-modal
  will be the primary "exclusive" consumer for bridge-node detection).
  Collection fields (sender_ids tuple) match via membership; scalar
  fields match via equality. Strings are explicitly NOT treated as
  collections.

src/maxim/memory/hippocampus.py:

- channel_membership_filter(channel, sender=None, *, membership_mode)
  — thin convenience alias forwarding to
  self._episode_store.episode_membership_filter(...). Kept on
  Hippocampus for ergonomics; the real logic lives on EpisodeStore.

- add_boundary_rule docstring extended to document the Round 1 Exec
  important #3 contract: boundary rules are NOT persisted; callers
  must re-add P3b rules at construction time on every load_state.

── New tests ──

tests/substrate/test_p3b_channel_integration.py — 28 tests across
7 test classes, all green:

TestP3bRuleFactories (8 tests):
- sms_gap_rule fires/doesn't fire across the gap threshold
- sms_gap_rule channel-gated (no fire on non-SMS)
- sms_sender_change_rule fires/no-ops on new/known/None contact
- test_sms_sender_change_rule_no_op_on_cold_start (Round 1 Exec C1
  regression)
- sms_sender_change_rule channel-gated

TestChannelSpecificRule (3 tests):
- Wrapper gates on event.channel (Round 1 Exec I2 regression)
- Wrapper does NOT gate on pending.channel (the load-bearing
  invariant — verifies the wrapper fires when pending and event
  channels DIFFER but the event matches the target)
- Inner rule result passthrough

TestRuleComposition (1 test):
- Round 1 Arch minor #1 regression: P3b rules compose ADDITIVELY
  with P3a defaults (the default channel_change_rule still fires on
  channel switch even when an SMS-specific rule is also installed)

TestMembershipFilter (7 tests):
- ANY mode: channel filter retains nodes from any matching episode
- Empty episodes returns False
- Sender criterion via collection membership
- No-match returns False
- EXCLUSIVE mode drops cross-channel nodes (Round 1 Arch C1
  regression — P4 entry point validation)
- membership_mode validation (raises on bogus value)
- Filter introspects arbitrary Episode dataclass fields (thread_id
  filter works without any new code — Round 1 Arch I4 regression)

TestChannelFilteredRetrieval (4 tests):
- Channel filter drops cross-channel neighbors via retrieve_on_cue
- Sender filter drops other-sender neighbors
- No filter (sanity) returns all neighbors
- test_channel_filter_mixed_hop_breaks_transitive_path (Round 1 Exec
  C3 regression): channel-filtered multi-hop is path-strict, NOT
  bridge-transparent. If an intermediate is in a non-matching
  episode, the BFS stops there even if downstream nodes pass the
  filter.

TestPersistenceContract (3 tests):
- Episodes round-trip preserves channel metadata
- test_boundary_rules_NOT_persisted_post_load (Round 1 Exec I3
  regression): the contract is explicit — load_state restores
  episodes only, not rules
- test_post_load_caller_can_re_register_rules: verifies the
  re-registration pattern works as expected

TestConcurrency (1 test):
- test_channel_filter_no_deadlock_under_concurrent_capture: the
  filter callback runs inside spreading_activation's graph lock and
  queries episodes_containing inside EpisodeStore._lock — verifies
  the acquire order doesn't reverse. Inherits the P3a Stage 1
  pattern.

TestP3bWireDiscipline (1 test):
- AST grep for `if self._<biosystem>` truthy checks in
  memory/episode.py and Hippocampus.channel_membership_filter.
  Reuses the P3a Stage 1 invariant.

── Test state ──

- tests/substrate/test_p3b_channel_integration.py: 28 new tests
- tests/substrate/test_p3a_episode_binding.py: 24 (Stage 1, 0
  regressions)
- tests/substrate/test_p3a_fixture_validation.py: 19 (Stage 2, 0
  regressions)
- tests/unit/test_bio_system_snapshot.py: 52 (P3.5, 0 regressions)
- Full fast suite: 4491 passed, 1 skipped, 0 regressions
- Ruff check + format clean

── Stage 1 deliverables NOT in this commit (Stage 2 + 3) ──

- scenarios/substrate/channel_episodes.yaml fixture
- Cue-aware metadata-grep baseline + Hebbian one-hop second
  head-to-head
- 10-seed sweep + variance-source dropout
- Results JSON + experiment writeup + reproduction protocol
- Pre-merge review round (Executor + Architecture lenses on the
  code — Round 1 was the plan-only review)

Stage 1 unblocks P4 seam consumption (the membership_mode parameter
and the introspected-criteria filter shape are committed). Stage 2
will unblock P4 baseline-comparison work.
dennys246 added a commit that referenced this pull request Apr 15, 2026
Two parallel reviewers (Executor lens + Architecture lens) ran against
commit 40b5500 in addition to a self-found critical that surfaced
during the wait. Total folded:

- 1 self-found critical (lock-inversion deadlock, cross-confirmed by
  Executor lens)
- 2 cross-confirmed criticals (unknown Episode field validation,
  Episode None-scalar handling)
- 5 important findings (3 Exec, 2 Arch)
- 5 minor findings folded; 2 deferred with rationale

── Critical folds ──

1. **Lock-inversion deadlock between filter callback and concurrent
   episode close** (self-found + Exec critical, cross-confirmed).
   `DependencyGraph.spreading_activation` holds binding_graph._lock
   for the entire BFS and invokes node_filter inside the critical
   section. The earlier (Stage 1 initial) `episode_membership_filter`
   was lazy — its closure called `self.episodes_containing(node_id)`
   per filter invocation, acquiring `EpisodeStore._lock` while
   `binding_graph._lock` was held. Meanwhile
   `Hippocampus._close_pending_episode_locked` acquires store_lock
   first (via store.add) then graph_lock (via apply_hebbian_on_close).
   Classic AB-BA inversion. P3a Stage 1 was safe because no caller
   ever passed a `node_filter` that touched the store; P3b is the
   first caller to do so.

   **Fix:** snapshot-based filter. `episode_membership_filter` walks
   `_episodes` once under `_lock` at build time and returns a
   lock-free closure over a `frozenset[str]`. The closure does
   `node_id in allowed` — no lock acquisition per call. As a bonus,
   the snapshot gives consistent point-in-time retrieval semantics.

   Regression guards (2 new tests):
   - test_filter_closure_holds_no_lock_after_construction: spawns a
     side thread that grabs EpisodeStore._lock and holds it; main
     thread then calls the filter. Asserts the call returns in
     <0.5s. With a lazy filter this would block until timeout.
   - test_filter_snapshot_does_not_see_episodes_added_after_construction:
     pins the snapshot-staleness contract so a future refactor can't
     quietly switch back to lazy lookup.

   Also updated _close_pending_episode_locked docstring to document
   the lock-inversion risk + name the snapshot fix as the
   load-bearing invariant.

2. **Unknown Episode field criteria silently returned empty filters**
   (Exec important #1 + Arch critical #1, cross-confirmed). The
   pre-fold `_matches` used `getattr(episode, key, None)` + a None
   short-circuit, conflating typos (`channnel="sms"`) with missing
   fields and with legitimately-None scalar values. The Arch lens
   framing was sharp: "P4 cross-modal mug test would pass at 0%
   collapse with an empty filter and nobody would know why."

   **Fix:** enumerate `dataclasses.fields(Episode)` once at filter
   build time and `raise ValueError` on unknown criterion keys.
   Regression guard: test_unknown_episode_field_raises_at_build_time.

3. **`field_value is None` short-circuit broke legitimate None scalar
   matches** (Arch critical #2). `Episode.thread_id: str | None` and
   `Episode.scn_tag: str | None` are nullable. A caller filtering
   `thread_id=None` (intent: "match episodes with no thread") got
   zero matches because the None short-circuit bailed before the
   equality check. Conflates "field missing" with "field is
   legitimately None."

   **Fix:** split the typo check (now done up-front via dataclass
   field enumeration) from the value check. Once the typo check has
   passed, `getattr(episode, key)` returns the field value directly
   and the equality / membership match handles None correctly.
   Regression guard: test_filter_matches_legitimate_none_scalar_field.

── Important folds ──

4. **`Hippocampus.channel_membership_filter` per-axis alias replaced
   with `Hippocampus.episode_filter(**criteria)`** (Arch important
   #1). The narrow alias would have forced P4 to add
   `modality_membership_filter`, P5 to add `stress_membership_filter`,
   etc. — N per-axis aliases each requiring sync with the underlying
   signature. The general `**criteria` form takes any combination of
   `Episode` field matchers and inherits future axes for free.
   Common-case ergonomics preserved via plain kwargs:
   `h.episode_filter(channel="sms", sender_ids="alice")`. No
   external callers existed yet, so the rename is free.

5. **`sms_gap_rule` and `sms_sender_change_rule` renamed to
   `channel_gap_rule(channel, max_gap_ticks)` and channel-agnostic
   `sender_change_rule()`** (Arch important #2). The earlier `sms_*`
   factories hard-coded "sms" three layers deep and would have
   spawned a per-channel rule zoo as P5/P6 added email / slack /
   voice. The renamed factories take the channel as an explicit
   parameter (or omit it entirely, leaving callers to compose with
   `channel_specific_rule` if they want gating). Regression guard:
   test_channel_gap_rule_works_for_arbitrary_channel + new
   test_sender_change_rule_composes_with_channel_specific_rule.

6. **Tuple criterion values raise TypeError** (Exec important #3).
   Pre-fold: passing `sender_ids=("alice", "bob")` (intent: "any of
   these senders") silently returned nothing because `tuple in
   tuple` checks element membership, not subset. Fix: type-check at
   filter build and raise. Subset / set-intersection semantics
   deferred to a future plan. Regression guard:
   test_collection_criterion_value_raises_typeerror.

7. **`exclusive` mode + zero-containing-episodes pinned to False**
   (Exec important #2). The vacuous-truth alternative ("every (zero)
   episode matches") was rejected as practically wrong. Regression
   guard: test_exclusive_mode_zero_episodes_returns_false.

8. **`exclusive` mode shipped with a 2-criteria mug-shape test**
   (Arch important #4). Uses `scn_tag` as a stand-in for the P4
   `modality` field that doesn't exist yet, validating the
   `**criteria` cross-criteria pattern works under `exclusive` mode.
   Regression guard: test_exclusive_mode_with_two_criteria_cross_mug_shape.

── Minor folds ──

9. test_boundary_rules_NOT_persisted_post_load now snapshots the
   default rule count from a baseline Hippocampus rather than
   hard-coding `len == 3` (Arch minor #1). Survives any future P3.5
   default-rule-count change.

10. test_wrapper_correct_in_isolation_without_default_rules
    (Arch minor #3). Builds a detector with ONLY
    `channel_specific_rule("sms", always_fire)` — no defaults — and
    verifies the wrapper still gates correctly on event.channel when
    pending.channel doesn't match. Pre-fold the existing test ran
    with the full default rule set, masking the difference between
    event-channel and pending-channel gating in subset cases.

11. EpisodeBoundaryDetector class docstring documents the per-rule
    O(1) invariant + flags the P5 perf consideration (Arch important
    #3, deferred to P5 with a documented entry condition).

── Deferred with rationale ──

- **Per-channel rule index for P5 stress.** The `any()` evaluation
  is O(rules × events). Stage 1 ships 3-5 rules; the index becomes
  worth building when the rule list crosses ~8-10 entries. Documented
  in EpisodeBoundaryDetector docstring as a P5 entry condition.
- **`add_boundary_rule` runtime signal on load** (Arch minor #2).
  Post-load instances silently lose their custom rules; the contract
  is documented but there's no operator-visible event. A future
  ergonomics enhancement, not load-bearing for Stage 1.

── Test state ──

- tests/substrate/test_p3b_channel_integration.py: 28 → 38 tests
  (+10: 2 deadlock guards already in place from self-found fold,
   plus 8 from this Round 2 fold for the criticals + importants)
- tests/substrate/test_p3a_episode_binding.py: 24 (Stage 1, 0
  regressions)
- tests/substrate/test_p3a_fixture_validation.py: 19 (Stage 2, 0
  regressions)
- tests/unit/test_bio_system_snapshot.py: 52 (P3.5, 0 regressions)
- Full fast suite: 4501 passed, 1 skipped, 0 regressions.
- Ruff check + format clean.

── False positive REJECTED (not folded) ──

Round 1 Executor flagged "nothing populates PendingEpisodeState
.sender_ids" as critical. Verified against actual code:
src/maxim/memory/hippocampus.py has TWO writers — _start_episode
populates from the first event's sender_id and observe_episode_event's
extension path also appends. The Executor reviewer's grep missed both.
Documented in the Round 1 fold commit (59eecf0).
dennys246 added a commit that referenced this pull request Apr 15, 2026
Two parallel reviewers (Executor lens + Architecture lens) on commits
c3a88cf..600b736. Zero directly cross-confirmed findings (lenses
focused on different layers — Executor on migration mechanics,
Architecture on cross-module interaction). Architecture lens caught
one BLOCKING-adjacent finding (#10) that needed a code-level
investigation before fold; trace confirmed it is a LATENT risk, not an
active bug.

Architecture #10 (latent bridge × subscriber attribution-asymmetry):
post-Wave-1, CLI fast paths now wire BOTH `create_pain_nac_subscriber`
AND `ToolPainBridge` to the same PainBus. The bridge has a
`_pending_tools` guard at `_on_embodiment_pain`; the subscriber does
NOT. **Today, the subscriber's `record_outcome_full` silently fails
to link pending tool events** because `record_tool_start` records the
pending event with a stripped 1-key `{"params": ...}` context, the
body's `_publish_pain` outcome carries 7 keys with zero overlap, and
`_context_similarity` (directional, denominator = `len(ctx1)` per the
Substrate P2 Stage 2 rule) computes `0/1 = 0.0` — below the 0.5
threshold, no link forms. Net: no double-counting today. **The
correctness is load-bearing on this similarity mismatch**, not on any
guard. If anyone enriches `record_tool_start`'s context (e.g., adds
`entity` for the broad-guard narrowing the bridge docstring at lines
367-371 explicitly contemplates), the trap activates and double-
counting starts silently.

Per the no-band-aid rule, NOT fixed in this PR by adding a
`_pending_tools` guard to `create_pain_nac_subscriber` (would couple
the subscriber to bridge state across `proprioception/pain_bus.py`
and `bridges/tool_pain_bridge.py` with no clean ownership
relationship — coupling band-aid). Instead:

1. New shell plan `pain_bus_bridge_subscriber_unification.md`
   captures the trigger conditions and the non-band-aid fix
   (Option B: bridge-aware subscriber via
   `build_pain_bus(*, tool_pain_bridge=...)` parameter, OR invert the
   wiring so `ToolPainBridge` mediates NAc attribution and the
   subscriber falls back when no bridge is wired). Both shapes need
   their own audit + design pass — sibling plan to this Wave 1.
2. Tripwire test
   `test_pain_bus.py::TestBuildPainBus::test_subscriber_does_not_link_pending_tool_event`
   pins the current behavior. Fails LOUDLY the moment anyone enriches
   `record_tool_start`'s context dict, forcing the deeper plan to
   open. The test docstring explicitly says: DO NOT relax the
   assertion.
3. CLAUDE.md `build_pain_bus` invariant updated with the latent-risk
   warning + tripwire reference.
4. `pain_bus_unification.md` audit doc gains a "Latent risk surfaced
   during pre-merge review" section enumerating the four band-aid
   options that were rejected and the non-band-aid path.

Architecture #2 (L4 commitment) + #3 (Gap B rationale wording):
audit doc Gap B section reworded. Prior wording ("DN has no other use
for hippocampus") was shaky because `agentic_runtime.py:719` IS DN's
hippocampus consumer, just externally wired. Replaced with the cleaner
framing: DN currently *constructs* its own bus; the Wave-2 fix is to
invert ownership so the agent runtime constructs the bus via
`build_pain_bus(...)` and INJECTS it into DN, which becomes a bus
*consumer*. Added explicit L4 commitment that Wave 2 MUST do the
inversion, NOT add `pain_detector=`/`hippocampus=` parameters to
`build_pain_bus` (which would violate "gate on the learning subject,
not the signal source").

Executor #1 (sim orchestrator:69 deferral note): Site #4 in the audit
table now explicitly documents why it is NOT migrated — the bus must
exist at line 69 (the sandbox's `PainTriggerLayer` needs it) BEFORE
`aut_hippocampus`/`aut_nac` are constructed at line 613+. Migrating
would require restructuring the construction-timing relationship.
The site is already behaviorally correct (subscribes both standard
learners) so it does not contribute to Gap A. Future Wave 1.5/Wave 2
candidate.

Executor #2 (dead `hippo_marker` test code): removed the unused
inner function from
`test_additional_subscribers_registered_after_standard_learners` —
artifact from an earlier draft. Test still uses
`hippo.capture.side_effect` to record ordering; behavior unchanged.

Executor #3 (history_size hard assertion): replaced the smoke check
with `assert bus.reaction_bus._history.maxlen == 42`. Catches a
regression where the kwarg silently fails to flow.

Executor #5 (interactive sim logger.info parity): added
`logger.info("Interactive sim PainBus wired ...")` to the interactive
sim path so all three CLI migration sites have parallel observability.

Tests: `tests/unit/test_pain_bus.py` 28 passing (was 27 — added the
tripwire test). `tests/unit/test_build_executor.py` 15 passing.
Combined 43 passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246 added a commit that referenced this pull request Apr 15, 2026
Two parallel reviewers (Executor + Architecture lenses) produced 20
total findings against P4 Stage 1; this commit folds the IMPORTANT
and MINOR items that require code changes. No CRITICAL findings.
4702 tests pass (185 across the P3a/P3b/P3.5/P4/hippocampus slice).

Cross-confirmed folds (both lenses caught independently):

Consistency windows:
- hippocampus_persistence.load_state now wraps the ENTIRE episode-
  binding restore block (episode_store.load, binding_graph rebuild,
  ordinal restore, node_modality clear-then-load) in a single
  `with self._episode_lock:` inside the existing `_rwlock.write()`.
  Pre-fold split regime held _rwlock.write() only for the first
  three and acquired _episode_lock separately for just the sidecar,
  exposing a torn-read window to any concurrent observe_episode_event.
  Lock order _rwlock then _episode_lock verified deadlock-free by
  inspection (Arch-lens #1).
- hippocampus_persistence.dump mirror: episode-binding reads
  (episode_store.to_dict, _next_episode_ordinal, _node_modality) now
  all happen under a SINGLE _episode_lock acquisition nested inside
  the existing _rwlock.read() block.
- hippocampus._close_pending_episode_locked flips the drain order:
  modality buffer now drains into _node_modality AFTER
  _episode_store.add and apply_hebbian_on_close both succeed. Drain
  is a pure dict update, so drain-last means if either earlier step
  raises, all three pieces of episode-binding state stay consistently
  unmutated. The symmetric drain-first choice introduced a "sidecar
  entries without graph nodes" failure window that Exec-lens #6
  flagged.

Cue handling:
- retrieve_cross_modal adds explicit runtime validation of
  target_modality literal ("text" or "vision"), mirroring load_state's
  validation. Typed Literal is only checked by mypy; callers without
  mypy (tests, notebooks) would silently return [] on a typo, exactly
  the silent-no-op failure mode the Literal was added to prevent
  (Exec-lens #8).
- retrieve_cross_modal docstring updated with explicit point-in-time
  read semantics (Arch-lens #2) and explicit single-hop-only cross-
  modal limitation (Arch-lens #3 — multi-hop paths through same-
  modality intermediates are silently truncated; decision deferred
  to Stage 2/3).

Test additions (9 new tests, total 34 P4 Stage 1 tests vs 25 pre-fold):
- TestSnapshotPatternFilter: frozenset content assertion pins
  `frozenset({"vision_mug"})` exactly, not just "some frozenset"
  (Exec-lens #1 — shape-only check let a regression writing
  `frozenset(_node_modality.keys())` slip through).
- test_load_state_rejects_unknown_modality_value: seeded with prior
  state, asserts EVERY piece (sidecar, episode store, ordinal,
  binding graph edges) is intact after the raise (Exec-lens #3).
- test_node_modality_stale_entries_cleared_on_rollback: the broken
  nac adapter now asserts hippocampus HAS been mutated to state C at
  raise time. Guards against a future SNAPSHOT_KINDS reordering
  silently trivial-passing the rollback path (Exec-lens #2).
- TestCueExemptionWithInGraphUntaggedCue: directly exercises the cue-
  exemption branch with a cue that IS in the binding graph but NOT
  tagged — pre-fold test used a cue absent from the graph entirely
  so spreading_activation early-returned before touching the
  exemption (Exec-lens #4).
- TestLastWriteWinsOnDuplicateNodeIdWithinEpisode: pins the drain's
  last-write-wins contract for the degenerate same-node-two-
  modalities-in-one-episode case (Arch-lens #7).
- TestStageThreeLimitation: pins the current single-hop-only cross-
  modal limitation. If a future refactor enables multi-hop traversal
  through same-modality intermediates this test FAILS and forces an
  explicit Stage 2/3 design decision (Arch-lens #3).
- TestConcurrencyCrossLockSmoke: spawns concurrent dump + observe
  workers for 0.5s and asserts both make forward progress. Guards
  against any future _episode_lock holder acquiring _rwlock that
  would deadlock against dump's _rwlock then _episode_lock order
  (Exec-lens #5).
- TestFixtureGeometry::test_within_pair_similarity_is_dim_invariant:
  dim-parametrized across 64/128/384/512/768. Pins the Arch-lens #4
  fold that rescaled noise_scale to noise_scale / sqrt(dim) so
  within-pair similarity is dim-independent. A Stage 2 caller picking
  CLIP's native 512-d space with the pre-fold coupling would have
  produced within-pair sim ~0.44, right at EC's 0.40 threshold.

Fixture cleanup:
- p4_fixture_gen renames noise_std to noise_scale and rescales per-
  element std as noise_scale / sqrt(dim) so within-pair expected
  cosine similarity is invariant in dim. Default noise_scale=0.4
  gives within-pair ~0.862 at any reasonable dim (Arch-lens #4).
- p4_fixture_gen cosine_similarity drops the zero-norm guard and
  lets numpy surface nan loudly on degenerate inputs (Exec-lens #10).

Persistence cargo-cult cleanup:
- hippocampus_persistence dump and load_state drop the
  getattr(self, "_episode_lock", None) defensive patterns. Both
  fields are set unconditionally in Hippocampus.__init__ and there
  is no subclass that omits them (Arch-lens #8).

Deferred / noted in PR description:
- Arch #5 (error-message PII): inapplicable — cue node ids are
  internal validated identifiers.
- Arch #6 (reset-path hygiene): audit came back empty, no existing
  reset paths touch episode state.
- Exec #9 (_start_episode redundant last_tick): cosmetic.
- Open Stage 2/3 design decision: should Stage 3 split
  retrieve_on_cue's node_filter into traversal_filter and
  result_filter to enable multi-hop cross-modal paths through same-
  modality intermediates? TestStageThreeLimitation is the pin;
  decision needs to happen BEFORE Stage 3's metric is frozen.

Test surface after fold: 34 P4 Stage 1 tests, 185 across broader
episode/persistence/hippocampus slice, 4702 full fast suite clean.
Only failures are 2 pre-existing worktree-environmental failures in
TestVersionInfo unrelated to P4.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dennys246 dennys246 mentioned this pull request Apr 15, 2026
7 tasks
dennys246 added a commit that referenced this pull request Apr 15, 2026
Three-lens pre-merge review (Executor + Architecture + Blast Radius)
found 2 BLOCKING + 9 IMPORTANT + 6 MINOR findings against the C3.3
branch tip. Two of the three BLOCKING/IMPORTANT tier findings were
cross-confirmed across lenses — the strongest trust signal the
review pattern produces.

All 17 folded in this single commit. Regression guards added for
every non-cosmetic fold. Bisect surface stays atomic.

## Cross-confirmed folds (highest trust)

**CC1 [BLOCKING] — Probe cache URL mismatch is a silent no-op.**
Executor B1 + Blast Radius I3. `install_on_target` calls
`_clear_probe_cache(url)` with the original mesh.yml URL. The cache
in lane_backends.py is keyed by `cfg.remote_url` (the peer.yml URL
shape) with no normalization. If the two URL shapes differ (trailing
slash, /v1 suffix, query string), the clear is a silent no-op → next
startup trusts a stale `auth_rejected` entry and wires the lane off.

Fix: tolerant-match via `probe_cache._canonical_url()` on both sides
of the comparison in `clear_cache_for_url`. Normalize-on-lookup
rather than normalize-on-write so existing cache entries don't
become orphans on upgrade. Regression tests cover the trailing-slash
variant, the missing-/v1 variant, the exact-match case (no
regression), and the unrelated-entry case (no over-removal).

**CC2 [BLOCKING] — Drain TOCTOU: concurrent `--node install` can
clobber operator drain intent.** Blast Radius B1 + Executor #5.
`_run_node_install` read drain state via `read_drained_nodes()`
OUTSIDE the filelock, then called `drain_node()` which is idempotent.
Two operators running `--node install` simultaneously could both see
`was_drained=False`, both drain, both resume at the end — silently
clobbering any concurrent operator-explicit drain that landed in
between. Violates the C2 "sticky operator intent" invariant that the
code docstring explicitly claimed.

Fix: new `drain_state.drain_node_if_absent(name, ...) -> (set, bool)`
primitive. Atomic "drain if not already drained, return whether we
changed anything" under a single filelock critical section.
`_run_node_install` uses the boolean return value to decide
`drained_here`, closing the TOCTOU window completely. Matches the
C2 A2 fold pattern (enforce invariants at the state layer, not the
CLI layer). Regression test pre-seeds the drain state file and
verifies the install verb does NOT auto-resume.

## IMPORTANT folds

**I1 — Coupling inversion: mesh_cli → peer/cli was upside-down.**
Architecture #1. The lazy `from maxim.peer.cli import _install_on_target,
_classify_install_tokens, KNOWN_EXTRAS` in `_run_node_install` made
the mesh-specific module depend on the 1,300-line kitchen-sink peer
verb dispatcher. Every C3.5/C3.6 verb (`--node update`, `--node
restart`, `--node llm`) would replicate this exact pattern.

Fix: new `src/maxim/peer/install_core.py` holds the shared core
(`install_on_target`, `classify_install_tokens`, `KNOWN_EXTRAS`,
`_INSTALL_POLL_DEADLINE_S`, `_looks_like_url`,
`_normalize_url_for_probe_cache`). Both `peer/cli.py::_cmd_install`
and `peer/mesh_cli.py::_run_node_install` import from the leaf. The
coupling becomes `cli.py → install_core ← mesh_cli.py` — both depend
on the leaf, neither on each other. Moves the dependency BEFORE
C3.5/C3.6 replicate the inversion and cement it by precedent.

**I2 — Missing CI grep for `/v1/admin/install` shape lock.**
Architecture #2. Plan 3 R2.5 + C3.1 both set the precedent: shared
HTTP cores get a strict CI grep locking the literal endpoint string
to a single module. `install_on_target` had the same single-source
-of-truth property but no grep.

Fix: added CI grep in `.github/workflows/test.yml` — any match of
`/v1/admin/install` outside `install_core.py` + its test file fails
CI with a hint pointing at the module + allow-list update path.

**I3 — Over-broad `http` prefix check rejects real PyPI packages.**
Executor #2. Both `_cmd_install`'s argv parser and
`_parse_node_install_tokens` used `arg.startswith("http")` to detect
URLs, which false-positives on `httpx`, `http-client`, `httpie` —
all real PyPI packages operators might install.

Fix: new `install_core._looks_like_url(token)` requires `"://"` in
the token. Applied to both parsers. Regression tests for `httpx`,
`http-client`, `httpie` in both test files, plus a guard that
confirms real URLs with `"://"` still get rejected.

**I4 — Resume-after-success failure returned same exit code as
install failure.** Executor #4. Both paths returned 1, so operators
tailing exit codes couldn't distinguish "install failed, stuck in
drain" from "install succeeded, resume failed, node is upgraded but
stuck in drain."

Fix: new `_EXIT_RESUME_FAILED_POST_INSTALL = 3` constant.
`_run_node_install` returns 3 on post-install resume failure; 1 is
reserved for install-itself-failed. Exit-code documented in the
verb docstring, cli-reference, and mesh_debug runbook. Regression
test locks the code distinction.

**I5 — Cluster-key divergence between mesh.yml and peer.yml.** Blast
Radius B2. `init-mesh` copies `peer.yml::api_key` into
`mesh.yml::cluster_key` once; after that the two files are
independent. Post-rotation, one install verb 401s while the other
succeeds against the same target.

Fix (docs-only for C3.3, doctor check deferred): explicit warning in
`cli-reference.md` under the `--node install` row, dedicated
section in `mesh_debug.md` with the symptom pattern, and a new
Stage C7 roadmap entry for the future
`check_cluster_key_consistency` doctor check. The doctor check
belongs with the cluster-key rotation verb (C7 ships both together
so the check has a remediation path to point at).

**I6 — Stacked failures leave node drained with no message.** Blast
Radius I2. If a `KeyboardInterrupt` or programming bug fired between
the drain and the install result, the function exited with the
exception and the node stayed drained silently.

Fix: wrapped `install_on_target` call in `try/except BaseException`
(includes `KeyboardInterrupt` + `SystemExit`, not just Exception).
Prints the "Install interrupted. Node X is STILL DRAINED" banner
before re-raising. Regression tests for `KeyboardInterrupt`,
`ValueError`, and the pre-drained case (no banner when we didn't
change state).

**I7 — Failure-message ordering not asserted.** Executor #3. The
regression test confirmed that both "Install failed" and "STILL
DRAINED" appeared in stderr, but not that STILL DRAINED was LAST.
Operator should see the actionable recovery hint at the bottom of
their terminal.

Fix: added `test_still_drained_banner_comes_after_install_error_output`
that uses `str.rfind` + `str.find` to verify the ordering invariant.

**I8 — Was-drained sticky semantics not documented operator-facing.**
Architecture #4. The code docstring covered it; `cli-reference.md`
did not. In a C4 world where the router consults drain state, the
"install never mutates drain intent" contract is load-bearing.

Fix: explicit "Was-drained sticky" paragraph in the cli-reference
row + dedicated "pre-drained nodes stay drained after success"
section in the mesh_debug walkthrough.

**I9 — Empty `mesh.cluster_key` not rejected at construction.** Blast
Radius I1. `MeshConfig.__post_init__` validated yaml-safe characters
but not empty string. If `peer.yml::api_key` was ever empty,
`synthesize_from_peer_config` would copy it through and the install
POST would send `Authorization: Bearer ` with a trailing space →
cryptic 401.

Fix: new empty-string guard in `MeshConfig.__post_init__`, paired
with a docstring explaining the failure mode it prevents. Matches
the E7 (empty nodes) + C3.2 A1 (self_name in nodes) pattern of
hoisting parser-side validation into the construction-time check.
Regression tests for both the rejection case and the
whitespace-only documented-limitation case.

## MINOR folds

- **M1** — `install_on_target` docstring claimed "no assumptions
  about drained" but mutated probe cache. One-sentence clarification
  added.
- **M2** — Parametrized test `test_every_known_extra_classifies_as_extra`
  locks the `KNOWN_EXTRAS` ↔ classifier contract so a future pyproject
  bump can't silently drop an extra out of the classifier.
- **M3** — Positional-URL rejection error leaked the typed URL
  verbatim to stderr (could contain secrets). Fix: new
  `_redact_url_for_error()` shows only `<scheme>://...`. Regression
  test with `sk-abc123-secret-token` in the URL path confirms the
  secret does NOT appear in stderr.
- **M4** — `600.0` polling deadline is now `_INSTALL_POLL_DEADLINE_S`
  named constant at module scope in `install_core.py`.
- **M5** — Skipped by design. The `verb_args` name matches its role.
- **M6** — `,semantic,` and `,,,` edge cases now have regression
  tests in both test files.

## Not folded (deferred with notes)

- **Architecture #7** — `run_node_subcommand` dispatch-dict refactor.
  Flagged for C3.5 when the verb count hits 8.
- **Architecture #5** — step-1 vs step-6 self-guard redundancy.
  Intentional defense-in-depth; the step-1 message is better. Added
  docstring comment explaining the choice.
- **Executor Q1** — drained-for-install distinguishability. Added as
  explicit C4.5 re-check trigger in the roadmap doc with the
  operator/install/auto tri-state semantics written out.
- **Executor Q2** — multi-http* argv divergence between the two
  parsers. Minor, not worth a design round in C3.3.
- **Blast Radius I4** — leader-side Authorization echo in error
  bodies. Out of scope for a peer-side ship.

## Files touched

- NEW `src/maxim/peer/install_core.py` — shared install core
- NEW CI grep in `.github/workflows/test.yml` — /v1/admin/install
  shape lock
- `src/maxim/peer/cli.py` — thin `_cmd_install` re-imports from
  install_core, fixes http:// sniff
- `src/maxim/peer/mesh_cli.py` — `_run_node_install` rewritten to
  use `drain_node_if_absent`, `_EXIT_RESUME_FAILED_POST_INSTALL`,
  try/except BaseException, `_redact_url_for_error`, import from
  install_core
- `src/maxim/peer/mesh_config.py` — empty cluster_key guard
- `src/maxim/peer/drain_state.py` — new `drain_node_if_absent`
  primitive
- `src/maxim/runtime/probe_cache.py` — `_canonical_url` +
  tolerant-match `clear_cache_for_url`
- `tests/unit/test_peer_install.py` — 4 new test classes:
  TestHttpPrefixSniffNotOverBroad (4), TestKnownExtrasClassification
  (3), TestProbeCacheNormalizedLookup (4),
  TestMeshConfigRejectsEmptyClusterKey (2)
- `tests/unit/test_peer_mesh_cli.py` — 2 new test classes:
  TestRunNodeInstallFoldGuards (12), TestDrainNodeIfAbsent (5)
- `docs/user/cli-reference.md` — expanded --node install row with
  exit-code table, cluster-key divergence warning, was-drained
  sticky contract
- `docs/troubleshooting/mesh_debug.md` — exit-code 3 recovery
  walkthrough, cluster-key divergence section, httpx/http-client
  note
- `docs/plans/reactive_peer_mesh_roadmap.md` — Stage C7
  cluster-key consistency doctor check entry + C4.5 re-check
  trigger for drain origin tri-state

## Test results

- 4816 passed, 1 skipped, 22 deselected in the full fast suite
  (was 4786 pre-fold; +30 new regression tests from the fold)
- 377/377 green in the concentrated peer+mesh+doctor slice
- `ruff check` + `ruff format --check` clean
- `mypy` on public API files clean
- Zero regressions from pre-fold baseline

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246 added a commit that referenced this pull request Apr 15, 2026
…op YAML fallback (fold steps 3+4)

Stage 2 v2 fold — commit sequence steps 3 and 4 (bundled because they
touch the same file). Addresses:

- Arch #2 (CRITICAL): Flowers102.classes is not documented torchvision
  API. Pin the 102 names in-repo as FLOWERS102_CLASS_NAMES +
  assert_torchvision_classes_match_pin() drift guard called from
  load_fixture_images at load time.
- Arch #10 (IMPORTANT): fixture's class_idx was declarative dead
  weight — only checked for uniqueness, never for correctness. Now
  enforced in load_fixture_images: every FixtureClass's class_idx is
  validated against FLOWERS102_CLASS_NAMES.index(name). Drift raises
  ValueError with a change-fixture hint.
- Exec #1 (IMPORTANT): the fallback _parse_simple_yaml silently
  dropped inline `#` comments. Line 22 of p4_mug_test.yaml
  (`dataset: nelorth/oxford-flowers  # torchvision...`) was parsed as
  the full tail string when PyYAML was absent, vs just
  "nelorth/oxford-flowers" when PyYAML was present. Environment-
  dependent silent drift.
- Exec #2 (IMPORTANT): split-brain between PyYAML and fallback
  parsers is a ticking bomb for a SHA-pinned fixture (pin protects
  bytes, not semantic interpretation).

Changes to tests/substrate/p4_fixture_loader.py:

1. New FLOWERS102_CLASS_NAMES constant: the canonical 102-entry
   tuple from Nilsback & Zisserman 2008, matching what
   torchvision.datasets.Flowers102.classes currently ships. Regen
   instructions + the change-fixture escalation path are documented
   in the constant's docstring.

2. New assert_torchvision_classes_match_pin() function: pulls
   torchvision's attribute (tolerantly — returns early if
   torchvision removed `.classes` entirely so downstream callers use
   the pinned list directly), compares against the pin, raises
   AssertionError with a diff-like message listing the first 10
   mismatches. Called from load_fixture_images as drift guard 1.

3. load_fixture_images tightened to three ordered drift guards:
   - Guard 1: assert_torchvision_classes_match_pin() (global
     ordering)
   - Guard 2: per-class class_idx consistency with
     FLOWERS102_CLASS_NAMES.index(name) (Arch #10)
   - Guard 3: per-sample label vs expected class (existing guard,
     now reads from the pinned list instead of torchvision.classes)

4. Removed _parse_simple_yaml, _parse_simple_class_list,
   _coerce_scalar (Exec #1/#2). load_fixture_descriptor now calls
   yaml.safe_load(raw) directly — no try/except fallback. PyYAML is
   a core pymaxim dependency (declared in pyproject.toml core deps)
   AND transitive via sentence-transformers in the semantic extra,
   so no caller can reach this function without PyYAML being
   importable.

All 11 fixture validation tests pass. No fixture YAML change — the
pinned SHA256 is unchanged because the YAML bytes are unchanged;
only the loader logic tightened.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246 added a commit that referenced this pull request Apr 15, 2026
…+ 0.70 retrieval gate (fold steps 8+9)

Stage 2 v2 fold — commit sequence steps 8 (materialize new pinned
fixture) and 9 (round-trip update + retrieval gate). Addresses
Arch #5 (threshold entanglement) + Arch #6 (missing 0.70 retrieval
gate specified in plan deliverable 5).

**fixture_version bumped 1 → 2.** This commit is tagged with the
"change-fixture" prefix per the loader's pinned-SHA contract, which
requires an explicit regeneration trigger.

scenarios/substrate/p4_mug_test.yaml:

- fixture_version: 1 → 2
- dataset: "nelorth/oxford-flowers" → "torchvision.datasets.Flowers102"
  (Arch #1 follow-through: the v1 stale label is replaced with the
  actual library in use, matching the plan amendment in commit
  6de09c6)
- NEW fields at the top-level (canonical ship-shape BuildConfig):
  - build_noise_reps: 1
  - build_bridges_enabled: true
  - build_text_ec_threshold: 0.60
  - build_vision_ec_threshold: 1.01
- Extensive header comment explaining the operating point selection
  from the Phase 2D v2 sweep: "largest noise_reps at which
  bridge-enabled mean forward top-5 recall ≥ 0.90." At noise_reps=1
  the sweep measured 0.980 mean recall + +96.0% Option 2 lift, at
  noise_reps=2 recall dropped to 0.800 (fails the gate), so 1 is
  the canonical choice.
- FIXTURE_SHA256 regenerated: 967e83ed18851e1dfcad418be57f3275cf04a961462e6dc4dd055b6b71c8920b

**Note on IDE diagnostics:** the editor's YAML language server auto-
infers a schema from the v1 shape and flags the new build_* fields
as "property not allowed." These are false positives — no schema
file exists in the repo, PyYAML (the sole parser after Exec #1+#2
fold) does not enforce additional-property rules, and the new
FixtureDescriptor fields are validated by the unit tests below.
The IDE warnings will clear once the language server re-infers from
the new shape.

tests/substrate/p4_fixture_loader.py:

- FixtureDescriptor dataclass extended with 4 new build_* fields
  (defaults match v1 hardcoded values for backward compat if a
  future fixture omits them).
- load_fixture_descriptor() pulls the new fields via .get() with
  defaults so legacy callers don't break.
- FIXTURE_SHA256 bumped to match the regenerated YAML.

tests/substrate/test_p4_mug_test_roundtrip.py:

- New _build_canonical_config(descriptor) helper constructs a
  BuildConfig from the fixture descriptor's build_* fields. Shared
  between the round-trip test and the new retrieval gate test.
- _prepare_mug_test_hippocampus() now uses _build_canonical_config
  instead of hardcoded BuildConfig values. The round-trip test
  therefore exercises the canonical ship shape (noise + bridges
  ON), validating that P3.5 SessionSnapshot round-trip survives the
  full fixture shape, not just the no-noise baseline.

tests/substrate/test_p4_fixture_validation.py (Arch #5 + Arch #6
enforcement):

- TestFixtureV2BuildConfig (4 tests) — semantic pin on the operating
  point values:
  - fixture_version == 2 (drift guard on the bump)
  - build_noise_reps == 1 (matches Phase 2D v2 operating point)
  - build_bridges_enabled is True (shared_superclass topology is
    the ship shape)
  - build_text/vision_ec_threshold == 0.60/1.01 (Stage 2 v1
    calibration values carried forward)
- TestFixtureRetrievalGate (1 test) — the 0.70 gate the plan
  specified in deliverable (5). Runs build_and_bind with the
  canonical config, measures forward + reverse top-5 recall on all
  10 classes, asserts both means ≥ 0.70. The bar is lower than
  the 0.90 operating-point selection rule because it's a CI
  regression gate, not the stricter operating-point criterion.
- Old test_descriptor_parses_successfully loosened from
  fixture_version == 1 to >= 1 (the specific v2 pin now lives in
  TestFixtureV2BuildConfig).

17/17 P4 fixture + round-trip + mechanism tests pass after the
fixture bump. The retrieval gate is a first-class CI check now,
failing loudly if a future fixture edit drops recall below 0.70.

The round-trip test is now harder: it runs against the noisy +
bridged fixture, which has 80 total episodes (50 class + 10 noise +
20 bridge) vs the v1 baseline 50. Wall clock stays ~17s. P3.5
SessionSnapshot round-trip proves the full canonical ship shape
survives serialize-kill-reload intact.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246 added a commit that referenced this pull request Apr 17, 2026
Wave 2 of biosystem_unification. Adds build_memory_hub() to
integration/memory_hub.py — the canonical MemoryHub construction
door that always calls .connect(), ensuring the three always-created
bridges (PlanHistoryBridge, EscalationLearningBridge, FearCircuitBridge)
are alive on every hub returned by the builder.

Audit found 4 gaps across 6 construction sites:
- Gap A: CLI non-sim agent (cli.py) — zero bridges, all dead
- Gap B: AgentFactory NPC agents — zero bridges, all dead
- Gap C: Sim orchestrator hub — dead code (missing scn/ec TypeError)
- Gap D: locals().get("fear_agent") fragility in sim AUT

Migration:
- Site #1 (cli.py): MemoryHub() → build_memory_hub() — fixes Gap A
- Site #2 (orchestrator AUT): MemoryHub() → build_memory_hub(),
  removed redundant .connect() call (fear_agent stored but never
  read by any bridge)
- Site #3 (orchestrator orch): added missing scn/ec — fixes Gap C
- Site #4 (Reachy): MemoryHub() → build_memory_hub(), keeps second
  .connect() for spatial/salience wiring from DefaultNetwork
- Site #5 (AgentFactory): MemoryHub() → build_memory_hub() ��� fixes Gap B

21 new tests in tests/unit/test_build_memory_hub.py.
4834 tests passing, 24 integration tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246 added a commit that referenced this pull request Apr 17, 2026
…on candidates (Wave 3)

Full audit of every bio-pipeline construction site in src/maxim/.
Ten production sites: 8 umbrella callers that reproduce ~30 lines of
bio-system construction, 2 leaf factories (create.py, load.py) that
return individual systems and don't need the umbrella.

Six sites collapse into build_bio_stack: CLI non-sim (#1), sim AUT
(#4), sim orch NPC (#5), Reachy (#6), AgentFactory (#7), and the
CLI sim modes (#2/#3) that reuse #1's bio-systems. Two are not
candidates: simulation/tools.py sub-AUT (consumer, not constructor)
and the leaf factories.

Key findings:
- Construction order is consistent: Hippocampus → NAc → SCN → EC →
  (ATL, AngularGyrus) → MemoryHub → PainBus → DefaultNetwork
- ATL and AngularGyrus optional everywhere (try/except)
- DefaultNetwork is opt-in (only sites #4 and #6)
- PainBus wiring varies (CLI builds independently, sim/Reachy inject)
- MemoryHub.connect() with DN subsystems only at Reachy (#6)
- All umbrella sites already use Wave 1/2 builders

4 open design questions carried to implementation session.
biosystem_unification.md status row updated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246 added a commit that referenced this pull request Apr 17, 2026
…migrations

Introduces runtime/bio_stack.py::build_bio_stack(*, persistence_dir) that
composes the four Wave 1+2 builders (build_reaction_bus, build_pain_bus,
build_memory_hub, build_default_network) into a single call returning a
frozen BioStack dataclass. Collapses ~30 lines of hand-rolled bio-pipeline
construction at each entry point into ~5.

Migrated sites:
- cli.py non-sim agent (site #1)
- simulation/orchestrator.py AUT (site #4, with pre-built pain_bus=)
- simulation/orchestrator.py orch NPC hub (site #5)
- embodied_runtime/agentic_runtime.py Reachy (site #6)

Not migrated (documented):
- CLI sim modes (#2/#3): just build_pain_bus, too thin for umbrella
- AgentFactory (#7): conditional remembers/learns + auto_load deferred to
  agent_factory_canonicalization.md Wave 4

27 new tests, 4902 passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant