Conversation
… plan vs. suggest changes vs. being completely autonomous
…nd salience in images.
…ic brain directly into the main selfy call
… so it's exposed to the Agentic brain
…ent capability to the agentic brain
dennys246
added a commit
that referenced
this pull request
Apr 3, 2026
Replace repo_cleanup_plan.md (85% done) and file_splitting_plan.md with a single future_plans.md that: - Maps dependencies between all 6 plans - Absorbs remaining cleanup items as prerequisites into the feature plans they gate (e.g. #3 double LLM load is Sim Agent prereq, S4 externalize profiles is Multi-LLM prereq) - Marks opportunistic items with when-to-do triggers - Drops low-ROI items (#4, #5, #36, #37, S5, S6) - Provides 3 execution order options (sim-first, infra-first, breadth) Individual design docs (simulation_agent_plan.md, multi_llm_scaling.md, agent_mesh.md, intelligent_context_upgrade.md) remain as detailed references. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246
added a commit
that referenced
this pull request
Apr 13, 2026
Ten findings from the R2 pre-merge review round, folded into a single commit. Details in the inline comments (each fix is tagged with its review number). REAL BEHAVIOR BUGS - #1 Stage-2 probe mis-classified HTTPAuthError (401/403) and HTTPRateLimited (429) as inference_broken. Auth-gated or throttled peers would report "model loading" instead of "rotate key" or "busy", and get the 15s short-backoff instead of the appropriate signal. Added explicit except branches BEFORE the generic HTTPError catch. - #2 _has_local_llm_flag scanned full argv including subcommand names, so `maxim tunnel --llm foo` on a leader falsely flipped role to solo and _model_state_file would read active_llm_model.solo.txt on that machine for the process lifetime. Added _SUBCOMMAND_NAMES guard. BAND-AIDS PROMOTED - #3 cli.py detect_and_apply_role exception wrapped a bare try/except with DEBUG log. Real failures would silently skip role export. Promoted to WARNING + structured role_detection_failed event. - #4 migrate_persisted_model_file caught Exception at DEBUG. Same silent-failure pattern. Promoted to WARNING + persisted_model_migration_failed event. - #5 Migration used .replace() without checking destination existence. On Windows this silently clobbered an existing role-specific file with legacy content. Added .exists() branch: delete legacy, preserve role-specific, emit persisted_model_migrated with collision=True. - #6 Moved `_validate_base_url` compat alias in openai_backend.py from mid-file line 101 (with # noqa: E402) to the imports block so the suppression is no longer needed. GAPS CLOSED - #7 _model_state_file now runs migrate_persisted_model_file lazily on first access per process via a module-level flag. Python-API users who never go through cli.py::main now get their legacy active_llm_model.txt migrated instead of orphaned. - #9 detect_and_apply_role now calls _check_leader_mode_divergence, which compares role.py's decision against leader_mode.detect_role() and emits a structured role_divergence event when they disagree. Normalizes leader_mode's "client" → role.py's "peer" semantically. Insurance against Plan 4 consolidating these. - #10 probe_llm_server now generates a per-probe uuid4 hex[:8] correlation ID threaded through both probe_started and probe_completed events so concurrent probes can be matched. - #11 role_detected event renamed decision-rule field from `source` to `role_source` so it doesn't collide with StructuredFormatter's top-level `s` short-key (logger module name). #8 (explicit __init__ per BackendError subclass) is a separate commit per the triage guidance — it's a real refactor with its own review surface. Tests added: - test_fix1_stage2_auth_rejected_not_inference_broken - test_fix1_stage2_rate_limited_not_inference_broken - test_fix2_subcommand_suppresses_llm_flag_scan - test_fix2_bare_llm_still_detects_solo - test_fix5_migration_destination_collision - test_fix7_lazy_migration_in_model_state_file - test_fix9_role_divergence_warns - test_fix10_probe_correlation_id_in_events - test_fix11_role_source_field_name Fast suite: 4082 passing (was 4073 post-R2). CI grep clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246
added a commit
that referenced
this pull request
Apr 13, 2026
Pre-merge review round per the refined R2 pattern — two parallel review Claudes (Executor + Architecture lens) audited R2.5 + R2.6 on feat/llm-path-r3 and returned 18 findings (2 critical, 7 high, 5 medium, 4 low). This commit folds the 12 items that block the PR into a single fix commit on the same branch. Deferred items listed at the end with rationale. ## CRITICAL — behavior bugs ### 1. _MaximPeerBackend.for_url mutated os.environ (Exec CRIT #1, Arch HIGH #1) Both reviewers flagged the same race: the R2.5 factory stored the probe API key in ``os.environ["MAXIM_PEER_PROBE_KEY"]``. Under concurrent probes (lane-A and lane-B validated in parallel, doctor --retry loop, or the deferred multi-peer dispatch path), the env var would race — lane-B's key could leak into lane-A's request, or lane-A's key would survive past its probe and mis-authenticate a later unrelated call. The pop-on-None branch also wiped any pre-existing user value for the same env key. Fix: store the key on the backend INSTANCE via a new ``_api_key_override: str | None`` attribute consulted by ``_get_api_key()`` before the env-var fallback. ``for_url`` no longer touches ``os.environ`` at all — it builds the cfg with an empty ``api_key_env`` and writes the key to the instance override. Safe to call from threads / doctor re-probe loops / future parallel validation paths. New regression tests: - ``test_for_url_does_not_mutate_os_environ`` — asserts env is untouched after ``for_url(api_key=k)`` - ``test_for_url_concurrent_instances_have_distinct_keys`` — asserts two backends built back-to-back do NOT observe each other's keys - ``test_for_url_none_key_returns_empty_string`` — asserts ``api_key=None`` yields an empty ``_get_api_key()`` (not a stale value from the env) ### 2. _emit_dispatch_exhausted bypassed the canonical normalization shim (Exec CRIT #2) The original R2.5 aggregated-failure log read ``ctx.get("agent_id") or ctx.get("agent")`` inline — reproducing the legacy-key bridge that Plan 2 R2b's ``_normalize_request_context`` was explicitly introduced to centralize. This was the one remaining inline duplication of the shim's logic anywhere in the new router code. A future rename of the legacy ``"agent"`` key to ``"agent_id"`` would leave this call site stranded with a stale fallback; a caller passing an empty / None context would land nulls in all fields. Fix: ``_emit_dispatch_exhausted`` now lazy-imports ``maxim.agents.llm_worker._normalize_request_context`` and reads ``ctx.request_id / .agent_id / .session_id / .lane`` off the typed result. Single canonical bridge, single rename surface. The existing test ``test_dispatch_exhausted_emits_one_warn_with_all_attempts`` was ALSO beefed up (per the reviewer's finding) to assert every multi-agent field reaches the LogRecord — the original test only checked the event-name substring, which would have allowed a regression that dropped ``agent_id`` / ``session_id`` / etc. to sail through. ## HIGH — dispatch + test coverage + CI enforcement ### 3. BACKEND_CLASSES dispatch table was a dead identity map (Exec HIGH #5, Arch LOW #7) The R2.5 shipment had ``BACKEND_CLASSES: dict[str, str] = {"openai": "openai", "maxim_peer": "maxim_peer"}`` and ``LLMRouter ._get_backend_for_provider`` hard-coded the ``"maxim_peer" / "maxim-peer"`` branch as a string literal. Zero indirection. Adding a new backend required edits in two files, and a typo in either would fall through to the router's "unknown provider type" warning silently. Fix: ``BACKEND_CLASSES`` now maps identifier → ``(module_path, class_name)`` tuples (lazy import so optional-dependency backends don't pay startup cost). New helper ``resolve_backend_class`` in ``lane_backends.py`` does the lookup + import + class validation + hyphen normalisation in one place. ``LLMRouter ._get_backend_for_provider`` now calls a thin ``_maybe_dispatch_backend_class`` helper which wraps ``resolve_backend_class``; the ``maxim_peer`` / ``maxim-peer`` string-literal branch is gone. Adding a new backend is now one line in ``BACKEND_CLASSES`` plus wiring in ``_classify_backend``. ### 4. health_check had no direct unit tests (Exec HIGH #3) ``health_check`` was only exercised indirectly via the ``_probe(url, ...)`` helper in ``test_two_stage_probe.py`` which routes through ``for_url``. Four direct tests now lock in the two-stage probe contract: - ``test_missing_base_url_returns_other`` - ``test_two_attempt_fires_on_unreachable_first_attempt`` - ``test_stage2_skipped_when_stage1_auth_rejected`` — the R2c stage-2 probe bug regression guard, now at the backend-method layer - ``test_stage2_runs_only_when_stage1_ok`` ### 5. Router + BACKEND_CLASSES dispatch had no integration test (Arch MEDIUM #6) ``test_router_typed_exceptions`` stuffed ``router._backends[...] = MagicMock()`` to bypass the backend lookup, meaning the ``"maxim_peer"`` provider-type branch in ``_get_backend_for_provider`` was never exercised by the unit tests. A typo in ``BACKEND_CLASSES`` or the normalisation step would have slipped through. Four new tests in a new ``TestBackendClassesDispatch`` class exercise the full dispatch: - ``test_maxim_peer_type_instantiates_maxim_peer_backend`` - ``test_hyphenated_maxim_peer_type_also_dispatches`` - ``test_resolve_backend_class_unknown_returns_none`` - ``test_resolve_backend_class_maxim_peer`` ### 6. No CI enforcement preventing new deprecated-shim callers (Arch HIGH #2) The R2.6 compat-shim deviation from spec ("kept as deprecated thin delegates instead of strict delete") only holds if no new code paths grow call sites of ``probe_llm_server`` / ``llm_server_responding_at`` / ``_probe_once``. Nothing structural prevented that. Fix: new CI grep invariant in ``.github/workflows/test.yml``. Matches only function-call patterns (trailing ``(``) and imports (leading ``import ``) — prose mentions in docstrings and comments are ignored by regex rather than post-filter. Allow list records every existing in-repo call site at R2.6 landing: - ``src/maxim/runtime/llm_server.py`` (the definition file) - ``src/maxim/runtime/lane_backends.py`` (probe_llm_server + _llm_server_responding_at usages) - ``src/maxim/doctor/checks.py`` (probe_llm_server + _llm_server_responding_at usages) - ``src/maxim/models/language/maxim_peer_backend.py`` (the ``_probe_once`` lazy-import inside ``health_check``) Any new match is a CI failure with an explicit migration hint in the error message. ## MEDIUM — consistency, safety nets, dead indirection ### 7. _record_attempt_outcome now asserts _inference_lock.locked() (Arch MEDIUM #4) The R2.5 ``_dispatch_attempts`` list is documented as safe-under-lock but the safety is invisible to future edits. Added an explicit ``assert self._inference_lock.locked()`` at the top of ``_record_attempt_outcome`` so any future code path that calls ``_try_provider`` outside the lock (warmup probe, benchmark hook, test helper) trips loudly instead of silently corrupting the buffer across dispatches. The existing test helper ``_call_try_provider`` and the two ``_complete_text_locked`` tests were updated to acquire ``router._inference_lock`` before the call. ### 8. Auth-branch comment corrected — no misleading "ordering" reference (Exec MEDIUM #6) The R2.5 auth-rejected / model-missing / inference-broken branches had a comment claiming "apply the long cooldown AFTER so _note_provider_failure's exponential ramp doesn't overwrite the hard value." But those branches never call ``_note_provider_failure`` at all — only the direct ``state.last_error`` assignment + the ``_set_long_backoff`` / ``_set_short_backoff`` helper. The comment was misleading and risked a future "cleanup" edit that added ``_note_provider_failure(...)`` "for symmetry" and silently regressed the hard-cooldown value. Comment rewritten to say explicitly: "Do NOT call ``_note_provider_failure`` in this branch" and explain why the hard value is load-bearing. ### 9. _probe_once docstring corrected — canonical primitive, not deprecated copy (Exec MEDIUM #7) R2.6's docstring mislabelled ``runtime/llm_server.py::_probe_once`` as "deprecated" and "intentionally a direct copy" of a backend static method. Neither was true: after deleting the backend's static method wrapper (see #10 below), ``_probe_once`` IS the single canonical liveness primitive. Both the backend's ``health_check`` (via lazy import inside the method body) and the compat shims delegate into it. Docstring rewritten to reflect this + documents the circular- import avoidance (the lazy imports between llm_server.py and maxim_peer_backend.py are load-bearing). ### 10. Removed _probe_liveness_once static method (Exec MEDIUM #8) The static ``_MaximPeerBackend._probe_liveness_once(url, api_key, timeout_s)`` was dead indirection — it only existed to call ``runtime/llm_server._probe_once`` and had no per-instance state or logic. Deleted. ``health_check`` now calls ``_probe_once`` directly (lazy-imported alongside ``_probe_stage2_readiness``). Cleanest outcome: single call site, single implementation, no dead wrapper to drift. ### 11. _record_unclassified_backend_error logs import failures once (Exec MEDIUM #9) The R2.5 safety-net helper silently swallowed all exceptions, including import failures on ``lane_metrics``. Silent failures defeat the safety net — the whole point of ``backend_unclassified_errors_total`` is to be the canary for missing typed exception classes, and a silent zero counter masks both the target bug and any underlying import break. Fix: ``_UNCLASSIFIED_IMPORT_WARNED`` module-level flag + ``warn(...)`` on first failure. Subsequent calls stay silent to avoid log flood; the first-failure warning is enough to surface a broken lane_metrics module. ### 12. Streaming malformed-chunk + empty-content paths now emit per-call WARN (Exec LOW #10, #11) Both streaming failure paths raised ``BackendDown`` without calling ``self._log_failure(...)``, so the only record of a mid-stream break was the router's aggregated ``dispatch_exhausted`` WARN — which combines all providers' outcomes into one line. Operators debugging "why did peer-X fail?" couldn't see the individual mid-stream malformed-chunk vs empty-content signal per call. Fix: both branches now call ``_log_failure("malformed_chunk", ...)`` or ``_log_failure("empty_stream", ...)`` before raising, so each streaming failure has its own ``peer_backend_failed`` WARN with the typed exception's ``fix_hint`` + request_id + agent_id + full multi-agent context. The malformed-chunk path also includes the model name in the ``fix_hint`` so operators know which model's streaming is broken. ## Deferred — rationale ### Exec HIGH #4 — real-network performance gate Reviewer correctly observed that ``tests/performance/test_fast_failover.py`` mocks ``_http.post`` and thus wouldn't catch a regression that added retry logic at the http-client layer (e.g., a ``max_retries=3`` kwarg on the registered endpoint). Real-network tests against dropping ports are flaky in CI. Deferred to the stress test protocol Phase D (``docs/plans/llm_path_fast_failover.md`` "Stress test protocol" section) which exercises the full path end-to-end against a real restarting leader. The mocked gate still catches every failure-mode classification bug; the real-network gate catches http-client config regressions that the mocked layer cannot see. ### Arch MEDIUM #5 — request_context dict | None typing Reviewer's tracking-only finding: the ``request_context: dict[str, Any] | None`` parameter signature of ``_MaximPeerBackend .complete_with_usage`` perpetuates the legacy dict shape even though the canonical contract is the typed ``RequestContext``. The ``_OpenAIBackend`` has the same signature, so this is a multi-backend cleanup for 0.5 or later, not Plan 3 scope. Not blocking the PR. ### Arch HIGH #3 — probe JSONL events gained a "provider" field R3 added a ``provider`` field to ``probe_started`` and ``probe_completed`` JSONL events. This is additive and does not break any downstream consumer (the existing ``probe_id`` correlation test passes unchanged). Documented here rather than via a separate schema-change commit because no tests or dashboards currently assert the event schema's field count. ## Test invariants verified after fold-in - Fast suite: 4142 passed, 2 skipped, 20 deselected (up from 4131 post-R2.6 = +11 new tests from the review fixes) - ``ruff check src/ tests/`` clean - ``ruff format --check`` clean - mypy on touched files: no new errors vs R2.6 baseline (confirmed by running mypy on the R2.6 router.py snapshot — pre-existing str|None errors unchanged, zero new errors introduced) - Plan 1 R1 urllib invariant: clean - Plan 3 R2.5 no-retry-loop invariant: clean (excluding ``retry_after_s`` + ``retry_timeout_s`` parameter names) - Plan 3 R2.6 deprecated-shim allow list: clean (new CI invariant now enforces this) ## Related - Review round: Executor-lens + Architecture-lens Claudes, 2026-04-12 - Spec: ``docs/plans/llm_path_fast_failover.md`` - Prior commits on this branch: ``824d737`` (R2.5), ``d09b74d`` (R2.6) - Review pattern: ``feedback_review_before_ship.md`` — pre-merge reviewer round catches bugs tests don't (2 critical + 10 non- critical this round, 0 overlap with test coverage) - Grep-before-dismissing: ``feedback_grep_before_assuming_fixed.md`` — verified every disputed finding against the actual code before folding in (no false-dismissals this round) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246
added a commit
that referenced
this pull request
Apr 14, 2026
Two parallel pre-merge reviews (Executor lens + Architecture lens) surfaced 13 findings. The Executor focused on correctness and test rigor; the Architecture lens focused on coupling, blast radius, and doc consistency. One finding cross-confirmed across both lenses; the rest were unique to their respective lens. None CRITICAL, two BLOCKING, the rest NICE-TO-HAVE. This commit folds all 13 into the same branch before PR open, per the Plan 3 R3 review-before-ship pattern (feedback_review_before_ship.md). === BLOCKING (Architecture) === **B1. Stale source-code refs to archived plan paths.** After the plans-audit commit (711105b) moved Plans 1-3.5 into docs/plans/archive/, eight source/CI files still cited the unprefixed paths in module docstrings and CI error-hint strings: - src/maxim/mesh/__init__.py:6 and mesh/message.py:6 (llm_path_foundation) - src/maxim/runtime/rate_limit.py:4 (llm_path_foundation) - src/maxim/time/scn.py:235 (llm_path_foundation) - src/maxim/models/language/maxim_peer_backend.py:18 (llm_path_fast_failover) - src/maxim/models/language/router.py:1042 (llm_path_fast_failover) - src/maxim/cli.py:183 (llm_path_typed_errors) - **.github/workflows/test.yml:84,109** (llm_path_fast_failover, twice) — this one is the worst: it's the CI grep guard's "go read the plan doc" hint printed to the operator on workflow failure. Before this fix, hitting the grep guard would land users on a 404. All eight rewritten to docs/plans/archive/ paths. **B2. docs/architecture/llm_routing.md:755 sub-plans list broken.** The plans-audit commit partially-rewrote this line but only caught three of the four archived plans. Plan 3.5 (cancellation_hygiene) was missing from the sub-plans list entirely. Also referenced mesh_operations.md as "created by Plan 4" — but Plan 4 Stage C is deferred so that file does not exist yet. Fixed to add Plan 3.5, Plan 3.6, Plan 4 ship status, and to mark mesh_operations.md as deferred to Stage C. === NICE-TO-HAVE (Executor, #1-#4, #7) === **#1. Autouse contextvar scrub missing in tests/conftest.py.** Plan 4 A.2 added a ``current_context()`` fallback in ``_normalize_request_context(None)``. If a test leaks a ``set_context`` binding (no autouse scrub existed), every later test that asserts ``_normalize_request_context(None).agent_id is None`` silently inherits the bound context. Exact same bug class as the env-var leak lesson in CLAUDE.md. Fix: new ``_isolate_maxim_request_context_contextvar`` autouse fixture, matching the shape of ``_isolate_maxim_cancellation_contextvar`` (Plan 3.5 R4). Also widened ``set_context(ctx: RequestContext)`` to accept ``None`` so the scrub is type-clean. **#2. Bench JSONL was NOT actually wire-compatible with production.** The load-bearing claim in the commit message and plan doc was that ``jq 'select(.e=="peer_backend_call")'`` queries work unchanged on bench output. Executor verified by comparing field sets and found the bench omitted ``provider``, ``model``, ``input_tokens``, ``output_tokens``, ``fix_hint`` and used ``error`` for the message string instead of the exception class name (production uses ``error=type(exc).__name__`` and ``fix_hint`` for the message). Fixed via: ``BenchAttempt`` carries all production fields (``provider``, ``model``, ``input_tokens``, ``output_tokens``, ``error`` as class name, ``fix_hint``, ``http_status``). Success path captures the LLMResponse's ``provider``/``model``/token fields; failure path captures ``type(exc).__name__`` + ``exc.fix_hint``. ``_classify_error`` now returns ``(outcome, fix_hint)`` where fix_hint uses the exact same ``getattr(exc, "fix_hint", "") or str(exc)`` pattern as ``_MaximPeerBackend._log_failure``. Added TWO new regression guards that lock the bench JSONL shape against the production event fields: - ``test_success_jsonl_field_parity_with_production_log_success`` - ``test_failure_jsonl_field_parity_with_production_log_failure`` These walk the field sets as Python sets and assert nothing from production is missing. If production adds a new field, these fail loudly and force the bench to be updated in lockstep. **#3. _analyse_recovery empty-list edge case.** An empty attempt list was conflated with ``no_outage_observed`` — semantically wrong because the bench may have been killed before firing its first call. Added a dedicated ``no_attempts`` reason branch + regression test. **#4. test_tight_loop_fires_many_calls_in_short_duration fragility.** The test asserted ``result.successes >= 10`` in a 0.25s window — loose enough that a future regression adding e.g. ``time.sleep(0.025)`` per call would still pass. Replaced with a rate-based assertion (``observed_rate >= 40/sec``) so real hot-path regressions fail loudly. **#7. Stale self-ref in archive/llm_path_foundation.md:363.** The archived plan's memory snippet pointed at ``docs/plans/ llm_path_foundation.md`` (its pre-archive path). Rewrote to make the move explicit ("archived 2026-04-14 when Plans 2, 3, 3.5 also landed"). === NICE-TO-HAVE (Architecture, N2-N6, N8) === **N2. Memory caveat about warmup agent_id coverage.** The memory entry's "750/750 agent_id coverage" claim was technically narrower than implied — ``_MaximPeerBackend.warmup()`` emits system- level DNS preflight events with ``agent_id=null`` by design. Added a caveat to the memory file making the scope explicit ("per-call inference events only"). **N3. accepts_request_context → supports_request_context rename.** Architecture reviewer pointed out the naming inconsistency with ``supports_model_override`` / ``supports_streaming`` / ``supports_tool_use``. The user's distinction (``accepts_*`` = wants forwarding vs ``supports_*`` = feature available) was too subtle to survive future refactors. Rename lands now while the flag has only three call sites. Updated across router.py, maxim_peer_backend.py, test_router_typed_exceptions.py (6 refs), test_maxim_peer_backend.py (2 refs), the plan doc, and the memory file. **N4. cli-reference.md missing bench subcommand.** Added a new "Bench Harnesses" section explicitly disambiguating ``maxim bench recovery-time`` (tight-loop, Plan 4 B) from the existing ``--benchmark`` model-evaluation flag. **N5/#5 cross-confirmed. Redundant outer set_context in bench loop.** Both lenses independently flagged that ``run_recovery_benchmark`` called ``set_context(bench_ctx)`` at the top AND ``set_context(per_call_ctx)`` per attempt. The outer binding was dead code because the per-call binding always shadowed it. Dropped the outer ``set_context`` + ``ctx_token``; only the per-call binding remains. Removed the now-unused ``dataclasses`` import. **N6. _classify_error subclass coverage regression test.** Bench's ``_classify_error`` enumerates BackendError subclasses by isinstance. If a new subclass is added without updating the bench, it falls through to ``unhandled_*``, misleading the recovery signal. Added ``test_every_backend_error_subclass_has_specific_mapping`` which walks ``BackendError.__subclasses__()`` transitively and asserts every concrete subclass maps to a non-unhandled outcome. **N8. Rerun protocol denominator clarification.** Plan 3's ``< 10s from leader-ready`` gate and this bench's ``recovery_time_s`` measure different things with different denominators. Added a "Two different gates, two different denominators" section to bench_recovery_time_rerun.md explaining that Plan 3's gate measures peer-side wedge latency (< 10s) while the bench measures end-to-end outage window (dominated by leader reload time). Both gates are valuable; they're complementary. === Verified clean (both reviewers) === - set_context binding order (before copy_context) — correct - try/finally symmetry for set_context + reset_context — correct - Router capability-flag check doesn't break cloud backends or Path A legacy prompt-formatting path - CI grep invariant ``retry|backoff|gateway`` in maxim_peer_backend.py still holds (no new matches) - ``maxim.benchmark`` public API verb still loads correctly after the ``src/maxim/bench/`` rename - _MaximPeerBackend.complete_with_usage still makes EXACTLY one HTTP call (Plan 3 invariant preserved) - httpx _stream_ctx invariant (Plan 3.6 lesson) untouched - Plan audit git mv operations were pure renames (R100, no content drift) === Tests / guards === - 4234 passed, 0 failed in the full fast suite (+6 over the pre-fold baseline: 4 new regression tests + 2 previously-skipped SemanticEngine tests that re-enabled) - ruff check + format: clean on all touched files - mypy: no new errors introduced (the 11 pre-existing errors on llm_worker.py / router.py / maxim_peer_backend.py are unchanged at unchanged line numbers) - Cross-confirmed finding (N5/#5) — the strongest signal to trust a finding — was folded first === Refs === - Plan doc with shipped invariants: docs/plans/llm_path_operator_visibility.md - Phase D2 hardware validation: docs/experiments/results/llm_path_stress_plan4_20260414.md - Bench rerun runbook: docs/experiments/protocols/bench_recovery_time_rerun.md - Memory entry: memory/project_llm_path_operator_visibility_ab_shipped.md Follow-up to 71f7c24 (Plan 4 A+B initial) and 711105b (plans audit) on the same feature branch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246
added a commit
that referenced
this pull request
Apr 14, 2026
Two independent reviewers (Executor + Architecture lenses) ran in parallel against commits 1befe9e (P3.5 Stage 1) + 71a5921 (P3a Stage 1) and produced 22 findings total. This commit folds all critical + important findings + most minor findings. A few deferred items are flagged below with rationale. ── CRITICAL FIXES ── 1. **_next_episode_ordinal not restored on load_state** (Exec critical #1). Pre-fix: dump → fresh Hippocampus → load_state → observe_ episode_event crashed with `duplicate episode id: ep_1` because the ordinal re-initialized to 0. Hippocampus.dump() now emits "next_episode_ordinal" and load_state restores it, with a fallback that derives max ordinal from loaded episode ids for corrupt-file recovery. Regression guards: - test_next_episode_ordinal_restored_on_load_state - test_ordinal_derived_from_max_id_when_dump_missing_ordinal 2. **Hippocampus.load_state still enforced payload version check** (Exec critical #2). Contradicted the envelope-authoritative tombstoned versioning rule documented in snapshot.py. Removed the check from Hippocampus / NAc / SCN load_state for uniformity with ATL. Hippocampus.dump no longer emits a "version" key at all (absence is the cleanest tombstone form). Regression guards: - test_nac_payload_version_mutation_does_not_affect_load - test_hippocampus_payload_legacy_version_absent_from_dump 3. **PerceptTraceBuffer.load_state broke ring-buffer capacity invariant** (Exec critical #3 + Arch critical #2). Pre-fix: restored entries list at arbitrary length, silently exceeding self._max_entries until the next record() call. Also silently dropped dumped tuning params (tau, tick_rate, etc.) with no diagnostic surface, causing P4 subprocess-restore across mismatched tuning to decay at the wrong rate. Fix: trim restored entries to max_entries keeping the most recent; emit a WARN log listing every dumped vs live tuning mismatch. Regression guards: - test_load_state_trims_to_max_entries - test_load_state_warns_on_tuning_drift 4. **EpisodeConfig forward-reference** (Arch critical #1). Pre-fix: HippocampusConfig.episode: EpisodeConfig = ... where EpisodeConfig was defined BELOW HippocampusConfig in the same module. Worked at runtime via `from __future__ import annotations` but broke typing.get_type_hints() and dict-shaped YAML construction — a latent bug that any future pydantic adapter or YAML config surface would have hit. Moved EpisodeConfig definition above HippocampusConfig. Verified with an inline get_type_hints() probe before commit. ── IMPORTANT FIXES ── 5. **apply_hebbian_on_close never called add_node** (Exec important #3). Pre-fix: _outgoing / _incoming populated but _nodes empty. Stage 2 binding-graph persistence via DependencyGraph.to_dict() would have silently lost the node list. Fix: add_node(id, id) for every activated node before any edge write. Regression guards: - test_binding_graph_nodes_populated_after_hebbian - test_binding_graph_to_dict_includes_nodes 6. **EpisodeStore.load_from_dict silently overwrote duplicates** (Exec important #4). Pre-fix: direct _episodes[ep.id] = ep assignment collapsed duplicate ids and left stale _by_node inverted-index entries. Fix: raise ValueError on duplicate id. Regression guard: - test_duplicate_episode_id_in_loaded_state_raises 7. **Binding graph not persisted → silent asymmetry** (Arch important #4). Pre-fix: episodes round-tripped through Hippocampus.dump but binding-graph edges did not, so post-load retrieve_on_cue returned nothing until new episodes landed. Fix: rebuild the binding graph from loaded episodes inside Hippocampus.load_state via apply_hebbian_on_close, so a restored Hippocampus produces the same retrieve_on_cue results as the original. Clears internal DependencyGraph state under its lock before rebuilding. Regression guard: - test_binding_graph_rebuilt_on_load_state 8. **_episode_detector was private, no P3b public seam** (Arch important #1). Fix: added Hippocampus.add_boundary_rule(rule) as a public method forwarding to self._episode_detector.add_rule. P3b channel integration will call this without reaching into private state. Regression guard: - test_add_boundary_rule_is_public_and_consulted 9. **SessionSnapshot silent partial-capture / partial-restore** (Exec minor #2 + Arch important #7, cross-confirmed). Fix: both capture and restore_into accept strict: bool = False. Under strict=True, partial capture raises with the missing system list, and restore_into raises on any mismatch between provided instances and envelope systems. P4 / P5 harnesses should use strict mode. Regression guards: - TestSessionSnapshotStrictMode (4 tests) 10. **test_hebbian_fires_when_atl_wired_and_empty was a no-op** (Exec important #1). Pre-fix: set h._atl = empty_atl — but Hippocampus has no _atl field in Stage 1, so the test was attaching an arbitrary attribute that exercised no real code. Deleted. Replaced with the in-tree source grep below. 11. **test_p3a_diff_has_no_truthy_biosystem_checks silently skipped on shallow clones** (Exec important #2). Pre-fix: used `git diff main` which returns non-zero on shallow-clone CI (the one place the guard is most needed), falling back to pytest.skip. Fix: replaced with inspect.getsource regex scan over memory/episode.py module + every P3a-added Hippocampus method. Always runs, always catches violations. 12. **Protocol lock-acquisition undocumented** (Arch important #5). Fix: added a "Locking caveat" note to the BioSystemSnapshot Protocol docstring warning callers on hot-path threads to spawn dump() off to a worker thread. 13. **_graph vs _binding_graph collision risk** (Arch important #3). Rather than renaming one of the fields (bigger blast radius), added prominent docstrings on both init-time field declarations + expanded the retrieve_on_cue docstring to explicitly state it queries ONLY the binding graph. Stage 2 may drive a rename if P3b retrieval fusion actually surfaces confusion. ── MINOR FIXES ── 14. Deadlock test iter-count threshold: `> 10` → `> 1` to avoid flaking on slow CI runners under lock contention. The "didn't deadlock" assertion via thread.join timeout remains the load- bearing check. 15. Plan file fold: updated every `load(state)` → `load_state(state)` reference in both plan files. Added a Production integration section to substrate_p3a_episode_binding.md specifying the post- Stage-1 wiring path (runtime/agent_loop.py call site, event construction from percept context, lock-order discipline). Populated the Load-bearing invariants section in both plan files with the post-Round-2 invariant list (13 for P3a, 8 for P3.5). ── DEFERRED (NOT FOLDED, WITH RATIONALE) ── - **retrieve_on_cue return shape (list[tuple] vs list[Memory])** (Arch minor): wrapping in a BindingHit dataclass is a shape decision that should happen when P3b retrieval fusion actually lands and the cost of shape divergence becomes concrete. Stage 1 deliberately uses raw tuples. - **memory_hub integration** (Arch minor): adding a MemoryHub.capture_session_snapshot() facade is plan-level orchestration that doesn't belong in this commit. Flagged for the P4 session. - **find_edge / update_edge non-atomic under multi-writer** (Exec important #5): real concern for Stage 2 multi-writer correctness but Stage 1 has no multi-writer tests or call sites — deferred to Stage 2 with an explicit comment in apply_hebbian_on_close. - **find_edge O(degree) perf under P5 stress** (Exec minor #3): flag only, deferred to P5. - **Tombstone string enforcement via AST test** (Arch minor): the docstring + comment convention plus removed load_state checks (nothing now branches on the payload version) make silent divergence extremely unlikely. Adding an AST test is cheap but not gating. - **load_state naming collision risk with future pydantic/msgspec conventions** (Arch important): deferred to the plan session that tackles the load(path) → load_from_path rename across 37+ call sites. Plan doc explicitly calls this out as out of scope for Stage 1. ── TEST STATE ── - tests/unit/test_bio_system_snapshot.py: 43 → 57 tests (+14) - tests/substrate/test_p3a_episode_binding.py: 18 → 24 tests (+6; removed 1 no-op, added 7 regression guards) - Full fast suite: 4360 passed, 1 skipped, 0 regressions (+15 over P3a Stage 1 baseline). - Ruff check + format clean on all touched files. ── BRANCH STATE ── feat/substrate-p3-kickoff now has 5 commits ahead of main: - 6aaa4c4 docs(plans): open P3.5 + P3a plan files - 12b0310 docs(plans): Round 1 review fold - 1befe9e feat(memory): P3.5 Stage 1 - 71a5921 feat(memory): P3a Stage 1 - THIS: fix(substrate-p3): Round 2 fold
dennys246
added a commit
that referenced
this pull request
Apr 15, 2026
Architecture lens: 2 critical + 5 important + 3 minor Executor lens: 3 minor (0 critical, 0 important) All critical + important findings folded. Three minors folded plus the fixture stale-comment fix. Two minors deferred with explicit rationale (tfidf baseline location move; p3a_metrics consolidation with p2_metrics) — rule of three, extract when 3 consumers exist. ── CRITICAL FIXES (Architecture lens) ── 1. **Flip `retrieve_on_cue` default from `multi_hop=False` to `multi_hop=True`** (Arch critical #1). The Stage 1 "backward compat" hedge protected exactly one test suite while silently degrading every future P3b/P4/P5 caller that forgot the kwarg. Stage 2 architectural finding says one-hop is TF-IDF-equivalent on bag-of-words and the mechanism's value lives entirely in the multi-hop path; default must match the primary path. Stage 1 mechanism tests that specifically exercise one-hop weight semantics now opt into `multi_hop=False` explicitly. 2. **Add `node_filter: Callable[[str], bool] | None` kwarg to `DependencyGraph.spreading_activation` and wire through `Hippocampus.retrieve_on_cue`** (Arch critical #2). Reserves the P3b channel-filter seam and the P4 modality-filter seam without committing to semantics. 5-line change in bus.py + wired through retrieve_on_cue; drops filtered nodes from traversal (both as sources and as hop targets). New `TestNodeFilterSeam` (3 tests) validates the seam: multi-hop and one-hop modes both honor the filter, and `node_filter=None` is an identity with the default. ── IMPORTANT FIXES (Architecture lens) ── 3. **Split `EpisodeConfig` into nested `HebbianConfig` + `RetrievalConfig`** (Arch important #4). Before: flat fields (`hebbian_init/delta/max`, `retrieval_decay/threshold/max_depth`) heading toward a kitchen-sink as P3b/P4/P6 add their own knobs. After: `cfg.episode.hebbian.{init, delta, max_weight}` and `cfg.episode.retrieval.{decay, threshold, max_depth}` — each concern in its own dataclass with a natural home. `max_weight` is named with the suffix to avoid shadowing the builtin. Callers updated: `Hippocampus._close_pending_episode_locked`, `Hippocampus.retrieve_on_cue` multi-hop path, `Hippocampus.load_state` binding-graph rebuild, Stage 1 test `_fresh_hippocampus` helper. Doing this now is cheap (20 lines); doing it after P6 lands is a breaking config rename. 4. **Soften parity assertion to `multi_hop_f1 > one_hop_f1 + 0.20`** (Arch important #5). Before: `|one_hop_f1 - tfidf_f1| < 0.05` locked in the parity between one-hop and TF-IDF as a hard invariant. Any future one-hop improvement (normalized weights, PageRank-style inference) would trip the test as a regression. The REAL architectural claim is "multi-hop lift is measurable and large," which is what the softened assertion locks in. Test renamed from `test_one_hop_does_not_beat_tfidf` to `test_multi_hop_lift_over_one_hop_is_real`. Current lift: 0.3045. 5. **Add 10% per-seed episode dropout as a real variance source** (Arch important #5). Before: all seeds produced byte-identical metrics (std_f1 = 0.0) and the `baseline_mean + 2×std` gate collapsed to `baseline_mean` — ceremonial. After: the generator drops 17 of 170 base episodes per seed via an independent seeded RNG. The gate now performs real statistical work: - Hebbian multi-hop F1: 0.9955 ± 0.0055 - Hebbian one-hop F1: 0.6910 ± 0.0074 - TF-IDF F1: 0.6600 ± 0.0058 - Gate (baseline + 2σ): 0.6715 - Margin: 0.3240 absolute Probes are topology-only (do NOT depend on dropped episodes) so every seed runs the same 50 retrieval tasks; dropout affects what edges the retriever can build, not what it's asked to retrieve. Multi-hop robustness absorbs the dropout almost perfectly (F1 stays near 1.0 because redundant hub+chain topology survives loss of any single reinforcement episode). ── DOCUMENTATION FOLDS ── 6. **Document the `max`-aggregation ↔ reinforcement ↔ P6 extinction coupling as a load-bearing invariant + P6 entry condition** (Arch important #3). `spreading_activation` uses `max`-path aggregation, not `sum`. The Stage 2 reinforcement-doubling fix (core weight 0.4 vs peripheral 0.3) creates a fragile equilibrium that P6 extinction can collapse if it decays core weights back toward 0.3. Three options documented for P6: - Add `sum` mode to `spreading_activation` (kwarg). - Use distinct `EdgeType` values (`HEBBIAN_BIND` vs `ASSOCIATES`). - Hold core weights above a strict floor during extinction. P6 must pick one before implementing extinction. 7. **Document P8 replay + `load_state` rebuild double-count risk** (Arch important #6). `apply_hebbian_on_close` is called during both persistence load and (future) P8 sleep replay. If both fire on the same episode in one session, edge weights double-apply (silently absorbed by the `hebbian_max=1.0` clamp today, but will bite if anyone raises the clamp). P8 entry condition: add an `(episode_id, edge_key)` idempotency marker or a `replayed_at_hebbian` flag. 8. **Document retrieval defaults calibration derivation** (Arch minor #3). The `RetrievalConfig` docstring now derives `decay=0.7 × core_weight=0.4 → ~5 effective hops` under `threshold=0.001`, with a "re-tune when" note for P3b real-text fixtures. ── EXECUTOR MINOR FOLDS ── 9. **Fix stale "Episodes/topic = 10" comment** in the generator docstring + the generated YAML header (Exec minor #1). Left over from the pre-reinforcement draft; the actual fixture ships 17 base episodes/topic. 10. **Add `peripherals_per_topic > pool_size` validation** (Exec minor #2). Pre-fold: silent undercount on pool overflow. Post-fold: generator raises `ValueError` with a clear message. New test `test_peripherals_per_topic_validation` regression- guards it. Also added `episode_dropout_rate` bounds validation. 11. **Rename `test_retrieval_defaults_produce_stage2_pass` → `test_retrieval_defaults_match_spec`** (Exec minor #3). The old name promised a probe but the body only asserted default values. New name matches what the test does. ── DEFERRED WITH RATIONALE ── - **`TfidfBaseline` location move** (Arch minor #1). Keeps `tests/substrate/tfidf_baseline.py` — test-to-test imports work from P3b fine. Move to `src/maxim/memory/baselines/` when a 3rd consumer emerges. Rule of three. - **`p3a_metrics.py` / `p2_metrics.py` extraction** (Arch minor #2). Extract common `aggregate_seeds` / `compare_to_baseline` into `tests/substrate/metrics_common.py` when P3b adds a 3rd consumer. Same rationale. ── TEST STATE ── - tests/substrate/test_p3a_fixture_validation.py: 12 → 19 tests (+7: node-filter seam × 3, dropout variance × 2, pool/rate validation × 2). - tests/substrate/test_p3a_episode_binding.py: 24 tests (Stage 1 + 0 regressions; 4 call sites updated to `multi_hop=False`). - tests/unit/test_bio_system_snapshot.py: 52 tests (P3.5 + 0 regressions). - Full fast suite: 4454 passed, 1 skipped, 0 regressions. - Ruff check + format clean. ── RESULTS AFTER FOLD ── Retriever mean F1 std F1 Beats TF-IDF + 2σ? ------------------ -------- ------- ---------------------- Hebbian multi-hop 0.9955 0.0055 YES (margin 0.324) Hebbian one-hop 0.6910 0.0074 ≈ parity TF-IDF baseline 0.6600 0.0058 -- Multi-hop lift over one-hop: 0.3045 (architectural invariant ≥ 0.20). Results JSON + experiment writeup regenerated with the dropout- based numbers; reproduction protocol unchanged (inline snippets still work).
dennys246
added a commit
that referenced
this pull request
Apr 15, 2026
Two parallel reviewers (Executor lens + Architecture lens) on commits c3a88cf..600b736. Zero directly cross-confirmed findings (lenses focused on different layers — Executor on migration mechanics, Architecture on cross-module interaction). Architecture lens caught one BLOCKING-adjacent finding (#10) that needed a code-level investigation before fold; trace confirmed it is a LATENT risk, not an active bug. Architecture #10 (latent bridge × subscriber attribution-asymmetry): post-Wave-1, CLI fast paths now wire BOTH `create_pain_nac_subscriber` AND `ToolPainBridge` to the same PainBus. The bridge has a `_pending_tools` guard at `_on_embodiment_pain`; the subscriber does NOT. **Today, the subscriber's `record_outcome_full` silently fails to link pending tool events** because `record_tool_start` records the pending event with a stripped 1-key `{"params": ...}` context, the body's `_publish_pain` outcome carries 7 keys with zero overlap, and `_context_similarity` (directional, denominator = `len(ctx1)` per the Substrate P2 Stage 2 rule) computes `0/1 = 0.0` — below the 0.5 threshold, no link forms. Net: no double-counting today. **The correctness is load-bearing on this similarity mismatch**, not on any guard. If anyone enriches `record_tool_start`'s context (e.g., adds `entity` for the broad-guard narrowing the bridge docstring at lines 367-371 explicitly contemplates), the trap activates and double- counting starts silently. Per the no-band-aid rule, NOT fixed in this PR by adding a `_pending_tools` guard to `create_pain_nac_subscriber` (would couple the subscriber to bridge state across `proprioception/pain_bus.py` and `bridges/tool_pain_bridge.py` with no clean ownership relationship — coupling band-aid). Instead: 1. New shell plan `pain_bus_bridge_subscriber_unification.md` captures the trigger conditions and the non-band-aid fix (Option B: bridge-aware subscriber via `build_pain_bus(*, tool_pain_bridge=...)` parameter, OR invert the wiring so `ToolPainBridge` mediates NAc attribution and the subscriber falls back when no bridge is wired). Both shapes need their own audit + design pass — sibling plan to this Wave 1. 2. Tripwire test `test_pain_bus.py::TestBuildPainBus::test_subscriber_does_not_link_pending_tool_event` pins the current behavior. Fails LOUDLY the moment anyone enriches `record_tool_start`'s context dict, forcing the deeper plan to open. The test docstring explicitly says: DO NOT relax the assertion. 3. CLAUDE.md `build_pain_bus` invariant updated with the latent-risk warning + tripwire reference. 4. `pain_bus_unification.md` audit doc gains a "Latent risk surfaced during pre-merge review" section enumerating the four band-aid options that were rejected and the non-band-aid path. Architecture #2 (L4 commitment) + #3 (Gap B rationale wording): audit doc Gap B section reworded. Prior wording ("DN has no other use for hippocampus") was shaky because `agentic_runtime.py:719` IS DN's hippocampus consumer, just externally wired. Replaced with the cleaner framing: DN currently *constructs* its own bus; the Wave-2 fix is to invert ownership so the agent runtime constructs the bus via `build_pain_bus(...)` and INJECTS it into DN, which becomes a bus *consumer*. Added explicit L4 commitment that Wave 2 MUST do the inversion, NOT add `pain_detector=`/`hippocampus=` parameters to `build_pain_bus` (which would violate "gate on the learning subject, not the signal source"). Executor #1 (sim orchestrator:69 deferral note): Site #4 in the audit table now explicitly documents why it is NOT migrated — the bus must exist at line 69 (the sandbox's `PainTriggerLayer` needs it) BEFORE `aut_hippocampus`/`aut_nac` are constructed at line 613+. Migrating would require restructuring the construction-timing relationship. The site is already behaviorally correct (subscribes both standard learners) so it does not contribute to Gap A. Future Wave 1.5/Wave 2 candidate. Executor #2 (dead `hippo_marker` test code): removed the unused inner function from `test_additional_subscribers_registered_after_standard_learners` — artifact from an earlier draft. Test still uses `hippo.capture.side_effect` to record ordering; behavior unchanged. Executor #3 (history_size hard assertion): replaced the smoke check with `assert bus.reaction_bus._history.maxlen == 42`. Catches a regression where the kwarg silently fails to flow. Executor #5 (interactive sim logger.info parity): added `logger.info("Interactive sim PainBus wired ...")` to the interactive sim path so all three CLI migration sites have parallel observability. Tests: `tests/unit/test_pain_bus.py` 28 passing (was 27 — added the tripwire test). `tests/unit/test_build_executor.py` 15 passing. Combined 43 passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246
added a commit
that referenced
this pull request
Apr 15, 2026
Two parallel reviewers (Executor + Architecture lenses) produced 20 total findings against P4 Stage 1; this commit folds the IMPORTANT and MINOR items that require code changes. No CRITICAL findings. 4702 tests pass (185 across the P3a/P3b/P3.5/P4/hippocampus slice). Cross-confirmed folds (both lenses caught independently): Consistency windows: - hippocampus_persistence.load_state now wraps the ENTIRE episode- binding restore block (episode_store.load, binding_graph rebuild, ordinal restore, node_modality clear-then-load) in a single `with self._episode_lock:` inside the existing `_rwlock.write()`. Pre-fold split regime held _rwlock.write() only for the first three and acquired _episode_lock separately for just the sidecar, exposing a torn-read window to any concurrent observe_episode_event. Lock order _rwlock then _episode_lock verified deadlock-free by inspection (Arch-lens #1). - hippocampus_persistence.dump mirror: episode-binding reads (episode_store.to_dict, _next_episode_ordinal, _node_modality) now all happen under a SINGLE _episode_lock acquisition nested inside the existing _rwlock.read() block. - hippocampus._close_pending_episode_locked flips the drain order: modality buffer now drains into _node_modality AFTER _episode_store.add and apply_hebbian_on_close both succeed. Drain is a pure dict update, so drain-last means if either earlier step raises, all three pieces of episode-binding state stay consistently unmutated. The symmetric drain-first choice introduced a "sidecar entries without graph nodes" failure window that Exec-lens #6 flagged. Cue handling: - retrieve_cross_modal adds explicit runtime validation of target_modality literal ("text" or "vision"), mirroring load_state's validation. Typed Literal is only checked by mypy; callers without mypy (tests, notebooks) would silently return [] on a typo, exactly the silent-no-op failure mode the Literal was added to prevent (Exec-lens #8). - retrieve_cross_modal docstring updated with explicit point-in-time read semantics (Arch-lens #2) and explicit single-hop-only cross- modal limitation (Arch-lens #3 — multi-hop paths through same- modality intermediates are silently truncated; decision deferred to Stage 2/3). Test additions (9 new tests, total 34 P4 Stage 1 tests vs 25 pre-fold): - TestSnapshotPatternFilter: frozenset content assertion pins `frozenset({"vision_mug"})` exactly, not just "some frozenset" (Exec-lens #1 — shape-only check let a regression writing `frozenset(_node_modality.keys())` slip through). - test_load_state_rejects_unknown_modality_value: seeded with prior state, asserts EVERY piece (sidecar, episode store, ordinal, binding graph edges) is intact after the raise (Exec-lens #3). - test_node_modality_stale_entries_cleared_on_rollback: the broken nac adapter now asserts hippocampus HAS been mutated to state C at raise time. Guards against a future SNAPSHOT_KINDS reordering silently trivial-passing the rollback path (Exec-lens #2). - TestCueExemptionWithInGraphUntaggedCue: directly exercises the cue- exemption branch with a cue that IS in the binding graph but NOT tagged — pre-fold test used a cue absent from the graph entirely so spreading_activation early-returned before touching the exemption (Exec-lens #4). - TestLastWriteWinsOnDuplicateNodeIdWithinEpisode: pins the drain's last-write-wins contract for the degenerate same-node-two- modalities-in-one-episode case (Arch-lens #7). - TestStageThreeLimitation: pins the current single-hop-only cross- modal limitation. If a future refactor enables multi-hop traversal through same-modality intermediates this test FAILS and forces an explicit Stage 2/3 design decision (Arch-lens #3). - TestConcurrencyCrossLockSmoke: spawns concurrent dump + observe workers for 0.5s and asserts both make forward progress. Guards against any future _episode_lock holder acquiring _rwlock that would deadlock against dump's _rwlock then _episode_lock order (Exec-lens #5). - TestFixtureGeometry::test_within_pair_similarity_is_dim_invariant: dim-parametrized across 64/128/384/512/768. Pins the Arch-lens #4 fold that rescaled noise_scale to noise_scale / sqrt(dim) so within-pair similarity is dim-independent. A Stage 2 caller picking CLIP's native 512-d space with the pre-fold coupling would have produced within-pair sim ~0.44, right at EC's 0.40 threshold. Fixture cleanup: - p4_fixture_gen renames noise_std to noise_scale and rescales per- element std as noise_scale / sqrt(dim) so within-pair expected cosine similarity is invariant in dim. Default noise_scale=0.4 gives within-pair ~0.862 at any reasonable dim (Arch-lens #4). - p4_fixture_gen cosine_similarity drops the zero-norm guard and lets numpy surface nan loudly on degenerate inputs (Exec-lens #10). Persistence cargo-cult cleanup: - hippocampus_persistence dump and load_state drop the getattr(self, "_episode_lock", None) defensive patterns. Both fields are set unconditionally in Hippocampus.__init__ and there is no subclass that omits them (Arch-lens #8). Deferred / noted in PR description: - Arch #5 (error-message PII): inapplicable — cue node ids are internal validated identifiers. - Arch #6 (reset-path hygiene): audit came back empty, no existing reset paths touch episode state. - Exec #9 (_start_episode redundant last_tick): cosmetic. - Open Stage 2/3 design decision: should Stage 3 split retrieve_on_cue's node_filter into traversal_filter and result_filter to enable multi-hop cross-modal paths through same- modality intermediates? TestStageThreeLimitation is the pin; decision needs to happen BEFORE Stage 3's metric is frozen. Test surface after fold: 34 P4 Stage 1 tests, 185 across broader episode/persistence/hippocampus slice, 4702 full fast suite clean. Only failures are 2 pre-existing worktree-environmental failures in TestVersionInfo unrelated to P4. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246
added a commit
that referenced
this pull request
Apr 15, 2026
Three-lens pre-merge review (Executor + Architecture + Blast Radius) found 2 BLOCKING + 9 IMPORTANT + 6 MINOR findings against the C3.3 branch tip. Two of the three BLOCKING/IMPORTANT tier findings were cross-confirmed across lenses — the strongest trust signal the review pattern produces. All 17 folded in this single commit. Regression guards added for every non-cosmetic fold. Bisect surface stays atomic. ## Cross-confirmed folds (highest trust) **CC1 [BLOCKING] — Probe cache URL mismatch is a silent no-op.** Executor B1 + Blast Radius I3. `install_on_target` calls `_clear_probe_cache(url)` with the original mesh.yml URL. The cache in lane_backends.py is keyed by `cfg.remote_url` (the peer.yml URL shape) with no normalization. If the two URL shapes differ (trailing slash, /v1 suffix, query string), the clear is a silent no-op → next startup trusts a stale `auth_rejected` entry and wires the lane off. Fix: tolerant-match via `probe_cache._canonical_url()` on both sides of the comparison in `clear_cache_for_url`. Normalize-on-lookup rather than normalize-on-write so existing cache entries don't become orphans on upgrade. Regression tests cover the trailing-slash variant, the missing-/v1 variant, the exact-match case (no regression), and the unrelated-entry case (no over-removal). **CC2 [BLOCKING] — Drain TOCTOU: concurrent `--node install` can clobber operator drain intent.** Blast Radius B1 + Executor #5. `_run_node_install` read drain state via `read_drained_nodes()` OUTSIDE the filelock, then called `drain_node()` which is idempotent. Two operators running `--node install` simultaneously could both see `was_drained=False`, both drain, both resume at the end — silently clobbering any concurrent operator-explicit drain that landed in between. Violates the C2 "sticky operator intent" invariant that the code docstring explicitly claimed. Fix: new `drain_state.drain_node_if_absent(name, ...) -> (set, bool)` primitive. Atomic "drain if not already drained, return whether we changed anything" under a single filelock critical section. `_run_node_install` uses the boolean return value to decide `drained_here`, closing the TOCTOU window completely. Matches the C2 A2 fold pattern (enforce invariants at the state layer, not the CLI layer). Regression test pre-seeds the drain state file and verifies the install verb does NOT auto-resume. ## IMPORTANT folds **I1 — Coupling inversion: mesh_cli → peer/cli was upside-down.** Architecture #1. The lazy `from maxim.peer.cli import _install_on_target, _classify_install_tokens, KNOWN_EXTRAS` in `_run_node_install` made the mesh-specific module depend on the 1,300-line kitchen-sink peer verb dispatcher. Every C3.5/C3.6 verb (`--node update`, `--node restart`, `--node llm`) would replicate this exact pattern. Fix: new `src/maxim/peer/install_core.py` holds the shared core (`install_on_target`, `classify_install_tokens`, `KNOWN_EXTRAS`, `_INSTALL_POLL_DEADLINE_S`, `_looks_like_url`, `_normalize_url_for_probe_cache`). Both `peer/cli.py::_cmd_install` and `peer/mesh_cli.py::_run_node_install` import from the leaf. The coupling becomes `cli.py → install_core ← mesh_cli.py` — both depend on the leaf, neither on each other. Moves the dependency BEFORE C3.5/C3.6 replicate the inversion and cement it by precedent. **I2 — Missing CI grep for `/v1/admin/install` shape lock.** Architecture #2. Plan 3 R2.5 + C3.1 both set the precedent: shared HTTP cores get a strict CI grep locking the literal endpoint string to a single module. `install_on_target` had the same single-source -of-truth property but no grep. Fix: added CI grep in `.github/workflows/test.yml` — any match of `/v1/admin/install` outside `install_core.py` + its test file fails CI with a hint pointing at the module + allow-list update path. **I3 — Over-broad `http` prefix check rejects real PyPI packages.** Executor #2. Both `_cmd_install`'s argv parser and `_parse_node_install_tokens` used `arg.startswith("http")` to detect URLs, which false-positives on `httpx`, `http-client`, `httpie` — all real PyPI packages operators might install. Fix: new `install_core._looks_like_url(token)` requires `"://"` in the token. Applied to both parsers. Regression tests for `httpx`, `http-client`, `httpie` in both test files, plus a guard that confirms real URLs with `"://"` still get rejected. **I4 — Resume-after-success failure returned same exit code as install failure.** Executor #4. Both paths returned 1, so operators tailing exit codes couldn't distinguish "install failed, stuck in drain" from "install succeeded, resume failed, node is upgraded but stuck in drain." Fix: new `_EXIT_RESUME_FAILED_POST_INSTALL = 3` constant. `_run_node_install` returns 3 on post-install resume failure; 1 is reserved for install-itself-failed. Exit-code documented in the verb docstring, cli-reference, and mesh_debug runbook. Regression test locks the code distinction. **I5 — Cluster-key divergence between mesh.yml and peer.yml.** Blast Radius B2. `init-mesh` copies `peer.yml::api_key` into `mesh.yml::cluster_key` once; after that the two files are independent. Post-rotation, one install verb 401s while the other succeeds against the same target. Fix (docs-only for C3.3, doctor check deferred): explicit warning in `cli-reference.md` under the `--node install` row, dedicated section in `mesh_debug.md` with the symptom pattern, and a new Stage C7 roadmap entry for the future `check_cluster_key_consistency` doctor check. The doctor check belongs with the cluster-key rotation verb (C7 ships both together so the check has a remediation path to point at). **I6 — Stacked failures leave node drained with no message.** Blast Radius I2. If a `KeyboardInterrupt` or programming bug fired between the drain and the install result, the function exited with the exception and the node stayed drained silently. Fix: wrapped `install_on_target` call in `try/except BaseException` (includes `KeyboardInterrupt` + `SystemExit`, not just Exception). Prints the "Install interrupted. Node X is STILL DRAINED" banner before re-raising. Regression tests for `KeyboardInterrupt`, `ValueError`, and the pre-drained case (no banner when we didn't change state). **I7 — Failure-message ordering not asserted.** Executor #3. The regression test confirmed that both "Install failed" and "STILL DRAINED" appeared in stderr, but not that STILL DRAINED was LAST. Operator should see the actionable recovery hint at the bottom of their terminal. Fix: added `test_still_drained_banner_comes_after_install_error_output` that uses `str.rfind` + `str.find` to verify the ordering invariant. **I8 — Was-drained sticky semantics not documented operator-facing.** Architecture #4. The code docstring covered it; `cli-reference.md` did not. In a C4 world where the router consults drain state, the "install never mutates drain intent" contract is load-bearing. Fix: explicit "Was-drained sticky" paragraph in the cli-reference row + dedicated "pre-drained nodes stay drained after success" section in the mesh_debug walkthrough. **I9 — Empty `mesh.cluster_key` not rejected at construction.** Blast Radius I1. `MeshConfig.__post_init__` validated yaml-safe characters but not empty string. If `peer.yml::api_key` was ever empty, `synthesize_from_peer_config` would copy it through and the install POST would send `Authorization: Bearer ` with a trailing space → cryptic 401. Fix: new empty-string guard in `MeshConfig.__post_init__`, paired with a docstring explaining the failure mode it prevents. Matches the E7 (empty nodes) + C3.2 A1 (self_name in nodes) pattern of hoisting parser-side validation into the construction-time check. Regression tests for both the rejection case and the whitespace-only documented-limitation case. ## MINOR folds - **M1** — `install_on_target` docstring claimed "no assumptions about drained" but mutated probe cache. One-sentence clarification added. - **M2** — Parametrized test `test_every_known_extra_classifies_as_extra` locks the `KNOWN_EXTRAS` ↔ classifier contract so a future pyproject bump can't silently drop an extra out of the classifier. - **M3** — Positional-URL rejection error leaked the typed URL verbatim to stderr (could contain secrets). Fix: new `_redact_url_for_error()` shows only `<scheme>://...`. Regression test with `sk-abc123-secret-token` in the URL path confirms the secret does NOT appear in stderr. - **M4** — `600.0` polling deadline is now `_INSTALL_POLL_DEADLINE_S` named constant at module scope in `install_core.py`. - **M5** — Skipped by design. The `verb_args` name matches its role. - **M6** — `,semantic,` and `,,,` edge cases now have regression tests in both test files. ## Not folded (deferred with notes) - **Architecture #7** — `run_node_subcommand` dispatch-dict refactor. Flagged for C3.5 when the verb count hits 8. - **Architecture #5** — step-1 vs step-6 self-guard redundancy. Intentional defense-in-depth; the step-1 message is better. Added docstring comment explaining the choice. - **Executor Q1** — drained-for-install distinguishability. Added as explicit C4.5 re-check trigger in the roadmap doc with the operator/install/auto tri-state semantics written out. - **Executor Q2** — multi-http* argv divergence between the two parsers. Minor, not worth a design round in C3.3. - **Blast Radius I4** — leader-side Authorization echo in error bodies. Out of scope for a peer-side ship. ## Files touched - NEW `src/maxim/peer/install_core.py` — shared install core - NEW CI grep in `.github/workflows/test.yml` — /v1/admin/install shape lock - `src/maxim/peer/cli.py` — thin `_cmd_install` re-imports from install_core, fixes http:// sniff - `src/maxim/peer/mesh_cli.py` — `_run_node_install` rewritten to use `drain_node_if_absent`, `_EXIT_RESUME_FAILED_POST_INSTALL`, try/except BaseException, `_redact_url_for_error`, import from install_core - `src/maxim/peer/mesh_config.py` — empty cluster_key guard - `src/maxim/peer/drain_state.py` — new `drain_node_if_absent` primitive - `src/maxim/runtime/probe_cache.py` — `_canonical_url` + tolerant-match `clear_cache_for_url` - `tests/unit/test_peer_install.py` — 4 new test classes: TestHttpPrefixSniffNotOverBroad (4), TestKnownExtrasClassification (3), TestProbeCacheNormalizedLookup (4), TestMeshConfigRejectsEmptyClusterKey (2) - `tests/unit/test_peer_mesh_cli.py` — 2 new test classes: TestRunNodeInstallFoldGuards (12), TestDrainNodeIfAbsent (5) - `docs/user/cli-reference.md` — expanded --node install row with exit-code table, cluster-key divergence warning, was-drained sticky contract - `docs/troubleshooting/mesh_debug.md` — exit-code 3 recovery walkthrough, cluster-key divergence section, httpx/http-client note - `docs/plans/reactive_peer_mesh_roadmap.md` — Stage C7 cluster-key consistency doctor check entry + C4.5 re-check trigger for drain origin tri-state ## Test results - 4816 passed, 1 skipped, 22 deselected in the full fast suite (was 4786 pre-fold; +30 new regression tests from the fold) - 377/377 green in the concentrated peer+mesh+doctor slice - `ruff check` + `ruff format --check` clean - `mypy` on public API files clean - Zero regressions from pre-fold baseline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246
added a commit
that referenced
this pull request
Apr 15, 2026
…parameterize encoders (fold step 2) Stage 2 v2 fold — commit sequence step 2. Addresses Arch #3 (encoder- confound discipline gap), Arch #5 (threshold entanglement with fixture SHA), Exec #4 (scripts/ imports from tests is a layering smell), Exec #9 (scripts/p4_mug_test_sweep.py has module-level logging.basicConfig side effects that leak into pytest sessions). New canonical orchestrator: tests/substrate/p4_build_and_bind.py - build_and_bind(descriptor, images, config) is the single source of truth for "wire EC + Hippocampus, encode fixture pairs, bind episodes." - Takes a BuildConfig with: text_encoder, vision_encoder, per-modality EC thresholds, FixtureNoiseConfig, FixtureBridgeConfig, seed. Stage 3 Arms A/B/C can reuse the same orchestrator by swapping the TextEncoder factory (paraphrase_mpnet_text_encoder vs clip_text_encoder). - Phase 2D v2 noise layer: FixtureNoiseConfig(noise_reps=N) creates 1 cross-class contamination pair per class via deterministic rotating assignment (class i gets a noise edge to class (i+1) % n), reinforced N times. noise_reps=0 disables. - Phase 2D v2 bridge layer: FixtureBridgeConfig(enabled=True) creates a shared "text_flower" superclass node bound to each text_X via (text_X, text_flower) episodes AND to each vision_X_0 via (text_flower, vision_X_0) episodes. Creates 2-hop paths text_X → text_flower → vision_Y_0 that are blocked under Option 1 single-hop filter but reachable under Option 2 split filter. Topology A (shared superclass) only; Topology B (pairwise) is documented in the P4 plan's Stage 2 v2 fold status section as a known alternative not run in this fold. - Returns a BuildResult with ec, hippocampus, class_results, noise_edges accounting, bridge_nodes, and BuildStats counters. Does NOT compute retrieval metrics — callers decide what to measure. - Encoder factories: paraphrase_mpnet_text_encoder(), clip_text_encoder() (uses VisionEncoder._ensure_model for the shared CLIP singleton so Arms B/C don't double-load weights), clip_vision_encoder(). scripts/p4_mug_test_sweep.py: - Migrated to import from tests.substrate.p4_build_and_bind via the sys.path insert trick at start of main(). - Moved logging.basicConfig INSIDE main() so `import scripts.p4_mug_test_sweep` has no side effects (Exec #4 fold). - Output path redirected from docs/experiments/p4_mug_test_sweep.md to p4_mug_test_sweep_v1_smoke.md — the v1 report is FROZEN as a historical artifact (including its tautological 1.000 recall and WRONG Option 2 decision) so the v2 report can explicitly supersede it. Running this script now only regenerates the smoke-test artifact, not the published finding. - Local _build_and_bind, _build_hippocampus, _build_ec, _encode_text_mpnet, _encode_image_clip, _make_pair_episode helpers deleted — all now live in p4_build_and_bind.py. tests/substrate/test_p4_mug_test_roundtrip.py: - _prepare_mug_test_hippocampus() no longer imports from scripts.p4_mug_test_sweep. Now imports from .p4_build_and_bind directly — sibling-in-package import, no sys.path manipulation needed, works in both parent and subprocess child. - Uses the new BuildConfig shape explicitly (paraphrase_mpnet text + clip vision + 0.60/1.01 thresholds + no noise + no bridges), which also documents the Stage 2 v1 fixture shape explicitly in the test source. 46 tests green across the P4 slice (vacuous_pass_guard, cross_modal_mechanism, fixture_validation, mug_test_roundtrip). Lint + format clean. What this commit does NOT do: - Does not change the pinned fixture at scenarios/substrate/p4_mug_test.yaml - Does not touch the Stage 2 v1 reports (docs/experiments/p4_*) - Does not run the v2 sweep (that's step 5-7) - Does not address remaining Round 2 review findings (class list pin, YAML fallback parser, tactical fixes) — those are separate fold commits Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246
added a commit
that referenced
this pull request
Apr 15, 2026
…+ 0.70 retrieval gate (fold steps 8+9) Stage 2 v2 fold — commit sequence steps 8 (materialize new pinned fixture) and 9 (round-trip update + retrieval gate). Addresses Arch #5 (threshold entanglement) + Arch #6 (missing 0.70 retrieval gate specified in plan deliverable 5). **fixture_version bumped 1 → 2.** This commit is tagged with the "change-fixture" prefix per the loader's pinned-SHA contract, which requires an explicit regeneration trigger. scenarios/substrate/p4_mug_test.yaml: - fixture_version: 1 → 2 - dataset: "nelorth/oxford-flowers" → "torchvision.datasets.Flowers102" (Arch #1 follow-through: the v1 stale label is replaced with the actual library in use, matching the plan amendment in commit 6de09c6) - NEW fields at the top-level (canonical ship-shape BuildConfig): - build_noise_reps: 1 - build_bridges_enabled: true - build_text_ec_threshold: 0.60 - build_vision_ec_threshold: 1.01 - Extensive header comment explaining the operating point selection from the Phase 2D v2 sweep: "largest noise_reps at which bridge-enabled mean forward top-5 recall ≥ 0.90." At noise_reps=1 the sweep measured 0.980 mean recall + +96.0% Option 2 lift, at noise_reps=2 recall dropped to 0.800 (fails the gate), so 1 is the canonical choice. - FIXTURE_SHA256 regenerated: 967e83ed18851e1dfcad418be57f3275cf04a961462e6dc4dd055b6b71c8920b **Note on IDE diagnostics:** the editor's YAML language server auto- infers a schema from the v1 shape and flags the new build_* fields as "property not allowed." These are false positives — no schema file exists in the repo, PyYAML (the sole parser after Exec #1+#2 fold) does not enforce additional-property rules, and the new FixtureDescriptor fields are validated by the unit tests below. The IDE warnings will clear once the language server re-infers from the new shape. tests/substrate/p4_fixture_loader.py: - FixtureDescriptor dataclass extended with 4 new build_* fields (defaults match v1 hardcoded values for backward compat if a future fixture omits them). - load_fixture_descriptor() pulls the new fields via .get() with defaults so legacy callers don't break. - FIXTURE_SHA256 bumped to match the regenerated YAML. tests/substrate/test_p4_mug_test_roundtrip.py: - New _build_canonical_config(descriptor) helper constructs a BuildConfig from the fixture descriptor's build_* fields. Shared between the round-trip test and the new retrieval gate test. - _prepare_mug_test_hippocampus() now uses _build_canonical_config instead of hardcoded BuildConfig values. The round-trip test therefore exercises the canonical ship shape (noise + bridges ON), validating that P3.5 SessionSnapshot round-trip survives the full fixture shape, not just the no-noise baseline. tests/substrate/test_p4_fixture_validation.py (Arch #5 + Arch #6 enforcement): - TestFixtureV2BuildConfig (4 tests) — semantic pin on the operating point values: - fixture_version == 2 (drift guard on the bump) - build_noise_reps == 1 (matches Phase 2D v2 operating point) - build_bridges_enabled is True (shared_superclass topology is the ship shape) - build_text/vision_ec_threshold == 0.60/1.01 (Stage 2 v1 calibration values carried forward) - TestFixtureRetrievalGate (1 test) — the 0.70 gate the plan specified in deliverable (5). Runs build_and_bind with the canonical config, measures forward + reverse top-5 recall on all 10 classes, asserts both means ≥ 0.70. The bar is lower than the 0.90 operating-point selection rule because it's a CI regression gate, not the stricter operating-point criterion. - Old test_descriptor_parses_successfully loosened from fixture_version == 1 to >= 1 (the specific v2 pin now lives in TestFixtureV2BuildConfig). 17/17 P4 fixture + round-trip + mechanism tests pass after the fixture bump. The retrieval gate is a first-class CI check now, failing loudly if a future fixture edit drops recall below 0.70. The round-trip test is now harder: it runs against the noisy + bridged fixture, which has 80 total episodes (50 class + 10 noise + 20 bridge) vs the v1 baseline 50. Wall clock stays ~17s. P3.5 SessionSnapshot round-trip proves the full canonical ship shape survives serialize-kill-reload intact. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246
added a commit
that referenced
this pull request
Apr 15, 2026
…t, VRAM OOM, canonical mps, cosmetics, tripwire (fold step 10) Stage 2 v2 fold — commit sequence step 10. Bundles six Round 2 review findings that are each small and tactically independent but thematically "polish before the PR": **Exec #3** — `tests/substrate/p4_mug_probe.py`: sort the probe's returned vision node lists by node id lexicographically instead of by weight-rank. Stage 2 v1's probe returned the ranked list as-is, which the round-trip test compared with exact equality. After P3.5 restore_into rebuilds the binding graph via apply_hebbian_on_close, edges may be inserted in a different order than the original build; tied-weight retrievals (5 class-correct nodes all at identical activation) then flake across platforms without indicating a real correctness regression. Sorting by node id captures "the binding graph round-trips" (the actual claim) without false flakes from insertion-order drift. Module docstring updated with the full rationale. **Exec #5** — `scripts/p4_clip_calibration_sweep.py::_pick_headroom_band`: hard-assert `len(candidates) >= target_n` at the end. Before the fold, a pathological sweep (NaN accuracies, empty dataset, etc.) could silently return a short list and the fixture would be written with fewer than target_n classes; downstream tests assume target_n. Loud failure at sweep time beats silent short-ship. **Exec #6** — `scripts/p4_vram_audit.py::_write_report`: fix the `+-4 MB` rendering when torch.cuda.empty_cache releases memory. Change the hardcoded `+` prefix on both torch_alloc and nvidia-smi deltas to explicit signed format `{:+.0f}` so negative deltas render correctly. **Exec #7** — `scripts/p4_vram_audit.py::_torch_allocated_reserved_mb`: use the canonical `torch.backends.mps.is_available()` instead of the torch 2.x-only `torch.mps.is_available()` shortcut. The canonical path works on any torch version with MPS support; the shortcut is not documented as stable. **Exec #9** — `tests/substrate/test_p4_fixture_validation.py::TestFixtureV2BuildConfig::test_thresholds_are_not_the_ec_default`: tripwire guarding against a future edit dropping either EC threshold to the default 0.40. Stage 2 v1's original sweep-with-defaults produced the "fake 1.000 recall" bug (every prompt collapsing into one text node); this test fires if the fixture drifts back into that regime. Strict: asserts both thresholds >= 0.45 so a near-default value cannot sneak through. **Arch #7** — `scripts/p4_vram_audit.py::main` + new `_safe_step` helper: wrap each audit step (CLIP load, mpnet load, mug test encoding) in a try/except capturing torch.cuda.OutOfMemoryError, MemoryError, and RuntimeError. On failure, the helper records a Sample with notes='OOM: ...', logs an error, writes the PARTIAL report with the captured samples, and returns 2. Before the fold, the script crashed on OOM and _write_report was never called, leaving operators with zero data from the audit they specifically ran to detect OOM. The audit's entire purpose is to measure "does the encoder stack fit?" — an OOM IS the answer, not a crash. **Arch #8** — `docs/experiments/p4_vram_audit.md`: soften the Phase 2E interpretation's "3.02 GB is steady-state" claim to be honest about what the measurement actually captured. The 0-MB-delta was taken at completion of ONE encode pass, not at peak during encoding, and not across repeated runs. torch caching allocator drift under Stage 3's 60-run sweep could move the number. The Option 3γ recommendation (dedicated worktree with MAXIM_LLM_ENABLED=0) is reaffirmed specifically because it sidesteps the allocator-drift uncertainty — we don't have to reason about it if the LLM isn't there. **Exec #10 (topology metric reverse direction)**: NOT addressed in this commit. The sweep's Option 2 lift metric is forward-only (text → vision). Adding reverse measurement is a Stage 3 concern, not a Stage 2 v2 fold item — the +96% forward lift is unambiguous and the Option 2 SHIP decision doesn't hinge on symmetry data. Documented as a known limitation in the next commit's report. 58/58 P4 tests pass after the bundle. Lint + format clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246
added a commit
that referenced
this pull request
Apr 17, 2026
Wave 2 of biosystem_unification. Adds build_memory_hub() to
integration/memory_hub.py — the canonical MemoryHub construction
door that always calls .connect(), ensuring the three always-created
bridges (PlanHistoryBridge, EscalationLearningBridge, FearCircuitBridge)
are alive on every hub returned by the builder.
Audit found 4 gaps across 6 construction sites:
- Gap A: CLI non-sim agent (cli.py) — zero bridges, all dead
- Gap B: AgentFactory NPC agents — zero bridges, all dead
- Gap C: Sim orchestrator hub — dead code (missing scn/ec TypeError)
- Gap D: locals().get("fear_agent") fragility in sim AUT
Migration:
- Site #1 (cli.py): MemoryHub() → build_memory_hub() — fixes Gap A
- Site #2 (orchestrator AUT): MemoryHub() → build_memory_hub(),
removed redundant .connect() call (fear_agent stored but never
read by any bridge)
- Site #3 (orchestrator orch): added missing scn/ec — fixes Gap C
- Site #4 (Reachy): MemoryHub() → build_memory_hub(), keeps second
.connect() for spatial/salience wiring from DefaultNetwork
- Site #5 (AgentFactory): MemoryHub() → build_memory_hub() ��� fixes Gap B
21 new tests in tests/unit/test_build_memory_hub.py.
4834 tests passing, 24 integration tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246
added a commit
that referenced
this pull request
Apr 17, 2026
…on candidates (Wave 3) Full audit of every bio-pipeline construction site in src/maxim/. Ten production sites: 8 umbrella callers that reproduce ~30 lines of bio-system construction, 2 leaf factories (create.py, load.py) that return individual systems and don't need the umbrella. Six sites collapse into build_bio_stack: CLI non-sim (#1), sim AUT (#4), sim orch NPC (#5), Reachy (#6), AgentFactory (#7), and the CLI sim modes (#2/#3) that reuse #1's bio-systems. Two are not candidates: simulation/tools.py sub-AUT (consumer, not constructor) and the leaf factories. Key findings: - Construction order is consistent: Hippocampus → NAc → SCN → EC → (ATL, AngularGyrus) → MemoryHub → PainBus → DefaultNetwork - ATL and AngularGyrus optional everywhere (try/except) - DefaultNetwork is opt-in (only sites #4 and #6) - PainBus wiring varies (CLI builds independently, sim/Reachy inject) - MemoryHub.connect() with DN subsystems only at Reachy (#6) - All umbrella sites already use Wave 1/2 builders 4 open design questions carried to implementation session. biosystem_unification.md status row updated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dennys246
added a commit
that referenced
this pull request
Apr 17, 2026
…migrations Introduces runtime/bio_stack.py::build_bio_stack(*, persistence_dir) that composes the four Wave 1+2 builders (build_reaction_bus, build_pain_bus, build_memory_hub, build_default_network) into a single call returning a frozen BioStack dataclass. Collapses ~30 lines of hand-rolled bio-pipeline construction at each entry point into ~5. Migrated sites: - cli.py non-sim agent (site #1) - simulation/orchestrator.py AUT (site #4, with pre-built pain_bus=) - simulation/orchestrator.py orch NPC hub (site #5) - embodied_runtime/agentic_runtime.py Reachy (site #6) Not migrated (documented): - CLI sim modes (#2/#3): just build_pain_bus, too thin for umbrella - AgentFactory (#7): conditional remembers/learns + auto_load deferred to agent_factory_canonicalization.md Wave 4 27 new tests, 4902 passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Within this feature I implement an agentic brain structured inspired after the human’s brains organization. The system uses typical agentic agents for accessing different tools like the internet or filesystems, but also comprises main three parts, an ExecAgent (Executive function), MemoryAgent (memory), SalienceAgent which all work to accomplish a single goal set out by the user (defaulting to “learn more about the world and help humans”). The system constantly listens to cli input and audio transcripts and if it hears it’s name it will attempt to react to the user.