feat(memory): bind _SPEAKER_WEIGHTS from calibration sweep#443
Conversation
Spec §5 calibration sweep ran 2026-05-08 against the production
Layer 1 26-probe baseline (20 scored, 6 drifted). The pre-spec
baseline (`_SOURCE_WEIGHTS` weighting) scored p@1=0.65, p@3=0.80,
p@5=0.90, MRR=0.7475.
Of the 24 (user x assistant x episode_summary) configurations at
production floor (0.30) and overfetch (20), three cleared the §5.2
hard floor (p@1 within 0.05 of baseline) and the improvement filter
(p@3 ≥ 0.85, the +0.05 threshold). All three tied on the p@3+MRR
sum and on the larger-assistant_weight tiebreaker; episode_summary_
weight had no observable effect on this probe set, so the middle
value (matching the issue body's nominal) ships.
Bound values:
user 0.85
assistant 0.8
episode_summary 0.85
The change is "non-regressive within tolerance and marginal +0.05
improvement on p@3" — at 20 scored probes, 0.05 deltas are
one-probe granularity. The issue body's nominal values
(user=1.0, assistant=0.7, episode_summary=0.85) FAIL the §5.2
hard floor at p@1=0.50, so the §5.2 fallback path ("ship nominal
values with non-regressive note") is not available; the calibrated
result is the only defensible ship.
Production constants in kai/eval/retrieval.py and the test fixture
in test_eval_retrieval.py move in lockstep so a future calibration
follow-up only needs to touch the three values in one place per
file.
Future calibration with a richer probe set (especially probes
whose expected_fact_id is an episode summary) is the way to retune
episode_summary_weight against real signal.
Review by KaiPR Review: feat(memory): bind _SPEAKER_WEIGHTS from calibration sweepOverall: Clean. No bugs, no security issues, no convention violations. CorrectnessThe three-way sync is correct:
SecurityNone. Constants-only change with no new code paths. Style / ConventionsThe updated comment in One suggestion (non-blocking)
That's a suggestion worth a follow-up ticket, not a blocker. The PR is shippable as-is. |
…ture Round 1 review of #443. Adopted. The `_reset_memory_module` fixture in tests/test_eval_retrieval.py hardcoded the three production weights as literals. This left a future calibration with three places to edit (memory.py + retrieval.py + the test fixture) rather than the two the PR description claimed. Importing the `_PRODUCTION_*` constants from kai.eval.retrieval makes the fixture auto-track any retune to the harness baseline; a future re-calibration touches memory.py's _SPEAKER_WEIGHTS and retrieval.py's _PRODUCTION_* constants only. The `_UNKNOWN_SPEAKER_WEIGHT` alias rebind (added in the prior review round) stays where it is — it derives from the live _SPEAKER_WEIGHTS["assistant"] entry, which is now populated from the imported constant so the alias automatically tracks.
Review by KaiClean PR. The author addressed the prior review's non-blocking suggestion: the test fixture now imports One minor observation (suggestion):
No bugs, no security issues, no convention violations. The values are correctly synchronized: |
Summary
_SPEAKER_WEIGHTSfrom the spec §5 calibration sweep run 2026-05-08 against the production Layer 1 26-probe baseline.user=0.85, assistant=0.8, episode_summary=0.85. Production constants inkai/eval/retrieval.pyand the_reset_memory_modulefixture intests/test_eval_retrieval.pymove in lockstep.Calibration result
Pre-spec baseline (
_SOURCE_WEIGHTSweighting against the production index, 2026-05-08):§5.2 gates applied to 24 configs (production floor=0.30, overfetch=20):
p@3 + MRRsum): all three tied at 1.5792.assistant_weight): all three tied at 0.8.episode_summary_weight(0.7 / 0.85 / 1.0). Indistinguishable on this probe set; the middle / issue-body-nominal value ships.Sweep table (24 configs at production floor=0.30, overfetch=20, sorted by p@1 desc then MRR desc):
Caveats worth noting in the closing comment
user=1.0, assistant=0.7, episode_summary=0.85scores p@1=0.50 — fails by 0.10. The §5.2 fallback path ("ship nominal values with non-regressive note") is therefore not available; the calibrated result is the only defensible ship.Test plan
make testpasses (2823 tests).make checkpasses (ruff lint + ruff format).kai/eval/retrieval.pyand the_reset_memory_modulefixture both updated; a future re-calibration touches three values in two files only.sudo make install; verify/memoryrendering still surfaces Speaker / Confidence correctly under the new weights.