feat(memory): bind _SPEAKER_WEIGHTS from calibration sweep by dcellison · Pull Request #443 · dcellison/kai

dcellison · 2026-05-08T20:09:55Z

Summary

Binds _SPEAKER_WEIGHTS from the spec §5 calibration sweep run 2026-05-08 against the production Layer 1 26-probe baseline.
Result: user=0.85, assistant=0.8, episode_summary=0.85. Production constants in kai/eval/retrieval.py and the _reset_memory_module fixture in tests/test_eval_retrieval.py move in lockstep.
Non-regressive within the §5.2 hard floor (p@1 within 0.05 of baseline) and marginal +0.05 improvement on p@3. Closes the §11 step 11 calibration item from Memory: per-fact provenance with retrieval-time downweight #437.

Calibration result

Pre-spec baseline (_SOURCE_WEIGHTS weighting against the production index, 2026-05-08):

Metric	Value
Probes scored	20 (6 drifted from 26)
p@1	0.65
p@3	0.80
p@5	0.90
MRR	0.7475
fraction_in_prompt	0.90

§5.2 gates applied to 24 configs (production floor=0.30, overfetch=20):

Hard floor: p@1 ≥ 0.60 (baseline 0.65 minus 0.05). Three configs cleared.
Improvement filter: p@3 ≥ 0.85 OR MRR ≥ 0.7975. All three survivors cleared via p@3 (exactly 0.85; MRR below the 0.7975 threshold).
Tiebreaker (p@3 + MRR sum): all three tied at 1.5792.
Secondary tiebreaker (larger assistant_weight): all three tied at 0.8.
Differ only in episode_summary_weight (0.7 / 0.85 / 1.0). Indistinguishable on this probe set; the middle / issue-body-nominal value ships.

Sweep table (24 configs at production floor=0.30, overfetch=20, sorted by p@1 desc then MRR desc):

user_w	asst_w	ep_w	p@1	p@3	p@5	MRR	in_prompt
0.85	0.8	0.7	0.60	0.85	0.90	0.7292	0.90
0.85	0.8	0.85	0.60	0.85	0.90	0.7292	0.90
0.85	0.8	1.0	0.60	0.85	0.90	0.7292	0.90
1.0	0.8	0.7	0.55	0.85	0.90	0.7042	0.90
1.0	0.8	0.85	0.55	0.85	0.90	0.7042	0.90
1.0	0.8	1.0	0.55	0.85	0.90	0.7042	0.90
0.85	0.7	0.7	0.55	0.85	0.90	0.7042	0.90
0.85	0.7	0.85	0.55	0.85	0.90	0.7042	0.90
1.0	0.7	0.7	0.50	0.85	0.90	0.6792	0.90
1.0	0.7	0.85	0.50	0.85	0.90	0.6792	0.90
0.85	0.6	0.7	0.50	0.85	0.90	0.6792	0.90
0.85	0.7	1.0	0.50	0.80	0.90	0.6667	0.90
1.0	0.7	1.0	0.45	0.80	0.90	0.6417	0.90
0.85	0.6	0.85	0.45	0.80	0.90	0.6417	0.90
1.0	0.6	0.7	0.40	0.85	0.90	0.6292	0.90
1.0	0.6	0.85	0.40	0.80	0.90	0.6083	0.90
0.85	0.5	0.7	0.40	0.80	0.90	0.6083	0.90
0.85	0.6	1.0	0.40	0.80	0.90	0.5917	0.90
1.0	0.5	0.7	0.35	0.80	0.90	0.5750	0.90
1.0	0.6	1.0	0.35	0.75	0.90	0.5625	0.90
0.85	0.5	0.85	0.35	0.75	0.90	0.5625	0.90
0.85	0.5	1.0	0.35	0.75	0.90	0.5625	0.90
1.0	0.5	0.85	0.30	0.70	0.90	0.5333	0.90
1.0	0.5	1.0	0.30	0.70	0.90	0.5333	0.90

Caveats worth noting in the closing comment

Marginal gains. p@1 sits at exactly the -0.05 floor; p@3 sits at exactly the +0.05 improvement threshold. With 20 scored probes, 0.05 deltas are one-probe granularity. The change is technically shippable per the spec's gates but not a slam-dunk.
Episode_summary_weight has zero effect on this probe set. Likely because the probes' expected_fact_ids are all extracted facts, not episode summaries. The three values (0.7 / 0.85 / 1.0) produce identical metrics. Calibrating that axis would require a probe set with episode-summary expected hits.
Issue-body nominal values fail the §5.2 hard floor. user=1.0, assistant=0.7, episode_summary=0.85 scores p@1=0.50 — fails by 0.10. The §5.2 fallback path ("ship nominal values with non-regressive note") is therefore not available; the calibrated result is the only defensible ship.
Today's baseline (p@1=0.65) is below the recorded 2026-04-23 baseline (p@1=0.731). Different data state — new extractions and hygiene sweeps over 14 days, drift mix differs. Not load-bearing for this decision but worth noting.

Test plan

make test passes (2823 tests).
make check passes (ruff lint + ruff format).
Production constants in kai/eval/retrieval.py and the _reset_memory_module fixture both updated; a future re-calibration touches three values in two files only.
Deploy via sudo make install; verify /memory rendering still surfaces Speaker / Confidence correctly under the new weights.
Closing comment on feat(memory): per-fact speaker attribution with retrieval-time downweight #441 with the sweep table and the §5.4 / §7 deliverables (this PR's body covers most of it; the operator's discretion on whether to also include five before/after sample retrieved-fact lists, which would need another service-stop run).

Spec §5 calibration sweep ran 2026-05-08 against the production Layer 1 26-probe baseline (20 scored, 6 drifted). The pre-spec baseline (`_SOURCE_WEIGHTS` weighting) scored p@1=0.65, p@3=0.80, p@5=0.90, MRR=0.7475. Of the 24 (user x assistant x episode_summary) configurations at production floor (0.30) and overfetch (20), three cleared the §5.2 hard floor (p@1 within 0.05 of baseline) and the improvement filter (p@3 ≥ 0.85, the +0.05 threshold). All three tied on the p@3+MRR sum and on the larger-assistant_weight tiebreaker; episode_summary_ weight had no observable effect on this probe set, so the middle value (matching the issue body's nominal) ships. Bound values: user 0.85 assistant 0.8 episode_summary 0.85 The change is "non-regressive within tolerance and marginal +0.05 improvement on p@3" — at 20 scored probes, 0.05 deltas are one-probe granularity. The issue body's nominal values (user=1.0, assistant=0.7, episode_summary=0.85) FAIL the §5.2 hard floor at p@1=0.50, so the §5.2 fallback path ("ship nominal values with non-regressive note") is not available; the calibrated result is the only defensible ship. Production constants in kai/eval/retrieval.py and the test fixture in test_eval_retrieval.py move in lockstep so a future calibration follow-up only needs to touch the three values in one place per file. Future calibration with a richer probe set (especially probes whose expected_fact_id is an episode summary) is the way to retune episode_summary_weight against real signal.

dcellison · 2026-05-08T20:24:56Z

Review by Kai

PR Review: feat(memory): bind _SPEAKER_WEIGHTS from calibration sweep

Overall: Clean. No bugs, no security issues, no convention violations.

Correctness

The three-way sync is correct:

retrieval.py:96-97: _PRODUCTION_USER_WEIGHT = 0.85, _PRODUCTION_ASSISTANT_WEIGHT = 0.8
memory.py:~248: _SPEAKER_WEIGHTS = {"user": 0.85, "assistant": 0.8, "episode_summary": 0.85}
test_eval_retrieval.py:85: fixture reset matches both

episode_summary is 0.85 in all three locations and was already at 0.85 before - unchanged and consistent.

Security

None. Constants-only change with no new code paths.

Style / Conventions

The updated comment in memory.py is thorough and meets the project's commenting standard - it explains the calibration provenance, the selection rationale, and the known limitation (episode_summary_weight blind spot).

One suggestion (non-blocking)

tests/test_eval_retrieval.py:85 - The fixture hard-codes the production weights as literals rather than importing them from memory.py or retrieval.py. This means a future calibration touches three values in three places, not two. Pre-existing pattern so not a regression introduced here, and the PR description even notes "a future re-calibration touches three values in two files only" - which undersells it by one. Consider importing _PRODUCTION_USER_WEIGHT etc. from retrieval.py in the fixture to make it truly two-file maintenance.

That's a suggestion worth a follow-up ticket, not a blocker. The PR is shippable as-is.

…ture Round 1 review of #443. Adopted. The `_reset_memory_module` fixture in tests/test_eval_retrieval.py hardcoded the three production weights as literals. This left a future calibration with three places to edit (memory.py + retrieval.py + the test fixture) rather than the two the PR description claimed. Importing the `_PRODUCTION_*` constants from kai.eval.retrieval makes the fixture auto-track any retune to the harness baseline; a future re-calibration touches memory.py's _SPEAKER_WEIGHTS and retrieval.py's _PRODUCTION_* constants only. The `_UNKNOWN_SPEAKER_WEIGHT` alias rebind (added in the prior review round) stays where it is — it derives from the live _SPEAKER_WEIGHTS["assistant"] entry, which is now populated from the imported constant so the alias automatically tracks.

dcellison · 2026-05-08T20:27:37Z

Review by Kai

Clean PR. The author addressed the prior review's non-blocking suggestion: the test fixture now imports _PRODUCTION_* constants from retrieval.py rather than repeating literals, so future calibration truly touches two files instead of three. The values are consistent across all three changed files.

One minor observation (suggestion):

tests/test_eval_retrieval.py:74-77 - The fixture now imports private symbols (_PRODUCTION_USER_WEIGHT, etc.) from a production module. This works and is documented in the updated comment, but it creates a test-to-internals coupling that could silently break if those constants are ever renamed or moved without updating the test import. The comment covers the intent well enough that this is fine as-is; just worth keeping in mind if retrieval.py is ever restructured.

No bugs, no security issues, no convention violations. The values are correctly synchronized: retrieval.py constants, memory.py dict, and the test fixture reset all agree on user=0.85, assistant=0.8, episode_summary=0.85.

zigguratt added 2 commits May 8, 2026 16:09

chore: trigger PR review bot

5919793

dcellison merged commit d564095 into main May 8, 2026
1 check passed

dcellison deleted the feature/437-speaker-calibration branch May 8, 2026 20:31

This was referenced May 8, 2026

Memory: per-fact provenance with retrieval-time downweight #437

Closed

Memory quality: replace write-time string filters with structural controls #436

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(memory): bind _SPEAKER_WEIGHTS from calibration sweep#443

feat(memory): bind _SPEAKER_WEIGHTS from calibration sweep#443
dcellison merged 3 commits into
mainfrom
feature/437-speaker-calibration

dcellison commented May 8, 2026

Uh oh!

dcellison commented May 8, 2026

Uh oh!

dcellison commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dcellison commented May 8, 2026

Summary

Calibration result

Caveats worth noting in the closing comment

Test plan

Uh oh!

dcellison commented May 8, 2026

Review by Kai

PR Review: feat(memory): bind _SPEAKER_WEIGHTS from calibration sweep

Correctness

Security

Style / Conventions

One suggestion (non-blocking)

Uh oh!

dcellison commented May 8, 2026

Review by Kai

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants