Skip to content

feat(memory): bind _SPEAKER_WEIGHTS from calibration sweep#443

Merged
dcellison merged 3 commits into
mainfrom
feature/437-speaker-calibration
May 8, 2026
Merged

feat(memory): bind _SPEAKER_WEIGHTS from calibration sweep#443
dcellison merged 3 commits into
mainfrom
feature/437-speaker-calibration

Conversation

@dcellison
Copy link
Copy Markdown
Owner

Summary

  • Binds _SPEAKER_WEIGHTS from the spec §5 calibration sweep run 2026-05-08 against the production Layer 1 26-probe baseline.
  • Result: user=0.85, assistant=0.8, episode_summary=0.85. Production constants in kai/eval/retrieval.py and the _reset_memory_module fixture in tests/test_eval_retrieval.py move in lockstep.
  • Non-regressive within the §5.2 hard floor (p@1 within 0.05 of baseline) and marginal +0.05 improvement on p@3. Closes the §11 step 11 calibration item from Memory: per-fact provenance with retrieval-time downweight #437.

Calibration result

Pre-spec baseline (_SOURCE_WEIGHTS weighting against the production index, 2026-05-08):

Metric Value
Probes scored 20 (6 drifted from 26)
p@1 0.65
p@3 0.80
p@5 0.90
MRR 0.7475
fraction_in_prompt 0.90

§5.2 gates applied to 24 configs (production floor=0.30, overfetch=20):

  • Hard floor: p@1 ≥ 0.60 (baseline 0.65 minus 0.05). Three configs cleared.
  • Improvement filter: p@3 ≥ 0.85 OR MRR ≥ 0.7975. All three survivors cleared via p@3 (exactly 0.85; MRR below the 0.7975 threshold).
  • Tiebreaker (p@3 + MRR sum): all three tied at 1.5792.
  • Secondary tiebreaker (larger assistant_weight): all three tied at 0.8.
  • Differ only in episode_summary_weight (0.7 / 0.85 / 1.0). Indistinguishable on this probe set; the middle / issue-body-nominal value ships.

Sweep table (24 configs at production floor=0.30, overfetch=20, sorted by p@1 desc then MRR desc):

user_w asst_w ep_w p@1 p@3 p@5 MRR in_prompt
0.85 0.8 0.7 0.60 0.85 0.90 0.7292 0.90
0.85 0.8 0.85 0.60 0.85 0.90 0.7292 0.90
0.85 0.8 1.0 0.60 0.85 0.90 0.7292 0.90
1.0 0.8 0.7 0.55 0.85 0.90 0.7042 0.90
1.0 0.8 0.85 0.55 0.85 0.90 0.7042 0.90
1.0 0.8 1.0 0.55 0.85 0.90 0.7042 0.90
0.85 0.7 0.7 0.55 0.85 0.90 0.7042 0.90
0.85 0.7 0.85 0.55 0.85 0.90 0.7042 0.90
1.0 0.7 0.7 0.50 0.85 0.90 0.6792 0.90
1.0 0.7 0.85 0.50 0.85 0.90 0.6792 0.90
0.85 0.6 0.7 0.50 0.85 0.90 0.6792 0.90
0.85 0.7 1.0 0.50 0.80 0.90 0.6667 0.90
1.0 0.7 1.0 0.45 0.80 0.90 0.6417 0.90
0.85 0.6 0.85 0.45 0.80 0.90 0.6417 0.90
1.0 0.6 0.7 0.40 0.85 0.90 0.6292 0.90
1.0 0.6 0.85 0.40 0.80 0.90 0.6083 0.90
0.85 0.5 0.7 0.40 0.80 0.90 0.6083 0.90
0.85 0.6 1.0 0.40 0.80 0.90 0.5917 0.90
1.0 0.5 0.7 0.35 0.80 0.90 0.5750 0.90
1.0 0.6 1.0 0.35 0.75 0.90 0.5625 0.90
0.85 0.5 0.85 0.35 0.75 0.90 0.5625 0.90
0.85 0.5 1.0 0.35 0.75 0.90 0.5625 0.90
1.0 0.5 0.85 0.30 0.70 0.90 0.5333 0.90
1.0 0.5 1.0 0.30 0.70 0.90 0.5333 0.90

Caveats worth noting in the closing comment

  1. Marginal gains. p@1 sits at exactly the -0.05 floor; p@3 sits at exactly the +0.05 improvement threshold. With 20 scored probes, 0.05 deltas are one-probe granularity. The change is technically shippable per the spec's gates but not a slam-dunk.
  2. Episode_summary_weight has zero effect on this probe set. Likely because the probes' expected_fact_ids are all extracted facts, not episode summaries. The three values (0.7 / 0.85 / 1.0) produce identical metrics. Calibrating that axis would require a probe set with episode-summary expected hits.
  3. Issue-body nominal values fail the §5.2 hard floor. user=1.0, assistant=0.7, episode_summary=0.85 scores p@1=0.50 — fails by 0.10. The §5.2 fallback path ("ship nominal values with non-regressive note") is therefore not available; the calibrated result is the only defensible ship.
  4. Today's baseline (p@1=0.65) is below the recorded 2026-04-23 baseline (p@1=0.731). Different data state — new extractions and hygiene sweeps over 14 days, drift mix differs. Not load-bearing for this decision but worth noting.

Test plan

  • make test passes (2823 tests).
  • make check passes (ruff lint + ruff format).
  • Production constants in kai/eval/retrieval.py and the _reset_memory_module fixture both updated; a future re-calibration touches three values in two files only.
  • Deploy via sudo make install; verify /memory rendering still surfaces Speaker / Confidence correctly under the new weights.
  • Closing comment on feat(memory): per-fact speaker attribution with retrieval-time downweight #441 with the sweep table and the §5.4 / §7 deliverables (this PR's body covers most of it; the operator's discretion on whether to also include five before/after sample retrieved-fact lists, which would need another service-stop run).

zigguratt added 2 commits May 8, 2026 16:09
Spec §5 calibration sweep ran 2026-05-08 against the production
Layer 1 26-probe baseline (20 scored, 6 drifted). The pre-spec
baseline (`_SOURCE_WEIGHTS` weighting) scored p@1=0.65, p@3=0.80,
p@5=0.90, MRR=0.7475.

Of the 24 (user x assistant x episode_summary) configurations at
production floor (0.30) and overfetch (20), three cleared the §5.2
hard floor (p@1 within 0.05 of baseline) and the improvement filter
(p@3 ≥ 0.85, the +0.05 threshold). All three tied on the p@3+MRR
sum and on the larger-assistant_weight tiebreaker; episode_summary_
weight had no observable effect on this probe set, so the middle
value (matching the issue body's nominal) ships.

Bound values:

    user             0.85
    assistant        0.8
    episode_summary  0.85

The change is "non-regressive within tolerance and marginal +0.05
improvement on p@3" — at 20 scored probes, 0.05 deltas are
one-probe granularity. The issue body's nominal values
(user=1.0, assistant=0.7, episode_summary=0.85) FAIL the §5.2
hard floor at p@1=0.50, so the §5.2 fallback path ("ship nominal
values with non-regressive note") is not available; the calibrated
result is the only defensible ship.

Production constants in kai/eval/retrieval.py and the test fixture
in test_eval_retrieval.py move in lockstep so a future calibration
follow-up only needs to touch the three values in one place per
file.

Future calibration with a richer probe set (especially probes
whose expected_fact_id is an episode summary) is the way to retune
episode_summary_weight against real signal.
@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

PR Review: feat(memory): bind _SPEAKER_WEIGHTS from calibration sweep

Overall: Clean. No bugs, no security issues, no convention violations.

Correctness

The three-way sync is correct:

  • retrieval.py:96-97: _PRODUCTION_USER_WEIGHT = 0.85, _PRODUCTION_ASSISTANT_WEIGHT = 0.8
  • memory.py:~248: _SPEAKER_WEIGHTS = {"user": 0.85, "assistant": 0.8, "episode_summary": 0.85}
  • test_eval_retrieval.py:85: fixture reset matches both

episode_summary is 0.85 in all three locations and was already at 0.85 before - unchanged and consistent.

Security

None. Constants-only change with no new code paths.

Style / Conventions

The updated comment in memory.py is thorough and meets the project's commenting standard - it explains the calibration provenance, the selection rationale, and the known limitation (episode_summary_weight blind spot).

One suggestion (non-blocking)

tests/test_eval_retrieval.py:85 - The fixture hard-codes the production weights as literals rather than importing them from memory.py or retrieval.py. This means a future calibration touches three values in three places, not two. Pre-existing pattern so not a regression introduced here, and the PR description even notes "a future re-calibration touches three values in two files only" - which undersells it by one. Consider importing _PRODUCTION_USER_WEIGHT etc. from retrieval.py in the fixture to make it truly two-file maintenance.

That's a suggestion worth a follow-up ticket, not a blocker. The PR is shippable as-is.

…ture

Round 1 review of #443. Adopted.

The `_reset_memory_module` fixture in tests/test_eval_retrieval.py
hardcoded the three production weights as literals. This left a
future calibration with three places to edit (memory.py +
retrieval.py + the test fixture) rather than the two the PR
description claimed. Importing the `_PRODUCTION_*` constants from
kai.eval.retrieval makes the fixture auto-track any retune to the
harness baseline; a future re-calibration touches memory.py's
_SPEAKER_WEIGHTS and retrieval.py's _PRODUCTION_* constants only.

The `_UNKNOWN_SPEAKER_WEIGHT` alias rebind (added in the prior
review round) stays where it is — it derives from the live
_SPEAKER_WEIGHTS["assistant"] entry, which is now populated from
the imported constant so the alias automatically tracks.
@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

Clean PR. The author addressed the prior review's non-blocking suggestion: the test fixture now imports _PRODUCTION_* constants from retrieval.py rather than repeating literals, so future calibration truly touches two files instead of three. The values are consistent across all three changed files.

One minor observation (suggestion):

tests/test_eval_retrieval.py:74-77 - The fixture now imports private symbols (_PRODUCTION_USER_WEIGHT, etc.) from a production module. This works and is documented in the updated comment, but it creates a test-to-internals coupling that could silently break if those constants are ever renamed or moved without updating the test import. The comment covers the intent well enough that this is fine as-is; just worth keeping in mind if retrieval.py is ever restructured.

No bugs, no security issues, no convention violations. The values are correctly synchronized: retrieval.py constants, memory.py dict, and the test fixture reset all agree on user=0.85, assistant=0.8, episode_summary=0.85.

@dcellison dcellison merged commit d564095 into main May 8, 2026
1 check passed
@dcellison dcellison deleted the feature/437-speaker-calibration branch May 8, 2026 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants