Problem statement
benchmarks/README.md:101-110 baseline shows items_dropped=0, dedup_removed=0 for every scenario. The select_and_pack budget-pressure path (src/contextweaver/context/selection.py) and the deduplicate_candidates near-duplicate detector (src/contextweaver/context/dedup.py) are unit-tested but never exercised by integration scenarios. Open issue #181 specifies this gap.
This makes ~40% of the context pipeline invisible to end-to-end benchmark metrics. Tokenizer / weight / dedup-threshold changes can't be evaluated for their drop/dedup impact today.
Proposed solution
New benchmarks/scenarios/stress_conversation.jsonl with:
- 50+ turns
- ≥2 tool results with
text length ≥2000 chars (firewall compaction kicks in)
- ≥2 pairs of consecutive near-duplicate
agent_msg items (Jaccard ≥0.85, matching the engine's default dedup threshold)
- Tuned so budget @ 6000 tokens forces
items_dropped > 0
Keep the existing long_conversation.jsonl (don't replace it — it remains the "light load" baseline).
Alternatives considered
- Generate scenarios programmatically. Rejected — scenario generation script adds surface area; hand-authored fixture is more readable for review.
- Lower the answer budget instead of expanding the scenario. Rejected — changing the budget makes existing baseline numbers incomparable. Adding a new scenario keeps the existing baselines stable.
- Add three new scenarios for full coverage matrix. Rejected — one stress scenario suffices to make the drop/dedup stages visible; further scenarios are diminishing returns.
Affected modules
benchmarks/scenarios/stress_conversation.jsonl (new), benchmarks/benchmark.py, benchmarks/README.md
Estimated effort
S (1–2 days)
Baseline
All current scenarios: items_dropped=0, dedup_removed=0, avg_compaction_ratio=1.00x–1.41x.
Success metric
latest.json.context.stress_conversation.items_dropped > 0 AND latest.json.context.stress_conversation.dedup_removed > 0 AND avg_compaction_ratio > 1.0x.
Evaluation plan
make benchmark produces the new scenario row with positive drops and dedup count. Two consecutive runs are byte-identical for non-latency keys.
Scope
One JSONL data file + scenario registration in benchmark.py + benchmarks/README.md baseline table update.
Non-goals
- Don't replace the existing
long_conversation.jsonl.
- Don't add fixture utilities or generator scripts.
- Don't change scoring weights or dedup threshold to force the effect — the scenario should hit those stages with the default configuration.
Design constraints
- Determinism preserved.
- Near-duplicates should have Jaccard similarity ≥ 0.85 on tokens (matches the default dedup threshold).
- Each large tool result ≥ 2000 chars.
- File format: JSONL, one object per line, matching the existing scenario schema.
Depends on
None.
Suggested order
Independent — can land in parallel with the matrix issue and the gold-set expansion.
Acceptance Criteria (STRICT)
Validation
make benchmark
python -c "
import json
d = json.load(open('benchmarks/results/latest.json'))['context']['stress_conversation']
assert d['items_dropped'] > 0, d
assert d['dedup_removed'] > 0, d
assert d['avg_compaction_ratio'] > 1.0, d
print('OK:', d)
"
Files likely touched
benchmarks/scenarios/stress_conversation.jsonl (new), benchmarks/benchmark.py (add scenario to scenarios list), benchmarks/README.md (table row).
Risk / rollback
None — pure additive data. Revert by removing the file + scenario registration.
Security/privacy notes
No real-user content; synthetic data only. Tool result payloads should be plausibly-shaped but fully fabricated.
Agent-ready notes
- Starting points:
benchmarks/scenarios/long_conversation.jsonl for style + structure; the existing invoice-search blob is the prototype for the large tool result.
- Constraints: verify the near-duplicate pairs hit Jaccard ≥ 0.85 with the real tokenizer before committing (don't eyeball it). Verify determinism (two
make benchmark runs identical for the new scenario).
- Review focus for Copilot code review: does the scenario actually trigger
items_dropped > 0 with the default ScoringConfig (not a tuned one), and does the near-duplicate pair pass the actual deduplicate_candidates threshold.
Cross-references
Size
S
Problem statement
benchmarks/README.md:101-110baseline showsitems_dropped=0, dedup_removed=0for every scenario. Theselect_and_packbudget-pressure path (src/contextweaver/context/selection.py) and thededuplicate_candidatesnear-duplicate detector (src/contextweaver/context/dedup.py) are unit-tested but never exercised by integration scenarios. Open issue #181 specifies this gap.This makes ~40% of the context pipeline invisible to end-to-end benchmark metrics. Tokenizer / weight / dedup-threshold changes can't be evaluated for their drop/dedup impact today.
Proposed solution
New
benchmarks/scenarios/stress_conversation.jsonlwith:textlength ≥2000 chars (firewall compaction kicks in)agent_msgitems (Jaccard ≥0.85, matching the engine's default dedup threshold)items_dropped > 0Keep the existing
long_conversation.jsonl(don't replace it — it remains the "light load" baseline).Alternatives considered
Affected modules
benchmarks/scenarios/stress_conversation.jsonl(new),benchmarks/benchmark.py,benchmarks/README.mdEstimated effort
S (1–2 days)
Baseline
All current scenarios:
items_dropped=0, dedup_removed=0, avg_compaction_ratio=1.00x–1.41x.Success metric
latest.json.context.stress_conversation.items_dropped > 0ANDlatest.json.context.stress_conversation.dedup_removed > 0ANDavg_compaction_ratio > 1.0x.Evaluation plan
make benchmarkproduces the new scenario row with positive drops and dedup count. Two consecutive runs are byte-identical for non-latency keys.Scope
One JSONL data file + scenario registration in
benchmark.py+benchmarks/README.mdbaseline table update.Non-goals
long_conversation.jsonl.Design constraints
Depends on
None.
Suggested order
Independent — can land in parallel with the matrix issue and the gold-set expansion.
Acceptance Criteria (STRICT)
benchmarks/scenarios/stress_conversation.jsonlexists, ≥50 turnstool_resultentries withtextlength ≥ 2000 charsagent_msgentries (Jaccard ≥ 0.85 verified withcontextweaver._utils.jaccard+tokenize)latest.json.context.stress_conversationshowsitems_dropped > 0ANDdedup_removed > 0ANDavg_compaction_ratio > 1.0benchmarks/README.mdbaseline table updated with the new scenario rowmake cipasses; existing scenarios' baseline numbers unchangedValidation
Files likely touched
benchmarks/scenarios/stress_conversation.jsonl(new),benchmarks/benchmark.py(add scenario to scenarios list),benchmarks/README.md(table row).Risk / rollback
None — pure additive data. Revert by removing the file + scenario registration.
Security/privacy notes
No real-user content; synthetic data only. Tool result payloads should be plausibly-shaped but fully fabricated.
Agent-ready notes
benchmarks/scenarios/long_conversation.jsonlfor style + structure; the existing invoice-search blob is the prototype for the large tool result.make benchmarkruns identical for the new scenario).items_dropped > 0with the defaultScoringConfig(not a tuned one), and does the near-duplicate pair pass the actualdeduplicate_candidatesthreshold.Cross-references
Size
S