Skip to content

[eval] Add stress conversation scenario forcing items_dropped>0 + dedup_removed>0 #212

@dgenio

Description

@dgenio

Problem statement

benchmarks/README.md:101-110 baseline shows items_dropped=0, dedup_removed=0 for every scenario. The select_and_pack budget-pressure path (src/contextweaver/context/selection.py) and the deduplicate_candidates near-duplicate detector (src/contextweaver/context/dedup.py) are unit-tested but never exercised by integration scenarios. Open issue #181 specifies this gap.

This makes ~40% of the context pipeline invisible to end-to-end benchmark metrics. Tokenizer / weight / dedup-threshold changes can't be evaluated for their drop/dedup impact today.

Proposed solution

New benchmarks/scenarios/stress_conversation.jsonl with:

  • 50+ turns
  • ≥2 tool results with text length ≥2000 chars (firewall compaction kicks in)
  • ≥2 pairs of consecutive near-duplicate agent_msg items (Jaccard ≥0.85, matching the engine's default dedup threshold)
  • Tuned so budget @ 6000 tokens forces items_dropped > 0

Keep the existing long_conversation.jsonl (don't replace it — it remains the "light load" baseline).

Alternatives considered

  • Generate scenarios programmatically. Rejected — scenario generation script adds surface area; hand-authored fixture is more readable for review.
  • Lower the answer budget instead of expanding the scenario. Rejected — changing the budget makes existing baseline numbers incomparable. Adding a new scenario keeps the existing baselines stable.
  • Add three new scenarios for full coverage matrix. Rejected — one stress scenario suffices to make the drop/dedup stages visible; further scenarios are diminishing returns.

Affected modules

benchmarks/scenarios/stress_conversation.jsonl (new), benchmarks/benchmark.py, benchmarks/README.md

Estimated effort

S (1–2 days)

Baseline

All current scenarios: items_dropped=0, dedup_removed=0, avg_compaction_ratio=1.00x–1.41x.

Success metric

latest.json.context.stress_conversation.items_dropped > 0 AND latest.json.context.stress_conversation.dedup_removed > 0 AND avg_compaction_ratio > 1.0x.

Evaluation plan

make benchmark produces the new scenario row with positive drops and dedup count. Two consecutive runs are byte-identical for non-latency keys.

Scope

One JSONL data file + scenario registration in benchmark.py + benchmarks/README.md baseline table update.

Non-goals

  • Don't replace the existing long_conversation.jsonl.
  • Don't add fixture utilities or generator scripts.
  • Don't change scoring weights or dedup threshold to force the effect — the scenario should hit those stages with the default configuration.

Design constraints

  • Determinism preserved.
  • Near-duplicates should have Jaccard similarity ≥ 0.85 on tokens (matches the default dedup threshold).
  • Each large tool result ≥ 2000 chars.
  • File format: JSONL, one object per line, matching the existing scenario schema.

Depends on

None.

Suggested order

Independent — can land in parallel with the matrix issue and the gold-set expansion.

Acceptance Criteria (STRICT)

  • benchmarks/scenarios/stress_conversation.jsonl exists, ≥50 turns
  • At least 2 tool_result entries with text length ≥ 2000 chars
  • At least 2 pairs of near-duplicate agent_msg entries (Jaccard ≥ 0.85 verified with contextweaver._utils.jaccard + tokenize)
  • latest.json.context.stress_conversation shows items_dropped > 0 AND dedup_removed > 0 AND avg_compaction_ratio > 1.0
  • benchmarks/README.md baseline table updated with the new scenario row
  • make ci passes; existing scenarios' baseline numbers unchanged

Validation

make benchmark
python -c "
import json
d = json.load(open('benchmarks/results/latest.json'))['context']['stress_conversation']
assert d['items_dropped'] > 0, d
assert d['dedup_removed'] > 0, d
assert d['avg_compaction_ratio'] > 1.0, d
print('OK:', d)
"

Files likely touched

benchmarks/scenarios/stress_conversation.jsonl (new), benchmarks/benchmark.py (add scenario to scenarios list), benchmarks/README.md (table row).

Risk / rollback

None — pure additive data. Revert by removing the file + scenario registration.

Security/privacy notes

No real-user content; synthetic data only. Tool result payloads should be plausibly-shaped but fully fabricated.

Agent-ready notes

  • Starting points: benchmarks/scenarios/long_conversation.jsonl for style + structure; the existing invoice-search blob is the prototype for the large tool result.
  • Constraints: verify the near-duplicate pairs hit Jaccard ≥ 0.85 with the real tokenizer before committing (don't eyeball it). Verify determinism (two make benchmark runs identical for the new scenario).
  • Review focus for Copilot code review: does the scenario actually trigger items_dropped > 0 with the default ScoringConfig (not a tuned one), and does the near-duplicate pair pass the actual deduplicate_candidates threshold.

Cross-references

Size

S

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/contextContext engine: manager, pipeline, firewallarea/evalEvaluation, benchmarking, quality measurementcomplexity/simpleStraightforward change, minimal riskenhancementNew feature or requestpriority/lowLower priority — scale & validation

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions