[eval] Add stress conversation scenario forcing items_dropped>0 + dedup_removed>0

## Problem statement

`benchmarks/README.md:101-110` baseline shows `items_dropped=0, dedup_removed=0` for **every** scenario. The `select_and_pack` budget-pressure path (`src/contextweaver/context/selection.py`) and the `deduplicate_candidates` near-duplicate detector (`src/contextweaver/context/dedup.py`) are unit-tested but never exercised by integration scenarios. Open issue #181 specifies this gap.

This makes ~40% of the context pipeline invisible to end-to-end benchmark metrics. Tokenizer / weight / dedup-threshold changes can't be evaluated for their drop/dedup impact today.

## Proposed solution

New `benchmarks/scenarios/stress_conversation.jsonl` with:
- 50+ turns
- ≥2 tool results with `text` length ≥2000 chars (firewall compaction kicks in)
- ≥2 pairs of consecutive near-duplicate `agent_msg` items (Jaccard ≥0.85, matching the engine's default dedup threshold)
- Tuned so budget @ 6000 tokens forces `items_dropped > 0`

Keep the existing `long_conversation.jsonl` (don't replace it — it remains the "light load" baseline).

## Alternatives considered

- **Generate scenarios programmatically.** Rejected — scenario generation script adds surface area; hand-authored fixture is more readable for review.
- **Lower the answer budget instead of expanding the scenario.** Rejected — changing the budget makes existing baseline numbers incomparable. Adding a new scenario keeps the existing baselines stable.
- **Add three new scenarios for full coverage matrix.** Rejected — one stress scenario suffices to make the drop/dedup stages visible; further scenarios are diminishing returns.

## Affected modules

`benchmarks/scenarios/stress_conversation.jsonl` (new), `benchmarks/benchmark.py`, `benchmarks/README.md`

## Estimated effort

S (1–2 days)

## Baseline

All current scenarios: `items_dropped=0, dedup_removed=0, avg_compaction_ratio=1.00x–1.41x`.

## Success metric

`latest.json.context.stress_conversation.items_dropped > 0` AND `latest.json.context.stress_conversation.dedup_removed > 0` AND `avg_compaction_ratio > 1.0x`.

## Evaluation plan

`make benchmark` produces the new scenario row with positive drops and dedup count. Two consecutive runs are byte-identical for non-latency keys.

## Scope

One JSONL data file + scenario registration in `benchmark.py` + `benchmarks/README.md` baseline table update.

## Non-goals

- Don't replace the existing `long_conversation.jsonl`.
- Don't add fixture utilities or generator scripts.
- Don't change scoring weights or dedup threshold to force the effect — the scenario should hit those stages with the *default* configuration.

## Design constraints

- Determinism preserved.
- Near-duplicates should have Jaccard similarity ≥ 0.85 on tokens (matches the default dedup threshold).
- Each large tool result ≥ 2000 chars.
- File format: JSONL, one object per line, matching the existing scenario schema.

## Depends on

None.

## Suggested order

Independent — can land in parallel with the matrix issue and the gold-set expansion.

## Acceptance Criteria (STRICT)

- [ ] `benchmarks/scenarios/stress_conversation.jsonl` exists, ≥50 turns
- [ ] At least 2 `tool_result` entries with `text` length ≥ 2000 chars
- [ ] At least 2 pairs of near-duplicate `agent_msg` entries (Jaccard ≥ 0.85 verified with `contextweaver._utils.jaccard` + `tokenize`)
- [ ] `latest.json.context.stress_conversation` shows `items_dropped > 0` AND `dedup_removed > 0` AND `avg_compaction_ratio > 1.0`
- [ ] `benchmarks/README.md` baseline table updated with the new scenario row
- [ ] `make ci` passes; existing scenarios' baseline numbers unchanged

## Validation

```bash
make benchmark
python -c "
import json
d = json.load(open('benchmarks/results/latest.json'))['context']['stress_conversation']
assert d['items_dropped'] > 0, d
assert d['dedup_removed'] > 0, d
assert d['avg_compaction_ratio'] > 1.0, d
print('OK:', d)
"
```

## Files likely touched

`benchmarks/scenarios/stress_conversation.jsonl` (new), `benchmarks/benchmark.py` (add scenario to scenarios list), `benchmarks/README.md` (table row).

## Risk / rollback

None — pure additive data. Revert by removing the file + scenario registration.

## Security/privacy notes

No real-user content; synthetic data only. Tool result payloads should be plausibly-shaped but fully fabricated.

## Agent-ready notes

- **Starting points:** `benchmarks/scenarios/long_conversation.jsonl` for style + structure; the existing invoice-search blob is the prototype for the large tool result.
- **Constraints:** verify the near-duplicate pairs hit Jaccard ≥ 0.85 with the real tokenizer before committing (don't eyeball it). Verify determinism (two `make benchmark` runs identical for the new scenario).
- **Review focus for Copilot code review:** does the scenario actually trigger `items_dropped > 0` with the *default* `ScoringConfig` (not a tuned one), and does the near-duplicate pair pass the actual `deduplicate_candidates` threshold.

## Cross-references

- **Closes #181** — directly implements the stress-test scenario specified in #181's goal and acceptance criteria.
- **Unblocks** the weight-sweep tool — drop/dedup pressure is what makes scoring-weight differences observable.

## Size

S

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[eval] Add stress conversation scenario forcing items_dropped>0 + dedup_removed>0 #212

Problem statement

Proposed solution

Alternatives considered

Affected modules

Estimated effort

Baseline

Success metric

Evaluation plan

Scope

Non-goals

Design constraints

Depends on

Suggested order

Acceptance Criteria (STRICT)

Validation

Files likely touched

Risk / rollback

Security/privacy notes

Agent-ready notes

Cross-references

Size

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[eval] Add stress conversation scenario forcing items_dropped>0 + dedup_removed>0 #212

Description

Problem statement

Proposed solution

Alternatives considered

Affected modules

Estimated effort

Baseline

Success metric

Evaluation plan

Scope

Non-goals

Design constraints

Depends on

Suggested order

Acceptance Criteria (STRICT)

Validation

Files likely touched

Risk / rollback

Security/privacy notes

Agent-ready notes

Cross-references

Size

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions