-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Labels
area: review-pipelineReview pipeline, context, promptsReview pipeline, context, promptseval-fixtureEval fixture / benchmark scenarioEval fixture / benchmark scenario
Description
Problem
Issue #30 (Adaptive patch compression for large PRs) is open but has no eval fixtures to validate compression behavior. Current fixtures are all small, focused diffs. Real-world PRs can be 50+ files and thousands of lines — the reviewer needs to degrade gracefully, not silently drop findings.
Proposal
Create stress test fixtures that validate review quality under compression:
Fixtures
| # | name | size | expected behavior |
|---|---|---|---|
| 1 | `large-pr-50-files-mixed` | 50 files, ~3000 lines | Must still catch the 1 security issue buried in file #38 |
| 2 | `large-pr-refactor-plus-bug` | 30 files (28 renames + 2 real changes) | Must not waste context on renames; must review the 2 substantive files |
| 3 | `large-pr-generated-code` | 10 files but 1 is 2000-line generated proto | Must skip generated file, review the rest |
| 4 | `large-pr-deletion-heavy` | 20 files, 15 are pure deletions | Must review the 5 non-deletion files; deletion-only may be skipped |
| 5 | `context-budget-exceeded` | Single file, 5000-line diff | Must use chunking/compression, not truncate randomly |
Metrics
For each fixture, track:
- Files reviewed vs files skipped (and why)
- Compression strategy used (full / compressed / clipped / multi-call)
- Finding recall compared to a "small diff" version of the same bug
- Total tokens used vs context budget
Acceptance
- 5 large-PR fixtures in `eval/fixtures/stress/`
- Compression strategy logged per fixture
- Security findings not dropped under compression
- Triage correctly skips generated/deletion-only files
🤖 Generated with Claude Code
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
area: review-pipelineReview pipeline, context, promptsReview pipeline, context, promptseval-fixtureEval fixture / benchmark scenarioEval fixture / benchmark scenario