Revise benchmark evaluation metrics and runners by bihius · Pull Request #230 · bihius/guard-proxy

bihius · 2026-06-06T14:16:59Z

Summary

Remove universal TPR/FPR targets from the evaluation plan and keep only RPS degradation as a soft lab guardrail
Add a tagged labeled corpus runner for defensible TP/FN/TN/FP counting
Rework go-ftw, ZAP, and Nuclei reporting so only the corpus publishes clean classification metrics
Update collector logic, lab wiring, and docs to match the revised methodology

Testing

Added and passed unit tests for the new metrics helpers and FTW classification logic
Verified shell syntax for the lab runners and setup scripts
Verified the new metrics module compiles and the benchmark Make targets expand correctly

M3 and M5 are complete; M6 benchmark work is underway (PR #230 merged). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

bihius and others added 2 commits June 6, 2026 16:15

Revise evaluation benchmarks and metrics

a1e0924

Merge branch 'main' into benchmark-refactor

39b67cd

bihius merged commit 0d25d48 into main Jun 6, 2026
3 checks passed

bihius added a commit that referenced this pull request Jun 6, 2026

docs: advance roadmap status to M6 active

2398bee

M3 and M5 are complete; M6 benchmark work is underway (PR #230 merged). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise benchmark evaluation metrics and runners#230

Revise benchmark evaluation metrics and runners#230
bihius merged 2 commits into
mainfrom
benchmark-refactor

bihius commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bihius commented Jun 6, 2026

Summary

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant