feat: add composite Robustness Evaluator tool#18
Conversation
Agent-Logs-Url: https://github.com/daedalus/ImpactGuard/sessions/5a202457-7c67-4611-a805-55045edfc8e3 Co-authored-by: daedalus <115175+daedalus@users.noreply.github.com>
Reviewer's GuideIntroduce a new Robustness Evaluator tool that computes composite robustness and fragility metrics from test-suite statistics via a Python API and CLI, enforces a minimum adversarial test ratio for CI gating, and documents its use in the README. Sequence diagram for CI gating with the Robustness Evaluator CLIsequenceDiagram
actor Developer
participant CI as CIPipeline
participant RE as RobustnessEvaluator_CLI
participant Eval as evaluate_robustness
Developer->>CI: Push_code_or_open_PR
CI->>CI: Run_tests_and_collect_metrics
CI->>RE: Invoke_with_arguments(n_total,n_adversarial,passing_adv,passing_norm,coverage,alpha,categories)
RE->>Eval: evaluate_robustness(n_total,n_adversarial,passing_adv,passing_norm,coverage,alpha,categories)
Eval-->>RE: RobustnessResult
alt JSON_output_requested
RE-->>CI: JSON_metrics_on_stdout
else Human_readable_report
RE-->>Developer: Text_report_on_stdout
end
RE-->>CI: Exit_code(0_or_1)
alt Adversarial_ratio_>=_0_25
CI->>CI: Mark_robustness_check_passed
else Adversarial_ratio_<_0_25
CI->>CI: Fail_pipeline_due_to_low_adversarial_coverage
end
Updated class diagram for the Robustness Evaluator moduleclassDiagram
class CategoryStats {
+str name
+int total
+int passing
+float pass_rate()
}
class RobustnessResult {
+int n_total
+int n_adversarial
+int n_normal
+int passing_adv
+int passing_norm
+float coverage
+float alpha
+float p_adversarial
+float p_normal
+float robustness_score
+float robustness_score_with_diversity
+float fragility_index
+float adversarial_ratio
+bool meets_adversarial_minimum
+float diversity_score
+list~CategoryStats~ categories
+str robustness_label()
+str fragility_label()
+dict to_dict()
}
class robustness_evaluator_module {
+float ADVERSARIAL_COVERAGE_MIN
+float DEFAULT_ALPHA
+dict~str,float~ ADVERSARIAL_BUDGET
+RobustnessResult evaluate_robustness(n_total,n_adversarial,passing_adv,passing_norm,coverage,alpha,categories)
+str _format_report(result)
+int main(argv)
}
robustness_evaluator_module "1" o-- "*" CategoryStats
robustness_evaluator_module "1" o-- "1" RobustnessResult
RobustnessResult "1" o-- "*" CategoryStats
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| BestPractice | 1 medium |
| Documentation | 5 minor |
| Complexity | 3 medium |
🟢 Metrics 44 complexity · 0 duplication
Metric Results Complexity 44 Duplication 0
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- Consider either using or removing the
ADVERSARIAL_BUDGETconstant, as it is currently defined but never referenced in the evaluator or CLI logic. - The fragility index currently resolves to
1.0when there are zero adversarial tests (becausep_advis0andp_norm > 0), which might be misleading; you may want to treatn_adversarial == 0asfragility_index=Noneto clearly signal that fragility cannot be assessed. - For the
--categoriesargument, wrappingjson.loadsand theCategoryStats(**item)construction in a try/except with a clear error message (and exit code) would make the CLI more robust against malformed JSON or missing fields instead of raising uncaught exceptions.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider either using or removing the `ADVERSARIAL_BUDGET` constant, as it is currently defined but never referenced in the evaluator or CLI logic.
- The fragility index currently resolves to `1.0` when there are zero adversarial tests (because `p_adv` is `0` and `p_norm > 0`), which might be misleading; you may want to treat `n_adversarial == 0` as `fragility_index=None` to clearly signal that fragility cannot be assessed.
- For the `--categories` argument, wrapping `json.loads` and the `CategoryStats(**item)` construction in a try/except with a clear error message (and exit code) would make the CLI more robust against malformed JSON or missing fields instead of raising uncaught exceptions.
## Individual Comments
### Comment 1
<location path="tools/robustness_evaluator.py" line_range="197-201" />
<code_context>
+
+ n_normal = n_total - n_adversarial
+
+ if n_adversarial > 0 and not (0 <= passing_adv <= n_adversarial):
+ raise ValueError(
+ f"passing_adv must be in [0, n_adversarial], got {passing_adv}"
+ )
+ if n_normal > 0 and not (0 <= passing_norm <= n_normal):
+ raise ValueError(
+ f"passing_norm must be in [0, n_normal], got {passing_norm}"
</code_context>
<issue_to_address>
**issue:** Passing counts are not validated when the corresponding test count is zero, allowing inconsistent inputs like passing_adv > 0 when n_adversarial == 0.
Because the checks are guarded by `n_adversarial > 0` / `n_normal > 0`, inputs like `n_adversarial=0, passing_adv=10` pass validation and yield `p_adversarial=0.0`, hiding invalid data. Please validate `passing_adv` and `passing_norm` against their counts unconditionally, e.g. `if not (0 <= passing_adv <= n_adversarial): ...`, which still behaves correctly when counts are zero.
</issue_to_address>
### Comment 2
<location path="tools/robustness_evaluator.py" line_range="375-377" />
<code_context>
+ args = parser.parse_args(argv)
+
+ cats: Optional[list[CategoryStats]] = None
+ if args.categories:
+ raw = json.loads(args.categories)
+ cats = [CategoryStats(**item) for item in raw]
+
+ try:
</code_context>
<issue_to_address>
**issue (bug_risk):** JSON and per-category parsing errors from --categories will currently crash the CLI instead of returning a clear error exit code.
Both `json.loads(args.categories)` and `CategoryStats(**item)` can raise (malformed JSON, missing/extra fields, wrong types), which currently bubbles up as a traceback and exits ungracefully. Consider wrapping this in `try/except (json.JSONDecodeError, TypeError, ValueError)` and printing a clear error to stderr with a non-zero exit code, consistent with how `evaluate_robustness` validation errors are handled.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| if n_adversarial > 0 and not (0 <= passing_adv <= n_adversarial): | ||
| raise ValueError( | ||
| f"passing_adv must be in [0, n_adversarial], got {passing_adv}" | ||
| ) | ||
| if n_normal > 0 and not (0 <= passing_norm <= n_normal): |
There was a problem hiding this comment.
issue: Passing counts are not validated when the corresponding test count is zero, allowing inconsistent inputs like passing_adv > 0 when n_adversarial == 0.
Because the checks are guarded by n_adversarial > 0 / n_normal > 0, inputs like n_adversarial=0, passing_adv=10 pass validation and yield p_adversarial=0.0, hiding invalid data. Please validate passing_adv and passing_norm against their counts unconditionally, e.g. if not (0 <= passing_adv <= n_adversarial): ..., which still behaves correctly when counts are zero.
| if args.categories: | ||
| raw = json.loads(args.categories) | ||
| cats = [CategoryStats(**item) for item in raw] |
There was a problem hiding this comment.
issue (bug_risk): JSON and per-category parsing errors from --categories will currently crash the CLI instead of returning a clear error exit code.
Both json.loads(args.categories) and CategoryStats(**item) can raise (malformed JSON, missing/extra fields, wrong types), which currently bubbles up as a traceback and exits ungracefully. Consider wrapping this in try/except (json.JSONDecodeError, TypeError, ValueError) and printing a clear error to stderr with a non-zero exit code, consistent with how evaluate_robustness validation errors are handled.
Projects need a single quantitative metric that captures adversarial test health, enforces a minimum adversarial test ratio, and rewards broad category coverage — none of which the existing S×E×C×λ risk model addresses.
Changes
tools/robustness_evaluator.py(new)evaluate_robustness()— core function returning aRobustnessResultdataclass with:R = C × (α × P_a + (1−α) × P_n)— Composite Robustness ScoreR_d = C × D × (α × P_a + (1−α) × P_n)— diversity-penalised variantF = 1 − (P_a / P_n)— Adversarial Fragility IndexD = categories_with_≥1_pass / total_categories— category diversity ratioCategoryStatsdataclass for per-category (boundary / semantic / evasion / compositional) pass/fail breakdown1when unmet, making it usable as a CI gate--jsonoutput for pipeline integrationREADME.mdSummary by Sourcery
Add a robustness evaluation tool that computes composite robustness and fragility metrics from test results and exposes them via both Python API and CLI.
New Features:
Enhancements: