feat: add composite Robustness Evaluator tool by Copilot · Pull Request #18 · daedalus/ImpactGuard

Copilot · 2026-05-06T19:15:10Z

Projects need a single quantitative metric that captures adversarial test health, enforces a minimum adversarial test ratio, and rewards broad category coverage — none of which the existing S×E×C×λ risk model addresses.

Changes

`tools/robustness_evaluator.py` (new)

evaluate_robustness() — core function returning a RobustnessResult dataclass with:
- R = C × (α × P_a + (1−α) × P_n) — Composite Robustness Score
- R_d = C × D × (α × P_a + (1−α) × P_n) — diversity-penalised variant
- F = 1 − (P_a / P_n) — Adversarial Fragility Index
- D = categories_with_≥1_pass / total_categories — category diversity ratio
CategoryStats dataclass for per-category (boundary / semantic / evasion / compositional) pass/fail breakdown
Enforces 25% adversarial minimum — exits 1 when unmet, making it usable as a CI gate
CLI with human-readable report and --json output for pipeline integration

from tools.robustness_evaluator import evaluate_robustness, CategoryStats

result = evaluate_robustness(
    n_total=100, n_adversarial=30,
    passing_adv=18, passing_norm=63,
    coverage=0.80, alpha=0.65,       # 0.5 general · 0.65 security · 0.75 red-team
    categories=[
        CategoryStats("boundary",      9, 6),
        CategoryStats("semantic",      8, 5),
        CategoryStats("evasion",       8, 4),
        CategoryStats("compositional", 5, 3),
    ],
)
# result.robustness_score            → 0.564  [FAIR]
# result.robustness_score_with_diversity → 0.564
# result.fragility_index             → 0.333  [BRITTLE]
# result.meets_adversarial_minimum   → True

`README.md`

New Robustness Evaluator section covering metric formulas, adversarial budget allocation table, Python API, and CLI usage with example output
Updated System Components table to include the new tool

Summary by Sourcery

Add a robustness evaluation tool that computes composite robustness and fragility metrics from test results and exposes them via both Python API and CLI.

New Features:

Introduce a Robustness Evaluator module that calculates composite robustness scores, fragility index, and diversity-aware metrics from test-suite statistics.
Provide a CLI interface and JSON output for integrating robustness evaluation into CI pipelines with an enforced minimum adversarial test ratio.

Enhancements:

Document the Robustness Evaluator in the README, including metric definitions, recommended adversarial budgeting, and usage examples, and add it to the system components table.

Agent-Logs-Url: https://github.com/daedalus/ImpactGuard/sessions/5a202457-7c67-4611-a805-55045edfc8e3 Co-authored-by: daedalus <115175+daedalus@users.noreply.github.com>

sourcery-ai · 2026-05-06T19:15:42Z

Reviewer's Guide

Introduce a new Robustness Evaluator tool that computes composite robustness and fragility metrics from test-suite statistics via a Python API and CLI, enforces a minimum adversarial test ratio for CI gating, and documents its use in the README.

Sequence diagram for CI gating with the Robustness Evaluator CLI

sequenceDiagram
  actor Developer
  participant CI as CIPipeline
  participant RE as RobustnessEvaluator_CLI
  participant Eval as evaluate_robustness

  Developer->>CI: Push_code_or_open_PR
  CI->>CI: Run_tests_and_collect_metrics
  CI->>RE: Invoke_with_arguments(n_total,n_adversarial,passing_adv,passing_norm,coverage,alpha,categories)
  RE->>Eval: evaluate_robustness(n_total,n_adversarial,passing_adv,passing_norm,coverage,alpha,categories)
  Eval-->>RE: RobustnessResult

  alt JSON_output_requested
    RE-->>CI: JSON_metrics_on_stdout
  else Human_readable_report
    RE-->>Developer: Text_report_on_stdout
  end

  RE-->>CI: Exit_code(0_or_1)
  alt Adversarial_ratio_>=_0_25
    CI->>CI: Mark_robustness_check_passed
  else Adversarial_ratio_<_0_25
    CI->>CI: Fail_pipeline_due_to_low_adversarial_coverage
  end

Updated class diagram for the Robustness Evaluator module

classDiagram
  class CategoryStats {
    +str name
    +int total
    +int passing
    +float pass_rate()
  }

  class RobustnessResult {
    +int n_total
    +int n_adversarial
    +int n_normal
    +int passing_adv
    +int passing_norm
    +float coverage
    +float alpha
    +float p_adversarial
    +float p_normal
    +float robustness_score
    +float robustness_score_with_diversity
    +float fragility_index
    +float adversarial_ratio
    +bool meets_adversarial_minimum
    +float diversity_score
    +list~CategoryStats~ categories
    +str robustness_label()
    +str fragility_label()
    +dict to_dict()
  }

  class robustness_evaluator_module {
    +float ADVERSARIAL_COVERAGE_MIN
    +float DEFAULT_ALPHA
    +dict~str,float~ ADVERSARIAL_BUDGET
    +RobustnessResult evaluate_robustness(n_total,n_adversarial,passing_adv,passing_norm,coverage,alpha,categories)
    +str _format_report(result)
    +int main(argv)
  }

  robustness_evaluator_module "1" o-- "*" CategoryStats
  robustness_evaluator_module "1" o-- "1" RobustnessResult
  RobustnessResult "1" o-- "*" CategoryStats

File-Level Changes

Change	Details	Files
Add a Robustness Evaluator module that computes composite robustness scores, fragility indices, and diversity-aware variants from test metrics, with support for per-category adversarial statistics.	Define CategoryStats and RobustnessResult dataclasses to model per-category stats and aggregated robustness outcomes, including helper label properties and JSON-serialisable output. Implement evaluate_robustness() to validate inputs, derive pass rates, compute robustness score R, diversity-penalised score R_d, fragility index F, adversarial ratio, diversity score, and minimum adversarial coverage flag. Introduce constants for minimum adversarial coverage, default adversarial weight, and recommended adversarial budget per category.	`tools/robustness_evaluator.py`
Expose the Robustness Evaluator via a CLI suitable for human-readable reports, JSON output, and CI enforcement of the adversarial minimum.	Add an argparse-based CLI that collects global metrics and optional per-category stats, parsing categories from a JSON string into CategoryStats objects. Render a formatted text report including composition, pass rates, primary metrics, and per-category breakdown using unicode bar indicators, or emit JSON via RobustnessResult.to_dict(). Set process exit code to 0 only when the minimum adversarial coverage threshold is met, otherwise exit with code 1; use exit code 2 for invalid input errors.	`tools/robustness_evaluator.py`
Document the Robustness Evaluator and surface it in the project’s main README.	Update the System Components table to list the new Robustness Evaluation tool and its purpose. Add a dedicated Robustness Evaluator section describing formulas for R, R_d, F, and D; symbol definitions; qualitative labels; and the enforced 25% adversarial coverage rule. Provide Python and CLI usage examples, including JSON-based category inputs and sample human-readable output for the evaluator.	`README.md`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

codacy-production · 2026-05-06T19:16:22Z

Not up to standards ⛔

🔴 Issues 4 medium · 5 minor

Alerts:
⚠ 9 issues (≤ 0 issues of at least minor severity)

Results:
9 new issues

Category Results

BestPractice 1 medium

Documentation 5 minor

Complexity 3 medium

View in Codacy

🟢 Metrics 44 complexity · 0 duplication

Metric Results

Complexity 44

Duplication 0

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

sourcery-ai

Hey - I've found 2 issues, and left some high level feedback:

Consider either using or removing the ADVERSARIAL_BUDGET constant, as it is currently defined but never referenced in the evaluator or CLI logic.
The fragility index currently resolves to 1.0 when there are zero adversarial tests (because p_adv is 0 and p_norm > 0), which might be misleading; you may want to treat n_adversarial == 0 as fragility_index=None to clearly signal that fragility cannot be assessed.
For the --categories argument, wrapping json.loads and the CategoryStats(**item) construction in a try/except with a clear error message (and exit code) would make the CLI more robust against malformed JSON or missing fields instead of raising uncaught exceptions.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- Consider either using or removing the `ADVERSARIAL_BUDGET` constant, as it is currently defined but never referenced in the evaluator or CLI logic.
- The fragility index currently resolves to `1.0` when there are zero adversarial tests (because `p_adv` is `0` and `p_norm > 0`), which might be misleading; you may want to treat `n_adversarial == 0` as `fragility_index=None` to clearly signal that fragility cannot be assessed.
- For the `--categories` argument, wrapping `json.loads` and the `CategoryStats(**item)` construction in a try/except with a clear error message (and exit code) would make the CLI more robust against malformed JSON or missing fields instead of raising uncaught exceptions.

## Individual Comments

### Comment 1
<location path="tools/robustness_evaluator.py" line_range="197-201" />
<code_context>
+
+    n_normal = n_total - n_adversarial
+
+    if n_adversarial > 0 and not (0 <= passing_adv <= n_adversarial):
+        raise ValueError(
+            f"passing_adv must be in [0, n_adversarial], got {passing_adv}"
+        )
+    if n_normal > 0 and not (0 <= passing_norm <= n_normal):
+        raise ValueError(
+            f"passing_norm must be in [0, n_normal], got {passing_norm}"
</code_context>
<issue_to_address>
**issue:** Passing counts are not validated when the corresponding test count is zero, allowing inconsistent inputs like passing_adv > 0 when n_adversarial == 0.

Because the checks are guarded by `n_adversarial > 0` / `n_normal > 0`, inputs like `n_adversarial=0, passing_adv=10` pass validation and yield `p_adversarial=0.0`, hiding invalid data. Please validate `passing_adv` and `passing_norm` against their counts unconditionally, e.g. `if not (0 <= passing_adv <= n_adversarial): ...`, which still behaves correctly when counts are zero.
</issue_to_address>

### Comment 2
<location path="tools/robustness_evaluator.py" line_range="375-377" />
<code_context>
+    args = parser.parse_args(argv)
+
+    cats: Optional[list[CategoryStats]] = None
+    if args.categories:
+        raw = json.loads(args.categories)
+        cats = [CategoryStats(**item) for item in raw]
+
+    try:
</code_context>
<issue_to_address>
**issue (bug_risk):** JSON and per-category parsing errors from --categories will currently crash the CLI instead of returning a clear error exit code.

Both `json.loads(args.categories)` and `CategoryStats(**item)` can raise (malformed JSON, missing/extra fields, wrong types), which currently bubbles up as a traceback and exits ungracefully. Consider wrapping this in `try/except (json.JSONDecodeError, TypeError, ValueError)` and printing a clear error to stderr with a non-zero exit code, consistent with how `evaluate_robustness` validation errors are handled.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-05-06T19:16:50Z

+    if n_adversarial > 0 and not (0 <= passing_adv <= n_adversarial):
+        raise ValueError(
+            f"passing_adv must be in [0, n_adversarial], got {passing_adv}"
+        )
+    if n_normal > 0 and not (0 <= passing_norm <= n_normal):


issue: Passing counts are not validated when the corresponding test count is zero, allowing inconsistent inputs like passing_adv > 0 when n_adversarial == 0.

Because the checks are guarded by n_adversarial > 0 / n_normal > 0, inputs like n_adversarial=0, passing_adv=10 pass validation and yield p_adversarial=0.0, hiding invalid data. Please validate passing_adv and passing_norm against their counts unconditionally, e.g. if not (0 <= passing_adv <= n_adversarial): ..., which still behaves correctly when counts are zero.

sourcery-ai · 2026-05-06T19:16:50Z

+    if args.categories:
+        raw = json.loads(args.categories)
+        cats = [CategoryStats(**item) for item in raw]


issue (bug_risk): JSON and per-category parsing errors from --categories will currently crash the CLI instead of returning a clear error exit code.

Both json.loads(args.categories) and CategoryStats(**item) can raise (malformed JSON, missing/extra fields, wrong types), which currently bubbles up as a traceback and exits ungracefully. Consider wrapping this in try/except (json.JSONDecodeError, TypeError, ValueError) and printing a clear error to stderr with a non-zero exit code, consistent with how evaluate_robustness validation errors are handled.

feat: add tools/robustness_evaluator.py and README documentation

415d3e5

Agent-Logs-Url: https://github.com/daedalus/ImpactGuard/sessions/5a202457-7c67-4611-a805-55045edfc8e3 Co-authored-by: daedalus <115175+daedalus@users.noreply.github.com>

Copilot AI assigned Copilot and daedalus May 6, 2026

Copilot created this pull request from a session on behalf of daedalus May 6, 2026 19:15 View session

daedalus marked this pull request as ready for review May 6, 2026 19:15

daedalus merged commit 8bc0758 into master May 6, 2026
1 check was pending

daedalus deleted the copilot/add-robustness-score-metric branch May 6, 2026 19:15

sourcery-ai Bot reviewed May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add composite Robustness Evaluator tool#18

feat: add composite Robustness Evaluator tool#18
daedalus merged 1 commit intomasterfrom
copilot/add-robustness-score-metric

Copilot AI commented May 6, 2026 •

edited by sourcery-ai Bot

Loading

Uh oh!

sourcery-ai Bot commented May 6, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

Uh oh!

codacy-production Bot commented May 6, 2026

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

sourcery-ai Bot May 6, 2026

Uh oh!

sourcery-ai Bot May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented May 6, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

tools/robustness_evaluator.py (new)

README.md

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for CI gating with the Robustness Evaluator CLI

Updated class diagram for the Robustness Evaluator module

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

Uh oh!

codacy-production Bot commented May 6, 2026

Not up to standards ⛔

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented May 6, 2026 •

edited by sourcery-ai Bot

Loading

`tools/robustness_evaluator.py` (new)

`README.md`

sourcery-ai Bot commented May 6, 2026 •

edited

Loading