Skip to content

Major: remove is_safe and confidence from evaluation models #201

@lan17

Description

@lan17

Summary

  • In a future major version, remove EvaluationResponse.is_safe, top-level EvaluationResponse.confidence, and per-control EvaluatorResult.confidence.
  • Evaluation responses should expose the actual evaluation artifacts: matches, errors, non_matches, and reason.
  • Enforcement code should derive allow/block/steer behavior directly from matches, errors, control actions, and on_evaluation_error, rather than reading a precomputed boolean or confidence score.

Motivation

  • Agent Control evaluation is fundamentally about control outcomes:
    • which controls matched
    • which controls failed
    • which controls did not match
    • which actions are attached to matched controls (deny, steer, observe)
  • is_safe is derived state, not source-of-truth state. It compresses control actions, evaluator errors, fail-open/fail-closed policy, and SDK/server merge behavior into one boolean.
  • Top-level confidence is not a clear confidence concept. The current value is a mixture of:
    • full confidence for deny matches
    • zero confidence for deny errors
    • proportional successful/evaluated controls otherwise
    • min(local, server) when merging SDK-local and server evaluations
  • Per-control EvaluatorResult.confidence is also not consistently meaningful. Most deterministic evaluators return fixed 1.0 / 0.0 values, while classifier-style evaluators may use it as a score. That creates a shared field whose semantics vary by evaluator.
  • Keeping these confidence fields risks misleading API consumers into treating them as calibrated probabilities.
  • Removing these fields forces clients to reason from explicit evaluation results instead of ambiguous summary numbers.

Current behavior

  • models/src/agent_control_models/evaluation.py defines:
    • EvaluationResponse.is_safe: bool
    • EvaluationResponse.confidence: float
  • models/src/agent_control_models/controls.py defines:
    • EvaluatorResult.confidence: float
  • The engine computes is_safe from matched deny/steer controls and deny-control errors in engine/src/agent_control_engine/core.py.
  • The engine computes top-level confidence in engine/src/agent_control_engine/core.py using successful/evaluated counts and special deny/error cases.
  • Composite condition confidence is derived from child evaluator confidence in engine/src/agent_control_engine/core.py, even though many evaluator confidence values are fixed or non-calibrated.
  • The Python SDK currently uses is_safe in several enforcement paths:
    • local/server merge: sdks/python/src/agent_control/evaluation.py
    • local-first short-circuiting: sdks/python/src/agent_control/evaluation.py
    • framework integration enforcement: sdks/python/src/agent_control/integrations/_core.py
    • decorator dict response handling: sdks/python/src/agent_control/control_decorators.py
  • The Python SDK mostly carries confidence through, exposes EvaluationResult.is_confident(...), logs confidence, and merges local/server confidence with min(...).
  • Observability stores control-event confidence and aggregates avg_confidence, so this change affects event schemas and stats too.

Expected behavior

  • Evaluation models should no longer include:
    • EvaluationResponse.is_safe
    • EvaluationResponse.confidence
    • EvaluatorResult.confidence
  • The response should contain explicit evaluation artifacts only, for example:
{
  "reason": null,
  "matches": [],
  "errors": [],
  "non_matches": []
}
  • Each ControlMatch.result should describe whether that specific evaluator matched, failed, and what metadata/message it returned, without a required confidence score.
  • Consumers that need an allow/block decision should derive it from:
  • Consumers that need evaluator-specific scores should carry those scores in evaluator-specific metadata with evaluator-specific naming, not in a universal confidence field.

Reproduction (if bug)

  1. Inspect engine/src/agent_control_engine/core.py confidence calculation.
  2. Observe that top-level confidence is not model confidence; it is mostly an evaluation-success ratio with special cases.
  3. Inspect built-in deterministic evaluators and observe that most EvaluatorResult.confidence values are fixed values rather than calibrated scores.
  4. Inspect SDK enforcement paths and observe that is_safe is used as a convenience decision gate even though the underlying response already contains matches, errors, and actions.

Proposed solution (optional)

  • Remove is_safe and confidence from EvaluationResponse and EvaluationResult in the next major version.
  • Remove confidence from EvaluatorResult and all ControlMatch.result payloads.
  • Remove composite-confidence calculation from the engine.
  • Remove EvaluationResult.__bool__ or redefine it only after a separate explicit design decision; do not silently preserve bool(result) as an alias for removed is_safe.
  • Remove EvaluationResult.is_confident(...).
  • Update all evaluator implementations to stop returning confidence.
  • For evaluator-specific scores that remain useful, move them into evaluator metadata with explicit names, such as:
    • score
    • risk_score
    • classifier_score
    • provider-specific raw score fields
  • Update Python SDK enforcement to derive behavior directly:
    • blocking deny match -> raise ControlViolationError
    • blocking steer match -> raise ControlSteerError
    • blocking evaluation error -> raise generic control-evaluation failure
    • non-blocking fail-open error -> keep visible in errors, do not block
    • observe matches -> log/emit observability only
  • Update local/server merge to merge matches, errors, and non_matches, then derive enforcement from the merged artifacts rather than computing local.is_safe and server.is_safe.
  • Update SDK decorator/integration paths to stop depending on response-level is_safe or confidence.
  • Update observability event models and stats to remove or replace confidence / avg_confidence.
  • Update TypeScript SDK generated models after the OpenAPI schema changes.

Additional context

  • Related control-level evaluator failure policy issue: Define control-level semantics for evaluator failures and timeouts #199
  • Related control model refactor issue: Refactor control definition models to share runtime fields #200
  • Relevant files:
    • models/src/agent_control_models/evaluation.py
    • models/src/agent_control_models/controls.py
    • models/src/agent_control_models/observability.py
    • engine/src/agent_control_engine/core.py
    • sdks/python/src/agent_control/evaluation.py
    • sdks/python/src/agent_control/integrations/_core.py
    • sdks/python/src/agent_control/control_decorators.py
    • sdks/python/src/agent_control/evaluation_events.py
    • server/src/agent_control_server/observability/store/postgres.py
  • This is a major-version API cleanup and should not be folded into smaller evaluator PRs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions