agent-assure v0.2.0
agent-assure v0.2.0
This release adds protocol-bound live evaluation while preserving the
deterministic fixture-mode assurance surface introduced in v0.1.0. Fixture-mode
commands remain offline and reproducible; live commands require an explicit
adapter configuration and a frozen live-protocol-record.
New live artifacts include live-protocol-record, live-evaluation-report,
live-comparison-report, live-drift-report, live-trajectory-report, and
emergency-process-record, with exported JSON Schemas under schemas/v0.2.0.
The v0.1 schema set remains available under schemas/v0.1.0 for historical
replay.
Live evaluation now supports repeated provider observations, cluster-aware
pass/outcome/reason-code/exclusion rates, pooled and cluster-mean interval
metadata, paired or fixed-reference comparisons, cost and latency summaries,
provider-version capture, budget and rate-limit accounting, incomplete-run
status, and static JSONL tests that keep the live path reproducible without
network access. Optional OpenAI-compatible live execution still requires
explicit network opt-in.
Advanced protocol-bound analysis can report rare-event upper bounds, observed
cluster-correlation summaries, Bonferroni-controlled endpoint families, and
paired exact or Monte Carlo randomization tests when the frozen protocol and
observed design meet the declared prerequisites. Cross-window drift monitoring
adds comparability checks, ordered trend and adjacent-step summaries,
dependence diagnostics, and EWMA governance-health or control-reliability
review signals. Trajectory reports derive privacy-filtered observable state
paths, transition profiles, sequence-invariant findings, history-dependent
checks, and operational event-process summaries from structured live artifacts.
Runtime support now includes an external-script live adapter backed by a
no-shell subprocess harness, declared environment passing, redacted emergency
process records, W3C trace-context propagation, privacy-filtered span plans,
and optional OpenTelemetry SDK or OTLP HTTP export.
Final pre-tag security hardening makes live producer-supplied failing policy
results verdict-bearing, confines live prompt/JSONL/script/cwd paths to the
live config directory, requires HTTPS plus explicit host allowlisting for
non-default OpenAI-compatible endpoints, bounds external-script stdout/stderr
capture, and expands recursive persisted-artifact redaction for common secret
token patterns while preserving schema-owned structural identifiers. The bundled
fixture HMAC key is now accepted only for repository synthetic examples; other
fixture runs must provide an explicit key.
Pre-release hardening unified decimal rendering across protocol and report
calculations, preserved unclamped cost and latency comparison deltas, enforced
cumulative total and generated token budgets after live responses, exposed
rare-event Poisson upper bounds as one-sided artifacts, and constrained paired
randomization comparison protocols to the expectation pass-rate endpoint they
actually test.
Release assets include the evidence packet, release artifact manifest, digest
replay file, SBOM, source distribution, wheel, and keyless cosign bundles when
built by the release workflow. Replay cross-checks manifest-listed artifact
digests, including SBOM and distribution bytes, while stable projections keep
environment-bearing review artifacts reproducible.
Synthetic calibration and regression coverage is summarized in
docs/live_calibration.md. The v0.2 live reports are time-bound operational
evidence for declared protocols, data boundaries, provider/model
configurations, and execution windows. This release does not establish safety
assurance, validate clinical use, prove regulatory compliance, provide general
provider-quality evidence, or claim OpenTelemetry adoption.