v0.22.0 Release Summary

What Changed

This release ships the offline agent evaluation framework: EvalScenario, EvalScenarios, EvalRunner, and EvalOrchestrator let you define named test cases, run them against a live queue + tracer, and get structured pass/fail metrics with per-scenario breakdowns. The release also integrates AgentAssertionTask into the full offline eval pipeline and adds in-process span capture so scenarios can assert on traces without a running server.

Breaking Changes

Database migration required. A new column is added to scouter.genai_eval_record:

ALTER TABLE scouter.genai_eval_record
    ADD COLUMN IF NOT EXISTS tags TEXT[] NOT NULL DEFAULT ARRAY[]::TEXT[];

The migration runs automatically on server startup via sqlx.

Changes

Offline Evaluation: EvalScenario / EvalScenarios / EvalRunner

Three new types form the core of the offline scenario framework:

EvalScenario — a single named test case. Holds a query string, optional context dict, tasks (assertion, LLM judge, trace, or agent), and a pass_threshold (float 0–1, default 1.0). Each scenario gets a stable UUID7 ID on construction.

from scouter.evaluate import EvalScenario, AgentAssertionTask, AgentAssertion

scenario = EvalScenario(
    name="tool_use_check",
    query="Search for recent AI papers",
    tasks=[
        AgentAssertionTask(
            id="search_called",
            assertion=AgentAssertion.tool_called("web_search"),
        )
    ],
    pass_threshold=1.0,
)

EvalScenarios — a collection of EvalScenario objects with internal state (datasets, contexts) populated by EvalRunner.collect_scenario_data() and results populated by EvalRunner.evaluate(). Not intended to be constructed manually; produced by EvalRunner.

EvalRunner — stateful engine that owns scenario definitions and GenAIEvalProfile references (as Arcs, same pattern as ScouterQueue). Two-phase API:

collect_scenario_data(queue, tracer) — drains records and spans captured during scenario execution and associates them with scenarios by scenario_id tag.
evaluate() — runs all tasks, returns ScenarioEvalResults.

Optionally call compare(baseline: ScenarioEvalResults) to produce a ScenarioComparisonResults with per-scenario deltas.

New result types:

Type	Purpose
`EvalMetrics`	Aggregate pass rates: `overall_pass_rate`, `scenario_pass_rate`, `dataset_pass_rates` (per-alias)
`ScenarioResult`	Pass/fail + task results for one scenario
`ScenarioEvalResults`	All `ScenarioResult`s + aggregate `EvalMetrics`
`ScenarioDelta`	Δ pass rate between two runs for one scenario
`ScenarioComparisonResults`	Full comparison across all scenarios with regression/improvement classification

Offline Evaluation: EvalOrchestrator (Python)

EvalOrchestrator is a high-level Python wrapper that manages the full capture lifecycle so callers don't have to sequence enable_capture / disable_capture / collect_scenario_data manually.

from scouter.evaluate import EvalOrchestrator, EvalScenario, EvalScenarios

orchestrator = EvalOrchestrator(
    scenarios=EvalScenarios(scenarios=[...]),
    queue=queue,
    tracer=tracer,
    profiles={"agent": profile},
)

results: ScenarioEvalResults = orchestrator.run(agent_fn=my_agent)

agent_fn is Callable[[str], str] — takes a query, returns a response string. The orchestrator:

Enables queue capture + local span capture.
Iterates scenarios, sets scouter.eval.scenario_id in OTel baggage, calls agent_fn(scenario.query).
Disables capture, calls EvalRunner.collect_scenario_data(), then evaluate().
Returns ScenarioEvalResults.

Subclass EvalOrchestrator and override execute_scenario() to handle non-string responses or add lifecycle hooks.

AgentAssertionTask: full pipeline integration

AgentAssertionTask was previously standalone (via execute_agent_assertion_tasks()). It is now fully wired into the EvalDataset pipeline:

EvaluationTask::AgentAssertion variant added.
EvaluationTaskType::AgentAssertion variant added (serializes as "AgentAssertion").
TaskConfig::AgentAssertion deserializes from stored task JSON.
AssertionTasks.agent: Vec<AgentAssertionTask> — tasks are routed to this bucket when building datasets from TasksFile.
EvaluationTask::AgentAssertion participates in depends_on resolution.

New supporting types:

TokenUsage — structured token count from LLM responses. Fields: input_tokens, output_tokens, total_tokens (all Optional[int]). Exposed to Python as a #[pyclass].

AgentContextBuilder (Rust-internal) — normalizes vendor-specific LLM response formats into a standard structure before assertion evaluation. Auto-detects format:

Pre-normalized (Scouter standard shape)
OpenAI — choices[].message.tool_calls, usage, model
Anthropic — content[] with ToolUseBlock, usage, model
Google/Gemini — candidates[].content.parts[] with function_call
Fallback tree walk

Path limits enforced: max 512 chars per path, max 32 segments.

In-process span capture

A new local capture mode lets tests assert on trace spans without a running Scouter server or Delta Lake backend. Spans are buffered in memory instead of forwarded to the transport.

Buffer capacity: 20,000 spans (CAPTURE_BUFFER_MAX). Writes beyond this limit are dropped with a warning.

Rust API (Tracer):

tracer.enable_local_capture()?;
// ... instrumented code ...
let spans = tracer.drain_local_spans()?;
let by_trace = tracer.get_local_spans_by_trace_ids(vec!["abc123...".into()])?;
tracer.disable_local_capture()?; // discards buffer

Python API (ScouterInstrumentor):

instrumentor.enable_local_capture()
# ... instrumented code ...
spans: list[TraceSpanRecord] = instrumentor.drain_local_spans()
spans_filtered = instrumentor.get_local_spans_by_trace_ids(["abc123..."])
instrumentor.disable_local_capture()

Module-level aliases also available: enable_local_span_capture, disable_local_span_capture, drain_local_span_capture.

disable_local_capture logs a warning and discards buffered spans if any remain.

EvalRecord: tags and trace_id stamping

Tags — EvalRecord now carries a tags: list[str] field in key=value format. Tags are persisted to PostgreSQL and returned in all query paths (get, paginated, archive).

record = EvalRecord(context={"response": "..."})
record.add_tag("environment", "staging")
record.add_tag("model", "gpt-4o")
# record.tags == ["environment=staging", "model=gpt-4o"]

trace_id at construction — EvalRecord(trace_id="<hex>") now accepted. Previously trace_id could only be set after construction.

Automatic stamping in QueueBus — when an EvalRecord is inserted via ScouterQueue and has no trace_id, the bus checks the active OTel span context. If a valid span is active, its trace ID is stamped onto the record (both the Rust struct and the Python-side object). The Python object is updated via a mutable borrow; a warning is logged if the cast fails.

Scenario tag auto-injection — if OTel baggage contains scouter.eval.scenario_id, the bus appends "scouter.eval.scenario_id=<value>" to record.tags automatically. Tag values are validated: alphanumeric, hyphens, underscores, max 128 chars. Invalid values are dropped with a warning.

ScouterQueue: offline record capture

New methods for offline use (mirroring the local span capture API):

queue.enable_capture()   # buffer EvalRecords in memory in addition to sending
queue.disable_capture()  # stop buffering and discard buffered records
queue.drain_records("alias")    # drain records from one queue by alias
queue.drain_all_records()       # drain from all queues, keyed by alias

Capture is off by default. Enabling it has negligible overhead; records are still forwarded to the normal transport.

GenAIEvalProfile references are now stored as Arc<GenAIEvalProfile> inside ScouterQueue.profiles, so EvalScenarios can share ownership without cloning.

shutdown() now releases the GIL during the 250ms wait periods, preventing deadlocks in multi-threaded Python programs.

Server: debug trace endpoint

A new diagnostic route returns the 10 most recent traces from the past 24 hours:

GET /scouter/trace/debug/recent

Returns the same TracePaginationResponse as the paginated trace query. Intended for local debugging and health verification; not authenticated differently from other trace routes.

Upgrading from v0.21.1

Apply the database migration. It runs automatically on server startup. If you run migrations manually, execute:

ALTER TABLE scouter.genai_eval_record
    ADD COLUMN IF NOT EXISTS tags TEXT[] NOT NULL DEFAULT ARRAY[]::TEXT[];

No other action required. All API changes are additive. Existing EvalRecord, ScouterQueue, and tracing usage continues to work without modification.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.22.0

Choose a tag to compare

Sorry, something went wrong.