v0.22.0
v0.22.0 Release Summary
What Changed
This release ships the offline agent evaluation framework: EvalScenario, EvalScenarios, EvalRunner, and EvalOrchestrator let you define named test cases, run them against a live queue + tracer, and get structured pass/fail metrics with per-scenario breakdowns. The release also integrates AgentAssertionTask into the full offline eval pipeline and adds in-process span capture so scenarios can assert on traces without a running server.
Breaking Changes
Database migration required. A new column is added to scouter.genai_eval_record:
ALTER TABLE scouter.genai_eval_record
ADD COLUMN IF NOT EXISTS tags TEXT[] NOT NULL DEFAULT ARRAY[]::TEXT[];The migration runs automatically on server startup via sqlx.
Changes
Offline Evaluation: EvalScenario / EvalScenarios / EvalRunner
Three new types form the core of the offline scenario framework:
EvalScenario — a single named test case. Holds a query string, optional context dict, tasks (assertion, LLM judge, trace, or agent), and a pass_threshold (float 0–1, default 1.0). Each scenario gets a stable UUID7 ID on construction.
from scouter.evaluate import EvalScenario, AgentAssertionTask, AgentAssertion
scenario = EvalScenario(
name="tool_use_check",
query="Search for recent AI papers",
tasks=[
AgentAssertionTask(
id="search_called",
assertion=AgentAssertion.tool_called("web_search"),
)
],
pass_threshold=1.0,
)EvalScenarios — a collection of EvalScenario objects with internal state (datasets, contexts) populated by EvalRunner.collect_scenario_data() and results populated by EvalRunner.evaluate(). Not intended to be constructed manually; produced by EvalRunner.
EvalRunner — stateful engine that owns scenario definitions and GenAIEvalProfile references (as Arcs, same pattern as ScouterQueue). Two-phase API:
collect_scenario_data(queue, tracer)— drains records and spans captured during scenario execution and associates them with scenarios byscenario_idtag.evaluate()— runs all tasks, returnsScenarioEvalResults.
Optionally call compare(baseline: ScenarioEvalResults) to produce a ScenarioComparisonResults with per-scenario deltas.
New result types:
| Type | Purpose |
|---|---|
EvalMetrics |
Aggregate pass rates: overall_pass_rate, scenario_pass_rate, dataset_pass_rates (per-alias) |
ScenarioResult |
Pass/fail + task results for one scenario |
ScenarioEvalResults |
All ScenarioResults + aggregate EvalMetrics |
ScenarioDelta |
Δ pass rate between two runs for one scenario |
ScenarioComparisonResults |
Full comparison across all scenarios with regression/improvement classification |
Offline Evaluation: EvalOrchestrator (Python)
EvalOrchestrator is a high-level Python wrapper that manages the full capture lifecycle so callers don't have to sequence enable_capture / disable_capture / collect_scenario_data manually.
from scouter.evaluate import EvalOrchestrator, EvalScenario, EvalScenarios
orchestrator = EvalOrchestrator(
scenarios=EvalScenarios(scenarios=[...]),
queue=queue,
tracer=tracer,
profiles={"agent": profile},
)
results: ScenarioEvalResults = orchestrator.run(agent_fn=my_agent)agent_fn is Callable[[str], str] — takes a query, returns a response string. The orchestrator:
- Enables queue capture + local span capture.
- Iterates scenarios, sets
scouter.eval.scenario_idin OTel baggage, callsagent_fn(scenario.query). - Disables capture, calls
EvalRunner.collect_scenario_data(), thenevaluate(). - Returns
ScenarioEvalResults.
Subclass EvalOrchestrator and override execute_scenario() to handle non-string responses or add lifecycle hooks.
AgentAssertionTask: full pipeline integration
AgentAssertionTask was previously standalone (via execute_agent_assertion_tasks()). It is now fully wired into the EvalDataset pipeline:
EvaluationTask::AgentAssertionvariant added.EvaluationTaskType::AgentAssertionvariant added (serializes as"AgentAssertion").TaskConfig::AgentAssertiondeserializes from stored task JSON.AssertionTasks.agent: Vec<AgentAssertionTask>— tasks are routed to this bucket when building datasets fromTasksFile.EvaluationTask::AgentAssertionparticipates independs_onresolution.
New supporting types:
TokenUsage — structured token count from LLM responses. Fields: input_tokens, output_tokens, total_tokens (all Optional[int]). Exposed to Python as a #[pyclass].
AgentContextBuilder (Rust-internal) — normalizes vendor-specific LLM response formats into a standard structure before assertion evaluation. Auto-detects format:
- Pre-normalized (Scouter standard shape)
- OpenAI —
choices[].message.tool_calls,usage,model - Anthropic —
content[]withToolUseBlock,usage,model - Google/Gemini —
candidates[].content.parts[]withfunction_call - Fallback tree walk
Path limits enforced: max 512 chars per path, max 32 segments.
In-process span capture
A new local capture mode lets tests assert on trace spans without a running Scouter server or Delta Lake backend. Spans are buffered in memory instead of forwarded to the transport.
Buffer capacity: 20,000 spans (CAPTURE_BUFFER_MAX). Writes beyond this limit are dropped with a warning.
Rust API (Tracer):
tracer.enable_local_capture()?;
// ... instrumented code ...
let spans = tracer.drain_local_spans()?;
let by_trace = tracer.get_local_spans_by_trace_ids(vec!["abc123...".into()])?;
tracer.disable_local_capture()?; // discards bufferPython API (ScouterInstrumentor):
instrumentor.enable_local_capture()
# ... instrumented code ...
spans: list[TraceSpanRecord] = instrumentor.drain_local_spans()
spans_filtered = instrumentor.get_local_spans_by_trace_ids(["abc123..."])
instrumentor.disable_local_capture()Module-level aliases also available: enable_local_span_capture, disable_local_span_capture, drain_local_span_capture.
disable_local_capture logs a warning and discards buffered spans if any remain.
EvalRecord: tags and trace_id stamping
Tags — EvalRecord now carries a tags: list[str] field in key=value format. Tags are persisted to PostgreSQL and returned in all query paths (get, paginated, archive).
record = EvalRecord(context={"response": "..."})
record.add_tag("environment", "staging")
record.add_tag("model", "gpt-4o")
# record.tags == ["environment=staging", "model=gpt-4o"]trace_id at construction — EvalRecord(trace_id="<hex>") now accepted. Previously trace_id could only be set after construction.
Automatic stamping in QueueBus — when an EvalRecord is inserted via ScouterQueue and has no trace_id, the bus checks the active OTel span context. If a valid span is active, its trace ID is stamped onto the record (both the Rust struct and the Python-side object). The Python object is updated via a mutable borrow; a warning is logged if the cast fails.
Scenario tag auto-injection — if OTel baggage contains scouter.eval.scenario_id, the bus appends "scouter.eval.scenario_id=<value>" to record.tags automatically. Tag values are validated: alphanumeric, hyphens, underscores, max 128 chars. Invalid values are dropped with a warning.
ScouterQueue: offline record capture
New methods for offline use (mirroring the local span capture API):
queue.enable_capture() # buffer EvalRecords in memory in addition to sending
queue.disable_capture() # stop buffering and discard buffered records
queue.drain_records("alias") # drain records from one queue by alias
queue.drain_all_records() # drain from all queues, keyed by aliasCapture is off by default. Enabling it has negligible overhead; records are still forwarded to the normal transport.
GenAIEvalProfile references are now stored as Arc<GenAIEvalProfile> inside ScouterQueue.profiles, so EvalScenarios can share ownership without cloning.
shutdown() now releases the GIL during the 250ms wait periods, preventing deadlocks in multi-threaded Python programs.
Server: debug trace endpoint
A new diagnostic route returns the 10 most recent traces from the past 24 hours:
GET /scouter/trace/debug/recent
Returns the same TracePaginationResponse as the paginated trace query. Intended for local debugging and health verification; not authenticated differently from other trace routes.
Upgrading from v0.21.1
-
Apply the database migration. It runs automatically on server startup. If you run migrations manually, execute:
ALTER TABLE scouter.genai_eval_record ADD COLUMN IF NOT EXISTS tags TEXT[] NOT NULL DEFAULT ARRAY[]::TEXT[];
-
No other action required. All API changes are additive. Existing
EvalRecord,ScouterQueue, and tracing usage continues to work without modification.