Skip to content

v0.22.0

Choose a tag to compare

@thorrester thorrester released this 18 Mar 10:34
· 317 commits to main since this release
a5a0db3

v0.22.0 Release Summary

What Changed

This release ships the offline agent evaluation framework: EvalScenario, EvalScenarios, EvalRunner, and EvalOrchestrator let you define named test cases, run them against a live queue + tracer, and get structured pass/fail metrics with per-scenario breakdowns. The release also integrates AgentAssertionTask into the full offline eval pipeline and adds in-process span capture so scenarios can assert on traces without a running server.


Breaking Changes

Database migration required. A new column is added to scouter.genai_eval_record:

ALTER TABLE scouter.genai_eval_record
    ADD COLUMN IF NOT EXISTS tags TEXT[] NOT NULL DEFAULT ARRAY[]::TEXT[];

The migration runs automatically on server startup via sqlx.


Changes

Offline Evaluation: EvalScenario / EvalScenarios / EvalRunner

Three new types form the core of the offline scenario framework:

EvalScenario — a single named test case. Holds a query string, optional context dict, tasks (assertion, LLM judge, trace, or agent), and a pass_threshold (float 0–1, default 1.0). Each scenario gets a stable UUID7 ID on construction.

from scouter.evaluate import EvalScenario, AgentAssertionTask, AgentAssertion

scenario = EvalScenario(
    name="tool_use_check",
    query="Search for recent AI papers",
    tasks=[
        AgentAssertionTask(
            id="search_called",
            assertion=AgentAssertion.tool_called("web_search"),
        )
    ],
    pass_threshold=1.0,
)

EvalScenarios — a collection of EvalScenario objects with internal state (datasets, contexts) populated by EvalRunner.collect_scenario_data() and results populated by EvalRunner.evaluate(). Not intended to be constructed manually; produced by EvalRunner.

EvalRunner — stateful engine that owns scenario definitions and GenAIEvalProfile references (as Arcs, same pattern as ScouterQueue). Two-phase API:

  1. collect_scenario_data(queue, tracer) — drains records and spans captured during scenario execution and associates them with scenarios by scenario_id tag.
  2. evaluate() — runs all tasks, returns ScenarioEvalResults.

Optionally call compare(baseline: ScenarioEvalResults) to produce a ScenarioComparisonResults with per-scenario deltas.

New result types:

Type Purpose
EvalMetrics Aggregate pass rates: overall_pass_rate, scenario_pass_rate, dataset_pass_rates (per-alias)
ScenarioResult Pass/fail + task results for one scenario
ScenarioEvalResults All ScenarioResults + aggregate EvalMetrics
ScenarioDelta Δ pass rate between two runs for one scenario
ScenarioComparisonResults Full comparison across all scenarios with regression/improvement classification

Offline Evaluation: EvalOrchestrator (Python)

EvalOrchestrator is a high-level Python wrapper that manages the full capture lifecycle so callers don't have to sequence enable_capture / disable_capture / collect_scenario_data manually.

from scouter.evaluate import EvalOrchestrator, EvalScenario, EvalScenarios

orchestrator = EvalOrchestrator(
    scenarios=EvalScenarios(scenarios=[...]),
    queue=queue,
    tracer=tracer,
    profiles={"agent": profile},
)

results: ScenarioEvalResults = orchestrator.run(agent_fn=my_agent)

agent_fn is Callable[[str], str] — takes a query, returns a response string. The orchestrator:

  1. Enables queue capture + local span capture.
  2. Iterates scenarios, sets scouter.eval.scenario_id in OTel baggage, calls agent_fn(scenario.query).
  3. Disables capture, calls EvalRunner.collect_scenario_data(), then evaluate().
  4. Returns ScenarioEvalResults.

Subclass EvalOrchestrator and override execute_scenario() to handle non-string responses or add lifecycle hooks.


AgentAssertionTask: full pipeline integration

AgentAssertionTask was previously standalone (via execute_agent_assertion_tasks()). It is now fully wired into the EvalDataset pipeline:

  • EvaluationTask::AgentAssertion variant added.
  • EvaluationTaskType::AgentAssertion variant added (serializes as "AgentAssertion").
  • TaskConfig::AgentAssertion deserializes from stored task JSON.
  • AssertionTasks.agent: Vec<AgentAssertionTask> — tasks are routed to this bucket when building datasets from TasksFile.
  • EvaluationTask::AgentAssertion participates in depends_on resolution.

New supporting types:

TokenUsage — structured token count from LLM responses. Fields: input_tokens, output_tokens, total_tokens (all Optional[int]). Exposed to Python as a #[pyclass].

AgentContextBuilder (Rust-internal) — normalizes vendor-specific LLM response formats into a standard structure before assertion evaluation. Auto-detects format:

  1. Pre-normalized (Scouter standard shape)
  2. OpenAI — choices[].message.tool_calls, usage, model
  3. Anthropic — content[] with ToolUseBlock, usage, model
  4. Google/Gemini — candidates[].content.parts[] with function_call
  5. Fallback tree walk

Path limits enforced: max 512 chars per path, max 32 segments.


In-process span capture

A new local capture mode lets tests assert on trace spans without a running Scouter server or Delta Lake backend. Spans are buffered in memory instead of forwarded to the transport.

Buffer capacity: 20,000 spans (CAPTURE_BUFFER_MAX). Writes beyond this limit are dropped with a warning.

Rust API (Tracer):

tracer.enable_local_capture()?;
// ... instrumented code ...
let spans = tracer.drain_local_spans()?;
let by_trace = tracer.get_local_spans_by_trace_ids(vec!["abc123...".into()])?;
tracer.disable_local_capture()?; // discards buffer

Python API (ScouterInstrumentor):

instrumentor.enable_local_capture()
# ... instrumented code ...
spans: list[TraceSpanRecord] = instrumentor.drain_local_spans()
spans_filtered = instrumentor.get_local_spans_by_trace_ids(["abc123..."])
instrumentor.disable_local_capture()

Module-level aliases also available: enable_local_span_capture, disable_local_span_capture, drain_local_span_capture.

disable_local_capture logs a warning and discards buffered spans if any remain.


EvalRecord: tags and trace_id stamping

TagsEvalRecord now carries a tags: list[str] field in key=value format. Tags are persisted to PostgreSQL and returned in all query paths (get, paginated, archive).

record = EvalRecord(context={"response": "..."})
record.add_tag("environment", "staging")
record.add_tag("model", "gpt-4o")
# record.tags == ["environment=staging", "model=gpt-4o"]

trace_id at constructionEvalRecord(trace_id="<hex>") now accepted. Previously trace_id could only be set after construction.

Automatic stamping in QueueBus — when an EvalRecord is inserted via ScouterQueue and has no trace_id, the bus checks the active OTel span context. If a valid span is active, its trace ID is stamped onto the record (both the Rust struct and the Python-side object). The Python object is updated via a mutable borrow; a warning is logged if the cast fails.

Scenario tag auto-injection — if OTel baggage contains scouter.eval.scenario_id, the bus appends "scouter.eval.scenario_id=<value>" to record.tags automatically. Tag values are validated: alphanumeric, hyphens, underscores, max 128 chars. Invalid values are dropped with a warning.


ScouterQueue: offline record capture

New methods for offline use (mirroring the local span capture API):

queue.enable_capture()   # buffer EvalRecords in memory in addition to sending
queue.disable_capture()  # stop buffering and discard buffered records
queue.drain_records("alias")    # drain records from one queue by alias
queue.drain_all_records()       # drain from all queues, keyed by alias

Capture is off by default. Enabling it has negligible overhead; records are still forwarded to the normal transport.

GenAIEvalProfile references are now stored as Arc<GenAIEvalProfile> inside ScouterQueue.profiles, so EvalScenarios can share ownership without cloning.

shutdown() now releases the GIL during the 250ms wait periods, preventing deadlocks in multi-threaded Python programs.


Server: debug trace endpoint

A new diagnostic route returns the 10 most recent traces from the past 24 hours:

GET /scouter/trace/debug/recent

Returns the same TracePaginationResponse as the paginated trace query. Intended for local debugging and health verification; not authenticated differently from other trace routes.


Upgrading from v0.21.1

  1. Apply the database migration. It runs automatically on server startup. If you run migrations manually, execute:

    ALTER TABLE scouter.genai_eval_record
        ADD COLUMN IF NOT EXISTS tags TEXT[] NOT NULL DEFAULT ARRAY[]::TEXT[];
  2. No other action required. All API changes are additive. Existing EvalRecord, ScouterQueue, and tracing usage continues to work without modification.