Skip to content

feat: add ground truth reference inputs for on-demand evaluation#732

Merged
jariy17 merged 3 commits intomainfrom
feat/GT_Support
Mar 30, 2026
Merged

feat: add ground truth reference inputs for on-demand evaluation#732
jariy17 merged 3 commits intomainfrom
feat/GT_Support

Conversation

@jariy17
Copy link
Copy Markdown
Collaborator

@jariy17 jariy17 commented Mar 30, 2026

Summary

Adds ground truth reference inputs to on-demand evaluation (run eval), allowing users to specify expected agent behavior for more precise evaluation.

New CLI flags

  • -A, --assertion <text...> — Assertion the agent should satisfy (repeatable)
  • --expected-trajectory <names> — Expected tool calls in order (comma-separated)
  • --expected-response <text> — Expected agent response text

TUI changes

  • New Ground Truth wizard step in run eval flow (appears when exactly 1 session is selected)
  • Evaluator creation screen now shows reference input placeholders ({assertions}, {expected_tool_trajectory}, {actual_tool_trajectory}, {expected_response})
  • Updated default evaluator instructions to include GT placeholders

Implementation

  • evaluationReferenceInputs passed through to the evaluate() API call
  • Results include referenceInputs summary in both CLI and TUI output
  • Note: The second commit (temp: use middleware...) is a temporary workaround — the SDK does not yet include evaluationReferenceInputs in its model, so we inject it via Smithy middleware. Remove when SDK is updated.

Test plan

  • npx tsc --noEmit — clean
  • npm run lint — 0 errors
  • 5 new unit tests for GT reference inputs (all pass)
  • Existing 33 run-eval tests continue to pass

@jariy17 jariy17 requested a review from a team March 30, 2026 19:48
@github-actions github-actions bot added the size/l PR size: L label Mar 30, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Package Tarball

aws-agentcore-0.4.0.tgz

How to install

npm install https://github.com/aws/agentcore-cli/releases/download/pr-732-tarball/aws-agentcore-0.4.0.tgz

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 30, 2026

Coverage Report

Status Category Percentage Covered / Total
🔵 Lines 45.98% 6575 / 14297
🔵 Statements 45.56% 6986 / 15333
🔵 Functions 44.68% 1177 / 2634
🔵 Branches 46.18% 4362 / 9444
Generated in workflow #1538 for commit d5e84ae by the Vitest Coverage Report Action

Bump @aws-sdk/client-bedrock-agentcore and
@aws-sdk/client-bedrock-agentcore-control from ^3.893.0 to ^3.1020.0.

The SDK now includes EvaluationReferenceInput in its model,
so we pass it directly to EvaluateCommand instead of injecting
via middleware.
@github-actions github-actions bot added size/l PR size: L and removed size/l PR size: L labels Mar 30, 2026
@jariy17
Copy link
Copy Markdown
Collaborator Author

jariy17 commented Mar 30, 2026

Testing Results

Verified all new GT features against a deployed runtime (TempAgent in EvalCheck project):

CLI flags tested

  • -A, --assertion — single and multiple assertions with Builtin.GoalSuccessRate
  • --expected-trajectory — comma-separated tool names with Builtin.TrajectoryExactOrderMatch
  • --expected-response — with and without explicit -t trace ID, using Builtin.Correctness
  • All three combined in a single run across multiple evaluators
  • Baseline run (no GT inputs) — confirms referenceInputs is absent from output

Custom evaluators tested

  • MyEvaluatorTraceGT (TRACE level) — {expected_response} placeholder populated correctly
  • CustomEvaluatorSessionGT (SESSION level) — {assertions}, {expected_tool_trajectory}, {actual_tool_trajectory} placeholders populated correctly

SDK upgrade verified

  • @aws-sdk/client-bedrock-agentcore@^3.1020.0evaluationReferenceInputs passed natively via EvaluateCommand (no middleware workaround needed)

Other checks

  • npx tsc --noEmit — clean
  • npm run lint — 0 errors
  • All 38 unit tests pass (including 5 new GT tests)
  • TUI run eval wizard shows Ground Truth step when exactly 1 session is selected

Copy link
Copy Markdown
Contributor

@Hweinstock Hweinstock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits and question on service behavior. Otherwise lgtm.

Hweinstock
Hweinstock previously approved these changes Mar 30, 2026
Copy link
Copy Markdown
Contributor

@Hweinstock Hweinstock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly confused on the customer experience, but thats likely because I'm new to evals. The changes themselves lgtm.

@github-actions github-actions bot removed the size/l PR size: L label Mar 30, 2026
@github-actions github-actions bot added the size/l PR size: L label Mar 30, 2026
@aws aws deleted a comment from Hweinstock Mar 30, 2026
@jariy17 jariy17 merged commit 01623ff into main Mar 30, 2026
22 checks passed
@jariy17 jariy17 deleted the feat/GT_Support branch March 30, 2026 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/l PR size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants