feat: add ground truth reference inputs for on-demand evaluation by jariy17 · Pull Request #732 · aws/agentcore-cli

jariy17 · 2026-03-30T19:48:17Z

Summary

Adds ground truth reference inputs to on-demand evaluation (run eval), allowing users to specify expected agent behavior for more precise evaluation.

New CLI flags

-A, --assertion <text...> — Assertion the agent should satisfy (repeatable)
--expected-trajectory <names> — Expected tool calls in order (comma-separated)
--expected-response <text> — Expected agent response text

TUI changes

New Ground Truth wizard step in run eval flow (appears when exactly 1 session is selected)
Evaluator creation screen now shows reference input placeholders ({assertions}, {expected_tool_trajectory}, {actual_tool_trajectory}, {expected_response})
Updated default evaluator instructions to include GT placeholders

Implementation

evaluationReferenceInputs passed through to the evaluate() API call
Results include referenceInputs summary in both CLI and TUI output
Note: The second commit (temp: use middleware...) is a temporary workaround — the SDK does not yet include evaluationReferenceInputs in its model, so we inject it via Smithy middleware. Remove when SDK is updated.

Test plan

npx tsc --noEmit — clean
npm run lint — 0 errors
5 new unit tests for GT reference inputs (all pass)
Existing 33 run-eval tests continue to pass

github-actions · 2026-03-30T19:49:02Z

Package Tarball

aws-agentcore-0.4.0.tgz

How to install

npm install https://github.com/aws/agentcore-cli/releases/download/pr-732-tarball/aws-agentcore-0.4.0.tgz

github-actions · 2026-03-30T19:54:57Z

Coverage Report

Status	Category	Percentage	Covered / Total
🔵	Lines	45.98%	6575 / 14297
🔵	Statements	45.56%	6986 / 15333
🔵	Functions	44.68%	1177 / 2634
🔵	Branches	46.18%	4362 / 9444

Generated in workflow #1538 for commit d5e84ae by the Vitest Coverage Report Action

Bump @aws-sdk/client-bedrock-agentcore and @aws-sdk/client-bedrock-agentcore-control from ^3.893.0 to ^3.1020.0. The SDK now includes EvaluationReferenceInput in its model, so we pass it directly to EvaluateCommand instead of injecting via middleware.

src/cli/aws/agentcore.ts

jariy17 · 2026-03-30T20:42:49Z

Testing Results

Verified all new GT features against a deployed runtime (TempAgent in EvalCheck project):

CLI flags tested

-A, --assertion — single and multiple assertions with Builtin.GoalSuccessRate
--expected-trajectory — comma-separated tool names with Builtin.TrajectoryExactOrderMatch
--expected-response — with and without explicit -t trace ID, using Builtin.Correctness
All three combined in a single run across multiple evaluators
Baseline run (no GT inputs) — confirms referenceInputs is absent from output

Custom evaluators tested

MyEvaluatorTraceGT (TRACE level) — {expected_response} placeholder populated correctly
CustomEvaluatorSessionGT (SESSION level) — {assertions}, {expected_tool_trajectory}, {actual_tool_trajectory} placeholders populated correctly

SDK upgrade verified

@aws-sdk/client-bedrock-agentcore@^3.1020.0 — evaluationReferenceInputs passed natively via EvaluateCommand (no middleware workaround needed)

Other checks

npx tsc --noEmit — clean
npm run lint — 0 errors
All 38 unit tests pass (including 5 new GT tests)
TUI run eval wizard shows Ground Truth step when exactly 1 session is selected

Hweinstock

Some nits and question on service behavior. Otherwise lgtm.

src/cli/commands/run/command.tsx

src/cli/operations/eval/run-eval.ts

src/cli/tui/screens/evaluator/types.ts

Hweinstock

Slightly confused on the customer experience, but thats likely because I'm new to evals. The changes themselves lgtm.

feat: add ground truth reference inputs for on-demand evaluation

0252612

jariy17 requested a review from a team March 30, 2026 19:48

github-actions bot added the size/l PR size: L label Mar 30, 2026

jariy17 temporarily deployed to e2e-testing March 30, 2026 19:48 — with GitHub Actions Inactive

jariy17 force-pushed the feat/GT_Support branch from 1000390 to ebbe38a Compare March 30, 2026 20:11

github-actions bot added size/l PR size: L and removed size/l PR size: L labels Mar 30, 2026

jariy17 temporarily deployed to e2e-testing March 30, 2026 20:11 — with GitHub Actions Inactive

jariy17 commented Mar 30, 2026

View reviewed changes

src/cli/aws/agentcore.ts Outdated Show resolved Hide resolved

awsswarnim approved these changes Mar 30, 2026

View reviewed changes

Hweinstock reviewed Mar 30, 2026

View reviewed changes

Hweinstock previously approved these changes Mar 30, 2026

View reviewed changes

fix: address PR review — combine import, add single-session GT guard

d5e84ae

jariy17 dismissed Hweinstock’s stale review via d5e84ae March 30, 2026 22:48

github-actions bot removed the size/l PR size: L label Mar 30, 2026

jariy17 temporarily deployed to e2e-testing March 30, 2026 22:48 — with GitHub Actions Inactive

github-actions bot added the size/l PR size: L label Mar 30, 2026

Hweinstock approved these changes Mar 30, 2026

View reviewed changes

aws deleted a comment from Hweinstock Mar 30, 2026

jariy17 merged commit 01623ff into main Mar 30, 2026
22 checks passed

jariy17 deleted the feat/GT_Support branch March 30, 2026 23:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add ground truth reference inputs for on-demand evaluation#732

feat: add ground truth reference inputs for on-demand evaluation#732
jariy17 merged 3 commits intomainfrom
feat/GT_Support

jariy17 commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

jariy17 commented Mar 30, 2026

Uh oh!

Hweinstock left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Hweinstock left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jariy17 commented Mar 30, 2026

Summary

New CLI flags

TUI changes

Implementation

Test plan

Uh oh!

github-actions bot commented Mar 30, 2026

Package Tarball

How to install

Uh oh!

github-actions bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report

Uh oh!

Uh oh!

jariy17 commented Mar 30, 2026

Testing Results

CLI flags tested

Custom evaluators tested

SDK upgrade verified

Other checks

Uh oh!

Hweinstock left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Hweinstock left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Mar 30, 2026 •

edited

Loading