feat: add ground truth reference inputs for on-demand evaluation#732
Merged
feat: add ground truth reference inputs for on-demand evaluation#732
Conversation
Contributor
Package TarballHow to installnpm install https://github.com/aws/agentcore-cli/releases/download/pr-732-tarball/aws-agentcore-0.4.0.tgz |
Contributor
Coverage Report
|
Bump @aws-sdk/client-bedrock-agentcore and @aws-sdk/client-bedrock-agentcore-control from ^3.893.0 to ^3.1020.0. The SDK now includes EvaluationReferenceInput in its model, so we pass it directly to EvaluateCommand instead of injecting via middleware.
1000390 to
ebbe38a
Compare
jariy17
commented
Mar 30, 2026
Collaborator
Author
Testing ResultsVerified all new GT features against a deployed runtime ( CLI flags tested
Custom evaluators tested
SDK upgrade verified
Other checks
|
awsswarnim
approved these changes
Mar 30, 2026
Hweinstock
reviewed
Mar 30, 2026
Contributor
Hweinstock
left a comment
There was a problem hiding this comment.
Some nits and question on service behavior. Otherwise lgtm.
Hweinstock
previously approved these changes
Mar 30, 2026
Contributor
Hweinstock
left a comment
There was a problem hiding this comment.
Slightly confused on the customer experience, but thats likely because I'm new to evals. The changes themselves lgtm.
Hweinstock
approved these changes
Mar 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds ground truth reference inputs to on-demand evaluation (
run eval), allowing users to specify expected agent behavior for more precise evaluation.New CLI flags
-A, --assertion <text...>— Assertion the agent should satisfy (repeatable)--expected-trajectory <names>— Expected tool calls in order (comma-separated)--expected-response <text>— Expected agent response textTUI changes
run evalflow (appears when exactly 1 session is selected){assertions},{expected_tool_trajectory},{actual_tool_trajectory},{expected_response})Implementation
evaluationReferenceInputspassed through to theevaluate()API callreferenceInputssummary in both CLI and TUI outputtemp: use middleware...) is a temporary workaround — the SDK does not yet includeevaluationReferenceInputsin its model, so we inject it via Smithy middleware. Remove when SDK is updated.Test plan
npx tsc --noEmit— cleannpm run lint— 0 errors