Skip to content

feat: evo preview features — config bundles, batch evaluation, recommendations, AB testing#1068

Merged
notgitika merged 106 commits intomainfrom
feat/evo-implementation
Apr 30, 2026
Merged

feat: evo preview features — config bundles, batch evaluation, recommendations, AB testing#1068
notgitika merged 106 commits intomainfrom
feat/evo-implementation

Conversation

@notgitika
Copy link
Copy Markdown
Contributor

Summary

Adds preview support for the Evo feature set: config bundles, batch evaluation, recommendations, and AB testing.

Config Bundles [preview]

  • add config-bundle — add versioned runtime configuration bundles
  • cb versions — list version history for a bundle
  • cb diff — diff two versions of a bundle
  • cb create-branch — create a new branch on an existing bundle
  • --with-config-bundle flag on agent creation auto-wires config bundle support
  • Config bundle baggage passed on invoke for runtime config injection

Batch Evaluation [preview]

  • run batch-evaluation — run evaluators across all agent sessions in CloudWatch
  • stop batch-evaluation — stop a running batch evaluation
  • Ground truth support (assertions, expected trajectory, turns)
  • Name validation against API pattern [a-zA-Z][a-zA-Z0-9_]{0,47}

Recommendations [preview]

  • run recommendation — optimize system prompts or tool descriptions using agent traces
  • Supports inline, file, and config bundle input sources
  • Config bundle integration: reads current prompt, writes optimized version back
  • JSONPath resolution from --runtime flag for multi-component bundles

AB Testing [preview]

  • Target-based AB test routing
  • AB test detail screen with p-value significance display

Other

  • TUI routing fixes for agentcore add config-bundle and agentcore add ab-test
  • Documentation for all preview features (docs/config-bundles.md, docs/batch-evaluation.md, docs/recommendations.md)
  • README updated with preview commands and doc links

Companion PR

  • CDK constructs: aws/agentcore-l3-cdk-constructs (separate PR)

Test plan

  • Unit tests passing
  • Manual CLI testing: batch eval (valid name, hyphens rejected, fake evaluator error, ground truth, multiple evaluators, lookback days, stop)
  • Manual CLI testing: recommendations (inline, file, config bundle, tool descriptions, nonexistent agent)
  • Manual CLI testing: config bundles (versions, diff, create-branch, add)
  • Manual CLI testing: status shows config bundles
  • Validate command passes on current schema

avi-alpert and others added 30 commits March 5, 2026 13:20
Add ConfigBundle as a new resource type with full lifecycle:
- Schema: ConfigBundleSchema with name validation, component configurations
- Primitive: ConfigBundlePrimitive for add/remove operations
- API client: SigV4-signed HTTP requests for config bundle CRUD operations
- Deploy: post-deploy hook to sync config bundles with control plane
- Status: config-bundle resource type in status command
- TUI: add wizard (name, description, components, branch, commit message),
  remove flow, ResourceGraph integration
- State: carry forward configBundles across redeploys in buildDeployedState
The signing service must be 'bedrock-agentcore' for all stages, not
'bedrock-agentcore-control' for prod. The endpoint hostname differs
from the signing service name.
- Add config bundle post-deploy setup to TUI deploy flow (useDeployFlow)
- Add clientToken to config bundle update API call
- Add parentVersionIds on update (required by API)
- Default branchName to "main" and commitMessage when not specified
- Add placeholders for branch/message in TUI wizard
- Fallback to find-by-name or create when update fails (stale IDs)
- Remove debug logging from actions.ts
- Add `agentcore edit config-bundle` CLI command with --bundle, --components,
  --components-file, --description, --branch, --message, --json flags
- Add interactive TUI wizard for editing config bundles (select bundle,
  input method, components, commit message, branch name, confirm)
- Add diff check to post-deploy: skip API update when components and
  description are unchanged, avoiding unnecessary version creation
- Use getConfigurationBundleVersion instead of getConfigurationBundle to
  avoid branch-not-found errors on bundles created with different branches
- Align default branch name to 'mainline' (API default) instead of 'main'
- For updates, inherit branch from current API state when not specified
- post-deploy-config-bundles: 13 tests covering create, update, skip
  (diff check), delete, branch inheritance, fallback paths, errors
- ConfigBundlePrimitive.edit: 7 tests covering component updates,
  optional field handling, missing bundle errors, field preservation
- useEditConfigBundleWizard: 16 tests covering step navigation,
  setters, goBack, reset, currentIndex tracking, step labels
feat: add configuration bundle support
* chore: remove edit config-bundle command

Users should edit agentcore.json directly to update config bundles.
Removes the edit CLI command, TUI screens, wizard hooks, and tests.

* feat: add config-bundle CLI commands for version history

Adds `agentcore config-bundle` with three subcommands:
- `versions` — list version history grouped by branch
- `get-version` — view specific version details and components
- `diff` — client-side deep diff between two versions

Also adds filter support (branchName, latestPerBranch, createdBy)
to the listConfigurationBundleVersions API client.

* feat: add config bundle hub TUI screens

Add TUI screens for browsing config bundles, viewing version history
with branch grouping, version detail drill-down, and diff comparison
between versions.

* fix: resolve config bundle versionId when falling back to list API (#49)

The Recommendation API requires versionId to be non-null when using
configurationBundle input. When resolveBundleByName fell back to the
list API (bundle not in deployed state), it returned no versionId,
causing a 400 validation error.

Now calls getConfigurationBundle after list to fetch the latest
versionId. Also adds versionId to the ResolvedBundle interface and
returns it from the deployed-state fast path.

* chore: remove get-version subcommand from config-bundle CLI

The versions --json and diff commands cover all practical use cases.
Keeps the command surface lean: versions + diff only.
* feat: add Recommendation API wrappers, CLI commands, and operations layer

Implement the Recommendations/Optimization feature for AgentCore CLI:
- SigV4-signed HTTP client for Start/Get/List/Delete Recommendation (DP)
- Operations layer with orchestration, polling, and local storage
- CLI commands: evals recommend, evals recommendation history/delete, run promote
- 27 unit tests covering API, storage, and orchestration logic
- Live-validated field names and ARN formats against prod API

* feat: add recommendation TUI wizard with session discovery and multi-evaluator support

- Add full recommendation wizard TUI (type, agent, evaluators, input, trace source, sessions, confirm)
- Add session discovery flow: discover sessions from CloudWatch, multi-select specific sessions
- Support both CloudWatch logs and session ID trace sources
- Pass selected sessionIds to recommendation API cloudwatchLogs config
- Add request ID capture and error detail extraction for debugging FAILED recommendations
- Fix recommendation API test mocks (add headers for requestId capture)
- Add scrollable list support (maxVisibleItems) to MultiSelectList, SelectList, WizardSelect
- Wire recommendation screen into App.tsx and EvalHubScreen navigation

* feat: add session span fetching, recommendation tests, and TUI integration

- Add fetch-session-spans module for retrieving OTEL spans from aws/spans
  and log records from runtime log groups with session ID filtering
- Add comprehensive tests for fetch-session-spans (9 tests) and extend
  run-recommendation tests (12 new tests covering file input, spans-file
  trace source, tool-desc auto-fetch, error handling, ARN passthrough)
- Wire recommendation hub, history screen, and list/delete CLI commands
- Update TUI routing for recommendation flows from eval and run hubs
- Add recommendation constants (poll intervals, terminal statuses)

* chore: remove list commands and promote stub, fix agents→runtimes rename

Remove `agentcore list recommendations` and `agentcore list recommendation --id`
commands (top-level `list` command deleted entirely). Remove `run promote` stub.
Fix typecheck errors from agents→runtimes schema rename in recommendation files.
#26)

* feat: add EvaluationJob resource — schema, primitive, deploy hook, TUI, and tests

Phase 1 of EvalJobRunner: CRUD + deploy integration for the EvaluationJob
control plane resource.

- Schema: EvaluationJobSchema in agentcore.json, deployed state tracking
- Primitive: EvaluationJobPrimitive with add/remove lifecycle
- AWS client: SigV4-signed HTTP wrappers for EvalJob CP operations
- Deploy: post-deploy hook creates/updates/deletes eval jobs imperatively
- CFN outputs: parse eval job execution role ARN from stack outputs
- TUI: add evaluation-job wizard flow + remove flow integration
- Tests: 53 tests across schema, primitive, AWS client, deploy hook, and TUI

* feat: add `run evaluation-job` command with DP API wrappers and orchestration

- Data plane API wrappers (RunEvaluationJob, GetEvaluationJobRun, ListEvaluationJobRuns)
  with SigV4 signing against bedrock-agentcore service
- Orchestration: resolve job from deployed state, generate runId, start run,
  poll for completion, fetch results from CW Logs output group
- CLI command: `agentcore run evaluation-job --job <name> --session-id <ids...>`
  with --json output and progress callbacks
- Tests: 17 new tests covering DP wrappers, runId generation, orchestration
  (error handling, polling, CW Logs result parsing)

* feat: complete US1/US2 quick wins — run name, cancel, update, stage-aware endpoints

- Add --run flag to `run evaluation-job` for custom run name prefixes
- Add `run cancel-evaluation-job` command with StopEvaluationJobRun DP API
- Add `update evaluation-job` primitive method and CLI subcommands
- Add `agentcore update experiment` parent command (backward-compatible)
- Make CP/DP endpoints stage-aware via AGENTCORE_STAGE env var (beta/gamma/prod)
- Fix beta SigV4 service name (bedrock-agentcore vs bedrock-agentcore-control)
- Update AddEvaluationJobFlow success screen with next-steps guidance

* feat: add TUI run wizard, progress steps, and local result storage for eval jobs

- Add RunEvalJobFlow TUI: select job → enter sessions → name run → confirm → execute
- Add StepProgress display during eval job polling (starting → polling → fetching → saving)
- Add elapsed time counter during run execution
- Add eval-job-storage module: save/load/list run results per job in .cli/eval-job-results/
- Auto-save results on both CLI and TUI paths
- Add "Evaluation Job" option to TUI Run screen
- Add 9 unit tests for eval-job-storage

* feat: add CloudWatch session discovery to eval job TUI wizard

- Add source type picker: "Discover from CloudWatch" vs "Enter manually"
- Add lookback days input (1-90 days) for CloudWatch discovery
- Discover sessions via CW Insights query using agent's runtimeId
- Multi-select from discovered sessions with span count + timestamps
- Auto-fallback to manual entry when agent not deployed (no runtimeId)
- Improve error display: show failed step in StepProgress before transitioning

* feat: migrate evaluation from resource CRUD to stateless batch evaluation

Replace the old EvaluationJob resource model (create/update/delete via
agentcore.json + deploy hooks) with a flat BatchEvaluation API model:

- Add `run batch-evaluation` and `run stop-batch-evaluation` CLI commands
- Add batch evaluation TUI wizard under the Run menu
- Add SigV4 API client for batch eval endpoints (start/get/list/stop)
- Add CloudWatch results fetching from outputDataConfig
- Remove all old evaluation-job infrastructure: primitive, deploy hook,
  schema, TUI add/remove screens, CP CRUD operations
- Remove evaluationJobs from agentcore.json schema

Tested end-to-end on gamma (account 998846730471) with Builtin.Faithfulness
evaluator against 3 agent sessions — all returning correct scores.

* chore: remove executionRoleArn now that FAS creds are live on gamma

The batch evaluation API no longer requires an execution role ARN.
Remove the --execution-role CLI option and all executionRoleArn
plumbing from the API client and orchestration layer.

* Revert "chore: remove executionRoleArn now that FAS creds are live on gamma"

This reverts commit f1706ff7ea4b7695d1466e609cde29e38cb00afb.

* refactor: move stop-batch-evaluation to top-level stop command

Move `agentcore run stop-batch-evaluation` to `agentcore stop batch-evaluation`
as a higher-level verb, consistent with pause/resume pattern.
- Restore --days flag on `run eval` (was renamed to --lookback, breaking
  existing scripts)
- Restore onListCloudWatchTraces/onGetCloudWatchTrace handlers in
  browser-mode.ts from public/main
@github-actions github-actions Bot added size/xl PR size: XL and removed size/xl PR size: XL labels Apr 30, 2026
@github-actions github-actions Bot added size/xl PR size: XL and removed size/xl PR size: XL labels Apr 30, 2026
@github-actions github-actions Bot added size/xl PR size: XL and removed size/xl PR size: XL labels Apr 30, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 30, 2026

Coverage Report

Status Category Percentage Covered / Total
🔵 Lines 42.91% 8934 / 20817
🔵 Statements 42.18% 9480 / 22475
🔵 Functions 39.66% 1537 / 3875
🔵 Branches 39.89% 5744 / 14397
Generated in workflow #2250 for commit 90939c2 by the Vitest Coverage Report Action

The AB test CLI flag was renamed from --gateway-arn to --gateway and
made optional. Tests now use --runtime instead, matching config-bundle
mode defaults.
@github-actions github-actions Bot added size/xl PR size: XL and removed size/xl PR size: XL labels Apr 30, 2026
Copy link
Copy Markdown
Contributor

@jariy17 jariy17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed at HEAD 6e085f4. All previously flagged regressions are resolved:

  • --days flag restored on run eval
  • onListCloudWatchTraces / onGetCloudWatchTrace restored in agentcore dev
  • RESOURCE_SUFFIX isolation restored in e2e import tests
  • ✅ Version and agent-inspector dep back to 0.12.2 / 0.3.0
  • PRIVATE_DEV_DISTRO config reverted

No regressions against the private repo baseline. The 4 issues flagged by agentcore-cli-automation (hardcoded amazonaws.com in recommendation/config-bundle wrappers, stale JSON schema, silent agentcore.json mutation on deploy, config bundle/AB test teardown leak) are separate functional issues worth addressing but not regressions from this PR's changes.

Copy link
Copy Markdown
Contributor

@jariy17 jariy17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating review — the 4 issues flagged by agentcore-cli-automation are blocking and need to be addressed before merge.


1. Hardcoded amazonaws.com breaks non-commercial partitions

Files:

  • src/cli/aws/agentcore-recommendation.ts:228
  • src/cli/aws/agentcore-config-bundles.ts:181

Both hardcode https://bedrock-agentcore..amazonaws.com / https://bedrock-agentcore-control..amazonaws.com. The sibling wrappers in this same PR (agentcore-ab-tests.ts, agentcore-batch-evaluation.ts, agentcore-http-gateways.ts) correctly use dnsSuffix(region) from ./partition. Recommendations and config bundles will silently fail in GovCloud and China partitions.

Fix: import dnsSuffix from ./partition and replace the hardcoded literal in both files.


2. schemas/agentcore.schema.v1.json is stale

The Zod schemas now include configBundles, abTests, and httpGateways as top-level fields on AgentCoreProjectSpecSchema, but the committed JSON schema was not regenerated. Users whose editors validate agentcore.json against the published schema (VS Code, etc.) will see false "property not allowed" errors on every new preview field.

Fix: run npm run build:lib && npm run build:schema and commit the regenerated schemas/agentcore.schema.v1.json.


3. validateProject() silently rewrites agentcore.json on every deploy

File: src/cli/operations/deploy/preflight.ts

The deploy preflight injects type: "ConfigurationBundle" into config bundle entries and writes the file back with JSON.stringify(rawJson, null, 2). This runs on every agentcore deploy, producing surprise git diffs for users and clobbering their file's original formatting (tabs, trailing newlines, key order). The Zod ConfigBundleSchema already applies this default in-memory, so the write-back is unnecessary.

Fix options:

  1. Fix the CDK side to consume the Zod-parsed spec with defaults applied, and drop this patching code.
  2. Inject type only into the in-memory object the CDK reads at synth time, without persisting to disk.
  3. If persisting is truly required: only write when patched === true (already done), preserve trailing newline, and print a warning to the user that their file was modified.

4. Config bundles and AB tests are leaked on stack teardown

performStackTeardown explicitly calls deleteHttpGatewayWithTargets for HTTP gateways before destroying the CFN stack, but there is no equivalent cleanup for config bundles or AB tests. When a user runs agentcore remove all + deploy, the CFN stack is destroyed but any config bundles and AB tests in deployed-state remain orphaned in AWS — silently accumulating charges with no CLI surface to clean them up.

Fix: extend performStackTeardown to iterate deployedState.targets[target].resources.configBundles and .abTests and delete them, mirroring what it already does for httpGateways.

The auto-generated gateway references --runtime, which must exist in
the project. Remove noAgent:true and use project.agentName dynamically.
@github-actions github-actions Bot added size/xl PR size: XL and removed size/xl PR size: XL labels Apr 30, 2026
@github-actions github-actions Bot added size/xl PR size: XL and removed size/xl PR size: XL labels Apr 30, 2026
@notgitika
Copy link
Copy Markdown
Contributor Author

Follow up PR for addressing:

  1. validateProject() silently rewrites agentcore.json on every deploy

  2. Config bundles and AB tests are leaked on stack teardown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/xl PR size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants