Skip to content

Feat: Evaluation preview features - Batch evaluation and config bundles #446

Merged
padmak30 merged 56 commits intomainfrom
feat/evo-main
Apr 30, 2026
Merged

Feat: Evaluation preview features - Batch evaluation and config bundles #446
padmak30 merged 56 commits intomainfrom
feat/evo-main

Conversation

@padmak30
Copy link
Copy Markdown
Contributor

Description of changes:

Batch Evaluation

  • Run batch evaluation on dataset
  • Ground truth support
  • Simulated dataset support

ConfigBundles

  • Config bundle support - read config bundle values from baggage
  • BaggageSpanProcessor and routing experiment context helpers

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

avi-alpert and others added 30 commits March 24, 2026 13:34
Re-adding workflow that was lost in a force push to main.
Original PR: aws/bedrock-agentcore-sdk-python-private#60

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add config bundle support to runtime

Introduces a new config_bundle module that allows agent handlers to read configuration from BedrockAgentCore configuration bundles, delivered via W3C baggage headers on each invocation
* feat: add batch evaluation
Introduces BatchEvaluationRunner, which orchestrates end-to-end batch evaluation against the AgentCore Evaluation Service
GITHUB_TOKEN lacks the workflows permission, so syncing .github/workflows/
from the public repo causes push failures. After merging, restore our
workflow files from HEAD before committing in both the clean and conflict
paths.
Bumps [cryptography](https://github.com/pyca/cryptography) from 46.0.5 to 46.0.7.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst)
- [Commits](pyca/cryptography@46.0.5...46.0.7)

---
updated-dependencies:
- dependency-name: cryptography
  dependency-version: 46.0.7
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [pillow](https://github.com/python-pillow/Pillow) from 12.1.1 to 12.2.0.
- [Release notes](https://github.com/python-pillow/Pillow/releases)
- [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst)
- [Commits](python-pillow/Pillow@12.1.1...12.2.0)

---
updated-dependencies:
- dependency-name: pillow
  dependency-version: 12.2.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [python-multipart](https://github.com/Kludex/python-multipart) from 0.0.22 to 0.0.26.
- [Release notes](https://github.com/Kludex/python-multipart/releases)
- [Changelog](https://github.com/Kludex/python-multipart/blob/master/CHANGELOG.md)
- [Commits](Kludex/python-multipart@0.0.22...0.0.26)

---
updated-dependencies:
- dependency-name: python-multipart
  dependency-version: 0.0.26
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* feat(runtime): stamp OTel spans with routing experiment baggage
padmak30 and others added 5 commits April 30, 2026 15:47
FAILED, STOPPED, and DELETING were raising RuntimeError; COMPLETED_WITH_ERRORS
  was hitting the unknown-status RuntimeError. All terminal states now return the
  response so callers can inspect result.status and result.error_details.
docs: add preview warning docstrings to all new evo methods and classes
@padmak30 padmak30 requested a review from a team April 30, 2026 22:04
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 30, 2026

✅ No Breaking Changes Detected

No public API breaking changes found in this PR.

@notgitika
Copy link
Copy Markdown
Contributor

Couple things I noticed:

  1. BatchEvaluationResult doesn't have a description field, but run_dataset_evaluation() passes description=response.get("description") when building it. Pydantic v2 will throw a ValidationError for the unexpected field — this would crash on every successful batch eval. Needs description: Optional[str] = None added to the model.

  2. The timeout test (test_poll_for_results_timeout_raises_timeout_error) patches time.time to control the clock, but _poll_for_results actually uses time.monotonic(). The test only passes because the real monotonic clock advances past the timeout between iterations — it's not actually testing the timeout path. The patch target should be time.monotonic instead.

notgitika
notgitika previously approved these changes Apr 30, 2026
jariy17
jariy17 previously approved these changes Apr 30, 2026
Copy link
Copy Markdown
Contributor

@notgitika notgitika left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@padmak30 padmak30 merged commit 719907b into main Apr 30, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants