MTRL Launch PR#5919
Merged
Merged
Conversation
…2011) * Add master-mtrl-release branch to PR checks * feat: add RMP (Restricted Model Package) support for ModelBuilder - Add shared rmp_utils.py with is_restricted_model_package() and get_container_s3_uri() utilities - Fix _fetch_and_cache_recipe_config() crash when s3_uri is None - Fix _build_single_modelbuilder() non-LORA path to use model_package_name for RMP (CP resolves escrow server-side) - Fix _convert_model_data_source_to_local() to return None for RMP - Add unit tests and regression tests (14 passing) Builds on top of PR #2010 (shapes.py Optional fix + BedrockModelBuilder RMP). Rebase needed after #2010 merges. * fix: address PR review feedback - Rename rmp_utils.py to model_package_utils.py - Rename get_container_s3_uri to get_s3_uri_from_inference_spec with null checks - Add early RMP exit in _build_single_modelbuilder before containers[0] access - Add RMP guard in _deploy_model_customization and fetch_endpoint_names_for_base_model - Remove internal acronyms from comments and error messages * fix: simplify RMP detection, add Nova env vars, improve tests - Remove fallback detection — use managed_storage_type == Restricted only - Add Nova hosting config (image + env vars) for Nova RMP in build path - Conditionally pass image only when provided by user - Update unit tests: 22 tests covering all edge cases * fix: add enable_network_isolation for Nova restricted model packages --------- Co-authored-by: Gokul Anantha Narayanan <166456257+nargokul@users.noreply.github.com> Co-authored-by: Jonathan makunga <makung@amazon.com>
…release (#2015) * feat: Add sagemaker-rft SDK for AgenticRFT integration (#1977) * feat: Add sagemaker-rft subpackage for multi-turn RFT customer integration * fix: update sagemaker-rft for AgenticRFTRuntimeService integration * feat: Add aws_rft_sdk source package - RolloutFeedbackClient with SigV4-signed CompleteTrajectory and UpdateReward - @rft_handler decorator: auto-reports completion+reward on success, errors on failure - RFTContext using contextvars (not threading.local) for Strands thread compatibility - Strands wrap_model adapter: injects X-Rft-* headers via client_args default_headers - Maps both snake_case and camelCase metadata keys (jobId/rolloutId from TLM) * chore: Remove superseded aws-rft-sdk package The aws-rft-sdk/ directory was the original prototype, now fully superseded by sagemaker-rft/ which provides the same functionality under the sagemaker.rft namespace with proper packaging, Pydantic models, and additional adapters (LangChain). Keeping both causes confusion about which package to import. * chore: Add pre-built wheel for sagemaker-rft 0.1.0 Include distributable wheel and sdist so other teams can install directly without building from source. * Consolidate individual RFT headers into single X-RFT-Metadata header Replace three separate headers (X-Rft-Job-Arn, X-Trajectory-Id, X-Span-Id) with a single X-RFT-Metadata header containing a JSON object with: - job_id: training job ARN - experiment_id: groups turns into a single trajectory - rollout_id: unique ID for each rollout (replaces span-id) * Fix RolloutFeedbackClient field name mismatch with TLM metadata The TLM sends metadata with jobId/rolloutId but RolloutFeedbackClient expected job_arn/trajectory_id, causing complete_trajectory and update_reward to silently skip. Accept both naming conventions (snake_case and camelCase). * Fix inference params field name mismatch with TLM payload - decorators.py: Accept both inferenceParams (camelCase from TLM) and inference_params (snake_case) in payload - strands.py: Accept both maxTokens/topP (camelCase) and max_tokens/top_p (snake_case) when applying to model params * Fix SDK payload casing, rollout ID, and inference params injection 1. headers.py: Accept both camelCase (jobId, rolloutId, experimentId) and snake_case (job_arn, trajectory_id) from TLM payload. Use the passed rollout ID instead of generating a new UUID. 2. strands.py: Use model.update_config() instead of setting params dict directly — params is None on OpenAIModel, update_config is the supported API for dynamic parameter changes. 3. decorators.py: Already accepts both inferenceParams and inference_params (fixed earlier). * Fix update_config: wrap inference params in params={} for Strands API * Add status param to complete_trajectory and report_error/report_complete helpers - complete_trajectory() accepts status param ("ready" or "failed") - Added report_error() to mark trajectory as failed with optional reward - Added report_complete() convenience method for success path - 404 errors already handled gracefully via _signed_post exception handling * feat: Updated with correct headers for rft * feat: Added variable endpoint for rft runtime and feedback to complete trajectory and reward * feat: added temp auth.py module and updated feedback.py * feat: added temp auth.py module and updated feedback.py * Support list rewards in rft_handler and report_complete - rft_handler: when reward is a list, calls complete_trajectory() + update_reward(list) separately instead of report_complete(float). This supports multi-turn trajectories with per-turn rewards. - report_complete: accepts float | list[float]. * fix: sagemaker-rft SDK bearer token auth, auto lifecycle, region env var - feedback.py: Switch from SigV4 to bearer token auth via aws_sagemaker_token_generator.provide_token(). Add region fallback from AWS_REGION env var. Support both camelCase (jobId, rolloutId) and snake_case (job_arn, trajectory_id) metadata keys. Handle 404 gracefully. Add report_complete() and report_error() convenience methods. - decorators.py: rft_handler now auto-calls CompleteTrajectory + UpdateReward when result dict contains "reward" key. Auto-calls report_error on exceptions. Support inferenceParams (camelCase). - models.py: Region default reads from AWS_REGION env var. * fix: Handle error results and terminal trajectory status in SDK - decorators.py: _handle_result checks result["status"] == "error" and calls report_error() instead of report_complete(). Previously, agent returning {"status": "error", "reward": 0.0} was treated as success, calling CompleteTrajectory(status=ready) which conflicted with Runtime's failTrajectory(). Now errors are properly reported so TLM retries immediately instead of waiting 10 min timeout. - feedback.py: CompleteTrajectory and UpdateReward gracefully handle 400 "not in valid status" (trajectory already failed by Runtime). Logs warning and skips instead of raising exception. * fix(rft): make feedback reporting non-fatal and fix endpoint/timeout - Move _handle_result to try/except in else clause so feedback failures do not prevent returning the rollout result - Fix _build_endpoint to not include "prod" in the URL prefix - Increase feedback HTTP timeout from 30s to 120s * Handle trajectory-already-processed errors in rft_handler decorator Centralize detection of 'not in valid status' and 'Cannot transition trajectory' errors into a shared helper. The decorator now catches these from the agent function and returns {status: skipped} instead of re-raising, so agents no longer need per-project workarounds. * fix: use Optional[] syntax in Pydantic models for Python 3.9 compatibility * refactor(rft): use sagemaker-core token generator instead of inline implementation Replace the inline SigV4 token signing logic and aws-sagemaker-token-generator fallback with sagemaker.core.token_generator.generate_token from sagemaker-core. Add sagemaker-core as a dependency in pyproject.toml. * refactor: move rft module from sagemaker-rft to sagemaker-train Move the rft subpackage from the standalone sagemaker-rft package into sagemaker-train as sagemaker.train.rft. Update all internal imports accordingly. Add requests and pydantic to sagemaker-train dependencies. Remove the sagemaker-rft package directory. * refactor(rft): remove auth.py wrapper, use generate_token directly The get_rft_api_key wrapper in auth.py was just passing args through to generate_token. Call generate_token directly in feedback.py instead. --------- Co-authored-by: Tritin Truong <tttritin@amazon.com> Co-authored-by: Barret Pickett <mrpic@amazon.com> Co-authored-by: James Yu <jamesfyu@amazon.com> * refactor: Rename headers, URIs, and decorator (#1990) * Rename finetuning-job-runtime endpoint to job-runtime (#1999) * Update strands.py (#2017) --------- Co-authored-by: Mike Shen <109769013+xiaoxshe@users.noreply.github.com> Co-authored-by: Tritin Truong <tttritin@amazon.com> Co-authored-by: Barret Pickett <mrpic@amazon.com> Co-authored-by: James Yu <jamesfyu@amazon.com>
…#2012) MTRL training outputs checkpoints to Restricted Model Packages where S3 URIs are hidden (ManagedStorageType: Restricted). This change: 1. sagemaker-core: Make s3_uri, s3_data_type, compression_type optional in S3ModelDataSource so ModelPackage.get() can deserialize RMPs without crashing on missing s3_uri field. 2. sagemaker-serve: BedrockModelBuilder now uses customModelDataSource.modelPackageArnDataSource when the model artifact is a model package ARN (RMP), instead of the unsupported modelSourceConfig.s3DataSource path. Falls back to model package ARN in _get_s3_artifacts when s3_uri is hidden for Nova RMP models. Tested end-to-end: ModelPackage.get() on RMP -> BedrockModelBuilder.deploy() -> Bedrock OD endpoint Active -> inference invocation successful. Co-authored-by: Mahima Chaudhary <mahchy@amazon.com>
* feat: Add MultiTurnRLTrainer for Agentic RFT jobs (#1988) * MTRL Evaluator (#1989) * feat: Add SageMaker token generator to sagemaker-core (#1983) * Feature processor v3 (#5565) * Feature store v3 (#5490) * feat: Add Feature Store Support to V3 * Add feature store tests --------- Co-authored-by: adishaa <adishaa@amazon.com> * feat: feature_processor v3 * integ tests * fix * chore(docs): Add API docs * fix: Fix flaky integ tests * fix diff * chore: rename parameter + cleanup comments * Feature store v3 (#5490) * feat: Add Feature Store Support to V3 * Add feature store tests --------- Co-authored-by: adishaa <adishaa@amazon.com> * add pyspark to test deps * add test deps * fix unit test deps * pin setuptools<82 for feature-processor and unit tests * Set JAVA_HOME for integ tests which requires java * fix spark session bug * fix(feature-processor): Fix Spark session config and Ivy cache race condition Isolate Ivy cache per Spark session via spark.jars.ivy to prevent concurrent pytest-xdist workers from corrupting shared /root/.ivy2/cache during Maven dependency resolution in CI. * revert previous change + create different ivy cache per test to fix concurrent writes in CI * revert changes to sagemaker-core * refactor(feature-processor): Migrate to FeatureGroup resource API - Replace sagemaker_session.describe_feature_group() calls with FeatureGroup.get() - Update _input_loader.py to use FeatureGroup resource attributes instead of dictionary access - Update feature_scheduler.py to use FeatureGroup.get() and access creation_time as attribute - Update _feature_group_lineage_entity_handler.py to return FeatureGroup resource instead of Dict - Remove unused imports (Dict, Any, FEATURE_GROUP, CREATION_TIME constants) - Replace dictionary key access with typed resource properties (offline_store_config, data_catalog_config, event_time_feature_name, etc.) - Update unit tests to reflect new FeatureGroup resource API usage - Improves type safety and reduces reliance on dictionary-based API responses * add `build` to test_requirements * add upper bounds for test dependencies * move feature-processor config to sagemaker-mlops optional deps --------- Co-authored-by: Aditi Sharma <165942273+Aditi2424@users.noreply.github.com> Co-authored-by: adishaa <adishaa@amazon.com> Co-authored-by: Basssem Halim <bhhalim@amazon.com> Co-authored-by: Molly He <mollyhe@amazon.com> * Added iso regions to dji-lmi (#5595) * Add docker-compose path to allow local training (#5598) * Add docker-compose path * Check for MacOS * Remove Unused method (#5593) * V3 Bug Fixes (#5601) * V3 Bug Fixes * fix(model_builder): Only set s3_upload_path for S3 URIs in passthrough In _build_for_passthrough(), model_path could be a local /tmp path. Setting s3_upload_path to a local path caused CreateModel API to reject the modelDataUrl with a validation error since it requires s3:// or https:// URIs. Now only S3 URIs are assigned to s3_upload_path; local paths are handled separately by _prepare_for_mode() in LOCAL_CONTAINER mode. * Test fixes * Bug fix 3 and 4 * fix: Add PipelineVariable support to ModelTrainer fields (fixes #5524) (#5608) * fix: Add PipelineVariable support to ModelTrainer fields (fixes #5524) Extend StrPipeVar type to ModelTrainer's direct fields: - training_image: Optional[str] -> Optional[StrPipeVar] - algorithm_name: Optional[str] -> Optional[StrPipeVar] - training_input_mode: Optional[str] -> Optional[StrPipeVar] - environment: Dict[str, str] -> Dict[str, StrPipeVar] This follows the existing V3 pattern already used by SourceCode, OutputDataConfig, and Compute (for instance_type). The StrPipeVar type alias and PipelineVariable.__get_pydantic_core_schema__() already exist in the codebase. This unblocks V2->V3 migration for SageMaker Pipelines users who need to pass ParameterString to ModelTrainer fields. Fixes #5524 * test: Add unit tests for PipelineVariable support + fix PipelineVariable-safe logging - Add test_model_trainer_pipeline_variable.py with 9 tests: - 4 PipelineVariable acceptance tests (training_image, algorithm_name, training_input_mode, environment) - 4 regression tests (real string values still work) - 1 invalid type rejection test - Fix PipelineVariable-safe logging in model_post_init (avoid __str__ on PipelineVariable which raises TypeError) All 57 tests pass (48 existing + 9 new, 0 regressions). --------- Co-authored-by: Amit Modi <modiamit@amazon.com> * Fix model registration with a model card (#5611) * Add docker-compose path * Check for MacOS * Fix model registration with a model card * Account for both ModelCard and ModelPackageModelCard objects * Add unit tests for model card during model registration * updated the SDK to use latest LMI image for sdk v3.x (#5616) * add EUCS to Jumpstart region config (#5615) Co-authored-by: Molly He <mollyhe@amazon.com> * Fix handling of training step dependencies to allow successful pipeline creation (#5618) * Add docker-compose path * Check for MacOS * Fix model registration with a model card * Account for both ModelCard and ModelPackageModelCard objects * Add unit tests for model card during model registration * Fix handling of dependencies in get_training_code_hash workflow utility * Update docstring * Add unit tests * sagemaker-core rich upper bound relax back to 15.0.0 (#5620) * Release sagemaker-core 2.5.1 (#5623) * Update changelog for sagemaker-core 2.5.1 (#5624) * Release sagemaker-core 2.5.1 * Update changelog for sagemaker-core 2.5.1 * docs: Add migration tool (MCP server) section to migration guide (#5628) * docs: Add migration tool (MCP server) section to migration guide Add instructions for installing and configuring the SageMaker SDK migration MCP server tool. Includes setup for Kiro, Kiro CLI, VS Code (Cline), Claude Desktop, and Cursor. Documents available tools (analyze_code, transform_code, validate_code, ask_question), example usage, and troubleshooting steps. * docs: Update Feature Store status to supported in migration guide Feature Store is now supported in V3 via sagemaker.core.resources.FeatureGroup and FeatureStore. Update the status from REMOVED to SUPPORTED. * fix: resolve PermissionError during local mode cleanup of root-owned Docker files (#5629) * fix: use docker fallback to clean up root-owned files in local mode * Remove alpine * Use network flag * Add -mindepth * Use chmod -R 777 via Docker * Add unit test to sagemaker-core for permissionError docker fix * Migration guide update (#5633) * docs: Add migration tool (MCP server) section to migration guide Add instructions for installing and configuring the SageMaker SDK migration MCP server tool. Includes setup for Kiro, Kiro CLI, VS Code (Cline), Claude Desktop, and Cursor. Documents available tools (analyze_code, transform_code, validate_code, ask_question), example usage, and troubleshooting steps. * docs: Update Feature Store status to supported in migration guide Feature Store is now supported in V3 via sagemaker.core.resources.FeatureGroup and FeatureStore. Update the status from REMOVED to SUPPORTED. * docs(migration): Add Codex CLI, VS Code Copilot, and Roo Code to MCP server IDE setup table Add configuration locations for additional IDEs that support the SageMaker migration MCP server: VS Code with Copilot, VS Code with Roo Code extension, and Codex CLI. * Migration guide update (#5636) * docs: Add migration tool (MCP server) section to migration guide Add instructions for installing and configuring the SageMaker SDK migration MCP server tool. Includes setup for Kiro, Kiro CLI, VS Code (Cline), Claude Desktop, and Cursor. Documents available tools (analyze_code, transform_code, validate_code, ask_question), example usage, and troubleshooting steps. * docs: Update Feature Store status to supported in migration guide Feature Store is now supported in V3 via sagemaker.core.resources.FeatureGroup and FeatureStore. Update the status from REMOVED to SUPPORTED. * docs(migration): Add Codex CLI, VS Code Copilot, and Roo Code to MCP server IDE setup table Add configuration locations for additional IDEs that support the SageMaker migration MCP server: VS Code with Copilot, VS Code with Roo Code extension, and Codex CLI. * docs: Update MCP server name from sagemaker-migration-mcp to sagemaker-sdk-helper Replace all references to the deprecated sagemaker-migration-mcp binary with the correct sagemaker-sdk-helper command and server name across installation, configuration, and troubleshooting sections. * fix(tuner): Include sm_drivers channel in HyperparameterTuner jobs (#5634) * fix(tuner): Include sm_drivers channel in HyperparameterTuner jobs When ModelTrainer has distributed=Torchrun(), the sm_drivers channel contains torchrun_driver.py and sm_train.sh which are required for multi-GPU execution. The tuner was not building this channel, causing the framework container to fall back to the legacy single-GPU entry point (python train.py) instead of torchrun. This caused a tensor size mismatch (batch_size vs accumulated_batch) in TRL's compute_loss when gradient_accumulation_steps > 1, because the single-process path doesn't partition batches across ranks. Fix: Replace _upload_source_code_and_configure_hyperparameters with _build_driver_and_code_channels that replicates ModelTrainer's channel building logic (sm_drivers, code, distributed.json, sourcecode.json, sm_train.sh). Also pass through environment and VPC config. * fix(tuner): Harden _build_training_job_definition against missing attributes - Use getattr with fallback for static_hyperparameters (fixes test_build_training_job_definition_includes_internal_channels) - Guard _prepare_model_trainer_for_tuning with isinstance check on entry_script to avoid calling _build_driver_and_code_channels on MagicMock model trainers - Guard environment passthrough with isinstance(env, dict) check - Guard VPC config passthrough with try/except for mock safety * fix(test): Rewrite tuner distributed integ test to match CI patterns - Use sagemaker_session fixture from conftest (auto-resolves role/region) - Use ml.m5.xlarge CPU instance (cheaper, available in CI) - Remove hardcoded role ARN and training_mode - Remove @pytest.mark.slow (not registered in CI config) - Use module-level function instead of class (matches other integ tests) - Use DEFAULT_CPU_IMAGE consistent with test_model_trainer.py * fix(tuner): Upload sourcedir.tar.gz for framework container compatibility The HPT API uses the legacy framework container path which expects sagemaker_submit_directory (a tar.gz on S3) to be downloaded and extracted to /opt/ml/code/. The previous approach of using a 'code' input channel mounted the code at /opt/ml/input/data/code/ instead, causing 'No such file or directory' errors. Fix: Create and upload sourcedir.tar.gz to S3, set both sagemaker_program and sagemaker_submit_directory hyperparameters. Remove the separate 'code' input channel since the framework container handles code extraction via sagemaker_submit_directory. * test(tuner): Add unit tests for driver/code channel building Add 25 unit tests covering the tuner changes from PR #5634: - _prepare_model_trainer_for_tuning guard logic - _build_driver_and_code_channels sm_drivers channel creation - _build_training_job_definition _tuner_channels inclusion - Environment and VPC config passthrough - sourcedir.tar.gz upload and sagemaker_submit_directory HP - static_hyperparameters getattr fallback * feat(ci): Add Fortress Code Reviewer security scan workflow (#5639) Add GitHub Actions workflow to run Fortress Code Reviewer security scan on every PR against the master branch. The workflow: - Triggers on pull_request_target against master - Performs collaborator check (auto-approve for collaborators, manual approval for external contributors) - Configures AWS credentials via OIDC - Triggers the sagemaker-python-sdk-ci-fortress-scan CodeBuild project The CodeBuild project installs Fortress at runtime from S3-hosted wheels and uses Bedrock (Claude) to analyze code for security vulnerabilities. --- X-AI-Prompt: Add Fortress security scan GitHub workflow for PR scanning X-AI-Tool: Kiro * fixes for model builder (#5631) * fixes for model builder * add nova model support * fix env_vars merge, update integ test for LORA two-step deployment, fix unit tests for nova model support - env_vars: append recipe/nova config to existing env_vars instead of skipping - integ test: verify both base IC and adapter IC creation for LORA models - unit tests: add _is_nova_model mock to accommodate nova model support changes * update codegen to mark MinMemoryRequiredInMb as optional DescribeInferenceComponent returns empty ComputeResourceRequirements for adapter ICs (created with BaseInferenceComponentName), but the service model still marks MinMemoryRequiredInMb as required. Add a REQUIRED_TO_OPTIONAL_OVERRIDES config in the codegen so re-running shapes generation produces the correct Optional field. * add retry for adapter IC creation on transient endpoint-not-found * model builder fixes * Skip test_deploy_from_training_job: parallel cleanup race condition under investigation --------- Co-authored-by: Joshua Towner <jjtowner@amazon.com> Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> * Bedrock fix (#5642) * fix(bedrock): Poll for model Active status before creating deployment Add _wait_for_model_active() to poll get_custom_model until the model reaches Active status before calling create_custom_model_deployment. This fixes ValidationException when the custom model is not yet ready for deployment after create_custom_model returns. * feat(bedrock): Harden BedrockModelBuilder for production readiness Extract _is_nova_model() helper to eliminate duplicated Nova detection logic across deploy() and _get_s3_artifacts(). Uses getattr with safe defaults instead of fragile hasattr chains. Add input validation to deploy() and create_deployment(): - Raise ValueError when model_package is not set - Raise ValueError when custom_model_name or role_arn missing for Nova deployments - Raise ValueError when model_arn is empty in create_deployment Move json and urlparse imports to module level (were previously imported inside _get_checkpoint_uri_from_manifest). Replace f-string logging with lazy %s formatting throughout. Initialize status=None before the polling loop in _wait_for_model_active to avoid UnboundLocalError if the loop body never executes. Rewrite unit tests (43 tests) with full coverage: - _is_nova_model: recipe_name, hub_content_name, case insensitivity, missing base_model, None fields - __init__: None model, TrainingJob, ModelPackage - Client singletons: caching, injection - _fetch_model_package: ModelPackage, TrainingJob, ModelTrainer, unknown type - _get_s3_artifacts: None package, non-Nova, Nova delegation, Nova fallback - _get_checkpoint_uri_from_manifest: success, missing key, NoSuchKey, not TrainingJob, no artifacts, invalid JSON - _wait_for_model_active: immediate, polling, Failed, timeout - create_deployment: polling chain, extra kwargs, empty/None ARN - deploy: non-Nova, Nova full chain, hub_content_name detection, default deployment name, tags, missing params, None stripping Add integration tests for Nova E2E deployment: - Training job existence and status verification - Builder creation and Nova detection via _is_nova_model - S3 artifacts checkpoint validation - Full deploy-with-polling flow (marked @pytest.mark.slow) - Timeout behavior on bogus ARN - Validation error paths (no model_package, empty model_arn) - Resource cleanup fixture for deployments and custom models * feat(bedrock): Add deployment status polling after CreateCustomModelDeployment Previously create_deployment() only polled for the custom model to reach Active status before calling CreateCustomModelDeployment, but did not wait for the deployment itself to become Active. This caused callers to receive a deployment ARN that was still in Creating state, requiring manual polling in user code. Add _wait_for_deployment_active() that polls get_custom_model_deployment until status reaches Active, raises RuntimeError on Failed, and times out after max_wait seconds (default 3600s, poll interval 30s). Wire it into create_deployment() so the full flow is now: 1. _wait_for_model_active (poll model creation) 2. create_custom_model_deployment (API call) 3. _wait_for_deployment_active (poll deployment creation) Gracefully skips deployment polling if the API response does not contain a customModelDeploymentArn. Unit tests (48 passing): - _wait_for_deployment_active: immediate Active, polling, Failed status, timeout - create_deployment: full model+deployment polling chain, skip polling when no ARN in response - deploy Nova chain: updated to verify deployment polling * fix(integ): Fix region handling and add get-or-create Nova training job The TestModelCustomizationDeployment integ tests were failing with DescribeTrainingJob 'Requested resource not found' because the SageMaker SDK caches the first session's region internally. The session-scoped cleanup_e2e_endpoints fixture (autouse) was creating a session in us-east-1 (default) before the class fixtures could set us-west-2, causing all subsequent TrainingJob.get calls to hit the wrong region. Fix by setting AWS_DEFAULT_REGION=us-west-2 in the cleanup_e2e_endpoints fixture before any SageMaker session is created. Add tests/integ/conftest.py with a session-scoped nova_training_job_name fixture that implements get-or-create: - Checks if sdk-integ-nova-micro-sft exists and is Completed - If InProgress, waits for completion - If not found, uploads minimal training data to S3 and launches a Nova Micro SFT training job via SFTTrainer - Reused across test_bedrock_nova_e2e.py and TestBedrockNovaDeployment in test_model_customization_deployment Update both Nova test files to use the shared fixture instead of hardcoded training job names. * fix(integ): Use SAGEMAKER_REGION for cross-region training job lookup The SageMaker SDK's SageMakerClient reads SAGEMAKER_REGION env var at init time and caches the region for all subsequent API calls. The cleanup_e2e_endpoints session fixture was the first to create a SageMakerClient (in the default region), which then poisoned all subsequent TrainingJob.get calls regardless of the region parameter. Fix by setting SAGEMAKER_REGION=us-west-2 in cleanup_e2e_endpoints before any SDK session is created, since all resources in this test file live in us-west-2. The env var is restored after cleanup. In CodeBuild (us-west-2) this is a no-op since the default region already matches. The other test files (triton, tei, tgi) are not affected since they have their own fixtures and don't import from this file. * refactor(integ): Replace Nova integ tests with example notebook Remove us-east-1 Nova integration tests that cannot run in the us-west-2 CodeBuild environment: - Delete tests/integ/test_bedrock_nova_e2e.py - Delete tests/integ/conftest.py (Nova get-or-create fixture) - Remove TestBedrockNovaDeployment class from test_model_customization_deployment.py Add example_notebooks/bedrock_nova_deployment.ipynb covering the full Nova workflow: SFTTrainer fine-tuning, BedrockModelBuilder deploy with model+deployment polling, inference, and cleanup. The BedrockModelBuilder source code and unit tests (48 passing) are unchanged. The us-west-2 integ tests for non-Nova Bedrock deployment (TestModelCustomizationDeployment) remain. * docs: Add Bedrock model builder example notebooks Add notebooks demonstrating Bedrock deployment workflows: - bedrock-modelbuilder-deployment-nova.ipynb: Nova model deployment via BedrockModelBuilder with SFTTrainer fine-tuning - boto3_deployment_notebook.ipynb: Direct boto3 Bedrock deployment - model_builder_deployment_notebook(1).ipynb: ModelBuilder deployment - 07-ml-model-development(1).ipynb: ML model development workflow - sagemaker-serve/example_notebooks/bedrock_nova_deployment.ipynb: Clean Nova deployment example with polling, inference, and cleanup * docs: Add Bedrock model builder example notebooks Add notebooks demonstrating Bedrock deployment workflows: - bedrock-modelbuilder-deployment-nova.ipynb: Nova model deployment via BedrockModelBuilder with SFTTrainer fine-tuning - boto3_deployment_notebook.ipynb: Direct boto3 Bedrock deployment - model_builder_deployment_notebook(1).ipynb: ModelBuilder deployment - 07-ml-model-development(1).ipynb: ML model development workflow - sagemaker-serve/example_notebooks/bedrock_nova_deployment.ipynb: Clean Nova deployment example with polling, inference, and cleanup * fix(serve): Update Nova Bedrock deployment notebook with working e2e flow Simplify notebook to use existing completed training job with BedrockModelBuilder deploy flow. Fix Nova inference content format to use array of {text: ...} objects. Remove broken SFTTrainer cells that fail due to botocore service model mismatch. * Update CHANGELOG 3.6.0 (#5649) * Update CHANGELOG.md sagemaker-core * Update VERSION sagemaker-core * Update CHANGELOG.md sagemaker-train * Update VERSION sagemaker-train * Update pyproject.toml sagemaker-train * Update CHANGELOG.md sagemaker-serve * Update VERSION sagemaker-serve * Update pyproject.toml sagemaker-serve * Update CHANGELOG.md sagemaker-mlops * Update VERSION sagemaker-mlops * Update pyproject.toml sagemaker-mlops * Update VERSION meta * Update CHANGELOG.md meta * Update pyproject.toml meta * Eval Support Update (#5658) * fix(evaluate): Remove GPT OSS model evaluation restriction Remove the check that blocked evaluation for openai-reasoning-gpt-oss-20b and openai-reasoning-gpt-oss-120b base models. * test(evaluate): Update GPT OSS tests to verify models are allowed Update TestGPTOSSModelValidation to assert that openai-reasoning-gpt-oss-20b and openai-reasoning-gpt-oss-120b models can be used for evaluation, matching the removal of the restriction in base_evaluator. * feature: Add Support for AWS Batch Quota Management Job Submission and Job Priority Update (#5659) * feature: [SDKv3]Add Support for QM Job Submission and Job Priority Update (#1970) * Trigger checks in changed modules and dependent modules (#1958) * Update pr workflow (#1963) * Trigger checks in changed modules and dependent modules * Removing github token dependency * Add back GH_PAT token to detect changes (#1965) * feature: Add Support for QM Job Submission and Job Priority Update --------- Co-authored-by: aviruthen <91846056+aviruthen@users.noreply.github.com> * feature: Updating aws_batch TrainingQueue integration test to support quota management. (#1978) * feature: Added an example notebook for QuotaManagement job submission on AWS Batch TrainingQueues. (#1980) * fix: aws_batch/test_training_queue QM unit test fix --------- Co-authored-by: mnganesh-amzn <mnganesh@amazon.com> Co-authored-by: aviruthen <91846056+aviruthen@users.noreply.github.com> * Migration MD Update (#5655) * updated the SDK to use latest LMIv22 image for sdk v3.x (#5640) * fix: Sync Nova hosting configs with AGISageMakerInference (#5664) Align _NOVA_HOSTING_CONFIGS CONTEXT_LENGTH and MAX_CONCURRENCY values with ALLOWLISTED_CONFIGURATIONS from AGISageMakerInference constants.py. Key changes: - micro: correct context/concurrency for g5, g6 instances; add g6e types - lite: add g6.12xlarge, g6.24xlarge; fix p5 to 128000 context - pro: remove unsupported g6.48xlarge; fix p5 to 24000/1 - lite-v2: add g6.48xlarge; fix p5 to 128000 context * feat: MLflow metrics visualization, enhanced wait UI, and eval job links (#5662) * Intermediary checkpoint * Evaluation job update * Fix studio domain mismatch for url, update text color, add link of evaluation job * Add underscore to fine-tune and eval job links * Update link to console, conditionally display studio link, update link color to blue * Always show console link, conditional show studio link * Minor update to execution link names * Fix region issue for studio url * Revert notebook change to original * Address PR readiness * Fix sagemaker-train unit tst * Update resources_codegen based on sagemaker-core change * feature: add telemetry attribution module for SDK usage provenance (#5661) * feature: add telemetry attribution module for SDK usage provenance * feature: add TrainingJob ARN to telemetry for training jobs and fixed bug with telemetry not being sent for *Trainer.train() if sagemaker_session is not provided * adding createdBy metadata to user agent string if attribution env var has been set to aid in resource attribution * fix: removed unused patch on builtins.open in test_create_with_byoc which was not being used and causing unintended patches to open calls elsewhere --------- Co-authored-by: Ryan Tanaka <rrtanaka@amazon.com> Co-authored-by: Molly He <mollyhe@amazon.com> * feature: extend list_jobs_by_share for quota_share_name (#5669) Co-authored-by: houtampl <houtampl@amazon.com> * fix: aws_batch integ test resources are now uniquely named by test run. (#5666) * Support IAM role for BaseEvaluator (#5671) * Updating changelog, version, and pyproject files (#5673) * fix(evaluate): Remove ModelPackageConfig from EvaluateBaseModel steps (#5635) When evaluate_base_model=True, the EvaluateBaseModel step in both DETERMINISTIC_TEMPLATE and CUSTOM_SCORER_TEMPLATE incorrectly included ModelPackageConfig with SourceModelPackageArn, causing the base model evaluation to load fine-tuned model weights instead of using only the base model from the public hub. This made both evaluations identical, leading users to believe fine-tuning had no effect. Remove ModelPackageConfig from the EvaluateBaseModel step in both templates so it only uses BaseModelArn from ServerlessJobConfig. The EvaluateCustomModel step retains ModelPackageConfig to correctly load fine-tuned weights. This is consistent with the fix already applied to the LLMAJ_TEMPLATE. --- X-AI-Prompt: Fix BenchMarkEvaluator evaluate_base_model bug from D406780217 X-AI-Tool: Kiro sim: https://t.corp.amazon.com/D406780217 * Fix: hardcode handler_name = "lambda_function.lambda_handler" to match the zip entry name. (#5692) * Fix lambda function handler name * Add integ test * Update integ test to wait for lambda call * feat: add telemetry emitter to ScriptProcessor and FrameworkProcessor run methods (#5697) Co-authored-by: Ryan Tanaka <rrtanaka@amazon.com> * fix: respect accept_eula in ModelBuilder LoRA deployment path (#5705) * Update accept_eula to respect user setup * Enable EULA acceptance in model customization tests Set accept_eula to True in model builder to fix tests * fix: add missing model_path attr in TestLoraAcceptEula To fix failing unit tests in PR: #5696 * fix(tests): fix TestLoraAcceptEula missing dataclass attrs and patches --------- Co-authored-by: Molly He <mollyhe@amazon.com> Co-authored-by: Gokul Anantha Narayanan <166456257+nargokul@users.noreply.github.com> * chore: Updated changelog, version and pyproject.toml for release (#5706) * feat: Add SageMaker token generator to sagemaker-core Embed the aws-sagemaker-token-generator library into sagemaker.core so users can generate SageMaker bearer tokens without installing a separate wheel. Usage: from sagemaker.core.aws_sagemaker_token_generator import provide_token token = provide_token(region='us-east-1') --------- Co-authored-by: Bassem Halim <bassemamir459@gmail.com> Co-authored-by: Aditi Sharma <165942273+Aditi2424@users.noreply.github.com> Co-authored-by: adishaa <adishaa@amazon.com> Co-authored-by: Basssem Halim <bhhalim@amazon.com> Co-authored-by: Molly He <mollyhe@amazon.com> Co-authored-by: Zachary David Saunders <zsaund@amazon.com> Co-authored-by: Bobby Lindsey <bobbywlindsey@users.noreply.github.com> Co-authored-by: Gokul Anantha Narayanan <166456257+nargokul@users.noreply.github.com> Co-authored-by: Amit <modi.osu@gmail.com> Co-authored-by: Amit Modi <modiamit@amazon.com> Co-authored-by: Rohit Kumar Srivastava <141.srivastava@gmail.com> Co-authored-by: IshaChid76 <49986634+IshaChid76@users.noreply.github.com> Co-authored-by: jam-jee <jamjee@amazon.com> Co-authored-by: rsareddy0329 <rsareddy0329@gmail.com> Co-authored-by: Joshua Towner <jjtowner@amazon.com> Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> Co-authored-by: David Lindskog <davlind@amazon.com> Co-authored-by: mnganesh-amzn <mnganesh@amazon.com> Co-authored-by: aviruthen <91846056+aviruthen@users.noreply.github.com> Co-authored-by: Ryan <ryantanaka.y@gmail.com> Co-authored-by: Ryan Tanaka <rrtanaka@amazon.com> Co-authored-by: ampleh <22372465+ampleh@users.noreply.github.com> Co-authored-by: houtampl <houtampl@amazon.com> Co-authored-by: Syed Mujtaba <42322958+mujtaba1747@users.noreply.github.com> * feat: Add sagemaker-rft SDK for AgenticRFT integration (#1977) * feat: Add sagemaker-rft subpackage for multi-turn RFT customer integration * fix: update sagemaker-rft for AgenticRFTRuntimeService integration * feat: Add aws_rft_sdk source package - RolloutFeedbackClient with SigV4-signed CompleteTrajectory and UpdateReward - @rft_handler decorator: auto-reports completion+reward on success, errors on failure - RFTContext using contextvars (not threading.local) for Strands thread compatibility - Strands wrap_model adapter: injects X-Rft-* headers via client_args default_headers - Maps both snake_case and camelCase metadata keys (jobId/rolloutId from TLM) * chore: Remove superseded aws-rft-sdk package The aws-rft-sdk/ directory was the original prototype, now fully superseded by sagemaker-rft/ which provides the same functionality under the sagemaker.rft namespace with proper packaging, Pydantic models, and additional adapters (LangChain). Keeping both causes confusion about which package to import. * chore: Add pre-built wheel for sagemaker-rft 0.1.0 Include distributable wheel and sdist so other teams can install directly without building from source. * Consolidate individual RFT headers into single X-RFT-Metadata header Replace three separate headers (X-Rft-Job-Arn, X-Trajectory-Id, X-Span-Id) with a single X-RFT-Metadata header containing a JSON object with: - job_id: training job ARN - experiment_id: groups turns into a single trajectory - rollout_id: unique ID for each rollout (replaces span-id) * Fix RolloutFeedbackClient field name mismatch with TLM metadata The TLM sends metadata with jobId/rolloutId but RolloutFeedbackClient expected job_arn/trajectory_id, causing complete_trajectory and update_reward to silently skip. Accept both naming conventions (snake_case and camelCase). * Fix inference params field name mismatch with TLM payload - decorators.py: Accept both inferenceParams (camelCase from TLM) and inference_params (snake_case) in payload - strands.py: Accept both maxTokens/topP (camelCase) and max_tokens/top_p (snake_case) when applying to model params * Fix SDK payload casing, rollout ID, and inference params injection 1. headers.py: Accept both camelCase (jobId, rolloutId, experimentId) and snake_case (job_arn, trajectory_id) from TLM payload. Use the passed rollout ID instead of generating a new UUID. 2. strands.py: Use model.update_config() instead of setting params dict directly — params is None on OpenAIModel, update_config is the supported API for dynamic parameter changes. 3. decorators.py: Already accepts both inferenceParams and inference_params (fixed earlier). * Fix update_config: wrap inference params in params={} for Strands API * Add status param to complete_trajectory and report_error/report_complete helpers - complete_trajectory() accepts status param ("ready" or "failed") - Added report_error() to mark trajectory as failed with optional reward - Added report_complete() convenience method for success path - 404 errors already handled gracefully via _signed_post exception handling * feat: Updated with correct headers for rft * feat: Added variable endpoint for rft runtime and feedback to complete trajectory and reward * feat: added temp auth.py module and updated feedback.py * feat: added temp auth.py module and updated feedback.py * Support list rewards in rft_handler and report_complete - rft_handler: when reward is a list, calls complete_trajectory() + update_reward(list) separately instead of report_complete(float). This supports multi-turn trajectories with per-turn rewards. - report_complete: accepts float | list[float]. * fix: sagemaker-rft SDK bearer token auth, auto lifecycle, region env var - feedback.py: Switch from SigV4 to bearer token auth via aws_sagemaker_token_generator.provide_token(). Add region fallback from AWS_REGION env var. Support both camelCase (jobId, rolloutId) and snake_case (job_arn, trajectory_id) metadata keys. Handle 404 gracefully. Add report_complete() and report_error() convenience methods. - decorators.py: rft_handler now auto-calls CompleteTrajectory + UpdateReward when result dict contains "reward" key. Auto-calls report_error on exceptions. Support inferenceParams (camelCase). - models.py: Region default reads from AWS_REGION env var. * fix: Handle error results and terminal trajectory status in SDK - decorators.py: _handle_result checks result["status"] == "error" and calls report_error() instead of report_complete(). Previously, agent returning {"status": "error", "reward": 0.0} was treated as success, calling CompleteTrajectory(status=ready) which conflicted with Runtime's failTrajectory(). Now errors are properly reported so TLM retries immediately instead of waiting 10 min timeout. - feedback.py: CompleteTrajectory and UpdateReward gracefully handle 400 "not in valid status" (trajectory already failed by Runtime). Logs warning and skips instead of raising exception. * fix(rft): make feedback reporting non-fatal and fix endpoint/timeout - Move _handle_result to try/except in else clause so feedback failures do not prevent returning the rollout result - Fix _build_endpoint to not include "prod" in the URL prefix - Increase feedback HTTP timeout from 30s to 120s * Handle trajectory-already-processed errors in rft_handler decorator Centralize detection of 'not in valid status' and 'Cannot transition trajectory' errors into a shared helper. The decorator now catches these from the agent function and returns {status: skipped} instead of re-raising, so agents no longer need per-project workarounds. * fix: use Optional[] syntax in Pydantic models for Python 3.9 compatibility * refactor(rft): use sagemaker-core token generator instead of inline implementation Replace the inline SigV4 token signing logic and aws-sagemaker-token-generator fallback with sagemaker.core.token_generator.generate_token from sagemaker-core. Add sagemaker-core as a dependency in pyproject.toml. * refactor: move rft module from sagemaker-rft to sagemaker-train Move the rft subpackage from the standalone sagemaker-rft package into sagemaker-train as sagemaker.train.rft. Update all internal imports accordingly. Add requests and pydantic to sagemaker-train dependencies. Remove the sagemaker-rft package directory. * refactor(rft): remove auth.py wrapper, use generate_token directly The get_rft_api_key wrapper in auth.py was just passing args through to generate_token. Call generate_token directly in feedback.py instead. --------- Co-authored-by: Tritin Truong <tttritin@amazon.com> Co-authored-by: Barret Pickett <mrpic@amazon.com> Co-authored-by: James Yu <jamesfyu@amazon.com> * MTRL Evaluator * fixes * Integ test changes * fix: Remove unimplemented trajectory stub from MTRL evaluator Remove _MTRLTurn, _MTRLTrajectory classes and _fetch_mtrl_trajectory function that raised NotImplementedError. These were stubbed for a future phase and should not ship to customers. Also remove the get_trajectory reference from the evaluate() docstring. * refactor: Align MTRL evaluator with standard evaluator UX contract - Type evaluate() return as MTRLEvaluationExecution - Replace custom _start_execution_boto3 with shared _start_execution path (adds pipeline tagging for discovery via get_all) - Remove _find_existing_pipeline (handled by shared infrastructure) - Remove custom wait()/refresh()/_print_trailing_logs() overrides from MTRLEvaluationExecution (uses sagemaker-core based parent methods) - Remove _get_results() and _show_mtrl_results (show_results removed) - Add proper get_all classmethod with telemetry and yield semantics - Add TYPE_CHECKING import for return type annotation * feat: Add 3P agent (Lambda) integration test for MTRL evaluator - Add test_mtrl_evaluator_3p_agent.py with 4 test cases covering Lambda ARN string, AgentLambda object, wait-for-completion, and get_all discoverability - Fix _resolve_agent_arn to handle AgentLambda.lambda_arn attribute - Add _start_mtrl_execution with proper pipeline tagging for get_all discovery (uses boto3 directly since Job step type requires beta endpoint for CreatePipeline validation) - Reuse get_presigned_mlflow_experiment_url in trainer_wait.py and multi_turn_rl_evaluator.py (DRY) - Delete unused multi_turn_rl_evaluator_utils.py (dead code after show_results removal) - Add TODO comment on custom botocore loader in utils.py * Change revert * docs: Add 3P agent (Lambda) evaluation section to notebook Add Case 4 demonstrating Lambda-based agent evaluation with: - Lambda ARN string as agent_config - AgentLambda object as agent_config - AgentLambda.create() inline code example All examples include wait() for completion. * fix: Always include MlflowExperimentName in JobConfigDocument The backend requires MlflowExperimentName when ModelPackageConfig is not provided (base model only evaluation). Default to 'mtrl-eval-{model_name}' when not explicitly set by the user. * docs: Add MTRL dogfooding notebook with Train/Eval/Deploy flows Covers three scenarios: 1. Bedrock AgentCore: train → evaluate → deploy via ModelBuilder 2. 3P Lambda agent: train → evaluate → deploy via ModelBuilder 3. Base model evaluation (no training, AgentCore + Lambda) Includes discovery utilities and cleanup section. * docs: Add Bedrock deployment section to dogfooding notebook Adds BedrockModelBuilder deploy examples for both AgentCore and Lambda training outputs, plus Bedrock runtime invocation example. * docs: Set minimal hyperparameters in dogfooding notebook Use num_epochs=1, global_batch_size=2, max_steps=5 for fast dogfooding runs instead of defaults that take hours. * fix: Use max_epochs instead of num_epochs in dogfooding notebook * chore: Disable telemetry S3 request during dogfooding The sm-pysdk-t S3 bucket is unreachable from beta accounts, causing noisy RequestException logs. Disabled until post-launch. * Revert "chore: Disable telemetry S3 request during dogfooding" This reverts commit 689c05b5fbd1aa3e8cf40b1a1b17d4f843dfcec9. * fix(serve): Fall back to hub_content_name when recipe_name is empty MTRL-trained model packages have hub_content_name set but recipe_name empty. ModelBuilder now falls back to hub_content_name for recipe lookup in the hub document, enabling deployment of MTRL fine-tuned models via ModelBuilder. * docs: Use openai-reasoning-gpt-oss-20b and add list_supported_models section Switch dogfooding notebook to openai-reasoning-gpt-oss-20b model and add a prominent Supported Models section showing both training and evaluation model discovery. * Address PR comments * Fixes --------- Co-authored-by: jamesfyu <jamesfyu@amazon.com> Co-authored-by: Bassem Halim <bassemamir459@gmail.com> Co-authored-by: Aditi Sharma <165942273+Aditi2424@users.noreply.github.com> Co-authored-by: adishaa <adishaa@amazon.com> Co-authored-by: Basssem Halim <bhhalim@amazon.com> Co-authored-by: Molly He <mollyhe@amazon.com> Co-authored-by: Zachary David Saunders <zsaund@amazon.com> Co-authored-by: Bobby Lindsey <bobbywlindsey@users.noreply.github.com> Co-authored-by: Amit <modi.osu@gmail.com> Co-authored-by: Amit Modi <modiamit@amazon.com> Co-authored-by: Rohit Kumar Srivastava <141.srivastava@gmail.com> Co-authored-by: IshaChid76 <49986634+IshaChid76@users.noreply.github.com> Co-authored-by: jam-jee <jamjee@amazon.com> Co-authored-by: rsareddy0329 <rsareddy0329@gmail.com> Co-authored-by: Joshua Towner <jjtowner@amazon.com> Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> Co-authored-by: David Lindskog <davlind@amazon.com> Co-authored-by: mnganesh-amzn <mnganesh@amazon.com> Co-authored-by: aviruthen <91846056+aviruthen@users.noreply.github.com> Co-authored-by: Ryan <ryantanaka.y@gmail.com> Co-authored-by: Ryan Tanaka <rrtanaka@amazon.com> Co-authored-by: ampleh <22372465+ampleh@users.noreply.github.com> Co-authored-by: houtampl <houtampl@amazon.com> Co-authored-by: Syed Mujtaba <42322958+mujtaba1747@users.noreply.github.com> Co-authored-by: Mike Shen <109769013+xiaoxshe@users.noreply.github.com> Co-authored-by: Tritin Truong <tttritin@amazon.com> Co-authored-by: Barret Pickett <mrpic@amazon.com> * Update pr-checks-master.yml (#1995) * Test fixes (#1996) * feat: upgrade MultiTurnRLTrainer experience for mlflow, model package, and add integ test and SDK Docs (#1993) * feat: update MultiTurnRLTrainer and example notebook (#2002) * feat: update MultiTurnRLTrainer and example notebook * Add dataset format docs and support .json/.csv extensions for RFT - Add dataset format requirements summary to notebook section 3 - Expand DATASET_SUPPORTED_EXTENSIONS to include .json and .csv - Add _validate_csv and _validate_json to DatasetFormatDetector - Update unit tests to reflect new supported extensions * fix: subtract 1 from current step in training progress bar to avoid premature 100% (#2008) The progress bar showed 100% while the training step was still in progress because CurrentStep reports the step being worked on, not yet completed. Subtract 1 (clamped to 0) from the numerator so the bar only reaches 100% after the step actually finishes. * Mtrl readiness (#1998) * Test fixes * Test Fixes * trigger CI * fix: MTRL integ test import typo and mlflow ARN format - Fix CustomCustomAgentLambda → CustomAgentLambda import in 3p agent test - Update mlflow-tracking-server ARNs to mlflow-app format to match new validator * fix: update unit tests to match refactored MLflow boto3 API Tests in test_finetune_utils.py were referencing removed functions (_mlflow_version_meets_minimum, _wait_for_mlflow_app_ready) and old MlflowApp resource-object patterns. Updated to use the new dict-based boto3 client approach (_mlflow_version_meets_minimum_dict, _wait_for_mlflow_app_ready_boto, _get_prod_sm_client with paginator). * fix: pass explicit boto_session with region to Session() in evaluator The BaseEvaluator._create_default_session validator was creating a boto3.client with region but passing only sagemaker_client to Session(). Session.__init__ creates its own boto3.Session() without region, which fails in CI where no ~/.aws/config exists. Fix by creating a boto3.Session(region_name=region) first and passing it as boto_session. Also pass explicit sagemaker_session to MultiTurnRLTrainer in the test_mtrl_evaluator.py fixture to avoid the same issue in the trainer constructor. * revert: remove version bumps from this branch Reverts VERSION files back to 2.10.0 (core) and 1.10.0 (train, serve), and removes the data/sample package-data glob from pyproject.toml. * fix: use dynamic account ID in integ tests instead of hardcoded 742774200982 Resolves current account via STS at module load time so tests work in any CI account without cross-account S3 permission errors. * fix: make MTRL evaluator integ tests fully account-agnostic - Model resolution now uses pre-resolved _model_arn/_model_name from trainer when available, avoiding DescribeModelPackage calls for trainers that already have this info cached. - Test fixture creates model package group on the fly instead of relying on pre-existing resources in a specific account. - All hardcoded account IDs replaced with dynamic STS resolution. * fix: remove end-to-end wait tests that require account-specific resources These tests called execution.wait() and asserted Succeeded, which requires real Bedrock AgentCore runtimes and trained model artifacts. The pipeline submission flow is already covered by test_evaluate_comparison_mode and test_pipeline_reuse. * fix: replace pipeline submission tests with construction tests Account 391266019386 does not support the Job step type in SageMaker Pipelines, so pipeline creation/execution tests cannot run. Replaced with evaluator construction tests that validate the SDK code path (model resolution, session creation, validator logic) without submitting pipelines. * test: add unit tests for model builder is_checkpoint and IC model_name changes Cover the new is_checkpoint logic in _resolve_model_artifact_uri, _fetch_peft, build(), and inference component creation using model_name. * feat: create restricted model package group for Nova models (#2013) * chore: update svc model (#2005) * feat: create restricted model package group for Nova models When MultiTurnRLTrainer auto-creates a model package group for Nova models, pass ManagedConfiguration(managed_storage_type="Restricted") to create a restricted MPG. Applies to both output and intermediate checkpoint MPGs. --------- Co-authored-by: Syed Mujtaba <42322958+mujtaba1747@users.noreply.github.com> * Mtrl readiness (#2019) * Test fixes * Test Fixes * trigger CI * fix: MTRL integ test import typo and mlflow ARN format - Fix CustomCustomAgentLambda → CustomAgentLambda import in 3p agent test - Update mlflow-tracking-server ARNs to mlflow-app format to match new validator * fix: update unit tests to match refactored MLflow boto3 API Tests in test_finetune_utils.py were referencing removed functions (_mlflow_version_meets_minimum, _wait_for_mlflow_app_ready) and old MlflowApp resource-object patterns. Updated to use the new dict-based boto3 client approach (_mlflow_version_meets_minimum_dict, _wait_for_mlflow_app_ready_boto, _get_prod_sm_client with paginator). * fix: pass explicit boto_session with region to Session() in evaluator The BaseEvaluator._create_default_session validator was creating a boto3.client with region but passing only sagemaker_client to Session(). Session.__init__ creates its own boto3.Session() without region, which fails in CI where no ~/.aws/config exists. Fix by creating a boto3.Session(region_name=region) first and passing it as boto_session. Also pass explicit sagemaker_session to MultiTurnRLTrainer in the test_mtrl_evaluator.py fixture to avoid the same issue in the trainer constructor. * revert: remove version bumps from this branch Reverts VERSION files back to 2.10.0 (core) and 1.10.0 (train, serve), and removes the data/sample package-data glob from pyproject.toml. * fix: use dynamic account ID in integ tests instead of hardcoded 742774200982 Resolves current account via STS at module load time so tests work in any CI account without cross-account S3 permission errors. * fix: make MTRL evaluator integ tests fully account-agnostic - Model resolution now uses pre-resolved _model_arn/_model_name from trainer when available, avoiding DescribeModelPackage calls for trainers that already have this info cached. - Test fixture creates model package group on the fly instead of relying on pre-existing resources in a specific account. - All hardcoded account IDs replaced with dynamic STS resolution. * fix: remove end-to-end wait tests that require account-specific resources These tests called execution.wait() and asserted Succeeded, which requires real Bedrock AgentCore runtimes and trained model artifacts. The pipeline submission flow is already covered by test_evaluate_comparison_mode and test_pipeline_reuse. * fix: replace pipeline submission tests with construction tests Account 391266019386 does not support the Job step type in SageMaker Pipelines, so pipeline creation/execution tests cannot run. Replaced with evaluator construction tests that validate the SDK code path (model resolution, session creation, validator logic) without submitting pipelines. * test: add unit tests for model builder is_checkpoint and IC model_name changes Cover the new is_checkpoint logic in _resolve_model_artifact_uri, _fetch_peft, build(), and inference component creation using model_name. * fix: merged model deployment and MLflow deep-linking for eval ModelBuilder (SMI): - Add is_checkpoint check in _fetch_peft() to skip LORA path for merged models - Resolve merged model artifacts to checkpoints/hf_merged/ in build and deploy - Use model_name instead of artifact_url in non-LORA IC spec MLflow URL deep-linking: - Use deepLink query param (matching SageMaker UI pattern) instead of URL fragments - Authenticate via presigned URL session to resolve experiment name to ID - Store mlflow_resource_arn and mlflow_experiment_name on eval execution - Fall back to experiment name search filter when ID resolution fails BedrockModelBuilder: - Add fallback to fetch output_model_package_arn from job config when not set * Resolving unit test failures * Fix test_merged_model_deployment isinstance compatibility across Python versions --------- Co-authored-by: Ming Luo <24469267+mingluo0108@users.noreply.github.com> Co-authored-by: jamesfyu <jamesfyu@amazon.com> Co-authored-by: Bassem Halim <bassemamir459@gmail.com> Co-authored-by: Aditi Sharma <165942273+Aditi2424@users.noreply.github.com> Co-authored-by: adishaa <adishaa@amazon.com> Co-authored-by: Basssem Halim <bhhalim@amazon.com> Co-authored-by: Molly He <mollyhe@amazon.com> Co-authored-by: Zachary David Saunders <zsaund@amazon.com> Co-authored-by: Bobby Lindsey <bobbywlindsey@users.noreply.github.com> Co-authored-by: Amit <modi.osu@gmail.com> Co-authored-by: Amit Modi <modiamit@amazon.com> Co-authored-by: Rohit Kumar Srivastava <141.srivastava@gmail.com> Co-authored-by: IshaChid76 <49986634+IshaChid76@users.noreply.github.com> Co-authored-by: jam-jee <jamjee@amazon.com> Co-authored-by: rsareddy0329 <rsareddy0329@gmail.com> Co-authored-by: Joshua Towner <jjtowner@amazon.com> Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> Co-authored-by: David Lindskog <davlind@amazon.com> Co-authored-by: mnganesh-amzn <mnganesh@amazon.com> Co-authored-by: aviruthen <91846056+aviruthen@users.noreply.github.com> Co-authored-by: Ryan <ryantanaka.y@gmail.com> Co-authored-by: Ryan Tanaka <rrtanaka@amazon.com> Co-authored-by: ampleh <22372465+ampleh@users.noreply.github.com> Co-authored-by: houtampl <houtampl@amazon.com> Co-authored-by: Syed Mujtaba <42322958+mujtaba1747@users.noreply.github.com> Co-authored-by: Mike Shen <109769013+xiaoxshe@users.noreply.github.com> Co-authored-by: Tritin Truong <tttritin@amazon.com> Co-authored-by: Barret Pickett <mrpic@amazon.com>
- Add model_package_config field to ModelTrainer for RMP consumption - Route MP ARN from recipe to ModelPackageConfig.SourceModelPackageArn - Add S3Uri optional override for escrow-managed artifacts (RMP) - Fix duplicate min_memory_required_in_mb in codegen output - Add studio_web_portal_settings to codegen output (missed in #2018) - Direct model_package_config param overrides recipe (CTJ priority) Reverted: ModelPackageGroupArn remains required (pending Smithy approval) Test results: - sagemaker-train unit tests: 81/81 passed - sagemaker-core unit tests: 3217/3225 passed - sagemaker-core tools tests: 39/39 passed - 8 expected failures (Docker Compose not installed locally, no code changed in tests/unit/local/): tests/unit/local/test_image.py::TestSageMakerContainerAdvanced (all 8) Co-authored-by: xibei chen <xibeich@amazon.com>
* Add inference component discovery and bedrock deploy polling - Add cell to list inference components for an endpoint before invoke - Add polling cells to wait for bedrock deployment to reach InService status before invoking (both Scenario 1 and Scenario 2) * Replace internal dogfooding notebook with open-source MTRL example - Remove account-specific ARNs, gamma endpoints, and internal references - Consolidate duplicate scenarios into a single clean flow (train → eval → deploy) - Add both SageMaker endpoint and Bedrock deployment paths - Include inference component discovery and deploy polling - Add descriptive markdown cells explaining each step - Move to model-customization-examples directory * Add evaluation and deployment sections to MTRL prod notebook - Add Section 11: Evaluate fine-tuned model and base model comparison - Add Section 12: Deploy to SageMaker endpoint with inference component discovery and invocation - Add Section 13: Deploy to Bedrock with polling and invocation - Add Section 14: Cleanup - Remove account-specific info from setup cell (ada credentials, internal paths) - Remove standalone mtrl_finetuning_example_notebook (consolidated here) - Update table of contents to reflect new sections
* feat: add run-level MLflow deep-linking for MTRL eval execution The eval MLflow URL now deep-links to the specific run (experiment_id + run_id), matching the training job behavior. Previously it only linked to the experiment level. Changes: - Add get_mlflow_url() and get_mlflow_details() to MTRLEvaluationExecution - Add get_presigned_mlflow_url() and _resolve_run_id() to mlflow_url_utils - Fix URL format to use fragment-based routing (#/experiments/id/runs/id) with ?workspace=default for SageMaker MLflow app compatibility - Refresh presigned URL every 30s during wait() (matching trainer behavior) - Add demo notebook for the feature * revert: remove demo notebooks from PR The notebooks are for local testing only, not needed in the PR.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5919 +/- ##
==========================================
- Coverage 89.97% 84.87% -5.10%
==========================================
Files 286 194 -92
Lines 39219 26265 -12954
==========================================
- Hits 35286 22292 -12994
- Misses 3933 3973 +40 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
nargokul
previously approved these changes
Jun 3, 2026
nargokul
approved these changes
Jun 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.