PySDK Version
Describe the bug
When a ModelTrainer is executed under a PipelineSession (i.e. produces a TrainingStep), ModelTrainer._create_training_job_args explicitly removes training_job_name from the request before serializing it to PascalCase:
# sagemaker/train/model_trainer.py (sagemaker-train 1.8.0)
if boto3 or isinstance(self.sagemaker_session, PipelineSession):
if isinstance(self.sagemaker_session, PipelineSession):
training_request.pop("training_job_name", None)
# Convert snake_case to PascalCase for AWS API
pipeline_request = {to_pascal_case(k): v for k, v in training_request.items()}
serialized_request = serialize(pipeline_request)
return serialized_request
Because the key is popped, the resulting request dict has no TrainingJobName. Downstream, TrainingStep.arguments (with PipelineDefinitionConfig(use_custom_job_prefix=True)) relies on TrainingJobName being present in the request so the prefix is preserved (and trim_request_dict removes it when use_custom_job_prefix=False).
The net effect is that use_custom_job_prefix=True is silently ignored for TrainingStep when the step is built from a ModelTrainer: every pipeline execution produces a random auto-generated training job name instead of the configured base_job_name prefix.
This is the same class of bug as #3991 and #4590 (which were about TransformStep), but for the new V3 ModelTrainer → TrainingStep path.
To reproduce
from sagemaker.core.workflow.pipeline import Pipeline
from sagemaker.core.workflow.pipeline_context import PipelineSession
from sagemaker.core.workflow.pipeline_definition_config import PipelineDefinitionConfig
from sagemaker.train.model_trainer import ModelTrainer
# ... build a ModelTrainer `trainer` with base_job_name=\"my-prefix\" ...
pipeline_session = PipelineSession()
trainer.sagemaker_session = pipeline_session
step_args = trainer._create_training_job_args()
assert \"TrainingJobName\" in step_args, step_args # FAILS — key was popped
pipeline = Pipeline(
name=\"repro\",
steps=[...], # TrainingStep built from trainer
sagemaker_session=pipeline_session,
pipeline_definition_config=PipelineDefinitionConfig(use_custom_job_prefix=True),
)
# Pipeline executions will NOT use \"my-prefix-...\" as the training job name.
Expected behavior
TrainingJobName should remain in the request dict so that PipelineDefinitionConfig(use_custom_job_prefix=True) produces training jobs named with the configured prefix. When use_custom_job_prefix=False, TrainingStep.arguments/trim_request_dict will strip the key as usual.
A minimal fix is to stop popping the key:
if boto3 or isinstance(self.sagemaker_session, PipelineSession):
pipeline_request = {to_pascal_case(k): v for k, v in training_request.items()}
serialized_request = serialize(pipeline_request)
return serialized_request
As a workaround we currently monkey-patch _create_training_job_args to re-insert TrainingJobName = _get_unique_name(self.base_job_name) when the session is a PipelineSession.
Screenshots or logs
N/A — silent misbehavior; the pipeline executes but job names use the default random name instead of the configured prefix.
System information
- SageMaker Python SDK version: sagemaker-train 1.8.0, sagemaker-core 2.8.0, sagemaker-mlops 1.8.0, sagemaker-serve 1.8.0 (also reproduces on 1.7.1 / 2.7.1)
- Framework name or algorithm: custom (source_code via ModelTrainer)
- Framework version: N/A
- Python version: 3.13
- CPU or GPU: CPU (irrelevant, bug is SDK-side)
- Custom Docker image (Y/N): Y
Additional context
Related closed issues for other step types: #3991 (TransformStep), #4590 (TransformStep/ProcessingStep).
PySDK Version
Describe the bug
When a
ModelTraineris executed under aPipelineSession(i.e. produces aTrainingStep),ModelTrainer._create_training_job_argsexplicitly removestraining_job_namefrom the request before serializing it to PascalCase:Because the key is popped, the resulting request dict has no
TrainingJobName. Downstream,TrainingStep.arguments(withPipelineDefinitionConfig(use_custom_job_prefix=True)) relies onTrainingJobNamebeing present in the request so the prefix is preserved (andtrim_request_dictremoves it whenuse_custom_job_prefix=False).The net effect is that
use_custom_job_prefix=Trueis silently ignored forTrainingStepwhen the step is built from aModelTrainer: every pipeline execution produces a random auto-generated training job name instead of the configuredbase_job_nameprefix.This is the same class of bug as #3991 and #4590 (which were about
TransformStep), but for the new V3ModelTrainer→TrainingSteppath.To reproduce
Expected behavior
TrainingJobNameshould remain in the request dict so thatPipelineDefinitionConfig(use_custom_job_prefix=True)produces training jobs named with the configured prefix. Whenuse_custom_job_prefix=False,TrainingStep.arguments/trim_request_dictwill strip the key as usual.A minimal fix is to stop popping the key:
As a workaround we currently monkey-patch
_create_training_job_argsto re-insertTrainingJobName = _get_unique_name(self.base_job_name)when the session is aPipelineSession.Screenshots or logs
N/A — silent misbehavior; the pipeline executes but job names use the default random name instead of the configured prefix.
System information
Additional context
Related closed issues for other step types: #3991 (TransformStep), #4590 (TransformStep/ProcessingStep).