Tune Step with Script Mode Estimator won't start due to lacking MetricDefinition in Pipeline definition #4326

lorenzwalthert · 2023-12-14T19:35:40Z

Describe the bug

Running a Sagemaker Pipeline with a tune step in which the estimator used is in script mode fails due to a lacking metrics definition in the pipeline definition, albeit it was passed to the estimator upon pipeline creation with the SDK.

This means a pipeline created and upserted with the Python SDK with a tuning step that contains a script mode estimator can't run.

I opened a support case with AWS premium support (170258349801168). The below code snippets are a simplified version.

To reproduce

I have a sagemaker pipeline with a hyper-parameter tuning step based on the SKLearn script mode estimator. The relevant snippet

metrics_mapping =  [
   {
        "Name": "test:lift-momentum",
        "Regex": "Lift \\(candidate all classes\\) over momentum: (.*?);",
    },
]
    model = sagemaker.sklearn.SKLearn(
        entry_point="step_train.py",
        role=role,
        instance_type=instance_type,
        instance_count=1,
        image_uri=image_uri,
        image_uri_region="eu-central-1",
        metric_definitions=metrics_mapping,
        hyperparameters=hyper_params,
        enable_sagemaker_metrics=True,
        sagemaker_session=sagemaker.workflow.pipeline_context.PipelineSession(),
    )

    inputs = {
        "all": sagemaker.TrainingInput(manifest_uri, s3_data_type="ManifestFile"),
    }
    tune = sagemaker.tuner.HyperparameterTuner(
        estimator=model,
        objective_metric_name="test:lift-random",
        objective_type="Maximize",
        hyperparameter_ranges=hyperparameter_ranges,
        max_jobs=2,
        max_parallel_jobs=1,
    )
    tune_args = tune.fit(inputs=inputs, wait=True)

    sagemaker.workflow.steps.TuningStep(
        name="TrainModel", step_args=tune_args
    )

I clearly pass a metrics definition into the estimator.

The pipeline can be upserted and started, but I get the following runtime error in the hyper-parameter tuning step (even before the Hyper-parameter tuning job is created in the sagemaker console):

ClientError: Failed to invoke sagemaker:CreateHyperParameterTuningJob. Error Details: A metric is required for this hyperparameter tuning job objective. Provide a metric in the metric definitions.

As expected, to create a hyper-parameter tuning job with CreateHyperParameterTuningJob the key TrainingJobDefinitions must contain a MetricDefinition. The following snippet from the docs:

"TrainingJobDefinitions": [ 
      { 
         "AlgorithmSpecification": { 
            "AlgorithmName": "string",
            "MetricDefinitions": [ 
               { 
                  "Name": "string",
                  "Regex": "string"
               }
            ],
            "TrainingImage": "string",
            "TrainingInputMode": "string"
         },
         ...

However, my pipeline definition file did not contain it, although I passed metric_definitions into the estimator. I believe it must be lost on the way. In my previous version of the pipeline, I used a train step instead of a tune step, and the metrics made it into the pipeline definition file. Here the relevant excerpt:

{
    "Name": "train_model",
    "Type": "Training",
    "Arguments": {
        "AlgorithmSpecification": {
            "TrainingInputMode": "File",
            "TrainingImage": "982361546614.dkr.ecr.eu-central-1.amazonaws.com/signals.processor:staging",
            "MetricDefinitions": [
                {
                    "Name": "train:loss",
                    "Regex": "train loss: (.+?),"
              ...

Expected behavior

The Python SDK takes into account the metrics_definition key supplied to the SKLearn estimator when creating the tune step of the pipeline.

System information
A description of your system. Please provide:

SageMaker Python SDK version: 2.197.1.dev0
Framework name (eg. PyTorch) or algorithm (eg. KMeans):
Framework version:
Python version: 3.11
CPU or GPU: CPU
Custom Docker image (Y/N): Y

The text was updated successfully, but these errors were encountered:

lorenzwalthert · 2023-12-16T21:04:28Z

It seems that a metric definition has to be supplied in sagemaker.tuner.HyperparameterTuner instead of the estimator. Thanks AWS support for providing the solution.

lorenzwalthert added the bug label Dec 14, 2023

lorenzwalthert changed the title ~~metric_definition in Training Job with script mode not propagated into pipeline definition~~ metric_definition script mode estimator not propagated into pipeline definition for tune step Dec 14, 2023

lorenzwalthert changed the title ~~metric_definition script mode estimator not propagated into pipeline definition for tune step~~ Tune Step with Script Mode Estimator won't start due to lacking MetricDefinition Dec 14, 2023

lorenzwalthert changed the title ~~Tune Step with Script Mode Estimator won't start due to lacking MetricDefinition~~ Tune Step with Script Mode Estimator won't start due to lacking MetricDefinition in Pipeline definition Dec 14, 2023

lorenzwalthert closed this as completed Dec 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tune Step with Script Mode Estimator won't start due to lacking MetricDefinition in Pipeline definition #4326

Tune Step with Script Mode Estimator won't start due to lacking MetricDefinition in Pipeline definition #4326

lorenzwalthert commented Dec 14, 2023 •

edited

Loading

lorenzwalthert commented Dec 16, 2023 •

edited

Loading

Tune Step with Script Mode Estimator won't start due to lacking MetricDefinition in Pipeline definition #4326

Tune Step with Script Mode Estimator won't start due to lacking MetricDefinition in Pipeline definition #4326

Comments

lorenzwalthert commented Dec 14, 2023 • edited Loading

lorenzwalthert commented Dec 16, 2023 • edited Loading

lorenzwalthert commented Dec 14, 2023 •

edited

Loading

lorenzwalthert commented Dec 16, 2023 •

edited

Loading