Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tune Step with Script Mode Estimator won't start due to lacking MetricDefinition in Pipeline definition #4326

Closed
lorenzwalthert opened this issue Dec 14, 2023 · 1 comment
Labels

Comments

@lorenzwalthert
Copy link

lorenzwalthert commented Dec 14, 2023

Describe the bug

Running a Sagemaker Pipeline with a tune step in which the estimator used is in script mode fails due to a lacking metrics definition in the pipeline definition, albeit it was passed to the estimator upon pipeline creation with the SDK.

This means a pipeline created and upserted with the Python SDK with a tuning step that contains a script mode estimator can't run.

I opened a support case with AWS premium support (170258349801168). The below code snippets are a simplified version.

To reproduce

I have a sagemaker pipeline with a hyper-parameter tuning step based on the SKLearn script mode estimator. The relevant snippet

metrics_mapping =  [
   {
        "Name": "test:lift-momentum",
        "Regex": "Lift \\(candidate all classes\\) over momentum: (.*?);",
    },
]
    model = sagemaker.sklearn.SKLearn(
        entry_point="step_train.py",
        role=role,
        instance_type=instance_type,
        instance_count=1,
        image_uri=image_uri,
        image_uri_region="eu-central-1",
        metric_definitions=metrics_mapping,
        hyperparameters=hyper_params,
        enable_sagemaker_metrics=True,
        sagemaker_session=sagemaker.workflow.pipeline_context.PipelineSession(),
    )

    inputs = {
        "all": sagemaker.TrainingInput(manifest_uri, s3_data_type="ManifestFile"),
    }
    tune = sagemaker.tuner.HyperparameterTuner(
        estimator=model,
        objective_metric_name="test:lift-random",
        objective_type="Maximize",
        hyperparameter_ranges=hyperparameter_ranges,
        max_jobs=2,
        max_parallel_jobs=1,
    )
    tune_args = tune.fit(inputs=inputs, wait=True)

    sagemaker.workflow.steps.TuningStep(
        name="TrainModel", step_args=tune_args
    )

I clearly pass a metrics definition into the estimator.

The pipeline can be upserted and started, but I get the following runtime error in the hyper-parameter tuning step (even before the Hyper-parameter tuning job is created in the sagemaker console):

ClientError: Failed to invoke sagemaker:CreateHyperParameterTuningJob. Error Details: A metric is required for this hyperparameter tuning job objective. Provide a metric in the metric definitions.

As expected, to create a hyper-parameter tuning job with CreateHyperParameterTuningJob the key TrainingJobDefinitions must contain a MetricDefinition. The following snippet from the docs:

"TrainingJobDefinitions": [ 
      { 
         "AlgorithmSpecification": { 
            "AlgorithmName": "string",
            "MetricDefinitions": [ 
               { 
                  "Name": "string",
                  "Regex": "string"
               }
            ],
            "TrainingImage": "string",
            "TrainingInputMode": "string"
         },
         ...

However, my pipeline definition file did not contain it, although I passed metric_definitions into the estimator. I believe it must be lost on the way. In my previous version of the pipeline, I used a train step instead of a tune step, and the metrics made it into the pipeline definition file. Here the relevant excerpt:

{
    "Name": "train_model",
    "Type": "Training",
    "Arguments": {
        "AlgorithmSpecification": {
            "TrainingInputMode": "File",
            "TrainingImage": "982361546614.dkr.ecr.eu-central-1.amazonaws.com/signals.processor:staging",
            "MetricDefinitions": [
                {
                    "Name": "train:loss",
                    "Regex": "train loss: (.+?),"
              ...

Expected behavior

The Python SDK takes into account the metrics_definition key supplied to the SKLearn estimator when creating the tune step of the pipeline.

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.197.1.dev0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans):
  • Framework version:
  • Python version: 3.11
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): Y
@lorenzwalthert lorenzwalthert changed the title metric_definition in Training Job with script mode not propagated into pipeline definition metric_definition script mode estimator not propagated into pipeline definition for tune step Dec 14, 2023
@lorenzwalthert lorenzwalthert changed the title metric_definition script mode estimator not propagated into pipeline definition for tune step Tune Step with Script Mode Estimator won't start due to lacking MetricDefinition Dec 14, 2023
@lorenzwalthert lorenzwalthert changed the title Tune Step with Script Mode Estimator won't start due to lacking MetricDefinition Tune Step with Script Mode Estimator won't start due to lacking MetricDefinition in Pipeline definition Dec 14, 2023
@lorenzwalthert
Copy link
Author

lorenzwalthert commented Dec 16, 2023

It seems that a metric definition has to be supplied in sagemaker.tuner.HyperparameterTuner instead of the estimator. Thanks AWS support for providing the solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant