Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: allowing for Retry to work with SageMaker steps #140

Open
lucaboulard opened this issue Jun 7, 2021 · 5 comments
Open

Comments

@lucaboulard
Copy link

Currently, the Retry mechanism does not work with TrainingStep and ProcessingStep as the full job name must be specified to the step constructor so that if the step fails when the job has already been created, all retries will fail in submitting the job as the job name has already been used.
This happens for almost any error (including capacity errors) excluding throttling errors.

A possible solution might be to add an alternative parameter to specify a job name prefix, instead of a full name, and let SageMaker add some random suffix.

@wong-a
Copy link
Contributor

wong-a commented Jun 17, 2021

Interesting, I think that's a feature the Step Functions or SageMaker service needs to support. Step Functions will retry with the same parameters.

A workaround that could be done today is to catch errors, go to another step that creates a new job name, then go back to the TrainingStep which reads the JobName from StepInput. Crude ASCII diagram:

                 -----> [Actual next state if successful]
                /
[TrainingStep] - Catch -> [Step That Generates New Job Name]
       ^                                        /
        \____________________________________ /

Or perhaps RetryCount from the Context Object could be used with States.Format to create a new job name on each retry:

@rodrick10
Copy link

Any update?

@lasdem
Copy link

lasdem commented Sep 1, 2022

I found a workaround for this.
You can override the job_name via the parameters and use fields from the context object to generate a unique name even after retrying by including the retry count.

training_step = steps.TrainingStep(
    "Train Step",
    estimator=xgb,
    data={
        "train": sagemaker.TrainingInput(train_s3_file, content_type="application/x-parquet"),
        "validation": sagemaker.TrainingInput(validation_s3_file, content_type="application/x-parquet"),
    },
    job_name=ExecutionInput()["dummy"],
    parameters = {
        "TrainingJobName.$": "States.Format('{}-{}-{}', $$.StateMachine.Name, $$.Execution.Name, $$.State.RetryCount)",
    },
    retry=default_retryer,
)

Please note how I set the job_name to ExecutionInput()["dummy"], because its a mandatory field. But it will be overwritten with the TrainingJobName from the parameters.

@keithleungwork
Copy link

Is this feature being implemented?

I am facing the same issue, although not related to retry.
If we set a string to the job_name, the created SFN can only be executed once.
i.e. If you access AWS SFN UI, you cannot execute the created workflow again. Because the training step here will use the same job name every time.

Which means we can only use ExecutionInput at the moment, but it is not user-friendly because it is not necessary for the user to input the job name manually.

@lasdem
Copy link

lasdem commented Oct 19, 2023

Is this feature being implemented?

I am facing the same issue, although not related to retry. If we set a string to the job_name, the created SFN can only be executed once. i.e. If you access AWS SFN UI, you cannot execute the created workflow again. Because the training step here will use the same job name every time.

Which means we can only use ExecutionInput at the moment, but it is not user-friendly because it is not necessary for the user to input the job name manually.

In my example above the ExecutionInput is not used, it is literally a dummy.
The important part is the section below

parameters = {
        "TrainingJobName.$": "States.Format('{}-{}-{}', $$.StateMachine.Name, $$.Execution.Name, $$.State.RetryCount)",
    },

Because here the TrainingJobName will be overwritten with whats provided here, which includes the name of the step function, the execution id and the retry count. This will generate a new name for every execution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants