Retry jobs stuck on starting #11761

ArnoudItility · 2023-01-18T16:03:30Z

What's the use case?

When firing off a large number of jobs to AWS ECS it can happen that there are some jobs that are stuck on starting. Using run_monitoring you can catch these runs, where after the runs are set to failed. Instead of catching them and setting them to failed, it would be valuable to incorporate an automatic retry on them (as the run_retries feature restarts all jobs that are classified as failed, which is not desired in a vast majority of cases).

Ideas of implementation

Add retry option to run_monitoring

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

dpeng817 · 2023-01-19T21:15:02Z

Maybe an easy solution here would be to have the option to turn auto retrying off for certain jobs via tags so that you can opt jobs out of the retry behavior. cc @johannkm would something like that be reasonable?

biancarosa · 2023-01-26T14:52:46Z

That'd be highly valuable for us too!

ArnoudItility added the type: feature-request label Jan 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry jobs stuck on starting #11761

Retry jobs stuck on starting #11761

ArnoudItility commented Jan 18, 2023

dpeng817 commented Jan 19, 2023

biancarosa commented Jan 26, 2023

Retry jobs stuck on starting #11761

Retry jobs stuck on starting #11761

Comments

ArnoudItility commented Jan 18, 2023

What's the use case?

Ideas of implementation

Additional information

Message from the maintainers

dpeng817 commented Jan 19, 2023

biancarosa commented Jan 26, 2023