Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry jobs stuck on starting #11761

Open
ArnoudItility opened this issue Jan 18, 2023 · 2 comments
Open

Retry jobs stuck on starting #11761

ArnoudItility opened this issue Jan 18, 2023 · 2 comments

Comments

@ArnoudItility
Copy link

What's the use case?

When firing off a large number of jobs to AWS ECS it can happen that there are some jobs that are stuck on starting. Using run_monitoring you can catch these runs, where after the runs are set to failed. Instead of catching them and setting them to failed, it would be valuable to incorporate an automatic retry on them (as the run_retries feature restarts all jobs that are classified as failed, which is not desired in a vast majority of cases).

Ideas of implementation

Add retry option to run_monitoring

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

@dpeng817
Copy link
Contributor

Maybe an easy solution here would be to have the option to turn auto retrying off for certain jobs via tags so that you can opt jobs out of the retry behavior. cc @johannkm would something like that be reasonable?

@biancarosa
Copy link

That'd be highly valuable for us too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants