-
Notifications
You must be signed in to change notification settings - Fork 378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Terminating AWS Batch jobs broken #107
Comments
This could be related to moby/moby#34213. |
This just happened for a job (job-id=9739ed8b-8125-4e75-9efb-fcf1d89f5c44) I ran today, which was still running after initiating termination via the consolet 1.5h prior. The command was
and the container was |
The easiest workaround for now is to kill the container running on the host.
|
If you terminate a job using the above approach, it will get retried if |
I think the reason that we aren't able to stop Batch Jobs is partly because when the Job receives a It seems like managing both job control and model training with |
Since we've identified the root cause for the job failures (and |
If you terminate an AWS Batch job via the console or AWS CLI, it doesn't work. This happens when running the
train_ec2.sh
script. The only way to kill it is to kill the underlying spot instance. We think this is due to a bug in Batch, and should submit a bug report to AWS.The text was updated successfully, but these errors were encountered: