Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminating AWS Batch jobs broken #107

Closed
lewfish opened this issue Aug 29, 2017 · 6 comments
Closed

Terminating AWS Batch jobs broken #107

lewfish opened this issue Aug 29, 2017 · 6 comments

Comments

@lewfish
Copy link
Contributor

lewfish commented Aug 29, 2017

If you terminate an AWS Batch job via the console or AWS CLI, it doesn't work. This happens when running the train_ec2.sh script. The only way to kill it is to kill the underlying spot instance. We think this is due to a bug in Batch, and should submit a bug report to AWS.

@hectcastro hectcastro added this to the Operations Sprint: 9/1-9/14 milestone Aug 30, 2017
@hectcastro
Copy link

This could be related to moby/moby#34213.

@lewfish
Copy link
Contributor Author

lewfish commented Aug 30, 2017

This just happened for a job (job-id=9739ed8b-8125-4e75-9efb-fcf1d89f5c44) I ran today, which was still running after initiating termination via the consolet 1.5h prior. The command was

run_script.sh, lf/train-ships, /opt/src/detection/scripts/train_ec2.sh --config-path /opt/src/detection/configs/ships/ssd_mobilenet_v1.config --train-id ships1 --dataset-id singapore_ships_chips_tiny --model-id ssd_mobilenet_v1_coco_11_06_2017

and the container was 279682201306.dkr.ecr.us-east-1.amazonaws.com/raster-vision-gpu:latest running in the raster-vision-gpu queue.

@lewfish
Copy link
Contributor Author

lewfish commented Aug 31, 2017

The easiest workaround for now is to kill the container running on the host.

[ec2-user@ip-172-31-53-167 ~]$ sudo docker ps
CONTAINER ID        IMAGE                                                                   COMMAND                  CREATED             STATUS              PORTS               NAMES
c98db2c40493        279682201306.dkr.ecr.us-east-1.amazonaws.com/raster-vision-gpu:latest   "bash run_script.sh l"   18 minutes ago      Up 18 minutes                           ecs-raster-vision-gpu-3-default-90f9cfa59cb5d09bbe01
4d7350e07e59        amazon/amazon-ecs-agent:latest                                          "/agent"                 34 minutes ago      Up 34 minutes                           ecs-agent
[ec2-user@ip-172-31-53-167 ~]$ sudo docker kill c98db2c40493
c98db2c40493

@tnation14 tnation14 self-assigned this Aug 31, 2017
@hectcastro hectcastro removed this from the Operations Sprint: 9/1-9/14 milestone Aug 31, 2017
@lewfish lewfish added the bug label Sep 1, 2017
@lewfish
Copy link
Contributor Author

lewfish commented Sep 1, 2017

If you terminate a job using the above approach, it will get retried if attempts is greater than 1. Since it's already a pain to terminate things, and resuming training is broken as described in #106, we should remember to submit jobs with --attempts 1 so there is no retry attempt.

@tnation14 tnation14 added this to the Operations Sprint: 9/1-9/14 milestone Sep 13, 2017
@hectcastro hectcastro modified the milestones: Operations Sprint: 9/1-9/14, Operations Sprint: 9/15-9/28 Sep 14, 2017
@tnation14
Copy link
Contributor

I think the reason that we aren't able to stop Batch Jobs is partly because when the Job receives a SIGINT signal from the Batch console, that signal isn't being propagated to the background processes running in the container. Those processes keep running, so bash can't exit and end the job. I was able to kill Batch jobs from the console using the changes that I made to train_ec2.sh on my branch feature/tnation/batch-job-termination. However, once the Job termination issue was fixed, I noticed that if any of the background processes (i.e. train.py and eval.py) exit without being killed, the Batch job will still hang until it's terminated manually. I think that's because the EXIT signal in the background process isn't being passed back up to the script.

It seems like managing both job control and model training with train_ec2.sh will to require a lot of nonstandard scripting to make it work the way it should; it may be worth it to break this task up into multiple, dependent Batch jobs. One job could run train.py, and the other can run eval.py. If we use either EFS or S3 to store the necessary shared files, we'll be able to create a more robust process that will keep us from having to do process management and error checking with a shell script.

@tnation14
Copy link
Contributor

Since we've identified the root cause for the job failures (and train_ec2.sh is being rewritten soon), I'm going to close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants