Terminating AWS Batch jobs broken #107

lewfish · 2017-08-29T22:16:12Z

If you terminate an AWS Batch job via the console or AWS CLI, it doesn't work. This happens when running the train_ec2.sh script. The only way to kill it is to kill the underlying spot instance. We think this is due to a bug in Batch, and should submit a bug report to AWS.

The text was updated successfully, but these errors were encountered:

hectcastro · 2017-08-30T17:27:53Z

This could be related to moby/moby#34213.

lewfish · 2017-08-30T18:17:08Z

This just happened for a job (job-id=9739ed8b-8125-4e75-9efb-fcf1d89f5c44) I ran today, which was still running after initiating termination via the consolet 1.5h prior. The command was

run_script.sh, lf/train-ships, /opt/src/detection/scripts/train_ec2.sh --config-path /opt/src/detection/configs/ships/ssd_mobilenet_v1.config --train-id ships1 --dataset-id singapore_ships_chips_tiny --model-id ssd_mobilenet_v1_coco_11_06_2017

and the container was 279682201306.dkr.ecr.us-east-1.amazonaws.com/raster-vision-gpu:latest running in the raster-vision-gpu queue.

lewfish · 2017-08-31T17:28:40Z

The easiest workaround for now is to kill the container running on the host.

[ec2-user@ip-172-31-53-167 ~]$ sudo docker ps
CONTAINER ID        IMAGE                                                                   COMMAND                  CREATED             STATUS              PORTS               NAMES
c98db2c40493        279682201306.dkr.ecr.us-east-1.amazonaws.com/raster-vision-gpu:latest   "bash run_script.sh l"   18 minutes ago      Up 18 minutes                           ecs-raster-vision-gpu-3-default-90f9cfa59cb5d09bbe01
4d7350e07e59        amazon/amazon-ecs-agent:latest                                          "/agent"                 34 minutes ago      Up 34 minutes                           ecs-agent
[ec2-user@ip-172-31-53-167 ~]$ sudo docker kill c98db2c40493
c98db2c40493

lewfish · 2017-09-01T21:45:33Z

If you terminate a job using the above approach, it will get retried if attempts is greater than 1. Since it's already a pain to terminate things, and resuming training is broken as described in #106, we should remember to submit jobs with --attempts 1 so there is no retry attempt.

tnation14 · 2017-09-15T20:16:43Z

I think the reason that we aren't able to stop Batch Jobs is partly because when the Job receives a SIGINT signal from the Batch console, that signal isn't being propagated to the background processes running in the container. Those processes keep running, so bash can't exit and end the job. I was able to kill Batch jobs from the console using the changes that I made to train_ec2.sh on my branch feature/tnation/batch-job-termination. However, once the Job termination issue was fixed, I noticed that if any of the background processes (i.e. train.py and eval.py) exit without being killed, the Batch job will still hang until it's terminated manually. I think that's because the EXIT signal in the background process isn't being passed back up to the script.

It seems like managing both job control and model training with train_ec2.sh will to require a lot of nonstandard scripting to make it work the way it should; it may be worth it to break this task up into multiple, dependent Batch jobs. One job could run train.py, and the other can run eval.py. If we use either EFS or S3 to store the necessary shared files, we'll be able to create a more robust process that will keep us from having to do process management and error checking with a shell script.

tnation14 · 2017-09-25T17:22:57Z

Since we've identified the root cause for the job failures (and train_ec2.sh is being rewritten soon), I'm going to close this.

lewfish added the detection label Aug 29, 2017

hectcastro added this to the Operations Sprint: 9/1-9/14 milestone Aug 30, 2017

hectcastro added the operations label Aug 30, 2017

hectcastro added the size: 3 label Aug 31, 2017

tnation14 self-assigned this Aug 31, 2017

hectcastro removed this from the Operations Sprint: 9/1-9/14 milestone Aug 31, 2017

lewfish added the bug label Sep 1, 2017

tnation14 added this to the Operations Sprint: 9/1-9/14 milestone Sep 13, 2017

hectcastro added the in progress label Sep 14, 2017

hectcastro modified the milestones: Operations Sprint: 9/1-9/14, Operations Sprint: 9/15-9/28 Sep 14, 2017

tnation14 added in review and removed in progress labels Sep 18, 2017

This was referenced Sep 22, 2017

Exit train_ec2.sh when tensorboard quits #123

Closed

Break train_ec2.sh logic into multiple Batch jobs #124

Closed

tnation14 closed this as completed Sep 25, 2017

hectcastro removed the in review label Sep 25, 2017

lewfish mentioned this issue Sep 29, 2017

Refactor detection scripts to be more flexible #128

Merged

lewfish mentioned this issue Dec 15, 2017

PyTorch + Batch job communication problem #168

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terminating AWS Batch jobs broken #107

Terminating AWS Batch jobs broken #107

lewfish commented Aug 29, 2017

hectcastro commented Aug 30, 2017

lewfish commented Aug 30, 2017 •

edited

lewfish commented Aug 31, 2017

lewfish commented Sep 1, 2017

tnation14 commented Sep 15, 2017

tnation14 commented Sep 25, 2017

Terminating AWS Batch jobs broken #107

Terminating AWS Batch jobs broken #107

Comments

lewfish commented Aug 29, 2017

hectcastro commented Aug 30, 2017

lewfish commented Aug 30, 2017 • edited

lewfish commented Aug 31, 2017

lewfish commented Sep 1, 2017

tnation14 commented Sep 15, 2017

tnation14 commented Sep 25, 2017

lewfish commented Aug 30, 2017 •

edited