Out of memory when restarting training process #106

lewfish · 2017-08-29T22:14:13Z

In theory, if you run train_ec2.sh and exit before training completes, and then restart the job, it should pick up where it left off. But this doesn't actually work because on the second run, TF emits an out of memory error. We should isolate the exact conditions when this occurs and file an issue in the repo for TF Object Detection. We should also check to see if there's an issue already there.

The text was updated successfully, but these errors were encountered:

lewfish · 2018-07-26T16:43:58Z

This appears to be working locally now. We should check this on EC2.

lewfish added the detection label Aug 29, 2017

lewfish changed the title ~~Fix problem with restarting training process~~ Out of memory when restarting training process Aug 29, 2017

lewfish added the bug label Aug 30, 2017

lewfish mentioned this issue Sep 1, 2017

Terminating AWS Batch jobs broken #107

Closed

lewfish removed the object-detection label May 23, 2018

lewfish added backlog and removed backlog labels Jun 5, 2018

lewfish added the priority label Jul 13, 2018

lossyrob closed this as completed Sep 26, 2018

lossyrob removed the priority label Sep 26, 2018

lewfish mentioned this issue Sep 26, 2018

Make sure training resumes for restarted jobs #422

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory when restarting training process #106

Out of memory when restarting training process #106

lewfish commented Aug 29, 2017

lewfish commented Jul 26, 2018 •

edited

Out of memory when restarting training process #106

Out of memory when restarting training process #106

Comments

lewfish commented Aug 29, 2017

lewfish commented Jul 26, 2018 • edited

lewfish commented Jul 26, 2018 •

edited