-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Out of memory at later stage of training #150
Comments
Hi @netw0rkf10w Memory usage variation mainly come from padding. If by chance, you get in the same batch a wide horizontal image and a narrow vertical image, then the padded resulting images will be huge (and result in quite a bit of wasted memory). The trainings are seeded, so presumably you will always encounter this "bad batch" at the same time in your 13 epoch. That being said, your command seems to be using the default batch-size which is 2 per card, it should amply fit in a 16gb card. Could you double check that no other process is using gpu memory on the node? Also the logs should provide you with the current "max mem" that has been required, so you can check whether it's constantly flirting with 16gb (it shouldn't with batch-size=2) Best of luck. |
If it helps, I've monitored memory quite a bit while training and you can see a bit of climb during first epoch and after that it's largely stable within +/- .1GB (i.e. 10.1GB). |
@alcinos That's indeed the issue. I checked the code and realized this too. Clearly this is not optimal. Is there a good reason you guys didn't use some kind of @lessw2020 My training went well during the first 12 epochs... I think one should have the total control of the training's memory footprint (ideally constant across epochs), so I will try |
@netw0rkf10w it's a bit harder to implement in our case, since the random cropping may change the aspect ratio of the image completely. The torchvision/detectron2 approach of pre-binning the aspect-ratios of all the images to create groups thus doesn't work, since it assumes the aspect ratio of each image won't change. |
@alcinos You are right, with the implemented random cropping it doesn't make sense to do grouping based on the aspect ratio. However, one can modify e.g. class RandomSizeCrop(object):
def __init__(self, min_size: int, max_size: int):
self.min_size = min_size
self.max_size = max_size
def __call__(self, img: PIL.Image.Image, target: dict):
iw, ih = img.size
if iw < ih:
w = random.randint(self.min_size, min(img.width, self.max_size))
h = random.randint(w, min(img.height, self.max_size))
else:
h = random.randint(self.min_size, min(img.height, self.max_size))
w = random.randint(h, min(img.width, self.max_size))
region = T.RandomCrop.get_params(img, [h, w])
return crop(img, target, region) Using this in combination with By the way, could you tell me how you came up with the following implemented data augmentation? T.Compose([
T.RandomHorizontalFlip(),
T.RandomSelect(
T.RandomResize(scales, max_size=1333),
T.Compose([
T.RandomResize([400, 500, 600]),
T.RandomSizeCrop(384, 600),
T.RandomResize(scales, max_size=1333),
])
),
normalize,
]) It doesn't seem that standard to me... Thank you! |
@netw0rkf10w By doing so, you'll also make the cropping augment weaker, it remains to be seen how much it affects final performance. For DETR, we know that adding the cropping as we do it accounts for 1-2 AP, not really negligible. Cropping are not very often used by Faster-RCNN and the like because it doesn't affect them that much (it's due to the inductive biases that are used in these methods), that may be why it looks non-standard to you. Basically our augment is 50% of the time the "normal" object-detection augment, and the rest of the time we apply some resize to a sensible size, then crop, then resize as if there was no crop, it's fairly straightforward. Tuning the probability of cropping to something else than 50% is experimentally detrimental on coco, and trying to change a bit the cropping scheme didn't seem to have any impact. |
@alcinos Great. Thanks for the information. I may be able to try |
Hello,
I observed some strange behavior when launching a training on a server of 4 16GB P100 GPUs using:
python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --coco_path /path/to/coco
The training went well for 12 epochs and then in the middle of the 13th epoch, it had an OOM error. Usually memory usage shouldn't change between epochs, but for DETR I don't know if this is the case.
According to the paper, you trained your models using "16 V100 GPUs, with 4 images per GPU (hence a total batch size of 64)". Could you tell me if your GPUs have 16GB or 32GB of memory?
Thanks a lot!
The text was updated successfully, but these errors were encountered: