Out of memory at later stage of training #150

netw0rkf10w · 2020-07-14T09:53:07Z

Hello,

I observed some strange behavior when launching a training on a server of 4 16GB P100 GPUs using:

python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --coco_path /path/to/coco

The training went well for 12 epochs and then in the middle of the 13th epoch, it had an OOM error. Usually memory usage shouldn't change between epochs, but for DETR I don't know if this is the case.

According to the paper, you trained your models using "16 V100 GPUs, with 4 images per GPU (hence a total batch size of 64)". Could you tell me if your GPUs have 16GB or 32GB of memory?

Thanks a lot!

The text was updated successfully, but these errors were encountered:

alcinos · 2020-07-17T16:43:40Z

Hi @netw0rkf10w
Thank you for your interest in DETR.

Memory usage variation mainly come from padding. If by chance, you get in the same batch a wide horizontal image and a narrow vertical image, then the padded resulting images will be huge (and result in quite a bit of wasted memory). The trainings are seeded, so presumably you will always encounter this "bad batch" at the same time in your 13 epoch.

That being said, your command seems to be using the default batch-size which is 2 per card, it should amply fit in a 16gb card. Could you double check that no other process is using gpu memory on the node? Also the logs should provide you with the current "max mem" that has been required, so you can check whether it's constantly flirting with 16gb (it shouldn't with batch-size=2)

Best of luck.

lessw2020 · 2020-07-23T00:02:17Z

If it helps, I've monitored memory quite a bit while training and you can see a bit of climb during first epoch and after that it's largely stable within +/- .1GB (i.e. 10.1GB).
For the r50 model it's around 10.x GB with bs 2, and 11.x with BS4 on my training.
Anyway, no memory issues in my experience.
Note you might want to make sure to run nvidia-smi (in notebook = !nvidia-smi) and check out your gpu free memory before starting to confirm what's free.

netw0rkf10w · 2020-07-26T15:49:34Z

Memory usage variation mainly come from padding. If by chance, you get in the same batch a wide horizontal image and a narrow vertical image, then the padded resulting images will be huge (and result in quite a bit of wasted memory).

@alcinos That's indeed the issue. I checked the code and realized this too. Clearly this is not optimal. Is there a good reason you guys didn't use some kind of GroupedBatchSampler like in detectron for sampling either landscape or portrait images in a batch?

@lessw2020 My training went well during the first 12 epochs...

I think one should have the total control of the training's memory footprint (ideally constant across epochs), so I will try GroupedBatchSampler for my usage.

alcinos · 2020-07-26T16:12:12Z

@netw0rkf10w it's a bit harder to implement in our case, since the random cropping may change the aspect ratio of the image completely. The torchvision/detectron2 approach of pre-binning the aspect-ratios of all the images to create groups thus doesn't work, since it assumes the aspect ratio of each image won't change.
That's why we didn't invest more time in that, and I said in my previous message on 16GB cards with the base model and bs<=4 you shouldn't have any issue either.
If you find a way to implement it cleanly, we'd be happy to review and merge a PR though 😄

netw0rkf10w · 2020-07-26T16:51:11Z

@alcinos You are right, with the implemented random cropping it doesn't make sense to do grouping based on the aspect ratio. However, one can modify e.g. RandomSizeCrop to let it keep the original aspect ratio (I mean the "binary" value, i.e. "portrait" or "landscape"), something like this:

class RandomSizeCrop(object):
    def __init__(self, min_size: int, max_size: int):
        self.min_size = min_size
        self.max_size = max_size

    def __call__(self, img: PIL.Image.Image, target: dict):
        iw, ih = img.size
        if iw < ih:
            w = random.randint(self.min_size, min(img.width, self.max_size))
            h = random.randint(w, min(img.height, self.max_size))
        else:
            h = random.randint(self.min_size, min(img.height, self.max_size))
            w = random.randint(h, min(img.width, self.max_size))

        region = T.RandomCrop.get_params(img, [h, w])
        return crop(img, target, region)

Using this in combination with GroupedBatchSampler, the maximum dimension of a batch would be batch_size * 3 * 800 * 1333 (or batch_size * 3 * 1333 * 800). In the current implementation, it is batch_size * 3 * 1333 * 1333 in the worse case.

By the way, could you tell me how you came up with the following implemented data augmentation?

          T.Compose([
            T.RandomHorizontalFlip(),
            T.RandomSelect(
                T.RandomResize(scales, max_size=1333),
                T.Compose([
                    T.RandomResize([400, 500, 600]),
                    T.RandomSizeCrop(384, 600),
                    T.RandomResize(scales, max_size=1333),
                ])
            ),
            normalize,
        ])

It doesn't seem that standard to me...

Thank you!

alcinos · 2020-07-26T17:04:34Z

@netw0rkf10w By doing so, you'll also make the cropping augment weaker, it remains to be seen how much it affects final performance. For DETR, we know that adding the cropping as we do it accounts for 1-2 AP, not really negligible.

Cropping are not very often used by Faster-RCNN and the like because it doesn't affect them that much (it's due to the inductive biases that are used in these methods), that may be why it looks non-standard to you. Basically our augment is 50% of the time the "normal" object-detection augment, and the rest of the time we apply some resize to a sensible size, then crop, then resize as if there was no crop, it's fairly straightforward. Tuning the probability of cropping to something else than 50% is experimentally detrimental on coco, and trying to change a bit the cropping scheme didn't seem to have any impact.

netw0rkf10w · 2020-07-26T17:24:48Z

@alcinos Great. Thanks for the information. I may be able to try GroupedBatchSampler some time during the summer, will keep you updated about the results ;)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory at later stage of training #150

Out of memory at later stage of training #150

netw0rkf10w commented Jul 14, 2020

alcinos commented Jul 17, 2020

lessw2020 commented Jul 23, 2020

netw0rkf10w commented Jul 26, 2020

alcinos commented Jul 26, 2020

netw0rkf10w commented Jul 26, 2020 •

edited

Loading

alcinos commented Jul 26, 2020

netw0rkf10w commented Jul 26, 2020 •

edited

Loading

Out of memory at later stage of training #150

Out of memory at later stage of training #150

Comments

netw0rkf10w commented Jul 14, 2020

alcinos commented Jul 17, 2020

lessw2020 commented Jul 23, 2020

netw0rkf10w commented Jul 26, 2020

alcinos commented Jul 26, 2020

netw0rkf10w commented Jul 26, 2020 • edited Loading

alcinos commented Jul 26, 2020

netw0rkf10w commented Jul 26, 2020 • edited Loading

netw0rkf10w commented Jul 26, 2020 •

edited

Loading

netw0rkf10w commented Jul 26, 2020 •

edited

Loading