Skip to content
This repository has been archived by the owner on Mar 12, 2024. It is now read-only.

Out of memory at later stage of training #150

Open
netw0rkf10w opened this issue Jul 14, 2020 · 7 comments
Open

Out of memory at later stage of training #150

netw0rkf10w opened this issue Jul 14, 2020 · 7 comments

Comments

@netw0rkf10w
Copy link

Hello,

I observed some strange behavior when launching a training on a server of 4 16GB P100 GPUs using:

python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --coco_path /path/to/coco

The training went well for 12 epochs and then in the middle of the 13th epoch, it had an OOM error. Usually memory usage shouldn't change between epochs, but for DETR I don't know if this is the case.

According to the paper, you trained your models using "16 V100 GPUs, with 4 images per GPU (hence a total batch size of 64)". Could you tell me if your GPUs have 16GB or 32GB of memory?

Thanks a lot!

@alcinos
Copy link
Contributor

alcinos commented Jul 17, 2020

Hi @netw0rkf10w
Thank you for your interest in DETR.

Memory usage variation mainly come from padding. If by chance, you get in the same batch a wide horizontal image and a narrow vertical image, then the padded resulting images will be huge (and result in quite a bit of wasted memory). The trainings are seeded, so presumably you will always encounter this "bad batch" at the same time in your 13 epoch.

That being said, your command seems to be using the default batch-size which is 2 per card, it should amply fit in a 16gb card. Could you double check that no other process is using gpu memory on the node? Also the logs should provide you with the current "max mem" that has been required, so you can check whether it's constantly flirting with 16gb (it shouldn't with batch-size=2)

Best of luck.

@lessw2020
Copy link
Contributor

If it helps, I've monitored memory quite a bit while training and you can see a bit of climb during first epoch and after that it's largely stable within +/- .1GB (i.e. 10.1GB).
For the r50 model it's around 10.x GB with bs 2, and 11.x with BS4 on my training.
Anyway, no memory issues in my experience.
Note you might want to make sure to run nvidia-smi (in notebook = !nvidia-smi) and check out your gpu free memory before starting to confirm what's free.

@netw0rkf10w
Copy link
Author

Memory usage variation mainly come from padding. If by chance, you get in the same batch a wide horizontal image and a narrow vertical image, then the padded resulting images will be huge (and result in quite a bit of wasted memory).

@alcinos That's indeed the issue. I checked the code and realized this too. Clearly this is not optimal. Is there a good reason you guys didn't use some kind of GroupedBatchSampler like in detectron for sampling either landscape or portrait images in a batch?

@lessw2020 My training went well during the first 12 epochs...

I think one should have the total control of the training's memory footprint (ideally constant across epochs), so I will try GroupedBatchSampler for my usage.

@alcinos
Copy link
Contributor

alcinos commented Jul 26, 2020

@netw0rkf10w it's a bit harder to implement in our case, since the random cropping may change the aspect ratio of the image completely. The torchvision/detectron2 approach of pre-binning the aspect-ratios of all the images to create groups thus doesn't work, since it assumes the aspect ratio of each image won't change.
That's why we didn't invest more time in that, and I said in my previous message on 16GB cards with the base model and bs<=4 you shouldn't have any issue either.
If you find a way to implement it cleanly, we'd be happy to review and merge a PR though 😄

@netw0rkf10w
Copy link
Author

netw0rkf10w commented Jul 26, 2020

@alcinos You are right, with the implemented random cropping it doesn't make sense to do grouping based on the aspect ratio. However, one can modify e.g. RandomSizeCrop to let it keep the original aspect ratio (I mean the "binary" value, i.e. "portrait" or "landscape"), something like this:

class RandomSizeCrop(object):
    def __init__(self, min_size: int, max_size: int):
        self.min_size = min_size
        self.max_size = max_size

    def __call__(self, img: PIL.Image.Image, target: dict):
        iw, ih = img.size
        if iw < ih:
            w = random.randint(self.min_size, min(img.width, self.max_size))
            h = random.randint(w, min(img.height, self.max_size))
        else:
            h = random.randint(self.min_size, min(img.height, self.max_size))
            w = random.randint(h, min(img.width, self.max_size))

        region = T.RandomCrop.get_params(img, [h, w])
        return crop(img, target, region)

Using this in combination with GroupedBatchSampler, the maximum dimension of a batch would be batch_size * 3 * 800 * 1333 (or batch_size * 3 * 1333 * 800). In the current implementation, it is batch_size * 3 * 1333 * 1333 in the worse case.

By the way, could you tell me how you came up with the following implemented data augmentation?

          T.Compose([
            T.RandomHorizontalFlip(),
            T.RandomSelect(
                T.RandomResize(scales, max_size=1333),
                T.Compose([
                    T.RandomResize([400, 500, 600]),
                    T.RandomSizeCrop(384, 600),
                    T.RandomResize(scales, max_size=1333),
                ])
            ),
            normalize,
        ])

It doesn't seem that standard to me...

Thank you!

@alcinos
Copy link
Contributor

alcinos commented Jul 26, 2020

@netw0rkf10w By doing so, you'll also make the cropping augment weaker, it remains to be seen how much it affects final performance. For DETR, we know that adding the cropping as we do it accounts for 1-2 AP, not really negligible.

Cropping are not very often used by Faster-RCNN and the like because it doesn't affect them that much (it's due to the inductive biases that are used in these methods), that may be why it looks non-standard to you. Basically our augment is 50% of the time the "normal" object-detection augment, and the rest of the time we apply some resize to a sensible size, then crop, then resize as if there was no crop, it's fairly straightforward. Tuning the probability of cropping to something else than 50% is experimentally detrimental on coco, and trying to change a bit the cropping scheme didn't seem to have any impact.

@netw0rkf10w
Copy link
Author

netw0rkf10w commented Jul 26, 2020

@alcinos Great. Thanks for the information. I may be able to try GroupedBatchSampler some time during the summer, will keep you updated about the results ;)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants