Chip classification training and prediction on multiple GPU's #861

lmbak · 2019-11-19T14:55:52Z

Overview

Making use of multi GPU systems during training and prediction of the chip classification. The multiprocessing is done in the batch dimension, so your batch size needs to be equal to or larger than the amount of GPU's you want to be used. No user input is required. Automatically the most amount of available GPU's is used.

Checklist

Updated docs/changelog.rst
Added needs-backport label if PR is bug fix that applies to previous minor release
Ran scripts/format_code and committed any changes
Documentation updated if needed
PR has a name that won't get you publicly shamed for vagueness

Notes

Does not work yet on multiple experiments, but only with one.

Testing Instructions

Tested it on multi GPU system, batch size needs to be larger than 1 before the 2nd GPU is used
~~Training on multiple experiments results in a memory error. Not sure how this can be resolved yet.~~

Closes #XXX

…ation. Only works on one experiment for now.

lmbak · 2019-11-19T14:57:27Z

Edit: Solved
@lewfish I'll resolve the conflicts tomorrow (or later tonight). Do you have any ideas about the multi-experiment error? When running more than 1 experiment it seems like the allocation of GPU's does not go correct. I'll dive in it deeper tomorrow.

sync with master

sync gpu with master

…ersa.

lmbak · 2019-11-20T14:35:23Z

Even though the travis succeeded, there was an error when using a pretrained model, or when restoring a model checkpoint when switching from multi-gpu to single-gpu machines or vice-versa. I think it is likely that models might be ported around from time to time, so now this is possible. See also
https://stackoverflow.com/questions/44230907/keyerror-unexpected-key-module-encoder-embedding-weight-in-state-dict

lmbak · 2019-11-20T14:40:09Z

@lewfish I only now realize that I did not check with you if this functionality is needed. Is it needed in this form?

lewfish · 2019-11-20T17:25:33Z

@lewfish I only now realize that I did not check with you if this functionality is needed. Is it needed in this form?

We haven't had a need for multi-GPU support, esp. for chip classification since training is relatively fast. I suppose it'll be good to have though and could speed up workflows.

lewfish · 2019-11-20T17:26:14Z

When you tested this, what speedup did you observe from using multiple GPUs?

lewfish · 2019-11-20T17:37:47Z

rastervision/backend/pytorch_chip_classification.py

        model = model.to(self.device)
        model_path = join(train_dir, 'model')

        # Load weights from a pretrained model.
+        def convert_state_dict_to_dataparallel(state_dict):


I would move these two functions out of the train function since it's already very long.

rastervision/backend/pytorch_chip_classification.py

lewfish · 2019-11-20T18:08:41Z

rastervision/backend/torch_utils/chip_classification/data.py

@@ -43,7 +44,11 @@ def __len__(self):


 def build_databunch(data_dir, img_sz, batch_sz, class_names, augmentors):
-    num_workers = 4
+    # To avoid memory errors


What happened without this check? I can see that using multiple GPUs on one instance means that there could be fewer CPUs available per GPU and so num_workers might need to be lower, but I don't think we want to set this 0 unless we have to. I've had situations where things froze when using num_workers > 0, but it's best to avoid setting it to zero.

Without this check I got memory errors. Even for super small batch sizes (batch_size=16, Nvidia P40 with 22GB of memory, using resnet50). My guess is that the multiple processes try to allocate the same bit of memory on the GPU. On the documentation page of torch.utils.data I came across the red box (1/3rd down the page, same as in the below comment), so I tried with num_workers=0, and that seemed to remove the memory error.

lewfish · 2019-11-20T18:10:27Z

rastervision/backend/pytorch_chip_classification.py

@@ -217,17 +218,59 @@ def train(self, tmp_dir):
        num_labels = len(databunch.label_names)
        model = get_model(
            self.train_opts.model_arch, num_labels, pretrained=True)
+        if torch.cuda.device_count() > 1:
+            model = nn.DataParallel(model)


I think we should add a multi_gpu option, and it's default value should be True, but it seems like it would be good to give people the option to turn this off.

Would you agree that the best place to have this flag would be in the backend configuration? i.e. rv.BackendConfig.builder(rv.PYTORCH_CHIP_CLASSIFICATION).with_multi_gpu(True) would enable multi GPU? In that case I would pass the flag on in:

PyTorchChipClassificationConfigBuilder from pytorch_chip_classification_config.py via a with_multi_gpu() method, correct?

I would rather add an option to with_train_options here https://github.com/azavea/raster-vision/blob/master/rastervision/backend/pytorch_chip_classification_config.py#L53

lewfish · 2019-11-20T18:16:10Z

Did you already sign and send us the CLA (see bottom of https://github.com/azavea/raster-vision)? I checked my email and couldn't find it.

lmbak · 2019-11-21T09:34:59Z

When you tested this, what speedup did you observe from using multiple GPUs?

Roughly a 40-50% speedup. Went from 36-40 seconds per epoch to 21-22 seconds. Although that was going from one Nvidia P40 to four P40's. So at best 50% speedup for 400% more capacity. The gains are in that the batch size can be increased. In our case we went from a batch size of 16 to 750. Although these numbers seem also odd. You would expect the batch size to increase by a factor 4 in this case. My guess is that this has to do with the num_workers used for data loading. See also the box on the torch.utils.data documentation page (red box about 1/3rd down the page). ~~My guess is that for a single card setting num_workers=0 the batch size might be larger, but the overall loading might be slower. But that's just a hunch. I'll be testing that later.~~ Tested it. see below. How does this fit with your experience with the behavior of batch_size and multiple GPU's and num_workers?

lmbak · 2019-11-21T14:10:50Z

I have been playing with the num_workers and the batch_size. I found that when the num_workers > 0', I have to set a lower batch_size`.
Here is what I tried (all on a single GPU):

Find a high batch_size with num_workers=0 before the error RuntimeError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 7.93 GiB total capacity; 6.85 GiB already allocated; 17.88 MiB free; 28.04 MiB cached) occurs
With this high batch_size that we just confirmed to be able to fit in GPU memory, set num_workers>0, now another error occurs: Training [------------------------------------] 0%ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Note that both are memory error, but they are not exactly the same.
When decreasing the batch_size the error eventually disappears. But then is inconsistent, meaning that for a fixed batch_size, the error might, or might not occur. Only when the batch_size is really small the error seems to never occur.

This is consistent with the warning on the torch.utils.data documentation site(1/3rd down the page). Probably the multi process data loading encounters some errors when each process is claiming the same part of GPU memory, but that is just a hunch.

So I think that when using the GPU (either a single, or multiple) the num_workers should be set to 0.

lmbak · 2019-11-21T14:31:36Z

With this knowledge I will have to test the single vs multi GPU again, both with num_workers=0. My guess is that the gains will decrease, as the single GPU scenario now will hold a bigger batch_size without giving a memory error.
TEST RESULTS

Again, with num_workers>0, errors are inconsistent. Sometimes they occur, sometimes they don't for equal experiments.
With num_wokers=0 compared two situations (equal GPU model):

4 GPU's batch_size=780 > time per epoch ~0:45
1 GPU batch_size=200 > time per epoch ~1:23

Note that the batch size roughly quadruples from experiment 2 to 1, this seems consistent with what you would expect. Also, in the case of 4 GPU's, they are ideling most of the time. This makes me think that most of the time is spent transferring the images to the GPU.

lewfish · 2019-12-04T18:02:21Z

Note that the batch size roughly quadruples from experiment 2 to 1, this seems consistent with what you would expect. Also, in the case of 4 GPU's, they are ideling most of the time. This makes me think that most of the time is spent transferring the images to the GPU.

Yeah, there's a good chance that using num_workers = 0 is making it so you're not getting the 4x speedup you would hope for. You mentioned that you got a bus error. I've seen that before when I didn't allocate enough shared memory in the Docker container. The ./docker/run script in RV doesn't explicitly allocate it (but should, I think). I've done this in another repo using the following option: docker run --shm-size 8G. If you try this, maybe you can use num_workers > 0

lewfish

Other than the num_workers issue and the use_multi_gpu option, this looks good. If you can't fix the problem by increasing the shared memory, could you make it so it only sets num_workers to 0 if use_multi_gpu is True?

lewfish · 2020-01-13T17:24:21Z

I was looking at this again, and given the added complexity, some unresolved issues (like the multiple experiments), the meager speed gains (0:45 for 4 GPUs vs 1:23 for 1 GPU), and the fact that there isn't a pressing need for this functionality, I think we should probably close this without merging and revisit it in the future if needed. What you you think?

lmbak · 2020-01-13T18:40:25Z

Agreed, I have not had the time to look at this PR this year. It even suits me better as I'm not using the GPU machine any more, so testing would be difficult.

lmbak and others added 3 commits November 11, 2019 14:03

Added multi GPU support

17e1db9

Trimmed multi GPU code, not tested yet

8fd1b5c

Added multi GPU support for training and prediction of chip classific…

95960fc

…ation. Only works on one experiment for now.

Ubuntu and others added 7 commits November 20, 2019 08:32

Make worker count dependent on num of cuda devices to avoid memory error

8440f49

Merge pull request #5 from azavea/master

c9bde93

sync with master

Merge branch 'lmb/multigpu' into lmb/syncwithmultigpu

a6c6262

Merge pull request #6 from lmbak/lmb/syncwithmultigpu

a362e3b

sync gpu with master

Included chipclassification multiGPU support in changelog

ac3d371

Added chipclassification multi GPU to documentation

de93014

Format code and fixed augmentor keyword in build_databunch()

c25759f

lmbak changed the title ~~WIP: Chip classification training and prediction on multiple GPU's~~ Chip classification training and prediction on multiple GPU's Nov 20, 2019

lmbak added 3 commits November 20, 2019 13:03

Removed unneeded codo to check for cuda or cpu

6db2150

Restored accidentally removed log statement

0ed6bdd

Allow models from multi-GPU to be ported to single-GPU/CPU and vice-v…

6fff52f

…ersa.

Format code

34969af

lewfish suggested changes Nov 20, 2019

View reviewed changes

lewfish suggested changes Dec 4, 2019

View reviewed changes

lmbak closed this Jan 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chip classification training and prediction on multiple GPU's #861

Chip classification training and prediction on multiple GPU's #861

lmbak commented Nov 19, 2019 •

edited

lmbak commented Nov 19, 2019 •

edited

lmbak commented Nov 20, 2019

lmbak commented Nov 20, 2019

lewfish commented Nov 20, 2019

lewfish commented Nov 20, 2019

lewfish Nov 20, 2019

lewfish Nov 20, 2019

lmbak Nov 21, 2019

lewfish Nov 20, 2019

lmbak Nov 25, 2019

lewfish Dec 4, 2019

lewfish commented Nov 20, 2019

lmbak commented Nov 21, 2019 •

edited

lmbak commented Nov 21, 2019 •

edited

lmbak commented Nov 21, 2019 •

edited

lewfish commented Dec 4, 2019

lewfish left a comment

lewfish commented Jan 13, 2020

lmbak commented Jan 13, 2020

Chip classification training and prediction on multiple GPU's #861

Chip classification training and prediction on multiple GPU's #861

Conversation

lmbak commented Nov 19, 2019 • edited

Overview

Checklist

Notes

Testing Instructions

lmbak commented Nov 19, 2019 • edited

lmbak commented Nov 20, 2019

lmbak commented Nov 20, 2019

lewfish commented Nov 20, 2019

lewfish commented Nov 20, 2019

lewfish Nov 20, 2019

Choose a reason for hiding this comment

lewfish Nov 20, 2019

Choose a reason for hiding this comment

lmbak Nov 21, 2019

Choose a reason for hiding this comment

lewfish Nov 20, 2019

Choose a reason for hiding this comment

lmbak Nov 25, 2019

Choose a reason for hiding this comment

lewfish Dec 4, 2019

Choose a reason for hiding this comment

lewfish commented Nov 20, 2019

lmbak commented Nov 21, 2019 • edited

lmbak commented Nov 21, 2019 • edited

lmbak commented Nov 21, 2019 • edited

lewfish commented Dec 4, 2019

lewfish left a comment

Choose a reason for hiding this comment

lewfish commented Jan 13, 2020

lmbak commented Jan 13, 2020

lmbak commented Nov 19, 2019 •

edited

lmbak commented Nov 19, 2019 •

edited

lmbak commented Nov 21, 2019 •

edited

lmbak commented Nov 21, 2019 •

edited

lmbak commented Nov 21, 2019 •

edited