Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chip classification training and prediction on multiple GPU's #861

Closed
wants to merge 14 commits into from
Closed

Chip classification training and prediction on multiple GPU's #861

wants to merge 14 commits into from

Conversation

lmbak
Copy link
Contributor

@lmbak lmbak commented Nov 19, 2019

Overview

Making use of multi GPU systems during training and prediction of the chip classification. The multiprocessing is done in the batch dimension, so your batch size needs to be equal to or larger than the amount of GPU's you want to be used. No user input is required. Automatically the most amount of available GPU's is used.

Checklist

  • Updated docs/changelog.rst
  • Added needs-backport label if PR is bug fix that applies to previous minor release
  • Ran scripts/format_code and committed any changes
  • Documentation updated if needed
  • PR has a name that won't get you publicly shamed for vagueness

Notes

Does not work yet on multiple experiments, but only with one.

Testing Instructions

  • Tested it on multi GPU system, batch size needs to be larger than 1 before the 2nd GPU is used
  • Training on multiple experiments results in a memory error. Not sure how this can be resolved yet.

Closes #XXX

@lmbak
Copy link
Contributor Author

lmbak commented Nov 19, 2019

Edit: Solved
@lewfish I'll resolve the conflicts tomorrow (or later tonight). Do you have any ideas about the multi-experiment error? When running more than 1 experiment it seems like the allocation of GPU's does not go correct. I'll dive in it deeper tomorrow.

@lmbak lmbak changed the title WIP: Chip classification training and prediction on multiple GPU's Chip classification training and prediction on multiple GPU's Nov 20, 2019
@lmbak
Copy link
Contributor Author

lmbak commented Nov 20, 2019

Even though the travis succeeded, there was an error when using a pretrained model, or when restoring a model checkpoint when switching from multi-gpu to single-gpu machines or vice-versa. I think it is likely that models might be ported around from time to time, so now this is possible. See also
https://stackoverflow.com/questions/44230907/keyerror-unexpected-key-module-encoder-embedding-weight-in-state-dict

@lmbak
Copy link
Contributor Author

lmbak commented Nov 20, 2019

@lewfish I only now realize that I did not check with you if this functionality is needed. Is it needed in this form?

@lewfish
Copy link
Contributor

lewfish commented Nov 20, 2019

@lewfish I only now realize that I did not check with you if this functionality is needed. Is it needed in this form?

We haven't had a need for multi-GPU support, esp. for chip classification since training is relatively fast. I suppose it'll be good to have though and could speed up workflows.

@lewfish
Copy link
Contributor

lewfish commented Nov 20, 2019

When you tested this, what speedup did you observe from using multiple GPUs?

model = model.to(self.device)
model_path = join(train_dir, 'model')

# Load weights from a pretrained model.
def convert_state_dict_to_dataparallel(state_dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move these two functions out of the train function since it's already very long.

@@ -43,7 +44,11 @@ def __len__(self):


def build_databunch(data_dir, img_sz, batch_sz, class_names, augmentors):
num_workers = 4
# To avoid memory errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened without this check? I can see that using multiple GPUs on one instance means that there could be fewer CPUs available per GPU and so num_workers might need to be lower, but I don't think we want to set this 0 unless we have to. I've had situations where things froze when using num_workers > 0, but it's best to avoid setting it to zero.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this check I got memory errors. Even for super small batch sizes (batch_size=16, Nvidia P40 with 22GB of memory, using resnet50). My guess is that the multiple processes try to allocate the same bit of memory on the GPU. On the documentation page of torch.utils.data I came across the red box (1/3rd down the page, same as in the below comment), so I tried with num_workers=0, and that seemed to remove the memory error.

@@ -217,17 +218,59 @@ def train(self, tmp_dir):
num_labels = len(databunch.label_names)
model = get_model(
self.train_opts.model_arch, num_labels, pretrained=True)
if torch.cuda.device_count() > 1:
model = nn.DataParallel(model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a multi_gpu option, and it's default value should be True, but it seems like it would be good to give people the option to turn this off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you agree that the best place to have this flag would be in the backend configuration? i.e. rv.BackendConfig.builder(rv.PYTORCH_CHIP_CLASSIFICATION).with_multi_gpu(True) would enable multi GPU? In that case I would pass the flag on in:

  • PyTorchChipClassificationConfigBuilder from pytorch_chip_classification_config.py via a with_multi_gpu() method, correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lewfish
Copy link
Contributor

lewfish commented Nov 20, 2019

Did you already sign and send us the CLA (see bottom of https://github.com/azavea/raster-vision)? I checked my email and couldn't find it.

@lmbak
Copy link
Contributor Author

lmbak commented Nov 21, 2019

When you tested this, what speedup did you observe from using multiple GPUs?

Roughly a 40-50% speedup. Went from 36-40 seconds per epoch to 21-22 seconds. Although that was going from one Nvidia P40 to four P40's. So at best 50% speedup for 400% more capacity. The gains are in that the batch size can be increased. In our case we went from a batch size of 16 to 750. Although these numbers seem also odd. You would expect the batch size to increase by a factor 4 in this case. My guess is that this has to do with the num_workers used for data loading. See also the box on the torch.utils.data documentation page (red box about 1/3rd down the page). My guess is that for a single card setting num_workers=0 the batch size might be larger, but the overall loading might be slower. But that's just a hunch. I'll be testing that later. Tested it. see below. How does this fit with your experience with the behavior of batch_size and multiple GPU's and num_workers?

@lmbak
Copy link
Contributor Author

lmbak commented Nov 21, 2019

I have been playing with the num_workers and the batch_size. I found that when the num_workers > 0', I have to set a lower batch_size`.
Here is what I tried (all on a single GPU):

  • Find a high batch_size with num_workers=0 before the error RuntimeError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 7.93 GiB total capacity; 6.85 GiB already allocated; 17.88 MiB free; 28.04 MiB cached) occurs
  • With this high batch_size that we just confirmed to be able to fit in GPU memory, set num_workers>0, now another error occurs: Training [------------------------------------] 0%ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
  • Note that both are memory error, but they are not exactly the same.
  • When decreasing the batch_size the error eventually disappears. But then is inconsistent, meaning that for a fixed batch_size, the error might, or might not occur. Only when the batch_size is really small the error seems to never occur.

This is consistent with the warning on the torch.utils.data documentation site(1/3rd down the page). Probably the multi process data loading encounters some errors when each process is claiming the same part of GPU memory, but that is just a hunch.

So I think that when using the GPU (either a single, or multiple) the num_workers should be set to 0.

@lmbak
Copy link
Contributor Author

lmbak commented Nov 21, 2019

With this knowledge I will have to test the single vs multi GPU again, both with num_workers=0. My guess is that the gains will decrease, as the single GPU scenario now will hold a bigger batch_size without giving a memory error.
TEST RESULTS

  • Again, with num_workers>0, errors are inconsistent. Sometimes they occur, sometimes they don't for equal experiments.

  • With num_wokers=0 compared two situations (equal GPU model):

  1. 4 GPU's batch_size=780 > time per epoch ~0:45
  2. 1 GPU batch_size=200 > time per epoch ~1:23

Note that the batch size roughly quadruples from experiment 2 to 1, this seems consistent with what you would expect. Also, in the case of 4 GPU's, they are ideling most of the time. This makes me think that most of the time is spent transferring the images to the GPU.

@lewfish
Copy link
Contributor

lewfish commented Dec 4, 2019

Note that the batch size roughly quadruples from experiment 2 to 1, this seems consistent with what you would expect. Also, in the case of 4 GPU's, they are ideling most of the time. This makes me think that most of the time is spent transferring the images to the GPU.

Yeah, there's a good chance that using num_workers = 0 is making it so you're not getting the 4x speedup you would hope for. You mentioned that you got a bus error. I've seen that before when I didn't allocate enough shared memory in the Docker container. The ./docker/run script in RV doesn't explicitly allocate it (but should, I think). I've done this in another repo using the following option: docker run --shm-size 8G. If you try this, maybe you can use num_workers > 0

Copy link
Contributor

@lewfish lewfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the num_workers issue and the use_multi_gpu option, this looks good. If you can't fix the problem by increasing the shared memory, could you make it so it only sets num_workers to 0 if use_multi_gpu is True?

@lewfish
Copy link
Contributor

lewfish commented Jan 13, 2020

I was looking at this again, and given the added complexity, some unresolved issues (like the multiple experiments), the meager speed gains (0:45 for 4 GPUs vs 1:23 for 1 GPU), and the fact that there isn't a pressing need for this functionality, I think we should probably close this without merging and revisit it in the future if needed. What you you think?

@lmbak
Copy link
Contributor Author

lmbak commented Jan 13, 2020

Agreed, I have not had the time to look at this PR this year. It even suits me better as I'm not using the GPU machine any more, so testing would be difficult.

@lmbak lmbak closed this Jan 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants