New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chip classification training and prediction on multiple GPU's #861
Conversation
…ation. Only works on one experiment for now.
Edit: Solved |
sync with master
sync gpu with master
Even though the travis succeeded, there was an error when using a pretrained model, or when restoring a model checkpoint when switching from multi-gpu to single-gpu machines or vice-versa. I think it is likely that models might be ported around from time to time, so now this is possible. See also |
@lewfish I only now realize that I did not check with you if this functionality is needed. Is it needed in this form? |
We haven't had a need for multi-GPU support, esp. for chip classification since training is relatively fast. I suppose it'll be good to have though and could speed up workflows. |
When you tested this, what speedup did you observe from using multiple GPUs? |
model = model.to(self.device) | ||
model_path = join(train_dir, 'model') | ||
|
||
# Load weights from a pretrained model. | ||
def convert_state_dict_to_dataparallel(state_dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move these two functions out of the train
function since it's already very long.
@@ -43,7 +44,11 @@ def __len__(self): | |||
|
|||
|
|||
def build_databunch(data_dir, img_sz, batch_sz, class_names, augmentors): | |||
num_workers = 4 | |||
# To avoid memory errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened without this check? I can see that using multiple GPUs on one instance means that there could be fewer CPUs available per GPU and so num_workers
might need to be lower, but I don't think we want to set this 0 unless we have to. I've had situations where things froze when using num_workers > 0
, but it's best to avoid setting it to zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this check I got memory errors. Even for super small batch sizes (batch_size=16
, Nvidia P40 with 22GB of memory, using resnet50). My guess is that the multiple processes try to allocate the same bit of memory on the GPU. On the documentation page of torch.utils.data I came across the red box (1/3rd down the page, same as in the below comment), so I tried with num_workers=0
, and that seemed to remove the memory error.
@@ -217,17 +218,59 @@ def train(self, tmp_dir): | |||
num_labels = len(databunch.label_names) | |||
model = get_model( | |||
self.train_opts.model_arch, num_labels, pretrained=True) | |||
if torch.cuda.device_count() > 1: | |||
model = nn.DataParallel(model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a multi_gpu
option, and it's default value should be True, but it seems like it would be good to give people the option to turn this off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you agree that the best place to have this flag would be in the backend configuration? i.e. rv.BackendConfig.builder(rv.PYTORCH_CHIP_CLASSIFICATION).with_multi_gpu(True)
would enable multi GPU? In that case I would pass the flag on in:
PyTorchChipClassificationConfigBuilder
frompytorch_chip_classification_config.py
via awith_multi_gpu()
method, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather add an option to with_train_options
here https://github.com/azavea/raster-vision/blob/master/rastervision/backend/pytorch_chip_classification_config.py#L53
Did you already sign and send us the CLA (see bottom of https://github.com/azavea/raster-vision)? I checked my email and couldn't find it. |
Roughly a 40-50% speedup. Went from 36-40 seconds per epoch to 21-22 seconds. Although that was going from one Nvidia P40 to four P40's. So at best 50% speedup for 400% more capacity. The gains are in that the batch size can be increased. In our case we went from a batch size of 16 to 750. Although these numbers seem also odd. You would expect the batch size to increase by a factor 4 in this case. My guess is that this has to do with the |
I have been playing with the
This is consistent with the warning on the torch.utils.data documentation site(1/3rd down the page). Probably the multi process data loading encounters some errors when each process is claiming the same part of GPU memory, but that is just a hunch. So I think that when using the GPU (either a single, or multiple) the |
With this knowledge I will have to test the single vs multi GPU again, both with
Note that the batch size roughly quadruples from experiment 2 to 1, this seems consistent with what you would expect. Also, in the case of 4 GPU's, they are ideling most of the time. This makes me think that most of the time is spent transferring the images to the GPU. |
Yeah, there's a good chance that using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the num_workers
issue and the use_multi_gpu
option, this looks good. If you can't fix the problem by increasing the shared memory, could you make it so it only sets num_workers
to 0 if use_multi_gpu
is True?
I was looking at this again, and given the added complexity, some unresolved issues (like the multiple experiments), the meager speed gains (0:45 for 4 GPUs vs 1:23 for 1 GPU), and the fact that there isn't a pressing need for this functionality, I think we should probably close this without merging and revisit it in the future if needed. What you you think? |
Agreed, I have not had the time to look at this PR this year. It even suits me better as I'm not using the GPU machine any more, so testing would be difficult. |
Overview
Making use of multi GPU systems during training and prediction of the chip classification. The multiprocessing is done in the batch dimension, so your batch size needs to be equal to or larger than the amount of GPU's you want to be used. No user input is required. Automatically the most amount of available GPU's is used.
Checklist
docs/changelog.rst
needs-backport
label if PR is bug fix that applies to previous minor releaseNotes
Does not work yet on multiple experiments, but only with one.
Testing Instructions
Training on multiple experiments results in a memory error. Not sure how this can be resolved yet.Closes #XXX