Slow training in machines with multiple GPUs #3265

pvcastro · 2019-09-19T11:28:04Z

System (please complete the following information):

OS: Linux Ubuntu 18.04.3 LTS
Python version: 3.6.8
AllenNLP version: v0.8.5
PyTorch version: 1.2.0

Question
I have used used AllenNLP NER training in multiple machines with multiple GPUs (DGX1, IBM AC922 and a custom DIGITS DevBox) and I have verified the following behavior in all of them:

There is no difference of training time when you use one or more GPUs for the same training. If you use 2 GPUs instead of 1, the amount of batches reduces by half, but each batch takes twice the time to process, so it doesn't make any difference.
When you have multiple trainings, running each one in a separate GPU (using CUDA_VISIBLE_DEVICES), the first training starts OK, but the following ones seems to perform very slowly, even though they are running on separate GPUs, hidden from each other using CUDA_VISIBLE_DEVICES. The same happens with inference as well.

Is this normal behavior? Anything I could do to improve this? I'm having a hard time taking advantage of robust machines with multiple GPUs. This has happened since AllenNLP 0.7.2 running PyTorch 0.4.1 as well, in all machines I used.

One note is that for AC922 and DGX1, the GPUs are NVLinked, but in the DIGITS DevBox there isn't this connection. And the behavior is the same anyway.

Thanks!

schmmd · 2019-09-19T22:42:06Z

@pvcastro I would suggest upgrading. It's pretty hard for us to investigate issues on PyTorch 0.4.1.

Possibly @amarasovic or @brendan-ai2 might be able to relate their experiences running a different model on multiple GPUs. I know some people in the group do that regularly.

pvcastro · 2019-09-19T22:44:53Z

Hi @schmmd !
It's happening in current version with AllenNLP 0.8.5 and PyTorch 1.1 too. I only meant that this always happened, since AllenNLP 0.7.2 with PyTorch 0.4.1.
Having apex installed doesn't help with this either.

schmmd · 2019-09-19T22:49:05Z

@pvcastro apologies--I misread--thank you for clarifying. I don't think it's normal behavior, but let's wait until someone who actually runs multi-GPU workloads is able to give some perspective.

pvcastro · 2019-09-19T22:51:16Z

Great, thanks!

pvcastro · 2019-09-20T19:20:35Z

FYI, we're running a SQuAD training using pytorch-transformers, in 2 GPUs, while there's an AllenNLP NER training in a third GPU, and this SQuAD training had no impact on the AllenNLP NER training.

brendan-ai2 · 2019-09-20T20:37:14Z

Hi @pvcastro,

This is an area that we're actively working on. Our multi-GPU setup is definitely not ideal and we're attempting to migrate to DistributedDataParallel to address some of these issues. It's very difficult to diagnose performance issues remotely, but some (very) high level thoughts for your specific concerns:

One thing that's important here would be to instrument AllenNLP to determine precisely where the bottleneck is. In some applications the tensorization (i.e., iterating over your data with DataIterator) is the bottleneck. If this were the case, then using multiple GPUs would have no effect. This is where I would start looking.
I'm less certain about this point. Presumably one issue is saturation of the bus. I'm not particularly familiar with NVLink, but if you can measure the usage of that and the PCI-e bus, that could be interesting. It would also be worth double checking to see if anything else is saturated, e.g. CPU.

Looking forward to hearing more. Regards.

pvcastro · 2019-09-23T12:21:47Z

Hi @brendan-ai2, thanks for the reply!

For 1, what do you propose I do for instrumenting the code? Any thoughts? Have you done this before and could share the procedure?

For 2 (and this is the one that has been causing me more pain, because I'm unable to run at least 2 trainings in parallel), here is what I have from running 'nvidia-smi topo -m' in the DIGITS devbox setup with 3 RTX 2080ti:

I don't have NVLink in this machine, but in the DGX1 and AC922 I used before I had, and the same issue was happening too. Anyway, I'll try measuring all these resources and will post the results here. If you have any suggestions on the best ways to approach this, please let me know.

Thanks!

brendan-ai2 · 2019-09-27T23:08:07Z

For 1, you could try using the script at https://github.com/allenai/allennlp/blob/master/scripts/benchmark_iter.py. This will tell you how long it takes to read, tensorize and batch your dataset. Then you'll need to compare this against how long it takes to train a batch. AllenNLP's train command should display this for an epoch, so adjust for batches per epoch and you'll have your number. If the data processing dominates the overall train time, using more GPUs won't help.

For 2, I'm less certain. I don't think nvidia-smi topo -m gives us what we need as it just shows the topology of the bus, not the current usage. Perhaps you could try pcm-pcie from https://github.com/opcm/pcm? I got this working, but I'm too familiar with it. And a gotcha, I wasn't able to get it working on a VM, only local hardware.

pvcastro · 2019-10-06T18:28:03Z

Hi @brendan-ai2 , just to give you some feedback, I haven't got a chance to follow your suggestions yet. Once I have some information, I'll post it here.

DeNeutoy · 2020-01-21T17:16:58Z

This should now be resolved by #3529

matt-gardner assigned brendan-ai2 Sep 27, 2019

DeNeutoy closed this as completed Jan 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow training in machines with multiple GPUs #3265

Slow training in machines with multiple GPUs #3265

pvcastro commented Sep 19, 2019

schmmd commented Sep 19, 2019

pvcastro commented Sep 19, 2019

schmmd commented Sep 19, 2019

pvcastro commented Sep 19, 2019

pvcastro commented Sep 20, 2019

brendan-ai2 commented Sep 20, 2019

pvcastro commented Sep 23, 2019

brendan-ai2 commented Sep 27, 2019

pvcastro commented Oct 6, 2019

DeNeutoy commented Jan 21, 2020

Slow training in machines with multiple GPUs #3265

Slow training in machines with multiple GPUs #3265

Comments

pvcastro commented Sep 19, 2019

schmmd commented Sep 19, 2019

pvcastro commented Sep 19, 2019

schmmd commented Sep 19, 2019

pvcastro commented Sep 19, 2019

pvcastro commented Sep 20, 2019

brendan-ai2 commented Sep 20, 2019

pvcastro commented Sep 23, 2019

brendan-ai2 commented Sep 27, 2019

pvcastro commented Oct 6, 2019

DeNeutoy commented Jan 21, 2020