Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Slow training in machines with multiple GPUs #3265

Closed
pvcastro opened this issue Sep 19, 2019 · 10 comments
Closed

Slow training in machines with multiple GPUs #3265

pvcastro opened this issue Sep 19, 2019 · 10 comments
Assignees

Comments

@pvcastro
Copy link
Contributor

System (please complete the following information):

  • OS: Linux Ubuntu 18.04.3 LTS
  • Python version: 3.6.8
  • AllenNLP version: v0.8.5
  • PyTorch version: 1.2.0

Question
I have used used AllenNLP NER training in multiple machines with multiple GPUs (DGX1, IBM AC922 and a custom DIGITS DevBox) and I have verified the following behavior in all of them:

  1. There is no difference of training time when you use one or more GPUs for the same training. If you use 2 GPUs instead of 1, the amount of batches reduces by half, but each batch takes twice the time to process, so it doesn't make any difference.

  2. When you have multiple trainings, running each one in a separate GPU (using CUDA_VISIBLE_DEVICES), the first training starts OK, but the following ones seems to perform very slowly, even though they are running on separate GPUs, hidden from each other using CUDA_VISIBLE_DEVICES. The same happens with inference as well.

Is this normal behavior? Anything I could do to improve this? I'm having a hard time taking advantage of robust machines with multiple GPUs. This has happened since AllenNLP 0.7.2 running PyTorch 0.4.1 as well, in all machines I used.

One note is that for AC922 and DGX1, the GPUs are NVLinked, but in the DIGITS DevBox there isn't this connection. And the behavior is the same anyway.

Thanks!

@schmmd
Copy link
Member

schmmd commented Sep 19, 2019

@pvcastro I would suggest upgrading. It's pretty hard for us to investigate issues on PyTorch 0.4.1.

Possibly @amarasovic or @brendan-ai2 might be able to relate their experiences running a different model on multiple GPUs. I know some people in the group do that regularly.

@pvcastro
Copy link
Contributor Author

Hi @schmmd !
It's happening in current version with AllenNLP 0.8.5 and PyTorch 1.1 too. I only meant that this always happened, since AllenNLP 0.7.2 with PyTorch 0.4.1.
Having apex installed doesn't help with this either.

@schmmd
Copy link
Member

schmmd commented Sep 19, 2019

@pvcastro apologies--I misread--thank you for clarifying. I don't think it's normal behavior, but let's wait until someone who actually runs multi-GPU workloads is able to give some perspective.

@pvcastro
Copy link
Contributor Author

Great, thanks!

@pvcastro
Copy link
Contributor Author

FYI, we're running a SQuAD training using pytorch-transformers, in 2 GPUs, while there's an AllenNLP NER training in a third GPU, and this SQuAD training had no impact on the AllenNLP NER training.

@brendan-ai2
Copy link
Contributor

Hi @pvcastro,

This is an area that we're actively working on. Our multi-GPU setup is definitely not ideal and we're attempting to migrate to DistributedDataParallel to address some of these issues. It's very difficult to diagnose performance issues remotely, but some (very) high level thoughts for your specific concerns:

  1. One thing that's important here would be to instrument AllenNLP to determine precisely where the bottleneck is. In some applications the tensorization (i.e., iterating over your data with DataIterator) is the bottleneck. If this were the case, then using multiple GPUs would have no effect. This is where I would start looking.

  2. I'm less certain about this point. Presumably one issue is saturation of the bus. I'm not particularly familiar with NVLink, but if you can measure the usage of that and the PCI-e bus, that could be interesting. It would also be worth double checking to see if anything else is saturated, e.g. CPU.

Looking forward to hearing more. Regards.

@pvcastro
Copy link
Contributor Author

Hi @brendan-ai2, thanks for the reply!

For 1, what do you propose I do for instrumenting the code? Any thoughts? Have you done this before and could share the procedure?

For 2 (and this is the one that has been causing me more pain, because I'm unable to run at least 2 trainings in parallel), here is what I have from running 'nvidia-smi topo -m' in the DIGITS devbox setup with 3 RTX 2080ti:

image

I don't have NVLink in this machine, but in the DGX1 and AC922 I used before I had, and the same issue was happening too. Anyway, I'll try measuring all these resources and will post the results here. If you have any suggestions on the best ways to approach this, please let me know.

Thanks!

@brendan-ai2
Copy link
Contributor

For 1, you could try using the script at https://github.com/allenai/allennlp/blob/master/scripts/benchmark_iter.py. This will tell you how long it takes to read, tensorize and batch your dataset. Then you'll need to compare this against how long it takes to train a batch. AllenNLP's train command should display this for an epoch, so adjust for batches per epoch and you'll have your number. If the data processing dominates the overall train time, using more GPUs won't help.

For 2, I'm less certain. I don't think nvidia-smi topo -m gives us what we need as it just shows the topology of the bus, not the current usage. Perhaps you could try pcm-pcie from https://github.com/opcm/pcm? I got this working, but I'm too familiar with it. And a gotcha, I wasn't able to get it working on a VM, only local hardware.

@pvcastro
Copy link
Contributor Author

pvcastro commented Oct 6, 2019

Hi @brendan-ai2 , just to give you some feedback, I haven't got a chance to follow your suggestions yet. Once I have some information, I'll post it here.

@DeNeutoy
Copy link
Contributor

This should now be resolved by #3529

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants