How to do data or model parallel training? #1091

alex4998 · 2023-09-08T19:22:25Z

alex4998
Sep 8, 2023

I see a few examples of data and model parallel training with PyTorch. It doesn't seem supported by TorchSharp out of the box, correct me if I'm wrong. Even if that is not supported out of the box, do you have examples how to do it making data transfers between GPUs explicitly? I mean data parallelism boils down to training on N GPUs with identical copies of the model. There is a step in the training where we need to combine gradients from all the models, right? I don't quite understand how and when to do that. A step-by-step list of actions with examples of what calls to the APIs should be made would help. My model size is about 400K parameters, and I have about 1.5x10^9 training samples. It takes about 5 days to go through one round of feeding all samples on my system, but I have 2 GPUs and 1 powerful CPU so, I would like to try to use all the resources in a hope to reduce the training time at least twice.

NiklasGustafsson · 2023-09-21T17:50:14Z

NiklasGustafsson
Sep 21, 2023
Maintainer

While multi-GPU training should work in theory, no testing has been made to validate. If you can make it work yourself (lucky you having multiple GPUs! :-)) then a PR to either the TorchSharpExamples repo or to this repo showing it enabled would be very much appreciated!

1 reply

alex4998 Sep 23, 2023
Author

I'm already thinking about creating my own algorithm, but I don't know the math behind combining the gradients. If somebody could point me to the right direction how to do it mathematically that would be very much appreciated :) and I would try to make an algorithm out of it. So far, I'm still looking for at least this answer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to do data or model parallel training? #1091

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How to do data or model parallel training? #1091

alex4998 Sep 8, 2023

Replies: 1 comment · 1 reply

NiklasGustafsson Sep 21, 2023 Maintainer

alex4998 Sep 23, 2023 Author

alex4998
Sep 8, 2023

Replies: 1 comment 1 reply

NiklasGustafsson
Sep 21, 2023
Maintainer

alex4998 Sep 23, 2023
Author