Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-gpu support in training loop #657

Closed
wants to merge 7 commits into from
Closed

Multi-gpu support in training loop #657

wants to merge 7 commits into from

Conversation

ehsanmok
Copy link
Contributor

@ehsanmok ehsanmok commented Feb 21, 2020

*Issue #162

Description of changes:

Adding the multi-gpu support. Besides local test, all examples are replicable and been tested on p3.8xlarge with 4 gpus.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Contributor

@jaheba jaheba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I think if we set ctx to be a list, we might want to rename it to ctxs or something similar.

test/model/test_models_synthetic.py Outdated Show resolved Hide resolved
src/gluonts/core/component.py Outdated Show resolved Hide resolved
src/gluonts/core/component.py Outdated Show resolved Hide resolved
src/gluonts/dataset/loader.py Outdated Show resolved Hide resolved
src/gluonts/dataset/loader.py Outdated Show resolved Hide resolved
@ehsanmok
Copy link
Contributor Author

ehsanmok commented Feb 22, 2020

@jaheba thanks for the review! all the comments have been addressed. I think it's better to avoid making incompatible names changes to mxnet so keeping the ctx name.

@alexw91
Copy link
Member

alexw91 commented Feb 26, 2020

Codecov Report

Merging #657 into master will increase coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #657   +/-   ##
=======================================
  Coverage   84.63%   84.64%           
=======================================
  Files         178      178           
  Lines       10401    10425   +24     
=======================================
+ Hits         8803     8824   +21     
- Misses       1598     1601    +3     
Impacted Files Coverage Δ
src/gluonts/core/component.py 86.22% <0.00%> (-0.96%) ⬇️

@codecov-io
Copy link

Codecov Report

Merging #657 into master will increase coverage by <.01%.
The diff coverage is 94%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #657      +/-   ##
==========================================
+ Coverage   84.63%   84.64%   +<.01%     
==========================================
  Files         178      178              
  Lines       10401    10425      +24     
==========================================
+ Hits         8803     8824      +21     
- Misses       1598     1601       +3
Impacted Files Coverage Δ
src/gluonts/trainer/_base.py 93.79% <100%> (+0.14%) ⬆️
src/gluonts/model/wavenet/_estimator.py 95.95% <100%> (ø) ⬆️
src/gluonts/model/deepar/_network.py 98.47% <100%> (+0.06%) ⬆️
src/gluonts/dataset/loader.py 95.31% <100%> (ø) ⬆️
src/gluonts/model/gp_forecaster/_network.py 100% <100%> (ø) ⬆️
...rc/gluonts/model/gp_forecaster/gaussian_process.py 73.41% <100%> (ø) ⬆️
src/gluonts/model/estimator.py 89.55% <100%> (ø) ⬆️
src/gluonts/support/util.py 90.68% <100%> (+0.29%) ⬆️
src/gluonts/model/predictor.py 81.94% <33.33%> (ø) ⬆️
src/gluonts/core/component.py 86.22% <91.66%> (-0.96%) ⬇️
... and 1 more

@alexw91
Copy link
Member

alexw91 commented Feb 26, 2020

Codecov Report

Merging #657 into master will increase coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #657   +/-   ##
=======================================
  Coverage   84.63%   84.64%           
=======================================
  Files         178      178           
  Lines       10401    10425   +24     
=======================================
+ Hits         8803     8824   +21     
- Misses       1598     1601    +3     
Impacted Files Coverage Δ
src/gluonts/core/component.py 86.22% <0.00%> (-0.96%) ⬇️

@ehsanmok ehsanmok requested a review from jaheba February 26, 2020 23:54
@istorch
Copy link

istorch commented Apr 25, 2020

@ehsanmok @jaheba Sorry to just randomly comment here (this is my first comment on this project), but when can we expect to see this released? I am a new user of GluonTS and it would be nice to have multi GPU support (my specific application uses DeepAR). Thanks!

@lostella
Copy link
Contributor

@istorch the changes look good to me, so this could be merged into master soon and be included in the next release (either this or next week)

Copy link
Contributor

@lostella lostella left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ehsanmok, see my specific comments inline. The main concern I have is that we currently lack proper regression tests (running extensive training on some datasets), and also I'm not familiar with multiple contexts and split_and_load in mxnet, so I'll need to look into this deeper or have someone else also review this.

Could you maybe add a MWE to the original post, describing expected behavior before/after the PR?

Comment on lines +163 to +172
params = net_source.collect_params()
for p in params:
ctx = params[p].list_ctx()
break
# force init otherwise may not load params
# since not all params have the same ctx in net_source
net_dest.initialize(ctx=ctx, force_reinit=True)
net_dest.load_parameters(
model_dir_path,
ctx=mx.current_context(),
ctx=ctx,
Copy link
Contributor

@lostella lostella Apr 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this limits the way this function can act: correct me if I'm wrong, this way the code puts all the parameters of net_dest in the same context, and this context is the one of the first parameter of net_source, right?

I think the point of using mx.current_context() here is that of being able to (potentially, not sure this is done anywhere) perform training on GPU and have the predictor network use the CPU instead.

Edit: can you explain why this is needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember, during testing, when not all params are in a single context, forcing to reinit with a list of contexts solved the issue. Please see my general comment at the end.

@@ -96,7 +96,7 @@ def __init__(
cyclic=self.cyclic,
is_train=self.is_train,
batch_size=self.batch_size,
ctx=self.ctx,
ctx=self.ctx[0],
Copy link
Contributor

@lostella lostella Apr 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this: one can specify a list of contexts, but then the data is loaded in the first one? But I'm really ignorant about the argument, so maybe I'm missing something.

Edit: is the idea to put everything in the first context, and then have split_and_load in the training loop take care of that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the idea to put everything in the first context, and then have split_and_load in the training loop take care of that?

Yes, indeed!

Comment on lines +269 to +282
if F is mx.ndarray:
ctx = (
inputs.context
if isinstance(inputs, mx.gluon.tensor_types)
else inputs[0].context
)
with ctx:
begin_state = self.rnn.begin_state(
func=F.zeros, dtype=self.dtype, batch_size=inputs.shape[0]
)
else:
begin_state = self.rnn.begin_state(
func=F.zeros, dtype=self.dtype, batch_size=0
)
Copy link
Contributor

@lostella lostella Apr 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like this context logic should not be in the network code. Maybe it would be better to control the context once and for all from the outside somehow? Otherwise I can see how the code of every model could be filled with similar context logic, and lose readability. What do you think?

Edit: maybe one idea would be to invoke the network within with ctx for the right ctx? I guess that would be in the trainer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree and in fact this has been repeated. Though I was expecting to have this in upstream mxnet to not hack this into.

@istorch
Copy link

istorch commented May 13, 2020

@ehsanmok @lostella Sorry to bug you guys again, but has there been any progress on this PR?

@ehsanmok
Copy link
Contributor Author

@lostella Thanks for your comments! in general, I think, since gluonts hasn't taken multi-gpu into considerations and has been designed for single gpu (it felt like that at least at the time of creating this PR), this PR initiations a sub-optimal solution at least with rooms to improvements of course.

As for the MVE, I've run all the existing examples and observed their improved sub-optimal speed ups which I can report later here.

@ehsanmok
Copy link
Contributor Author

Hi @istorch, thanks for your patience! As @lostella pointed out earlier, this PR offers a solution and it'd be helpful to have some external validations too, so if you can, please run your application to test things us further. You can install this patch via

python -m pip install git+git://github.com/ehsanmok/gluon-ts@multi-gpu

@istorch
Copy link

istorch commented May 14, 2020

@ehsanmok Thanks for picking this up again! I have installed your branch and passed the context in as a list, like ctx = [mx.gpu(i) for i in range(8)]. Watching nvidia-smi, I see that all 8 GPUs are utilized during training. However, when I go to generate predictions, it still only uses one GPU. Is there anything special I need to do to make it use all GPUs for prediction as well?

@ehsanmok
Copy link
Contributor Author

@istorch thanks a lot for the report! I'll look into the prediction part.

@istorch
Copy link

istorch commented May 15, 2020

@ehsanmok also, I did a little benchmarking and I'm not seeing any improvement in using multiple GPUs. With the same settings, my training ran in 7min 40s on a single GPU, and 8min 51s on 8 GPUs. I set epochs to 50, batch_size to 512, and num_batches_per_epoch to 16 (this is on a p2.8xlarge instance).

@ehsanmok
Copy link
Contributor Author

ehsanmok commented May 15, 2020

@istorch The speedup is dependent of how parallelizable your model is too. To make a correct comparison, did you multiply the batch size by the number of GPUs (like if you've used 512 for single GPU, then you'd need to use 512 * 8 for your 8 GPU context)? also please keep num_batches_per_epoch the same when changing context. It'd be helpful if you could share more context about what your code is doing and what model are you training.

@ehsanmok
Copy link
Contributor Author

@lostella @istorch after an investigation it turns out the adding the complete multi-gpu feature is harder than what I initially thought. This PR add the multi-gpu for training loop only. The current data loader seems to be a major bottleneck as it wasn't written for single context only and the other parts of GluonTS estimator is hard to decouple and extend. This is one more reason to consider #829.

@ehsanmok ehsanmok changed the title Multi-gpu support Multi-gpu support in training loop May 15, 2020
@ehsanmok ehsanmok closed this Jul 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants