-
Notifications
You must be signed in to change notification settings - Fork 755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-gpu support in training loop #657
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I think if we set ctx
to be a list, we might want to rename it to ctxs
or something similar.
@jaheba thanks for the review! all the comments have been addressed. I think it's better to avoid making incompatible names changes to mxnet so keeping the |
Codecov Report
@@ Coverage Diff @@
## master #657 +/- ##
=======================================
Coverage 84.63% 84.64%
=======================================
Files 178 178
Lines 10401 10425 +24
=======================================
+ Hits 8803 8824 +21
- Misses 1598 1601 +3
|
Codecov Report
@@ Coverage Diff @@
## master #657 +/- ##
==========================================
+ Coverage 84.63% 84.64% +<.01%
==========================================
Files 178 178
Lines 10401 10425 +24
==========================================
+ Hits 8803 8824 +21
- Misses 1598 1601 +3
|
Codecov Report
@@ Coverage Diff @@
## master #657 +/- ##
=======================================
Coverage 84.63% 84.64%
=======================================
Files 178 178
Lines 10401 10425 +24
=======================================
+ Hits 8803 8824 +21
- Misses 1598 1601 +3
|
@istorch the changes look good to me, so this could be merged into master soon and be included in the next release (either this or next week) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ehsanmok, see my specific comments inline. The main concern I have is that we currently lack proper regression tests (running extensive training on some datasets), and also I'm not familiar with multiple contexts and split_and_load
in mxnet, so I'll need to look into this deeper or have someone else also review this.
Could you maybe add a MWE to the original post, describing expected behavior before/after the PR?
params = net_source.collect_params() | ||
for p in params: | ||
ctx = params[p].list_ctx() | ||
break | ||
# force init otherwise may not load params | ||
# since not all params have the same ctx in net_source | ||
net_dest.initialize(ctx=ctx, force_reinit=True) | ||
net_dest.load_parameters( | ||
model_dir_path, | ||
ctx=mx.current_context(), | ||
ctx=ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this limits the way this function can act: correct me if I'm wrong, this way the code puts all the parameters of net_dest
in the same context, and this context is the one of the first parameter of net_source
, right?
I think the point of using mx.current_context()
here is that of being able to (potentially, not sure this is done anywhere) perform training on GPU and have the predictor network use the CPU instead.
Edit: can you explain why this is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember, during testing, when not all params are in a single context, forcing to reinit with a list of contexts solved the issue. Please see my general comment at the end.
@@ -96,7 +96,7 @@ def __init__( | |||
cyclic=self.cyclic, | |||
is_train=self.is_train, | |||
batch_size=self.batch_size, | |||
ctx=self.ctx, | |||
ctx=self.ctx[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this: one can specify a list of contexts, but then the data is loaded in the first one? But I'm really ignorant about the argument, so maybe I'm missing something.
Edit: is the idea to put everything in the first context, and then have split_and_load in the training loop take care of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the idea to put everything in the first context, and then have split_and_load in the training loop take care of that?
Yes, indeed!
if F is mx.ndarray: | ||
ctx = ( | ||
inputs.context | ||
if isinstance(inputs, mx.gluon.tensor_types) | ||
else inputs[0].context | ||
) | ||
with ctx: | ||
begin_state = self.rnn.begin_state( | ||
func=F.zeros, dtype=self.dtype, batch_size=inputs.shape[0] | ||
) | ||
else: | ||
begin_state = self.rnn.begin_state( | ||
func=F.zeros, dtype=self.dtype, batch_size=0 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to me like this context logic should not be in the network code. Maybe it would be better to control the context once and for all from the outside somehow? Otherwise I can see how the code of every model could be filled with similar context logic, and lose readability. What do you think?
Edit: maybe one idea would be to invoke the network within with ctx
for the right ctx
? I guess that would be in the trainer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree and in fact this has been repeated. Though I was expecting to have this in upstream mxnet to not hack this into.
@lostella Thanks for your comments! in general, I think, since gluonts hasn't taken multi-gpu into considerations and has been designed for single gpu (it felt like that at least at the time of creating this PR), this PR initiations a sub-optimal solution at least with rooms to improvements of course. As for the MVE, I've run all the existing examples and observed their improved sub-optimal speed ups which I can report later here. |
Hi @istorch, thanks for your patience! As @lostella pointed out earlier, this PR offers a solution and it'd be helpful to have some external validations too, so if you can, please run your application to test things us further. You can install this patch via
|
@ehsanmok Thanks for picking this up again! I have installed your branch and passed the context in as a list, like |
@istorch thanks a lot for the report! I'll look into the prediction part. |
@ehsanmok also, I did a little benchmarking and I'm not seeing any improvement in using multiple GPUs. With the same settings, my training ran in 7min 40s on a single GPU, and 8min 51s on 8 GPUs. I set epochs to 50, batch_size to 512, and num_batches_per_epoch to 16 (this is on a p2.8xlarge instance). |
@istorch The speedup is dependent of how parallelizable your model is too. To make a correct comparison, did you multiply the batch size by the number of GPUs (like if you've used 512 for single GPU, then you'd need to use 512 * 8 for your 8 GPU context)? also please keep |
@lostella @istorch after an investigation it turns out the adding the complete multi-gpu feature is harder than what I initially thought. This PR add the multi-gpu for training loop only. The current data loader seems to be a major bottleneck as it wasn't written for single context only and the other parts of GluonTS estimator is hard to decouple and extend. This is one more reason to consider #829. |
*Issue #162
Description of changes:
Adding the multi-gpu support. Besides local test, all examples are replicable and been tested on p3.8xlarge with 4 gpus.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.