Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize PureNcclCommunicator to accelerate training with double buffering #216

Merged
merged 8 commits into from
Apr 5, 2018

Conversation

shu65
Copy link
Member

@shu65 shu65 commented Mar 2, 2018

I optimized PureNcclCommunicator to accelerate training with double buffering.

@kuenishi kuenishi added this to the v1.3.0 milestone Mar 26, 2018
@shu65 shu65 changed the title [WIP] Optimize PureNcclCommunicator to accelerate training with double buffering Optimize PureNcclCommunicator to accelerate training with double buffering Mar 29, 2018
Copy link
Member

@kuenishi kuenishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check my understanding by summarizing changes:

  • Remove a thread for asynchronously run allreduce_grad in optimizer, replaced by communicator's new interface allreduce_grad_async(), which just runs the allreduce in another stream that have been used for double buffering.
  • Direct division of all grads right after allreduce operation has been replaced by cupy.ElementwiseKernel implementation.
  • Also the float width casting (model to allreduce, allreduce to model repectively) have been replaced with cupy.ElementwiseKernel implementation.

Which of these were dominantly effective in your thought? And how much did you gain by all these optimizations, if you have any numbers?

I think first item is not that effective but contributes rather to code simplification, and the latter two contributes more to the performance. The latter two also I think contribute to non-dbuf execution; is this correct?

@@ -144,7 +184,7 @@ def _get_nccl_type_id(dtype):
elif dtype == np.float32:
return nccl.NCCL_FLOAT32
elif dtype == np.float64:
return nccl.NCCL_FLOAT64
return nccl.NCC_FLOAT64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended? Otherwise please fix this typo.

n_elems)
if stream != chainer.cuda.Stream.null:
needs_sync = self._assign(grad_dtype, allreduce_grad_dtype, n_elems)
if stream != chainer.cuda.Stream.null and needs_sync:
chainer.cuda.Stream.null.synchronize()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why synchronization is required when double buffering is not enabled? I thought everything happens in null stream in that case... I want to know why and would be nice if the reason is documented here in the comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This synchronize() runs when double buffering is enabled and CUDA memory is allocated. So it is not required when double buffering is not enabled.

self.nccl_comm.allReduce(self.gpu_allreduce_buffer_a.ptr(),
self.gpu_allreduce_buffer_b.ptr(), n_elems,
_get_nccl_type_id(allreduce_grad_dtype),
nccl.NCCL_SUM, stream.ptr)
if stream != chainer.cuda.Stream.null:
stream.synchronize()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto why this can be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because, I added synchronize() at the end of allreduce_grad in DoubleBufferingOptimizer.
https://github.com/chainer/chainermn/pull/216/files#diff-5ac36de863ea63673cdc464693ae1accR131

@@ -62,14 +61,10 @@ def __init__(self, actual_optimizer, communicator):
'needs_update', False)
super(_DoubleBufferingOptimizer, self).__setattr__(
'device', None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I don't think this is used any more, can be removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I will remove it.


def allreduce_grad_async(self, model, stream):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clean up is great, but as this method is only for _DoubleBufferingOptimizer it should be private method, like named as def _allreduce_grad_async(...).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I will rename it.

@shu65
Copy link
Member Author

shu65 commented Apr 5, 2018

Thank you for the comments.

I think first item is not that effective but contributes rather to code simplification, and the latter two contributes more to the performance. The latter two also I think contribute to non-dbuf execution; is this correct?

These are correct!

@shu65
Copy link
Member Author

shu65 commented Apr 5, 2018

In the previous version, the processing of All-Reduce did not overlap with the others in some cases.

It is the ideal condition that All-Reduce and the others overlap as shown in the following figure.
image

However, All-Reduce and the others did not overlap as shown in the following figure.
image

The cause is D2D and computing mean of grads are performed on Null stream.

So I changed it to use the same stream as All-Reduce for D2D and computing mean of grads because these processes are fully overlapped with forward, backward, and optimize.

@kuenishi
Copy link
Member

kuenishi commented Apr 5, 2018

After an offline discussion, I understand the situation, I think. After this pull request a rough timeline would be like:

      null) ---> forward ---> backward ---> d2d ---> optimize ---> f --> b --> d2d ---> opt --> ...
                                             X                                  X 
background) ---> allreduce(sum) --> /n ---> d2d ---> allreduce(sum) ---> /n -> d2d ---> allreduce --> ...

This is done with just a single thread and has much fewer synchronization points, only just before allreduce and after division by size to compute model mean.

Previous implementation had several more synchronization points, especially a data transfer inside the GPU, from model to allreduce buffer (which is called D2D above) conflicts with training data transfer from host to device before forward (IIRC), or optimize, forward, or backward. The best base would be the transfer starts right after the swap, but the worst case would be right after backward finish. In the worst cast the allreduce starts after backward finished, which have very few overlap of computation and communication. Also the division is moved to background stream, the amount of computation in null stream is reduced in some amount. Thank you for the great work!

@kuenishi kuenishi merged commit 65bf577 into master Apr 5, 2018
@kuenishi kuenishi deleted the opt_pure_nccl_comunicator branch April 5, 2018 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants