Optimize PureNcclCommunicator to accelerate training with double buffering #216

shu65 · 2018-03-02T05:30:51Z

I optimized PureNcclCommunicator to accelerate training with double buffering.

kuenishi

Let me check my understanding by summarizing changes:

Remove a thread for asynchronously run allreduce_grad in optimizer, replaced by communicator's new interface allreduce_grad_async(), which just runs the allreduce in another stream that have been used for double buffering.
Direct division of all grads right after allreduce operation has been replaced by cupy.ElementwiseKernel implementation.
Also the float width casting (model to allreduce, allreduce to model repectively) have been replaced with cupy.ElementwiseKernel implementation.

Which of these were dominantly effective in your thought? And how much did you gain by all these optimizations, if you have any numbers?

I think first item is not that effective but contributes rather to code simplification, and the latter two contributes more to the performance. The latter two also I think contribute to non-dbuf execution; is this correct?

kuenishi · 2018-04-02T07:19:19Z

chainermn/communicators/pure_nccl_communicator.py

@@ -144,7 +184,7 @@ def _get_nccl_type_id(dtype):
    elif dtype == np.float32:
        return nccl.NCCL_FLOAT32
    elif dtype == np.float64:
-        return nccl.NCCL_FLOAT64
+        return nccl.NCC_FLOAT64


Is this intended? Otherwise please fix this typo.

kuenishi · 2018-04-02T08:03:59Z

chainermn/communicators/pure_nccl_communicator.py

-                                    n_elems)
-        if stream != chainer.cuda.Stream.null:
+        needs_sync = self._assign(grad_dtype, allreduce_grad_dtype, n_elems)
+        if stream != chainer.cuda.Stream.null and needs_sync:
            chainer.cuda.Stream.null.synchronize()


Why synchronization is required when double buffering is not enabled? I thought everything happens in null stream in that case... I want to know why and would be nice if the reason is documented here in the comment.

This synchronize() runs when double buffering is enabled and CUDA memory is allocated. So it is not required when double buffering is not enabled.

kuenishi · 2018-04-02T08:04:44Z

chainermn/communicators/pure_nccl_communicator.py

        self.nccl_comm.allReduce(self.gpu_allreduce_buffer_a.ptr(),
                                 self.gpu_allreduce_buffer_b.ptr(), n_elems,
                                 _get_nccl_type_id(allreduce_grad_dtype),
-                                 nccl.NCCL_SUM, stream.ptr)
-        if stream != chainer.cuda.Stream.null:
-            stream.synchronize()


Ditto why this can be removed.

Because, I added synchronize() at the end of allreduce_grad in DoubleBufferingOptimizer.
https://github.com/chainer/chainermn/pull/216/files#diff-5ac36de863ea63673cdc464693ae1accR131

kuenishi · 2018-04-02T08:10:01Z

chainermn/optimizers.py

@@ -62,14 +61,10 @@ def __init__(self, actual_optimizer, communicator):
            'needs_update', False)
        super(_DoubleBufferingOptimizer, self).__setattr__(
            'device', None)


As I don't think this is used any more, can be removed?

Thanks. I will remove it.

kuenishi · 2018-04-02T08:26:39Z

chainermn/communicators/pure_nccl_communicator.py


+    def allreduce_grad_async(self, model, stream):


This clean up is great, but as this method is only for _DoubleBufferingOptimizer it should be private method, like named as def _allreduce_grad_async(...).

Ok. I will rename it.

shu65 · 2018-04-05T05:09:04Z

Thank you for the comments.

I think first item is not that effective but contributes rather to code simplification, and the latter two contributes more to the performance. The latter two also I think contribute to non-dbuf execution; is this correct?

These are correct!

shu65 · 2018-04-05T09:36:48Z

In the previous version, the processing of All-Reduce did not overlap with the others in some cases.

It is the ideal condition that All-Reduce and the others overlap as shown in the following figure.

However, All-Reduce and the others did not overlap as shown in the following figure.

The cause is D2D and computing mean of grads are performed on Null stream.

So I changed it to use the same stream as All-Reduce for D2D and computing mean of grads because these processes are fully overlapped with forward, backward, and optimize.

kuenishi · 2018-04-05T10:21:08Z

After an offline discussion, I understand the situation, I think. After this pull request a rough timeline would be like:

      null) ---> forward ---> backward ---> d2d ---> optimize ---> f --> b --> d2d ---> opt --> ...
                                             X                                  X 
background) ---> allreduce(sum) --> /n ---> d2d ---> allreduce(sum) ---> /n -> d2d ---> allreduce --> ...

This is done with just a single thread and has much fewer synchronization points, only just before allreduce and after division by size to compute model mean.

Previous implementation had several more synchronization points, especially a data transfer inside the GPU, from model to allreduce buffer (which is called D2D above) conflicts with training data transfer from host to device before forward (IIRC), or optimize, forward, or backward. The best base would be the transfer starts right after the swap, but the worst case would be right after backward finish. In the worst cast the allreduce starts after backward finished, which have very few overlap of computation and communication. Also the division is moved to background stream, the amount of computation in null stream is reduced in some amount. Thank you for the great work!

shu65 added 5 commits January 31, 2018 15:50

optimize pure_nccl_comunicator

c1827a3

use CUDA stream only for double buffering

999fee0

fix bugs

fa3f880

fix bug

caaadc8

fix bugs

67430be

kuenishi added this to the v1.3.0 milestone Mar 26, 2018

fix flake8 errors

a6cda43

shu65 changed the title ~~[WIP] Optimize PureNcclCommunicator to accelerate training with double buffering~~ Optimize PureNcclCommunicator to accelerate training with double buffering Mar 29, 2018

fix conflict

4d4ef1b

iwiwi requested a review from kuenishi March 30, 2018 04:09

iwiwi assigned kuenishi Mar 30, 2018

iwiwi added the enhancement label Mar 30, 2018

kuenishi requested changes Apr 2, 2018

View reviewed changes

reflected comments

de07e4f

kuenishi approved these changes Apr 5, 2018

View reviewed changes

kuenishi merged commit 65bf577 into master Apr 5, 2018

kuenishi deleted the opt_pure_nccl_comunicator branch April 5, 2018 10:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize PureNcclCommunicator to accelerate training with double buffering #216

Optimize PureNcclCommunicator to accelerate training with double buffering #216

shu65 commented Mar 2, 2018

kuenishi left a comment

kuenishi Apr 2, 2018

kuenishi Apr 2, 2018

shu65 Apr 5, 2018

kuenishi Apr 2, 2018

shu65 Apr 5, 2018

kuenishi Apr 2, 2018

shu65 Apr 5, 2018

kuenishi Apr 2, 2018

shu65 Apr 5, 2018

shu65 commented Apr 5, 2018

shu65 commented Apr 5, 2018

kuenishi commented Apr 5, 2018 •

edited

Optimize PureNcclCommunicator to accelerate training with double buffering #216

Optimize PureNcclCommunicator to accelerate training with double buffering #216

Conversation

shu65 commented Mar 2, 2018

kuenishi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shu65 commented Apr 5, 2018

shu65 commented Apr 5, 2018

kuenishi commented Apr 5, 2018 • edited

kuenishi commented Apr 5, 2018 •

edited