Asynchronous allreduce? #3

hiyijian · 2017-02-28T05:04:27Z

Hi baidu research team,
Is it possible to make an asynchronous allreduce based on this project? I think it is quite important when we integrate allreudce into deep learning framework such as Caffe. Would you like to shed a light on it?

Thanks

shubho · 2017-02-28T05:08:18Z

What exactly is an asynchronous allreduce?

hiyijian · 2017-02-28T06:11:45Z

Sorry. Let me make it more clear.

synchronous allreduce is like MPI_Allreduce(...). It will not return untill the actull communition finished.

asynchronous allreduce is like MPI_Iallreduce(..., MPI_Request *request). It will immediately return, and we can wait it finished wherever we want. In deep learning framework, It gives an opportunity to overlap calculation and communiction as Baidu SVAIL blog has already mentioned

Thanks

shubho · 2017-02-28T06:14:38Z

The best way to do it is at the framework level - so allreduce is done on a separate thread and the thread that requires the output of allreduce can wait on the allreduce thread while all other threads can continue executing. This is how we do it in our internal framework.

This mechanism varies from framework from framework - so should be done by the framework authors.

Alternatively you can wrap this up in a std::async and then wait on the std::future to get the same functionality.

hiyijian · 2017-02-28T08:01:53Z

Thank you for the advise. I will have a try.

hiyijian · 2017-03-08T09:22:49Z

hi, I got some trouble when doing allreduce in a seperate thread in GPU mode.

We take computing in main thread, while do allreduce in another thread during backward. It sounds perfect. However, in practice all device memory allocated in cuda context of main thread. the context is bind to this thread only. Of cause We can pass data pointer to the seperate thread easily, but the pointer make no sense in that thread, since cuda context of this thread is DIFFRIENT from main thread. It caused following error

--------------------------------------------------------------------------
The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
cannot be used.
  cuIpcGetMemHandle return value:   1
  address: 0x130a620608
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.
--------------------------------------------------------------------------

Do you have any idea to avoid this?

Thanks a alot

gibiansky · 2017-03-08T16:43:48Z

In our tensorflow-allreduce patch, we end up having a single background MPI thread. That MPI thread establishes the CUDA context and then runs a loop that checks a queue for allreduce requests. In order to do an allreduce, a thread can put something into the "allreduce queue" and wait for the background MPI thread to get to it. You then need some mechanism to continue -- in our patch, we have each "allreduce request" come with the thing to reduce and also with a callback to call after the reduction is complete.

Our logic is coupled with TensorFlow a fair amount, but you can look at the MPI Background Thread loop here.

Is there a similar approach that could work for your application? The key is to make sure that a single thread establishes the GPU context, calls MPI_Init, and does all future MPI communication going forward.

hiyijian · 2017-03-09T02:02:26Z

Thanks @gibiansky . We did the job in a very similary way as yours. I finnaly solved the CUDA context problem via CUDA v8.0 cuCtxGetCurrent/cuCtxSetCurrent. Specifically, I call cuCtxGetCurrent to get a context handle in computing main thread , and then call cuCtxSetCurrent to set the same context handle in communication thread.

It works like a charm ~

Thanks a lot

hiyijian closed this as completed Feb 28, 2017

hiyijian reopened this Mar 8, 2017

hiyijian closed this as completed Mar 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asynchronous allreduce? #3

Asynchronous allreduce? #3

hiyijian commented Feb 28, 2017

shubho commented Feb 28, 2017

hiyijian commented Feb 28, 2017

shubho commented Feb 28, 2017 •

edited

hiyijian commented Feb 28, 2017

hiyijian commented Mar 8, 2017

gibiansky commented Mar 8, 2017

hiyijian commented Mar 9, 2017 •

edited

Asynchronous allreduce? #3

Asynchronous allreduce? #3

Comments

hiyijian commented Feb 28, 2017

shubho commented Feb 28, 2017

hiyijian commented Feb 28, 2017

shubho commented Feb 28, 2017 • edited

hiyijian commented Feb 28, 2017

hiyijian commented Mar 8, 2017

gibiansky commented Mar 8, 2017

hiyijian commented Mar 9, 2017 • edited

shubho commented Feb 28, 2017 •

edited

hiyijian commented Mar 9, 2017 •

edited