Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asynchronous allreduce? #3

Closed
hiyijian opened this issue Feb 28, 2017 · 7 comments
Closed

Asynchronous allreduce? #3

hiyijian opened this issue Feb 28, 2017 · 7 comments

Comments

@hiyijian
Copy link

Hi baidu research team,
Is it possible to make an asynchronous allreduce based on this project? I think it is quite important when we integrate allreudce into deep learning framework such as Caffe. Would you like to shed a light on it?

Thanks

@shubho
Copy link
Collaborator

shubho commented Feb 28, 2017

What exactly is an asynchronous allreduce?

@hiyijian
Copy link
Author

Sorry. Let me make it more clear.

synchronous allreduce is like MPI_Allreduce(...). It will not return untill the actull communition finished.

asynchronous allreduce is like MPI_Iallreduce(..., MPI_Request *request). It will immediately return, and we can wait it finished wherever we want. In deep learning framework, It gives an opportunity to overlap calculation and communiction as Baidu SVAIL blog has already mentioned

Thanks

@shubho
Copy link
Collaborator

shubho commented Feb 28, 2017

The best way to do it is at the framework level - so allreduce is done on a separate thread and the thread that requires the output of allreduce can wait on the allreduce thread while all other threads can continue executing. This is how we do it in our internal framework.

This mechanism varies from framework from framework - so should be done by the framework authors.

Alternatively you can wrap this up in a std::async and then wait on the std::future to get the same functionality.

@hiyijian
Copy link
Author

Thank you for the advise. I will have a try.

@hiyijian
Copy link
Author

hiyijian commented Mar 8, 2017

hi, I got some trouble when doing allreduce in a seperate thread in GPU mode.

We take computing in main thread, while do allreduce in another thread during backward. It sounds perfect. However, in practice all device memory allocated in cuda context of main thread. the context is bind to this thread only. Of cause We can pass data pointer to the seperate thread easily, but the pointer make no sense in that thread, since cuda context of this thread is DIFFRIENT from main thread. It caused following error

--------------------------------------------------------------------------
The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
cannot be used.
  cuIpcGetMemHandle return value:   1
  address: 0x130a620608
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.
--------------------------------------------------------------------------

Do you have any idea to avoid this?

Thanks a alot

@hiyijian hiyijian reopened this Mar 8, 2017
@gibiansky
Copy link
Collaborator

In our tensorflow-allreduce patch, we end up having a single background MPI thread. That MPI thread establishes the CUDA context and then runs a loop that checks a queue for allreduce requests. In order to do an allreduce, a thread can put something into the "allreduce queue" and wait for the background MPI thread to get to it. You then need some mechanism to continue -- in our patch, we have each "allreduce request" come with the thing to reduce and also with a callback to call after the reduction is complete.

Our logic is coupled with TensorFlow a fair amount, but you can look at the MPI Background Thread loop here.

Is there a similar approach that could work for your application? The key is to make sure that a single thread establishes the GPU context, calls MPI_Init, and does all future MPI communication going forward.

@hiyijian
Copy link
Author

hiyijian commented Mar 9, 2017

Thanks @gibiansky . We did the job in a very similary way as yours. I finnaly solved the CUDA context problem via CUDA v8.0 cuCtxGetCurrent/cuCtxSetCurrent. Specifically, I call cuCtxGetCurrent to get a context handle in computing main thread , and then call cuCtxSetCurrent to set the same context handle in communication thread.

It works like a charm ~

Thanks a lot

@hiyijian hiyijian closed this as completed Mar 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants