-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Asynchronous allreduce? #3
Comments
What exactly is an asynchronous allreduce? |
Sorry. Let me make it more clear. synchronous allreduce is like MPI_Allreduce(...). It will not return untill the actull communition finished. asynchronous allreduce is like MPI_Iallreduce(..., MPI_Request *request). It will immediately return, and we can wait it finished wherever we want. In deep learning framework, It gives an opportunity to overlap calculation and communiction as Baidu SVAIL blog has already mentioned Thanks |
The best way to do it is at the framework level - so allreduce is done on a separate thread and the thread that requires the output of allreduce can wait on the allreduce thread while all other threads can continue executing. This is how we do it in our internal framework. This mechanism varies from framework from framework - so should be done by the framework authors. Alternatively you can wrap this up in a |
Thank you for the advise. I will have a try. |
hi, I got some trouble when doing allreduce in a seperate thread in GPU mode. We take computing in main thread, while do allreduce in another thread during backward. It sounds perfect. However, in practice all device memory allocated in cuda context of main thread. the context is bind to this thread only. Of cause We can pass data pointer to the seperate thread easily, but the pointer make no sense in that thread, since cuda context of this thread is DIFFRIENT from main thread. It caused following error
Do you have any idea to avoid this? Thanks a alot |
In our Our logic is coupled with TensorFlow a fair amount, but you can look at the MPI Background Thread loop here. Is there a similar approach that could work for your application? The key is to make sure that a single thread establishes the GPU context, calls MPI_Init, and does all future MPI communication going forward. |
Thanks @gibiansky . We did the job in a very similary way as yours. I finnaly solved the CUDA context problem via CUDA v8.0 cuCtxGetCurrent/cuCtxSetCurrent. Specifically, I call cuCtxGetCurrent to get a context handle in computing main thread , and then call cuCtxSetCurrent to set the same context handle in communication thread. It works like a charm ~ Thanks a lot |
Hi baidu research team,
Is it possible to make an asynchronous allreduce based on this project? I think it is quite important when we integrate allreudce into deep learning framework such as Caffe. Would you like to shed a light on it?
Thanks
The text was updated successfully, but these errors were encountered: