Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are the collective APIs provided by gloo thread-safe? #35

Closed
hiyijian opened this issue May 11, 2017 · 3 comments
Closed

Are the collective APIs provided by gloo thread-safe? #35

hiyijian opened this issue May 11, 2017 · 3 comments

Comments

@hiyijian
Copy link

hiyijian commented May 11, 2017

Dear gloo team,

Deep laerning framework like caffe backprops in a layer wise fashion. It is better to do "allreduce" immediately after gradients of each layer figured out. To not only overlap comunication and computing, but also comunication of diffrient layers, these "allreduce" should run in seperate threads, rather than one single thread queued these comunication requests.

Unfortunately, most MPI-like softwares like openmpi and baidu-allreduce do not guarantee multi-thread safety.

Is this also your concern? How can we achieve this using gloo ?

Thanks

@pietern
Copy link
Contributor

pietern commented May 11, 2017

Hi!

This is not a concern with Gloo. To do exactly what you describe, Caffe2 creates multiple Gloo contexts and uses a different context for each successive allreduce operation. To limit the number of contexts to be lower than the number of layers (since that might be a lot), it creates a fixed number (e.g. 16) and cycles through the contexts to reuse them. This results in having a maximum of 16 (or whatever you choose) allreduce operations running in parallel. Since every context is completely independent from every other context, you can create as many as you need, and use them from any number of threads.

Heads up: for each new context Gloo will create new pairs and the transport layer equivalent (TCP sockets or ibverbs queue pairs), so if you want to use a large number of contexts and have a large number of nodes, you may run into OS level limits for the number of sockets.

@pietern pietern closed this as completed May 11, 2017
@pietern
Copy link
Contributor

pietern commented May 11, 2017

Also see data_parallel_model.py for an example how this is done in Caffe2: https://github.com/caffe2/caffe2/blob/1456fa794b8441e4c9fa727cdd2f36e1f044f184/caffe2/python/data_parallel_model.py#L432-L478

@hiyijian
Copy link
Author

Hi @pietern ,
Thank you very much.

But can we really benifit from this kind of mutiple threading trick? Have you ever compared single thread allreduce with mutiple threads allreduce, with the same amout of data. For example, 10 jobs, each allreudce 8M float number, running with 10 threads VS. single threads sequencely

My experiment on baidu-allreduce says that the total time costs are roughly the same. baidu-allreduce use the same chuncked ring algorithmn as gloo, except that it is built on the top of OpenMPI and gloo is MPI-free

It seems we can not benifit from this trick at all, maybe since the real limitation is on hardware such as netword card. Gloo would do better?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants