Are the collective APIs provided by gloo thread-safe? #35

hiyijian · 2017-05-11T05:15:25Z

Dear gloo team,

Deep laerning framework like caffe backprops in a layer wise fashion. It is better to do "allreduce" immediately after gradients of each layer figured out. To not only overlap comunication and computing, but also comunication of diffrient layers, these "allreduce" should run in seperate threads, rather than one single thread queued these comunication requests.

Unfortunately， most MPI-like softwares like openmpi and baidu-allreduce do not guarantee multi-thread safety.

Is this also your concern? How can we achieve this using gloo ?

Thanks

pietern · 2017-05-11T22:24:02Z

Hi!

This is not a concern with Gloo. To do exactly what you describe, Caffe2 creates multiple Gloo contexts and uses a different context for each successive allreduce operation. To limit the number of contexts to be lower than the number of layers (since that might be a lot), it creates a fixed number (e.g. 16) and cycles through the contexts to reuse them. This results in having a maximum of 16 (or whatever you choose) allreduce operations running in parallel. Since every context is completely independent from every other context, you can create as many as you need, and use them from any number of threads.

Heads up: for each new context Gloo will create new pairs and the transport layer equivalent (TCP sockets or ibverbs queue pairs), so if you want to use a large number of contexts and have a large number of nodes, you may run into OS level limits for the number of sockets.

pietern · 2017-05-11T22:25:34Z

Also see data_parallel_model.py for an example how this is done in Caffe2: https://github.com/caffe2/caffe2/blob/1456fa794b8441e4c9fa727cdd2f36e1f044f184/caffe2/python/data_parallel_model.py#L432-L478

hiyijian · 2017-05-25T10:57:35Z

Hi @pietern ,
Thank you very much.

But can we really benifit from this kind of mutiple threading trick? Have you ever compared single thread allreduce with mutiple threads allreduce, with the same amout of data. For example, 10 jobs, each allreudce 8M float number, running with 10 threads VS. single threads sequencely

My experiment on baidu-allreduce says that the total time costs are roughly the same. baidu-allreduce use the same chuncked ring algorithmn as gloo, except that it is built on the top of OpenMPI and gloo is MPI-free

It seems we can not benifit from this trick at all, maybe since the real limitation is on hardware such as netword card. Gloo would do better?

pietern closed this as completed May 11, 2017

flymark2010 mentioned this issue Jun 2, 2017

The performance of using multiple gloo Context in multiple threads #42

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are the collective APIs provided by gloo thread-safe? #35

Are the collective APIs provided by gloo thread-safe? #35

hiyijian commented May 11, 2017 •

edited

Loading

pietern commented May 11, 2017

pietern commented May 11, 2017

hiyijian commented May 25, 2017

Are the collective APIs provided by gloo thread-safe? #35

Are the collective APIs provided by gloo thread-safe? #35

Comments

hiyijian commented May 11, 2017 • edited Loading

pietern commented May 11, 2017

pietern commented May 11, 2017

hiyijian commented May 25, 2017

hiyijian commented May 11, 2017 •

edited

Loading