Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require more details about building rabit with MPI support #23

Closed
weijianwen opened this issue Jul 29, 2015 · 6 comments
Closed

Require more details about building rabit with MPI support #23

weijianwen opened this issue Jul 29, 2015 · 6 comments

Comments

@weijianwen
Copy link

Hi,

I'm trying to build wormhole from scratch in which rabit is a dependent lib. My target platform is MPI-enable with assistance from some Batch Job Management System. I wonder if it is possible to add more details about bulidng rabit with MPI support in README. For example,

  1. In order to run wormhole with MPI, is it necessary to build rabit with MPI support? How?
  2. Regarding MPI libraries out there (OpenMPI, Intel MPI, MPICH2, MVAPCHI...), what't your recommendation and in which you have tested yet?

Best,

@tqchen
Copy link
Member

tqchen commented Jul 29, 2015

If is not necessary to build rabit with MPI support if you want to use rabit's as a communication lib. If you build it with MPI support, the communication lib is switched to MPI.

To build with MPI as backend, simply link against librabt_mpi.a.

  • you can type make mpi to build that

@weijianwen
Copy link
Author

Interesting. Sounds like that rabit is a "message operation" library supporting various backend engines. ZeroMQ works in the similar way.

MPI uses broadcast, collect, reduce as verbs, so it is nice candidate for rabit backend engine. Extra benefits from MPI lie in:

  1. Good integration with highend network fabrics such as Infiniband.
  2. Good integration with job scheduling system. Name a few, SLURM, LSF, OpenLava, SGE. The scheduling system will take care of the MPI jobs for us.

But as mentioned by Tianqi, the tradeoff is: no auto recovery in MPI.

Anyway, topics and benchmarks entitled "rabit-socket v.s. rabit-MPI v.s. ZeroMQ" may be interesting. ZeroMQ is performance oriented, thus no reliability mechanism is designed for it.

Please correct me if I am wrong.

@hjk41
Copy link
Member

hjk41 commented Jul 30, 2015

MPI 2.0 does allow you to dynamically spawn new process in case you want to
restart a dead one, though I would say it is not as easy as it seems. And
currently rabit-MPI does not leverage that feature.

ZeroMQ is more optimized for small messages and is not necessarily a good
choice for machine learning workloads, since most messages in ML are large
messages. ZeroMQ is reliable, in that it automatically re-transmits
messages and it also has some kind of load balancing mechanism built into
it.

Besides communication, Rabit provides checkpointing, I think that is the
most important distinction.

On Thu, Jul 30, 2015 at 1:05 AM, 健美猫 notifications@github.com wrote:

Interesting. Sounds like that rabit is a "message operation" library
supporting various backend engine. ZeroMQ works in the similar way.

MPI uses broadcast, collect, reduce as verbs, so it is nice candidate for
rabit backend engine. Extra benefits from MPI lie in:

  1. Good integration with highend network fabrics such as Infiniband.
  2. Good integration with job scheduling system. Name a few, SLURM,
    LSF, OpenLava, SGE. The scheduling system will take care of the MPI jobs
    for us.

But as mentioned by Tianqi, the tradeoff is: no auto recovery in MPI.

Any, topics and benchmarks entitled "rabit-socket v.s. rabit-MPI v.s.
ZeroMQ" may be interesting. ZeroMQ is performance oriented, thus no
reliability mechanism is designed for it.

Please correct me if I am wrong.


Reply to this email directly or view it on GitHub
#23 (comment).

HONG Chuntao
System Research Group
Microsoft Research Asia

@weijianwen
Copy link
Author

Thank you, @hjk41 . I think these nice features should be higlighted in README and tutorials.

For simplicity, I'll try rabit with default setting first. This issue will be closed.

@tqchen
Copy link
Member

tqchen commented Jul 30, 2015

@weijianwen It would be great if you can open a PR and contribute your understanding to the tutorial., thanks

@weijianwen
Copy link
Author

@tqchen Sure glad to help. I'll send feedbacks about how to install dmlc stacks on a moderate-sized cluster. As I wasn't engaged in the design process before, my feedbacks will reflect what a library user hope to know when he/she at the very beginning. That would be a good point to reorganize README, tutorials and other docs.

On more thing. I appreciate if someone can merge my PR in dmlc/wormhole. It is typo fixing, not feature adding. As ps-lite replaces ps in wormhole's dependency, I wonder if we should also replace ps's reference link in "Depending DMLC Libraries".

dmlc/wormhole#18

Best,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants