Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving support for systems using Omni-Path interconnect #4560

Open
jfgrimm opened this issue Jun 12, 2024 · 0 comments
Open

Improving support for systems using Omni-Path interconnect #4560

jfgrimm opened this issue Jun 12, 2024 · 0 comments

Comments

@jfgrimm
Copy link
Member

jfgrimm commented Jun 12, 2024

Currently, the default way we build OpenMPI, MPICH etc. works well for InfiniBand systems, but shows very poor performance on Omni-Path (we saw between 2x and 10x worse bandwidth and latency in benchmarks on our system).
It would be good to figure out a way to improve Omni-Path support in EasyBuild (perhaps through a configuration option?); at a minimum, we should improve documentation.

relevant PRs to date:

  • [#20501] PSM2 dependency added to recent libfabric easyconfigs
  • [#20585] PSM2 dependency made conditional on having x86_64
  • [#20794] previous changes effectively undone, by commenting PSM2 dependency back out due to CUDA build dependency

further info/ideas:

  • Omni-Path systems should use either PSM2 or opx
    • PSM2 can be either stand-alone, or via libfabric
    • opx is a libfabric provider; drop-in replacement for PSM2
    • Cornelis' plan is to move away from PSM2 (the upcoming 400G adapters will only support opx)
    • no benefit (only additional overhead) from using UCX with Omni-Path
  • Cornelis' documentation currently recommends using PSM2:
    • For best performance, Cornelis recommends that you use the PSM2, the high performance
      interface to the OPX Fabric. This is accomplished using the Open Fabrics Interface (OFI) MPI
      fabric setting -genv I_MPI_FABRICS=ofi and ensure that FI_PROVIDER=psm2.

    • source: Cornelis_OPX_Performance_Tuning_UG_H93143_v25_0.pdf (March 2024)
  • [#20794 comment] suggestion by @bartoldeman to patch PSM2 in order to drop the CUDA build dependency
    • matches current approach for OpenMPI, by including some minimal CUDA prototypes (since PSM2 will dlopen('libcuda.so.1') at runtime)
  • [#20794 comment] suggestion by @boegel to perhaps implement "opt-in hooks that are part of EasyBuild framework, which can be enabled selectively?"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant