Skip to content

Conversation

@msimberg
Copy link
Collaborator

@msimberg msimberg commented Apr 3, 2025

Adds a bit more text to the libfabric, OpenMPI, NCCL, and RCCL pages. Still far from complete, but another step forward hopefully.

@msimberg msimberg requested review from RMeli and bcumming as code owners April 3, 2025 11:43
@github-actions
Copy link

github-actions bot commented Apr 3, 2025

preview available: https://docs.tds.cscs.ch/75

The instructions found on this page may be inaccurate, but are a good starting point to using OpenMPI on Alps.

!!! todo
Deploy experimental uenv.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@biddisco, @bcumming I think it'd be good if we could deploy the not fully tested uenv from @biddisco as prgenv-gnu-ompi/25.4:alpha1 or something like that to have something stable to pull from jfrog. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do - we don't need to do this before we deploy these docs

@msimberg msimberg requested review from Madeeks and biddisco April 3, 2025 11:46
Comment on lines 36 to 37
export OMPI_MCA_pml="^ucx" # (4)
export OMPI_MCA_mtl="ofi" # (5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are optional in my experience when using just OFI CXI. I mostly used them when experimenting with LINKx and I tried to replicate as close as possible the cmd line from the LINKx paper (i.e. the one from LANL/ORNL about OpenMPI on Cray EX).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what you're saying is: if OpenMPI is "correctly" built for Alps (only the necessary providers/backends/etc.) then these two should not be needed? If yes, I agree, we can leave this out.

Copy link
Member

@Madeeks Madeeks Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not need them to get the numbers I was expecting when tinkering with my OFI+CXI+OMPI container.
My guess is that if everything is set up "correctly", OpenMPI evaluates that OFI is the best layer to use and things go well.
But on second thought, having these vars around would be helpful to "anchor" OMPI into the behavior we want. Otherwise we have little clue if/when something changes and e.g. OMPI chooses different components/backends for whatever reason.
All of this obviously assumes that we recognize what we get with these vars as the "accepted and expected" behavior/performance.

@github-actions
Copy link

github-actions bot commented Apr 3, 2025

preview available: https://docs.tds.cscs.ch/75

@github-actions
Copy link

github-actions bot commented Apr 3, 2025

preview available: https://docs.tds.cscs.ch/75

@github-actions
Copy link

github-actions bot commented Apr 3, 2025

preview available: https://docs.tds.cscs.ch/75

@github-actions
Copy link

github-actions bot commented Apr 3, 2025

preview available: https://docs.tds.cscs.ch/75

Comment on lines +90 to +92
See [the container engine documentation][ref-ce-aws-ofi-hook] for information on using NCCL in containers.
The [NCCL][ref-communication-nccl] contains general information on configuring NCCL.
This information is especially important when using uenvs, as the environment variables are not set automatically.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB @bcumming, I'm linking from the GB docs to the updated NCCL docs now.

@github-actions
Copy link

github-actions bot commented Apr 3, 2025

preview available: https://docs.tds.cscs.ch/75

@github-actions
Copy link

github-actions bot commented Apr 3, 2025

preview available: https://docs.tds.cscs.ch/75

@github-actions
Copy link

github-actions bot commented Apr 4, 2025

preview available: https://docs.tds.cscs.ch/75

@github-actions
Copy link

github-actions bot commented Apr 4, 2025

preview available: https://docs.tds.cscs.ch/75

@github-actions
Copy link

github-actions bot commented Apr 4, 2025

preview available: https://docs.tds.cscs.ch/75

@github-actions
Copy link

github-actions bot commented Apr 4, 2025

preview available: https://docs.tds.cscs.ch/75

msimberg and others added 2 commits April 4, 2025 12:46
Co-authored-by: Rocco Meli <r.meli@bluemail.ch>
Co-authored-by: Rocco Meli <r.meli@bluemail.ch>
@github-actions
Copy link

github-actions bot commented Apr 4, 2025

preview available: https://docs.tds.cscs.ch/75

1 similar comment
@github-actions
Copy link

github-actions bot commented Apr 4, 2025

preview available: https://docs.tds.cscs.ch/75

Co-authored-by: Rocco Meli <r.meli@bluemail.ch>
@github-actions
Copy link

github-actions bot commented Apr 4, 2025

preview available: https://docs.tds.cscs.ch/75

@github-actions
Copy link

github-actions bot commented Apr 4, 2025

preview available: https://docs.tds.cscs.ch/75

1 similar comment
@github-actions
Copy link

github-actions bot commented Apr 4, 2025

preview available: https://docs.tds.cscs.ch/75

@github-actions
Copy link

github-actions bot commented Apr 4, 2025

preview available: https://docs.tds.cscs.ch/75

export FI_CXI_DISABLE_HOST_REGISTER=1
export FI_CXI_RX_MATCH_MODE=software
export FI_MR_CACHE_MONITOR=userfaultfd
export MPICH_GPU_SUPPORT_ENABLED=0 # (4)!
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
export MPICH_GPU_SUPPORT_ENABLED=0 # (4)!
export MPICH_GPU_SUPPORT_ENABLED=0 # (4)!
export NCCL_SOCKET_IFNAME="hsn"
export NCCL_CROSS_NIC="1"
export FI_CXI_COMPAT="0"
export MPIR_CVAR_CH4_OFI_MULTI_NIC_STRIPING_THRESHOLD=100000000

@boeschf, @teojgo, @fawzi (or anyone else that knows): any comments on NCCL_SOCKET_IFNAME, NCCL_CROSS_NIC, FI_CXI_COMPAT, and MPIR_CVAR_CH4_OFI_MULTI_NIC_STRIPING_THRESHOLD? They are set by enroot in addition to the environment variables that we already have documented here. Can we safely recommend them to users?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MPIR_CVAR_CH4_OFI_MULTI_NIC_STRIPING_THRESHOLD only affects some MPICH-based MPIs; MPICH and MVAPICH respond to it, but beyond that I'm not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, the default vars set by the AWS NCCL plugin hook (which reasonably apply only when NCCL is used) are defined here: https://git.cscs.ch/alps-platforms/vservices/vs-enroot/-/blob/main/enroot-variables.tf?ref_type=heads#L314

Individual vclusters might override these values, of course.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MPIR_CVAR_CH4_OFI_MULTI_NIC_STRIPING_THRESHOLD only affects some MPICH-based MPIs; MPICH and MVAPICH respond to it, but beyond that I'm not sure.

Ah, of course 🤦 makese sense. So definitely not for the NCCL page then. Might still be useful to add to the Cray MPICH page, but I'd like to first understand how universally useful it is.

For reference, the default vars set by the AWS NCCL plugin hook (which reasonably apply only when NCCL is used) are defined here: https://git.cscs.ch/alps-platforms/vservices/vs-enroot/-/blob/main/enroot-variables.tf?ref_type=heads#L314

Indeed, thanks for the link. That's what I was using as a reference to understand if we need to list all of them on the NCCL page here in the docs. Based on docs FI_CXI_COMPAT and NCCL_SOCKET_IFNAME seem to be more for safety, and should be ok to set. NCCL_CROSS_NIC seems a bit more unclear. It definitely has an effect on performance, but it's not clear if it's always good or bad.

@github-actions
Copy link

github-actions bot commented Apr 7, 2025

preview available: https://docs.tds.cscs.ch/75

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/75

Copy link
Member

@bcumming bcumming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start on these docs.

It covers a range of topics - so in the interest of avoiding multiple bike sheds, let's merge then work on the separate topics individually.

@bcumming bcumming merged commit 2f74baf into eth-cscs:main Apr 15, 2025
1 check passed
@msimberg msimberg deleted the expand-communication branch April 15, 2025 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants