Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{mpi}[GCC/13.2.0] OpenMPI v5.0.3, PMIx v5.0.2 #17561

Open
wants to merge 10 commits into
base: develop
Choose a base branch
from

Conversation

boegel
Copy link
Member

@boegel boegel commented Mar 18, 2023

(created using eb --new-pr)

WIP since we're using release candidates here, not final releases.

I had to strip out the CUDA-related patches we are using for OpenMPI 4.1.5 to get the build working, we'll need to figure out how to move forward there (cc @Micket, @bartoldeman)

@boegel boegel added the update label Mar 18, 2023
@boegel boegel marked this pull request as draft March 18, 2023 11:51
@boegel boegel added this to the release after 4.7.1 milestone Mar 18, 2023
@Micket
Copy link
Contributor

Micket commented Mar 18, 2023

I don't think there is really anything new to do with regards to CUDA. Just continue to patch in support for internal header.

@shahzebsiddiqui
Copy link
Contributor

is this PR going to be merged soon? I would be interested in using this version of OpenMPI.

@boegel boegel changed the title {mpi}[GCC/12.2.0] OpenMPI v5.0.0rc10, PMIx v5.0.0rc1 {mpi}[GCC/13.2.0] OpenMPI v5.0.1, PMIx v5.0.1 Jan 22, 2024
@boegel boegel marked this pull request as ready for review January 22, 2024 07:46
@SebastianAchilles
Copy link
Member

My remaining question here is, whether we want to add the CUDA-related patches first, or merge this PR as is and add the CUDA-related patches in a follow-up PR?

@SebastianAchilles
Copy link
Member

@boegelbot please test @ jsc-zen3

@boegelbot
Copy link
Collaborator

@SebastianAchilles: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=17561 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_17561 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3446

Test results coming soon (I hope)...

- notification for comment with ID 1904058315 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.3, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/3d70d3547c216b3078c551c4d30c96b1 for a full test report.

@bartoldeman
Copy link
Contributor

I can have a look this week to see how hard it is to port over the internal CUDA patches...

@SebastianAchilles
Copy link
Member

@boegelbot please test @ jsc-zen3
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@SebastianAchilles: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=17561 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_17561 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3447

Test results coming soon (I hope)...

- notification for comment with ID 1904168663 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.3, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/46572ea7ce6477a7eeb12017f74d3963 for a full test report.

This patch has changed since libcuda is no longer dlopen()'ed by Open
MPI. Instead we can generate a stub library, and at runtime the
CUDA-dependent DSO's (but not the main libmpi.so library) load
libcuda.so. This is then consistent with
https://docs.open-mpi.org/en/v5.0.x/tuning-apps/networking/cuda.html
(but --enable-mca-dso=<comma-delimited-list-of-cuda-components> is
done by default already)
@boegel
Copy link
Member Author

boegel commented Feb 15, 2024

@boegelbot please test @ jsc-zen3

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=17561 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_17561 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3621

Test results coming soon (I hope)...

- notification for comment with ID 1946547876 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.3, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/d523bd60f2048da761357a3a8e2188ce for a full test report.

@bartoldeman
Copy link
Contributor

I also had a look at OpenMPI-4.1.1_opal-datatype-cuda-performance.patch but there is no more conditional CUDA compilation in opal/datatype, so it's obsolete (there doesn't seem to be any performance penalty anymore from enabling CUDA).

Comment on lines 37 to 39
# disable MPI1 compatibility for now, see what breaks...
# configopts += '--enable-mpi1-compatibility '

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# disable MPI1 compatibility for now, see what breaks...
# configopts += '--enable-mpi1-compatibility '

This is commented out in all easyconfigs in EB5. I suggest we drop the comment for OpenMPI 5.

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
bear-pg0207u28a.bear.cluster - Linux RHEL 8.6, x86_64, AMD EPYC 9554 64-Core Processor (zen4), Python 3.6.8
See https://gist.github.com/branfosj/f57cf432627d1e2cc7c88724990caa00 for a full test report.

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
bear-pg0105u03a - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/branfosj/f3fd6149e6fbd3f51637870485cf8ac7 for a full test report.

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
bear-pg0207u20a - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8480CL (sapphirerapids), Python 3.6.8
See https://gist.github.com/branfosj/39fd24d948a99d6e47b6715eda45b2a7 for a full test report.

@bedroge
Copy link
Contributor

bedroge commented Feb 20, 2024

@boegel I guess we should also include the smcuda patch from open-mpi/ompi#12338 here like I've done for 4.1.x in #19940?.

Edit: done in boegel#94

@casparvl
Copy link
Contributor

@bartoldeman how did you decide which functions to put stubs for in your patch? I guess those will need to potentially be updated for newer OpenMPI versions, that might call additional functions?

@bartoldeman
Copy link
Contributor

@casparvl this is just a result of grepping for them or if you compile and it's not in the header file you get an error message. Indeed newer Open MPI may use more/different CUDA functions which would necessitate changing the header file.

@boegel
Copy link
Member Author

boegel commented Mar 13, 2024

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on login1

PR test command 'EB_PR=17561 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_17561 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 13097

Test results coming soon (I hope)...

- notification for comment with ID 1995061997 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
cns2 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/2cc1f31f080f59f48c555d04a9e4e82b for a full test report.

@bartoldeman
Copy link
Contributor

@boegelbot please test @ jsc-zen3

@bartoldeman
Copy link
Contributor

ofi with psm3 caused issues on Generoso.. I think we've seen something like this before...

@boegelbot
Copy link
Collaborator

@bartoldeman: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=17561 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_17561 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3770

Test results coming soon (I hope)...

- notification for comment with ID 1996249416 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.3, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/10a5496aa4c3eae19a86d230299fb273 for a full test report.

@boegel boegel modified the milestones: 4.9.1, release after 4.9.1 Apr 3, 2024
`OpenMPI-5.0.x_add_atomic_wmb.patch` is obsolete now
@bartoldeman
Copy link
Contributor

@boegel another bump: boegel#95

Bump to OpenMPI to 5.0.3, PMIx to 5.0.2
@bartoldeman bartoldeman changed the title {mpi}[GCC/13.2.0] OpenMPI v5.0.2, PMIx v5.0.1 {mpi}[GCC/13.2.0] OpenMPI v5.0.3, PMIx v5.0.2 Apr 30, 2024
@bartoldeman
Copy link
Contributor

@boegelbot please test @ jsc-zen3

@boegelbot
Copy link
Collaborator

@bartoldeman: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=17561 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_17561 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 4051

Test results coming soon (I hope)...

- notification for comment with ID 2086458131 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.3, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/3b4a761af1920f6cb25fb3a9c8eb72be for a full test report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants