Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add patch for OpenMPI 4.1.1 to support building using --with-cuda=internal #15528

Merged

Conversation

bartoldeman
Copy link
Contributor

Allow building Open MPI with --with-cuda=internal, by providing an
internal minimal cuda.h header file. This eliminate the CUDA
(build)dependency; as long as the runtime CUDA version is 8.0+,
libcuda.so will be dlopen'ed and used successfully.

Allow building Open MPI with --with-cuda=internal, by providing an
internal minimal cuda.h header file. This eliminate the CUDA
(build)dependency; as long as the runtime CUDA version is 8.0+,
libcuda.so will be dlopen'ed and used successfully.
@bartoldeman
Copy link
Contributor Author

See the discussion in #14919

Note that @Micket suggests to let --with-cuda=internal be a default, but that'd need an easyblock change

@bartoldeman
Copy link
Contributor Author

I can also add a speedup patch, on Tuesday at the latest.

@branfosj branfosj added this to the next release (4.5.5?) milestone May 22, 2022
Copy link
Contributor

@Micket Micket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Should we wait for the performance patch or do that in a separate PR?

jfgrimm added a commit to jfgrimm/easybuild-easyconfigs that referenced this pull request May 23, 2022
@akesandgren
Copy link
Contributor

That's a seriously large patch (although simple to follow), have you suggested it upstream too?

@Micket
Copy link
Contributor

Micket commented May 25, 2022

@akesandgren open-mpi/ompi#10364 (i assume the plan is to update the PR with the latest cleanup'ed patch?)

@Micket
Copy link
Contributor

Micket commented May 25, 2022

Test report by @Micket
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
alvis-s1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz, Python 3.6.8
See https://gist.github.com/a02966aa45ac98b2bdadec5f006df089 for a full test report.

@bartoldeman
Copy link
Contributor Author

@akesandgren the upstream patch is only item 1. I wanted some feedback first before committing the rest.
Some CUDA files were moved in 5.x and ROCm support was added, so there are a ton of fairly cosmetic changes in between.

I can of course make this patch smaller (or split in 3) but it's a trade off.

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
bask-pg0309u12a.cluster.baskerville.ac.uk - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), 1 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 3.6.8
See https://gist.github.com/252c18c3e9aec873e856eddf9bd17bf3 for a full test report.

@SebastianAchilles
Copy link
Member

@boegelbot please test @ jsc-zen2
CORE_CNT=16
EB_ARGS="--buildpath=/dev/shm/$USER --installpath=/tmp/$USER/pr15528"

@boegelbot
Copy link
Collaborator

@SebastianAchilles: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster

PR test command 'EB_PR=15528 EB_ARGS="--buildpath=/dev/shm/$USER --installpath=/tmp/$USER/pr15528" /opt/software/slurm/bin/sbatch --job-name test_PR_15528 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen2.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 1234

Test results coming soon (I hope)...

- notification for comment with ID 1137859624 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
jsczen2c1.int.jsc-zen2.easybuild-test.cluster - Linux Rocky Linux 8.5, x86_64, AMD EPYC 7742 64-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/5597dd9294f51eb4e6e35556735a77f7 for a full test report.

Copy link
Contributor

@Micket Micket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

I ran through all expected OSU tests;

#!/usr/bin/env bash
#SBATCH -n 2 -N 2
#SBATCH --gpus-per-node=A40:4
#SBATCH -t 1:00:00

ml OSU-Micro-Benchmarks/5.7.1-gompi-2021a-CUDA-11.3.1

for t in osu_bibw osu_bw osu_latency osu_mbw_mr osu_multi_lat osu_allgather osu_allgatherv osu_allreduce osu_alltoall osu_alltoallv osu_bcast osu_gather osu_gatherv osu_reduce osu_reduce_scatter osu_scatter osu_scatterv osu_iallgather osu_ialltoall osu_ibcast osu_igather osu_iscatter osu_alltoall osu_allreduce osu_reduce osu_alltoall
do
    echo "Running ${t}"
    mpirun ${t} -d cuda D D
    mpirun ${t} -d cuda H D
    mpirun ${t} -d cuda D H
    mpirun ${t} -d cuda H H
done

and it all worked without errors.

Anyone else wants to have a second check? I'm good with this getting merged.

@branfosj
Copy link
Member

Anyone else wants to have a second check? I'm good with this getting merged.

I tested with the OSU 5.9 (#15343 and #15344). In addition to the above, this also allows running the NCCL based tests:

for t in osu_nccl_bibw osu_nccl_bw osu_nccl_latency osu_nccl_allgather osu_nccl_allreduce osu_nccl_bcast osu_nccl_reduce osu_nccl_reduce_scatter osu_nccl_reduce osu_nccl_allreduce
do
  echo ${t}
  mpirun -np 2 ${t} -d cuda D D
done

These were all fine. So LGTM :)

@branfosj
Copy link
Member

Going in, thanks @bartoldeman!

@branfosj branfosj merged commit 99223bf into easybuilders:develop May 26, 2022
@boegel boegel changed the title OpenMPI 4.1.1: patch and build --with-cuda=internal add patch for OpenMPI 4.1.1 to support building using --with-cuda=internal May 26, 2022
@boegel boegel mentioned this pull request May 27, 2022
2 tasks
@migueldiascosta
Copy link
Member

I was running a benchmark on a CPU node with foss/2022a, getting bad performance, and noticed that OpenMPI was using smcuda as shared memory btl, forcing it to use vader significantly improved performance, by at least 30% (this was on a 2x96 core Genoa node, so right now it may be an extreme case of shared memory communication, but that's the trend...)

I think I'll just set OMPI_MCA_btl=^smcuda on non-GPU nodes, but just wanted to mention here that our builds may be trading off too much performance for generality?

@Micket
Copy link
Contributor

Micket commented May 3, 2023

This CPU node had the cuda runtime installed? Otherwise i don't think the smcuda would have even been enabled at all.

@migueldiascosta
Copy link
Member

migueldiascosta commented May 3, 2023

Not that I can tell - this was on benchmarking system provided by a vendor, running Ubuntu, but the only thing nvidia related I can find is /usr/bin/nvidia-detector from ubuntu-drivers-common

Just tried it on our own (centos) nodes and indeed smcuda is not used, so not sure what is triggering it in the benchmarking system... forgot that I had already set OMPI_MCA_btl=^smcuda, without it I still see smcuda being used

@boegel
Copy link
Member

boegel commented May 3, 2023

If this should be discussed further, let's open an issue on this, discussing stuff in a merged PR is difficult to keep track of...

@migueldiascosta
Copy link
Member

done, #17854

@migueldiascosta
Copy link
Member

This CPU node had the cuda runtime installed? Otherwise i don't think the smcuda would have even been enabled at all.

@Micket if I understand https://github.com/open-mpi/ompi/blob/9216ad4c49a16b3134d0dc47d8aa623f569d45ae/opal/mca/btl/smcuda/README.md correctly, smcuda was implemented as a copy of the old sm; without the cuda runtime, it obviously doesn't find any device or do anything cuda related, but it will use the old sm code instead of vader. The comment about smcuda having a higher priority than sm probably also applies to vader, which is why it's being selected (?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants