add patch for OpenMPI 4.1.1 to support building using --with-cuda=internal #15528

bartoldeman · 2022-05-22T00:51:06Z

Allow building Open MPI with --with-cuda=internal, by providing an
internal minimal cuda.h header file. This eliminate the CUDA
(build)dependency; as long as the runtime CUDA version is 8.0+,
libcuda.so will be dlopen'ed and used successfully.

Allow building Open MPI with --with-cuda=internal, by providing an internal minimal cuda.h header file. This eliminate the CUDA (build)dependency; as long as the runtime CUDA version is 8.0+, libcuda.so will be dlopen'ed and used successfully.

bartoldeman · 2022-05-22T00:56:07Z

See the discussion in #14919

Note that @Micket suggests to let --with-cuda=internal be a default, but that'd need an easyblock change

bartoldeman · 2022-05-22T02:10:12Z

I can also add a speedup patch, on Tuesday at the latest.

Micket

LGTM. Should we wait for the performance patch or do that in a separate PR?

easybuild/easyconfigs/o/OpenMPI/OpenMPI-4.1.1-GCC-10.3.0.eb

The overhead was there even if no GPU is available.

akesandgren · 2022-05-25T13:21:54Z

That's a seriously large patch (although simple to follow), have you suggested it upstream too?

Micket · 2022-05-25T13:25:06Z

@akesandgren open-mpi/ompi#10364 (i assume the plan is to update the PR with the latest cleanup'ed patch?)

Micket · 2022-05-25T14:17:39Z

Test report by @Micket
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
alvis-s1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz, Python 3.6.8
See https://gist.github.com/a02966aa45ac98b2bdadec5f006df089 for a full test report.

bartoldeman · 2022-05-25T14:47:22Z

@akesandgren the upstream patch is only item 1. I wanted some feedback first before committing the rest.
Some CUDA files were moved in 5.x and ROCm support was added, so there are a ton of fairly cosmetic changes in between.

I can of course make this patch smaller (or split in 3) but it's a trade off.

branfosj · 2022-05-25T19:15:57Z

Test report by @branfosj
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
bask-pg0309u12a.cluster.baskerville.ac.uk - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), 1 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 3.6.8
See https://gist.github.com/252c18c3e9aec873e856eddf9bd17bf3 for a full test report.

SebastianAchilles · 2022-05-25T21:21:16Z

@boegelbot please test @ jsc-zen2
CORE_CNT=16
EB_ARGS="--buildpath=/dev/shm/$USER --installpath=/tmp/$USER/pr15528"

boegelbot · 2022-05-25T21:22:07Z

@SebastianAchilles: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster

PR test command 'EB_PR=15528 EB_ARGS="--buildpath=/dev/shm/$USER --installpath=/tmp/$USER/pr15528" /opt/software/slurm/bin/sbatch --job-name test_PR_15528 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen2.sh' executed!

exit code: 0
output:

Submitted batch job 1234

Test results coming soon (I hope)...

- notification for comment with ID 1137859624 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

boegelbot · 2022-05-25T21:46:39Z

Test report by @boegelbot
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
jsczen2c1.int.jsc-zen2.easybuild-test.cluster - Linux Rocky Linux 8.5, x86_64, AMD EPYC 7742 64-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/5597dd9294f51eb4e6e35556735a77f7 for a full test report.

Micket

lgtm

I ran through all expected OSU tests;

#!/usr/bin/env bash
#SBATCH -n 2 -N 2
#SBATCH --gpus-per-node=A40:4
#SBATCH -t 1:00:00

ml OSU-Micro-Benchmarks/5.7.1-gompi-2021a-CUDA-11.3.1

for t in osu_bibw osu_bw osu_latency osu_mbw_mr osu_multi_lat osu_allgather osu_allgatherv osu_allreduce osu_alltoall osu_alltoallv osu_bcast osu_gather osu_gatherv osu_reduce osu_reduce_scatter osu_scatter osu_scatterv osu_iallgather osu_ialltoall osu_ibcast osu_igather osu_iscatter osu_alltoall osu_allreduce osu_reduce osu_alltoall
do
    echo "Running ${t}"
    mpirun ${t} -d cuda D D
    mpirun ${t} -d cuda H D
    mpirun ${t} -d cuda D H
    mpirun ${t} -d cuda H H
done

and it all worked without errors.

Anyone else wants to have a second check? I'm good with this getting merged.

branfosj · 2022-05-26T07:40:01Z

Anyone else wants to have a second check? I'm good with this getting merged.

I tested with the OSU 5.9 (#15343 and #15344). In addition to the above, this also allows running the NCCL based tests:

for t in osu_nccl_bibw osu_nccl_bw osu_nccl_latency osu_nccl_allgather osu_nccl_allreduce osu_nccl_bcast osu_nccl_reduce osu_nccl_reduce_scatter osu_nccl_reduce osu_nccl_allreduce
do
  echo ${t}
  mpirun -np 2 ${t} -d cuda D D
done

These were all fine. So LGTM :)

branfosj · 2022-05-26T09:34:35Z

Going in, thanks @bartoldeman!

migueldiascosta · 2023-05-03T01:20:50Z

I was running a benchmark on a CPU node with foss/2022a, getting bad performance, and noticed that OpenMPI was using smcuda as shared memory btl, forcing it to use vader significantly improved performance, by at least 30% (this was on a 2x96 core Genoa node, so right now it may be an extreme case of shared memory communication, but that's the trend...)

I think I'll just set OMPI_MCA_btl=^smcuda on non-GPU nodes, but just wanted to mention here that our builds may be trading off too much performance for generality?

Micket · 2023-05-03T01:44:24Z

This CPU node had the cuda runtime installed? Otherwise i don't think the smcuda would have even been enabled at all.

migueldiascosta · 2023-05-03T02:18:29Z

Not that I can tell - this was on benchmarking system provided by a vendor, running Ubuntu, but the only thing nvidia related I can find is /usr/bin/nvidia-detector from ubuntu-drivers-common

~~Just tried it on our own (centos) nodes and indeed smcuda is not used, so not sure what is triggering it in the benchmarking system...~~ forgot that I had already set OMPI_MCA_btl=^smcuda, without it I still see smcuda being used

boegel · 2023-05-03T19:30:41Z

If this should be discussed further, let's open an issue on this, discussing stuff in a merged PR is difficult to keep track of...

migueldiascosta · 2023-05-04T02:48:24Z

done, #17854

migueldiascosta · 2023-05-04T03:59:44Z

This CPU node had the cuda runtime installed? Otherwise i don't think the smcuda would have even been enabled at all.

@Micket if I understand https://github.com/open-mpi/ompi/blob/9216ad4c49a16b3134d0dc47d8aa623f569d45ae/opal/mca/btl/smcuda/README.md correctly, smcuda was implemented as a copy of the old sm; without the cuda runtime, it obviously doesn't find any device or do anything cuda related, but it will use the old sm code instead of vader. The comment about smcuda having a higher priority than sm probably also applies to vader, which is why it's being selected (?)

branfosj added this to the next release (4.5.5?) milestone May 22, 2022

branfosj added the bug fix label May 22, 2022

Micket approved these changes May 22, 2022

View reviewed changes

Micket requested changes May 22, 2022

View reviewed changes

easybuild/easyconfigs/o/OpenMPI/OpenMPI-4.1.1-GCC-10.3.0.eb Show resolved Hide resolved

jfgrimm added a commit to jfgrimm/easybuild-easyconfigs that referenced this pull request May 23, 2022

update patch, (pre-)configopts to match easybuilders#15528

c52b082

bartoldeman added 2 commits May 25, 2022 12:24

Change _ to - for more consistent patch naming

a93cf12

Add patch removing most Open MPI runtime overhead when CUDA is enabled.

b5ed84f

The overhead was there even if no GPU is available.

Added comment about configopts (@Micket)

63d81d0

Micket approved these changes May 26, 2022

View reviewed changes

branfosj mentioned this pull request May 26, 2022

{perf}[gompi/2021b] OSU-Micro-Benchmarks v5.9 #15344

Closed

1 task

branfosj merged commit 99223bf into easybuilders:develop May 26, 2022

ocaisa mentioned this pull request May 26, 2022

Next pilot version EESSI/software-layer#173

Closed

boegel changed the title ~~OpenMPI 4.1.1: patch and build --with-cuda=internal~~ add patch for OpenMPI 4.1.1 to support building using --with-cuda=internal May 26, 2022

ocaisa mentioned this pull request May 26, 2022

Removing requirement of CUDA availability at build time ofiwg/libfabric#7790

Closed

boegel mentioned this pull request May 27, 2022

{mpi}[GCC/11.3.0] OpenMPI v4.1.4 #15426

Merged

2 tasks

branfosj mentioned this pull request Jun 4, 2022

Various benchmarks from OSU-Micro-Benchmarks/5.7.1-gompi-2021a-CUDA-11.3.1 segfault when using CUDA buffers #14801

Closed

boegel mentioned this pull request Jun 8, 2022

PoC: Add groundwork for separating actual cuda parts in OpenMPI. WIP WIP WIP #14919

Closed

akesandgren mentioned this pull request Jun 22, 2022

update make_module_* methods in OpenMPI easyblock to handle special cases for OpenMPI-CUDA easybuilders/easybuild-easyblocks#2710

Closed

ocaisa mentioned this pull request Jul 22, 2022

{chem}[foss/2021a,foss/2021b] LAMMPS v23Jun2022 w/ Python 3.9.5 + 3.9.6 and CUDA #15900

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add patch for OpenMPI 4.1.1 to support building using --with-cuda=internal #15528

add patch for OpenMPI 4.1.1 to support building using --with-cuda=internal #15528

bartoldeman commented May 22, 2022

bartoldeman commented May 22, 2022

bartoldeman commented May 22, 2022

Micket left a comment

akesandgren commented May 25, 2022

Micket commented May 25, 2022

Micket commented May 25, 2022

bartoldeman commented May 25, 2022

branfosj commented May 25, 2022

SebastianAchilles commented May 25, 2022

boegelbot commented May 25, 2022

boegelbot commented May 25, 2022

Micket left a comment

branfosj commented May 26, 2022

branfosj commented May 26, 2022

migueldiascosta commented May 3, 2023

Micket commented May 3, 2023

migueldiascosta commented May 3, 2023 •

edited

boegel commented May 3, 2023

migueldiascosta commented May 4, 2023

migueldiascosta commented May 4, 2023

add patch for OpenMPI 4.1.1 to support building using --with-cuda=internal #15528

add patch for OpenMPI 4.1.1 to support building using --with-cuda=internal #15528

Conversation

bartoldeman commented May 22, 2022

bartoldeman commented May 22, 2022

bartoldeman commented May 22, 2022

Micket left a comment

Choose a reason for hiding this comment

akesandgren commented May 25, 2022

Micket commented May 25, 2022

Micket commented May 25, 2022

bartoldeman commented May 25, 2022

branfosj commented May 25, 2022

SebastianAchilles commented May 25, 2022

boegelbot commented May 25, 2022

boegelbot commented May 25, 2022

Micket left a comment

Choose a reason for hiding this comment

branfosj commented May 26, 2022

branfosj commented May 26, 2022

migueldiascosta commented May 3, 2023

Micket commented May 3, 2023

migueldiascosta commented May 3, 2023 • edited

boegel commented May 3, 2023

migueldiascosta commented May 4, 2023

migueldiascosta commented May 4, 2023

migueldiascosta commented May 3, 2023 •

edited