Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mvapich2-tce application build fails when slurm is not installed #4455

Closed
Tracked by #4453
garlick opened this issue Aug 1, 2022 · 8 comments
Closed
Tracked by #4453

mvapich2-tce application build fails when slurm is not installed #4455

garlick opened this issue Aug 1, 2022 · 8 comments

Comments

@garlick
Copy link
Member

garlick commented Aug 1, 2022

On fluke, where only Flux is installed, trying to build a simple mpi hello world program fails with:


[garlick@fluke108:mpi-test]$ module list

Currently Loaded Modules:
  1) intel-classic-tce/2021.6.0   2) mvapich2-tce/2.3.6   3) StdEnv (S)

  Where:
   S:  Module is Sticky, requires --force to unload or purge

[garlick@fluke108:mpi-test]$ make
mpicc     hello.c   -o hello
ld: warning: libpmi2.so.0, needed by /usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so, not found (try using -rpath or -rpath-link)
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Finalize'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_KVS_Fence'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Abort'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Nameserv_unpublish'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Initialized'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Info_GetNodeAttr'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Job_Spawn'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Init'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_KVS_Put'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Nameserv_lookup'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Nameserv_publish'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Info_GetJobAttrIntArray'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Info_PutNodeAttr'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Info_GetJobAttr'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Job_GetId'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Info_GetNodeAttrIntArray'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_KVS_Get'
make: *** [<builtin>: hello] Error 1
[garlick@fluke108:mpi-test]$ 

Edit: see also https://lc.llnl.gov/jira/browse/TCE-29 (not public)

@garlick
Copy link
Member Author

garlick commented Aug 1, 2022

Well this works:

[garlick@fluke108:mpi-test]$ make
mpicc    -c -o hello.o hello.c
mpicc -o hello hello.o -L/usr/lib64/flux -lpmi2

and then when flux runs the executable, it sets LD_LIBRARY_PATH appropriately.

@garlick garlick changed the title cannot build mpi hello world against TCE mvpaich2 on machine with no slurm installed mvapich2-tce build fails when slurm is not installed Aug 1, 2022
@garlick garlick changed the title mvapich2-tce build fails when slurm is not installed mvapich2-tce application build fails when slurm is not installed Aug 1, 2022
@trws
Copy link
Member

trws commented Aug 4, 2022

How does slurm handle libpmi2, is it just always installed in /usr/lib64? I'm not sure how we usually do this, but my first thought would be to treat it as an alternative, in the update-alternatives sense, and symlink one or the other into place depending on which is set up.

@grondo
Copy link
Contributor

grondo commented Aug 4, 2022

How does slurm handle libpmi2, is it just always installed in /usr/lib64?

Yes.

my first thought would be to treat it as an alternative, in the update-alternatives sense, and symlink one or the other into place depending on which is set up.

Not a bad thought! I was thinking we could package a symlink in an RPM that is optionally installed on flux-only clusters. Sysadmins could also maintain the symlink with ansible, which is closer to the alternatives approach.

@garlick
Copy link
Member Author

garlick commented Aug 4, 2022

I think like mpich, mvapich does not need to link directly with this library. In fact it should have the PMI 1 wire protocol built in so should not even need to dlopen any PMI dso.

IOW a mvapich2 config issue.

@garlick
Copy link
Member Author

garlick commented Aug 10, 2022

As noted in the jira ticket mentioned above (not public), the following config options result in an mvapich that works on a flux only system and on a system with both slurm and flux installed

module --force purge
./configure \
  --enable-shared \
  --enable-romio \
  --disable-silent-rules \
  --disable-new-dtags \
  --enable-threads=multiple \
  --with-ch3-rank-bits=32 \
  --enable-wrapper-rpath=yes \
  --disable-alloc \
  --enable-fast=all \
  --disable-cuda \
  --enable-registration-cache \
  --with-device=ch3:mrail \
  --with-rdma=gen2 \
  --disable-mcast \
  --with-file-system=lustre+nfs+ufs \
  --enable-llnl-site-specific-options \
  --enable-debuginfo \
  --with-pm=hydra \
  --prefix=/g/g0/garlick/opt/mvapich2-2.3.7-1-hydra
#  --enable-fortran=all \
#  --with-pmi=pmi2 --with-pm=slurm --with-slurm=/usr

@trws
Copy link
Member

trws commented Aug 10, 2022 via email

@garlick
Copy link
Member Author

garlick commented Aug 10, 2022

I actually just omitted fortran to save time on the build since I wasn't going to test it. So I don't know whether I would have hit that or not. I'll go ahead and try.

@garlick
Copy link
Member Author

garlick commented Aug 26, 2022

Adding --with-pm=hydra and not adding --with-pm=slurm resolved this problem.

@garlick garlick closed this as completed Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants