Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TOSS 4 non-TCE openmpi: Failed to open drm root directory /sys/class/drm.: No such file or directory #5743

Closed
garlick opened this issue Feb 15, 2024 · 3 comments

Comments

@garlick
Copy link
Member

garlick commented Feb 15, 2024

Problem: a simple MPI hello world compiled with the TOSS 4 (non tce) openmpi 4.1 build prints this annoying but seemingly harmless message:

$ module purge
$ module use /opt/toss/modules/modulefiles
$ module load openmpi-gnu

$ flux run  ./hello
Failed to open drm root directory /sys/class/drm.: No such file or directory
fdPyPwZT2RV: completed MPI_Init in 1.112s.  There are 1 tasks
fdPyPwZT2RV: completed first barrier in 0.000s
fdPyPwZT2RV: completed MPI_Finalize in 0.002s

OpenMPI calls hwloc, and hwloc loads a plugin calls rsmi which prints to stderr.

Solution: One can use the environment to control which hwloc components are loaded as described here:

$ flux run --env=HWLOC_COMPONENTS=-rsmi ./hello
fdPyTZiwDSf: completed MPI_Init in 1.657s.  There are 1 tasks
fdPyTZiwDSf: completed first barrier in 0.000s
fdPyTZiwDSf: completed MPI_Finalize in 0.002s

Other potentially useful runes for that MPI build are

# Avoid broken openib btl (use tcp/shmem)
-env=OMPI_MCA_btl=^openib

# In case UCX is used - avoid deadlock in MPI_Init()
-opmi=pmix

# Is compiled for slurm, so make sure it finds our `libpmi2.so` before theirs (not needed in flux-core 0.59.0 and beyond)
--env=LD_LIBRARY_PATH=$(dirname $(flux config builtin pmi_library_path)):$LD_LIBRARY_PATH
@garlick
Copy link
Member Author

garlick commented Feb 15, 2024

This is resolved - just wanted to get it into the issue database rather than have it disappear in the annals of mattermost

@garlick garlick closed this as completed Feb 15, 2024
@grondo
Copy link
Contributor

grondo commented Feb 15, 2024

using the new -o hwloc.xmlfile shell option would resolve the rsmi errors since Flux doesn't use hwloc_topology_load(3) for jobs (it fetches XML from the enclosing instance so that the topology is not re-discovered unnecessarily). It would also likely make MPI_Init much faster. (I guess the same effect could be had with -o pmi=pmix with recent flux-pmix as well)

@garlick
Copy link
Member Author

garlick commented Feb 15, 2024

Great points!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants