Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TACC Singularity container in RHEL getting Fatal error in PMPI_Init_thread: Other MPI error, error stack #5137

Open
SomePersonSomeWhereInTheWorld opened this issue Apr 17, 2023 · 8 comments

Comments

@SomePersonSomeWhereInTheWorld

Using geodynamics/aspect:latest-tacc, with Singularity version 3.7.1 on RHEL 8 with OpenMPI 4.1.5a1

$ singularity -v run aspect_latest-tacc.sif aspect-release slab_detachment.prm 
VERBOSE: Not forwarding SINGULARITY_TMPDIR environment variable
VERBOSE: Not forwarding SINGULARITY_BINDPATH environment variable
VERBOSE: Setting HOME=/path/to/me
VERBOSE: Setting PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
VERBOSE: Set messagelevel to: 4
VERBOSE: Starter initialization
VERBOSE: Check if we are running as setuid
VERBOSE: Drop root privileges
VERBOSE: Drop root privileges permanently
VERBOSE: Spawn stage 1
VERBOSE: Execute stage 1
VERBOSE: stage 1 exited with status 0
VERBOSE: Get root privileges
VERBOSE: Change filesystem uid to 547289
VERBOSE: Spawn master process
VERBOSE: Create mount namespace
VERBOSE: Entering in mount namespace
VERBOSE: Create mount namespace
VERBOSE: Spawn RPC server
VERBOSE: Execute master process
VERBOSE: Serve RPC requests
VERBOSE: Default mount: /proc:/proc
VERBOSE: Default mount: /sys:/sys
VERBOSE: Default mount: /dev:/dev
VERBOSE: Found 'bind path' = /etc/localtime, /etc/localtime
VERBOSE: Found 'bind path' = /etc/hosts, /etc/hosts
VERBOSE: Default mount: /tmp:/tmp
VERBOSE: Default mount: /var/tmp:/var/tmp
VERBOSE: Default mount: /etc/resolv.conf:/etc/resolv.conf
VERBOSE: Checking for template passwd file: /burg/opt/singularity-3.7/var/singularity/mnt/session/rootfs/etc/passwd
VERBOSE: Creating passwd content
VERBOSE: Creating template passwd file and appending user data: /burg/opt/singularity-3.7/var/singularity/mnt/session/rootfs/etc/passwd
VERBOSE: Default mount: /etc/passwd:/etc/passwd
VERBOSE: Checking for template group file: /path/to/singularity-3.7/var/singularity/mnt/session/rootfs/etc/group
VERBOSE: Creating group content
VERBOSE: Default mount: /etc/group:/etc/group
VERBOSE: /path/to/me found within container
VERBOSE: rpc server exited with status 0
VERBOSE: Execute stage 2
Abort(2664079) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(136)........: 
MPID_Init(904)...............: 
MPIDI_OFI_mpi_init_hook(1421): 
MPIDU_bc_table_create(311)...: 

I also took at shot at using the Docker image and pulling it into Singularity

singularity run aspect.sif aspect-release slab_detachment.prm 
[g241:3607246] OPAL ERROR: Unreachable in file ext3x_client.c at line 112
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[g241:3607246] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

That's likely from not having the same version of OpenMPI, am I correct?

@gassmoeller
Copy link
Member

Hi @RobbieTheK, sorry for being slow to respond. Yes, both issues seem to be related to the interaction between the images and the system you are running the images on. In order to say more I would need to know more about the system you are using, but here are some general information about the images:

  • The first error when you used geodynamics/aspect:latest-tacc looks like an MPI error. It looks like you are not using one of the TACC systems to run this, is this correct? In this case the specialized MPI inside this container is likely just not working with your system (e.g. it might try to communicate in a way that is not supported by your cluster).
  • The second image: geodynamics/aspect:latest uses an unmodified OpenMPI that should work on most normal systems. It is not optimized for Infiniband or other high-speed interconnects. The error you get in the second part of your message looks like you are running this on some cluster that uses SLURM? Are you sure that you have set up your job script correctly that you can start a program successfully? Try if you can run something like echo in parallel, e.g. as in mpirun -np 4 echo hello, if this does not succeed (i.e. you dont see 4 output lines displaying hello) then your problem is not related to ASPECT or its docker image, but to how you set up your job on the cluster.

@SomePersonSomeWhereInTheWorld
Copy link
Author

Re: 1st error, correct not a TACC system was just trying to see if it would work.

This is a Bright Computing 9.1 cluster running RHEL 8 with Slurm 20, openmpi/gcc/64/4.1.5a1

I'm using an interactive srun job -c4 -n4 as options.

mpirun -np 4 echo hello hello
hello
hello
hello

Same error:

singularity run aspect.sif aspect-release slab_detachment.prm
[g225:3682368] OPAL ERROR: Unreachable in file ext3x_client.c at line 112
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[g225:3682368] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

@gassmoeller
Copy link
Member

I'm using an interactive srun job -c4 -n4 as options.

I have not tried using srun directly on this image. Can you instead set up a batch script that you run with sbatch or run the command interactively on a development node? The message seems to say that you need to compile MPI in a specific way to use srun and our docker image was not set up that way.

@bangerth
Copy link
Contributor

@RobbieTheK Can I assume that using a batch script solved the problem?

@SomePersonSomeWhereInTheWorld
Copy link
Author

No I ended up building and compiling candi and deal II. I'd be happy to try again with sbatch but I mentioned I did try with srun to no avail.

@bangerth
Copy link
Contributor

What I meant to ask is whether you found a way to make it work for you?

@bangerth
Copy link
Contributor

I don't know that I have anything to offer. I don't know much about singularity (or containers in general) and I don't work on the TACC machines. I'm also not sure we have the resources as a project to really figure this out.

@gassmoeller @tjhei Do you have anything to offer? Or should we just say "We'd love to provide this, but we can't" and close the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants