Problems with using a CVMFS MPI application inside a Singularity container #2606

ocaisa · 2020-09-18T16:52:18Z

In EESSI/filesystem-layer#37 I've created a script for an HPC system to create an alien cache in a shared space that can be used by execution nodes that are not connected to the internet (all done inside a Singularity container).

Everything works fine until I try to execute an application in parallel, the problems arise when I try to run an application using MPI (see EESSI/filesystem-layer#38 for details). The basic error is

Failed to initialize loader socket

which leads to a Fatal error. Is there anything I can do to get around this? I suspect it might be due to this envvar

export SINGULARITY_BIND="$SINGULARITY_CVMFS_RUN:/var/run/cvmfs,$SINGULARITY_CVMFS_LIB:/var/lib/cvmfs,$SINGULARITY_CVMFS_ALIEN:/alien,$SINGULARITY_HOMEDIR/default.local:/etc/cvmfs/default.local"

and some kind of race condition (it did work for 2 MPI processes, worked sometimes for 4 and never for 6).

The text was updated successfully, but these errors were encountered:

jblomer · 2020-09-21T09:01:12Z

I think you're spot on. Both the SINGULARITY_CVMFS_RUN and the SINGULARITY_CVMFS_LIB directories should be local to the compute node because they store UNIX domain sockets and pipes. Only the alien cache directory should be on the shared space.

In order to make the separation between local (/var/lib/cvmfs) and shared spaces (alien cache directory) clearer, I'd suggest to use a newer configuration syntax in /etc/cvmfs/default.local as described in the advanced cache configuration section:

CVMFS_WORKSPACE=/local/path
CVMFS_CACHE_PRIMARY=alien
CVMFS_CACHE_alien_TYPE=posix
CVMFS_CACHE_alien_SHARED=no  # confusing, but shared in this case means "shared quota management", which is is not available for the alien cache
CVMFS_CACHE_alien_QUOTA_LIMIT=-1
CVMFS_CACHE_alien_ALIEN=/shared/storage

ocaisa · 2020-09-21T10:49:57Z

I tried that but unfortunately it doesn't help. The problem is, I think that when I use something like

srun --time=00:05:00 --nodes=1 --ntasks-per-node=4 singularity exec --fusemount "$EESSI_CONFIG" --fusemount "$EESSI_PILOT" /p/project/cecam/singularity/cecam/ocais1/client-pilot_centos7-2020.08.sif /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi mdrun -s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 1000 -g logfile

the issue is that rather than mount CVMFS once on a node, what I doing is telling it to do 4 different mounts, one for each MPI task on the node. Each of these mounts is sharing the var and lib directories because I don't (currently) have a way to tell them to use a unique one. I'll see if I can figure out a way around that.

jblomer · 2020-09-21T10:56:21Z

Ah, I see.. yes, they need to be local and unique per mount. The cvmfs configuration files can make (limited) use of bash scripting. So in the default.local file, you can specify subdirectories e.g. per PID or MPI rank or by other useful environment variables. (Such subdirectories are not automatically removed though.)

ocaisa · 2020-09-21T11:25:00Z

Ok, great, that's what I was hoping, if I can use $$ in there I should be able to solve this.

jblomer · 2020-09-21T11:48:11Z

That should work in practice. Notice though that you'd be taking the PID of the temporary shell process parsing the config file, not of the cvmfs2 fuse module. So there is a slight chance that PIDs are reused, although probably quite a theoretical possibility.

ocaisa · 2020-09-22T11:48:03Z

Ok, I got a little further. So the idea of $$ does what it is supposed to (for a job in SLURM I actually use $SLURM_PROCID which is unique per MPI task on a node so should work in theory and practice). This seems to be working fine for /var/lib/cvmfs (I see the files being created in the correct subdirectories) but I still run into problems when trying to run in MPI mode (although I did get a little bigger than before, but also not reproducible). Do I need a unique /var/run/cvmfs per mount as well? Is that possible somehow?

ocaisa · 2020-09-22T12:14:22Z

I also tried using the --scratch option to Singularity as well for /var/run/cvmfs, but that doesn't seem to make a difference.

I should say that I am not actually getting an error any more, the process just seems to hang:

[ocais1@juwels01 test]$ SLURM_MPI_TYPE=pspmix OMP_NUM_THREADS=6 srun -p devel --time=00:05:00 --nodes=1 --ntasks-per-node=4 --cpus-per-task=6 singularity exec --scratch /var/run/cvmfs --fusemount "$EESSI_CONFIG" --fusemount "$EESSI_PILOT" /p/project/cecam/singularity/cecam/ocais1/client-pilot_centos7-2020.08.sif /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi mdrun -s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 1000 -g logfile
srun: job 2637874 queued and waiting for resources
srun: job 2637874 has been allocated resources
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: loading Fuse module... done
fuse: failed to clone device fd: Inappropriate ioctl for device
fuse: trying to continue without -o clone_fd.
CernVM-FS: loading Fuse module... done
fuse: failed to clone device fd: Inappropriate ioctl for device
fuse: trying to continue without -o clone_fd.
CernVM-FS: loading Fuse module... done
fuse: failed to clone device fd: Inappropriate ioctl for device
fuse: trying to continue without -o clone_fd.
CernVM-FS: loading Fuse module... done
fuse: failed to clone device fd: Inappropriate ioctl for device
fuse: trying to continue without -o clone_fd.
CernVM-FS: loading Fuse module... done
fuse: failed to clone device fd: Inappropriate ioctl for device
fuse: trying to continue without -o clone_fd.
CernVM-FS: loading Fuse module... done
fuse: failed to clone device fd: Inappropriate ioctl for device
fuse: trying to continue without -o clone_fd.

(note that CernVM-FS: loading Fuse module... done appears 6 instead of 8 times)

ocaisa · 2020-09-22T12:31:29Z

This is solved by using the --scratch option to singularity for both /var/run/cvmfs and /var/lib/cvmfs:

[ocais1@juwels01 test]$ SLURM_MPI_TYPE=pspmix OMP_NUM_THREADS=2 srun -p devel --time=00:05:00 --nodes=1 --ntasks-per-node=24 --cpus-per-task=2 singularity exec --scratch /var/lib/cvmfs --scratch /var/run/cvmfs --fusemount "$EESSI_CONFIG" --fusemount "$EESSI_PILOT" /p/project/cecam/singularity/cecam/ocais1/client-pilot_centos7-2020.08.sif /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi mdrun -s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 1000 -g logfile

DrDaveD · 2020-09-23T14:09:21Z

Ah, I wish I would have recognized what the issue was. My singcvmfs command also adds -S /var/run/cvmfs at the end of its script and maps /var/lib/cvmfs to a given directory (including a default). The problem with using -S on /var/lib/cvmfs is that the cache will get thrown away after each run.

I wonder if using singcvmfs would make what you're trying to do easier, perhaps with modifications if needed.

ocaisa · 2020-09-23T14:13:48Z

Yeah, for my case that's ok because I'm using a (pre-populated) alien cache so there's really nothing to throw away in /var/lib/cvmfs but I expect we will have to get more clever for other scenarios.

DrDaveD · 2020-09-23T14:22:29Z

Depending on the application and filesystem, I worry that using a shared alien cache can cause havoc on a filesystem's metadata server due to large numbers of requests for small files. Have you tried it yet at large scales? If so I'd be interested in hearing about how many files the application accesses at startup time, the number of nodes, and what filesystem the alien cache is on.

ocaisa · 2020-09-23T14:30:47Z

No, I haven't got that far yet, but I should say that the way things are set up right now is for a user (or group), not a system wide alien cache so I wouldn't see the metadata workload being any worse than starting up a self-compiled set of software.

I was considering using the tiered cache mentioned in the docs which should help, but the first task was to get it working in the first place.

ocaisa · 2020-09-23T15:15:45Z

Actually I just tried this out. I realised I probably had to do a custom tiered configuration:

# Custom settings
CVMFS_WORKSPACE=/var/lib/cvmfs
CVMFS_CACHE_PRIMARY=hpc

CVMFS_CACHE_hpc_TYPE=tiered
CVMFS_CACHE_hpc_UPPER=memory
CVMFS_CACHE_hpc_LOWER=alien
CVMFS_CACHE_hpc_LOWER_READONLY=yes

CVMFS_CACHE_memory_TYPE=posix
CVMFS_CACHE_memory_SHARED=no
CVMFS_CACHE_memory_QUOTA_LIMIT=-1
CVMFS_CACHE_memory_ALIEN="/ram_alien"

CVMFS_CACHE_alien_TYPE=posix
CVMFS_CACHE_alien_SHARED=no
CVMFS_CACHE_alien_QUOTA_LIMIT=-1
CVMFS_CACHE_alien_ALIEN="/alien"
CVMFS_HTTP_PROXY="INVALID-PROXY"

where ram_alien is bind-mounted to /dev/shm. My reasoning was that if I used the ram cache plugin it would create a 2GB cache space for each mount (one for each MPI task), whereas if I use /dev/shm and bind mount it, it will be shared by each MPI task. I don't really know how to test this though (apart from the fact that it seemed to work).

jblomer · 2020-09-23T15:24:07Z

In order to test that the setup is effective, you can check that /ram_alien gets populated with regular files (find /ram_alien -type f).

The fact that the cache on the ram disk is unbounded is likely to become a problem though. I'd rather advise to use the RAM cache plugin. You can configure the plugin such that it listens on a TCP socket and connect several mount points to it. The cache will be shared. You might need some extra logic to start the plugin process before the containers start on the nodes.

jblomer · 2020-09-23T15:29:44Z

I guess that a UNIX domain socket that gets bind mounted into the containers should work just as well.

ocaisa · 2020-09-23T15:33:56Z

It does indeed get populated. I'll take a deeper look into the RAM cache plugin another day.

ocaisa · 2020-09-24T07:39:54Z

I think I don't have a good understanding of how the cache plugin works in this scenario, for me it is failing with:

[ocais1@juwels02 ~]$ singularity shell --scratch /var/run/cvmfs --scratch /var/lib/cvmfs --fusemount "container:cvmfs2 cvmfs-config.eessi-hpc.org /cvmfs/cvmfs-config.eessi-hpc.org" --fusemount "container:cvmfs2 pilot.eessi-hpc.org /cvmfs/pilot.eessi-hpc.org" /p/project/cecam/singularity/cecam/ocais1/client-pilot_centos7-2020.08.sif
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
Singularity> Failed to set requested number of open files, using maximum number 16384
CernVM-FS: loading Fuse module... Failed to connect to external cache manager (9 - cache directory/plugin problem)
Failed to set requested number of open files, using maximum number 16384
CernVM-FS: loading Fuse module... Failed to connect to external cache manager (9 - cache directory/plugin problem)

I tried just mounting a directory and assuming it would create the socket itself. I also tried creating a socket and before launching Singularity and binding it but that lead to the error above. I really don't know much about sockets, so I'm shooting in the dark at this point.

ocaisa closed this as completed Sep 22, 2020

ocaisa mentioned this issue Oct 15, 2020

cvmfs_preload from Stratum 1 #2618

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with using a CVMFS MPI application inside a Singularity container #2606

Problems with using a CVMFS MPI application inside a Singularity container #2606

ocaisa commented Sep 18, 2020 •

edited

Loading

jblomer commented Sep 21, 2020

ocaisa commented Sep 21, 2020

jblomer commented Sep 21, 2020 •

edited

Loading

ocaisa commented Sep 21, 2020 •

edited

Loading

jblomer commented Sep 21, 2020

ocaisa commented Sep 22, 2020

ocaisa commented Sep 22, 2020

ocaisa commented Sep 22, 2020

DrDaveD commented Sep 23, 2020

ocaisa commented Sep 23, 2020

DrDaveD commented Sep 23, 2020

ocaisa commented Sep 23, 2020

ocaisa commented Sep 23, 2020

jblomer commented Sep 23, 2020 •

edited

Loading

jblomer commented Sep 23, 2020

ocaisa commented Sep 23, 2020

ocaisa commented Sep 24, 2020

Problems with using a CVMFS MPI application inside a Singularity container #2606

Problems with using a CVMFS MPI application inside a Singularity container #2606

Comments

ocaisa commented Sep 18, 2020 • edited Loading

jblomer commented Sep 21, 2020

ocaisa commented Sep 21, 2020

jblomer commented Sep 21, 2020 • edited Loading

ocaisa commented Sep 21, 2020 • edited Loading

jblomer commented Sep 21, 2020

ocaisa commented Sep 22, 2020

ocaisa commented Sep 22, 2020

ocaisa commented Sep 22, 2020

DrDaveD commented Sep 23, 2020

ocaisa commented Sep 23, 2020

DrDaveD commented Sep 23, 2020

ocaisa commented Sep 23, 2020

ocaisa commented Sep 23, 2020

jblomer commented Sep 23, 2020 • edited Loading

jblomer commented Sep 23, 2020

ocaisa commented Sep 23, 2020

ocaisa commented Sep 24, 2020

ocaisa commented Sep 18, 2020 •

edited

Loading

jblomer commented Sep 21, 2020 •

edited

Loading

ocaisa commented Sep 21, 2020 •

edited

Loading

jblomer commented Sep 23, 2020 •

edited

Loading