Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with using a CVMFS MPI application inside a Singularity container #2606

Closed
ocaisa opened this issue Sep 18, 2020 · 17 comments
Closed

Comments

@ocaisa
Copy link

ocaisa commented Sep 18, 2020

In EESSI/filesystem-layer#37 I've created a script for an HPC system to create an alien cache in a shared space that can be used by execution nodes that are not connected to the internet (all done inside a Singularity container).

Everything works fine until I try to execute an application in parallel, the problems arise when I try to run an application using MPI (see EESSI/filesystem-layer#38 for details). The basic error is

Failed to initialize loader socket

which leads to a Fatal error. Is there anything I can do to get around this? I suspect it might be due to this envvar

export SINGULARITY_BIND="$SINGULARITY_CVMFS_RUN:/var/run/cvmfs,$SINGULARITY_CVMFS_LIB:/var/lib/cvmfs,$SINGULARITY_CVMFS_ALIEN:/alien,$SINGULARITY_HOMEDIR/default.local:/etc/cvmfs/default.local"

and some kind of race condition (it did work for 2 MPI processes, worked sometimes for 4 and never for 6).

@jblomer
Copy link
Member

jblomer commented Sep 21, 2020

I think you're spot on. Both the SINGULARITY_CVMFS_RUN and the SINGULARITY_CVMFS_LIB directories should be local to the compute node because they store UNIX domain sockets and pipes. Only the alien cache directory should be on the shared space.

In order to make the separation between local (/var/lib/cvmfs) and shared spaces (alien cache directory) clearer, I'd suggest to use a newer configuration syntax in /etc/cvmfs/default.local as described in the advanced cache configuration section:

CVMFS_WORKSPACE=/local/path
CVMFS_CACHE_PRIMARY=alien
CVMFS_CACHE_alien_TYPE=posix
CVMFS_CACHE_alien_SHARED=no  # confusing, but shared in this case means "shared quota management", which is is not available for the alien cache
CVMFS_CACHE_alien_QUOTA_LIMIT=-1
CVMFS_CACHE_alien_ALIEN=/shared/storage

@ocaisa
Copy link
Author

ocaisa commented Sep 21, 2020

I tried that but unfortunately it doesn't help. The problem is, I think that when I use something like

srun --time=00:05:00 --nodes=1 --ntasks-per-node=4 singularity exec --fusemount "$EESSI_CONFIG" --fusemount "$EESSI_PILOT" /p/project/cecam/singularity/cecam/ocais1/client-pilot_centos7-2020.08.sif /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi mdrun -s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 1000 -g logfile

the issue is that rather than mount CVMFS once on a node, what I doing is telling it to do 4 different mounts, one for each MPI task on the node. Each of these mounts is sharing the var and lib directories because I don't (currently) have a way to tell them to use a unique one. I'll see if I can figure out a way around that.

@jblomer
Copy link
Member

jblomer commented Sep 21, 2020

Ah, I see.. yes, they need to be local and unique per mount. The cvmfs configuration files can make (limited) use of bash scripting. So in the default.local file, you can specify subdirectories e.g. per PID or MPI rank or by other useful environment variables. (Such subdirectories are not automatically removed though.)

@ocaisa
Copy link
Author

ocaisa commented Sep 21, 2020

Ok, great, that's what I was hoping, if I can use $$ in there I should be able to solve this.

@jblomer
Copy link
Member

jblomer commented Sep 21, 2020

That should work in practice. Notice though that you'd be taking the PID of the temporary shell process parsing the config file, not of the cvmfs2 fuse module. So there is a slight chance that PIDs are reused, although probably quite a theoretical possibility.

@ocaisa
Copy link
Author

ocaisa commented Sep 22, 2020

Ok, I got a little further. So the idea of $$ does what it is supposed to (for a job in SLURM I actually use $SLURM_PROCID which is unique per MPI task on a node so should work in theory and practice). This seems to be working fine for /var/lib/cvmfs (I see the files being created in the correct subdirectories) but I still run into problems when trying to run in MPI mode (although I did get a little bigger than before, but also not reproducible). Do I need a unique /var/run/cvmfs per mount as well? Is that possible somehow?

@ocaisa
Copy link
Author

ocaisa commented Sep 22, 2020

I also tried using the --scratch option to Singularity as well for /var/run/cvmfs, but that doesn't seem to make a difference.

I should say that I am not actually getting an error any more, the process just seems to hang:

[ocais1@juwels01 test]$ SLURM_MPI_TYPE=pspmix OMP_NUM_THREADS=6 srun -p devel --time=00:05:00 --nodes=1 --ntasks-per-node=4 --cpus-per-task=6 singularity exec --scratch /var/run/cvmfs --fusemount "$EESSI_CONFIG" --fusemount "$EESSI_PILOT" /p/project/cecam/singularity/cecam/ocais1/client-pilot_centos7-2020.08.sif /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi mdrun -s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 1000 -g logfile
srun: job 2637874 queued and waiting for resources
srun: job 2637874 has been allocated resources
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: loading Fuse module... done
fuse: failed to clone device fd: Inappropriate ioctl for device
fuse: trying to continue without -o clone_fd.
CernVM-FS: loading Fuse module... done
fuse: failed to clone device fd: Inappropriate ioctl for device
fuse: trying to continue without -o clone_fd.
CernVM-FS: loading Fuse module... done
fuse: failed to clone device fd: Inappropriate ioctl for device
fuse: trying to continue without -o clone_fd.
CernVM-FS: loading Fuse module... done
fuse: failed to clone device fd: Inappropriate ioctl for device
fuse: trying to continue without -o clone_fd.
CernVM-FS: loading Fuse module... done
fuse: failed to clone device fd: Inappropriate ioctl for device
fuse: trying to continue without -o clone_fd.
CernVM-FS: loading Fuse module... done
fuse: failed to clone device fd: Inappropriate ioctl for device
fuse: trying to continue without -o clone_fd.

(note that CernVM-FS: loading Fuse module... done appears 6 instead of 8 times)

@ocaisa
Copy link
Author

ocaisa commented Sep 22, 2020

This is solved by using the --scratch option to singularity for both /var/run/cvmfs and /var/lib/cvmfs:

[ocais1@juwels01 test]$ SLURM_MPI_TYPE=pspmix OMP_NUM_THREADS=2 srun -p devel --time=00:05:00 --nodes=1 --ntasks-per-node=24 --cpus-per-task=2 singularity exec --scratch /var/lib/cvmfs --scratch /var/run/cvmfs --fusemount "$EESSI_CONFIG" --fusemount "$EESSI_PILOT" /p/project/cecam/singularity/cecam/ocais1/client-pilot_centos7-2020.08.sif /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi mdrun -s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 1000 -g logfile

@ocaisa ocaisa closed this as completed Sep 22, 2020
@DrDaveD
Copy link
Contributor

DrDaveD commented Sep 23, 2020

Ah, I wish I would have recognized what the issue was. My singcvmfs command also adds -S /var/run/cvmfs at the end of its script and maps /var/lib/cvmfs to a given directory (including a default). The problem with using -S on /var/lib/cvmfs is that the cache will get thrown away after each run.

I wonder if using singcvmfs would make what you're trying to do easier, perhaps with modifications if needed.

@ocaisa
Copy link
Author

ocaisa commented Sep 23, 2020

Yeah, for my case that's ok because I'm using a (pre-populated) alien cache so there's really nothing to throw away in /var/lib/cvmfs but I expect we will have to get more clever for other scenarios.

@DrDaveD
Copy link
Contributor

DrDaveD commented Sep 23, 2020

Depending on the application and filesystem, I worry that using a shared alien cache can cause havoc on a filesystem's metadata server due to large numbers of requests for small files. Have you tried it yet at large scales? If so I'd be interested in hearing about how many files the application accesses at startup time, the number of nodes, and what filesystem the alien cache is on.

@ocaisa
Copy link
Author

ocaisa commented Sep 23, 2020

No, I haven't got that far yet, but I should say that the way things are set up right now is for a user (or group), not a system wide alien cache so I wouldn't see the metadata workload being any worse than starting up a self-compiled set of software.

I was considering using the tiered cache mentioned in the docs which should help, but the first task was to get it working in the first place.

@ocaisa
Copy link
Author

ocaisa commented Sep 23, 2020

Actually I just tried this out. I realised I probably had to do a custom tiered configuration:

# Custom settings
CVMFS_WORKSPACE=/var/lib/cvmfs
CVMFS_CACHE_PRIMARY=hpc

CVMFS_CACHE_hpc_TYPE=tiered
CVMFS_CACHE_hpc_UPPER=memory
CVMFS_CACHE_hpc_LOWER=alien
CVMFS_CACHE_hpc_LOWER_READONLY=yes

CVMFS_CACHE_memory_TYPE=posix
CVMFS_CACHE_memory_SHARED=no
CVMFS_CACHE_memory_QUOTA_LIMIT=-1
CVMFS_CACHE_memory_ALIEN="/ram_alien"

CVMFS_CACHE_alien_TYPE=posix
CVMFS_CACHE_alien_SHARED=no
CVMFS_CACHE_alien_QUOTA_LIMIT=-1
CVMFS_CACHE_alien_ALIEN="/alien"
CVMFS_HTTP_PROXY="INVALID-PROXY"

where ram_alien is bind-mounted to /dev/shm. My reasoning was that if I used the ram cache plugin it would create a 2GB cache space for each mount (one for each MPI task), whereas if I use /dev/shm and bind mount it, it will be shared by each MPI task. I don't really know how to test this though (apart from the fact that it seemed to work).

@jblomer
Copy link
Member

jblomer commented Sep 23, 2020

In order to test that the setup is effective, you can check that /ram_alien gets populated with regular files (find /ram_alien -type f).

The fact that the cache on the ram disk is unbounded is likely to become a problem though. I'd rather advise to use the RAM cache plugin. You can configure the plugin such that it listens on a TCP socket and connect several mount points to it. The cache will be shared. You might need some extra logic to start the plugin process before the containers start on the nodes.

@jblomer
Copy link
Member

jblomer commented Sep 23, 2020

I guess that a UNIX domain socket that gets bind mounted into the containers should work just as well.

@ocaisa
Copy link
Author

ocaisa commented Sep 23, 2020

It does indeed get populated. I'll take a deeper look into the RAM cache plugin another day.

@ocaisa
Copy link
Author

ocaisa commented Sep 24, 2020

I think I don't have a good understanding of how the cache plugin works in this scenario, for me it is failing with:

[ocais1@juwels02 ~]$ singularity shell --scratch /var/run/cvmfs --scratch /var/lib/cvmfs --fusemount "container:cvmfs2 cvmfs-config.eessi-hpc.org /cvmfs/cvmfs-config.eessi-hpc.org" --fusemount "container:cvmfs2 pilot.eessi-hpc.org /cvmfs/pilot.eessi-hpc.org" /p/project/cecam/singularity/cecam/ocais1/client-pilot_centos7-2020.08.sif
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
Singularity> Failed to set requested number of open files, using maximum number 16384
CernVM-FS: loading Fuse module... Failed to connect to external cache manager (9 - cache directory/plugin problem)
Failed to set requested number of open files, using maximum number 16384
CernVM-FS: loading Fuse module... Failed to connect to external cache manager (9 - cache directory/plugin problem)

I tried just mounting a directory and assuming it would create the socket itself. I also tried creating a socket and before launching Singularity and binding it but that lead to the error above. I really don't know much about sockets, so I'm shooting in the dark at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants