Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--nv flag does not work in singularity nested inside singularity #5759

Closed
rcaspart opened this issue Dec 16, 2020 · 12 comments
Closed

--nv flag does not work in singularity nested inside singularity #5759

rcaspart opened this issue Dec 16, 2020 · 12 comments

Comments

@rcaspart
Copy link

We are trying to run jobs requiring access to Nvidia GPUs in singularity containers. For organisational and technical reasons we need to be able to run these jobs within a cascade of containers, i.e. running an additional singularity container within the first singularity container.

For a single singularity container using the --nv flag works fine and the required libraries and binaries are bind-mounted into the container. However, when starting an additional container inside the first one, none of the Nvidia libraries gets bind-mounted into the second container (the binaries are available).

On a first look my assumption is, this is caused by the nvidia-container-cli not being available within the first container and singularity as a result falling back to the nvliblist.conf, where the names of the libraries (and binaries) are specified. Singularity then relies on the information from the ld cache to find the respective paths of these libraries. However, the ld cache does not include information about the libraries bind-mount by singularity to /.singularity.d/libs (and given the read-only nature of our containers can to the best of my knowledge not include them). As a result singularity fails to find the required libraries and does not bind-mount them into the second container.

My naive suggestion would be to have singularity in addition to relying on the ld cache also as a fallback check for libraries bind-mounted by singularity in the /.singularity.d/libs directory. Or is there any point I am missing here?

Version of Singularity:

3.7.0-1.el7

Expected behavior

Nvidia libraries and binaries are bind-mount into and available in both containers.

Actual behavior

Only the Nvidia binaries are bind-mount into and available in both containers. The libraries are only available in the first container.

$ singularity shell --nv docker://matterminers/wlcg-wn\:latest
INFO:    Using cached SIF image
Singularity> singularity shell --nv docker://matterminers/wlcg-wn\:latest
INFO:    Using cached SIF image
INFO:    Convert SIF file to sandbox...
WARNING: underlay of /usr/bin/nvidia-smi required more than 50 (924) bind mounts
Singularity> nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

Steps to reproduce this behavior

Start singularity container with the --nv flag.
Start second singularity container inside the first one with the --nv flag.

What OS/distro are you running

Scientific Linux 7.9 (Nitrogen)

How did you install Singularity

From EPEL repository

@dtrudg
Copy link
Contributor

dtrudg commented Dec 16, 2020

Hi @rcaspart - this is a bit of a uncommon scenario. It'd be good to know what situation you have where you need to run singularity nested inside itself?

We generally don't want to encourage it if not necessary, as you lose the advantages of the SIF format (the single file has to be extracted to a sandbox) among other things.

We might consider a PR for this, but it's unlikely to be a high priority. As a workaround you can probably use -B/--bind to bind the libs into their standard locations in the outer container, for the inner Singularity version to pick them up... but this isn't something I've tried.

@giffels
Copy link

giffels commented Dec 17, 2020

Hi @dctrud,

I think this actually not a that uncommon scenario. If you like to integrate HPC resources into the WLCG computing for multiple experiments, you need the first layer to provide the traditional WLCG like software environment (Grid middleware) and the experiments themself partly start their own environment in a second singularity layer.

@rcaspart
Copy link
Author

Hi @dctrud,
thanks a lot for your reply and thanks to Manuel for outlining the situation. I second Manuel's opinion, that while it most certainly is not the most common scenario, I think it is not the most uncommon one either.
While I agree that it is not an ideal solution and brings with it some disadvantages, it unfortunately is the only way we could envisage for these kind of situations.

Thanks for your suggested workaround, I have given this a try and it works. However I suspect, this is only the case due to the feature introduced in #5670, which in turn breaks some use cases for us (see #5766).

Regarding a PR, if you (or others) agree that also checking /.singularity.d/libs for the libraries is a reasonable and viable way to proceed I am happy to have a look into this and give it a try and eventually contribute this as PR to singularity.

@dtrudg
Copy link
Contributor

dtrudg commented Dec 17, 2020

Hi @rcaspart @giffels

I think this actually not a that uncommon scenario. If you like to integrate HPC resources into the WLCG computing for multiple experiments, you need the first layer to provide the traditional WLCG like software environment (Grid middleware) and the experiments themself partly start their own environment in a second singularity layer.

I'm afraid that this is almost certainly an uncommon scenario if we consider our entire user base. I'm not aware of the full details of the WCLG computing configuration, but the vast majority of users are running Singularity directly on an HPC host. In the case where a containerized middleware and nesting is used, I'm afraid it may be required for that middle layer (in the container) to implement some workarounds on occasion. The complex nested setups we know that some sites like WLCG use are varied, and we just don't have detail of how they are implemented to know how changes may affect them.

As an example - this is the first time I've come across anyone using NVIDIA within a nested container setup, and I don't think we've ever considered even testing that. With limited resources I'm afraid that we can't anticipate or test every possible scenario, and we do need to change behavior to move forward.

Regarding a PR, if you (or others) agree that also checking /.singularity.d/libs for the libraries is a reasonable and viable way to proceed I am happy to have a look into this and give it a try and eventually contribute this as PR to singularity.

I'd definitely be happy to consider a PR like this. In general, I'd encourage sites who have nested / complex setups to look at ways in which they can contribute test cases to the code, so that the behavior you need is something that we consider automatically. We'd be very glad to accept these unless they are in conflict with the needs of the broader user base.

Thanks.

@jafaruddinlie
Copy link

Hi @dtrudg
Thanks for the suggestion at the slack channel!
I am providing another use case here, we are trying to build a containerised desktop environment for our HPC users and would like to provide access to our already containerised singularity applications, for example, Relion or ChimeraX.
Starting the desktop container works, and starting containerised apps without the Nvidia library works, but without access to the GPU, most of the applications are not going to operate properly.

@carterpeel
Copy link
Contributor

Hello,

This is a templated response that is being sent out to all open issues. We are working hard on 'rebuilding' the Singularity community, and a major task on the agenda is finding out what issues are still outstanding.

Please consider the following:

  1. Is this issue a duplicate, or has it been fixed/implemented since being added?
  2. Is the issue still relevant to the current state of Singularity's functionality?
  3. Would you like to continue discussing this issue or feature request?

Thanks,
Carter

@olifre
Copy link
Contributor

olifre commented May 15, 2021

@carterpeel As outlined in detail in the earlier comments, this issue affects various different use cases. It's not solved yet. So this templated comment is not really helpful, it does not even specify how to respond, nor does it ease reading through the issue — it just interrupts the existing discussion.

@stale
Copy link

stale bot commented Jul 14, 2021

This issue has been automatically marked as stale because it has not had activity in over 60 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jul 14, 2021
@olifre
Copy link
Contributor

olifre commented Jul 14, 2021

Dear stale-bot, this regression is still relevant to multiple users.

@stale stale bot removed the stale label Jul 14, 2021
@DrDaveD DrDaveD changed the title Nvidia libraries not bind-mounted into singulaity container within a singularity container with --nv flag --nv flag does not work in singularity nested inside singularity Sep 14, 2021
@DrDaveD
Copy link
Collaborator

DrDaveD commented Sep 14, 2021

We had another use case for this today, and I found that doing these commands inside the first singularity container before invoking the nested singularity exec --nv ... works around the issue:

TMPD=`mktemp -d`
(echo '#!/bin/bash'; echo 'exec /usr/sbin/ldconfig -C '"$TMPD"'/ld.so.cache "$@"') >$TMPD/ldconfig
chmod +x $TMPD/ldconfig
PATH=$TMPD:$PATH
ldconfig $LD_LIBRARY_PATH

This works because the second singularity invokes ldconfig -p to locate the nvidia libraries. I'm not sure what would be a good solution for changing singularity to make this work automatically.

brianhlin added a commit to brianhlin/osgvo-docker-pilot that referenced this issue Sep 14, 2021
(SOFTWARE-4807)

Work around for apptainer/singularity#5759

We do this to ensure that GPU libs can be accessed by user containers.
In particular, the backfill container mounts the libs from the host to
'/.singularity.d/libs' (available in LD_LIBRARY_PATH).

ldconfig tries to write its cache to /etc/, which isn't possible if
the outer backfill container is started up with Singularity.
brianhlin added a commit to brianhlin/osgvo-docker-pilot that referenced this issue Sep 14, 2021
(SOFTWARE-4807)

Work around for apptainer/singularity#5759

We do this to ensure that GPU libs can be accessed by user containers.
In particular, the backfill container mounts the libs from the host to
'/.singularity.d/libs' (available in LD_LIBRARY_PATH).

ldconfig tries to write its cache to /etc/, which isn't possible if
the outer backfill container is started up with Singularity.
brianhlin added a commit to brianhlin/osgvo-docker-pilot that referenced this issue Sep 14, 2021
(SOFTWARE-4807)

Work around for apptainer/singularity#5759

We do this to ensure that GPU libs can be accessed by user containers.
In particular, the backfill container mounts the libs from the host to
'/.singularity.d/libs' (available in LD_LIBRARY_PATH).

ldconfig tries to write its cache to /etc/, which isn't possible if
the outer backfill container is started up with Singularity.
@luator
Copy link

luator commented May 19, 2022

Any update on this? I'm currently trying to get a nested setup work with Apptainer (1.0.2) and it seems that the workaround of @DrDaveD does not work there anymore.

(I am also a bit unsure if it makes sense to continue the discussion here or if a new issue should be opened in the Apptainer repo.)

@DrDaveD
Copy link
Collaborator

DrDaveD commented May 19, 2022

The workaround still works for me when I use apptainer both for shell --nv docker://matterminers/wlcg-wn\:latest on the outside and for another --nv on the inside. Since the container doesn't have apptainer in it, I bind-mounted /cvmfs and ran it from /cvmfs/oasis.opensciencegrid.org/mis/apptainer/bin/apptainer.

I created apptainer/apptainer#464 for followup, please continue the discussion there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants