New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error mount-binding path from time to time #1313
Comments
|
@thiell Is there an autofs setup involved with bind directory ? |
|
@cclerget yes, we're using autofs on Also we have |
Maybe you can try by disabling hostfs and use Between did you have a full debug log when bind failed ?
Nice catch ! Thank you. I will check code base to fix that |
|
Ah thanks for the suggestion @cclerget, I checked and unfortunately we can't do that as we're also mounting some group NFS home space using autofs, so this has to be somehow dynamic. I can't reproduce the issue myself, users are reporting this issue only from time to time, on nodes that have /oak mounted on the host... |
|
Sorry @thiell, that will be difficult to help without some debug lines. After looking into singularity_priv_drop, there is no function which could set errno to ENOENT, so we can assume errno was set by mount. Why that happen sometime ? a lustre issue ? mount namespace issue ? You can try to use to force Singularity to open /oak before entering in mount namespace, if Singularity can't open /oak user will be notified with the following warning message: |
|
Hi @cclerget, So I guess the error reported by Singularity is wrong... I believe all users that have reported this don't have access to this mount point |
|
Hey @cclerget ! We have some debug output for you! We are still having this issue on Sherlock so your wisdom is greatly appreciated. @thiell can verify - but it should be the case that /oak is mounted on all nodes. @thiell is this the case for the login node as well, without any extra groups / actions by the user? |
|
@vsoch indeed oak is permanently mounted on all nodes including login nodes on this cluster. I checked this one, and /oak was mounted on March 16 (and the node hasn't rebooted since): and then I also see: Right now: |
|
Also, I would like to highlight that in the debug output above, Singularity does see |
|
@thiell @vsoch the following lines make me really think the problem is on the Lustre side :
|
|
@cclerget thx but what is the errno from stat()? it just can't be ENOENT, so that would be helpful to know. |
|
We were just tracking one of these errors Would it be easy to do a change that would allow our users to ignore a set of host mount points? |
|
Sorry for the delay @thiell, I just commit some changes on this branch https://github.com/cclerget/singularity/tree/autofs-debug to display string error. Can you give it a try to look what happens exactly ? |
|
Hi @cclerget, Despite the fact that this Lustre thing might be confusing, we still think that Singularity has a bug and isn't returning the proper errno (should have been EPERM and not ENOENT). So I'm asking, is it OK to always make the assumption that Thanks a lot for your patch too. I'm not sure how we can try your autofs-debug patch right now, as we don't want to disrupt the production anymore (the use of Singularity has been painful enough for our users lately due to this issue), but I'll see what I can do and keep you posted if I find something. |
|
Hey @thiell , glad you found a solution ! You're right, Singularity don't return proper errno, that was fixed in release candidate for mount errors. The patch submitted was only to see exact About your question, if |
Version of Singularity: 2.4
We're facing some weird bind mount errors on CentOS 7 and Singularity 2.4. The path and filesystem does exist, but singularity is reporting ENOENT when bind mounting it (it only happens from time to time):
Expected behavior
Bind mount works
Actual behavior
Bind mount fails from time to time.
Steps to reproduce behavior
We tried to reproduce the error without luck. Only our users are reporting the problem. I'm opening this ticket in the hope somebody else is seeing this.
Also, I quickly checked the code, and I noticed parts like that:
Because errno is not saved, and
singularity_priv_drop()may call other glibc functions, the error string extracted from errno and passed tosingularity_messagecould be wrong, masking the real issue.The text was updated successfully, but these errors were encountered: