Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: deadlock in cvmfs2 process #3402

Open
harenber opened this issue Sep 29, 2023 · 21 comments
Open

Bug: deadlock in cvmfs2 process #3402

harenber opened this issue Sep 29, 2023 · 21 comments
Assignees

Comments

@harenber
Copy link

Dear all,

we're suffering for days from a potential bug, where the cvmfs2 process serving sft.cern.ch is deadlocking. (We only saw that for sft.cern.ch!)

With the 268 nodes we have, we see a rate of >= 1 node per hour showing this behaviour. We (Wuppertal University) are an ATLAS Tier-2, so we have a lot of ATLAS workload. Our local users (other then from the HEP group) barely use CVMFS. We couldn't identify if it is related to some particular ATLAS job type, yet.

Symptom:

[root@wn21268 ~]# ls -l /cvmfs/sft.cern.ch
ls: Zugriff auf /cvmfs/sft.cern.ch nicht möglich: Datei oder Verzeichnis nicht gefunden

killing the cvmfs2 process and restarting autofs helps.

Most interesting observation:

this returns nothing on the sft.cern.ch cvmfs2 process

[root@wn21159 fd]# perf trace -p 26181 --duration 10
^C[root@wn21159 fd]#

while it shows normal activity on any other cvmfs2 process.

I attached with gdb to one of those deadlocking processes:

(gdb) bt
#0  0x00007f9e0428db3b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007f9e0428dbcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f9e0428dc6b in [sem_wait@@GLIBC_2.2.5](mailto:sem_wait@@GLIBC_2.2.5) () from /lib64/libpthread.so.0
#3  0x00007f9e0323e5c8 in fuse_session_loop_mt () from /lib64/libfuse.so.2
#4  0x00007f9e03473b0f in FuseMain (argc=<optimized out>, argv=<optimized out>)
    at /home/sftnight/jenkins/workspace/CvmfsFullBuildDocker/CVMFS_BUILD_ARCH/docker-x86_64/CVMFS_BUILD_PLATFORM/cc7/build/BUILD/cvmfs-2.11.0/cvmfs/loader.cc:1086
#5  0x0000000000404c3f in main (argc=5, argv=0x7ffddaca7248)
    at /home/sftnight/jenkins/workspace/CvmfsFullBuildDocker/CVMFS_BUILD_ARCH/docker-x86_64/CVMFS_BUILD_PLATFORM/cc7/build/BUILD/cvmfs-2.11.0/cvmfs/fuse_main.cc:144
(gdb)

versions are

cvmfs-2.11.0-1.el7.x86_64
cvmfs-config-default-2.0-1.noarch
cvmfs-libs-2.11.0-1.el7.x86_64

and kernel is

[root@wn21186 ~]# uname -a
Linux wn21186.pleiades.uni-wuppertal.de 3.10.0-1160.99.1.el7.x86_64 #1 SMP Wed Sep 13 14:19:20 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

We already collected some DEBUG logs here, but as Dave suggested on cvmfs-talk, we just reconfigured the cluster to only write DEBUG logs for sft.cern.ch. As soon as we catch a node showing this behaviour again, we will add the relevant log to this issue.

We cross-checked the CVMFS client config with DESY and the Squid config with LRZ-LMU and we couldn't find any difference other than local cache sizes (we have 12 GB CVMFS cache on our nodes).

Cheers

Martin and Torsten

@HereThereBeDragons
Copy link
Contributor

HereThereBeDragons commented Oct 3, 2023

Hello Martin and Torsten,

Sadly with the current information you provided its really difficult to extract the actual problem.
As suggested by Dave, we would benefit to have dedicated debug logs for the different repos.
You can do this by using in the default.local something like (reload needed of the repo, remount probbably better if its already a bit fishy):

CVMFS_DEBUGLOG=/tmp/debuglog-@fqrn@.log

@fqrn@ will be dynamically replaced by the repo name.

A few things I have seen so far:

  • I dont think it is related to this, but it is important to know that CVMFS_QUOTA_LIMIT is a soft limit and not a hard limit and it can happen by time that the cache gets minimally larger. for your 12 GB cvmfs is cache: is the partition bigger or smaller than the 12 GB?

And if found this error message (which I also right now cannot see as how it is related):

(cache) checking if sft-nightlies.cern.ch revision 115303 is blacklisted    [09-28-2023 13:33:55 CEST]
(cvmfs) remounting root catalog    [09-28-2023 13:33:55 CEST]
...
(download) validate DNS entry for squid2.pleiades.uni-wuppertal.de    [09-28-2023 13:33:55 CEST]
(dns) add address squid2.pleiades.uni-wuppertal.de -> 132.195.124.10    [09-28-2023 13:33:55 CEST]
....
(download) Verify downloaded url /.cvmfspublished, proxy http://132.195.124.10:3128 (curl error 0)    [09-28-2023 13:33:55 CEST]
(cvmfs) repository name does not match (found sft.cern.ch, expected sft-nightlies.cern.ch)    [09-28-2023 13:33:55 CEST]
(cache) failed to fetch manifest (3 - repository name mismatch)    [09-28-2023 13:33:55 CEST]

This happened in total 4 times. Always sft-nightlies.cern.ch expecting to be sft.cern.ch

(cache) failed to fetch manifest (3 - repository name mismatch)    [09-28-2023 13:33:55 CEST]
(cache) failed to fetch manifest (3 - repository name mismatch)    [09-28-2023 13:34:07 CEST]
(cache) failed to fetch manifest (3 - repository name mismatch)    [09-28-2023 13:34:10 CEST]
(cache) failed to fetch manifest (3 - repository name mismatch)    [09-28-2023 13:37:10 CEST]

@harenber
Copy link
Author

harenber commented Oct 4, 2023

Thanks for looking into this. We have the impression that the problem is triggert by some ATLAS jobs. The problem nearly vanished as quickly as it appeared without any change in the system by us.

However, we wrote a small script which restarts CVMFS in case this happens:

#!/bin/bash

if ! [ -f /cvmfs/sft.cern.ch/lcg/lastUpdate ]; then
    echo "$(date -Im)" >> /var/log/cvmfslog/kill.log ;
    kill $(cvmfs_talk -i sft.cern.ch pid);
    cp /var/log/cvmfslog/cvmfs_sft.log "/var/log/cvmfslog/cvmfs_sft.log_$(date -Im)";
    systemctl restart autofs;
fi

and we finally got a log which we uploaded to

https://harenber.web.cern.ch/harenber/wn21164/

We're not sure how to interpret these (3 - repository name mismatch) error messages. Is there some misconfiguration on our side?

Cheers from Wuppertal

Martin and Torsten

@DrDaveD
Copy link
Contributor

DrDaveD commented Oct 4, 2023

With that latest log, there appears to me to be a long portion of normal behavior at the end of the log. Can you provide a timestamp or something to search for in the log around the time that the hanging behavior was observed?

Regarding the name mismatch, indeed that indicates a misconfiguration on either a client or stratum 1 for the sft-nightlies.cern.ch repository. It's very difficult to identify though because I searched the log for CVMFS_SERVER_URL settings including sft-nightlies.cern.ch and they all look normal, coming from /etc/cvmfs/domain.d/cern.ch.conf. I checked the .cvmfswhitelist on each of those with

for S1 in $(echo "http://cvmfs-stratum-one.cern.ch:8000/cvmfs/sft-nightlies.cern.ch;http://cernvmfs.gridpp.rl.ac.uk:8000/cvmfs/sft-nightlies.cern.ch;http://cvmfs-s1bnl.opensciencegrid.org:8000/cvmfs/sft-nightlies.cern.ch;http://cvmfs-s1fnal.opensciencegrid.org:8000/cvmfs/sft-nightlies.cern.ch;http://cvmfsrep.grid.sinica.edu.tw:8000/cvmfs/sft-nightlies.cern.ch;http://cvmfs-stratum-one.ihep.ac.cn:8000/cvmfs/sft-nightlies.cern.ch;http://cvmfs-s1.hpc.swin.edu.au:8000/cvmfs/sft-nightlies.cern.ch;http://sampacs01.if.usp.br:8000/cvmfs/sft-nightlies.cern.ch"|tr ';' ' '); do
echo :$S1
curl -s "$S1/.cvmfswhitelist"|cat -v|grep ^N
done

and none of them indicated they were serving sft.cern.ch. It makes me wonder if the squid cache is somehow poisoned but that seems rather unlikely.

@stderr-enst
Copy link

With that latest log, there appears to me to be a long portion of normal behavior at the end of the log. Can you provide a timestamp or something to search for in the log around the time that the hanging behavior was observed?

Well, our script that's dumping the log is run every 5 minutes, if it can't access a testfile in the sft repo anymore.
So I'm afraid we can only pin the lock-up to somewhere during the last 5 minutes of that log -- Sometime between 21:50 and 21:55?

@DrDaveD
Copy link
Contributor

DrDaveD commented Oct 4, 2023

During the last 5 minutes there was very little activity and it looked entirely normal. I do see an access to lastUpdate at 21:50:01 and a difference is that there isn't an access of anything shown at 21:55 before a shutdown at 21:55:03. There is a normal check of .cvmfspublished at 21:54:04. So as far as cvmfs is concerned, it looks like it never even saw the request of [ -f /cvmfs/sft.cern.ch/lcg/lastUpdate ] from your script.

Regarding the name mismatch, the combined log does show it attempting to read from http://cvmfs-stratum-one.cern.ch:8000/cvmfs/sft.cern.ch/.cvmfspublished just before the mismatch error, but because the logs are mixed it's very difficult to match that back wiith where that came from.

@harenber
Copy link
Author

harenber commented Oct 4, 2023

During the last 5 minutes there was very little activity and it looked entirely normal. I do see an access to lastUpdate at 21:50:01 and a difference is that there isn't an access of anything shown at 21:55 before a shutdown at 21:55:03. There is a normal check of .cvmfspublished at 21:54:04. So as far as cvmfs is concerned, it looks like it never even saw the request of [ -f /cvmfs/sft.cern.ch/lcg/lastUpdate ] from your script.

Yeah.. that fits our observation, that the process cvmfs2 is completely dead. ptrace is showing no activity at all anymore once this happens. The backtrace shows it's waiting on a semaphore.

We could try to enhance our "watchdog" script with some gdb magic or so.. would that be helpful? Or do you have some other idea what could be done?

Cheers

Martin and Torsten

@HereThereBeDragons
Copy link
Contributor

Do you maybe have the logs of the sft-nightlies? while it seems to be an independent issue it would be still good to also figure out what is there going on

@stderr-enst
Copy link

Right now we only configured debug logs for sft.cern.ch throughout the cluster.
sft-nightlies debug logs are only available in the (mixed, global) debug logs from our first post. (wn21159 and wn21089)

Is there a quick and easy way to switch on debug logs for a specific repo during operation?
We are hesitant to roll out another client config change, since we always (have to?) wait until all jobs of a certain node are finished to remount the CVMFS repos.
The simplest approach, with least book-keeping for us, was to throw a slurm reboot at all nodes, which is a bit drastic to repeat too often and always results in worse utilization.

@HereThereBeDragons
Copy link
Contributor

hi,

as i said you can use CVMFS_DEBUGLOG=/tmp/debuglog-@fqrn@.log in your default.local and it will create debug logs for very repo mounted on that node. (might be also interesting to see what the atlas repo is doing if you say these issues are mainly seen when atlas runs jobs)

in 2.11 client, a simple reload should be enough to switch between debug logging and not. no remount needed.

@stderr-enst
Copy link

hi,

as i said you can use CVMFS_DEBUGLOG=/tmp/debuglog-@fqrn@.log in your default.local and it will create debug logs for very repo mounted on that node. (might be also interesting to see what the atlas repo is doing if you say these issues are mainly seen when atlas runs jobs)

in 2.11 client, a simple reload should be enough to switch between debug logging and not. no remount needed.
Hey,

Good to know, that no reboot is required!
We are reconfiguring our nodes to write all logs and added some cleanup when everything seems to go fine.

I think we also didn't answer your question about the CVMFS cache size from that post.
The partition is always much larger than the 12GB, so the soft limit shouldn't be an issue.

We'll get back to you, once we got a new set of logs.

A couple of points though:

  • Did you know that sft and sft-nightly are symlinking to each other? Maybe this is causing the mixup in the logs:
$ ls -lah /cvmfs/sft.cern.ch/lcg/latest
lrwxrwxrwx 1 cvmfs cvmfs 39 18. Mai 2020  /cvmfs/sft.cern.ch/lcg/latest -> /cvmfs/sft-nightlies.cern.ch/lcg/latest
$ ls -lah /cvmfs/sft-nightlies.cern.ch/lcg/contrib
lrwxrwxrwx 1 cvmfs cvmfs 30 15. Dez 2016  /cvmfs/sft-nightlies.cern.ch/lcg/contrib -> /cvmfs/sft.cern.ch/lcg/contrib
  • Just a reminder, the observed symptom is a complete lock up in the mentioned semaphore, so no log output or any process operation is occurring anymore after this happened
  • Since the error isn't occuring as frequently anymore, we changed our script to catch the error, but just drain the node instead of killing the process and remounting the sft repo
  • If you want, we could get in touch via zoom and look at the locked process together on one of such nodes, as soon as it appears again

@DrDaveD
Copy link
Contributor

DrDaveD commented Oct 5, 2023

I don't see how the cross-symbolic link could cause that crossover in the cvmfs requests.

The problem is really bizarre because the debug log is showing that the sft.cern.ch cvmfs2 process is essentially idle when the lockup happens. I wonder if it could be something in the shared cache that it is getting stuck on, but if it did I would expect the log messages to show it begin to communicate with the shared cache process and never returning.

I think the gdb stack trace you provided is misleading because it only showed one thread. Please try to get another one from a hanging cvmfs2 process using thread apply all bt in gdb.

@harenber
Copy link
Author

Took a while, but here we have one:

[root@wn21229 cvmfslog]# ls /cvmfs/sft-nightlies.cern.ch/
lcg
[root@wn21229 cvmfslog]# ls /cvmfs/sft.cern.ch
ls: cannot access /cvmfs/sft.cern.ch: No such file or directory
[root@wn21229 cvmfslog]# less kill.log
[root@wn21229 cvmfslog]# ls /cvmfs/sft.cern.ch
ls: cannot access /cvmfs/sft.cern.ch: No such file or directory
[root@wn21229 cvmfslog]#


[root@wn21229 ~]# cvmfs_talk -i sft.cern.ch pid
16840
[root@wn21229 ~]# gdb -p 16840
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Attaching to process 16840
Reading symbols from /usr/bin/cvmfs2...done.
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 8906]
[New LWP 8902]
[New LWP 8901]
[New LWP 2609]
[New LWP 246723]
[New LWP 93847]
[New LWP 93673]
[New LWP 93672]
[New LWP 93667]
[New LWP 68525]
[New LWP 16850]
[New LWP 16849]
[New LWP 16848]
[New LWP 16847]
[New LWP 16846]
[New LWP 16845]
[New LWP 16844]
[New LWP 16843]
[New LWP 16842]
[New LWP 16841]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libstdc++.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libstdc++.so.6
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib64/libcvmfs_fuse_stub.so...done.
Loaded symbols for /usr/lib64/libcvmfs_fuse_stub.so
Reading symbols from /lib64/libfuse.so.2...Reading symbols from /lib64/libfuse.so.2...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libfuse.so.2
Reading symbols from /usr/lib64/libcvmfs_fuse_debug.so...done.
Loaded symbols for /usr/lib64/libcvmfs_fuse_debug.so
Reading symbols from /lib64/libcvmfs_crypto_debug.so.2.11.0...done.
Loaded symbols for /lib64/libcvmfs_crypto_debug.so.2.11.0
Reading symbols from /lib64/libcvmfs_util_debug.so.2.11.0...done.
Loaded symbols for /lib64/libcvmfs_util_debug.so.2.11.0
Reading symbols from /lib64/libssl.so.10...Reading symbols from /lib64/libssl.so.10...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libssl.so.10
Reading symbols from /lib64/libcrypto.so.10...Reading symbols from /lib64/libcrypto.so.10...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libcrypto.so.10
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /lib64/libuuid.so.1...Reading symbols from /lib64/libuuid.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libuuid.so.1
Reading symbols from /lib64/libgssapi_krb5.so.2...Reading symbols from /lib64/libgssapi_krb5.so.2...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libgssapi_krb5.so.2
Reading symbols from /lib64/libkrb5.so.3...Reading symbols from /lib64/libkrb5.so.3...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libkrb5.so.3
Reading symbols from /lib64/libcom_err.so.2...Reading symbols from /lib64/libcom_err.so.2...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libcom_err.so.2
Reading symbols from /lib64/libk5crypto.so.3...Reading symbols from /lib64/libk5crypto.so.3...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libk5crypto.so.3
Reading symbols from /lib64/libz.so.1...Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libz.so.1

Reading symbols from /lib64/libkrb5support.so.0...Reading symbols from /lib64/libkrb5support.so.0...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libkrb5support.so.0
Reading symbols from /lib64/libkeyutils.so.1...Reading symbols from /lib64/libkeyutils.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libkeyutils.so.1
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /lib64/libselinux.so.1...Reading symbols from /lib64/libselinux.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libselinux.so.1
Reading symbols from /lib64/libpcre.so.1...Reading symbols from /lib64/libpcre.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libpcre.so.1
0x00007fc148c72b3b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install cvmfs-2.11.0-1.el7.x86_64
(gdb)
(gdb) thread apply all bt

Thread 21 (Thread 0x7fc142fd0700 (LWP 16841)):
#0  0x00007fc14816addd in poll () from /lib64/libc.so.6
#1  0x00007fc1476b28d1 in Watchdog::MainWatchdogListener (data=0x2483840)
    at /home/sftnight/jenkins/workspace/CvmfsFullBuildDocker/CVMFS_BUILD_ARCH/docker-x86_64/CVMFS_BUILD_PLATFORM/cc7/build/BUILD/cvmfs-2.11.0/cvmfs/monitor.cc:568
#2  0x00007fc148c6cea5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc148175b0d in clone () from /lib64/libc.so.6

Thread 20 (Thread 0x7fc1427cf700 (LWP 16842)):
#0  0x00007fc148c7375d in read () from /lib64/libpthread.so.0
#1  0x00007fc146fad315 in ReadPipe (fd=21, buf=0x7fc1427ceeb7, nbyte=1)
    at /home/sftnight/jenkins/workspace/CvmfsFullBuildDocker/CVMFS_BUILD_ARCH/docker-x86_64/CVMFS_BUILD_PLATFORM/cc7/build/BUILD/cvmfs-2.11.0/cvmfs/util/posix.cc:516
#2  0x00007fc147747c6f in FuseInvalidator::MainInvalidator (data=0x25dc4e0)
    at /home/sftnight/jenkins/workspace/CvmfsFullBuildDocker/CVMFS_BUILD_ARCH/docker-x86_64/CVMFS_BUILD_PLATFORM/cc7/build/BUILD/cvmfs-2.11.0/cvmfs/fuse_evict.cc:140
#3  0x00007fc148c6cea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fc148175b0d in clone () from /lib64/libc.so.6

Thread 19 (Thread 0x7fc141fce700 (LWP 16843)):
#0  0x00007fc14816addd in poll () from /lib64/libc.so.6
#1  0x00007fc1477498aa in FuseRemounter::MainRemountTrigger (data=0x25dc7f0)
    at /home/sftnight/jenkins/workspace/CvmfsFullBuildDocker/CVMFS_BUILD_ARCH/docker-x86_64/CVMFS_BUILD_PLATFORM/cc7/build/BUILD/cvmfs-2.11.0/cvmfs/fuse_remount.cc:209
#2  0x00007fc148c6cea5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc148175b0d in clone () from /lib64/libc.so.6

Thread 18 (Thread 0x7fc1417cd700 (LWP 16844)):
#0  0x00007fc14816addd in poll () from /lib64/libc.so.6
#1  0x00007fc14767d06a in glue::DentryTracker::MainCleaner (data=0x25e2190)
---Type <return> to continue, or q <return> to quit---
    at /home/sftnight/jenkins/workspace/CvmfsFullBuildDocker/CVMFS_BUILD_ARCH/docker-x86_64/CVMFS_BUILD_PLATFORM/cc7/build/BUILD/cvmfs-2.11.0/cvmfs/glue_buffer.cc:191
#2  0x00007fc148c6cea5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc148175b0d in clone () from /lib64/libc.so.6

Thread 17 (Thread 0x7fc140fcc700 (LWP 16845)):
#0  0x00007fc14816addd in poll () from /lib64/libc.so.6
#1  0x00007fc1476e182c in download::DownloadManager::MainDownload (data=0x24ae5e0)
    at /home/sftnight/jenkins/workspace/CvmfsFullBuildDocker/CVMFS_BUILD_ARCH/docker-x86_64/CVMFS_BUILD_PLATFORM/cc7/build/BUILD/cvmfs-2.11.0/cvmfs/network/download.cc:562
#2  0x00007fc148c6cea5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc148175b0d in clone () from /lib64/libc.so.6

Thread 16 (Thread 0x7fc1407cb700 (LWP 16846)):
#0  0x00007fc14816addd in poll () from /lib64/libc.so.6
#1  0x00007fc1476e182c in download::DownloadManager::MainDownload (data=0x2535d40)
    at /home/sftnight/jenkins/workspace/CvmfsFullBuildDocker/CVMFS_BUILD_ARCH/docker-x86_64/CVMFS_BUILD_PLATFORM/cc7/build/BUILD/cvmfs-2.11.0/cvmfs/network/download.cc:562
#2  0x00007fc148c6cea5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc148175b0d in clone () from /lib64/libc.so.6

Thread 15 (Thread 0x7fc13ffca700 (LWP 16847)):
#0  0x00007fc14816addd in poll () from /lib64/libc.so.6
#1  0x00007fc147752031 in quota::MainWatchdogListener (data=0x2477870)
    at /home/sftnight/jenkins/workspace/CvmfsFullBuildDocker/CVMFS_BUILD_ARCH/docker-x86_64/CVMFS_BUILD_PLATFORM/cc7/build/BUILD/cvmfs-2.11.0/cvmfs/quota_listener.cc:90
#2  0x00007fc148c6cea5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc148175b0d in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x7fc13f7c9700 (LWP 16848)):
#0  0x00007fc14816addd in poll () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---
#1  0x00007fc147751eb0 in quota::MainUnpinListener (data=0x25e89e0)
    at /home/sftnight/jenkins/workspace/CvmfsFullBuildDocker/CVMFS_BUILD_ARCH/docker-x86_64/CVMFS_BUILD_PLATFORM/cc7/build/BUILD/cvmfs-2.11.0/cvmfs/quota_listener.cc:48
#2  0x00007fc148c6cea5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc148175b0d in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x7fc13efc8700 (LWP 16849)):
#0  0x00007fc148c739dd in accept () from /lib64/libpthread.so.0
#1  0x00007fc147754524 in TalkManager::MainResponder (data=0x25dc7b0)
    at /home/sftnight/jenkins/workspace/CvmfsFullBuildDocker/CVMFS_BUILD_ARCH/docker-x86_64/CVMFS_BUILD_PLATFORM/cc7/build/BUILD/cvmfs-2.11.0/cvmfs/talk.cc:176
#2  0x00007fc148c6cea5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc148175b0d in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7fc13e7c7700 (LWP 16850)):
#0  0x00007fc148c739dd in accept () from /lib64/libpthread.so.0
#1  0x00007fc147e596af in loader::loader_talk::MainTalk (data=<optimized out>)
    at /home/sftnight/jenkins/workspace/CvmfsFullBuildDocker/CVMFS_BUILD_ARCH/docker-x86_64/CVMFS_BUILD_PLATFORM/cc7/build/BUILD/cvmfs-2.11.0/cvmfs/loader_talk.cc:60
#2  0x00007fc148c6cea5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc148175b0d in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7fc11d7fc700 (LWP 68525)):
#0  0x00007fc148c7375d in read () from /lib64/libpthread.so.0
#1  0x00007fc147c22d62 in fuse_kern_chan_receive () from /lib64/libfuse.so.2
#2  0x00007fc147c23d59 in fuse_ll_receive_buf () from /lib64/libfuse.so.2
#3  0x00007fc147c2337e in fuse_do_work () from /lib64/libfuse.so.2
#4  0x00007fc148c6cea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc148175b0d in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7fc1257fb700 (LWP 93667)):
---Type <return> to continue, or q <return> to quit---
#0  0x00007fc148c7375d in read () from /lib64/libpthread.so.0
#1  0x00007fc147c22d62 in fuse_kern_chan_receive () from /lib64/libfuse.so.2
#2  0x00007fc147c23d59 in fuse_ll_receive_buf () from /lib64/libfuse.so.2
#3  0x00007fc147c2337e in fuse_do_work () from /lib64/libfuse.so.2
#4  0x00007fc148c6cea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc148175b0d in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7fc1277ff700 (LWP 93672)):
#0  0x00007fc148c7375d in read () from /lib64/libpthread.so.0
#1  0x00007fc147c22d62 in fuse_kern_chan_receive () from /lib64/libfuse.so.2
#2  0x00007fc147c23d59 in fuse_ll_receive_buf () from /lib64/libfuse.so.2
#3  0x00007fc147c2337e in fuse_do_work () from /lib64/libfuse.so.2
#4  0x00007fc148c6cea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc148175b0d in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7fc13cfd0700 (LWP 93673)):
#0  0x00007fc148c7375d in read () from /lib64/libpthread.so.0
#1  0x00007fc147c22d62 in fuse_kern_chan_receive () from /lib64/libfuse.so.2
#2  0x00007fc147c23d59 in fuse_ll_receive_buf () from /lib64/libfuse.so.2
#3  0x00007fc147c2337e in fuse_do_work () from /lib64/libfuse.so.2
#4  0x00007fc148c6cea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc148175b0d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7fc12dffb700 (LWP 93847)):
#0  0x00007fc148c7375d in read () from /lib64/libpthread.so.0
#1  0x00007fc147c22d62 in fuse_kern_chan_receive () from /lib64/libfuse.so.2
#2  0x00007fc147c23d59 in fuse_ll_receive_buf () from /lib64/libfuse.so.2
#3  0x00007fc147c2337e in fuse_do_work () from /lib64/libfuse.so.2
#4  0x00007fc148c6cea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc148175b0d in clone () from /lib64/libc.so.6

---Type <return> to continue, or q <return> to quit---
Thread 6 (Thread 0x7fc126ffe700 (LWP 246723)):
#0  0x00007fc148c7375d in read () from /lib64/libpthread.so.0
#1  0x00007fc147c22d62 in fuse_kern_chan_receive () from /lib64/libfuse.so.2
#2  0x00007fc147c23d59 in fuse_ll_receive_buf () from /lib64/libfuse.so.2
#3  0x00007fc147c2337e in fuse_do_work () from /lib64/libfuse.so.2
#4  0x00007fc148c6cea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc148175b0d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fc12effd700 (LWP 2609)):
#0  0x00007fc148c7375d in read () from /lib64/libpthread.so.0
#1  0x00007fc147c22d62 in fuse_kern_chan_receive () from /lib64/libfuse.so.2
#2  0x00007fc147c23d59 in fuse_ll_receive_buf () from /lib64/libfuse.so.2
#3  0x00007fc147c2337e in fuse_do_work () from /lib64/libfuse.so.2
#4  0x00007fc148c6cea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc148175b0d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fc12f7fe700 (LWP 8901)):
#0  0x00007fc148c7375d in read () from /lib64/libpthread.so.0
#1  0x00007fc147c22d62 in fuse_kern_chan_receive () from /lib64/libfuse.so.2
#2  0x00007fc147c23d59 in fuse_ll_receive_buf () from /lib64/libfuse.so.2
#3  0x00007fc147c2337e in fuse_do_work () from /lib64/libfuse.so.2
#4  0x00007fc148c6cea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc148175b0d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fc08afff700 (LWP 8902)):
#0  0x00007fc148c7375d in read () from /lib64/libpthread.so.0
#1  0x00007fc147c22d62 in fuse_kern_chan_receive () from /lib64/libfuse.so.2
#2  0x00007fc147c23d59 in fuse_ll_receive_buf () from /lib64/libfuse.so.2
#3  0x00007fc147c2337e in fuse_do_work () from /lib64/libfuse.so.2
#4  0x00007fc148c6cea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc148175b0d in clone () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---

Thread 2 (Thread 0x7fc125ffc700 (LWP 8906)):
#0  0x00007fc148c7375d in read () from /lib64/libpthread.so.0
#1  0x00007fc147c22d62 in fuse_kern_chan_receive () from /lib64/libfuse.so.2
#2  0x00007fc147c23d59 in fuse_ll_receive_buf () from /lib64/libfuse.so.2
#3  0x00007fc147c2337e in fuse_do_work () from /lib64/libfuse.so.2
#4  0x00007fc148c6cea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc148175b0d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fc149291740 (LWP 16840)):
#0  0x00007fc148c72b3b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007fc148c72bcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007fc148c72c6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007fc147c235c8 in fuse_session_loop_mt () from /lib64/libfuse.so.2
#4  0x00007fc147e58b0f in FuseMain (argc=<optimized out>, argv=<optimized out>)
    at /home/sftnight/jenkins/workspace/CvmfsFullBuildDocker/CVMFS_BUILD_ARCH/docker-x86_64/CVMFS_BUILD_PLATFORM/cc7/build/BUILD/cvmfs-2.11.0/cvmfs/loader.cc:1086
#5  0x0000000000404c3f in main (argc=5, argv=0x7ffe4e4c8568)
    at /home/sftnight/jenkins/workspace/CvmfsFullBuildDocker/CVMFS_BUILD_ARCH/docker-x86_64/CVMFS_BUILD_PLATFORM/cc7/build/BUILD/cvmfs-2.11.0/cvmfs/fuse_main.cc:144

@jblomer
Copy link
Member

jblomer commented Oct 13, 2023

Thank you for the backtrace!

It does not really seem to be a deadlock but this cvmfs2 process is idling. The problem may be related to the use of containers. On this node, could you please

  • Get the recent cvmfs syslog messages for sft-nightlies.cern.ch
  • Try to mount manually:
sudo mkdir -p /mnt/test
sudo mount -t cvmfs sft-nightlies.cern.ch /mnt/test

That should fail but we are interested in the output on the terminal.

@harenber
Copy link
Author

Hi Jakob,

here is the output, sft-nightlies work but sft.cern.ch does not:

[root@wn21229 ~]# mount -t cvmfs sft-nightlies.cern.ch /mnt/test
Debug: using library /usr/lib64/libcvmfs_fuse_stub.so
CernVM-FS: running with credentials 899:899
CernVM-FS: running in debug mode
CernVM-FS: loading Fuse module... (cvmfs) Parsing config file /etc/cvmfs/default.conf    [10-13-2023 13:19:32 CEST]
(cvmfs) execve'd /bin/sh (PID: 118606)    [10-13-2023 13:19:32 CEST]
(cvmfs) Parsing config file /etc/cvmfs/default.d/50-cern.conf    [10-13-2023 13:19:32 CEST]
(cvmfs) execve'd /bin/sh (PID: 118608)    [10-13-2023 13:19:32 CEST]
(cvmfs) Parsing config file /cvmfs/cvmfs-config.cern.ch/etc/cvmfs/default.conf    [10-13-2023 13:19:32 CEST]
(cvmfs) execve'd /bin/sh (PID: 118611)    [10-13-2023 13:19:32 CEST]
(cvmfs) Parsing config file /etc/cvmfs/default.local    [10-13-2023 13:19:32 CEST]
(cvmfs) execve'd /bin/sh (PID: 118614)    [10-13-2023 13:19:32 CEST]
(cvmfs) Parsing config file /cvmfs/cvmfs-config.cern.ch/etc/cvmfs/domain.d/cern.ch.conf    [10-13-2023 13:19:32 CEST]
(cvmfs) execve'd /bin/sh (PID: 118617)    [10-13-2023 13:19:32 CEST]
(cvmfs) Parsing config file /etc/cvmfs/domain.d/cern.ch.conf    [10-13-2023 13:19:32 CEST]
(cvmfs) execve'd /bin/sh (PID: 118619)    [10-13-2023 13:19:32 CEST]
(cvmfs) Parsing config file /etc/cvmfs/domain.d/cern.ch.local    [10-13-2023 13:19:32 CEST]
(cvmfs) Parsing config file /cvmfs/cvmfs-config.cern.ch/etc/cvmfs/config.d/sft-nightlies.cern.ch.conf    [10-13-2023 13:19:32 CEST]
(cvmfs) Parsing config file /etc/cvmfs/config.d/sft-nightlies.cern.ch.conf    [10-13-2023 13:19:32 CEST]
(cvmfs) Parsing config file /etc/cvmfs/config.d/sft-nightlies.cern.ch.local    [10-13-2023 13:19:32 CEST]
(cvmfs) execve'd /bin/sh (PID: 118622)    [10-13-2023 13:19:32 CEST]
done
CernVM-FS: mounted cvmfs on /mnt/test
[root@wn21229 ~]# cat /mnt/test/^C
[root@wn21229 ~]# mkdir -p /mnt/test-sft
[root@wn21229 ~]# mount -t cvmfs sft.cern.ch /mnt/test-sft
Repository sft.cern.ch is already mounted on /cvmfs/sft.cern.ch
[root@wn21229 ~]# cat /cvmfs/sft.cern.ch/lcg/lastUpdate
cat: /cvmfs/sft.cern.ch/lcg/lastUpdate: No such file or directory
[root@wn21229 ~]# 

Logs:

[root@wn21229 ~]# tail -1200 /var/log/messages | grep sft
Oct 13 09:55:18 wn21229 cvmfs2: (sft-nightlies.cern.ch) switched to catalog revision 116458
Oct 13 10:11:46 wn21229 cvmfs2: (sft-nightlies.cern.ch) switched to catalog revision 116460
Oct 13 10:19:25 wn21229 cvmfs2: (sft-nightlies.cern.ch) CernVM-FS: unmounted /cvmfs/sft-nightlies.cern.ch (sft-nightlies.cern.ch)
Oct 13 10:21:09 wn21229 cvmfs2: (sft-nightlies.cern.ch) CernVM-FS: running in debug mode
Oct 13 10:21:09 wn21229 cvmfs2: (sft-nightlies.cern.ch) switching proxy from (none) to http://132.195.124.11:3128 (set proxies)
Oct 13 10:21:09 wn21229 cvmfs2: (sft-nightlies.cern.ch) geographic order of servers retrieved from cernvmfs.gridpp.rl.ac.uk
Oct 13 10:21:09 wn21229 cvmfs2: (sft-nightlies.cern.ch) switching proxy from http://132.195.124.11:3128 to http://132.195.124.10:3128 (geosort)
Oct 13 10:21:09 wn21229 cvmfs2: (sft-nightlies.cern.ch) switching proxy from (none) to http://132.195.124.11:3128 (cloned)
Oct 13 10:21:09 wn21229 cvmfs2: (sft-nightlies.cern.ch) switching proxy from http://132.195.124.11:3128 to DIRECT (set proxies)
Oct 13 10:21:09 wn21229 cvmfs2: (sft-nightlies.cern.ch) CernVM-FS: linking /cvmfs/sft-nightlies.cern.ch to repository sft-nightlies.cern.ch
Oct 13 10:25:43 wn21229 cvmfs2: (sft-nightlies.cern.ch) switched to catalog revision 116461
Oct 13 10:49:25 wn21229 cvmfs2: (sft-nightlies.cern.ch) switched to catalog revision 116462
Oct 13 10:49:25 wn21229 cvmfs2: (sft-nightlies.cern.ch) CernVM-FS: unmounted /cvmfs/sft-nightlies.cern.ch (sft-nightlies.cern.ch)
Oct 13 10:51:43 wn21229 cvmfs2: (sft-nightlies.cern.ch) CernVM-FS: running in debug mode
Oct 13 10:51:44 wn21229 cvmfs2: (sft-nightlies.cern.ch) switching proxy from (none) to http://132.195.124.5:3128 (set proxies)
Oct 13 10:51:44 wn21229 cvmfs2: (sft-nightlies.cern.ch) geographic order of servers retrieved from cvmfs-stratum-one.cern.ch
Oct 13 10:51:44 wn21229 cvmfs2: (sft-nightlies.cern.ch) switching proxy from http://132.195.124.5:3128 to http://132.195.124.10:3128 (geosort)
Oct 13 10:51:44 wn21229 cvmfs2: (sft-nightlies.cern.ch) switching proxy from (none) to http://132.195.124.10:3128 (cloned)
Oct 13 10:51:44 wn21229 cvmfs2: (sft-nightlies.cern.ch) switching proxy from http://132.195.124.10:3128 to DIRECT (set proxies)
Oct 13 10:51:44 wn21229 cvmfs2: (sft-nightlies.cern.ch) CernVM-FS: linking /cvmfs/sft-nightlies.cern.ch to repository sft-nightlies.cern.ch
Oct 13 11:01:23 wn21229 cvmfs2: (sft-nightlies.cern.ch) switched to catalog revision 116463
Oct 13 11:13:57 wn21229 cvmfs2: (sft-nightlies.cern.ch) switched to catalog revision 116464
Oct 13 11:21:14 wn21229 cvmfs2: (sft-nightlies.cern.ch) CernVM-FS: unmounted /cvmfs/sft-nightlies.cern.ch (sft-nightlies.cern.ch)
Oct 13 11:28:51 wn21229 cvmfs2: (sft-nightlies.cern.ch) CernVM-FS: running in debug mode
Oct 13 11:28:51 wn21229 cvmfs2: (sft-nightlies.cern.ch) switching proxy from (none) to http://132.195.124.11:3128 (set proxies)
Oct 13 11:28:51 wn21229 cvmfs2: (sft-nightlies.cern.ch) geographic order of servers retrieved from cvmfs-stratum-one.ihep.ac.cn
Oct 13 11:28:51 wn21229 cvmfs2: (sft-nightlies.cern.ch) switching proxy from http://132.195.124.11:3128 to http://132.195.124.10:3128 (geosort)
Oct 13 11:28:51 wn21229 cvmfs2: (sft-nightlies.cern.ch) switching proxy from (none) to http://132.195.124.5:3128 (cloned)
Oct 13 11:28:51 wn21229 cvmfs2: (sft-nightlies.cern.ch) switching proxy from http://132.195.124.5:3128 to DIRECT (set proxies)
Oct 13 11:28:52 wn21229 cvmfs2: (sft-nightlies.cern.ch) CernVM-FS: linking /cvmfs/sft-nightlies.cern.ch to repository sft-nightlies.cern.ch
Oct 13 11:38:46 wn21229 cvmfs2: (sft-nightlies.cern.ch) CernVM-FS: unmounted /cvmfs/sft-nightlies.cern.ch (sft-nightlies.cern.ch)
Oct 13 11:41:10 wn21229 cvmfs2: (sft.cern.ch) Attempt to send fuse connection info to new mount (via /var/cache/cvmfs/shared/fusefd.orIY7t/sock)
Oct 13 11:41:10 wn21229 cvmfs2: (sft-nightlies.cern.ch) CernVM-FS: running in debug mode
Oct 13 11:41:11 wn21229 cvmfs2: (sft-nightlies.cern.ch) switching proxy from (none) to http://132.195.124.11:3128 (set proxies)
Oct 13 11:41:11 wn21229 cvmfs2: (sft-nightlies.cern.ch) geographic order of servers retrieved from cvmfs-s1.hpc.swin.edu.au
Oct 13 11:41:11 wn21229 cvmfs2: (sft-nightlies.cern.ch) switching proxy from http://132.195.124.11:3128 to http://132.195.124.5:3128 (geosort)
Oct 13 11:41:11 wn21229 cvmfs2: (sft-nightlies.cern.ch) switching proxy from (none) to http://132.195.124.11:3128 (cloned)
Oct 13 11:41:11 wn21229 cvmfs2: (sft-nightlies.cern.ch) switching proxy from http://132.195.124.11:3128 to DIRECT (set proxies)
Oct 13 11:41:11 wn21229 cvmfs2: (sft-nightlies.cern.ch) CernVM-FS: linking /cvmfs/sft-nightlies.cern.ch to repository sft-nightlies.cern.ch
Oct 13 11:44:01 wn21229 cvmfs2: (sft.cern.ch) Attempt to send fuse connection info to new mount (via /var/cache/cvmfs/shared/fusefd.d7Yc3r/sock)
Oct 13 11:45:49 wn21229 cvmfs2: (sft.cern.ch) Attempt to send fuse connection info to new mount (via /var/cache/cvmfs/shared/fusefd.Smrpw9/sock)
Oct 13 11:49:29 wn21229 cvmfs2: (sft.cern.ch) Attempt to send fuse connection info to new mount (via /var/cache/cvmfs/shared/fusefd.1LM5uj/sock)
Oct 13 11:53:27 wn21229 cvmfs2: (sft.cern.ch) Attempt to send fuse connection info to new mount (via /var/cache/cvmfs/shared/fusefd.lHcnwF/sock)
Oct 13 11:56:14 wn21229 cvmfs2: (sft.cern.ch) Attempt to send fuse connection info to new mount (via /var/cache/cvmfs/shared/fusefd.wSbOko/sock)
Oct 13 11:59:56 wn21229 cvmfs2: (sft-nightlies.cern.ch) CernVM-FS: unmounted /cvmfs/sft-nightlies.cern.ch (sft-nightlies.cern.ch)
Oct 13 13:18:57 wn21229 cvmfs2: (sft.cern.ch) Attempt to send fuse connection info to new mount (via /var/cache/cvmfs/shared/fusefd.GTZzTW/sock)
Oct 13 13:19:32 wn21229 cvmfs2: (sft-nightlies.cern.ch) CernVM-FS: running in debug mode
Oct 13 13:19:47 wn21229 cvmfs2: (sft-nightlies.cern.ch) CernVM-FS: running in debug mode
Oct 13 13:20:24 wn21229 cvmfs2: (sft-nightlies.cern.ch) switching proxy from (none) to http://132.195.124.10:3128 (set proxies)
Oct 13 13:20:24 wn21229 cvmfs2: (sft-nightlies.cern.ch) geographic order of servers retrieved from cvmfs-s1fnal.opensciencegrid.org
Oct 13 13:20:24 wn21229 cvmfs2: (sft-nightlies.cern.ch) switching proxy from (none) to http://132.195.124.10:3128 (cloned)
Oct 13 13:20:24 wn21229 cvmfs2: (sft-nightlies.cern.ch) switching proxy from http://132.195.124.10:3128 to DIRECT (set proxies)
Oct 13 13:20:24 wn21229 cvmfs2: (sft-nightlies.cern.ch) CernVM-FS: linking /mnt/test to repository sft-nightlies.cern.ch
Oct 13 13:20:42 wn21229 cvmfs2: (sft-nightlies.cern.ch) looks like cvmfs has been crashed previously
Oct 13 13:20:42 wn21229 cvmfs2: (sft-nightlies.cern.ch) switching proxy from (none) to http://132.195.124.5:3128 (set proxies)
Oct 13 13:20:42 wn21229 cvmfs2: (sft-nightlies.cern.ch) geographic order of servers retrieved from sampacs01.if.usp.br
Oct 13 13:20:42 wn21229 cvmfs2: (sft-nightlies.cern.ch) switching proxy from http://132.195.124.5:3128 to http://132.195.124.10:3128 (geosort)
Oct 13 13:20:42 wn21229 cvmfs2: (sft-nightlies.cern.ch) switching proxy from (none) to http://132.195.124.11:3128 (cloned)
Oct 13 13:20:42 wn21229 cvmfs2: (sft-nightlies.cern.ch) switching proxy from http://132.195.124.11:3128 to DIRECT (set proxies)
Oct 13 13:20:45 wn21229 cvmfs2: (sft.cern.ch) Attempt to send fuse connection info to new mount (via /var/cache/cvmfs/shared/fusefd.MA45oL/sock)
Oct 13 13:21:10 wn21229 cvmfs2: (sft-nightlies.cern.ch) CernVM-FS: linking /cvmfs/sft-nightlies.cern.ch to repository sft-nightlies.cern.ch
Oct 13 13:22:30 wn21229 cvmfs2: (sft.cern.ch) Attempt to send fuse connection info to new mount (via /var/cache/cvmfs/shared/fusefd.23y7Pi/sock)

Cheers

Torsten

@jblomer
Copy link
Member

jblomer commented Oct 13, 2023

Ok, then I think it is exactly the problem of CVM-2004.

There must be a containerized job with an container runtime that does not keep an open file descriptor to the directory of container root file system. After moving the job/container in its own namespace, the cvmfs repository looks unused from the point of view of autofs. It will then unmount the file system but the fuse process stays alive, serving the containerized job. And it is preventing another cvmfs process remounting the repository.

You can check for such "zombie processes" with the cvmfs_zombie utility. Just compile it with g++ -o find_zombie find_zombie.cc.

Some versions of apptainer showed this problem but recent versions should work around it (perhaps @DrDaveD can comment).

This problem has been properly fixed as of the EL9 kernel / fuse module. As a workaround, you can try static fstab cvmfs mounts or very long autofs timeouts.

@jblomer jblomer self-assigned this Oct 13, 2023
@harenber
Copy link
Author

Hi Jakob,

thanks a lot for the analysis! That gives us the opportunity to implement a workaround until we upgrade to Alma 9 (which will be soon, waiting for our administration to place the order with the cluster vendor).

Just to verify the theory, here is the output of the cvmfs_zombie tool.

[root@wn21229 tmp]# g++ -o find_zombie find_zombie.cc 
[root@wn21229 tmp]# ./find_zombie 
Current mount namespace is mnt:[4026531840]
@2172 / mnt:[4026533758] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@11037 / mnt:[4026533833] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@15324 / mnt:[4026533922] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@20790 / mnt:[4026533874] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@21704 / mnt:[4026533902] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@22259 / mnt:[4026533952] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@22913 / mnt:[4026533959] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@26963 / mnt:[4026534121] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@27671 / mnt:[4026533965] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@38869 / mnt:[4026533768] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@39299 / mnt:[4026533778] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@41700 / mnt:[4026534022] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@44115 / mnt:[4026533999] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@44126 / mnt:[4026533940] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@56684 / mnt:[4026533842] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): unpacked.cern.ch
@59803 / mnt:[4026533963] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@60250 / mnt:[4026533734] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@60802 / mnt:[4026534034] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@62713 / mnt:[4026534007] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@66216 / mnt:[4026533845] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@79756 / mnt:[4026533979] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@79839 / mnt:[4026534044] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@92633 / mnt:[4026533927] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@95633 / mnt:[4026533850] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@98253 / mnt:[4026533820] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): unpacked.cern.ch
@106098 / mnt:[4026533863] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@109577 / mnt:[4026534027] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@113296 / mnt:[4026533835] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@113723 / mnt:[4026533984] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@132989 / mnt:[4026533795] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): unpacked.cern.ch
@133064 / mnt:[4026534098] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@135704 / mnt:[4026534114] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@141083 / mnt:[4026534037] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@143574 / mnt:[4026533884] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@149313 / mnt:[4026534077] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@149643 / mnt:[4026533930] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@149725 / mnt:[4026533869] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@156640 / mnt:[4026533900] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@156980 / mnt:[4026533989] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@159176 / mnt:[4026533859] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@161912 / mnt:[4026533914] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@162379 / mnt:[4026534085] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@166602 / mnt:[4026534081] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@169191 / mnt:[4026534011] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@175819 / mnt:[4026533971] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@178814 / mnt:[4026534053] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@180833 / mnt:[4026534016] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@187205 / mnt:[4026534127] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@188636 / mnt:[4026533993] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@192910 / mnt:[4026534090] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@212156 / mnt:[4026534069] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@213235 / mnt:[4026534064] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@213479 / mnt:[4026534123] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@219468 / mnt:[4026533802] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@225192 / mnt:[4026534112] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@226543 / mnt:[4026534125] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@240122 / mnt:[4026534101] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@240926 / mnt:[4026533970] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@245109 / mnt:[4026533887] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@246095 / mnt:[4026534106] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@246737 / mnt:[4026534119] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@251004 / mnt:[4026533877] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): unpacked.cern.ch
@258980 / mnt:[4026534078] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch
@259349 / mnt:[4026533980] (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter): atlas.cern.ch

And we currently have

[root@wn21229 tmp]# rpm -qi apptainer
Name        : apptainer
Version     : 1.2.2
Release     : 1.el7
Architecture: x86_64
Install Date: Fri 22 Sep 2023 09:36:10 AM CEST
Group       : Unspecified
Size        : 120195813
License     : BSD and LBNL BSD and ASL 2.0
Signature   : RSA/SHA256, Thu 27 Jul 2023 08:26:23 PM CEST, Key ID 6a2faea2352c64e5
Source RPM  : apptainer-1.2.2-1.el7.src.rpm
Build Date  : Thu 27 Jul 2023 07:52:15 PM CEST
Build Host  : buildvm-x86-12.iad2.fedoraproject.org
Relocations : (not relocatable)
Packager    : Fedora Project
Vendor      : Fedora Project
URL         : https://apptainer.org
Bug URL     : https://bugz.fedoraproject.org/apptainer
Summary     : Application and environment virtualization formerly known as Singularity
Description :
Apptainer provides functionality to make portable
containers that can be used across host environments.
[root@wn21229 tmp]# 

Just saw that a 1.2.4 was released two days ago by @DrDaveD .

Thanks for the help

Torsten and the Wuppertal crew

@thdesy
Copy link

thdesy commented Oct 16, 2023

Hi all,

we just made a quick check with find_zombies on some of our nodes at DESY-HH and noticed also quite a number of processes/file handles. These have not been an issue so far as we are blocking the "large" repos like atlas.cern.ch and cms.cern.ch from umounting and keep them continuously alive.

With the current job mix, it affects primarily a python bin path for ATLAS and the OSG apptainer exec for CMS jobs - probably just statistics with the paths used by each ATLAS/CMS job??

[root@batch1401 tmp]# /tmp/find_zombie | cut -d " " -f 4 | sort | uniq -c
      9 (/bin/bash):
      9 (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter):
    171 (/cvmfs/atlas.cern.ch/repo/sw/software/21.0/sw/lcg/releases/Python/2.7.10-806f9/x86_64-slc6-gcc49-opt/bin/python2.7):
      1 (/cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_2_16_UL/bin/slc7_amd64_gcc700/cmsRun):
      3 (/cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_17/bin/slc7_amd64_gcc700/cmsRun):
      8 (/cvmfs/oasis.opensciencegrid.org/mis/apptainer/1.2.3/x86_64/libexec/apptainer/libexec/starter):
      1 is
[root@batch1402 ~]# /tmp/find_zombie | cut -d " " -f 4 | sort | uniq -c
      8 (/bin/bash):
      8 (/cvmfs/atlas.cern.ch/repo/containers/sw/apptainer/x86_64-el7/1.2.2/libexec/apptainer/bin/starter):
    152 (/cvmfs/atlas.cern.ch/repo/sw/software/21.0/sw/lcg/releases/Python/2.7.10-806f9/x86_64-slc6-gcc49-opt/bin/python2.7):
      6 (/cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_13/bin/el8_amd64_gcc11/cmsRun):
      2 (/cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_17/bin/slc7_amd64_gcc700/cmsRun):
      1 (/cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_30/bin/slc7_amd64_gcc700/cmsRun):
      1 (/cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_37/bin/slc7_amd64_gcc700/cmsRun):
      1 (/cvmfs/cms.cern.ch/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_0_pre10/bin/slc7_amd64_gcc820/cmsRun):
     22 (/cvmfs/oasis.opensciencegrid.org/mis/apptainer/1.2.3/x86_64/libexec/apptainer/libexec/starter):
      1 is
      3 (/usr/bin/bash):
      1 (/usr/bin/coreutils):

Cheers,
Thomas

@DrDaveD
Copy link
Contributor

DrDaveD commented Oct 16, 2023

I am surprised that the zombie command only shows atlas.cern.ch and unpacked.cern.ch repositories. Are those having the same problem? I thought it was sft.cern.ch. The cvmfs_config fuser sft.cern.ch command should point you to the processes that are preventing the repository from completely unmounting. You can recover by killing those processes.

There's no known problem with this in apptainer-1.2.2. However any container application can easily cause the problem by exiting without making sure all child processes have exited. So if possible please bring it to the attention of the application owner.

@vvolkl
Copy link
Contributor

vvolkl commented Mar 5, 2024

Hi @harenber, did you switch to EL9 in the meantime? Can you confirm that this should fix this issue?

@harenber
Copy link
Author

harenber commented Mar 5, 2024

Hi @vvolkl, unfortunately not yet, we're still waiting for our vendor to come up with a schedule. I'll update here as soon as we did the switch.

@harenber
Copy link
Author

We got a appointment now with our vendor. We'll upgrade from May 22nd to June 5th.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants