Stuck cvmfs processes with osgstorage.org #3378

vvolkl · 2023-08-30T15:36:53Z

We've seen a few different, but seemingly related reports of issues with osgstorage:

I'm using this issue just to keep track. The authorization helper is probably the first point of investigation.

vvolkl · 2024-01-25T11:47:44Z

There's been another instance of this problem at RAL that I could take a look at thanks to Tom Birkett. The affected repository was sbn.osgstorage.org, and the symptoms the same as previously reported - The mount process was unkillable and unresponsive, and attaching gdb or doing any filesystem operation on /cvmfs/sbn.osgstorage.org would hang indefinitely.

The system log shows this message around the time the problem occurs:

23 22:05:18 10064 cvmfs2[2953117]:  (sbn.osgstorage.org) ) switching host from http://xcachevirgo.pic.es:8000/ to http://cf-ac-uk-cache.nationalreserchplatform.org:8000/ (host serving data too slowly)

What's new is that we found that we could get the node unstuck by manually aborting the fuse connection. This can be done easily with

sudo sh -c 'echo 1 > /sys/fs/fuse/connections/<device id>/abort'

The tricky part is that in this state neither stat -c %d /cvmfs/sbn.osgstorage.org nor cvmfs_talk -i sbn.osgstorage.org device id can be used to find <device id>. However, the stuck filesystem is likely to have a nonzero value in /sys/fs/fuse/connections/<device id>/waiting, so the procedure is to read /sys/fs/fuse/connections/*/waiting and abort the one which is non-zero.

Doing a directory listing on sbn.osgstorage.org should then report a Transport endpoint not connected error, and can be unmounted (and remounted).

DrDaveD · 2024-01-25T15:33:15Z

Very interesting that this was a case that did not involve authentication, and that you found a way to get it unstuck. Of course it would be best to figure out how to prevent it from happening, but meanwhile there should probably be a way to do this either automatically in the watchdog process or via a cvmfs_config command.

vvolkl · 2024-01-30T08:55:58Z

Yes, agreed that we might want to add an option to our scripts to do this. An elegant way to remember the device number, suggested by Laura, is to set it as an environment variable of the process in order to be able to robustly retrieve it even in the stuck state. We'll update cvmfs_config with an option to do this.

DrDaveD · 2024-02-06T17:40:37Z

I investigated another case today at CIT. This one was authenticated, ligo.storage.igwn.org (which is configured exactly like ligo.osgstorage.org). I didn't think to try your trick to abort the FUSE connection, but I managed to get it unstuck a different way (described below).

I collected a bugreport tarball (as usual I had to kill a couple of processes that got stuck) and found relevant messages from old /var/log/messages files 2 weeks old and 1 week old. The cvmfs2 processes were still running, so I tried to run gdb on it but found that there was already gdb -p running as a child of the watchdog process onto the main cvmfs2 process, stuck. I killed that gdb, and both cvmfs2 processes exited. It also resulted in this log message:

Feb  6 07:28:47 node2131.cluster.ldas.cit cvmfs2[571028]: (ligo.storage.igwn.org) --#012Signal: 11, errno: 0, version: 2.11.2, PID: 571041#012Executable path: /usr/bin/cvmfs2#012failed to start gdb/lldb (-1 bytes written, errno 32)#012#012Timestamp: Tue Feb  6 07:28:47 2024#012#012 Crash logged also on file: stacktrace.ligo.storage.igwn.org

But only the same message was in /var/lib/cvmfs/osgstorage/shared/stacktrace.ligo.storage.igwn.org.

At this point the mount was still stuck. I removed /var/lib/cvmfs/osgstorage/shared/*ligo.storage* and did systemctl restart autofs. The latter took a long time, but after that the mount worked again. There were still a couple of stuck processes in the background, but I was then able to kill them. There were a ton of them at first, but most of them got automatically cleaned up.

jblomer · 2024-02-06T19:57:07Z

That's an interesting lead! I think that would add up: the main process talks to the watchdog in its signal handler. So if it hangs up here, it may well hang in the observed way. The watchdog tries to gather the stack trace and fails (waits forever). The main process has a 30s limit in its signal handler to prevent the watchdog from completely blocking the abort process. But for some reason, this may not have worked. Signal 11 is a SEGFAULT, so perhaps the memory of the main process got corrupted.

jblomer · 2024-02-06T19:59:09Z

A few checks

Is there an indication that the main process is busy waiting (consuming 100% or more CPU time)?
Can we try to just send a SIGSEGV to a cvmfs fuse pid on one of the nodes? To see if stack trace generation, in general, works

DrDaveD · 2024-02-07T00:05:05Z

No it wasn't using up CPU. I now investigated 4 more nodes in the same cluster stuck on the same repository. I was able to clean up all of them using the fuse connection abort above, after collecting info. On two of them cvmfs2 wasn't running, and the other two had the same gdb stuck process. I collected also the /proc/NNN/stack from the stuck cvmfs2 process, and did an ls -lR on /proc/NNN. I put all the files including the bugreport tarball in this tarball.

DrDaveD · 2024-02-08T17:48:18Z

Can we try to just send a SIGSEGV to a cvmfs fuse pid on one of the nodes? To see if stack trace generation, in general, works

I have seen stack traces in that cluster from nodes that got signal 6 from the out of memory killer. Is that enough of a test? Is the stacktrace in the latest tarball helpful?

vvolkl self-assigned this Aug 30, 2023

jnc74743 mentioned this issue Sep 5, 2023

CVMFS client hangs on repo and df -u command #3383

Open

DrDaveD mentioned this issue Nov 1, 2023

hung /cvmfs mounts #3432

Open

rezan mentioned this issue Apr 3, 2024

Download error can cause stalling and error looping #3563

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stuck cvmfs processes with osgstorage.org #3378

Stuck cvmfs processes with osgstorage.org #3378

vvolkl commented Aug 30, 2023

vvolkl commented Jan 25, 2024 •

edited

Loading

DrDaveD commented Jan 25, 2024

vvolkl commented Jan 30, 2024

DrDaveD commented Feb 6, 2024

jblomer commented Feb 6, 2024

jblomer commented Feb 6, 2024

DrDaveD commented Feb 7, 2024

DrDaveD commented Feb 8, 2024 •

edited

Loading

Stuck cvmfs processes with osgstorage.org #3378

Stuck cvmfs processes with osgstorage.org #3378

Comments

vvolkl commented Aug 30, 2023

vvolkl commented Jan 25, 2024 • edited Loading

DrDaveD commented Jan 25, 2024

vvolkl commented Jan 30, 2024

DrDaveD commented Feb 6, 2024

jblomer commented Feb 6, 2024

jblomer commented Feb 6, 2024

DrDaveD commented Feb 7, 2024

DrDaveD commented Feb 8, 2024 • edited Loading

vvolkl commented Jan 25, 2024 •

edited

Loading

DrDaveD commented Feb 8, 2024 •

edited

Loading