Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck cvmfs processes with osgstorage.org #3378

Open
vvolkl opened this issue Aug 30, 2023 · 8 comments
Open

Stuck cvmfs processes with osgstorage.org #3378

vvolkl opened this issue Aug 30, 2023 · 8 comments
Assignees

Comments

@vvolkl
Copy link
Contributor

vvolkl commented Aug 30, 2023

We've seen a few different, but seemingly related reports of issues with osgstorage:

I'm using this issue just to keep track. The authorization helper is probably the first point of investigation.

@vvolkl
Copy link
Contributor Author

vvolkl commented Jan 25, 2024

There's been another instance of this problem at RAL that I could take a look at thanks to Tom Birkett. The affected repository was sbn.osgstorage.org, and the symptoms the same as previously reported - The mount process was unkillable and unresponsive, and attaching gdb or doing any filesystem operation on /cvmfs/sbn.osgstorage.org would hang indefinitely.

The system log shows this message around the time the problem occurs:

23 22:05:18 10064 cvmfs2[2953117]:  (sbn.osgstorage.org) ) switching host from http://xcachevirgo.pic.es:8000/ to http://cf-ac-uk-cache.nationalreserchplatform.org:8000/ (host serving data too slowly)

What's new is that we found that we could get the node unstuck by manually aborting the fuse connection. This can be done easily with

sudo sh -c 'echo 1 > /sys/fs/fuse/connections/<device id>/abort'

The tricky part is that in this state neither stat -c %d /cvmfs/sbn.osgstorage.org nor cvmfs_talk -i sbn.osgstorage.org device id can be used to find <device id>. However, the stuck filesystem is likely to have a nonzero value in /sys/fs/fuse/connections/<device id>/waiting, so the procedure is to read /sys/fs/fuse/connections/*/waiting and abort the one which is non-zero.

Doing a directory listing on sbn.osgstorage.org should then report a Transport endpoint not connected error, and can be unmounted (and remounted).

@DrDaveD
Copy link
Contributor

DrDaveD commented Jan 25, 2024

Very interesting that this was a case that did not involve authentication, and that you found a way to get it unstuck. Of course it would be best to figure out how to prevent it from happening, but meanwhile there should probably be a way to do this either automatically in the watchdog process or via a cvmfs_config command.

@vvolkl
Copy link
Contributor Author

vvolkl commented Jan 30, 2024

Yes, agreed that we might want to add an option to our scripts to do this. An elegant way to remember the device number, suggested by Laura, is to set it as an environment variable of the process in order to be able to robustly retrieve it even in the stuck state. We'll update cvmfs_config with an option to do this.

@DrDaveD
Copy link
Contributor

DrDaveD commented Feb 6, 2024

I investigated another case today at CIT. This one was authenticated, ligo.storage.igwn.org (which is configured exactly like ligo.osgstorage.org). I didn't think to try your trick to abort the FUSE connection, but I managed to get it unstuck a different way (described below).

I collected a bugreport tarball (as usual I had to kill a couple of processes that got stuck) and found relevant messages from old /var/log/messages files 2 weeks old and 1 week old. The cvmfs2 processes were still running, so I tried to run gdb on it but found that there was already gdb -p running as a child of the watchdog process onto the main cvmfs2 process, stuck. I killed that gdb, and both cvmfs2 processes exited. It also resulted in this log message:

Feb  6 07:28:47 node2131.cluster.ldas.cit cvmfs2[571028]: (ligo.storage.igwn.org) --#012Signal: 11, errno: 0, version: 2.11.2, PID: 571041#012Executable path: /usr/bin/cvmfs2#012failed to start gdb/lldb (-1 bytes written, errno 32)#012#012Timestamp: Tue Feb  6 07:28:47 2024#012#012 Crash logged also on file: stacktrace.ligo.storage.igwn.org

But only the same message was in /var/lib/cvmfs/osgstorage/shared/stacktrace.ligo.storage.igwn.org.

At this point the mount was still stuck. I removed /var/lib/cvmfs/osgstorage/shared/*ligo.storage* and did systemctl restart autofs. The latter took a long time, but after that the mount worked again. There were still a couple of stuck processes in the background, but I was then able to kill them. There were a ton of them at first, but most of them got automatically cleaned up.

@jblomer
Copy link
Member

jblomer commented Feb 6, 2024

That's an interesting lead! I think that would add up: the main process talks to the watchdog in its signal handler. So if it hangs up here, it may well hang in the observed way. The watchdog tries to gather the stack trace and fails (waits forever). The main process has a 30s limit in its signal handler to prevent the watchdog from completely blocking the abort process. But for some reason, this may not have worked. Signal 11 is a SEGFAULT, so perhaps the memory of the main process got corrupted.

@jblomer
Copy link
Member

jblomer commented Feb 6, 2024

A few checks

  • Is there an indication that the main process is busy waiting (consuming 100% or more CPU time)?
  • Can we try to just send a SIGSEGV to a cvmfs fuse pid on one of the nodes? To see if stack trace generation, in general, works

@DrDaveD
Copy link
Contributor

DrDaveD commented Feb 7, 2024

No it wasn't using up CPU. I now investigated 4 more nodes in the same cluster stuck on the same repository. I was able to clean up all of them using the fuse connection abort above, after collecting info. On two of them cvmfs2 wasn't running, and the other two had the same gdb stuck process. I collected also the /proc/NNN/stack from the stuck cvmfs2 process, and did an ls -lR on /proc/NNN. I put all the files including the bugreport tarball in this tarball.

@DrDaveD
Copy link
Contributor

DrDaveD commented Feb 8, 2024

Can we try to just send a SIGSEGV to a cvmfs fuse pid on one of the nodes? To see if stack trace generation, in general, works

I have seen stack traces in that cluster from nodes that got signal 6 from the out of memory killer. Is that enough of a test? Is the stacktrace in the latest tarball helpful?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants