-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stuck cvmfs processes with osgstorage.org #3378
Comments
There's been another instance of this problem at RAL that I could take a look at thanks to Tom Birkett. The affected repository was sbn.osgstorage.org, and the symptoms the same as previously reported - The mount process was unkillable and unresponsive, and attaching gdb or doing any filesystem operation on /cvmfs/sbn.osgstorage.org would hang indefinitely. The system log shows this message around the time the problem occurs:
What's new is that we found that we could get the node unstuck by manually aborting the fuse connection. This can be done easily with sudo sh -c 'echo 1 > /sys/fs/fuse/connections/<device id>/abort' The tricky part is that in this state neither Doing a directory listing on sbn.osgstorage.org should then report a |
Very interesting that this was a case that did not involve authentication, and that you found a way to get it unstuck. Of course it would be best to figure out how to prevent it from happening, but meanwhile there should probably be a way to do this either automatically in the watchdog process or via a cvmfs_config command. |
Yes, agreed that we might want to add an option to our scripts to do this. An elegant way to remember the device number, suggested by Laura, is to set it as an environment variable of the process in order to be able to robustly retrieve it even in the stuck state. We'll update cvmfs_config with an option to do this. |
I investigated another case today at CIT. This one was authenticated, ligo.storage.igwn.org (which is configured exactly like ligo.osgstorage.org). I didn't think to try your trick to abort the FUSE connection, but I managed to get it unstuck a different way (described below). I collected a bugreport tarball (as usual I had to kill a couple of processes that got stuck) and found relevant messages from old /var/log/messages files 2 weeks old and 1 week old. The cvmfs2 processes were still running, so I tried to run gdb on it but found that there was already gdb -p running as a child of the watchdog process onto the main cvmfs2 process, stuck. I killed that gdb, and both cvmfs2 processes exited. It also resulted in this log message:
But only the same message was in At this point the mount was still stuck. I removed |
That's an interesting lead! I think that would add up: the main process talks to the watchdog in its signal handler. So if it hangs up here, it may well hang in the observed way. The watchdog tries to gather the stack trace and fails (waits forever). The main process has a 30s limit in its signal handler to prevent the watchdog from completely blocking the abort process. But for some reason, this may not have worked. Signal 11 is a SEGFAULT, so perhaps the memory of the main process got corrupted. |
A few checks
|
No it wasn't using up CPU. I now investigated 4 more nodes in the same cluster stuck on the same repository. I was able to clean up all of them using the fuse connection abort above, after collecting info. On two of them cvmfs2 wasn't running, and the other two had the same gdb stuck process. I collected also the /proc/NNN/stack from the stuck cvmfs2 process, and did an ls -lR on /proc/NNN. I put all the files including the bugreport tarball in this tarball. |
I have seen stack traces in that cluster from nodes that got signal 6 from the out of memory killer. Is that enough of a test? Is the stacktrace in the latest tarball helpful? |
We've seen a few different, but seemingly related reports of issues with osgstorage:
I'm using this issue just to keep track. The authorization helper is probably the first point of investigation.
The text was updated successfully, but these errors were encountered: