-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: deadlock in cvmfs2 process #3402
Comments
Hello Martin and Torsten, Sadly with the current information you provided its really difficult to extract the actual problem.
A few things I have seen so far:
And if found this error message (which I also right now cannot see as how it is related):
This happened in total 4 times. Always
|
Thanks for looking into this. We have the impression that the problem is triggert by some ATLAS jobs. The problem nearly vanished as quickly as it appeared without any change in the system by us. However, we wrote a small script which restarts CVMFS in case this happens: #!/bin/bash
if ! [ -f /cvmfs/sft.cern.ch/lcg/lastUpdate ]; then
echo "$(date -Im)" >> /var/log/cvmfslog/kill.log ;
kill $(cvmfs_talk -i sft.cern.ch pid);
cp /var/log/cvmfslog/cvmfs_sft.log "/var/log/cvmfslog/cvmfs_sft.log_$(date -Im)";
systemctl restart autofs;
fi and we finally got a log which we uploaded to https://harenber.web.cern.ch/harenber/wn21164/ We're not sure how to interpret these Cheers from Wuppertal Martin and Torsten |
With that latest log, there appears to me to be a long portion of normal behavior at the end of the log. Can you provide a timestamp or something to search for in the log around the time that the hanging behavior was observed? Regarding the name mismatch, indeed that indicates a misconfiguration on either a client or stratum 1 for the sft-nightlies.cern.ch repository. It's very difficult to identify though because I searched the log for CVMFS_SERVER_URL settings including sft-nightlies.cern.ch and they all look normal, coming from /etc/cvmfs/domain.d/cern.ch.conf. I checked the .cvmfswhitelist on each of those with
and none of them indicated they were serving sft.cern.ch. It makes me wonder if the squid cache is somehow poisoned but that seems rather unlikely. |
Well, our script that's dumping the log is run every 5 minutes, if it can't access a testfile in the sft repo anymore. |
During the last 5 minutes there was very little activity and it looked entirely normal. I do see an access to Regarding the name mismatch, the combined log does show it attempting to read from |
Yeah.. that fits our observation, that the process cvmfs2 is completely dead. ptrace is showing no activity at all anymore once this happens. The backtrace shows it's waiting on a semaphore. We could try to enhance our "watchdog" script with some gdb magic or so.. would that be helpful? Or do you have some other idea what could be done? Cheers Martin and Torsten |
Do you maybe have the logs of the |
Right now we only configured debug logs for Is there a quick and easy way to switch on debug logs for a specific repo during operation? |
hi, as i said you can use in 2.11 client, a simple reload should be enough to switch between debug logging and not. no remount needed. |
Good to know, that no reboot is required! I think we also didn't answer your question about the CVMFS cache size from that post. We'll get back to you, once we got a new set of logs. A couple of points though:
|
I don't see how the cross-symbolic link could cause that crossover in the cvmfs requests. The problem is really bizarre because the debug log is showing that the sft.cern.ch cvmfs2 process is essentially idle when the lockup happens. I wonder if it could be something in the shared cache that it is getting stuck on, but if it did I would expect the log messages to show it begin to communicate with the shared cache process and never returning. I think the gdb stack trace you provided is misleading because it only showed one thread. Please try to get another one from a hanging cvmfs2 process using |
Took a while, but here we have one:
|
Thank you for the backtrace! It does not really seem to be a deadlock but this cvmfs2 process is idling. The problem may be related to the use of containers. On this node, could you please
That should fail but we are interested in the output on the terminal. |
Hi Jakob, here is the output,
Logs:
Cheers Torsten |
Ok, then I think it is exactly the problem of CVM-2004. There must be a containerized job with an container runtime that does not keep an open file descriptor to the directory of container root file system. After moving the job/container in its own namespace, the cvmfs repository looks unused from the point of view of autofs. It will then unmount the file system but the fuse process stays alive, serving the containerized job. And it is preventing another cvmfs process remounting the repository. You can check for such "zombie processes" with the cvmfs_zombie utility. Just compile it with Some versions of apptainer showed this problem but recent versions should work around it (perhaps @DrDaveD can comment). This problem has been properly fixed as of the EL9 kernel / fuse module. As a workaround, you can try static fstab cvmfs mounts or very long autofs timeouts. |
Hi Jakob, thanks a lot for the analysis! That gives us the opportunity to implement a workaround until we upgrade to Alma 9 (which will be soon, waiting for our administration to place the order with the cluster vendor). Just to verify the theory, here is the output of the
And we currently have
Just saw that a 1.2.4 was released two days ago by @DrDaveD . Thanks for the help Torsten and the Wuppertal crew |
Hi all, we just made a quick check with find_zombies on some of our nodes at DESY-HH and noticed also quite a number of processes/file handles. These have not been an issue so far as we are blocking the "large" repos like atlas.cern.ch and cms.cern.ch from umounting and keep them continuously alive. With the current job mix, it affects primarily a python bin path for ATLAS and the OSG apptainer exec for CMS jobs - probably just statistics with the paths used by each ATLAS/CMS job??
Cheers, |
I am surprised that the zombie command only shows atlas.cern.ch and unpacked.cern.ch repositories. Are those having the same problem? I thought it was sft.cern.ch. The There's no known problem with this in apptainer-1.2.2. However any container application can easily cause the problem by exiting without making sure all child processes have exited. So if possible please bring it to the attention of the application owner. |
Hi @harenber, did you switch to EL9 in the meantime? Can you confirm that this should fix this issue? |
Hi @vvolkl, unfortunately not yet, we're still waiting for our vendor to come up with a schedule. I'll update here as soon as we did the switch. |
We got a appointment now with our vendor. We'll upgrade from May 22nd to June 5th. |
Dear all,
we're suffering for days from a potential bug, where the cvmfs2 process serving sft.cern.ch is deadlocking. (We only saw that for sft.cern.ch!)
With the 268 nodes we have, we see a rate of >= 1 node per hour showing this behaviour. We (Wuppertal University) are an ATLAS Tier-2, so we have a lot of ATLAS workload. Our local users (other then from the HEP group) barely use CVMFS. We couldn't identify if it is related to some particular ATLAS job type, yet.
Symptom:
killing the cvmfs2 process and restarting autofs helps.
Most interesting observation:
this returns nothing on the sft.cern.ch cvmfs2 process
while it shows normal activity on any other cvmfs2 process.
I attached with gdb to one of those deadlocking processes:
versions are
and kernel is
We already collected some DEBUG logs here, but as Dave suggested on cvmfs-talk, we just reconfigured the cluster to only write DEBUG logs for sft.cern.ch. As soon as we catch a node showing this behaviour again, we will add the relevant log to this issue.
We cross-checked the CVMFS client config with DESY and the Squid config with LRZ-LMU and we couldn't find any difference other than local cache sizes (we have 12 GB CVMFS cache on our nodes).
Cheers
Martin and Torsten
The text was updated successfully, but these errors were encountered: