-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MS] MSUnmerged has been restarting in the t2t3 replica #10713
Comments
As we enabled T1_US_FNAL_Disk to be taken into consideration by the MSUnmerged service, we also started seeing the
One can also observe that all these crashes happened when we had >= 15k files to be deleted under the same directory. I'm going to skip |
Probably can add a debug printout of the number of pfns at:
likely there is some buffer in gfal2 that is overflowing and causing a segfault if the list is sufficiently large. Once running with the printout we can see what the largest non-crashing value is and maybe pick a chunk size of half that or so. For context, rucio does them one at a time: https://github.com/rucio/rucio/blob/a45f685af1963c8f1bd60bbbfe3e207e7df7e903/lib/rucio/rse/protocols/gfal.py#L473 |
Now since we have the deletions in slices we are sure that the
While fairly big ones does seem to go through without any problems, but a little bit of delay only:
|
If the global python logging level is debug, I know gfal prints some more detailed information about its internal operations. It might hint at what is going wrong |
Actually back then it was printing way too many stuff and I have separated the python log level from the I will set it back to debug and see what is going to happen. |
Increasing the [1]
|
@todor-ivanov: Signal 11 is usually indicative of a segmentation violation which occurs when a program attempts to access an invalid bit of memory. I would be surprised if this is something that is being generated from outside the MSUnmerged/gfal process. It would be consistent with an overflow of either memory or array index in a list. Can you retrieve the cord dump and view the stack trace? |
Hi @klannon that's absolutely correct. I did not elaborate too much but yes, by:
I meant by the kernel, exactly because of the reasons you explained very well. So that's the reason why I was expecting the core dump to be in the system logs instead of the service logs. This is the big disappointment I was talking about when I requested the logs from the machine running the pods, which were willingly provided by Imran, and were absolutely clear. The only thing I can see there are the periodic records from The core dump is again a misleading message, because it is basically a message which WMCore logs from [2], based on some bitwise operations with the object returned from the system to the Some more hints I find in the old bugs related to Things are getting messier and messier unfortunately :( [3] [2] WMCore/src/python/WMCore/REST/Main.py Lines 470 to 475 in cccb247
[1]
|
@muhammadimranfarooqi has deployed MSUnmerged with WMCore |
I scanned the ms-unmerged logs and it appears we still got two restarts of the
With the error above, I think our next priority within the MSUnmerged project would be to actually streamline the certificates used by the service and make sure the certificate/proxy used is properly authorized to access the storage and delete unneeded files. Reported in this issue: #10680 |
I just had another look at the ms-unmerged logs and I did not see any CherryPy restarts for the last 5 days (1.5.6.pre1). @todor-ivanov from my side, I would say we can close this issue. What do you think? |
Thanks for taking a look @amaltaro ! Lets close it. We are for sure having that in mind while having the last check upon swapping the credentials for the service. In case we notice the errors come back we may reopen it at any time. |
Impact of the bug
MSUnmerged
Describe the bug
While checking the MSUnmerged logs, Todor noticed that the
t2t3
replica has been having some random service restarts, note that it's not the POD getting recreated, but the service restarts inside the same POD.Reason is still unclear, and the only thing we can extract from the service logs is [1]
NOTE: this causes that MSUnmerged replica to keep cleaning up the same RSEs over and over, since the service cache gets wiped out during the restart process.
UPDATE: the
t1
POD service also gets restarted when FNAL is enabled.How to reproduce it
Not clear yet! Note that 2 replicas work with no issues, only this t2t3 is having these problems.
Expected behavior
Investigate the reason for these service/CherryPy restarts and fix the MSUnmerged code; or whatever is making that replica unstable.
One comment that I made to Todor though, is that it would be interesting to log the amount of files retrieved from the Rucio ConMon, just so we know how many files need to be parsed in advanced.
Additional context and error message
[1]
The text was updated successfully, but these errors were encountered: