-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deal with duplicate DID exception when injecting multiple file replicas into Rucio #10026
Comments
Hi @amaltaro, I am not 100% sure I fully understand your concern, mostly because the exception [1] [2] |
I had a chat with Eric and Yuyi today, and indeed this code looks suspicious and could potentially be causing issues with data produced in central production. This actually performs two actions:
so, it could be that our datasets are ending without a replica, we are not sure. The suggested approach for such exceptions would be to:
if a file passes these 2 checks, then indeed we can consider this data insertion operation complete. BTW, here is a pointer to the code executed on the server side: https://github.com/rucio/rucio/blob/5e92ee1e6d323d7e2f9eeb6e97ecb4f6fc4c3abe/lib/rucio/core/replica.py#L1457 @yuyiguo @ericvaandering will provide us with some guidance on which Python APIs we could use for this error handling/check. PS.: and here is a GH issue reporting this (possible) problem in October 2020 :( |
And here is a fresh JIRA ticket reporting many inconsistencies between DBS and Rucio, apparently with files (or maybe only replicas!) missing in Rucio: https://its.cern.ch/jira/browse/CMSCOMPPR-25873 If we think that the Rucio client
I'm not sure whether it would solve the problem though. And, it also adds extra complication on our side with additional bookkeeping and error handling. @todor-ivanov please discuss this with the Rucio experts to see what is the best option to solve this. PS.: the JIRA above reports problems with single file blocks as well. So likely not related to write operations in bulk performed by WMAgent. |
OK, I don't know if it will help but I'm parsing the logs and looking for anything other than an exit 201 from POST /replicas from WMCore. So far nothing. |
Thanks @ericvaandering I am also trying to find some records in the wmagent logs related to any of those files. I'll post my findings here as well. |
So I don't know WHAT happened here, but:
|
I have no idea what has generated the above two records. [1]
[2]
|
And here is one ERROR which might be more related to the duplicate DID exception.
|
Repeating the debugging information I've posted in the Jira ticket: https://its.cern.ch/jira/browse/CMSCOMPPR-25873 here as well: Today I took one of the two workflows Hasan Ozturk was pointing as the latest ones suffering the issue and did the following debugging:
So we need to understand:
|
So to answer my own previous two questions:
If someone from from Data Management could check the reason for these deletions, this would be of great help. Thanks! Could that be because of a missing rule protecting them or because the insertion has not completed, or because they were not properly associated with a higher level data container...? The latest seems to be a negative answer, though, since at some point the block/(dataset in Rucio terms) did know about those files? |
I looked into the undertaker, which I think is the thing deleting files. It's only been running since June 7 and only removed one DID: We have our Rucio meeting now and I will try to understand under what circumstances DIDs get deleted. |
Thanks @ericvaandering |
The undertaker is supposed to delete DIDs which have an expiration date. Any chance files or datasets are being created with those? DIDs should never have expiration dates in CMS unless they are set afterwards to invalidate things. Only rules should have expiration dates. |
Hi @ericvaandering we did double check with @amaltaro and the answer is: No, we do not set did replicas with expiration date, neither for file nor for dataset. And actually if we were having those set for the dataset(blocks in CMS terms) this expiration data would have been affecting also the rest 51 files in the same dataset. There is something different with those 5 that has been deleted, and I cannot figure out what. |
@todor-ivanov @ericvaandering have we concluded that there is nothing in WMCore to be fixed on this regard and that actually files get inserted into Rucio, but they (files/replicas) somehow get deleted by something in the Rucio consistency monitor eco-system? |
Hi @amaltaro I do not think there is anything we can do here from the WMCore side. Closing this issue now. |
Impact of the bug
WMAgent
Describe the bug
We don't seem to have hit this exception - yet - in production, but there will certainly be a moment where we will try to add a list of file replicas (
add_replicas
Rucio client API) into Rucio, and some of those will already exist in the database, thus hitting aDataIdentifierAlreadyExists
exception.With the current code, we would simply try the same call (or same files + freshly created ones after the previous cycle) again and again, which would likely keep failing on the server side.
How to reproduce it
Inject a file replica against a block, then try to inject the same replica again (or the same replica plus a new file)
Expected behavior
Duplicated file replica exception should not be considered if we tried to add a single file replica, the component should deal with it as if it had a successful replica injection.
However, if there are multiple file replicas in the input, our wrapper API
createReplicas
should signal (or raise an exception) back to the caller and let the caller break that list of file replicas into single replicas, making a newcreateReplicas
call for every replica file.Additional context and error message
Initial handling (single file replica) in: #10024
The text was updated successfully, but these errors were encountered: