-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gfal-copy failing but returning status code 0 - leaving a bunch of file mismatch in the system #11556
Comments
@vkuznet thanks for creating this issue. @haozturk FYI Here is my suggestion on how to debug it:
Another option is to use the relational database, using your preferred SQL client (sqlplus, SQLDeveloper, etc). If you use sqlplus, you can type this command in the agent:
and I think the table that we want to look at is |
Here is very simple procedure to check DBS vs rucio files:
and here I run it
So, if you run both and count number of files you will see the difference, e.g.
The missing LFN with hash |
I communicated with @ericvaandering who gave me the clue. According to Rucio the file which is not present in Rucio does not have associated block (i.e. it is not attached to the block).
Then, I checked LFNs status in DBS and found the same confirmation, i.e. the LFN My conclusion is that invalid files are not appeared in Rucio. |
I further investigated Unified code base and found particular flaw in their function usage which depends on internal cache. Please see full details over here: https://gitlab.cern.ch/CMSToolsIntegration/operations/-/issues/41#note_6636304. In summary, it seems to me that Unified stores DBS results into cache used in multiple places and when finally compare results of DBS vs Rucio files some files may have invalidated by underlying workflow or data-ops leading to file mismatch. I'll be happy to check another block when PnR will provide that but at the moment I do not see any issues with WMCore/agent/RucioInjector. |
@vkuznet Valentin, regardless of a stale cache or not in Unified, didn't you confirm that actually there is a file in Rucio which has not been attached to its parent block?
BTW, here is our Rucio implementation for adding a replica to Rucio server plus attaching it to a block: We might have to chase this file/block in the agent logs to see if any error/exception was raised when dealing with it in RucioInjector. |
Unless "invalidation" (who/where does that) is not deleting the Rucio DID but only detaching it. |
Alan, how I can find which agent was used for archived workflow? I looked up the one provided in description and found it here: https://cmsweb.cern.ch/reqmgr2/fetch?rid=request-pdmvserv_Run2022D_Muon_10Dec2022_221209_230856_3545 but I can't find it in wmstats using this WF name. |
Meanwhile, according to WMCore RucioInjector code base https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/Rucio/Rucio.py#L404-L414 and using logs from one of the CERN agents (vocms0255) I do see few errors like this:
which suggests that RucioInjector can fail and the reason is unknown (or not propagated well from Rucio client). Said that, we may perform inverse look-up and check if PnR has filemismatch in this block |
@amaltaro , another suggestion, by inspecting reqmgr2 logDB I found that for a given WF we do have pretty extensive JSON meta-data but it lacks of one particular important element the agent name which processed this WF. Should we open issue for that to make debugging easier? |
The only way to know which agents worked on a given request is through wmstats, at the same time, wmstats only contains data for active workflow. The workflow you are looking at has already been archived (normal-archived), so we cannot easily know which agents processed it. I think Hasan mentioned many workflows in the same situation, so I'd suggest to pick one or two that have ran in the past weeks (and that are not yet archived). This message that you pasted "Error: An unknown exception occurred" suggested that no error details were returned from Rucio server. Still, WMAgent is stateful and it will keep retrying to inject that replica+attach in the coming component cycles. Regarding reqmgr2 LogDB, I feel like the way it's used is sub-optimal and from time to time the database gets slower and queries might time out. I believe it's also meant to be a temporary database only for active workflows. We might have to revisit its usage and how data is organized in it before expanding with further use-cases. But I agree that the agent information is very much needed! |
@amaltaro , while awaiting from PnR about active workflows, I inspected the list of ones they listed in their ticket and I found none available in wmstats, either it does not exist or it is completed. Said that, I also inspected WMCore Rucio.py code base and would like to provide general feedback about this issue:
|
Thanks Valentin. Regarding the Rucio.py implementation, we need to check what is actually returned from the server - through Rucio client - to see which information we can pass upstream. Another option would be to actually separate the "rucio replica injection" from the "rucio replica attachment". Making it in 2 different methods such that we can properly see what the outcome of each is. Note that |
Since in P&R ticket there is restriction, I am pasting what we have from other ticket. Valentin checked this workflow An example of a file that is in DBS but not in rucio: The corresponding block is: |
Alan did check the workflow which produced the MINIAODSIM above, which was: pdmvserv_task_BTV-Run3Summer22EEGS-00009__v1_T_221223_125301_7115 Looking into the output MINIAODSIM dataset, here is what one of the WMAgent utilitarian tools says:
so one of the blocks is said not to have any files in Rucio, while DBS has 41 files, confirmed with: https://cmsweb.cern.ch/dbs/prod/global/DBSReader/filesummaries?block_name=/QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM%232ac6771a-ae1e-4666-9c3d-f2e3d8daccbf IF I grep the submit8 RucioInjector logs, I can find that this block was reported as being successfully injected in Rucio back in January (AND it reports the very same 41 files):
Alan still need to look into Rucio server to see if it actually knowns any of the DIDs (files) in this block. To be done next. |
From Alan: FYI. I asked for feedback in the #DM Ops Mattermost channel and Jhonatan seemed to have found all those files in Rucio: https://mattermost.web.cern.ch/cms-o-and-c/pl/4mg45jykbbn8jdbqxydn64f58y I still have follow up questions and we still need to understand what that means and why that happened. Anyhow, feel free to follow this up over there as well. |
@z4027163 thank you for updating this ticket! Sorry that I had to miss the DM meeting, is there anything that we WM are expected to do next? Or what is the next step to properly understanding what happened here? |
@amaltaro They are aware of the issue, we couldn't figure out the exact issue in the meeting. Eric/Diago will check the ticket for more details. We are waiting for their feedback, maybe there are follow-ups to do from wmcore side after that. |
I think we said we'd monitor the ticket, not that there was anything I could look into. Jhonathan's posts in MM show that these file DIDs are getting created in Rucio, right? And they are being attached to the block DIDs. So the mystery is how they are being unattached. |
@ericvaandering Ah yes, I misunderstood. I thought you guys would be the right people to check it. Who should we ask to look into it? I had the impression it was outside wmcore's reach. |
Well, I did just take a look at the Rucio code to see WHEN file are detached from blocks. It looks like it can happen if a file is declared bad AND it is the last copy of that file: https://github.com/rucio/rucio/blob/e595093ef688422a9a3d8823570d7826369ab22e/lib/rucio/core/rule.py#L1932 @ivmfnal would it be possible to search the consistency checking logs for some of these files (is there a sample list somewhere?) to see if this is what's happening? |
@ericvaandering , I certainly do not know details of rucio but looking at the code I find suspicious the following block:
Here we have two parameters, |
Alan, I confirm that I do see too such errors:
in agents for |
I scanned few agents, and I do see |
@vkuznet Valentin, having stage out errors is normal and expected. What is not expected is to have multiple status code out of the stage out command. From the example I provided above, this:
is bad and needs to be understood. I would suggest you to look into the command constructed for the gfal-based stage out and try to understand how it can happen. We might have to even pull the relevant singularity image and try to reproduce it, but I hope we don't need to go that far. |
Well, according to https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Storage/Backends/GFAL2Impl.py#L122 the
In other words, both come from single shells script but, in my view, the first exit code is not properly captured by the shell script and it is simple echo command output, while the second one is a real failure code of gfal-copy. |
I reconstructed the shell script using gfal command and it looks like this one:
where |
Does anyone know how we can emulate the file transfer on a site using the script I posted earlier:
Finally, if we can't reproduce the issue with aforementioned script I suggest to add to its structure a call to verify the transfer, e.g. |
In this PR #11601 I provided additional check for checksums of source and target PFN via |
Valentin, from this I understand that the option I am also starting to consider that it might be helpful to make a bug report in their (CERN) repository: On how we could test and/or try to reproduce it, I think we would need to:
@todor-ivanov @khurtado perhaps you have another suggestion on how we can test it? |
BTW, we should definitely add a |
I created GGUS ticket https://ggus.eu/?mode=ticket_info&ticket_id=162181&come_from=submit and request to perform manual file transfer using our script on T2_IT_Rome site. |
@dynamic-entropy Hi Rahul, Valentin is on vacation until mid next week, so let me reply it to the best of my knowledge. We are still trying to understand why gfal-copy command returns an exit code 0 while internally raising an error. I believe this is the main root cause of these file mismatch and ideally we should resolve it there. Nonetheless, I still think A service that keeps data invalidation in sync between Rucio and DBS will be important, but right now its unclear what scope that belongs to. |
Right. I believe Kevin's statement was more or less "We have limited development effort, so the first focus should be on fixing the problem if we understand it" |
Just for the record, the current list of T1/T2 sites using
|
I also took the opportunity and contacted the CERN gfal-util experts with this GitLab issue: |
@amaltaro we just got a report also from a CRAB user of
i.e. what you discussed above in #11556 (comment) and below. The issue in gfal repo https://gitlab.cern.ch/dmc/gfal2-util/-/issues/2 is dormant (no reply from Mihai in 4 months). Do you still see this error ? Is the code catching it or doing something about it ? I looked at the PR's mentioned above but they do not look a solution. I'd like to add Stephan Lammel as FYI, but do not find his GH handle. |
Alan pointed me to the gfal issue a while ago and i then investigated/found the issue.
|
@belforte Hi Stefano, I haven't heard any reports from the P&R team on this for at least a couple of months. Nonetheless, it does not mean that it no longer happens in production, it could be that they are simply going for data invalidation without reporting it. @haozturk (and Hassan, I don't remember his GH user) are you still seeing Rucio vs DBS inconsistencies? Is there any automation to "overcome" those on the Unified side? @stlammel I had to re-read the GFAL2 ticket and my understanding is that that change/suspicious is on the gfal2-util side. If so, do you think you (or Stefano?) could get a hold of Mihai? |
Hallo Alan,
|
this error is difficult to catch, Unless you grep for it in the logs. Indeed P&R could spot it as "rules to tape stay stuck/suspended" but if they look and find the file missing on disk they have no way to know why. And tracking back to WMA logs may not be easy. Even if Mihai fixes the code, I propose to stay on the safe side and add a |
@belforte checksum calculation has been enabled by default:
but it looks like enabling it in the |
I guess that |
Impact of the bug
Identify issue with stuck PnR workflows and allow them to progress.
Describe the bug
As was described in PnR#41 we have certain level of workflows which have file mismatch between DBS and Rucio and they never get announced for months. The filemismatch usually happens when there are output files in DBS but not recorded by rucio, here is one example:
for WF:
pdmvserv_Run2022D_Muon_10Dec2022_221209_230856_3545
. A recent list of workflows having such issue can be found here.How to reproduce it
Most likely we need to perform debugging of agent logs and its local databases to identify the issue with stuck workflows.
Expected behavior
We should have a file match between DBS and Rucio to allow WF to progress.
Additional context and error message
The text was updated successfully, but these errors were encountered: