Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A tool to change the GUID of a file manually for 8034-FileNameInconsistentWithGUID issues #37742

Closed
haozturk opened this issue Apr 29, 2022 · 21 comments

Comments

@haozturk
Copy link
Contributor

We came across w/ the following issue in production a few times:

Fatal Exception (Exit code: 8034)
An exception of category 'FileNameInconsistentWithGUID' occurred while
[0] Calling InputSource::readFile_
Exception Message:
GUID DE727AEB-44B0-E811-AE5A-FA163E294873 extracted from file name dcap://cmsdcap-kit.gridka.de:22125/pnfs/gridka.de/cms/disk-only/store/data/Run2018D/SingleMuon/RAW/v1/000/322/179/00001/DE727AEB-44B0-E811-AE5A-FA163E294873.root is inconsistent with the GUID read from the file 327FE446-F9B7-E811-9C31-FA163E5394F5 

Processing of such files were suggested to be done by bypassing this check, since the file content looked proper. However, it's not feasible w/ the existing workload management system as described at [1] Matti suggested to open this issue to discuss/develop a tool to update the GUID of files manually when necessary to allow their processing.

[1] https://its.cern.ch/jira/browse/CMSCOMPPR-24122

@cmsbuild
Copy link
Contributor

A new Issue was created by @haozturk Hasan ztrk.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign core

@makortel
Copy link
Contributor

@Dr15Jones @dan131riley Any thoughts?

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@Dr15Jones
Copy link
Contributor

@pcanal thoughts?

@dan131riley
Copy link

The CMS file GUID is the only entry on the FileIdentifier branch, so in principle it should be possible to update it, or at worst make a copy with the old GUID. Doing so will change the file hash, which might have other implications for DBS entries, maybe the transfer system?

For the future, I would much rather have a parameter to the PoolOutputModule to set the GUID so the replacement file gets created with the GUID of the original. I think this would be a lot cleaner for future file replacement, but I don't know how much of a manual process it would entail.

@makortel
Copy link
Contributor

For the future, I would much rather have a parameter to the PoolOutputModule to set the GUID so the replacement file gets created with the GUID of the original. I think this would be a lot cleaner for future file replacement, but I don't know how much of a manual process it would entail.

I agree that approach would be cleaner. Copying @nsmith-'s messages from mattermost https://mattermost.web.cern.ch/cms-o-and-c/pl/5ssbgjxc4bnkdxipof3mhcasth

its possible these are files from a T0 manual recovery
the one exception to GUID matching filename is that if T0 has a RAW output loss, if they act quickly they can reprocess the streamer files still in the input buffer and recreate the file. But the new file will have a different GUID even though it is renamed to match the lost file

@drkovalskyi Can you comment if setting an additional parameter for PoolOutputModule would be feasible in such recovery reprocessing?

@nsmith-
Copy link
Contributor

nsmith- commented Apr 29, 2022

An alternative solution to this problem is for T0 to invalidate the lost file in DBS and inject a new replacement file into the dataset, with the GUID being whatever the replacement file creates.

@germanfgv
Copy link
Contributor

@nsmith- our current procedure is to do exactly that (e.g CMSTRANSF-344). I'm not sure what was done for this run in 2018. I was not able to find any report of issues concerning run 322179.

@drkovalskyi
Copy link
Contributor

We can certainly add an option to set GUID in PoolOutputModule, but I don't see why we need it all. What's the use case for GUID check? Keep in mind that it's a "normal" practice for Tier0 to re-create files like that and we have a number of RAW files like that. This number is certainly small and with enough manual work it's possible to replace all such files updating everything everywhere, but do we need it?

@makortel
Copy link
Contributor

makortel commented May 3, 2022

@drkovalskyi The option to enforce the GUID in the file name and inside the file to be consistent was requested by CompOps in dmwm/WMCore#9432 to detect the cases where xrootd was silently serving wrong files on some sites (at the time).

@drkovalskyi
Copy link
Contributor

Ok, that's an important use case. @germanfgv could you please add an option to set GUID in the PR that you are working for the compression?

@makortel
Copy link
Contributor

makortel commented May 3, 2022

could you please add an option to set GUID in the PR that you are working for the compression?

We would need to first add such an option to PoolOutputModule (so far it wasn't clear if such an option would be really needed, now the needs seems clear).

@drkovalskyi
Copy link
Contributor

Yes, we need this. We had enough cases where we had to do a manual recovery like that.

@makortel
Copy link
Contributor

makortel commented May 4, 2022

#37806 adds an overrideGUID parameter to PoolOutputModule.

@makortel
Copy link
Contributor

makortel commented May 4, 2022

@drkovalskyi Would you need a backport for 12_3_X? (I presume earlier releases can be left out by now)

@drkovalskyi
Copy link
Contributor

We do need a back-port to 12_3_X since it's our current data taking release and such issues may raise anytime. Thanks.

@makortel
Copy link
Contributor

@drkovalskyi #37806 was merged, so the new parameter will be in 12_4_0_pre4. @drkovalskyi, can you test the parameter in the pre4 pre-release in any meaningful way, or should we just proceed with the 12_3_X backport?

@drkovalskyi
Copy link
Contributor

We should be able to run a replay with 12_4_0_pre4 and try manually to recreate a file. @germanfgv could you please help Jhonatan to make a test like that?

@makortel
Copy link
Contributor

+1

I suppose a 12_3_X backport is not needed anymore

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@qliphy qliphy closed this as completed Aug 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants