Investigate automatic suspicious replica recovery #403

ericvaandering · 2023-01-13T13:12:36Z

Would need something done with traces (declaring things suspicious, maybe some logic to deal with xrootd and specific exit codes?)

Then need to run the replica recoverer daemon.

yuyiguo · 2023-01-17T19:32:18Z

@ericvaandering ,
Can you give more details on this issue?

belforte · 2023-01-20T10:01:08Z

@yuyiguo @ericvaandering I guess I am the one who started this. Shall we try to have a quick chat on zoom ?
I am generally/usually available in your early morning (8-10), But if you prefer that I expand here.. sure, let me know.

yuyiguo · 2023-01-20T15:03:53Z

Yes, I am available today. Let me know when we can chat.

belforte · 2023-01-20T17:42:40Z

wel... today I had things to do. Let's plan it a little bit so that Eric may also join. 10 min should suffice

belforte · 2023-01-23T17:33:33Z

what about tomorrow in your 8-10am window ?

belforte · 2023-01-24T16:09:42Z

Adding some description after a chat with Eric and Yuyi (Katy was also there)
@KatyEllis FYI

where we start from

in this meeting Dimitrios Christidis reported in slide 11 that
- For each failed transfer, if the error message matches some pattern, then the replica is marked as suspicious. The Rucio Replica Recoverer daemon processes these and does automatic recovery

what we want to do

use that Replica Recover daemon to automatically check and fix files which we suspect of being corrupted because they make cmsRun(CMSSW) raise a fatal exception when reading.

things we know

cmsRun sends UDP messages when it closed files which it read successfully, those are turned somehow/somewhere into messages to AMQ which we (CMS) consume (some daemon which Yuyi is in control of)
the path above is not viable to report read failures, since in that case cmsRun just exit via a fatal exception and has no way to go through the UDP-sending part
Writing to AMQ requires authentication, so it is secure and can be tuned/throttled rate wise
Sending UDP messages is free for all to do and if too many are sent some get lost and there's never DDOS risk. Should something wrong/bad/malicious happen, an extra ton "I read this file" will do not permanent harm, but a flood of "this file is bad" may be very very bad.
there are other AMQ queues which are consumed by Yuyi's daemon and where we could send the information about bad files. In particular one referred to as "the job report parser which reads from ES and writes to AMQ
Rucio python client has declare_suspicious_file_replicas(self, pfns, reason) in lib/rucio/client/ by which a replica (PFN) can be declared suspicious, and a (hopefully matching) declare_suspicious_file_replicas(self, pfns, reason) in lib/rucio/api
Same as above can also be done via a REST call.
There is also a way to declare a file as suspicious via a non-authenticated URL (?? did I misunderstand ?)
We are not running the Rucio Replica Recoverer daemon now, but can (and should)
Currently jobwrappers (WMA or CRAB) communicate with the world via the FJR which is returned to scheduler and sent to WMArchive (and from there fo HDFS and/or ES but Stefano does not know) and via condor classAds which are read by monitoring spider and sent to ES and used e.g. to populate Grafana dashboards in CMS Monitoring.

things we do not know, but want to know

what exactly happens when a replica is declared suspicious (is it still used as source ? processed by workflows ? E.g. is the state of the (Rucio) dataset affected ? CRAB only processes datasets with status AVAILABLE
how exactly data are currently fed to AMQ: by which tools, run by whom, which data they are etc.
would it be possible (sensible?) to send read errors to AMQ and have "Yuyi's daemon" use those messages to declare supsicious replicas ?
if we decide instead to declare replicas directly from some upstream process (instead of sending a message to AMQ), do we have the needed information and authorization there ? CRAB server currently authenticates with account crab_server or with crab_tape_recall. Other daemons/processes which currently write to AMQ ? What would be the risk in granting to CRAB server the needed privilege ?
we detect errors via a "local read PFN" and will have the file LFN, but declare_suspicious_file_replica needs a PFN in .. which format ? Can we really send {'scope': , 'name': , 'rse': <rse_name>} instead ?
do we have multiple option to declare suspicious replicas in Rucio ? Or in the end python client, url where it post, call in api...all proceeds via the same code (in core ? ). In the end, what exactly happens ?

things we should do (i.e. The Plan)

complete our knowledge of topics listed above (Rucio part: Yuyi. AMQ: Yuyi. FJR to AMQ: Stefano)
evaluate and prototype a change to job wrapper that identifies read error and put info in FJR, exit code and classAds (exit code goes already into a classAd, but the failing PFN does not) (Stefano)
evaluate what would be the best way for job wrapper to report read errors so that replica can be declared suspicious. Stefano will propose a plan once he has the needed knowledge. Stefano prefers a solution where CRAB and WMA signal errors by parsing cmsRun stderr in grid nodes, but reporting to Rucio is done in a single place common to both.
agree on the plan (all) and assign specific TODO issues.

dciangot · 2023-02-22T09:28:55Z

did this issue follow up somewhere else? Or is it just stale and we still need to validate @belforte proposal with the various tasks?

belforte · 2023-02-22T12:58:41Z

is still on my todo list and I am not tracking it elsewhere. I should. Then this can be put on hold until I have a proposal.

belforte · 2023-02-22T13:40:03Z

currently this breaks up as

investigate Rucio’s ReplicaRecoverer daemon : Yuyi (need and ad hoc issue ?)
improve JobWrapper to detect suspicious replicas Parse cmsRun stdout/err to detect corrupted root files CRABServer#7548 : Stefano
investigate how to send this information to Rucio Investigate FJR to WMArchive to AMQ path CRABServer#7549 Stefano + Yuyi
formulate a plan to be reviewed make a proposal on how to automatically flag suspicious files CRABServer#7550 : Stefano

yuyiguo · 2023-02-22T15:56:48Z

This is on my to-do list too but in a low priority.

belforte · 2023-02-22T16:03:45Z

First thing is to flag jobs which hit corrupted files, and monitor it, so we quantify problem.

yuyiguo · 2023-03-28T02:35:53Z

What is the definition of "suspicious replicas"? If a transfer failed, fts will try to retransfer it. If the failure is permanent, how Replica Recover daemon can fix it? Why CMSSW or processing jobs will read a suspicious replica?

belforte · 2023-03-28T07:45:53Z

Hi @yuyiguo . Let me try to list here what I (think that I) understand. Hopefully it answers some of your questions.

If we plan to use declare_suspicious_file_replicas definition is up to us.
Current wisdom is to check cmsRun stdout for fatal errors during root readout, we may have to collect some telling strings and improve with time.
ATLAS does a rucio download of each files to the WN before reading, so Rucio will verify checksum there and IIUC automatically mark replica as suspicious. I have tested this myself [1] and the error is duly detected, but I think something else is needed to mark the replica as suspicious, Most likely there's something else going one when ATLAS jobs try to read files and we need to ask Dimitrios to be more explicit than in his talk mentioned in comment above. Anyhow we do not download files from storage to WN's.
We do not know why replicas get corrupted, most likely it is due to some problem with site's storage systems, as you say FTS transfers are expected to verify checksums. While of course, sometimes a bit can still be flipped in last step of file write/close and one only notices when file is read back.
data processing jobs do not know that a given replica is corrupted, so they read it and fail. I do not know what will happen if replica is marked suspicious. We only process datasets/blocks which are marked as AVAILABLE in a Disk RSE in Rucio, I can't say how a SUSPICIOUS replica affects that. Anyhow currently bad replicas stay on disk until manually removed by DM operators.
and finally, about how Replica Recover can fix it, all that I know is this and I can't say if this is really what is in use for ATLAS. I have not tried to read the daemon code itself, but it has more comments with details. I definitely have no clue what metadata means here.

Hope this helps !

[1]

belforte@lxplus701/belforte> rucio download cms:/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root --rses T2_US_UCSD
2023-03-27 17:57:51,728	INFO	Processing 1 item(s) for input
2023-03-27 17:57:51,847	INFO	No preferred protocol impl in rucio.cfg: No section: 'download'
2023-03-27 17:57:51,848	INFO	Using main thread to download 1 file(s)
2023-03-27 17:57:51,848	INFO	Preparing download of cms:/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root
2023-03-27 17:57:51,866	INFO	Trying to download with davs and timeout of 4713s from T2_US_UCSD: cms:/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root 
2023-03-27 17:57:51,936	INFO	Using PFN: davs://redirector.t2.ucsd.edu:1095/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root
2023-03-27 17:59:58,713	WARNING	Checksum validation failed for file: cms:/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root
2023-03-27 17:59:58,714	WARNING	Download attempt failed. Try 1/2
2023-03-27 18:02:02,656	WARNING	Checksum validation failed for file: cms:/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root
2023-03-27 18:02:02,657	WARNING	Download attempt failed. Try 2/2
2023-03-27 18:02:02,670	ERROR	Failed to download file cms:/store/mc/RunIIFall17MiniAODv2/PMSSM_set_2_prompt_2_TuneCP2_13TeV-pythia8/MINIAODSIM/PUFall17Fast_GridpackScan_94X_mc2017_realistic_v15-v1/30000/BAA88CF6-7C17-ED11-8883-34800D2F7FF0.root
2023-03-27 18:02:02,672	ERROR	None of the requested files have been downloaded.
belforte@lxplus701/belforte>

belforte · 2023-03-28T07:53:35Z

btw, the file in my example above has been fixed by Felipe and download is now OK.
https://mattermost.web.cern.ch/cms-o-and-c/pl/uw4mauamr3dxmf1aebpfsybcjo
https://ggus.eu/?mode=ticket_info&ticket_id=160902

ivmfnal · 2023-08-01T22:00:00Z

Some time ago I wrote this document with the proposal how we can handle this.
Has anybody had a chance to read it ?
Do we need to revisit that proposal ? I am a bit confused about the state of that proposal.

ivmfnal · 2023-08-02T15:29:10Z

Basically here is my proposal:

Detection
- Add code to WM and other similar activities to report a replica as "suspicious" to Rucio on a read error via standard Rucio interfaces (API or CLI)
- Write instructions for individual users how to do the same
Recovery. Configure existing Rucio suspicious_replica_recoverer to declare the replica as "bad" after so many detections. Once the replica is declared "bad", Rucio will re-replicate it. We can discuss separately the case when this replica is in fact last replica for the file.

I would like to get some feedback on this proposal. Perhaps we can add more details to it.
Once we agree on the plan, we can work out the details.

belforte · 2023-08-02T15:48:44Z

thanks Igor, and apologies for not having replied earlier.
I had indeed read your document and fully agree with it and the plan outlined above.
I have in, and reasonably close to the top of my list, to do the 1. above in CRAB: dmwm/CRABServer#7548

I would like to see this at work from automatic/automated tools for a while, before we think about enabling users, at that point we may have to introduce some way to "trust machines more than humans" IMHO.

One thing that I expect we can talk about later, but let me mention now:
we can detect both missing and possibly corrupted files, and tell one from the other.
Should we also flag the missing ones (i.e. clean open failures w/o a zip error or wrong checksum) as suspicious and sort of try to shortcut the CE ? I am a bit worried when looking at CE page by how many sites simply fail week after week to give any useful result, looks only half sites or so have a "done". I am not saying to abandon that effort, simplt complement it.

We can surely resume this once I have code which parses CMSSW stderr !

ivmfnal · 2023-08-02T17:28:18Z

Stefano @belforte , thanks for the reply and the feedback. I appreciate it.

I was thinking that it would make a sense to have another meeting (I think we had one already some time ago) among involved people to re-sync, discuss use cases and maybe come up with an action plan.

I think we need to get at least @yuyiguo @dynamic-entropy (Rahul) @klannon there. I would invite @ericvaandering too but he is on vacation. Who am I missing ?

ivmfnal · 2023-08-02T17:38:12Z

@belforte said: "we can detect both missing and possibly corrupted files, and tell one from the other."

I think it is important to differentiate between several types of failures.

I would add another dimension to this:

potentially recoverable failures - those which have chances to be caused by some transient condition like networking failure or the site downtime
non-recoverable failures - e.g. checksum/file size/file format error - conditions which are unlikely to fix themselves

My understanding of the problem is that we want to use the "suspicious" replica state in first case for a while before we declare the replica "bad" if things do not improve, whereas if we believe this is not recoverable error, then go straight to "bad" replica state.

belforte · 2023-08-03T09:06:41Z

I do not think we need a (longish( meeting now. I'd like to have code which parses stdout/stder and does a few "mark as suspiciou" first. There can be questions arising during that which we can address as needed. In a way, I have my action plan.
Maybe simply a 5-min slot in usual CMS-Rucio dev meeting to make sure that everybody agrees with the plan which you outlined ? I think you missed @dciangot , anyhow this is not urgent IMHO.

As to the recoverable vs. non-recoverable. Yes, I know, we already discussed in the meeting where you firstly presented this. The problem here is how to be sure that the specific error is really a bad file, not a transient problem in the storage server, is the file really truncated, or was a connection dropped somewhere ? So again, I'd like to get experience with the simpler path first. All in all CRAB already retries file read failures 3 times (and WMA 10, IIUC), so if we e.g. say "3 time suspicious = bad", it may be good enough.

ivmfnal · 2023-08-03T13:15:51Z

Just as FYI, here is the suspicious replica recoverer config for ATLAS:

[
    {
        "action": "declare bad",
        "datatype": ["HITS"],
        "scope": [],
        "scope_wildcard": []
    },
    {
        "action": "ignore",
        "datatype": ["RAW"],
        "scope": [],
        "scope_wildcard": []
    }
]

https://gitlab.cern.ch/atlas-adc-ddm/flux-rucio/-/blob/master/releases/base/secrets/suspicious_replica_recoverer.json

ericvaandering · 2023-09-11T14:03:14Z

No, Rucio doesn’t have any innate checking abilities like this. The “daemon” you suppose would try reading every file directly over xrootd to find the corrupted one(s)?

…

On Sep 11, 2023, at 8:29 AM, Stefano Belforte ***@***.***> wrote: @ivmfnal <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ivmfnal&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=4nwsxiBW9w1hto-mG8R-zIcsw8bQnnyLOE6UvhbzO-RUqlIqdpWIe9EIo4ddP9V-&s=BZuidikJVKW2O2gEb8s17pB2YV1_SdHUWyNqUK19F0w&e=> @ericvaandering <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ericvaandering&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=4nwsxiBW9w1hto-mG8R-zIcsw8bQnnyLOE6UvhbzO-RUqlIqdpWIe9EIo4ddP9V-&s=KXOvCA_Q_S1rwnae3hgHuuOpzAMCLJTB8gVaS3higiQ&e=> @dciangot <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dciangot&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=4nwsxiBW9w1hto-mG8R-zIcsw8bQnnyLOE6UvhbzO-RUqlIqdpWIe9EIo4ddP9V-&s=qqhAZ7JFjZfdcdW8TFXNwHc7lW81787T5qefoEEUZKE&e=> I have started with code to report bad files in CRAB. But I have found that cmsRun output does not provide (clear) information on whether the file open attempt was local, or on a remote site via xrootd. So I do not have a good way to say which is the RSE (or the pfn required by Rucio). Is there a way to report to rucio only a suspicious DID, and have it figure out which of the possible multiple disk replicas is corrupted ? Or do we need an additional daemon on our side , somewhere ? See e.g the example in dmwm/CRABServer#7548 (comment) <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_CRABServer_issues_7548-23issuecomment-2D1713880031&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=4nwsxiBW9w1hto-mG8R-zIcsw8bQnnyLOE6UvhbzO-RUqlIqdpWIe9EIo4ddP9V-&s=_6JlGpQ28FdBuh5bw2Nb27UZ-wDSgFJWUDyC7iWNJOE&e=> even if the bad file is at CERN, cmsRun reports [c] Fatal Root Error: @sub=TStorageFactoryFile::Init file root://cmsxrootd.fnal.gov//store/user/belforte/truncatedFile.root is truncated at 52428800 bytes — Reply to this email directly, view it on GitHub <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_CMSRucio_issues_403-23issuecomment-2D1713887191&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=4nwsxiBW9w1hto-mG8R-zIcsw8bQnnyLOE6UvhbzO-RUqlIqdpWIe9EIo4ddP9V-&s=aAMz2yzyPJnlkKIEidVEk0wfRBjbtx-cWFGG0rzwNSU&e=>, or unsubscribe <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAMYJLQF4LAYHMV4WGGIEKLXZ4G3LANCNFSM6AAAAAAT2MUYLY&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=4nwsxiBW9w1hto-mG8R-zIcsw8bQnnyLOE6UvhbzO-RUqlIqdpWIe9EIo4ddP9V-&s=TE8F43KLaV2RFtjhKYaivBHuN67clsQxjuj8EoqUrss&e=>. You are receiving this because you were mentioned.

belforte · 2023-09-11T14:13:50Z

or do a rucio get which checks checksum

…

On 11/09/2023 16:03, Eric Vaandering wrote: The “daemon” you suppose would try reading every file directly over xrootd to find the corrupted one(s)?

ivmfnal · 2023-09-11T14:18:59Z

would not it be easier to have CRAB be specific which replica failed to read ?

belforte · 2023-09-11T14:21:19Z

CRAB needs that CMSSW if the error happened opening a local file or a fallback one.
See my example. The corrupted file is at CERN, cmsRun only mentions FNAL :-(

ivmfnal · 2023-09-11T14:27:45Z

I know.
Would not it be easier to have CRAB tell which replica was corrupt - the remote or the local ?

belforte · 2023-09-11T15:01:25Z

which means: I can run that daemon ! Yes, I know. I may have to do something of that sort anyhow.
Mostly I am realizing that we do no have a good classification/reporting of the file open/read errors.

ericvaandering · 2023-09-11T15:17:36Z

I wasn't suggesting that you need to write or run it. Just the process you are suggesting. WMAgent will probably run into the same issue.

So I understand the problem:

CRAB gets this error from CMSSW but CMSSW does not give enough information to know if the file is read locally or remotely?

Then, even if we knew "remotely" I could imagine problems knowing which remote file was read. CMSSW may not have any way of knowing that.

belforte · 2023-09-11T15:42:15Z

Correct. Usually when xroot succeds in opening a remote file, the new PFN is printed in cmsRun log, but much to my disappointment, when the file open fails, nothing gets printed.
In a way since the file is remote to begin with, cmssw handles it like a "first time open" and fails with 8020, not as a "fallback open" which would result in 8028.

Of course many times the file will be opened locally and the site where job runs is sufficient informaiton. But I am not sure how to reliably tell.

I think we nedd some "exploratory work" so am going to simply reports "suspected corruptions" as files on CERN EOS to be consumed by some script (crontab e,g,) which can cross check and possibly make a rucio call, so that script can check the multiple replicas, if any.

IIUC Dima's plans NANO* aside, we are going to have a single disk copy of files.

belforte · 2023-09-11T17:14:35Z

(somehow) later on we can think of incorporating all that code into something that parses cmsrun stdout and makes a call to Rucio on the fly. But I am not sure that we want to run Rucio client in WorkerNodes. It should be available since it is on cvmfs.
I guess I will not post more for a bit, while I try to map what's the situation out there.

KatyEllis · 2023-09-13T10:34:07Z

With my site admin hat on, I would also advocate for clearer information on when a job fails (whether production or analysis) due to local or remote reads.

ericvaandering · 2023-10-20T01:54:28Z

I finally found my notes from a discussion with @ivmfnal and Cedric regarding this, so I paste them here. This would help us declare suspicious replicas. What we'd need to start investigating again is how to move things from suspicious to bad. There was a way to do that too and I've forgotten.

# rucio-admin config get --section conveyor --option suspicious_pattern
[conveyor]
suspicious_pattern=.*No such file or directory.*,.*CHECKSUM MISMATCH Source and destination checksums do not match.*,.*SOURCE CHECKSUM MISMATCH.*,.*Unable to read file - wrong file checksum.*,.*checksum verification failed.*
# rucio-admin config get --section kronos --option bad_files_patterns
[kronos]
bad_files_patterns=.*No such file or directory.*,.*no such file or directory.*,.*CHECKSUM MISMATCH Source and destination checksums do not match.*,.*SOURCE CHECKSUM MISMATCH.*,.*Unable to read file - wrong file checksum.*,.*checksum verification failed.*,.*direct_access.*,.*Copy failed with mode 3rd pull, with error: Transfer failed: failure: Server returned nothing.*,.*HTTP 404 : File not found.*

Here is the part of the code responsible for the declaration in the conveyor : https://github.com/rucio/rucio/blob/master/lib/rucio/daemons/conveyor/finisher.py#L301-L331
and in kronos : https://github.com/rucio/rucio/blob/master/lib/rucio/daemons/tracer/kronos.py#L131-L147
As you can see, for kronos, the traces are expected to have some specific key/values

belforte · 2023-10-20T10:31:12Z

I will report in here as well what I wrote in the MM channel.
CRAB has been running now for 11 days in a mode where it reports (via files on EOS) all occurences of FatalRootError,
so far only a few cases of "error in reading from an otherwise good file replica" have surfaced.
dmwm/CRABServer#7883 (comment)

I suggest to collect more experience before doing more work. The fear (at least from my side) was that there are way more problems that the few that users report. But if we stay at a handful a year, manual action is less effort than we already spent. I have plans to write a script to facilitate suspicious file checking, so that users (or CRAB team) can report trustable information to DataManagement operators who have the needed permissions to mark replicas bad.

ivmfnal · 2023-10-20T12:06:35Z

@ericvaandering based on my conversation with Cedric, some time ago I made this proposal. The idea is to use the replica recoverer to move replicas from suspicious to bad.

Here is my summary on how to configure the replica recoverer: #403 (comment)

belforte · 2023-11-13T15:40:24Z

I had the wrong permission in the directory where read failure error reports are collected.
I have now collected 11 truncated files in a few hours :-(
I have a script which runs overall those reports, but so fare only makes daily summaries, not particularly good to read. @ivmfnal are we ready for me to add a suspicious replica report ? In cases where root says file blahblah.root is truncated at 1542455296 bytes: should be 2672978503, there is not much to argue, I presume. I can add size check via gfal-ls e.g. but once that says that side is different from what Rucio memorized... no point is doing more, right ?
IIUC we can't report a reason for suspicious replicas, shall we look into marking those as BAD directly ?

belforte · 2023-11-17T16:01:22Z

I have now setup some automated reporting pipeline, and waiting for suggestions from DM ops and experts in here on how to progress.
Please see: https://its.cern.ch/jira/browse/CMSTRANSF-766

I think am stuck since:

I do not have permission to mark replica as BAD myself, nor really want to
If those bad replicas are not marked and resolved "prompty" we end up with some kind of "home grown DB" to keep track which is the last thing I want to do
my scripts are meant to give us a sense of what's going on and show what is possible. But are not really a solid tool to be used for years to come. I know that I am not a good developer and will be more than happy if someone takes over.

Overall I think that in respect to Atlas we have:

better inormation (truncated, not a ROOT file, potential read error)
no indication of the source from CMSSW due to xroot fall-back, when we detect a possibly corrupted LFN we do not know where the replica is
once we go through the trouble of checking all replicas to find the one which potentially caused the error, we know enough to take action if needed

So we need either a smarter Replica Recover daemon or something in the middle like what I scripted here, but I have got very cold about what to expect from current Rucio. The ReplicaRecover could have a great role for replicas found bad or missing during transfers. I am quite puzzled that there is no automatism there yet.

In any case I have reached the limit of what CRAB support can do

Please, step in.

belforte · 2023-11-17T16:05:59Z

Since my time and patience are limited, I have setup crontabs to update all of that daily.
Access to information is unified by this Grafana panel
https://monit-grafana.cern.ch/d/CsnjLe6Mk/crab-overview?orgId=11&refresh=1h&from=1700063958182&to=1700236758182&viewPanel=116

IMHO it would make a lot of sense that DM operators have a look daily and if they agree, declare BAD replicas a suggested after all additional checks that they consider relevant.

I hope with this I can stop answering user reports of corrupted files.

haozturk · 2024-05-28T09:19:33Z

I'll close this one as we have roadmap. Will track this effort at #805

ericvaandering assigned ericvaandering and yuyiguo Jan 13, 2023

dciangot added bug triage Issues that need investigation before action labels Feb 22, 2023

dciangot added waiting task in pending for external deps and removed bug labels Feb 22, 2023

This was referenced Feb 22, 2023

Investigate FJR to WMArchive to AMQ path dmwm/CRABServer#7549

Closed

make a proposal on how to automatically flag suspicious files dmwm/CRABServer#7550

Closed

ivmfnal self-assigned this Aug 1, 2023

belforte mentioned this issue Sep 11, 2023

Parse cmsRun stdout/err to detect corrupted root files dmwm/CRABServer#7548

Closed

dynamic-entropy removed the triage Issues that need investigation before action label Sep 29, 2023

belforte mentioned this issue Nov 17, 2023

collect file corruption report from jobs and act on the info dmwm/CRABServer#7883

Closed

5 tasks

ericvaandering assigned haozturk and unassigned yuyiguo and ericvaandering Dec 1, 2023

This was referenced Dec 19, 2023

[DEV] Suspicious replica recoveror: Enrich rucio tracers to include file read errors #691

Open

[DEV] Suspicious replica recoveror: Configure conveyor-finisher such that it starts marking replicas suspicious #692

Closed

ericvaandering added this to the Recover suspicious replicas milestone Dec 19, 2023

ericvaandering unassigned ivmfnal Jan 9, 2024

voetberg self-assigned this Jan 23, 2024

haozturk mentioned this issue May 28, 2024

Meta: Suspicious replica recovery roadmap #805

Open

8 tasks

haozturk closed this as completed May 28, 2024

Investigate automatic suspicious replica recovery #403

Investigate automatic suspicious replica recovery #403

Comments

ericvaandering commented Jan 13, 2023

yuyiguo commented Jan 17, 2023

belforte commented Jan 20, 2023

yuyiguo commented Jan 20, 2023

belforte commented Jan 20, 2023

belforte commented Jan 23, 2023

belforte commented Jan 24, 2023 • edited Loading

where we start from

what we want to do

things we know

things we do not know, but want to know

things we should do (i.e. The Plan)

dciangot commented Feb 22, 2023

belforte commented Feb 22, 2023

belforte commented Feb 22, 2023

yuyiguo commented Feb 22, 2023

belforte commented Feb 22, 2023

yuyiguo commented Mar 28, 2023

belforte commented Mar 28, 2023

belforte commented Mar 28, 2023

ivmfnal commented Aug 1, 2023

ivmfnal commented Aug 2, 2023

belforte commented Aug 2, 2023

ivmfnal commented Aug 2, 2023

ivmfnal commented Aug 2, 2023 • edited Loading

belforte commented Aug 3, 2023

ivmfnal commented Aug 3, 2023

ericvaandering commented Sep 11, 2023 via email

belforte commented Sep 11, 2023 via email • edited Loading

ivmfnal commented Sep 11, 2023

belforte commented Sep 11, 2023

ivmfnal commented Sep 11, 2023

belforte commented Sep 11, 2023

ericvaandering commented Sep 11, 2023

belforte commented Sep 11, 2023

belforte commented Sep 11, 2023

KatyEllis commented Sep 13, 2023

ericvaandering commented Oct 20, 2023

belforte commented Oct 20, 2023

ivmfnal commented Oct 20, 2023

belforte commented Nov 13, 2023

belforte commented Nov 17, 2023

belforte commented Nov 17, 2023

haozturk commented May 28, 2024

belforte commented Jan 24, 2023 •

edited

Loading

ivmfnal commented Aug 2, 2023 •

edited

Loading

belforte commented Sep 11, 2023 via email •

edited

Loading