-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate automatic suspicious replica recovery #403
Comments
@ericvaandering , |
@yuyiguo @ericvaandering I guess I am the one who started this. Shall we try to have a quick chat on zoom ? |
Yes, I am available today. Let me know when we can chat. |
wel... today I had things to do. Let's plan it a little bit so that Eric may also join. 10 min should suffice |
what about tomorrow in your 8-10am window ? |
Adding some description after a chat with Eric and Yuyi (Katy was also there) where we start from
what we want to do
things we know
things we do not know, but want to know
things we should do (i.e. The Plan)
|
did this issue follow up somewhere else? Or is it just stale and we still need to validate @belforte proposal with the various tasks? |
is still on my todo list and I am not tracking it elsewhere. I should. Then this can be put on hold until I have a proposal. |
currently this breaks up as
|
This is on my to-do list too but in a low priority. |
First thing is to flag jobs which hit corrupted files, and monitor it, so we quantify problem. |
What is the definition of "suspicious replicas"? If a transfer failed, fts will try to retransfer it. If the failure is permanent, how Replica Recover daemon can fix it? Why CMSSW or processing jobs will read a suspicious replica? |
Hi @yuyiguo . Let me try to list here what I (think that I) understand. Hopefully it answers some of your questions.
Hope this helps ! [1]
|
btw, the file in my example above has been fixed by Felipe and download is now OK. |
Some time ago I wrote this document with the proposal how we can handle this. |
Basically here is my proposal:
I would like to get some feedback on this proposal. Perhaps we can add more details to it. |
thanks Igor, and apologies for not having replied earlier. I would like to see this at work from automatic/automated tools for a while, before we think about enabling users, at that point we may have to introduce some way to "trust machines more than humans" IMHO. One thing that I expect we can talk about later, but let me mention now: We can surely resume this once I have code which parses CMSSW stderr ! |
Stefano @belforte , thanks for the reply and the feedback. I appreciate it. I was thinking that it would make a sense to have another meeting (I think we had one already some time ago) among involved people to re-sync, discuss use cases and maybe come up with an action plan. I think we need to get at least @yuyiguo @dynamic-entropy (Rahul) @klannon there. I would invite @ericvaandering too but he is on vacation. Who am I missing ? |
@belforte said: "we can detect both missing and possibly corrupted files, and tell one from the other." I think it is important to differentiate between several types of failures. I would add another dimension to this:
My understanding of the problem is that we want to use the "suspicious" replica state in first case for a while before we declare the replica "bad" if things do not improve, whereas if we believe this is not recoverable error, then go straight to "bad" replica state. |
I do not think we need a (longish( meeting now. I'd like to have code which parses stdout/stder and does a few "mark as suspiciou" first. There can be questions arising during that which we can address as needed. In a way, I have my action plan. As to the recoverable vs. non-recoverable. Yes, I know, we already discussed in the meeting where you firstly presented this. The problem here is how to be sure that the specific error is really a bad file, not a transient problem in the storage server, is the file really truncated, or was a connection dropped somewhere ? So again, I'd like to get experience with the simpler path first. All in all CRAB already retries file read failures 3 times (and WMA 10, IIUC), so if we e.g. say "3 time suspicious = bad", it may be good enough. |
Just as FYI, here is the suspicious replica recoverer config for ATLAS:
|
or do a rucio get which checks checksum
…On 11/09/2023 16:03, Eric Vaandering wrote:
The “daemon” you suppose would try reading every file directly over
xrootd to find the corrupted one(s)?
|
would not it be easier to have CRAB be specific which replica failed to read ? |
CRAB needs that CMSSW if the error happened opening a local file or a fallback one. |
I know. |
which means: I can run that daemon ! Yes, I know. I may have to do something of that sort anyhow. |
I wasn't suggesting that you need to write or run it. Just the process you are suggesting. WMAgent will probably run into the same issue. So I understand the problem: CRAB gets this error from CMSSW but CMSSW does not give enough information to know if the file is read locally or remotely? Then, even if we knew "remotely" I could imagine problems knowing which remote file was read. CMSSW may not have any way of knowing that. |
Correct. Usually when xroot succeds in opening a remote file, the new PFN is printed in cmsRun log, but much to my disappointment, when the file open fails, nothing gets printed. Of course many times the file will be opened locally and the site where job runs is sufficient informaiton. But I am not sure how to reliably tell. I think we nedd some "exploratory work" so am going to simply reports "suspected corruptions" as files on CERN EOS to be consumed by some script (crontab e,g,) which can cross check and possibly make a rucio call, so that script can check the multiple replicas, if any. IIUC Dima's plans NANO* aside, we are going to have a single disk copy of files. |
(somehow) later on we can think of incorporating all that code into something that parses cmsrun stdout and makes a call to Rucio on the fly. But I am not sure that we want to run Rucio client in WorkerNodes. It should be available since it is on cvmfs. |
With my site admin hat on, I would also advocate for clearer information on when a job fails (whether production or analysis) due to local or remote reads. |
I finally found my notes from a discussion with @ivmfnal and Cedric regarding this, so I paste them here. This would help us declare suspicious replicas. What we'd need to start investigating again is how to move things from suspicious to bad. There was a way to do that too and I've forgotten.
Here is the part of the code responsible for the declaration in the conveyor : https://github.com/rucio/rucio/blob/master/lib/rucio/daemons/conveyor/finisher.py#L301-L331 |
I will report in here as well what I wrote in the MM channel. I suggest to collect more experience before doing more work. The fear (at least from my side) was that there are way more problems that the few that users report. But if we stay at a handful a year, manual action is less effort than we already spent. I have plans to write a script to facilitate suspicious file checking, so that users (or CRAB team) can report trustable information to DataManagement operators who have the needed permissions to mark replicas bad. |
@ericvaandering based on my conversation with Cedric, some time ago I made this proposal. The idea is to use the replica recoverer to move replicas from suspicious to bad. Here is my summary on how to configure the replica recoverer: #403 (comment) |
I had the wrong permission in the directory where read failure error reports are collected. |
I have now setup some automated reporting pipeline, and waiting for suggestions from DM ops and experts in here on how to progress. I think am stuck since:
Overall I think that in respect to Atlas we have:
So we need either a smarter Replica Recover daemon or something in the middle like what I scripted here, but I have got very cold about what to expect from current Rucio. The ReplicaRecover could have a great role for replicas found bad or missing during transfers. I am quite puzzled that there is no automatism there yet. In any case I have reached the limit of what CRAB support can do Please, step in. |
Since my time and patience are limited, I have setup crontabs to update all of that daily. IMHO it would make a lot of sense that DM operators have a look daily and if they agree, declare BAD replicas a suggested after all additional checks that they consider relevant. I hope with this I can stop answering user reports of corrupted files. |
I'll close this one as we have roadmap. Will track this effort at #805 |
Would need something done with traces (declaring things suspicious, maybe some logic to deal with xrootd and specific exit codes?)
Then need to run the replica recoverer daemon.
The text was updated successfully, but these errors were encountered: