-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle gfal exceptions while listing baseDirEntry && Avoid extra stat operations during recursion #11794
Handle gfal exceptions while listing baseDirEntry && Avoid extra stat operations during recursion #11794
Conversation
Jenkins results:
|
Hi @vkuznet, thanks for this review. I have answered your comments inline. If you think anything else is worth attention please let me know. |
Jenkins results:
|
Hi @amaltaro @vkuznet with my latest commit I actually found a way to reduce recursive operations during the RSE cleaning process even more. With the changes suggested here: #1178 we are to delete the whole branch of empty directories under a specific entry in the site's namespace. But for some of the sites we are actually able to use In my latest commit I am doing this already and now this (for the sites, which allow this operation to happen) is an extreme speedup. This one is a real game changer (again under the constraint: for the sites where it works). I tested it with T2_IT_Legnaro: https://cmsweb.cern.ch/ms-unmerged/data/info?rse=T2_IT_Legnaro&detail=True And the results were really good. The runtime was reduced to just few min. It also reduces the logs [1], because we do not dump into the logs the deletion result of every file present in the directory in the cases when we succeed to remove it from the top. Unfortunately there will be no change for those who enter the recursive operations, though. [1]
Uploading the RSE contents to MongoDB:
Second attempt to clean it:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todor, please find a couple of enhancement suggestions along the code.
And while working on this bugix I can now log the full sequence of events which happen when a random error from In [1] we enter a long sequence of:
Maybe we should consider gfal retries. What do you think @amaltaro @vkuznet [1]
[2]
|
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todor, I know you have been testing this code in your development environment.
So I would say to go ahead and merge it whenever you feel confortable with the changes. Before deploying in production, we need to backport it to the correct branch and cut a new patch release, basically what I explained here:
#11781 (review)
Thanks @amaltaro I'll do it tomorrow morning |
Hi @amaltaro, I am squashing and merging this PR at this stage. There are few more improvements for which I already have a prototype in my dev environment, like handling some more FYI: @vkuznet |
… operations during recursion. Fix rewriting timestamps for all documents in the REST interface. Trying to remove nonEmpty directories as early as possible. Adding _rmDir auxiliary method Fix broken log message && Add gfall error code to the error log messages Avoid duplicate attempts to enter and already missing directory. Update filesDeletedFail counters.
034d744
to
e59ac60
Compare
Jenkins results:
|
Fixes #11793
Fixes #11701
Status
READY
Description
As a natural consequence of the way how the operations are reordered for the purpose of the current bug fixing, we also improve the MSUnmerged service in at least 3 different aspects:
File Not Found
exception explicitly during thebaseDirEntry
listing and not relying on thestat
operation to catch a missing file.stat
operations during recursive calls when we are already sure the entry point is a directoryIn Addition:
Is it backward compatible (if not, which system it affects?)
YES
Related PRs
This PR is actually also contributing to the fix of the following issue: #11701 which was partially fixed by: #11781
External dependencies / deployment changes
None