-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abandon storage.xml in favor of storage.json for stage-in and stage-out (after reverting merging #11816 to master) #11869
Abandon storage.xml in favor of storage.json for stage-in and stage-out (after reverting merging #11816 to master) #11869
Conversation
… (storage.json): chained rules, add more tests
…are missing attributes in stage out (happened in this loop: for stageOut in self.stageOuts)
Jenkins results:
|
Fail WMCore_t.Services_t.Rucio_t.Rucio_t.RucioTest:testGetPFN with error |
You mean from your development branch? Or where do you think this error comes from? @nhduongvn Duong, @khurtado (Kenyi) is taking over these tests and review for the moment. Please let us know if you need any assistance and/or if this PR is ready for another review based on the questions/requests made in #11816 |
Yes, Jenkin tests run on my development branch but it is the same as dmwm/WMCore master branch. I did not touch these codes at all.
|
The Rucio_t.py tests are not related to my codes but if you want I can go ahead and change it |
@nhduongvn Regarding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nhduongvn
Could you please add the references to the methods that are transferred from the WMCore master branch in the description?
I know that they are referenced in the previous PR already, but it would be good to have them here as well rather than in a reference, since this will be the actual final PR.
@@ -253,68 +255,11 @@ def deletePFN(self, pfn, lfn, command): | |||
impl.retryPause = self.retryPauseTime | |||
|
|||
try: | |||
impl.removeFile(pfn) | |||
if not self.bypassImpl: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up from here:
https://github.com/dmwm/WMCore/pull/11816/files/4268fd37c60b9d1f14aa40790ef07d4839143b7f#r1448938754
One way to change the implementation of this method just for the unit test, without altering the actual code is to use emulators.
Here is an example:
We have the following function "loadCouchID" that we do not want to use in the unit tests.
WMCore/src/python/WMCore/WMSpec/StdSpecs/StdBase.py
Lines 183 to 202 in d14d0d9
def loadCouchID(self, configDoc=None, configCacheUrl=None, couchDBName=None): | |
""" | |
Load a config document from couch db and return the object | |
:param configDoc: the config ID document | |
:param configCacheUrl: couch url for the config | |
:param couchDBName: couch database name | |
:return: config document object | |
""" | |
# TODO: Evaluate if we should call validateConfigCacheExists here | |
configCache = None | |
if configDoc is not None and configDoc != "": | |
if (configCacheUrl, couchDBName) in self.config_cache: | |
configCache = self.config_cache[(configCacheUrl, couchDBName)] | |
else: | |
configCache = ConfigCache(configCacheUrl, couchDBName, True) | |
self.config_cache[(configCacheUrl, couchDBName)] = configCache | |
configCache.loadByID(configDoc) | |
return configCache |
So, a new class is created for the unit tests, which inherits the whole class but overrides some functions like loadCouchID.
So then in the unit test, we call that instead of the original TaskChainWorkloadFactory() object, e.g.:
from WMQuality.Emulators.WMSpecGenerator.Samples.BasicProductionWorkload import getProdArgs, taskChainWorkload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, a child class is added to unit test script Delete_t.py. The bypassImpl
flag is removed from parent class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I like your idea to decouple inference of unit test in the main classes by removing bypassImpl
and using inherited classes for testing, this requires to copy the relevant codes from main classes to inherited ones every time we want to make a modification. I mean new changes in the main classes will not be automatically tested by running the unit test. I write here so that new developers know what to do and avoid confusions.
|
||
def matchPFN(self, protocol, pfn): | ||
""" | ||
_matchPFN_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up from:
https://github.com/dmwm/WMCore/pull/11816/files/4268fd37c60b9d1f14aa40790ef07d4839143b7f#r1446308838
The docstring in this function is using the outdated format. Could you please update it to the new one?
https://github.com/dmwm/WMCore/blob/master/CONTRIBUTING.rst#project-docstrings-best-practices
i.e.:
- Removing matchPFN
- Adding
:param protocol: description, :param pfn: description
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
self.stageOuts = self.siteCfg.stageOuts | ||
|
||
msg = "" | ||
msg += "There are %s stage out definitions.\n" % len(self.stageOuts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a need for the empty string msg at the beginning? Why not simply:
msg = "There are %s stage out definitions.\n" % len(self.stageOuts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
def addMapping(self, protocol, match, result, | ||
chain=None, mapping_type='lfn-to-pfn'): | ||
""" | ||
_addMapping_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please implement the new dosctring format from there?
https://github.com/dmwm/WMCore/blob/master/CONTRIBUTING.rst#project-docstrings-best-practices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
def matchLFN(self, protocol, lfn): | ||
""" | ||
_matchLFN_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here regarding docstring format
https://github.com/dmwm/WMCore/blob/master/CONTRIBUTING.rst#project-docstrings-best-practices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
def readRFC(filename, storageSite, volume, protocol): | ||
""" | ||
_readRFC_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment for new docstring format
https://github.com/dmwm/WMCore/blob/master/CONTRIBUTING.rst#project-docstrings-best-practices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
for subnode in node.children: | ||
subSiteName = report['subSiteName'] if 'subSiteName' in report.keys() else None | ||
aStorageSite = subnode.attrs.get('site', None) | ||
if aStorageSite is None: aStorageSite = report['siteName'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please break this if statement in two lines to follow standard pep8 syntax?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
localReport['storageSite'] = aStorageSite | ||
localReport['command'] = subnode.attrs.get('command', None) | ||
#use default command='gfal2' when 'command' is not specified | ||
if localReport['command'] is None: localReport['command'] = 'gfal2' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with this if statement (two lines)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Hi @khurtado , I have implemented your suggestions. Are you done with reviewing? |
@nhduongvn I am, yes. However, I could not see your changes in the PR. There are no new commits to the PR and I couldn't see any changes in your |
@khurtado , I pushed my updates |
Jenkins results:
|
Test this please |
Jenkins results:
|
@chwissing I am trying test this PR against KIT-HOREKA. Can you help me with this? After selecting KIT as a site, how can I match against this subsite? EDIT: I was able to run at KIT-T3 which is also a subsite, so that should be enough for the test. |
@amaltaro I ran into an issue here that I am not 100% sure if it is related to this PR or not. It has to do with site:
If I do it on an agent without this patch and version 2.3.0.1, I do not get any errors. However, the wmagentJob.log logfiles look pretty similar, the patched version seems to show the correct mappings and the jobs end in a similar fashion, with all stageout commands successfully. The only errors I see are related to logArch1 but both jobs (with patch and no patch) show the same errors in that sense. Do you have any idea of why or how to look more info on this error? For reference, here a wmagentJob.log for the patched agent: ANd here is one for the agent without the patch: |
Okay, here is one difference: With the patch, I see:
And without the patch:
Somehow, the Patch:
No patch:
I don't know if that explains the error from the previous thread, but it's something strange. Looking into the
Could this be a bug on the storage.json file rather than the patch? Hence the non-patched version working because it's not using the same mapping pattern? |
Do you know if local protocols are supposed to start with E.g.: For T3_ES_PIC_BSC
I do see other sites have a different configuration, so maybe this is a site bug in the configuration?
|
Hi @khurtado , |
It looks to me the configuration is not correct, but I do not know for sure that whether the |
Hallo Alan, Duong, |
In BSC case, they do a local copy, so I suppose using
But would you say then, that the config should use |
Yes, if the filesystem is mounted on all worker nodes and permissions are setup
|
So, both "file:/..." and "file:///..." are correct. First syntax omits the endpoint part, the later leaves it blank.
|
Does it mean that the |
I suspect the "cp" implementation isn't right. Why are we not using gfal also for file protocol which i suspect will handle the slashes properly. (The cp might be a pre-grid implementation.)
|
@stlammel This T3 is special in the sense that BSC is not actually part of the Global Pool and they launch local pilots there. For stageout, they first only copy the files to a local disk and then they have a custom made component that intercepts the stageout signal and proceeds to scp the files to T1_ES_PIC_Disk. For some reason, the "cp" part does not trigger any errors but I suspect the files are not properly copied (they are probably copied in the same current dir as 'file:' give this from the logfiles:
So then, when the scp tries to proceeds, it can't find the files and sends an error message. So, just to confirm, would you say that if the stage out command they use in site-local-config.xml is like this
Then, the output protocol prefix should remove
|
Hallo Kenyi, |
@stlammel Thank you! I do not know at which point they do the scp to be honest, but I have updated the GGUS ticket with your suggestion (moving to gfal instead or removing the file:///), since I think that should solve the issue. @amaltaro @todor-ivanov Given that the BSC case seems indeed like a site configuration problem and not a problem with this PR and given that I'm happy enough with my other tests, I am requesting one more review to this PR before it can be merged. Here is a summary of the tests
|
Just for the record, removing I'm marking #11703 as in waiting, since it's only pending an additiona PR review to be merged. I'm done with the functionality tests. |
test this please |
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nhduongvn @khurtado I left a few comments along the code that needs to be considered.
I would also invite you to look at those errors reported in the Jenkins report, for instance:
test/python/WMCore_t/Storage_t/StageOutMgr_t.py
E0602, line 39 in StageOutMgrTest.stageOut: Undefined variable 'StageOutFailure'
and reports like
W0611, line 17: Unused SiteLocalConfig imported from WMCore.Storage.SiteLocalConfig
can be considered as well. Or we can clean those up in another iteration once it gets merged.
I wanted to note though that these are false notifications:
src/python/WMCore/Storage/StageOutMgr.py
W0611, line 16: Unused import WMCore.Storage.Backends
W0611, line 17: Unused import WMCore.Storage.Plugins
Please do NOT remove these import because they are actually needed.
|
||
self.stageOuts = self.siteCfg.stageOuts | ||
|
||
msg = "There are %s stage out definitions.\n" % len(self.stageOuts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can see, this log line, line 102 and maybe a few more will only be logged if we have an error or exception. Can you please revisit it and also print things out in info mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Use self.logger.info
if 'option' in overrideConf: | ||
if len(overrideConf['option']) > 0: | ||
overrideParams['option'] = overrideConf['option'] | ||
if 'option' in self.overrideConf and self.overrideConf['option'] is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simplify this line like: if self.overrideConf.get('option') is not None:
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is much better
raise StageOutInitError( msg ) | ||
self.stageOuts = self.siteCfg.stageOuts | ||
|
||
msg = "\nThere are %s stage out definitions." % len(self.stageOuts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment on this msg variable, which is only printed in case there is an exception. Can you please revisit it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
amsg += '\t'+stageOutStr(stageOut) + '\n' | ||
amsg += str(ex) | ||
msg += amsg | ||
print(amsg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace print by logging library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
options = self.siteCfg.localStageOut.get('option', None) | ||
pfn = self.searchTFC(lfn) | ||
protocol = self.tfc.preferredProtocol | ||
if stageOut_rfc is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could change this line to if not stageOut_rfc:
, such that it can capture a potential empty list as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
msg += str(ex) | ||
self.stageOuts = self.siteCfg.stageOuts | ||
|
||
msg = "\nThere are %s stage out definitions." % len(self.stageOuts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment on this msg variable, which will only happen to get printed if there is an exception in the code. Please revisit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@nhduongvn Duong, I don't mean to put any pressure on you :), but if we can converge on this before the weekend, it will go in in the March upcoming release candidate for WMAgent. Let us know if you need any assistance on the remaining work. |
It is actually a good news! I can definitely finish addressing your reviews today (planned to work on these this afternoon then the power went out 😞) |
@nhduongvn Duong, if you feel confortable squash the commits in this PR, please have a look at the "Step 10" in this section: https://github.com/dmwm/WMCore/blob/master/CONTRIBUTING.rst#contributing , it has a nice reference for squashing commits, and go ahead. |
Jenkins results:
|
Done! |
Please let me know if you are happy with new updates. I can try to squash commits but honestly I am not very comfortable doing this. Last time I did with Todor, it took me 2 days. |
No problem @nhduongvn , instead of "Merge pull request", I am going to "Squash and merge", as test/* and src/* changes are already all together in the same commits. I will then follow this up with further aesthetic changes reported by jenkins. @khurtado can you please get the relevant pieces of this stage out/in mechanism based on the JSON storage description documented in https://cms-wmcore.docs.cern.ch/? You probably want to use the gdoc that Duong has circulated so far. |
Fixes #11703
Status
In development
Description
These are changes to continue reviewing and testing for code changes to adapt stage-in and stage-out for new storage description in storage.json. The first merging to master of PR #11816 was reverted (#11857) to allow further reviewing and testing.
Is it backward compatible (if not, which system it affects?)
NO
Related PRs
Debug PR with failed workflow test #11790
First merged PR #11816
Reverted PR #11857