Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abandon storage.xml in favor of storage.json for stage-in and stage-out (after reverting merging #11816 to master) #11869

Merged
merged 7 commits into from
Feb 29, 2024

Conversation

nhduongvn
Copy link
Collaborator

Fixes #11703

Status

In development

Description

These are changes to continue reviewing and testing for code changes to adapt stage-in and stage-out for new storage description in storage.json. The first merging to master of PR #11816 was reverted (#11857) to allow further reviewing and testing.

Is it backward compatible (if not, which system it affects?)

NO

Related PRs

Debug PR with failed workflow test #11790
First merged PR #11816
Reverted PR #11857

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 tests deleted
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 40 warnings and errors that must be fixed
    • 7 warnings
    • 215 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 321 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14775/artifact/artifacts/PullRequestReport.html

@nhduongvn
Copy link
Collaborator Author

Fail WMCore_t.Services_t.Rucio_t.Rucio_t.RucioTest:testGetPFN with error
'davs://eoscms.cern.ch:443/eos/cms/store' != 'gsiftp://eoscmsftp.cern.ch:2811/eos/cms/store/test/rucio/int/store'
It looks like the prefix here has not been updated:
https://github.com/nhduongvn/WMCore/blob/ccfc286d7c81f8b47beb26ed97096a8e442adbda/test/python/WMCore_t/Services_t/Rucio_t/Rucio_t.py#L239

@amaltaro
Copy link
Contributor

amaltaro commented Jan 22, 2024

It looks like the prefix here has not been updated:
https://github.com/nhduongvn/WMCore/blob/ccfc286d7c81f8b47beb26ed97096a8e442adbda/test/python/WMCore_t/Services_t/Rucio_t/Rucio_t.py#L239

You mean from your development branch? Or where do you think this error comes from?

@nhduongvn Duong, @khurtado (Kenyi) is taking over these tests and review for the moment. Please let us know if you need any assistance and/or if this PR is ready for another review based on the questions/requests made in #11816

@nhduongvn
Copy link
Collaborator Author

nhduongvn commented Jan 22, 2024

It looks like the prefix here has not been updated:
https://github.com/nhduongvn/WMCore/blob/ccfc286d7c81f8b47beb26ed97096a8e442adbda/test/python/WMCore_t/Services_t/Rucio_t/Rucio_t.py#L239

You mean from your development branch? Or where do you think this error comes from?

@nhduongvn Duong, @khurtado (Kenyi) is taking over these tests and review for the moment. Please let us know if you need any assistance and/or if this PR is ready for another review based on the questions/requests made in #11816

Yes, Jenkin tests run on my development branch but it is the same as dmwm/WMCore master branch. I did not touch these codes at all.
my development: https://github.com/nhduongvn/WMCore/blob/23337348d509cfd3039b02fdfa0c9ee96177fbbf/test/python/WMCore_t/Services_t/Rucio_t/Rucio_t.py#L239
master

cernTestbedPrefix = 'gsiftp://eoscmsftp.cern.ch:2811/eos/cms/store/test/rucio/int'

@nhduongvn
Copy link
Collaborator Author

I implemented all your reviewing comments in #11816 in these new commits. The codes are ready for further testing or another review round if you want. After talking to you and Stephan last week, I understand that we will go for tests after I implement all your comments in #11816.

@nhduongvn
Copy link
Collaborator Author

The Rucio_t.py tests are not related to my codes but if you want I can go ahead and change it

@khurtado
Copy link
Contributor

khurtado commented Jan 26, 2024

The Rucio_t.py tests are not related to my codes but if you want I can go ahead and change it

@nhduongvn Regarding Fail WMCore_t.Services_t.Rucio_t.Rucio_t.RucioTest:testGetPFN with error, I see the exact same error in Jenkins for other PRs unrelated to this, so please ignore this unit test.

Copy link
Contributor

@khurtado khurtado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nhduongvn

Could you please add the references to the methods that are transferred from the WMCore master branch in the description?
I know that they are referenced in the previous PR already, but it would be good to have them here as well rather than in a reference, since this will be the actual final PR.

#11816

src/python/WMCore/Storage/DeleteMgr.py Show resolved Hide resolved
@@ -253,68 +255,11 @@ def deletePFN(self, pfn, lfn, command):
impl.retryPause = self.retryPauseTime

try:
impl.removeFile(pfn)
if not self.bypassImpl:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up from here:
https://github.com/dmwm/WMCore/pull/11816/files/4268fd37c60b9d1f14aa40790ef07d4839143b7f#r1448938754

One way to change the implementation of this method just for the unit test, without altering the actual code is to use emulators.

Here is an example:

We have the following function "loadCouchID" that we do not want to use in the unit tests.

def loadCouchID(self, configDoc=None, configCacheUrl=None, couchDBName=None):
"""
Load a config document from couch db and return the object
:param configDoc: the config ID document
:param configCacheUrl: couch url for the config
:param couchDBName: couch database name
:return: config document object
"""
# TODO: Evaluate if we should call validateConfigCacheExists here
configCache = None
if configDoc is not None and configDoc != "":
if (configCacheUrl, couchDBName) in self.config_cache:
configCache = self.config_cache[(configCacheUrl, couchDBName)]
else:
configCache = ConfigCache(configCacheUrl, couchDBName, True)
self.config_cache[(configCacheUrl, couchDBName)] = configCache
configCache.loadByID(configDoc)
return configCache

So, a new class is created for the unit tests, which inherits the whole class but overrides some functions like loadCouchID.

https://github.com/khurtado/WMCore/blob/8c43760377a072847395087e7b9397181d0a0db7/src/python/WMQuality/Emulators/WMSpecGenerator/Samples/BasicProductionWorkload.py#L99-L122

So then in the unit test, we call that instead of the original TaskChainWorkloadFactory() object, e.g.:

from WMQuality.Emulators.WMSpecGenerator.Samples.BasicProductionWorkload import getProdArgs, taskChainWorkload

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, a child class is added to unit test script Delete_t.py. The bypassImpl flag is removed from parent class

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I like your idea to decouple inference of unit test in the main classes by removing bypassImpl and using inherited classes for testing, this requires to copy the relevant codes from main classes to inherited ones every time we want to make a modification. I mean new changes in the main classes will not be automatically tested by running the unit test. I write here so that new developers know what to do and avoid confusions.


def matchPFN(self, protocol, pfn):
"""
_matchPFN_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up from:

https://github.com/dmwm/WMCore/pull/11816/files/4268fd37c60b9d1f14aa40790ef07d4839143b7f#r1446308838

The docstring in this function is using the outdated format. Could you please update it to the new one?
https://github.com/dmwm/WMCore/blob/master/CONTRIBUTING.rst#project-docstrings-best-practices

i.e.:

  • Removing matchPFN
  • Adding :param protocol: description, :param pfn: description, etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

self.stageOuts = self.siteCfg.stageOuts

msg = ""
msg += "There are %s stage out definitions.\n" % len(self.stageOuts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a need for the empty string msg at the beginning? Why not simply:

msg = "There are %s stage out definitions.\n" % len(self.stageOuts)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def addMapping(self, protocol, match, result,
chain=None, mapping_type='lfn-to-pfn'):
"""
_addMapping_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please implement the new dosctring format from there?

https://github.com/dmwm/WMCore/blob/master/CONTRIBUTING.rst#project-docstrings-best-practices

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


def matchLFN(self, protocol, lfn):
"""
_matchLFN_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


def readRFC(filename, storageSite, volume, protocol):
"""
_readRFC_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

for subnode in node.children:
subSiteName = report['subSiteName'] if 'subSiteName' in report.keys() else None
aStorageSite = subnode.attrs.get('site', None)
if aStorageSite is None: aStorageSite = report['siteName']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please break this if statement in two lines to follow standard pep8 syntax?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

localReport['storageSite'] = aStorageSite
localReport['command'] = subnode.attrs.get('command', None)
#use default command='gfal2' when 'command' is not specified
if localReport['command'] is None: localReport['command'] = 'gfal2'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with this if statement (two lines)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@nhduongvn
Copy link
Collaborator Author

Hi @khurtado , I have implemented your suggestions. Are you done with reviewing?

@khurtado
Copy link
Contributor

khurtado commented Jan 31, 2024

@nhduongvn I am, yes. However, I could not see your changes in the PR. There are no new commits to the PR and I couldn't see any changes in your stageOutUsingStorageJson_test_b927 local branch. Where did you make these changes? Note if you commit and push these changes to your local branch, the PR will automatically update as well.

@nhduongvn
Copy link
Collaborator Author

@nhduongvn I am, yes. However, I could not see your changes in the PR. There are no new commits to the PR and I couldn't see any changes in your stageOutUsingStorageJson_test_b927 local branch. Where did you make these changes? Note if you commit and push these changes to your local branch, the PR will automatically update as well.

@khurtado , I pushed my updates

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 tests deleted
    • 1 tests no longer failing
    • 3 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 75 warnings and errors that must be fixed
    • 8 warnings
    • 261 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 363 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14805/artifact/artifacts/PullRequestReport.html

@khurtado
Copy link
Contributor

khurtado commented Feb 5, 2024

Test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 tests deleted
    • 1 tests no longer failing
    • 3 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 75 warnings and errors that must be fixed
    • 8 warnings
    • 261 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 363 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14816/artifact/artifacts/PullRequestReport.html

@khurtado
Copy link
Contributor

khurtado commented Feb 8, 2024

@chwissing I am trying test this PR against KIT-HOREKA. Can you help me with this? After selecting KIT as a site, how can I match against this subsite?

EDIT: I was able to run at KIT-T3 which is also a subsite, so that should be enough for the test.

@khurtado
Copy link
Contributor

khurtado commented Feb 21, 2024

@amaltaro I ran into an issue here that I am not 100% sure if it is related to this PR or not.

It has to do with site: T3_ES_PIC_BSC
If I run a small 10 jobs MC (no pileup) workflows at this site with an agent with this PR on an agent v2.3.0 I get the following error
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_TaskChain_SingleTask20K_khurtado_xmlv1_scaletest20k_240220_213118_6881

StageOutError (Exit Code: 60311)
Error transferring file from BSC to PIC

If I do it on an agent without this patch and version 2.3.0.1, I do not get any errors.
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_TaskChain_SingleTask20K_khurtado_xmlv1_scaletest_nopatch_240220_214046_5592

However, the wmagentJob.log logfiles look pretty similar, the patched version seems to show the correct mappings and the jobs end in a similar fashion, with all stageout commands successfully. The only errors I see are related to logArch1 but both jobs (with patch and no patch) show the same errors in that sense.

Do you have any idea of why or how to look more info on this error?

For reference, here a wmagentJob.log for the patched agent:
https://gist.github.com/khurtado/aa147d968275f31d9a5b150614fea010

ANd here is one for the agent without the patch:
https://gist.github.com/khurtado/ddc80a59d3f4a40a01203f6e10b90ef5

@khurtado
Copy link
Contributor

khurtado commented Feb 21, 2024

Okay, here is one difference:

With the patch, I see:

Rucio File Catalog has been loaded:
	lfn-to-pfn: protocol=input path-match-re=/(.*) result=root://localhost:1094//$1
	lfn-to-pfn: protocol=output path-match-re=/(.*) result=file:///gpfs/projects/cie12/outputs/$1
	pfn-to-lfn: protocol=input path-match-re=root://localhost:1094//(.*) result=/$1
	pfn-to-lfn: protocol=output path-match-re=file:///gpfs/projects/cie12/outputs/(.*) result=/$1

And without the patch:

Stage out to : T1_ES_PIC_Disk using: cp
Local Stage Out PNN to be used is T1_ES_PIC_Disk
Local Stage Out Catalog to be used is trivialcatalog_file:/gpfs/projects/cie12/cms.conf/SITECONF/local/storage.xml?protocol=output
Trivial File Catalog has been loaded:
	lfn-to-pfn: protocol=input path-match-re=/+store/(.*) result=root://localhost:1094//store/$1
	lfn-to-pfn: protocol=output path-match-re=/+store/(.*) result=/gpfs/projects/cie12/outputs/store/$1
There are 0 fallback stage out definitions.

Somehow, the file:// seems to influence the results because when I see at condor output file I see an extra file: file that is not present in the non-patched version:

Patch:

WMTaskSpace/cmsRun1:
total 7412
drwxr-xr-x 17 cie12994 cie12    4096 Feb 21 03:43 CMSSW_10_6_18
-rw-r--r--  1 cie12994 cie12   19484 Feb 21 03:54 FrameworkJobReport.xml
-rw-r--r--  1 cie12994 cie12  113453 Feb 21 03:54 LHEoutput.root
-rw-r--r--  1 cie12994 cie12  652962 Feb 21 03:44 PSet.pkl
-rw-r--r--  1 cie12994 cie12     128 Feb 21 03:44 PSet.py
-rw-r--r--  1 cie12994 cie12     159 Feb 21 03:44 PSetTweak.json
-rw-r--r--  1 cie12994 cie12 6556847 Feb 21 03:54 RAWSIMoutput.root
-rw-r--r--  1 cie12994 cie12   23790 Feb 21 03:54 Report.pkl
-rw-r--r--  1 cie12994 cie12     336 Feb 21 03:43 __init__.py
drwxr-xr-x  2 cie12994 cie12    4096 Feb 21 03:43 __pycache__
-rw-r--r--  1 cie12994 cie12    1682 Feb 21 03:44 cmsRun1-main.sh
-rw-r--r--  1 cie12994 cie12     230 Feb 21 03:54 cmsRun1-stderr.log
-rw-r--r--  1 cie12994 cie12  177537 Feb 21 03:54 cmsRun1-stdout.log
-rw-r--r--  1 cie12994 cie12       4 Feb 21 03:44 process.id

WMTaskSpace/logArch1:
total 224
-rw-r--r-- 1 cie12994 cie12   2279 Feb 21 03:54 Report.pkl
-rw-r--r-- 1 cie12994 cie12    337 Feb 21 03:43 __init__.py
drwxr-xr-x 2 cie12994 cie12   4096 Feb 21 03:54 __pycache__
drwxr-xr-x 3 cie12994 cie12   4096 Feb 21 03:54 file:
-rw-r--r-- 1 cie12994 cie12 210069 Feb 21 03:54 logArchive.tar.gz

WMTaskSpace/stageOut1:
total 16
-rw-r--r-- 1 cie12994 cie12 1437 Feb 21 03:54 Report.pkl
-rw-r--r-- 1 cie12994 cie12  338 Feb 21 03:43 __init__.py
drwxr-xr-x 2 cie12994 cie12 4096 Feb 21 03:54 __pycache__
drwxr-xr-x 3 cie12994 cie12 4096 Feb 21 03:54 file:
======== WMAgent bootstrap FINISH at Wed Feb 21 02:54:42 GMT 2024 ========

No patch:

WMTaskSpace/cmsRun1:
total 7940
drwxr-xr-x 17 cie12994 cie12    4096 Feb 21 03:47 CMSSW_10_6_18
-rw-r--r--  1 cie12994 cie12   19085 Feb 21 03:58 FrameworkJobReport.xml
-rw-r--r--  1 cie12994 cie12  113691 Feb 21 03:58 LHEoutput.root
-rw-r--r--  1 cie12994 cie12  652974 Feb 21 03:48 PSet.pkl
-rw-r--r--  1 cie12994 cie12     128 Feb 21 03:48 PSet.py
-rw-r--r--  1 cie12994 cie12     159 Feb 21 03:48 PSetTweak.json
-rw-r--r--  1 cie12994 cie12 6861530 Feb 21 03:58 RAWSIMoutput.root
-rw-r--r--  1 cie12994 cie12   23362 Feb 21 03:58 Report.pkl
-rw-r--r--  1 cie12994 cie12     336 Feb 21 03:47 __init__.py
drwxr-xr-x  2 cie12994 cie12    4096 Feb 21 03:47 __pycache__
-rw-r--r--  1 cie12994 cie12    1682 Feb 21 03:48 cmsRun1-main.sh
-rw-r--r--  1 cie12994 cie12     230 Feb 21 03:58 cmsRun1-stderr.log
-rw-r--r--  1 cie12994 cie12  408657 Feb 21 03:58 cmsRun1-stdout.log
-rw-r--r--  1 cie12994 cie12       4 Feb 21 03:48 process.id

WMTaskSpace/logArch1:
total 264
-rw-r--r-- 1 cie12994 cie12   2287 Feb 21 03:58 Report.pkl
-rw-r--r-- 1 cie12994 cie12    337 Feb 21 03:47 __init__.py
drwxr-xr-x 2 cie12994 cie12   4096 Feb 21 03:58 __pycache__
-rw-r--r-- 1 cie12994 cie12 254356 Feb 21 03:58 logArchive.tar.gz

WMTaskSpace/stageOut1:
total 12
-rw-r--r-- 1 cie12994 cie12 1442 Feb 21 03:58 Report.pkl
-rw-r--r-- 1 cie12994 cie12  338 Feb 21 03:47 __init__.py
drwxr-xr-x 2 cie12994 cie12 4096 Feb 21 03:58 __pycache__
======== WMAgent bootstrap FINISH at Wed Feb 21 02:58:47 GMT 2024 ========

======== WMAgent final job runtime checks STARTING at Wed Feb 21 02:58:47 GMT 2024 ========
Wed Feb 21 02:58:47 GMT 2024: Job Runtime in seconds:  659
Wed Feb 21 02:58:47 GMT 2024: Job bootstrap script exited:  0
Wed Feb 21 02:58:47 GMT 2024: Job execution exited:  0
======== WMAgent final job runtime checks FINISHED at Wed Feb 21 02:58:47 GMT 2024 ========

I don't know if that explains the error from the previous thread, but it's something strange.

Looking into the storage.json, I see the file:/// attribute, so it seems the PR is doing the right thing

[
   {  "site": "T3_ES_PIC_BSC",
      "volume": "BSC_GPFS",
      "protocols": [
        {  "protocol": "input",
           "access": "site-ro",
           "prefix": "file:///gpfs/projects/cie12/inputs"
        },
        {  "protocol": "output",
           "access": "site-rw",
           "prefix": "file:///gpfs/projects/cie12/outputs"
        }
      ],
      "type": "DISK",
      "rse": "T1_ES_PIC_Disk"
   }
]

Could this be a bug on the storage.json file rather than the patch? Hence the non-patched version working because it's not using the same mapping pattern?

@khurtado
Copy link
Contributor

khurtado commented Feb 21, 2024

@nhduongvn @amaltaro

Do you know if local protocols are supposed to start with file:// ?

E.g.: For T3_ES_PIC_BSC

[
   {  "site": "T3_ES_PIC_BSC",
      "volume": "BSC_GPFS",
      "protocols": [
        {  "protocol": "input",
           "access": "site-ro",
           "prefix": "file:///gpfs/projects/cie12/inputs"
        },
        {  "protocol": "output",
           "access": "site-rw",
           "prefix": "file:///gpfs/projects/cie12/outputs"
        }
      ],
      "type": "DISK",
      "rse": "T1_ES_PIC_Disk"
   }
]

I do see other sites have a different configuration, so maybe this is a site bug in the configuration?

$ find . -iname "storage.json" -exec grep -H file {} \; | grep prefix
_US_Rutgers/storage.json:                "prefix": "file:/cms/se/phedex"
./T3_US_UMD/storage.json:                "prefix": "file:/mnt/hadoop/cms"
./T3_US_UMiss/storage.json:                "prefix": "file:/osgremote/osg_data/cms"
./T3_US_VC3_NotreDame/storage.json:                "prefix": "file:/hadoop"
./T3_US_CMU/storage.json:            "prefix": "file:/export/data"
./T3_IT_MIB/storage.json:                "prefix": "file:/gwteras/cms"
./T3_ES_PIC_BSC/storage.json:           "prefix": "file:///gpfs/projects/cie12/inputs"
./T3_ES_PIC_BSC/storage.json:           "prefix": "file:///gpfs/projects/cie12/outputs"
./T3_US_ORNL/storage.json:                "prefix": "file:/gpfs/alpine/hep134/proj-shared"

@nhduongvn
Copy link
Collaborator Author

Hi @khurtado ,
Yes, the patch works as it supposes to do. Here you see the output protocol uses prefix file:///gpfs/projects/cie12/outputs so this prefix is placed in front of the lfn results in an PFN likes this:
file:///gpfs/projects/cie12/outputs/aFile.txt for example.
I look at the storage.xml which non-patch uses to translate the PFN and it is different
https://gitlab.cern.ch/SITECONF/T3_ES_PIC_BSC/-/blob/master/storage.xml?ref_type=heads
<lfn-to-pfn protocol="output" path-match="/+store/(.*)" result="/gpfs/projects/cie12/outputs/store/$1"/>
You see the results PFN will be /gpfs/projects/cie12/outputs/store/aFile.txt

@nhduongvn
Copy link
Collaborator Author

nhduongvn commented Feb 21, 2024

@nhduongvn @amaltaro

Do you know if local protocols are supposed to start with file:// ?

E.g.: For T3_ES_PIC_BSC

[
   {  "site": "T3_ES_PIC_BSC",
      "volume": "BSC_GPFS",
      "protocols": [
        {  "protocol": "input",
           "access": "site-ro",
           "prefix": "file:///gpfs/projects/cie12/inputs"
        },
        {  "protocol": "output",
           "access": "site-rw",
           "prefix": "file:///gpfs/projects/cie12/outputs"
        }
      ],
      "type": "DISK",
      "rse": "T1_ES_PIC_Disk"
   }
]

I do see other sites have a different configuration, so maybe this is a site bug in the configuration?

$ find . -iname "storage.json" -exec grep -H file {} \; | grep prefix
_US_Rutgers/storage.json:                "prefix": "file:/cms/se/phedex"
./T3_US_UMD/storage.json:                "prefix": "file:/mnt/hadoop/cms"
./T3_US_UMiss/storage.json:                "prefix": "file:/osgremote/osg_data/cms"
./T3_US_VC3_NotreDame/storage.json:                "prefix": "file:/hadoop"
./T3_US_CMU/storage.json:            "prefix": "file:/export/data"
./T3_IT_MIB/storage.json:                "prefix": "file:/gwteras/cms"
./T3_ES_PIC_BSC/storage.json:           "prefix": "file:///gpfs/projects/cie12/inputs"
./T3_ES_PIC_BSC/storage.json:           "prefix": "file:///gpfs/projects/cie12/outputs"
./T3_US_ORNL/storage.json:                "prefix": "file:/gpfs/alpine/hep134/proj-shared"

It looks to me the configuration is not correct, but I do not know for sure that whether the file: is used conventionally for local file. @stlammel could you comment?

@stlammel
Copy link

Hallo Alan, Duong,
if by "local protocols" you mean protocols not used by Rucio, then no, they will not all start with "file:". The "file:" denotes Posix/file-protocol access. Those protocols will likely not be used by Rucio but there will be other "davs:", "root:" protocols that Rucio will not use.
Thanks,
cheers, Stephan

@khurtado
Copy link
Contributor

@stlammel

In BSC case, they do a local copy, so I suppose using file: is correct. E.g.: MIT also does it:

[
   {  "site": "T2_US_Nebraska",
      "volume": "Nebraska_CEPH",
      "protocols": [
<snip>
         {  "protocol": "file",
            "access": "site-ro",
            "prefix": "file:/cms"
         }
<snip>

But would you say then, that the config should use file: rather than file:// ? I asked the same question in this GGUS ticket
https://ggus.eu/index.php?mode=ticket_info&ticket_id=165390

@stlammel
Copy link

Yes, if the filesystem is mounted on all worker nodes and permissions are setup
properly, there is nothing wrong in using Posix/file protocol for stage-out (or data
access).

  • Stephan

@stlammel
Copy link

So, both "file:/..." and "file:///..." are correct. First syntax omits the endpoint part, the later leaves it blank.

  • Stephan

@nhduongvn
Copy link
Collaborator Author

nhduongvn commented Feb 21, 2024

So, both "file:/..." and "file:///..." are correct. First syntax omits the endpoint part, the later leaves it blank.

  • Stephan

Does it mean that the command="cp" in stage-out should proceed differently for storage.json since there are extras file:/ or file:///? Or command should be omitted in stage-out so gfal2 is used?

@stlammel
Copy link

I suspect the "cp" implementation isn't right. Why are we not using gfal also for file protocol which i suspect will handle the slashes properly. (The cp might be a pre-grid implementation.)

  • Stephan

@khurtado
Copy link
Contributor

khurtado commented Feb 22, 2024

@stlammel This T3 is special in the sense that BSC is not actually part of the Global Pool and they launch local pilots there. For stageout, they first only copy the files to a local disk and then they have a custom made component that intercepts the stageout signal and proceeds to scp the files to T1_ES_PIC_Disk.

For some reason, the "cp" part does not trigger any errors but I suspect the files are not properly copied (they are probably copied in the same current dir as 'file:' give this from the logfiles:

WMTaskSpace/stageOut1:
total 16
-rw-r--r-- 1 cie12994 cie12 1437 Feb 21 03:54 Report.pkl
-rw-r--r-- 1 cie12994 cie12  338 Feb 21 03:43 __init__.py
drwxr-xr-x 2 cie12994 cie12 4096 Feb 21 03:54 __pycache__
drwxr-xr-x 3 cie12994 cie12 4096 Feb 21 03:54 file:
======== WMAgent bootstrap FINISH at Wed Feb 21 02:54:42 GMT 2024 ========

So then, when the scp tries to proceeds, it can't find the files and sends an error message.

So, just to confirm, would you say that if the stage out command they use in site-local-config.xml is like this

   <stage-out>
      <method volume="BSC_GPFS" protocol="output" command="cp"/>
   </stage-out>

Then, the output protocol prefix should remove file:/ or file:///, and just have the prefix path since cp can't handle the extras file:... like gfal2 for example?

[
   {  "site": "T3_ES_PIC_BSC",
      "volume": "BSC_GPFS",
      "protocols": [
        {  "protocol": "output",
           "access": "site-rw",
           "prefix": "/gpfs/projects/cie12/outputs"
        }
      ],
      "type": "DISK",
      "rse": "T1_ES_PIC_Disk"
   }
]

@stlammel
Copy link

Hallo Kenyi,
so, yes, it looks like the command="cp" implementation isn't working properly. It would be interesting to know if switching to the gfal implementation will just solve things. Where does PIC do the "scp", is that internal/not exposed or done as part of a custom "cp" implementation?
Just removing the "file://" in storage.json will probably fix things too. (Since it's a local protocol with the stage-out implementation specified, the prefix can be anything.)
Thanks,
cheers, Stephan

@khurtado
Copy link
Contributor

khurtado commented Feb 22, 2024

@stlammel Thank you! I do not know at which point they do the scp to be honest, but I have updated the GGUS ticket with your suggestion (moving to gfal instead or removing the file:///), since I think that should solve the issue.

@amaltaro @todor-ivanov Given that the BSC case seems indeed like a site configuration problem and not a problem with this PR and given that I'm happy enough with my other tests, I am requesting one more review to this PR before it can be merged.

Here is a summary of the tests

  • Chained rule workflow test
    Tested with T1_DE_KIT to WebDav protocol, chained with pnfs
    b) workflow with sub-site
    Tested with subsite: KIT-T3
    All good.
    c) StepChain + TaskChain + Rereco
    Done submitting to a few sites, all good.
    d) Check Cleanup + LogCollect
    Logs looking good as well.
    e) We might check DBS and Rucio (make a Rucio rule)
    DBS and xrdcp tests done (rather than rucio) on the files produced from the workflow below:
    https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_TaskChain_SingleTask20K_khurtado_xmlv1_scaletest20k_240216_000923_9224
    f) Large scale test (targetting all production sites) - 20k jobs (single core, 30min)
    Done and completed. Errors were more related to the agent getting overloaded due to the 20K jobs. Stageout related errors seemed transient as the same sites with issues were staging out successfully in other tests or jobs. BSC was the only one confusing and it looks like it's a site configuration issue.

@khurtado
Copy link
Contributor

khurtado commented Feb 26, 2024

Just for the record, removing file:/// from the json fixed the issue, so this is confirmed to be a site issue and to be fixed at the siteconf level. More details:
https://ggus.eu/index.php?mode=ticket_info&ticket_id=165390

I'm marking #11703 as in waiting, since it's only pending an additiona PR review to be merged. I'm done with the functionality tests.

@amaltaro
Copy link
Contributor

test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 tests deleted
    • 3 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 75 warnings and errors that must be fixed
    • 8 warnings
    • 261 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 363 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14917/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nhduongvn @khurtado I left a few comments along the code that needs to be considered.

I would also invite you to look at those errors reported in the Jenkins report, for instance:

test/python/WMCore_t/Storage_t/StageOutMgr_t.py
    E0602, line 39 in StageOutMgrTest.stageOut: Undefined variable 'StageOutFailure' 

and reports like

 W0611, line 17: Unused SiteLocalConfig imported from WMCore.Storage.SiteLocalConfig 

can be considered as well. Or we can clean those up in another iteration once it gets merged.

I wanted to note though that these are false notifications:

src/python/WMCore/Storage/StageOutMgr.py
    W0611, line 16: Unused import WMCore.Storage.Backends
    W0611, line 17: Unused import WMCore.Storage.Plugins 

Please do NOT remove these import because they are actually needed.


self.stageOuts = self.siteCfg.stageOuts

msg = "There are %s stage out definitions.\n" % len(self.stageOuts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see, this log line, line 102 and maybe a few more will only be logged if we have an error or exception. Can you please revisit it and also print things out in info mode?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Use self.logger.info

if 'option' in overrideConf:
if len(overrideConf['option']) > 0:
overrideParams['option'] = overrideConf['option']
if 'option' in self.overrideConf and self.overrideConf['option'] is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify this line like: if self.overrideConf.get('option') is not None:?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is much better

raise StageOutInitError( msg )
self.stageOuts = self.siteCfg.stageOuts

msg = "\nThere are %s stage out definitions." % len(self.stageOuts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment on this msg variable, which is only printed in case there is an exception. Can you please revisit it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

amsg += '\t'+stageOutStr(stageOut) + '\n'
amsg += str(ex)
msg += amsg
print(amsg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace print by logging library.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

options = self.siteCfg.localStageOut.get('option', None)
pfn = self.searchTFC(lfn)
protocol = self.tfc.preferredProtocol
if stageOut_rfc is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could change this line to if not stageOut_rfc:, such that it can capture a potential empty list as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

msg += str(ex)
self.stageOuts = self.siteCfg.stageOuts

msg = "\nThere are %s stage out definitions." % len(self.stageOuts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment on this msg variable, which will only happen to get printed if there is an exception in the code. Please revisit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@amaltaro
Copy link
Contributor

@nhduongvn Duong, I don't mean to put any pressure on you :), but if we can converge on this before the weekend, it will go in in the March upcoming release candidate for WMAgent. Let us know if you need any assistance on the remaining work.

@nhduongvn
Copy link
Collaborator Author

It is actually a good news! I can definitely finish addressing your reviews today (planned to work on these this afternoon then the power went out 😞)

@amaltaro
Copy link
Contributor

@nhduongvn Duong, if you feel confortable squash the commits in this PR, please have a look at the "Step 10" in this section: https://github.com/dmwm/WMCore/blob/master/CONTRIBUTING.rst#contributing , it has a nice reference for squashing commits, and go ahead.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 tests deleted
    • 3 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 50 warnings and errors that must be fixed
    • 8 warnings
    • 255 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 357 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14932/artifact/artifacts/PullRequestReport.html

@nhduongvn
Copy link
Collaborator Author

@nhduongvn @khurtado I left a few comments along the code that needs to be considered.

I would also invite you to look at those errors reported in the Jenkins report, for instance:

test/python/WMCore_t/Storage_t/StageOutMgr_t.py
    E0602, line 39 in StageOutMgrTest.stageOut: Undefined variable 'StageOutFailure' 

and reports like

 W0611, line 17: Unused SiteLocalConfig imported from WMCore.Storage.SiteLocalConfig 

can be considered as well. Or we can clean those up in another iteration once it gets merged.

I wanted to note though that these are false notifications:

src/python/WMCore/Storage/StageOutMgr.py
    W0611, line 16: Unused import WMCore.Storage.Backends
    W0611, line 17: Unused import WMCore.Storage.Plugins 

Please do NOT remove these import because they are actually needed.

Done!

@nhduongvn
Copy link
Collaborator Author

nhduongvn commented Feb 29, 2024

Please let me know if you are happy with new updates. I can try to squash commits but honestly I am not very comfortable doing this. Last time I did with Todor, it took me 2 days.

@amaltaro
Copy link
Contributor

No problem @nhduongvn , instead of "Merge pull request", I am going to "Squash and merge", as test/* and src/* changes are already all together in the same commits. I will then follow this up with further aesthetic changes reported by jenkins.

@khurtado can you please get the relevant pieces of this stage out/in mechanism based on the JSON storage description documented in https://cms-wmcore.docs.cern.ch/? You probably want to use the gdoc that Duong has circulated so far.

@amaltaro amaltaro merged commit 5827a91 into dmwm:master Feb 29, 2024
2 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Abandon storage.xml in favor of storage.json for stage-in and stage-out
5 participants