Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force pickle protocol=0 for the spec creation to be compatible with Py2/Py3 #10628

Merged
merged 1 commit into from
Jun 29, 2021

Conversation

amaltaro
Copy link
Contributor

@amaltaro amaltaro commented Jun 28, 2021

Superseeds #10619 (please see whole discussion/investigation in this PR)
Fixes #10609

Status

Ready

Description

As discussed with Dario this morning, trying a variation of the current implementation in #10619
If it passes the unit tests, then we can run some real tests within WMCore.

NOTE: this patch is required for the Python2 1.4.7.patchX agents:

cmst1@vocms0262:/data/srv/wmagent/current $ diff -rup /data/srv/wmagent/v1.4.7.patch3/sw/slc7_amd64_gcc630/cms/wmagent/1.4.7.patch3/lib/python2.7/site-packages/WMCore/BossAir/Plugins/SimpleCondorPlugin.py{.bkp,}
--- /data/srv/wmagent/v1.4.7.patch3/sw/slc7_amd64_gcc630/cms/wmagent/1.4.7.patch3/lib/python2.7/site-packages/WMCore/BossAir/Plugins/SimpleCondorPlugin.py.bkp	2021-06-24 19:30:23.860759021 +0200
+++ /data/srv/wmagent/v1.4.7.patch3/sw/slc7_amd64_gcc630/cms/wmagent/1.4.7.patch3/lib/python2.7/site-packages/WMCore/BossAir/Plugins/SimpleCondorPlugin.py	2021-06-24 19:31:50.394372378 +0200
@@ -605,7 +605,7 @@ class SimpleCondorPlugin(BasePlugin):
             requiredOSes = self.scramArchtoRequiredOS(job.get('scramArch'))
             ad['My.REQUIRED_OS'] = classad.quote(requiredOSes)
             cmsswVersions = ','.join(job.get('swVersion'))
-            ad['My.CMSSW_Versions'] = classad.quote(cmsswVersions)
+            ad['My.CMSSW_Versions'] = classad.quote(str(cmsswVersions))

Is it backward compatible (if not, which system it affects?)

yes, hopefully!

Related PRs

none

External dependencies / deployment changes

none

@cmsdmwmbot
Copy link

Jenkins results:

  • Python2 Unit tests: failed
    • 45 new failures
    • 8 changes in unstable tests
  • Python3 Unit tests: failed
    • 43 new failures
    • 13 changes in unstable tests
  • Python2 Pylint check: failed
    • 9 warnings and errors that must be fixed
    • 1 warnings
    • 6 comments to review
  • Python3 Pylint check: failed
    • 9 warnings and errors that must be fixed
    • 1 warnings
    • 24 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12038/artifact/artifacts/PullRequestReport.html

…y2/Py3

pickle load auto detects the protocol
@cmsdmwmbot
Copy link

Jenkins results:

  • Python2 Unit tests: succeeded
  • Python3 Unit tests: succeeded
    • 4 changes in unstable tests
  • Python2 Pylint check: failed
    • 9 warnings and errors that must be fixed
    • 1 warnings
    • 6 comments to review
  • Python3 Pylint check: failed
    • 9 warnings and errors that must be fixed
    • 1 warnings
    • 24 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12040/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

I still have to check the outcome of my tests (might have to create another fix or two before I can get to that though), but a pre-review would be welcome ;-)

Copy link
Member

@mapellidario mapellidario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well yes, this is what we agreed on and based on our current knowledge it is the best that we can do.

The results of you tests will greatly determine the fate of these changes, though 😏 Will see, but i am not too worried either, it should not take long to cherry pick changes from #10619 in case your tests fail because of pickle

@amaltaro
Copy link
Contributor Author

Quick update on the tests I have been running for Python3 central services with these changes in.

  • PY3 central services are able to properly create the request spec pickle file
  • PY2 and PY3 WMAgent are able to acquire such workflows and load the spec pickle file (created with protocol 0)
  • PY2 SimpleCondorPlugin needs to be patch (see initial description)
  • PY3 WMAgent - based on 1.5.0.pre3 - required many fixes in order to get jobs submitted.
  • PY2 WMAgent had 100% failure rate. Failure happens during runtime with error [py2_wma]
  • PY3 WMAgent had 100% failure rate. Still to be debugged, but runtime error can be seen in [py3_wma]

Note that the Py2 WMAgent is still running 1.4.7.patch3, so it does not have the python decoupling developments.
A quick check around that CondorStatusService suggest that self.step.data._internal_name is actually unicode, which breaks the validation in CMSSW code itself. Debugging...

[py2_wma]

INFO:root:    Invoking command: /cvmfs/cms.cern.ch/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_0/external/slc6_amd64_gcc530/bin/python2.7 -m WMCore.WMRuntime.ScriptInvoke WMTaskSpace.cmsRun1 SetupCMSSWPset 

INFO:root:Subprocess stdout was:
2021-06-28 23:14:52,519:INFO:ScriptInvoke:Invoking scripts in current directory: /afs/cern.ch/user/a/amaltaro/psetTweak2/job/WMTaskSpace/cmsRun1/CMSSW_8_1_0
2021-06-28 23:14:52,576:INFO:Bootstrap:Job Index = 10
Job Instance = {'ownerRole': u'unknown', 'agentName': 'vocms0261.cern.ch', 'jobgroup': 1, 'taskType': 'integration', 'owner': u'amaltaro', 'fwjr': None, 'id': 10, 'jobType': 'Production', 'estimatedJobTime': 28800, 'estimatedDiskUsage': 1000000, 'state_time': 1624886965, 'state': 'new', 'location': None, 'allowOpportunistic': False, 'agentNumber': 0, 'spec': '/data/srv/wmagent/v1.4.7.patch3/install/wmagent/WorkQueueManager/cache/amaltaro_TaskChain_ProdMinBias_June2021_Val_210628_150829_5725/WMSandbox/WMWorkload.pkl', 'workflow': 'amaltaro_TaskChain_ProdMinBias_June2021_Val_210628_150829_5725', 'ownerGroup': u'unknown', 'cache_dir': u'/data/srv/wmagent/v1.4.7.patch3/install/wmagent/JobCreator/JobCache/amaltaro_TaskChain_ProdMinBias_June2021_Val_210628_150829_5725/ProdMinBias/JobCollection_1_0/job_10', 'estimatedMemoryUsage': 1200.0, 'fwjr_path': None, 'input_files': [{'runs': set([]), 'last_event': 0, 'lfn': 'MCFakeFile-amaltaro_TaskChain_ProdMinBias_June2021_Val_210628_150829_5725-ProdMinBias-1fc1a33e4582105549b2d043a1939366', 'locations': set([]), 'first_event': 0, 'parents': set([]), 'checksums': {}, 'events': 9000, 'merged': True, 'size': 0}], 'task': u'/amaltaro_TaskChain_ProdMinBias_June2021_Val_210628_150829_5725/ProdMinBias', 'name': u'86fcac5d-2781-4c4f-99cd-879cd7c410fe-9', 'counter': 10, 'mask': {'LastRun': 1, 'LastLumi': 10, 'FirstRun': 1, 'inclusivemask': True, 'runAndLumis': {}, 'LastEvent': 410, 'FirstEvent': 370, 'jobID': 10, 'FirstLumi': 10}, 'numberOfCores': 1, 'ownerDN': u'/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=amaltaro/CN=718748/CN=Alan Malta Rodrigues', 'retry_count': 0, 'sandbox': '/data/srv/wmagent/v1.4.7.patch3/install/wmagent/WorkQueueManager/cache/amaltaro_TaskChain_ProdMinBias_June2021_Val_210628_150829_5725/amaltaro_TaskChain_ProdMinBias_June2021_Val_210628_150829_5725-Sandbox.tar.bz2', 'outcome': 'failure'}

2021-06-28 23:14:52,724:INFO:SetupCMSSWPset:Executing SetupCMSSWPSet...
2021-06-28 23:14:55,127:INFO:SetupCMSSWPset:Tag chirp updates from CMSSW with step cmsRun1
Traceback (most recent call last):
  File "/cvmfs/cms.cern.ch/slc6_amd64_gcc530/external/python/2.7.11-giojec4/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/cvmfs/cms.cern.ch/slc6_amd64_gcc530/external/python/2.7.11-giojec4/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/afs/cern.ch/user/a/amaltaro/psetTweak2/job/WMCore.zip/WMCore/WMRuntime/ScriptInvoke.py", line 118, in <module>
RuntimeError: Error invoking script for step WMTaskSpace.cmsRun1
withe Script module: SetupCMSSWPset
_cmsRun1_ is not a valid <class 'FWCore.ParameterSet.Types.string'>Details:
  File "/afs/cern.ch/user/a/amaltaro/psetTweak2/job/WMCore.zip/WMCore/WMRuntime/ScriptInvoke.py", line 110, in <module>
  File "/afs/cern.ch/user/a/amaltaro/psetTweak2/job/WMCore.zip/WMCore/WMRuntime/ScriptInvoke.py", line 81, in invoke
  File "/build/dmwmbld/srv/state/dmwmbld/builds/comp_gcc630/w/tmp/BUILDROOT/a4c7a6fe65310a171692a1c6cb882348/build/dmwmbld/srv/state/dmwmbld/builds/comp_gcc630/w/slc7_amd64_gcc630/cms/wmagent/1.4.7.patch3/lib/python2.7/site-packages/WMCore/WMRuntime/Scripts/SetupCMSSWPset.py", line 746, in __call__
    self.handleCondorStatusService()
  File "/build/dmwmbld/srv/state/dmwmbld/builds/comp_gcc630/w/tmp/BUILDROOT/a4c7a6fe65310a171692a1c6cb882348/build/dmwmbld/srv/state/dmwmbld/builds/comp_gcc630/w/slc7_amd64_gcc630/cms/wmagent/1.4.7.patch3/lib/python2.7/site-packages/WMCore/WMRuntime/Scripts/SetupCMSSWPset.py", line 641, in handleCondorStatusService
    tag=cms.untracked.string("_%s_" % self.step.data._internal_name)))
  File "/cvmfs/cms.cern.ch/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_0/python/FWCore/ParameterSet/Types.py", line 23, in __call__
    param = globals()[self.name](*value,**params)
  File "/cvmfs/cms.cern.ch/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_0/python/FWCore/ParameterSet/Types.py", line 145, in __init__
    super(string,self).__init__(value)
  File "/cvmfs/cms.cern.ch/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_0/python/FWCore/ParameterSet/Mixins.py", line 56, in __init__
    raise ValueError(str(value)+" is not a valid "+str(type(self)))

[py3_wma]

    JobPackageError (Exit Code: 11001)
        Failed to load JobPackage:/srv/job/WMSandbox/JobPackage.pcl
        unsupported pickle protocol: 5

@amaltaro
Copy link
Contributor Author

amaltaro commented Jun 29, 2021

We have some progress here. After patching SetupCMSSWPSet.py, I managed to get jobs running and successfully completing in our "old" python WMAgent 1.4.7.patch3. I'm going to create a patch soon.

Regarding the python3 agent, unfortunately I cannot test the whole chain, because jobs fail during runtime (there are other CVMFS dependencies that we need to sort first). However, the most critical part has been tested and this protocol seems to be fully compatible.

IMO, we should give it another try in a clean environment. So I'm going to merge and put this in a new tag.

Here is a new GH issue to upgrade the pickle protocol used: #10643
And here is the required patch for 1.4.7 agents: #10644

Again, thank you very much for your investigation and the discussion on this matter, Dario!

@mapellidario
Copy link
Member

@amaltaro I have missed something, did you managed to fix the problems described in [py2_wma] and [py3_wma]? do we need to open new issues?

But more importantly, why is wmagent in py3 complaining that it cannot load protocol 5? Aren't we running py 3.8.2? It should be available there 😱

@amaltaro
Copy link
Contributor Author

Hi @mapellidario , sorry for the very late reply.

@amaltaro I have missed something, did you managed to fix the problems described in [py2_wma] and [py3_wma]? do we need to open new issues?

Yes, those two issues have been resolved, at least in Py2 WMagent.

But more importantly, why is wmagent in py3 complaining that it cannot load protocol 5? Aren't we running py 3.8.2? It should be available there 😱

I mentioned this issue in our last chat - before I left on vacation - but I could not explain the reason back then. The issue is that WMAgent creates the job pickle file using the latest protocol (5), but in the worker node runtime, we are still sourcing a python2 environment and its latest protocol is 2. This is why things are not "totally" compatible, besides having our WMCore code dual stack. To be addressed with this issue: #10641

@mapellidario
Copy link
Member

Aha, now I get it. Thanks Alan!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WMAgent fails to load spec pickle file created in Python3
3 participants