Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] 11634.X RelVal failures during processing of Python configuration #35624

Closed
iarspider opened this issue Oct 12, 2021 · 30 comments · Fixed by #35630 or #35722
Closed

[GPU] 11634.X RelVal failures during processing of Python configuration #35624

iarspider opened this issue Oct 12, 2021 · 30 comments · Fixed by #35630 or #35722

Comments

@iarspider
Copy link
Contributor

Example log file

Starting python3 /pool/condor/dir_10183/jenkins/workspace/ib-run-relvals/cms-bot/monitor_workflow.py timeout --signal SIGTERM 9000  cmsRun -j JobReport2.xml  step2_DIGI_L1_DIGI2RAW_HLT.py
----- Begin Fatal Exception 12-Oct-2021 09:04:49 UTC-----------------------
An exception of category 'ConfigFileReadError' occurred while
   [0] Processing the python configuration file named step2_DIGI_L1_DIGI2RAW_HLT.py
Exception Message:
 unknown python problem occurred.
RuntimeError: Unable to cast Python instance of type <class 'NoneType'> to C++ type 'std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >'

At:
  /cvmfs/cms-ib.cern.ch/nweek-02702/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_1_GPU_X_2021-10-11-2300/python/FWCore/ParameterSet/Config.py(1163): _insertPaths
  /cvmfs/cms-ib.cern.ch/nweek-02702/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_1_GPU_X_2021-10-11-2300/python/FWCore/ParameterSet/Config.py(1323): fillProcessDesc
  <string>(2): <module>

----- End Fatal Exception -------------------------------------------------

Only occurs in GPU IBs

@cmsbuild
Copy link
Contributor

A new Issue was created by @iarspider .

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

I took a quick look of 11634.502, and it seems to me that process.schedule contains two Nones, in

...
HLT_ZeroBias_FirstBXAfterTrain_v3
AlCa_RPCMuonNormalisation_v13
None
None
MC_ReducedIterativeTracking_v12
MC_PFMET_v17
...

@fwyzard
Copy link
Contributor

fwyzard commented Oct 12, 2021

can this be caused by #35566 ?

@makortel
Copy link
Contributor

Just by eye #35566 looks a good candidate. That kind of operations in customize functions can lead to a "sequence type" to contain None if objects with the same label are replaced with new objects.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 12, 2021

Yes, that's the idea.
What's a more friendly way to replace the content of a Path ?

@makortel
Copy link
Contributor

assign core

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

makortel commented Oct 12, 2021

Given that e.g.

process = cms.Process("Test")
process.a = cms.EDProducer("A")
process.b = cms.EDProducer("B")
process.s = cms.Sequence(process.a)
process.p = cms.Path(process.s)
process.schedule = cms.Schedule(process.p)
process.s = cms.Sequence(process.b)

and

process = cms.Process("Test")
process.a = cms.EDProducer("A")
process.b = cms.EDProducer("B")

process.t = cms.Task(process.a)
process.schedule = cms.Schedule(tasks=cms.Task(process.t))
process.t = cms.Task(process.b)

lead to process.b being run, but versions without the intermediate process.s = cms.Sequence() or cms.Task do not, I think this is actually a bug (oversight) in how cms.Process.__setattr__() works with cms.Schedule. I'm already looking for a fix.

@makortel
Copy link
Contributor

I should have a fix in #35630.

@missirol
Copy link
Contributor

@fwyzard I was wondering if there was a way to catch this in the (GPU-enabled) PR tests. Looking at the GPU outputs of the tests of #35566 , I see those run only the Fake HLT. If you have suggestions, let me know and I can try to take care of it (cc: @Martin-Grunewald).

@fwyzard
Copy link
Contributor

fwyzard commented Oct 13, 2021

Good point.

Looks like the GPU tests are still running the Run 2 workflows (10824.5xx), we should change them to use the Run 3 workflows (11634.5xx) that include the GRun menu.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 13, 2021

@smuzaffar how can we change the workflows that the bot runs when one asks "enable gpu" from the Run 2 ones

10824.502    2018_Patatrack_PixelOnlyGPU+TTbar_13TeV_TuneCUETP8M1_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT 
10824.512    2018_Patatrack_ECALOnlyGPU+TTbar_13TeV_TuneCUETP8M1_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT 
10824.522    2018_Patatrack_HCALOnlyGPU+TTbar_13TeV_TuneCUETP8M1_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT 

to the Run-3 ones

11634.506    2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+Reco+HARVEST 
11634.512    2021_Patatrack_ECALOnlyGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+Reco+HARVEST 
11634.522    2021_Patatrack_HCALOnlyGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+Reco+HARVEST 

?

This includes switching from .502 to .506, which is more realistic in terms of what we will run next year.

@smuzaffar
Copy link
Contributor

@fwyzard , bot currently does not know any thing about Run2 or Run3. Currently PR workflwos are defined in https://github.com/cms-sw/cms-bot/blob/master/cmssw-pr-test-config we can convert this to a python/shell script which can dynamically generate different workflows based on the run. But still somehow we need to feel the run information to bot. Is there any thing in runTheMatrix which bot can use to find out the Run information?

@fwyzard
Copy link
Contributor

fwyzard commented Oct 13, 2021

No, runTheMatrix.py defines all the workflows.
I think we can simply make the change by hand:

 PR_TEST_MATRIX_EXTRAS=1306.0,101.0,9.0,25202.0,10224.0,250202.181
-PR_TEST_MATRIX_EXTRAS_GPU=10824.502,10824.512,10824.522
+PR_TEST_MATRIX_EXTRAS_GPU=11634.506,11634.512,11634.522
 PR_TEST_MATRIX_EXTRAS_PROFILING=23434.21,11834.21

@fwyzard
Copy link
Contributor

fwyzard commented Oct 13, 2021

Done in cms-sw/cms-bot#1648 .

@missirol
Copy link
Contributor

Done in cms-sw/cms-bot#1648 .

Thanks, Andrea.

I should have a fix in #35630.

@makortel , just noting here that the fix might need to be backported to 12.0.x, since #35566 was backported to that release.

@makortel
Copy link
Contributor

Seems that #35630 was not enough. @smuzaffar, could you reopen this issue?

@makortel
Copy link
Contributor

I investigated the issue a bit further. For a reason that is not yet clear to me, the 11634.502 step2 config file has

process.digi2raw_step = cms.Path(process.DigiToRaw)
process.AlCa_LumiPixelsCounts_Random_v1 = cms.Path(process.HLTBeginSequenceRandom+process.hltScalersRawToDigi+process.hltPreAlCaLumiPixelsCountsRandom+process.hltPixelTrackerHVOn+process.HLTDoLocalPixelSequence+process.hltAlcaPixelClusterCounts+process.HLTEndSequence)
process.AlCa_LumiPixelsCounts_ZeroBias_v1 = cms.Path(process.HLTBeginSequence+process.hltScalersRawToDigi+process.hltL1sZeroBias+process.hltPreAlCaLumiPixelsCountsZeroBias+process.hltPixelTrackerHVOn+process.HLTDoLocalPixelSequence+process.hltAlcaPixelClusterCounts+process.HLTEndSequence)
process.endjob_step = cms.EndPath(process.endOfProcess)

The process.AlCa_LumiPixelsCounts_Random_v1 and process.AlCa_LumiPixelsCounts_ZeroBias_v1 cause the entries of the previous Path objects with the same name in the process.HLTSchedule object to be replaced with None. When process.schedule is extended with process.HLTSchedule, the None's get propagated to process.schedule.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 15, 2021

process.AlCa_LumiPixelsCounts_Random_v1 and process.AlCa_LumiPixelsCounts_ZeroBias_v1 are created by

if 'AlCa_LumiPixelsCounts_Random_v1' in process.__dict__:
# redefine the path to use the HLTDoLocalPixelSequence
process.AlCa_LumiPixelsCounts_Random_v1 = cms.Path(
process.HLTBeginSequenceRandom +
process.hltScalersRawToDigi +
process.hltPreAlCaLumiPixelsCountsRandom +
process.hltPixelTrackerHVOn +
process.HLTDoLocalPixelSequence +
process.hltAlcaPixelClusterCounts +
process.HLTEndSequence )
if 'AlCa_LumiPixelsCounts_ZeroBias_v1' in process.__dict__:
# redefine the path to use the HLTDoLocalPixelSequence
process.AlCa_LumiPixelsCounts_ZeroBias_v1 = cms.Path(
process.HLTBeginSequence +
process.hltScalersRawToDigi +
process.hltL1sZeroBias +
process.hltPreAlCaLumiPixelsCountsZeroBias +
process.hltPixelTrackerHVOn +
process.HLTDoLocalPixelSequence +
process.hltAlcaPixelClusterCounts +
process.HLTEndSequence )

to work around a problem in the definition of the paths (that do not use the HLTDoLocalPixelSequence) and the Patatrack customisation (that assumes the HLTDoLocalPixelSequence).

But they should be created after the HLT menu has been imported, not before ?

@makortel
Copy link
Contributor

Sorry, I was not fully clear. IIUC the customizeHLTforPatatrack() gets called only by customizeHLTforMC() that is called at the very end of the step2_DIGI_L1_DIGI2RAW_HLT.py. In #35624 (comment) I meant that that fragment is literally in the cmsDriver-generated step2_DIGI_L1_DIGI2RAW_HLT.py. I recall ConfigBuilder tends to do that for some python constructs, but I was still surprised to see those Path definitions there.

@makortel
Copy link
Contributor

With @Dr15Jones we traced the cause of the error. The customizeHLTforPatatrack() is called by customizeHLTforCMSSW

# CMSSW version specific customizations
def customizeHLTforCMSSW(process, menuType="GRun"):
# if the gpu modifier is enabled, make the Pixel, ECAL and HCAL reconstruction offloadable to a GPU
from HLTrigger.Configuration.customizeHLTforPatatrack import customizeHLTforPatatrack
gpu.makeProcessModifier(customizeHLTforPatatrack).apply(process)

that itself is called at the end of e.g. HLT_GRun_cff.py

from HLTrigger.Configuration.customizeHLTforCMSSW import customizeHLTforCMSSW
fragment = customizeHLTforCMSSW(fragment,"GRun")

There the fragment is cms.ProcessFragment that has no special replacement functionality, implying that any customize function operating on a ProcessFragment must do all necessary replacement operations. #35722 proposes a fix for this.

The two Paths get added to the generated configuration file because of this logic in ConfigBuilder

self.schedule.append(self.process.HLTSchedule)
[self.blacklist_paths.append(path) for path in self.process.HLTSchedule if isinstance(path,(cms.Path,cms.EndPath))]

# dump all paths
self.pythonCfgCode += "\n# Path and EndPath definitions\n"
for path in self.process.paths:
if getattr(self.process,path) not in self.blacklist_paths:
self.pythonCfgCode += dumpPython(self.process,path)
for endpath in self.process.endpaths:
if getattr(self.process,endpath) not in self.blacklist_paths:
self.pythonCfgCode += dumpPython(self.process,endpath)

The Paths from the HLT menu get added to self.process.paths, but the two Paths that appear as None in the HLTSchedule are not added to self.blacklist_paths and therefore lines to define them explicitly are generated. Also the two Path definitions disappear from `step2*.py" with #35722.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 19, 2021

@makortel thanks for the proposed fix.

There the fragment is cms.ProcessFragment that has no special replacement functionality, implying that any customize function operating on a ProcessFragment must do all necessary replacement operations.

Mhm, the whole idea of the cms.ProcessFragment was for it to have the same behaviour as a cms.Process , at least in terms of customisation functions. What would need to be added for it to have the same special replacement functionality ?

@makortel
Copy link
Contributor

Mhm, the whole idea of the cms.ProcessFragment was for it to have the same behaviour as a cms.Process , at least in terms of customisation functions. What would need to be added for it to have the same special replacement functionality ?

In order to support the same functionality as cms.Process it would need a lot of rather complicated code from cms.Process, at least everything that is in or called by cms.Process.__setattr__ (so copy-paste is not an option).

@makortel
Copy link
Contributor

just noting here that the fix might need to be backported to 12.0.x, since #35566 was backported to that release.

12_0_X backport is in #35759

@makortel
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 21, 2021

Mhm, the whole idea of the cms.ProcessFragment was for it to have the same behaviour as a cms.Process , at least in terms of customisation functions. What would need to be added for it to have the same special replacement functionality ?

In order to support the same functionality as cms.Process it would need a lot of rather complicated code from cms.Process, at least everything that is in or called by cms.Process.__setattr__ (so copy-paste is not an option).

cms.ProcessFragment is just a thin wrapper around a cms.Process, and should forward all setattr/getattr/delattr calls to the underlying object:

class ProcessFragment(object):
    def __init__(self, process):
        if isinstance(process, Process):
            self.__process = process
        elif isinstance(process, str):
            self.__process = Process(process)
            #make sure we do not override the defaults
            del self.__process.options
            del self.__process.maxEvents
            del self.__process.maxLuminosityBlocks
        else:
            raise TypeError('a ProcessFragment can only be constructed from an existig Process or from process name')
    def __dir__(self):
        return [ x for x in dir(self.__process) if isinstance(getattr(self.__process, x), _ConfigureComponent) ]
    def __getattribute__(self, name):
        if name == '_ProcessFragment__process':
            return object.__getattribute__(self, '_ProcessFragment__process')
        else:
            return getattr(self.__process, name)
    def __setattr__(self, name, value):
        if name == '_ProcessFragment__process':
            object.__setattr__(self, name, value)
        else:
            setattr(self.__process, name, value)
    def __delattr__(self, name):
        if name == '_ProcessFragment__process':
            pass
        else:
            return delattr(self.__process, name)

Do you have any idea why the fix made for the cms.Process did not work for the cms.ProcessFragment ?

@makortel
Copy link
Contributor

Do you have any idea why the fix made for the cms.Process did not work for the cms.ProcessFragment ?

Hmm, that is a good point. I think what happened is that the cms.Process held by cms.ProcessFragment had the Paths in a cms.Schedule object process.HLTSchedule, and the replacement-in-cms.Schedule logic in cms.Process only applies to process.schedule. The (optional) cms.Schedule object in the cms.Process should have the label process.schedule, and while additional cms.Schedule objects appear to work for some things, framework gives no guarantees for such.

I opened a new issue #35842 to figure out a way how we could eventually enforce this rule on cms.Schedule objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment