Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add runtime json #11812

Merged
merged 2 commits into from
Dec 12, 2023
Merged

Add runtime json #11812

merged 2 commits into from
Dec 12, 2023

Conversation

khurtado
Copy link
Contributor

@khurtado khurtado commented Dec 6, 2023

Fixes #11743

Status

Ready to be tested

Description

Creates json with runtime information for future customization usage

Is it backward compatible (if not, which system it affects?)

YES

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests added
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 14 warnings and errors that must be fixed
    • 1 warnings
    • 30 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 9 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14698/artifact/artifacts/PullRequestReport.html

@khurtado
Copy link
Contributor Author

khurtado commented Dec 7, 2023

@amaltaro @todor-ivanov I think this is ready for review
This PR produces the json structure below for taskChain and stepChain. I opted for keeping both 'keep_output" and "transient" due to #9013.

  • taskChain
{
    "workflow_type": "TaskChain",
    "number_of_cmsRuns": 1,
    "worker_arch": "X86_64",
    "worker_os": "rhel8",
    "cmsRun_params": [
        {
            "step": "cmsRun1",
            "keep_output": true,
            "input_files": [
                "/store/data/Run2023C/JetMET1/RAW/v1/000/367/131/00000/f7943baa-66b2-402d-a9f2-4f6fc52cde13.root"
            ],
            "output_files": [
                "FEVTDEBUGHLToutput.root"
            ],
            "output_files_datatiers": [
                "FEVTDEBUGHLT"
            ],
            "output_files_transient": [
                true
            ],
            "output_files_lfnBase": [
                "/store/unmerged/CMSSW_13_3_0_pre4/JetMET1/FEVTDEBUGHLT/133X_dataRun3_HLT_frozen_v1_CNAFARM_RelVal_2023C-v1"
            ],
            "output_files_mergedLFNBase": [
                "/store/relval/CMSSW_13_3_0_pre4/JetMET1/FEVTDEBUGHLT/133X_dataRun3_HLT_frozen_v1_CNAFARM_RelVal_2023C-v1"
            ]
        }
    ]
}
  • StepChain
{
    "workflow_type": "StepChain",
    "number_of_cmsRuns": 5,
    "worker_arch": "X86_64",
    "worker_os": "rhel8",
    "cmsRun_params": [
        {
            "step": "cmsRun1",
            "keep_output": false,
            "input_files": [
                null
            ],
            "output_files": [
                "RAWSIMoutput.root"
            ],
            "output_files_datatiers": [
                "GEN-SIM"
            ],
            "output_files_transient": [
                true
            ],
            "output_files_lfnBase": [
                "/store/unmerged/Run3Summer23BPixGS/HTo2LongLivedTo4b_MH-125_MFF-25_CTau-15000mm_TuneCP5_13p6TeV-pythia8/GEN-SIM/130X_mcRun3_2023_realistic_postBPix_v2-v2"
            ],
            "output_files_mergedLFNBase": [
                "/store/mc/Run3Summer23BPixGS/HTo2LongLivedTo4b_MH-125_MFF-25_CTau-15000mm_TuneCP5_13p6TeV-pythia8/GEN-SIM/130X_mcRun3_2023_realistic_postBPix_v2-v2"
            ]
        },
        {
            "step": "cmsRun2",
            "keep_output": true,
            "input_files": [
                "../cmsRun1/RAWSIMoutput.root"
            ],
            "output_files": [
                "PREMIXRAWoutput.root"
            ],
            "output_files_datatiers": [
                "GEN-SIM-RAW"
            ],
            "output_files_transient": [
                true
            ],
            "output_files_lfnBase": [
                "/store/unmerged/Run3Summer23BPixDRPremix/HTo2LongLivedTo4b_MH-125_MFF-25_CTau-15000mm_TuneCP5_13p6TeV-pythia8/GEN-SIM-RAW/130X_mcRun3_2023_realistic_postBPix_v2-v2"
            ],
            "output_files_mergedLFNBase": [
                "/store/mc/Run3Summer23BPixDRPremix/HTo2LongLivedTo4b_MH-125_MFF-25_CTau-15000mm_TuneCP5_13p6TeV-pythia8/GEN-SIM-RAW/130X_mcRun3_2023_realistic_postBPix_v2-v2"
            ]
        },
        {
            "step": "cmsRun3",
            "keep_output": true,
            "input_files": [
                "../cmsRun2/PREMIXRAWoutput.root"
            ],
            "output_files": [
                "AODSIMoutput.root"
            ],
            "output_files_datatiers": [
                "AODSIM"
            ],
            "output_files_transient": [
                true
            ],
            "output_files_lfnBase": [
                "/store/unmerged/Run3Summer23BPixDRPremix/HTo2LongLivedTo4b_MH-125_MFF-25_CTau-15000mm_TuneCP5_13p6TeV-pythia8/AODSIM/130X_mcRun3_2023_realistic_postBPix_v2-v2"
            ],
            "output_files_mergedLFNBase": [
                "/store/mc/Run3Summer23BPixDRPremix/HTo2LongLivedTo4b_MH-125_MFF-25_CTau-15000mm_TuneCP5_13p6TeV-pythia8/AODSIM/130X_mcRun3_2023_realistic_postBPix_v2-v2"
            ]
        },
        {
            "step": "cmsRun4",
            "keep_output": true,
            "input_files": [
                "../cmsRun3/AODSIMoutput.root"
            ],
            "output_files": [
                "MINIAODSIMoutput.root"
            ],
            "output_files_datatiers": [
                "MINIAODSIM"
            ],
            "output_files_transient": [
                true
            ],
            "output_files_lfnBase": [
                "/store/unmerged/Run3Summer23BPixMiniAODv4/HTo2LongLivedTo4b_MH-125_MFF-25_CTau-15000mm_TuneCP5_13p6TeV-pythia8/MINIAODSIM/130X_mcRun3_2023_realistic_postBPix_v2-v2"
            ],
            "output_files_mergedLFNBase": [
                "/store/mc/Run3Summer23BPixMiniAODv4/HTo2LongLivedTo4b_MH-125_MFF-25_CTau-15000mm_TuneCP5_13p6TeV-pythia8/MINIAODSIM/130X_mcRun3_2023_realistic_postBPix_v2-v2"
            ]
        },
        {
            "step": "cmsRun5",
            "keep_output": true,
            "input_files": [
                "../cmsRun4/MINIAODSIMoutput.root"
            ],
            "output_files": [
                "NANOEDMAODSIMoutput.root"
            ],
            "output_files_datatiers": [
                "NANOAODSIM"
            ],
            "output_files_transient": [
                true
            ],
            "output_files_lfnBase": [
                "/store/unmerged/Run3Summer23BPixNanoAODv12/HTo2LongLivedTo4b_MH-125_MFF-25_CTau-15000mm_TuneCP5_13p6TeV-pythia8/NANOAODSIM/130X_mcRun3_2023_realistic_postBPix_v2-v2"
            ],
            "output_files_mergedLFNBase": [
                "/store/mc/Run3Summer23BPixNanoAODv12/HTo2LongLivedTo4b_MH-125_MFF-25_CTau-15000mm_TuneCP5_13p6TeV-pythia8/NANOAODSIM/130X_mcRun3_2023_realistic_postBPix_v2-v2"
            ]
        }
    ]
}

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 12 warnings and errors that must be fixed
    • 1 warnings
    • 33 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 8 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14702/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 12 warnings and errors that must be fixed
    • 1 warnings
    • 33 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 8 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14703/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
  • Python3 Pylint check: failed
    • 12 warnings and errors that must be fixed
    • 1 warnings
    • 33 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 8 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14704/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me.

@khurtado
Copy link
Contributor Author

khurtado commented Dec 8, 2023

@todor-ivanov THank you!
@amaltaro Just for the record, I injected all the DMWM/Integration test jobs and so far so good

https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_TaskChain_ProdMinBias_khurtado_taskchain1_231208_153841_5641

https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_DQMHarvest_RunWhitelist_khurtado_DQMHarvest_RunWhitelist_231208_174950_2269
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_ReReco_failCase_khurtado_ReReco_failCase_231208_175002_4878
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_ReReco_failCase_Nvidia_khurtado_ReReco_failCase_Nvidia_231208_175010_4707
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_ReReco_RunBlockWhite_khurtado_ReReco_RunBlockWhite_231208_175018_2828
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_ReReco_RunBlockWhite_Nvidia_khurtado_ReReco_RunBlockWhite_Nvidia_231208_175025_3365
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_SC_6Steps_PU_khurtado_SC_6Steps_PU_231208_175032_898
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_SC_EL8_khurtado_SC_EL8_231208_175041_3991
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_SC_EL8_khurtado_SC_EL8_231208_175057_3868
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_SC_EL8_khurtado_SC_EL8.json.bak_231208_175106_8435
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_TaskChain_ProdMinBias_Nvidia_khurtado_TaskChain_ProdMinBias_Nvidia_231208_175132_1239
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_TC_6Tasks_PU_khurtado_TC_6Tasks_PU_231208_175138_9749
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_TC_6Tasks_Scratch_khurtado_TC_6Tasks_Scratch_231208_175147_1850
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_TC_EL8_khurtado_TC_EL8_231208_175155_7592
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_TC_LHE_PFN_khurtado_TC_LHE_PFN_231208_175203_2589
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_TC_PY3_ProdPsi_khurtado_TC_PY3_ProdPsi_231208_175209_8919
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_DQMHarvesting_khurtado_newsubmit_DQMHarvesting_231208_175940_395
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_DQMHarvesting_LumiMask_khurtado_newsubmit_DQMHarvesting_LumiMask_231208_175946_6198
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_DQMHarvesting_MultiRun_khurtado_newsubmit_DQMHarvesting_MultiRun_231208_175953_1008
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_ReReco_LumiMask_khurtado_newsubmit_ReReco_LumiMask_231208_175959_8238
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_ReReco_Parents_khurtado_newsubmit_ReReco_Parents_231208_180006_3979
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_SC_LumiMask_Rules_khurtado_newsubmit_SC_LumiMask_Rules_231208_180025_6678
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_SC_MultiPU_khurtado_newsubmit_SC_MultiPU_231208_180035_4055
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_SC_ReDigi_Harvest_Prod_khurtado_newsubmit_SC_ReDigi_Harvest_Prod_231208_180055_2981
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_TaskChain_MC_khurtado_newsubmit_TaskChain_MC_231208_180119_6899
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_TaskChain_Prod_khurtado_newsubmit_TaskChain_Prod_231208_180131_1608
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_TaskChain_PUMCRecyc_khurtado_newsubmit_TaskChain_PUMCRecyc_231208_180141_4168
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_TC_Drop_PhEDEx_Ext_khurtado_newsubmit_TC_Drop_PhEDEx_Ext_231208_180149_2057
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_TC_PY3_Data_LumiList_khurtado_newsubmit_TC_PY3_Data_LumiList_231208_180156_70
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_TC_PY3_TTbarPU_khurtado_newsubmit_TC_PY3_TTbarPU_231208_180206_7048

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khurtado thank you for providing those json dumps, Kenyi. Based on that, I have the following suggestions:

  • what do you think about renaming output_files_transient to transient_output?
  • perhaps I would rename output_files_lfnBase to something like output_files_unmerged_base (or output_files_unmerged_lfn_base). Similarly for output_files_mergedLFNBase
  • when a step has no input files, should we set input_files=[] instead of input_files=[null]?

In addition, perhaps it would be useful to have another attribute for the job type as well (Production/Processing/Merge/etc)? If we have that, then we could consider logging the final json structure for Production/Processing jobs. The other job types can potentially have a very large list of input files and I'd rather not dump those in the logs.

Could you please try to create a unit test for createWMRuntimeJson as well?

Note that I have not yet looked into the source code changes, but I will come back to this later or whenever you provide further updates to this PR.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 14 warnings and errors that must be fixed
    • 1 warnings
    • 33 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 8 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14706/artifact/artifacts/PullRequestReport.html

@khurtado
Copy link
Contributor Author

khurtado commented Dec 11, 2023

@amaltaro

I made the following changes:

  • I renamed the attributes with your suggestions
  • I agree with the input_files as well, so it is now doing [] rather than [null]
  • I added job_type which can be Production, Processing, Merge, LogCollect, etc
  • I also changed Startup such that if job_type is Production/Processing, we dump the json in the log
  • I have also squashed the commits

What is missing is the unit test, which I can come back to later when I'm back. In the meantime, I think this is ready for review again.

The log will look like this:

<snip>

INFO:root:runtime json = {
    "workflow_type": "TaskChain",
    "number_of_cmsRuns": 1,
    "worker_arch": "X86_64",
    "worker_os": "rhel8",
    "job_type": "Processing",
    "cmsRun_params": [
        {
            "step": "cmsRun1",
            "keep_output": true,
            "input_files": [
                "/store/data/Run2023C/JetMET1/RAW/v1/000/367/131/00000/f7943baa-66b2-402d-a9f2-4f6fc52cde13.root"
            ],
            "output_files": [
                "FEVTDEBUGHLToutput.root"
            ],
            "output_files_datatiers": [
                "FEVTDEBUGHLT"
            ],
            "transient_output": [
                true
            ],
            "output_files_unmerged_base": [
                "/store/unmerged/CMSSW_13_3_0_pre4/JetMET1/FEVTDEBUGHLT/133X_dataRun3_HLT_frozen_v1_CNAFARM_RelVal_2023C-v1"
            ],
            "output_files_merged_base": [
                "/store/relval/CMSSW_13_3_0_pre4/JetMET1/FEVTDEBUGHLT/133X_dataRun3_HLT_frozen_v1_CNAFARM_RelVal_2023C-v1"
            ]
        }
    ]
}
INFO:root:Building task at directory: /tmpscratch/users/khurtado/work/wmcore/11743/taskchain/job
INFO:root:CMSSW.coreBuild invoked
INFO:root:create(/tmpscratch/users/khurtado/work/wmcore/11743/taskchain/job/WMTaskSpace/cmsRun1)

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 14 warnings and errors that must be fixed
    • 1 warnings
    • 33 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 8 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14710/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the prompt action, Kenyi.
I left a few more comments along the code, and I would suggest the following as well:

src/python/WMCore/WMRuntime/Bootstrap.py Outdated Show resolved Hide resolved
src/python/WMCore/WMRuntime/Bootstrap.py Show resolved Hide resolved
src/python/WMCore/WMRuntime/Bootstrap.py Outdated Show resolved Hide resolved
src/python/WMCore/WMRuntime/Bootstrap.py Outdated Show resolved Hide resolved
if chained:
inputModule = getattr(step.data.input, 'inputOutputModule', None)
inputStepName = getattr(step.data.input, 'inputStepName', None)
inputFileNames.append("../{}/{}.root".format(inputStepName, inputModule))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to set it to an absolute path instead of relative?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amaltaro Maybe, but we treat chain processing with relative paths in the PSet tweaking, which is why I opted to keep it that way. Considering that, do you prefer absolute path, or keep it the same way as with the code below?

inputFile = ("file:../%s/%s.root" % (self.step.data.input.inputStepName,
self.step.data.input.inputOutputModule))

src/python/WMCore/WMRuntime/Bootstrap.py Outdated Show resolved Hide resolved
src/python/WMCore/WMRuntime/Startup.py Outdated Show resolved Hide resolved
@khurtado
Copy link
Contributor Author

khurtado commented Dec 11, 2023

@amaltaro I addressed all requests but the absolute path for chained processing. I believe it should be treated the same way it is done with the tweakings.
I left the changes in a different commit so they are easier to spot.

CodeTimer will print something like this for the json stuff now:

INFO:root:Creating WM runtime information json took 0.218 seconds to complete

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 12 warnings and errors that must be fixed
    • 1 warnings
    • 33 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 7 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14715/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 12 warnings and errors that must be fixed
    • 1 warnings
    • 33 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 7 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14716/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor

Thank you Kenyi, let's get this in.

@amaltaro amaltaro merged commit fd288bb into dmwm:master Dec 12, 2023
3 of 4 checks passed
"job_type": "Production", # string with job type information: Production/Processing, Merge, LogCollect, etc.
"cmsRun_params": [{
"step": "cmsRun1", # string with the relevant step
"keek_output": boolean saying whether we keep output or not,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a typo, should the key be keep_output?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes Matti, it should be keep_output. I will soon pick a couple of examples from workflows that ran this week and share those here with you.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Alan.

@amaltaro
Copy link
Contributor

@makortel Matti, in case this is relevant to you. I picked a couple of workflows and looked on what was reported in the wmagentJob.log log file.

This is the runtime json that was created for a StepChain job which belongs to workflow amaltaro_SC_EL8_Agent227_Val_240110_213036_6665:

{
    "cmsRun_params": [
        {
            "input_files": [],
            "keep_output": true,
            "output_files": [
                "FEVTDEBUGoutput.root"
            ],
            "output_files_datatiers": [
                "GEN-SIM"
            ],
            "output_files_merged_base": [
                "/store/backfill/1/CMSSW_12_4_0_pre2/RelValWToLNu_14TeV/GEN-SIM/GenSimFull_SC_EL8_Agent227_Val_Alanv1-v11"
            ],
            "output_files_unmerged_base": [
                "/store/unmerged/CMSSW_12_4_0_pre2/RelValWToLNu_14TeV/GEN-SIM/GenSimFull_SC_EL8_Agent227_Val_Alanv1-v11"
            ],
            "step": "cmsRun1",
            "transient_output": [
                true
            ]
        },
        {
            "input_files": [
                "../cmsRun1/FEVTDEBUGoutput.root"
            ],
            "keep_output": true,
            "output_files": [
                "FEVTDEBUGHLToutput.root"
            ],
            "output_files_datatiers": [
                "GEN-SIM-DIGI-RAW"
            ],
            "output_files_merged_base": [
                "/store/backfill/1/CMSSW_12_4_0_pre2/RelValWToLNu_14TeV/GEN-SIM-DIGI-RAW/Digi_2021_SC_EL8_Agent227_Val_Alanv1-v11"
            ],
            "output_files_unmerged_base": [
                "/store/unmerged/CMSSW_12_4_0_pre2/RelValWToLNu_14TeV/GEN-SIM-DIGI-RAW/Digi_2021_SC_EL8_Agent227_Val_Alanv1-v11"
            ],
            "step": "cmsRun2",
            "transient_output": [
                true
            ]
        },
        {
            "input_files": [
                "../cmsRun2/FEVTDEBUGHLToutput.root"
            ],
            "keep_output": true,
            "output_files": [
                "NANOEDMAODSIMoutput.root",
                "MINIAODSIMoutput.root",
                "DQMoutput.root",
                "RECOSIMoutput.root"
            ],
            "output_files_datatiers": [
                "NANOAODSIM",
                "MINIAODSIM",
                "DQMIO",
                "GEN-SIM-RECO"
            ],
            "output_files_merged_base": [
                "/store/backfill/1/CMSSW_12_4_0_pre2/RelValWToLNu_14TeV/NANOAODSIM/RecoNano_2021_SC_EL8_Agent227_Val_Alanv1-v11",
                "/store/backfill/1/CMSSW_12_4_0_pre2/RelValWToLNu_14TeV/MINIAODSIM/RecoNano_2021_SC_EL8_Agent227_Val_Alanv1-v11",
                "/store/backfill/1/CMSSW_12_4_0_pre2/RelValWToLNu_14TeV/DQMIO/RecoNano_2021_SC_EL8_Agent227_Val_Alanv1-v11",
                "/store/backfill/1/CMSSW_12_4_0_pre2/RelValWToLNu_14TeV/GEN-SIM-RECO/RecoNano_2021_SC_EL8_Agent227_Val_Alanv1-v11"
            ],
            "output_files_unmerged_base": [
                "/store/unmerged/CMSSW_12_4_0_pre2/RelValWToLNu_14TeV/NANOAODSIM/RecoNano_2021_SC_EL8_Agent227_Val_Alanv1-v11",
                "/store/unmerged/CMSSW_12_4_0_pre2/RelValWToLNu_14TeV/MINIAODSIM/RecoNano_2021_SC_EL8_Agent227_Val_Alanv1-v11",
                "/store/unmerged/CMSSW_12_4_0_pre2/RelValWToLNu_14TeV/DQMIO/RecoNano_2021_SC_EL8_Agent227_Val_Alanv1-v11",
                "/store/unmerged/CMSSW_12_4_0_pre2/RelValWToLNu_14TeV/GEN-SIM-RECO/RecoNano_2021_SC_EL8_Agent227_Val_Alanv1-v11"
            ],
            "step": "cmsRun3",
            "transient_output": [
                true,
                true,
                true,
                true
            ]
        }
    ],
    "job_type": "Production",
    "number_of_cmsRuns": 3,
    "worker_arch": "X86_64",
    "worker_os": "rhel8",
    "workflow_type": "StepChain"
}

while this TaskChain job for https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_TC_Nano_Agent227_Val_240110_213025_3077 produced:

{
    "cmsRun_params": [
        {
            "input_files": [
                "/store/mc/RunIISummer20UL17MiniAODv2/WplusToJJZToLNuJJ_mjj100_pTj10_QCD_LO_TuneCP5_13TeV-madgraph-pythia8/MINIAODSIM/106X_mc2017_realistic_v9-v2/2560000/FC70BF0E-01D1-D04C-8C9F-C0DEFA040142.root",
                "/store/mc/RunIISummer20UL17MiniAODv2/WplusToJJZToLNuJJ_mjj100_pTj10_QCD_LO_TuneCP5_13TeV-madgraph-pythia8/MINIAODSIM/106X_mc2017_realistic_v9-v2/2560000/CC9DE505-6F64-034F-9F1D-3CEB014DCC2F.root",
                "/store/mc/RunIISummer20UL17MiniAODv2/WplusToJJZToLNuJJ_mjj100_pTj10_QCD_LO_TuneCP5_13TeV-madgraph-pythia8/MINIAODSIM/106X_mc2017_realistic_v9-v2/2560000/C0F2C2AB-D82F-D344-917C-CFD67051CEBF.root",
                "/store/mc/RunIISummer20UL17MiniAODv2/WplusToJJZToLNuJJ_mjj100_pTj10_QCD_LO_TuneCP5_13TeV-madgraph-pythia8/MINIAODSIM/106X_mc2017_realistic_v9-v2/2560000/F5B1A081-7A16-F744-BE87-5166500D959E.root",
                "/store/mc/RunIISummer20UL17MiniAODv2/WplusToJJZToLNuJJ_mjj100_pTj10_QCD_LO_TuneCP5_13TeV-madgraph-pythia8/MINIAODSIM/106X_mc2017_realistic_v9-v2/2560000/F2A73B75-CFBA-8E41-AFAC-396401CB4067.root",
                "/store/mc/RunIISummer20UL17MiniAODv2/WplusToJJZToLNuJJ_mjj100_pTj10_QCD_LO_TuneCP5_13TeV-madgraph-pythia8/MINIAODSIM/106X_mc2017_realistic_v9-v2/2560000/4BCFFC5F-8139-B74E-AFD3-6BD520111CE1.root"
            ],
            "keep_output": true,
            "output_files": [
                "NANOEDMAODSIMoutput.root"
            ],
            "output_files_datatiers": [
                "NANOAODSIM"
            ],
            "output_files_merged_base": [
                "/store/backfill/1/RunIISummer20UL17NanoAODv9/WplusToJJZToLNuJJ_mjj100_pTj10_QCD_LO_TuneCP5_13TeV-madgraph-pythia8/NANOAODSIM/Task1_NanoAODv9_TC_Nano_Agent227_Val_Alanv1-v11"
            ],
            "output_files_unmerged_base": [
                "/store/unmerged/RunIISummer20UL17NanoAODv9/WplusToJJZToLNuJJ_mjj100_pTj10_QCD_LO_TuneCP5_13TeV-madgraph-pythia8/NANOAODSIM/Task1_NanoAODv9_TC_Nano_Agent227_Val_Alanv1-v11"
            ],
            "step": "cmsRun1",
            "transient_output": [
                true
            ]
        }
    ],
    "job_type": "Processing",
    "number_of_cmsRuns": 1,
    "worker_arch": "X86_64",
    "worker_os": "rhel7",
    "workflow_type": "TaskChain"
}

Those are just for illustration, but I hope they give you a better sense of what we are currently providing from the workflow/job runtime environment.

@makortel
Copy link

Thanks @amaltaro for the examples. Here is some initial feedback from me and @Dr15Jones (based on just staring the examples, and reminding ourselves what we had asked for in #10819

  • What is the meaning of keep_output and transient_output?
  • What is the meaning of output_files_merged_base and output_files_unmerged_base?
  • Seems we would need some documentation on all the fields
  • How would a workflow configured to use the two-file solution (i.e. CMSSW process.source.secondaryFileNames) look like?
  • How would a pileup file look like?
  • How can we tell if an intermediate file is not used outside of the StepChain?
  • We'd want the current step value (cmsRun1 etc) to be passed in some way to the process.customizeWorkflow() function. Could be e.g.
    • new top-level untracked.PSet (we'd need to agree the name)
    • parameter to the customizeWorkflow()

@amaltaro
Copy link
Contributor

@khurtado Kenyi, as we have discussed in the weekly meeting. Can you please document these fields and share it here? If you prefer, create a merge request against cms-wmcore and we can follow it from there.

@khurtado
Copy link
Contributor Author

khurtado commented Feb 22, 2024

@makortel

  • What is the meaning of keep_output and transient_output?

       "keek_output":  Mark whether or not we should keep the output from this CMSSW step.
    

Reference:

def keepOutput(self, keepOutput):
"""
_keepOutput_
Mark whether or not we should keep the output from this step. We don't
want to keep the output from certain chained steps.
"""
self.data.output.keep = keepOutput
return

“transient_output”: ordered boolean saying whether output files are announced in the workflow or not. E.g.: StepChain intermediate steps outputs are considered transient and hence do not need to be staged out and registered in DBS/Rucio.

Reference:

It also assumes all the intermediate steps output are transient and do not need
to be staged out and registered in DBS/PhEDEx. Only the last step output will be
made available.

However, there is a bug on transient output that maybe @amaltaro can comment more on: #9013

  • What is the meaning of output_files_merged_base and output_files_unmerged_base?
    "output_files_unmerged_base": [ordered string with unmerged lfnBase info],
    Used for example, to build the process logicalFileName:

    result.addParameter("process.%s.logicalFileName" % modName, lfn)

       "output_files_merged_base": [ordered string with merged LFNBase info]
    

@amaltaro: Would it be fair to say if keep_output is True, the merged LFN base is used?

  • Seems we would need some documentation on all the fields
    As far as I know, there is at present no documentation for these fields other than the comments in the code. We might need to create an issue to create some documentation.

  • How would a workflow configured to use the two-file solution (i.e. CMSSW process.source.secondaryFileNames) look like?

    • Mhhh, I don’t think we are covering this case, actually
    • Looking into the WM Tweak, inputfile[‘lfn’] is used for the primaryFiles and input file[‘parents’] for the secondaryFileNames. We would need to include this field in the json.
      for secondaryFile in inputFile["parents"]:
      secondaryFiles.append(secondaryFile["lfn"])

@amaltaro Do you agree? If so, I can create a ticket to follow up on this

  • How would a pileup file look like?
  • How can we tell if an intermediate file is not used outside of the StepChain?
    • Technically, with the transient value.
  • We'd want the current step value (cmsRun1 etc) to be passed in some way to the process.customizeWorkflow() function. Could be e.g.

@amaltaro
Copy link
Contributor

However, there is a bug on transient output that maybe @amaltaro can comment more on: #9013

The transient_output basically specifies whether a given output module + datatier is supposed to be registered in DBS/Rucio or not. In other words, objects marked as transient are usually produced/consumed/destroyed inside a given job and they are not meant to even be transferred to the storage unmerged area (except for differences between TaskChain and StepChain).

@amaltaro: Would it be fair to say if keep_output is True, the merged LFN base is used?

Totally! If False, then those output files are never merged (and never show up in DBS/Rucio).

How would a workflow configured to use the two-file solution (i.e. CMSSW process.source.secondaryFileNames) look like?

Yes, AFAIR secondaryFileNames is used for pileup files. For the record, here is the script used to tweak the job PSet: https://github.com/cms-sw/cmssw-wm-tools/blob/30b626d030b7b11f83fe4e0385ce303e83d3fcdf/bin/cmssw_handle_pileup.py

We do not provide a list of secondary files though, as some of those datasets (e.g. Neutrino) can have > 100k files. Unless there is a strong motivation for this, I don't think it should be part of the runtime dump - there are other ways to get a hold of it in the worker node.

@amaltaro do you have an example of this?

I had to scan my AFS area, and I did find an example from 2017 (AFAIK it has not changed). You can also access it here: https://amaltaro.web.cern.ch/amaltaro/forFrancesco/myPSet.py

How can we tell if an intermediate file is not used outside of the StepChain?
Technically, with the transient value.

or keep_output=false.

HTH!

@khurtado
Copy link
Contributor Author

How would a workflow configured to use the two-file solution (i.e. CMSSW process.source.secondaryFileNames) look like?

Yes, AFAIR secondaryFileNames is used for pileup files. For the record, here is the script used to tweak the job PSet: https://github.com/cms-sw/cmssw-wm-tools/blob/30b626d030b7b11f83fe4e0385ce303e83d3fcdf/bin/cmssw_handle_pileup.py

We do not provide a list of secondary files though, as some of those datasets (e.g. Neutrino) can have > 100k files. Unless there is a strong motivation for this, I don't think it should be part of the runtime dump - there are other ways to get a hold of it in the worker node.

I am confused. Isn't inputFile["parents"] the list of secondaryFiles?

for secondaryFile in inputFile["parents"]:
secondaryFiles.append(secondaryFile["lfn"])

@makortel
Copy link

makortel commented Feb 22, 2024

Thanks for the clarifications. Here are some quick follow-up questions

However, there is a bug on transient output that maybe @amaltaro can comment more on: #9013

The transient_output basically specifies whether a given output module + datatier is supposed to be registered in DBS/Rucio or not. In other words, objects marked as transient are usually produced/consumed/destroyed inside a given job and they are not meant to even be transferred to the storage unmerged area (except for differences between TaskChain and StepChain).

Could you elaborate how the bug manifests? Like is true when it should be false, or false when it should be true? The examples in #11812 (comment) show it as true in all cases.

  • What is the meaning of output_files_merged_base and output_files_unmerged_base?
    "output_files_unmerged_base": [ordered string with unmerged lfnBase info],
    Used for example, to build the process logicalFileName:

Ok, so these are about how the files are handled by the WM after the chain has completed. I guess for the purposes for modifying cmsRun behavior we could ignore these.

  • Seems we would need some documentation on all the fields
    As far as I know, there is at present no documentation for these fields other than the comments in the code. We might need to create an issue to create some documentation.

That would be great. We need to be able to produce the JSON file for CMSSW tests, and without a clear definition of the fields that would be challenging.

How would a workflow configured to use the two-file solution (i.e. CMSSW process.source.secondaryFileNames) look like?

Yes, AFAIR secondaryFileNames is used for pileup files. For the record, here is the script used to tweak the job PSet: https://github.com/cms-sw/cmssw-wm-tools/blob/30b626d030b7b11f83fe4e0385ce303e83d3fcdf/bin/cmssw_handle_pileup.py

Let's try to not confuse two-file solution and pileup files. @khurtado's link

for secondaryFile in inputFile["parents"]:
secondaryFiles.append(secondaryFile["lfn"])

looked relevant to what I was after (process.source.secondaryFileNames).

We do not provide a list of secondary files though, as some of those datasets (e.g. Neutrino) can have > 100k files. Unless there is a strong motivation for this, I don't think it should be part of the runtime dump - there are other ways to get a hold of it in the worker node.

On a quick thought I don't think we would have a use case that would require the knowledge of the pileup files, and the question was more out of curiosity. But we need to think a bit more.

Right, it can be related to #11744. I brought it up because from the earlier discussion in #10819 (comment) onwards it wasn't clear if WM preference for this information would to have it as part of the JSON or some other way.

@khurtado
Copy link
Contributor Author

@amaltaro Could you confirm if we would need to add this block (inputFiles['parents') to support the secondaryFileNames field? From what I see, that seems to be the case but you previous comment on it confused me, so I want to double check. If so, we would need to create a couple of issues, one to provide this extra field and one for documentation.

for secondaryFile in inputFile["parents"]:
secondaryFiles.append(secondaryFile["lfn"])

@amaltaro
Copy link
Contributor

Kenyi, IF this secondary file feature refers to workflows processing parent datasets (which in the workflow description is represented by IncludeParents=True), then it looks like we need this information as well.

On what concerns the GH tickets, I see these things as an atomic operation and IMO documentation - whenever required - should be provided with the actual development. You already have most of the information here anyways, so migrating to gitlab should be simple (classical words ;))

@khurtado
Copy link
Contributor Author

Do I understand correctly that the 2 required steps are the following?

  1. Create a new issue to include (inputFiles['parents'), in order to support the secondaryFileNames
  2. Document these changes in wmcore-docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create dump of the WM runtime parameters for Process customization
5 participants