Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failures in Run 3 data reprocessing #40437

Open
kskovpen opened this issue Jan 6, 2023 · 186 comments
Open

Failures in Run 3 data reprocessing #40437

kskovpen opened this issue Jan 6, 2023 · 186 comments

Comments

@kskovpen
Copy link
Contributor

kskovpen commented Jan 6, 2023

We are seeing failures in the ongoing Run 3 data reprocessing, presumably related to the DeepTau implementation. Here is just one example of the failure: https://cms-unified.web.cern.ch/cms-unified/report/haozturk_ACDC0_Run2022D_BTagMu_10Dec2022_221221_171338_6693

The crash message is:

Exception Message: invalid prediction = nan for tau_index = 0, pred_index = 0

PdmV

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 6, 2023

A new Issue was created by @kskovpen .

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor

Assign reconstruction

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 6, 2023

New categories assigned: reconstruction

@mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

makortel commented Jan 6, 2023

A bit more context from the link

Fatal Exception (Exit code: 8001)
An exception of category 'DeepTauId' occurred while
[0] Processing Event run: 357696 lumi: 141 event: 162424424 stream: 0
[1] Running path 'MINIAODoutput_step'
[2] Prefetching for module PoolOutputModule/'MINIAODoutput'
[3] Prefetching for module PATTauIDEmbedder/'slimmedTaus'
[4] Calling method for module DeepTauId/'deepTau2017v2p1ForMini'
Exception Message:
invalid prediction = nan for tau_index = 0, pred_index = 0

The links to logs do not seem to work.

@danielwinterbottom
Copy link

Just to add that this might be related to this issue: #28358
In this case, the problems were found to be non-reproducible crashes due to problems on the site

@srimanob
Copy link
Contributor

Can this issue be closed? Or issue related to something else?

@VinInn
Copy link
Contributor

VinInn commented Feb 9, 2023

It would be possible to obtain details on the machine were the crash occurred?

@kskovpen
Copy link
Contributor Author

It looks like this issue is a bottleneck in finishing the 2023 data reprocessing. The fraction of failures is not negligible, as can be seen here (look for 8001 error codes). Is there still a way to implement a protection for these failures?

@VinInn we are trying to get this info; will let you know, if we manage to dig it out.

@kskovpen
Copy link
Contributor Author

Now, also looking at the discussion, which happened in #40733 - is our understanding correct that this issue is potentially fixed in 12_6_0_pre5?

@makortel
Copy link
Contributor

Now, also looking at the discussion, which happened in #40733 - is our understanding correct that this issue is potentially fixed in 12_6_0_pre5?

No, it is still an exception

throw cms::Exception("DeepTauId")
<< "invalid prediction = " << pred << " for tau_index = " << tau_index << ", pred_index = " << k;

@makortel
Copy link
Contributor

@kskovpen Do you have any pointers to the logs of the 8001 failures?

@kskovpen
Copy link
Contributor Author

@kskovpen Do you have any pointers to the logs of the 8001 failures?

Here it is.

@makortel
Copy link
Contributor

@kskovpen Do you have any pointers to the logs of the 8001 failures?

Here it is.

Thanks. This failure occurred on Intel(R) Xeon(R) CPU E5645 @ 2.40GHz, which is of Westmere microarchitecture, i.e. SSE-only.

@kskovpen
Copy link
Contributor Author

kskovpen commented Jul 2, 2023

This issue appears to be the main bottleneck in the current run3 data reprocessing. The initial cause of the issue could be the excessive memory usage of the deepTau related modules. The issue happens all over the place and there are many examples, e.g. https://cms-unified.web.cern.ch/cms-unified/showlog/?search=ReReco-Run2022E-ZeroBias-27Jun2023-00001#DataProcessing:50660. Was the memory profiling done for the latest deepTau implementation in cmssw?

@makortel
Copy link
Contributor

makortel commented Jul 3, 2023

https://cms-unified.web.cern.ch/cms-unified/showlog/?search=ReReco-Run2022E-ZeroBias-27Jun2023-00001#DataProcessing:50660

Trying to look at the logs for 50660, I see

  • https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022E_ZeroBias_27Jun2023_230627_121019_4766/50660/DataProcessing/0b0b9315-4516-4a95-b2a1-63dcdaa879c9-24-0-logArchive/
    • I see only the invalid prediction = -nan for tau_index = 0, pred_index = 0 exception, on Intel(R) Xeon(R) CPU X5650, which is Westmere microarchitecture, i.e. SSE-only
    • The wmagentJob.log has
       2023-06-28 05:13:16,497:INFO:CMSSW:Executing CMSSW. args: ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'el8_amd64_gcc10', 'scramv1', 'CMSSW', 'CMSSW_12_4_14_patch1', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '']
       2023-06-28 05:14:12,850:INFO:PerformanceMonitor:PSS: 1015060; RSS: 759448; PCPU: 50.5; PMEM: 1.1
       2023-06-28 05:19:13,174:INFO:PerformanceMonitor:PSS: 12010830; RSS: 6378532; PCPU: 123; PMEM: 9.7
       2023-06-28 05:19:13,175:ERROR:PerformanceMonitor:Error in CMSSW step cmsRun1
       Number of Cores: 4
       Job has exceeded maxPSS: 10000 MB
       Job has PSS: 12010 MB
       
       2023-06-28 05:19:13,176:ERROR:PerformanceMonitor:Attempting to kill step using SIGUSR2
      
      which is interesting because it claims PSS larger than RSS. I don't understand how that could be. The RSS is reasonable for 4-core job, and quite consistent with the RSS reported in the CMSSW log. If the timestamps of the wmagentJob.log and cmsRun1-stdout.log can be correlated (i.e. their clocks are close enough), the error by WM above is noticed at the time the CMSSW has already terminated the data processing loop and is shutting down.
    • Later on the wmagentJob.log shows the CMSSW terminated with exit code 8001.
  • https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022E_ZeroBias_27Jun2023_230627_121019_4766/50660/DataProcessing/03b0b56e-f7b7-44ff-a29e-e6033be3b4d6-9-0-logArchive/
    • The wmagentJob.log has
      2023-06-28 06:02:34,738:INFO:PerformanceMonitor:PSS: 549635; RSS: 668964; PCPU: 38.0; PMEM: 0.2
      2023-06-28 06:07:34,962:INFO:PerformanceMonitor:PSS: 9424486; RSS: 7618620; PCPU: 233; PMEM: 2.8
      2023-06-28 06:12:35,200:INFO:PerformanceMonitor:PSS: 10024095; RSS: 7686548; PCPU: 308; PMEM: 2.9
      2023-06-28 06:12:35,200:ERROR:PerformanceMonitor:Error in CMSSW step cmsRun1
      Number of Cores: 4
      Job has exceeded maxPSS: 10000 MB
      Job has PSS: 10024 MB
      
      2023-06-28 06:12:35,201:ERROR:PerformanceMonitor:Attempting to kill step using SIGUSR2
      
      again showing PSS larger than RSS, and RSS being somewhat compatible with the RSS reported in the CMSSW log.
    • Correlating the timestamps between wmagentJob.log and cmsRun1-stdout.log it seems that CMSSW shut itself down after the SIGUSR2 signal. CMSSW log itself shows no issues. RSS fluctuates between 7.0 and 7.5 GiB.

With these two logs alone I don't see any evidence that deepTau would cause memory issues. The main weirdness to me seems to be PSS becoming larger than RSS leading to WM asking CMSSW to stop processing (in addition of the exception from deepTau).

@makortel
Copy link
Contributor

makortel commented Jul 3, 2023

The 8001 log (https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022E_ZeroBias_27Jun2023_230627_121019_4766/8001/DataProcessing/04ae17ee-b07e-4ca6-b32e-4e51ee944afe-36-0-logArchive/job/) shows the job throwing the exception was also run on Intel(R) Xeon(R) CPU X5650, i.e. Westmere and SSE-only.

@davidlange6
Copy link
Contributor

do we have any such reported failures at CERN? (eg, why does the T0 not see this same issue)

@kskovpen
Copy link
Contributor Author

kskovpen commented Jul 3, 2023

The failures are strongly correlated with the sites. The highest failure rates are observed at T2_BE_IIHE, T2_US_Nebraska, and T2_US_Caltech.

@kskovpen
Copy link
Contributor Author

kskovpen commented Jul 3, 2023

@amaltaro
Copy link

amaltaro commented Jul 3, 2023

@makortel Hi Matti, for the test workflow that I was running and reported issues like:

invalid prediction = -nan for tau_index = 0, pred_index = 0

it happened at 2 sites only (13 failures at MIT and 3 at Nebraska).

A couple of logs for MIT area available in CERN EOS:

/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-2055-0-log.tar.gz
and
/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-206-0-log.tar.gz

while for Nebraska they are:

/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-1662-0-log.tar.gz
and
/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-1668-0-log.tar.gz

@makortel
Copy link
Contributor

makortel commented Jul 3, 2023

Thanks @amaltaro for the logs.

A couple of logs for MIT area available in CERN EOS:

/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-2055-0-log.tar.gz

This job failed because the input data from file:/mnt/hadoop/cms/store/data/Run2022D/MuonEG/RAW/v1/000/357/688/00000/819bbdb2-43c0-4393-aa84-4f0a81ad5f9e.root was corrupted (CMSSW log has

R__unzipLZMA: error 9 in lzma_code
----- Begin Fatal Exception 01-Jul-2023 05:39:45 EDT-----------------------
An exception of category 'FileReadError' occurred while
   [0] Processing  Event run: 357688 lumi: 48 event: 85850129 stream: 2
   [1] Running path 'AODoutput_step'
   [2] Prefetching for module PoolOutputModule/'AODoutput'
   [3] While reading from source GlobalObjectMapRecord hltGtStage2ObjectMap '' HLT
   [4] Rethrowing an exception that happened on a different read request.
   [5] Processing  Event run: 357688 lumi: 48 event: 86242374 stream: 0
   [6] Running path 'dqmoffline_17_step'
   [7] Prefetching for module CaloTowersAnalyzer/'AllCaloTowersDQMOffline'
   [8] Prefetching for module CaloTowersCreator/'towerMaker'
   [9] Prefetching for module HBHEPhase1Reconstructor/'hbhereco@cpu'
   [10] Prefetching for module HcalRawToDigi/'hcalDigis'
   [11] While reading from source FEDRawDataCollection rawDataCollector '' LHC
   [12] Reading branch FEDRawDataCollection_rawDataCollector__LHC.
   Additional Info:
      [a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers
fNbytes = 2802592, fKeylen = 115, fObjlen = 4817986, noutot = 0, nout=0, nin=2802477, nbuf=4817986

----- End Fatal Exception -------------------------------------------------

I'm puzzled why the FileReadError resulted in 8001 exit code instead of 8021, but I'll open a separate issue for that (#42179).

/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-206-0-log.tar.gz

This file doesn't seem to exist.

while for Nebraska they are:

/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-1662-0-log.tar.gz
/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-1668-0-log.tar.gz

These jobs failed with the invalid prediction = nan for tau_index = 0, pred_index = 0 exception. The node had Intel(R) Xeon(R) CPU X5650 CPU, i.e. SSE-only (and thus consistent with the discussion above).

@makortel
Copy link
Contributor

makortel commented Jul 3, 2023

Another example of highly failing workflow: https://cms-unified.web.cern.ch/cms-unified/report/pdmvserv_Run2022B_HLTPhysics_27Jun2023_230627_115530_2305

Here

https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022B_HLTPhysics_27Jun2023_230627_115530_2305/50660/DataProcessing/1a49954c-1f29-41bd-aa2f-b19144eead34-0-0-logArchive/
shows the combination of invalid prediction = nan for tau_index = 0, pred_index = 0 exception (CPU is Intel(R) Xeon(R) CPU X5650 i.e. SSE-only), and WM seeing PSS going over the limit, while RSS is much smaller and reasonable for 4-core job.

https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022B_HLTPhysics_27Jun2023_230627_115530_2305/50660/DataProcessing/1a49954c-1f29-41bd-aa2f-b19144eead34-2-0-logArchive/
WM sees PSS going over the limit, while RSS is smaller and reasonable for 4-core job.

@makortel
Copy link
Contributor

makortel commented Jul 3, 2023

The main weirdness to me seems to be PSS becoming larger than RSS leading to WM asking CMSSW to stop processing (in addition of the exception from deepTau).

Poking into the WM code, I see the PSS is read from /proc/<PID>/smaps, and RSS from ps https://github.com/dmwm/WMCore/blob/762bae943528241f67625016fd019ebcd0014af1/src/python/WMCore/WMRuntime/Monitors/PerformanceMonitor.py#L242. IIUC ps uses /proc/PID/stat (which is also what CMSSW's SimpleMemoryCheck printouts uses), and apparently stat and smaps are known to report different numbers (e.g. https://unix.stackexchange.com/questions/56469/rssresident-set-size-is-differ-when-use-pmap-and-ps-command).

But is this large (~3 GB, ~30 %) difference expected? (ok, we don't know what would be RSS as reported by smaps)

@davidlange6
Copy link
Contributor

davidlange6 commented Jul 3, 2023 via email

@Dr15Jones
Copy link
Contributor

(there are long stretches when MemoryCheck is silent as VSS does not change while RSS is actullay changing sue to swapping)

Newer releases also monitor changes in RSS in MemoryCheck.

@VinInn
Copy link
Contributor

VinInn commented Jul 26, 2023

it happens (second time) even if the job is pinned (I killed the previous job and bang)

[innocent@lxplus801 rereco]$ cat /proc/3951792/smaps_rollup
00400000-ffffc2780000 ---p 00000000 00:00 0                              [rollup]
Rss:             7653376 kB
Anonymous:       7463360 kB
AnonHugePages:   1048576 kB
Swap:             507712 kB
[innocent@lxplus801 rereco]$ cat /proc/3951792/smaps_rollup
Rss:             9195776 kB
Anonymous:       9005760 kB
AnonHugePages:   2621440 kB
Swap:             507712 kB
[innocent@lxplus801 rereco]$ cat /proc/3951792/smaps_rollup
Rss:            10354880 kB
Anonymous:      10164480 kB
AnonHugePages:   3670016 kB
Swap:             410112 kB

@VinInn
Copy link
Contributor

VinInn commented Jul 26, 2023

The memory explosion can be reproduced!

@VinInn
Copy link
Contributor

VinInn commented Jul 28, 2023

let's cross post
in the rereco in question these are the models we run in each stream

[innocent@lxplus801 rereco]$ grep onnx config.py
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepBoostedJet/V02/full/resnet.onnx'),
    model_path = cms.FileInPath('RecoParticleFlow/PFProducer/data/mlpf/mlpf_2021_11_16__no_einsum__all_data_cms-best-of-asha-scikit_20211026_042043_178263.workergpu010.onnx'),
    mvaIDTrainingFile = cms.FileInPath('RecoMuon/MuonIdentification/data/mvaID.onnx'),
    mvaIDTrainingFile = cms.FileInPath('RecoMuon/MuonIdentification/data/mvaID.onnx'),
    mvaIDTrainingFile = cms.FileInPath('RecoMuon/MuonIdentification/data/mvaID.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepBoostedJet/V02/full/resnet.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepBoostedJet/V02/full/resnet.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepVertex/phase1_deepvertexcombined.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDB.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDCvB.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDC.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepFlavourV03_10X_training/model.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepFlavourV03_10X_training/model.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepFlavourV03_10X_training/model.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepVertex/phase1_deepvertex.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/HiggsInteractionNet/V00/IN.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/HiggsInteractionNet/V00/IN.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepBoostedJet/V02/decorrelated/resnet.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepBoostedJet/V02/decorrelated/resnet.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK8/MD-2prong/V01/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK8/MD-2prong/V01/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDB_mass_independent.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/102X/V02/BvL.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/102X/V02/BvL.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDCvB_mass_independent.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/102X/V02/CvB.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/102X/V02/CvB.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDC_mass_independent.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/102X/V02/CvL.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/102X/V02/CvL.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepFlavourV03_10X_training/model.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK4/CHS/V00/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK4/CHS/V00/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK4/CHS/V00/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK4/CHS/V00/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK8/General/V01/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK8/General/V01/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK8/MassRegression/V01/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK8/MassRegression/V01/particle-net.onnx'),
    mvaIDTrainingFile = cms.FileInPath('RecoMuon/MuonIdentification/data/mvaID.onnx'),

so multiple modules use the very same model.
It is absolutely not trivial to try to collapse them as they MAY run in parallel.
Still they could share the same model as it is in a edm::GlobalCache<ONNXRuntime>

@VinInn
Copy link
Contributor

VinInn commented Aug 1, 2023

My summary:
There were four, most probably independent, issues:

  1. TauId throws in a loop which is not supposed to be executed as the size of the collection is supposed to be zero.
    Not solved, not reproduced. Most probably memory corruption.
  2. PSS > RSS
    Identified, solution proposed at WMCore level.
  3. Suddenly RSS grows "out of control"
    Reproduced. Seems related to a non-collaborative effort between JeMalloc and THP in presence of scarse resources and memory fragmentation.
    Solution proposed: use TCMalloc that is explicitly designed to collaborate with THP.
  4. RelVal needs more than 2GB per stream (see the other issue)
    It seems related to SimHit replay
    Solution is to run with Nstreams < 0.5*Nthreads (waiting for a reassessment of the need of SimHit replay in particular in Tracker)

@drkovalskyi
Copy link
Contributor

Thanks Vincenzo for the investigation and the summary.

@srimanob
Copy link
Contributor

srimanob commented Aug 1, 2023

4. RelVal needs more than 2GB per stream (see the other issue)
It seems related to SimHit replay
Solution is to run with Nstreams < 0.5*Nthreads (waiting for a reassessment of the need of SimHit replay in particular in Tracker)

Why does data reprocessing (on the topic of issue) has issue with simhit replay? Or you mean for general MC relvals. For MC relvals, currently it runs with less stream (4 threads, 2 streams) in default setup.

@VinInn
Copy link
Contributor

VinInn commented Aug 1, 2023

@srimanob ,
let me clarify: data reprocessing has no issue with simhit replay. RelVal does.

@jamesletts
Copy link

jamesletts commented Aug 2, 2023

My summary: There were four, most probably independent, issues:

  1. TauId throws in a loop which is not supposed to be executed as the size of the collection is supposed to be zero.
    Not solved, not reproduced. Most probably memory corruption.
  2. PSS > RSS
    Identified, solution proposed at WMCore level.
  3. Suddenly RSS grows "out of control"
    Reproduced. Seems related to a non-collaborative effort between JeMalloc and THP in presence of scarse resources and memory fragmentation.
    Solution proposed: use TCMalloc that is explicitly designed to collaborate with THP.
  4. RelVal needs more than 2GB per stream (see the other issue)
    It seems related to SimHit replay
    Solution is to run with Nstreams < 0.5*Nthreads (waiting for a reassessment of the need of SimHit replay in particular in Tracker)

Regarding this excellent summary of Vincenzo's (thank you), the XEB is asking for periodic updates and time estimates for the solutions. I have a link to the WMCore fix (2). Are there GitHub issues opened for 1 (TauID) and 3 (TCMalloc)?

@makortel
Copy link
Contributor

makortel commented Aug 2, 2023

Are there GitHub issues opened for 1 (TauID)

#42444

and 3 (TCMalloc)?

#42387

@makortel
Copy link
Contributor

makortel commented Aug 7, 2023

Profiling the job mentioned in #40437 (comment) again with the usual MEM_LIVE (not peak) showed that in this job the top-level ParameterSet took about 40 MB
https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue40437/reco_07.5_live/496

The top-level ParameterSet is currently kept alive throughout the job, but that is not strictly necessary. I made a PR to master to release the ParameterSet soon after the modules have been constructed (before beginJob()) #42503 .

@makortel
Copy link
Contributor

makortel commented Aug 7, 2023

I spun off the DQM memory usage discussion into #42504

@makortel
Copy link
Contributor

makortel commented Aug 7, 2023

This ~70-90 MB / job
https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue40437/reco_07.5_live/255
has already been fixed in 13_1_X #41465.

@VinInn
Copy link
Contributor

VinInn commented Aug 8, 2023

Is #41465 worth a backport?

@makortel
Copy link
Contributor

makortel commented Aug 8, 2023

Is #41465 worth a backport?

I'd say it depends on the @cms-sw/pdmv-l2's plan for this re-reco in 12_4_X (well, also for plans for a re-reco in 13_0_X). If there is a chance that the backport would really be used, then the backport could be worth it.

@makortel
Copy link
Contributor

makortel commented Aug 9, 2023

Here is another almost 100 MB / stream (in L1TMuonEndCapTrackProducer) #42526

@makortel
Copy link
Contributor

Another spin-off, this time on the number of allocations (i.e. memory churn) #42672

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests