High memory usage in DQM harvesting job for Express #38976

rvenditti · 2022-08-05T13:54:15Z

DQMHarvesting is exceeding maxPSS in Express reconstruction at T0 in runs 356381, 356615, 356719 link to cmsTalk.

We re-run job on lxplus and they have been completed successfully even if with some warnings (see below)
Looking inside the tarball and local running jobs, it seems that the output of cmsRun shows some warnings in HLTConfigProvider for HLT-EGamma client and at merging step for CTTPS.
In order to investigate these warnings two GitHub issues have been opened:
CTPPS ME merging problems in Harvesting step #38969
Problem in DQM Harvesting step with EgHLTOfflineClient #38970
However, looking at a similar issue observed in May, it seems that the warnings are there since long time and are not the root of the problem.
Running the IgProf tool for the memory profiling, it seems that the main memory consumer is the DQMFileSaver DQMFileSaver::saveForOffline function (see /afs/cern.ch/work/r/rosma/public/Run356381/igreport_total.res and /afs/cern.ch/work/r/rosma/public/Run356615/).... but we don't know how much this check is reliable given that Harvesting is not running on events. Can software expert have a look and give us some suggestions? @makortel

cmsbuild · 2022-08-05T13:54:32Z

A new Issue was created by @rvenditti .

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

davidlange6 · 2022-08-05T14:00:30Z

could you post the igprof results please?

davidlange6 · 2022-08-05T14:01:10Z

(eg, the html version is more easily understandable)

makortel · 2022-08-05T14:03:40Z

assign dqm

cmsbuild · 2022-08-05T14:03:59Z

New categories assigned: dqm

@jfernan2,@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

rvenditti · 2022-08-05T14:41:22Z

Hi @davidlange6 , unfortunately i am not able to forward the browsable version to my cern site. However the related sql3 file is here: /eos/user/r/rosma/www/cgi-bin/data/igreport_perf.sql3

makortel · 2022-08-05T14:46:45Z

I took a quick look on /afs/cern.ch/work/r/rosma/public/Run356381/igreport_total.res and I see the profile reports MEM_TOTAL, i.e. total memory allocations. MEM_LIVE (that shows the memory usage at a given time) would be more useful. It seems to me that IgProfService does not have dump points for framework callbacks that would be most useful for harvesting job. I'll add some and profile myself too.

On the other hand, I'm not sure how useful the IgProf profile is for a harvesting job. We know that the job processes histograms, but IgProf profile won't tell much about what histograms exactly take a lot of memory (beyond the type). Have you checked the histogram sizes from a DQMIO file? I think that would be a useful number, even if DQMRootSource should merge the histograms on each file open, and e.g. in https://cms-talk.web.cern.ch/t/re-2018-replay-for-pps-pcl-test-dqm-expresmergewrite-memory-too-high/6114/7 the memory problem was clearly in the output.

makortel · 2022-08-06T01:01:26Z

One thing I noticed is the job is awfully slow under igprof -mp ... cmsRunGlibC. While restarting my memory profile after copying the input files for run 356615 (for the run 356381 the files were already gone) I profiled the CPU time consumption. Full profile is here
https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf
Of the 1623 seconds of total

TTree::GetEntry() takes 526 s (32 %) https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf/18
Histogram merging takes 290 s (18 %) https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf/35
- Main contribution is from TProfile, 252 s (16 %)
edm::one::EDProducerBase::doEndLuminosityBlock() takes 612 s (38 %) https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf/31, largest contributions
- GEMDQMHarvester 287 s (18 %) https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf/36
- SiStripQualityChecker 68 s (4 %) https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf/75
- QualityTester 54 s (3 %) https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf/82
DQMEDHarvester::endProcessBlockProduce() takes 20 s (1 %) https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf/134
DQMFileSaver::saveForOffline() takes 17 s (1 %) https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf/147

Of these the TProfile merging and GEMDQMHarvester could be something worth of taking a look (even if strictly speaking outside of the scope of this issue).

makortel · 2022-08-08T21:06:17Z

A MEM_LIVE profile (dumped at postGlobalEndRun) points to GEMDQMHarvester::dqmEndLuminosityBlock() taking 610 MB https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/preEndRun.mem.356615/25 . Nearly all of that appears to be in std::maps in GEMDQMHarvester::createTableWatchingSummary() https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/preEndRun.mem.356615/27.

The maps are initialized in

cmssw/DQM/GEM/plugins/GEMDQMHarvester.cc

Lines 282 to 298 in bbffe91

    
           void GEMDQMHarvester::createTableWatchingSummary() { 
        
             if (bIsStatusChambersInit_) 
        
               return; 
        
             for (const auto &[nIdxLayer, nNumCh] : mapNumChPerChamber_) { 
        
               for (Int_t i = 1; i <= nNumCh; i++) { 
        
                 mapStatusChambersSummary_[{nIdxLayer, i}] = std::vector<StatusInfo>(); 
        
                 mapNumStatusChambersSummary_[{nIdxLayer, i}] = NumStatus(); 
        
                 for (Int_t j = 1; j <= nNumVFATs_; j++) { 
        
                   mapStatusVFATsSummary_[{nIdxLayer, i, j}] = std::vector<StatusInfo>(); 
        
                   mapNumStatusVFATsSummary_[{nIdxLayer, i, j}] = NumStatus(); 
        
                 } 
        
               } 
        
             } 
        
             bIsStatusChambersInit_ = true; 
        
           }

The profile indicates

mapStatusChambersSummary_ and mapNumStatusChambersSummary_ have 160,380 elements
mapStatusVFATsSummary_ and mapNumStatusVFATsSummary_ have 3,849,120 elements

I'm afraid the GEMDQMHarvester::drawSummaryHistogram() needs to be redesigned.

makortel · 2022-08-08T21:06:25Z

FYI @cms-sw/gem-dpg-l2

jshlee · 2022-08-09T01:17:06Z

@slowmoyang @quark2 can you take a look at this.

mmusich · 2022-08-12T10:56:47Z

@jshlee @slowmoyang @quark2 any update on this issue?

quark2 · 2022-08-12T10:59:06Z

Hi @mmusich,

Sorry for my late reply, I'm working on it, but I was sick, so the task was delayed... I'll make a fix soon.

Best regards,
Byeonghak Ko

mmusich · 2022-08-15T07:01:12Z

Hi @quark2,
there have been already 5 instances of the issue so far - if I count correctly, can you clarify the timeline for the fix?
In case it cannot arrive by today, such that a patch release can be built tomorrow after the ORP meeting (FYI @perrotta @qliphy), can we consider disabling in production the offending module until a more proper fix is found out?
I think we can do that by commenting these lines:

cmssw/DQMOffline/Configuration/python/DQMOffline_SecondStep_cff.py

Lines 34 to 37 in 4c7b44b

    
           from Configuration.Eras.Modifier_run3_GEM_cff import run3_GEM 
        
           _run3_GEM_DQMOffline_SecondStepMuonDPG = DQMOffline_SecondStepMuonDPG.copy() 
        
           _run3_GEM_DQMOffline_SecondStepMuonDPG += gemClients 
        
           run3_GEM.toReplaceWith(DQMOffline_SecondStepMuonDPG, _run3_GEM_DQMOffline_SecondStepMuonDPG)

please clarify.
Thanks,

Marco (ORM)

quark2 · 2022-08-15T08:12:34Z

Hi @mmusich,

I made a fix, but I have no idea how to measure the memory consumption with the fix. Could you provide the way?
Once the fix works fine, I'll make a PR asap.

Best regards,
Byeonghak Ko

mmusich · 2022-08-15T08:16:09Z

@quark2 the best way would be to run the igProf profiler and there are some instructions here: https://twiki.cern.ch/twiki/bin/viewauth/CMS/RecoIntegration#Run_profiler_igprof.
Another (quicker) way, is to just print the RSS used by cmsRun when executing the harvesting and check that it should be substantially lower than before.

quark2 · 2022-08-15T08:16:56Z

Thanks, I'll use it!

quark2 · 2022-08-15T10:46:50Z

Hi,

I've made a fix and its backport to 12_4_X. I've seen the reduction of the RSS. (Although it was still 1.9 GB; still too large?)

mmusich · 2022-08-15T10:57:45Z

I've seen the reduction of the RSS.

How much is it reduced?
Matti pointed above that GEMDQMHarvester::dqmEndLuminosityBlock() is taking 610 MB. Do you have a profiling before / after the change? Is 1.9GB only due to GEM or overall by the harvesting job?

quark2 · 2022-08-15T11:04:09Z

The amount 1.9 GB is overall by the harvesting job. The RSS before the fix is 2.8 GB, so about 900 MB is saved.

mmusich · 2022-08-15T11:07:22Z

The RSS before the fix is 2.8 GB, so about 900 MB is saved.

Thanks, can you also provide what would the number in case the harvesting job is executed without GEMDQMHarvester at all?
This would hep to set the scale of what is reasonable.

quark2 · 2022-08-15T16:35:16Z

I executed it without GEMDQMHarvester, and the maximum RSS was also 1.9 GB. I'm not sure if it is reasonable, but at least I can say that GEMDQMHarvester consumes basically small memory (after the issue is fixed).

quark2 · 2022-08-15T16:40:28Z

Btw, does anyone know how to reproduce DQMIO files in local, especially root://cms-xrd-global.cern.ch//store/express/Run2022C/StreamExpress/DQMIO/Express-v1/000/356/381/00000/0114A6C8-C901-4386-94D4-F0386E636CA0.root?

mmusich · 2022-08-15T16:41:56Z

Btw, does anyone know how to reproduce DQMIO files in local, especially root://cms-xrd-global.cern.ch//store/express/Run2022C/StreamExpress/DQMIO/Express-v1/000/356/381/00000/0114A6C8-C901-4386-94D4-F0386E636CA0.root?

you would need to run the express reco from the streamer files of 356381, which I strongly suspect are already gone.

quark2 · 2022-08-15T16:51:18Z

I see... I need to look at the geometry part more, and reproducing that file should give a good hint.

mmusich · 2022-08-15T17:09:06Z

I see... I need to look at the geometry part more, and reproducing that file should give a good hint.

There have been recently failures of the same type e.g. for run 357479.

I think the streamers for that file are still available:

$ eos ls /store/t0streamer/Data/Express/000/357/479 | grep .dat | wc -l
1044

you would need to generate the configuration via:

python RunExpressProcessing.py --scenario ppEra_Run3 --lfn /store/t0streamer/Data/Express/000/357/479/run357479_ls1012_streamExpress_StorageManager.dat --global-tag 101X_dataRun2_Express_v7 --fevt --dqmio

manually modifying the output configuration

process.maxEvents = cms.untracked.PSet(
    input = cms.untracked.int32(-1)
)
process.source = cms.Source("NewEventStreamFileReader",
    fileNames = cms.untracked.vstring('<put here all your input streamer files>')
)
process.options = cms.untracked.PSet(
    numberOfStreams = cms.untracked.uint32(0),
    numberOfThreads = cms.untracked.uint32(8)
)

to get the correct source.
Alternatively I think you can try to use a cmsDriver command mimicking what Tier-0 does for express, starting with RAW data as input, but that would require me more time to cook.
Perhaps @cms-sw/dqm-l2 can help here as well.
HTH

mmusich · 2022-08-16T10:28:13Z

@makortel we tried to run the igprof profiler with @sarafiorendi and we got these:

but we don't observe a substantial reduction from the GEM DQM code.
This seems at odds with the observation from @quark2 #38976 (comment) which is somewhat in agreement with my profile posted at #39061 (comment).
Are we missing something in the recipe? We followed https://twiki.cern.ch/twiki/bin/viewauth/CMS/RecoIntegration#Run_profiler_igprof

sarafiorendi · 2022-08-16T12:54:55Z

@makortel @mmusich
sorry my bad, I updated the results with the PR integrated (same link as above), they show (as far as I understand) some effective reduction

mmusich · 2022-08-16T12:59:00Z

sorry my bad, I updated the results with the PR integrated (same link as above), they show (as far as I understand) some effective reduction

thanks. Still I am not sure to understand why there is no contribution from GEMDQMHarvester in MEM_LIVE.

makortel · 2022-08-16T13:06:06Z

Are we missing something in the recipe? We followed https://twiki.cern.ch/twiki/bin/viewauth/CMS/RecoIntegration#Run_profiler_igprof

@mmusich @sarafiorendi Was this the part of the recipe you followed for the MEM_LIVE profile?

igprof -d -t cmsRunGlibC -mp cmsRunGlibC a.py >& a.log

igprof-analyse --sqlite -v --demangle --gdb -r MEM_LIVE IgProf.1.gz > ig.1.txt

sarafiorendi · 2022-08-16T13:11:14Z

this are the commands I ran

igprof -d -t cmsRunGlibC -mp cmsRunGlibC dump_cfg.py  > & PR_30files.log
igprof-analyse --sqlite -d -v -g -r MEM_LIVE igprof.cmsRunGlibC.5556.1660650465.673794.gz | sqlite3 igreport_live.sql3

as from
https://igprof.org/analysis.html#:~:text=Memory%20profiling%20reports

so maybe the second one is not correct (?)

makortel · 2022-08-16T13:23:40Z

Thanks. I'm a bit confused what dump exactly igprof.cmsRunGlibC.5556.1660650465.673794.gz corresponds to, since the Validation.Performance.IgProfInfo.customise

cmssw/Validation/Performance/python/IgProfInfo.py

Lines 5 to 9 in 7654b67

    
           process.IgProfService = cms.Service("IgProfService", 
        
               reportFirstEvent            = cms.untracked.int32(1),  #Dump first event for baseline studies 
        
               reportEventInterval         = cms.untracked.int32( ((process.maxEvents.input.value()-1)//2) ), # dump in the middle of the run 
        
               reportToFileAtPostEvent     = cms.untracked.string("| gzip -c > IgProf.%I.gz") 
        
               )

causes a profile to be dumped after some events with the event entry number in the file name. Maybe it's the "end of process" dump, given that the recipe does not specify an output file (-o). In that case the MEM_LIVE would report memory leaks (which is not particularly relevant for this problem).

To see the memory usage of GEMDQMHarvester as in #38976 (comment) you can add a profile dump point to globalEndRun transition with

process.IgProfService.reportToFileAtPostEndRun = cms.untracked.string("| gzip -c > IgProf.endRun.%R.gz")

(where %R gets substituted with the run number) and convert that into the sqlite3 file.

mmusich · 2023-05-19T19:57:50Z

Just for the record, this is keeping happening also in 2023, despite an increase of the memory allowance for harvesting jobs at Tier-0.
A recent paused job occurred in Express_Run367696_StreamExpress (logs available here)

makortel · 2023-05-19T20:11:47Z

Is there a recipe to reproduce? (the link wasn't clear for me)

gpetruc · 2023-05-20T11:02:49Z

Hi,

tar xzvf /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2023C/MemoryExpress/7c897736-9d50-4d3a-8e7b-da1e3c8d796b-Periodic-Harvest-1-3-logArchive.tar.gz
cd job/WMTaskSpace/cmsRun1
cmsRun PSet.py

The job has 598 files to process, when runnign the memory slowly increasing as the files are opened, and then increasing much faster at the end. It starts off at ~1.4GB, it's 1.7GB after 500 file but then jumps up. Logging every few seconds the number of opened files and the memory usage in kB and got

files 572 mem 1718344
files 575 mem 1718344
files 577 mem 1718344
files 580 mem 1718344
files 580 mem 1718344
files 582 mem 1833032
files 585 mem 1833032
files 587 mem 1833032
files 589 mem 1833032
files 592 mem 1833032
files 594 mem 1833032
files 597 mem 1833032
files 598 mem 1968200
files 598 mem 2586696
files 598 mem 2586696
...

And then there's lots of LogErrors in the form

%MSG-e MergeFailure:  source 20-May-2023 12:58:16 CEST PostBeginProcessBlock
Found histograms with different axis limits or different labels 'ROCs hits multiplicity per event vs LS' not merged.
%MSG

and the memory usage grows furtherat least up to 2859080 (the job is still running...)

I'm testing also igprof with the command lines and reportToFileAtPostEndRun suggested above

gpetruc · 2023-05-20T15:05:32Z

If the job is let to run, the memory usage climbs all the way to 5 GB.

I didn't manage to get anything out of igprof, despite trying to set in process.IgProfService

     reportToFileAtPostEndRun = cms.untracked.string("| gzip -c > IgProf.endRun.%R.gz"),
     reportToFileAtPostOpenFile = cms.untracked.string("| gzip -c > IgProf.fileOpen.%F.gz"),
     reportToFileAtPostCloseFile = cms.untracked.string("| gzip -c > IgProf.fileClose.%C.gz")

I don't get any of those dumps.
I also tried to get some dumps from jemalloc but with little success

I got some dumps from valgrind massif, which seem to imply that it's mostly histograms and associated root stuff (TLists, THashLists, ...), maybe as a consequence of the failed merging the job saves a lot of duplicate histograms?

If anyone has more up-to-date instructions to profile the heap I can try them.

For those interested in reproducing the problem more quickly, limiting the job to 5-10 files it runs much more quickly (few minutes) but still exhibits a large memory growth at the end.

makortel · 2023-05-22T13:53:10Z

The IgProf memory profiling setup has, unfortunately, become broken.

@gartung Could you remind us the recipe to use jemalloc's profiling that works in 13_0_X (i.e. without #40899)?

I'll try out VTune.

In the meantime, could @cms-sw/dqm-l2 check that the set of histograms in the merged files looks reasonable?

davidlange6 · 2023-05-22T14:03:28Z

From a log it seems the input files are all held open until the end. Is that expected?

makortel · 2023-05-22T15:13:00Z

From a log it seems the input files are all held open until the end. Is that expected?

That is how DQMRootSource has been coded

cmssw/DQMServices/FwkIO/plugins/DQMRootSource.cc

Lines 474 to 480 in 2334fc9

    
           DQMRootSource::~DQMRootSource() { 
        
             for (auto& file : m_openFiles) { 
        
               if (file.m_file && file.m_file->IsOpen()) { 
        
                 logFileAction("Closed file", ""); 
        
               } 
        
             } 
        
           }

(since #28622, IIUC; another PR related to DQMRootSource memory usage #30889)

gartung · 2023-05-22T18:24:00Z

The IgProf memory profiling setup has, unfortunately, become broken.

@gartung Could you remind us the recipe to use jemalloc's profiling that works in 13_0_X (i.e. without #40899)?

I'll try out VTune.

In the meantime, could @cms-sw/dqm-l2 check that the set of histograms in the merged files looks reasonable?

Besides setting up the version of the jemalloc with profiling enabled
scram setup jemalloc-prof
scram b ToolUpdated
you need to set an environment variable
MALLOC_CONF=prof_leak:true,lg_prof_sample:10,prof_final:true cmsRun config.py

davidlange6 · 2023-05-22T19:34:39Z

Hum. So some growth with run length is expected... Seems we (Chris) fixed this already once long ago On May 22, 2023 5:13 PM, Matti Kortelainen ***@***.***> wrote: From a log it seems the input files are all held open until the end. Is that expected? That is how DQMRootSource has been coded https://github.com/cms-sw/cmssw/blob/2334fc966c47ee08c48487fce4eaa80384a32904/DQMServices/FwkIO/plugins/DQMRootSource.cc#L474-L480 (since #28622<#28622>, IIUC) — Reply to this email directly, view it on GitHub<#38976 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABGPFQ56GKOPMS3N7S46RILXHN7ATANCNFSM55WFIVFQ>. You are receiving this because you were mentioned.Message ID: ***@***.***>

makortel · 2023-05-23T18:49:43Z

VTune wasn't very useful, but I managed to get the following total sizes by type

TH1F: 134.2 MB
TH2F: 369.1 MB
TProfile2D: 234.9 MB

In addition, TBuffer constructor (via TBufferFile constructor) 377.5 MB. These don't sound outrageously high to me (even if TH2F and TProfile2D are high-ish).

cmsbuild added the pending-assignment label Aug 5, 2022

cmsbuild added dqm-pending pending-signatures and removed pending-assignment labels Aug 5, 2022

missirol mentioned this issue Aug 9, 2022

Problem in DQM Harvesting step with EgHLTOfflineClient #38970

Open

This was referenced Aug 15, 2022

A safe guard on GEM DQM harvester #39060

Merged

A safe guard on GEM DQM harvester, a backport to 12_4_X #39061

Merged

This was referenced Aug 15, 2022

Add more profile dump points to IgProfService #39071

Closed

Add more profile dump points to IgProfService #39073

Merged

rvenditti mentioned this issue Jun 13, 2023

Definition of a new DQM Offline sequence to be run on Express Stream #41944

Merged

High memory usage in DQM harvesting job for Express #38976

High memory usage in DQM harvesting job for Express #38976

Comments

rvenditti commented Aug 5, 2022

cmsbuild commented Aug 5, 2022

davidlange6 commented Aug 5, 2022

davidlange6 commented Aug 5, 2022

makortel commented Aug 5, 2022

cmsbuild commented Aug 5, 2022

rvenditti commented Aug 5, 2022

makortel commented Aug 5, 2022

makortel commented Aug 6, 2022

makortel commented Aug 8, 2022

makortel commented Aug 8, 2022

jshlee commented Aug 9, 2022

mmusich commented Aug 12, 2022

quark2 commented Aug 12, 2022

mmusich commented Aug 15, 2022

quark2 commented Aug 15, 2022 • edited

mmusich commented Aug 15, 2022

quark2 commented Aug 15, 2022

quark2 commented Aug 15, 2022

mmusich commented Aug 15, 2022

quark2 commented Aug 15, 2022

mmusich commented Aug 15, 2022

quark2 commented Aug 15, 2022

quark2 commented Aug 15, 2022

mmusich commented Aug 15, 2022 • edited

quark2 commented Aug 15, 2022

mmusich commented Aug 15, 2022

mmusich commented Aug 16, 2022

sarafiorendi commented Aug 16, 2022

mmusich commented Aug 16, 2022

makortel commented Aug 16, 2022

sarafiorendi commented Aug 16, 2022

makortel commented Aug 16, 2022 • edited

mmusich commented May 19, 2023

makortel commented May 19, 2023

gpetruc commented May 20, 2023

gpetruc commented May 20, 2023

makortel commented May 22, 2023

davidlange6 commented May 22, 2023

makortel commented May 22, 2023 • edited

gartung commented May 22, 2023

davidlange6 commented May 22, 2023 via email

makortel commented May 23, 2023

quark2 commented Aug 15, 2022 •

edited

mmusich commented Aug 15, 2022 •

edited

makortel commented Aug 16, 2022 •

edited

makortel commented May 22, 2023 •

edited