Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory usage in DQM harvesting job for Express #38976

Open
rvenditti opened this issue Aug 5, 2022 · 42 comments
Open

High memory usage in DQM harvesting job for Express #38976

rvenditti opened this issue Aug 5, 2022 · 42 comments

Comments

@rvenditti
Copy link
Contributor

DQMHarvesting is exceeding maxPSS in Express reconstruction at T0 in runs 356381, 356615, 356719 link to cmsTalk.

  • We re-run job on lxplus and they have been completed successfully even if with some warnings (see below)

  • Looking inside the tarball and local running jobs, it seems that the output of cmsRun shows some warnings in HLTConfigProvider for HLT-EGamma client and at merging step for CTTPS.
    In order to investigate these warnings two GitHub issues have been opened:
    CTPPS ME merging problems in Harvesting step #38969
    Problem in DQM Harvesting step with EgHLTOfflineClient  #38970

  • However, looking at a similar issue observed in May, it seems that the warnings are there since long time and are not the root of the problem.

  • Running the IgProf tool for the memory profiling, it seems that the main memory consumer is the DQMFileSaver DQMFileSaver::saveForOffline function (see /afs/cern.ch/work/r/rosma/public/Run356381/igreport_total.res and /afs/cern.ch/work/r/rosma/public/Run356615/).... but we don't know how much this check is reliable given that Harvesting is not running on events. Can software expert have a look and give us some suggestions? @makortel

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 5, 2022

A new Issue was created by @rvenditti .

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@davidlange6
Copy link
Contributor

could you post the igprof results please?

@davidlange6
Copy link
Contributor

(eg, the html version is more easily understandable)

@makortel
Copy link
Contributor

makortel commented Aug 5, 2022

assign dqm

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 5, 2022

New categories assigned: dqm

@jfernan2,@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

@rvenditti
Copy link
Contributor Author

Hi @davidlange6 , unfortunately i am not able to forward the browsable version to my cern site. However the related sql3 file is here: /eos/user/r/rosma/www/cgi-bin/data/igreport_perf.sql3

@makortel
Copy link
Contributor

makortel commented Aug 5, 2022

I took a quick look on /afs/cern.ch/work/r/rosma/public/Run356381/igreport_total.res and I see the profile reports MEM_TOTAL, i.e. total memory allocations. MEM_LIVE (that shows the memory usage at a given time) would be more useful. It seems to me that IgProfService does not have dump points for framework callbacks that would be most useful for harvesting job. I'll add some and profile myself too.

On the other hand, I'm not sure how useful the IgProf profile is for a harvesting job. We know that the job processes histograms, but IgProf profile won't tell much about what histograms exactly take a lot of memory (beyond the type). Have you checked the histogram sizes from a DQMIO file? I think that would be a useful number, even if DQMRootSource should merge the histograms on each file open, and e.g. in https://cms-talk.web.cern.ch/t/re-2018-replay-for-pps-pcl-test-dqm-expresmergewrite-memory-too-high/6114/7 the memory problem was clearly in the output.

@makortel
Copy link
Contributor

makortel commented Aug 6, 2022

One thing I noticed is the job is awfully slow under igprof -mp ... cmsRunGlibC. While restarting my memory profile after copying the input files for run 356615 (for the run 356381 the files were already gone) I profiled the CPU time consumption. Full profile is here
https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf
Of the 1623 seconds of total

Of these the TProfile merging and GEMDQMHarvester could be something worth of taking a look (even if strictly speaking outside of the scope of this issue).

@makortel
Copy link
Contributor

makortel commented Aug 8, 2022

A MEM_LIVE profile (dumped at postGlobalEndRun) points to GEMDQMHarvester::dqmEndLuminosityBlock() taking 610 MB https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/preEndRun.mem.356615/25 . Nearly all of that appears to be in std::maps in GEMDQMHarvester::createTableWatchingSummary() https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/preEndRun.mem.356615/27.

The maps are initialized in

void GEMDQMHarvester::createTableWatchingSummary() {
if (bIsStatusChambersInit_)
return;
for (const auto &[nIdxLayer, nNumCh] : mapNumChPerChamber_) {
for (Int_t i = 1; i <= nNumCh; i++) {
mapStatusChambersSummary_[{nIdxLayer, i}] = std::vector<StatusInfo>();
mapNumStatusChambersSummary_[{nIdxLayer, i}] = NumStatus();
for (Int_t j = 1; j <= nNumVFATs_; j++) {
mapStatusVFATsSummary_[{nIdxLayer, i, j}] = std::vector<StatusInfo>();
mapNumStatusVFATsSummary_[{nIdxLayer, i, j}] = NumStatus();
}
}
}
bIsStatusChambersInit_ = true;
}

The profile indicates

  • mapStatusChambersSummary_ and mapNumStatusChambersSummary_ have 160,380 elements
  • mapStatusVFATsSummary_ and mapNumStatusVFATsSummary_ have 3,849,120 elements

I'm afraid the GEMDQMHarvester::drawSummaryHistogram() needs to be redesigned.

@makortel
Copy link
Contributor

makortel commented Aug 8, 2022

FYI @cms-sw/gem-dpg-l2

@jshlee
Copy link
Contributor

jshlee commented Aug 9, 2022

@slowmoyang @quark2 can you take a look at this.

@mmusich
Copy link
Contributor

mmusich commented Aug 12, 2022

@jshlee @slowmoyang @quark2 any update on this issue?

@quark2
Copy link
Contributor

quark2 commented Aug 12, 2022

Hi @mmusich,

Sorry for my late reply, I'm working on it, but I was sick, so the task was delayed... I'll make a fix soon.

Best regards,
Byeonghak Ko

@mmusich
Copy link
Contributor

mmusich commented Aug 15, 2022

Hi @quark2,
there have been already 5 instances of the issue so far - if I count correctly, can you clarify the timeline for the fix?
In case it cannot arrive by today, such that a patch release can be built tomorrow after the ORP meeting (FYI @perrotta @qliphy), can we consider disabling in production the offending module until a more proper fix is found out?
I think we can do that by commenting these lines:

from Configuration.Eras.Modifier_run3_GEM_cff import run3_GEM
_run3_GEM_DQMOffline_SecondStepMuonDPG = DQMOffline_SecondStepMuonDPG.copy()
_run3_GEM_DQMOffline_SecondStepMuonDPG += gemClients
run3_GEM.toReplaceWith(DQMOffline_SecondStepMuonDPG, _run3_GEM_DQMOffline_SecondStepMuonDPG)

please clarify.
Thanks,

Marco (ORM)

@quark2
Copy link
Contributor

quark2 commented Aug 15, 2022

Hi @mmusich,

I made a fix, but I have no idea how to measure the memory consumption with the fix. Could you provide the way?
Once the fix works fine, I'll make a PR asap.

Best regards,
Byeonghak Ko

@mmusich
Copy link
Contributor

mmusich commented Aug 15, 2022

@quark2 the best way would be to run the igProf profiler and there are some instructions here: https://twiki.cern.ch/twiki/bin/viewauth/CMS/RecoIntegration#Run_profiler_igprof.
Another (quicker) way, is to just print the RSS used by cmsRun when executing the harvesting and check that it should be substantially lower than before.

@quark2
Copy link
Contributor

quark2 commented Aug 15, 2022

Thanks, I'll use it!

@quark2
Copy link
Contributor

quark2 commented Aug 15, 2022

Hi,

I've made a fix and its backport to 12_4_X. I've seen the reduction of the RSS. (Although it was still 1.9 GB; still too large?)

@mmusich
Copy link
Contributor

mmusich commented Aug 15, 2022

I've seen the reduction of the RSS.

How much is it reduced?
Matti pointed above that GEMDQMHarvester::dqmEndLuminosityBlock() is taking 610 MB. Do you have a profiling before / after the change? Is 1.9GB only due to GEM or overall by the harvesting job?

@quark2
Copy link
Contributor

quark2 commented Aug 15, 2022

The amount 1.9 GB is overall by the harvesting job. The RSS before the fix is 2.8 GB, so about 900 MB is saved.

@mmusich
Copy link
Contributor

mmusich commented Aug 15, 2022

The RSS before the fix is 2.8 GB, so about 900 MB is saved.

Thanks, can you also provide what would the number in case the harvesting job is executed without GEMDQMHarvester at all?
This would hep to set the scale of what is reasonable.

@quark2
Copy link
Contributor

quark2 commented Aug 15, 2022

I executed it without GEMDQMHarvester, and the maximum RSS was also 1.9 GB. I'm not sure if it is reasonable, but at least I can say that GEMDQMHarvester consumes basically small memory (after the issue is fixed).

@quark2
Copy link
Contributor

quark2 commented Aug 15, 2022

Btw, does anyone know how to reproduce DQMIO files in local, especially root://cms-xrd-global.cern.ch//store/express/Run2022C/StreamExpress/DQMIO/Express-v1/000/356/381/00000/0114A6C8-C901-4386-94D4-F0386E636CA0.root?

@mmusich
Copy link
Contributor

mmusich commented Aug 15, 2022

Btw, does anyone know how to reproduce DQMIO files in local, especially root://cms-xrd-global.cern.ch//store/express/Run2022C/StreamExpress/DQMIO/Express-v1/000/356/381/00000/0114A6C8-C901-4386-94D4-F0386E636CA0.root?

you would need to run the express reco from the streamer files of 356381, which I strongly suspect are already gone.

@quark2
Copy link
Contributor

quark2 commented Aug 15, 2022

I see... I need to look at the geometry part more, and reproducing that file should give a good hint.

@mmusich
Copy link
Contributor

mmusich commented Aug 15, 2022

I see... I need to look at the geometry part more, and reproducing that file should give a good hint.

There have been recently failures of the same type e.g. for run 357479.

I think the streamers for that file are still available:

$ eos ls /store/t0streamer/Data/Express/000/357/479 | grep .dat | wc -l
1044

you would need to generate the configuration via:

python RunExpressProcessing.py --scenario ppEra_Run3 --lfn /store/t0streamer/Data/Express/000/357/479/run357479_ls1012_streamExpress_StorageManager.dat --global-tag 101X_dataRun2_Express_v7 --fevt --dqmio 

manually modifying the output configuration

process.maxEvents = cms.untracked.PSet(
    input = cms.untracked.int32(-1)
)
process.source = cms.Source("NewEventStreamFileReader",
    fileNames = cms.untracked.vstring('<put here all your input streamer files>')
)
process.options = cms.untracked.PSet(
    numberOfStreams = cms.untracked.uint32(0),
    numberOfThreads = cms.untracked.uint32(8)
)

to get the correct source.
Alternatively I think you can try to use a cmsDriver command mimicking what Tier-0 does for express, starting with RAW data as input, but that would require me more time to cook.
Perhaps @cms-sw/dqm-l2 can help here as well.
HTH

@mmusich
Copy link
Contributor

mmusich commented Aug 16, 2022

@sarafiorendi
Copy link
Contributor

@makortel @mmusich
sorry my bad, I updated the results with the PR integrated (same link as above), they show (as far as I understand) some effective reduction

@mmusich
Copy link
Contributor

mmusich commented Aug 16, 2022

sorry my bad, I updated the results with the PR integrated (same link as above), they show (as far as I understand) some effective reduction

thanks. Still I am not sure to understand why there is no contribution from GEMDQMHarvester in MEM_LIVE.

@makortel
Copy link
Contributor

Are we missing something in the recipe? We followed https://twiki.cern.ch/twiki/bin/viewauth/CMS/RecoIntegration#Run_profiler_igprof

@mmusich @sarafiorendi Was this the part of the recipe you followed for the MEM_LIVE profile?

igprof -d -t cmsRunGlibC -mp cmsRunGlibC a.py >& a.log

igprof-analyse --sqlite -v --demangle --gdb -r MEM_LIVE IgProf.1.gz > ig.1.txt

@sarafiorendi
Copy link
Contributor

this are the commands I ran

igprof -d -t cmsRunGlibC -mp cmsRunGlibC dump_cfg.py  > & PR_30files.log
igprof-analyse --sqlite -d -v -g -r MEM_LIVE igprof.cmsRunGlibC.5556.1660650465.673794.gz | sqlite3 igreport_live.sql3

as from
https://igprof.org/analysis.html#:~:text=Memory%20profiling%20reports

so maybe the second one is not correct (?)

@makortel
Copy link
Contributor

makortel commented Aug 16, 2022

Thanks. I'm a bit confused what dump exactly igprof.cmsRunGlibC.5556.1660650465.673794.gz corresponds to, since the Validation.Performance.IgProfInfo.customise

process.IgProfService = cms.Service("IgProfService",
reportFirstEvent = cms.untracked.int32(1), #Dump first event for baseline studies
reportEventInterval = cms.untracked.int32( ((process.maxEvents.input.value()-1)//2) ), # dump in the middle of the run
reportToFileAtPostEvent = cms.untracked.string("| gzip -c > IgProf.%I.gz")
)

causes a profile to be dumped after some events with the event entry number in the file name. Maybe it's the "end of process" dump, given that the recipe does not specify an output file (-o). In that case the MEM_LIVE would report memory leaks (which is not particularly relevant for this problem).

To see the memory usage of GEMDQMHarvester as in #38976 (comment) you can add a profile dump point to globalEndRun transition with

process.IgProfService.reportToFileAtPostEndRun = cms.untracked.string("| gzip -c > IgProf.endRun.%R.gz")

(where %R gets substituted with the run number) and convert that into the sqlite3 file.

@mmusich
Copy link
Contributor

mmusich commented May 19, 2023

Just for the record, this is keeping happening also in 2023, despite an increase of the memory allowance for harvesting jobs at Tier-0.
A recent paused job occurred in Express_Run367696_StreamExpress (logs available here)

@makortel
Copy link
Contributor

Is there a recipe to reproduce? (the link wasn't clear for me)

@gpetruc
Copy link
Contributor

gpetruc commented May 20, 2023

Hi,

tar xzvf /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2023C/MemoryExpress/7c897736-9d50-4d3a-8e7b-da1e3c8d796b-Periodic-Harvest-1-3-logArchive.tar.gz
cd job/WMTaskSpace/cmsRun1
cmsRun PSet.py

The job has 598 files to process, when runnign the memory slowly increasing as the files are opened, and then increasing much faster at the end. It starts off at ~1.4GB, it's 1.7GB after 500 file but then jumps up. Logging every few seconds the number of opened files and the memory usage in kB and got

files 572 mem 1718344
files 575 mem 1718344
files 577 mem 1718344
files 580 mem 1718344
files 580 mem 1718344
files 582 mem 1833032
files 585 mem 1833032
files 587 mem 1833032
files 589 mem 1833032
files 592 mem 1833032
files 594 mem 1833032
files 597 mem 1833032
files 598 mem 1968200
files 598 mem 2586696
files 598 mem 2586696
...

And then there's lots of LogErrors in the form

%MSG-e MergeFailure:  source 20-May-2023 12:58:16 CEST PostBeginProcessBlock
Found histograms with different axis limits or different labels 'ROCs hits multiplicity per event vs LS' not merged.
%MSG

and the memory usage grows furtherat least up to 2859080 (the job is still running...)

I'm testing also igprof with the command lines and reportToFileAtPostEndRun suggested above

@gpetruc
Copy link
Contributor

gpetruc commented May 20, 2023

If the job is let to run, the memory usage climbs all the way to 5 GB.

I didn't manage to get anything out of igprof, despite trying to set in process.IgProfService

     reportToFileAtPostEndRun = cms.untracked.string("| gzip -c > IgProf.endRun.%R.gz"),
     reportToFileAtPostOpenFile = cms.untracked.string("| gzip -c > IgProf.fileOpen.%F.gz"),
     reportToFileAtPostCloseFile = cms.untracked.string("| gzip -c > IgProf.fileClose.%C.gz")

I don't get any of those dumps.
I also tried to get some dumps from jemalloc but with little success

I got some dumps from valgrind massif, which seem to imply that it's mostly histograms and associated root stuff (TLists, THashLists, ...), maybe as a consequence of the failed merging the job saves a lot of duplicate histograms?

If anyone has more up-to-date instructions to profile the heap I can try them.

For those interested in reproducing the problem more quickly, limiting the job to 5-10 files it runs much more quickly (few minutes) but still exhibits a large memory growth at the end.

@makortel
Copy link
Contributor

The IgProf memory profiling setup has, unfortunately, become broken.

@gartung Could you remind us the recipe to use jemalloc's profiling that works in 13_0_X (i.e. without #40899)?

I'll try out VTune.

In the meantime, could @cms-sw/dqm-l2 check that the set of histograms in the merged files looks reasonable?

@davidlange6
Copy link
Contributor

From a log it seems the input files are all held open until the end. Is that expected?

@makortel
Copy link
Contributor

makortel commented May 22, 2023

From a log it seems the input files are all held open until the end. Is that expected?

That is how DQMRootSource has been coded

DQMRootSource::~DQMRootSource() {
for (auto& file : m_openFiles) {
if (file.m_file && file.m_file->IsOpen()) {
logFileAction("Closed file", "");
}
}
}

(since #28622, IIUC; another PR related to DQMRootSource memory usage #30889)

@gartung
Copy link
Member

gartung commented May 22, 2023

The IgProf memory profiling setup has, unfortunately, become broken.

@gartung Could you remind us the recipe to use jemalloc's profiling that works in 13_0_X (i.e. without #40899)?

I'll try out VTune.

In the meantime, could @cms-sw/dqm-l2 check that the set of histograms in the merged files looks reasonable?

Besides setting up the version of the jemalloc with profiling enabled
scram setup jemalloc-prof
scram b ToolUpdated
you need to set an environment variable
MALLOC_CONF=prof_leak:true,lg_prof_sample:10,prof_final:true cmsRun config.py

@davidlange6
Copy link
Contributor

davidlange6 commented May 22, 2023 via email

@makortel
Copy link
Contributor

VTune wasn't very useful, but I managed to get the following total sizes by type

  • TH1F: 134.2 MB
  • TH2F: 369.1 MB
  • TProfile2D: 234.9 MB

In addition, TBuffer constructor (via TBufferFile constructor) 377.5 MB. These don't sound outrageously high to me (even if TH2F and TProfile2D are high-ish).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants