Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CTPPS ME merging problems in Harvesting step #38969

Open
rvenditti opened this issue Aug 4, 2022 · 12 comments
Open

CTPPS ME merging problems in Harvesting step #38969

rvenditti opened this issue Aug 4, 2022 · 12 comments

Comments

@rvenditti
Copy link
Contributor

rvenditti commented Aug 4, 2022

As a follow up of Express job killed at T0 for memory issues at harvesting step in run 356381 (link), we found that the log file shows a problem in merging a ME:
"%MSG-e MergeFailure: source 04-Aug-2022 15:58:00 CEST PostBeginProcessBlock
Found histograms with different axis limits or different labels 'ROCs hits multiplicity per event vs LS' not merged"
The related ME seems to belong to CTPPS:

hp2HitsMultROC_LS[indexP] = ibooker.bookProfile2D("ROCs hits multiplicity per event vs LS",

Can please CTPPS DQM experts have a look ?

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 4, 2022

A new Issue was created by @rvenditti .

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

makortel commented Aug 4, 2022

assign dqm

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 4, 2022

New categories assigned: dqm

@jfernan2,@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

makortel commented Aug 4, 2022

FYI @cms-sw/ctpps-dpg-l2

@vavati
Copy link

vavati commented Aug 5, 2022

I don't understand which could be the issue: the equivalent plots in DQMonline seems ok. These plots have not been modified since long, I guess the problem is elsewhere, not in the plot itself.

@mmusich
Copy link
Contributor

mmusich commented Sep 2, 2024

@vavati, while investigating recent failures in Tier0 prompt processing , I see that this issue persists as of today (in CMSSW_14_0_14).

For the record the message is:

%MSG-e MergeFailure:  source 02-Sep-2024 16:23:44 CEST PostBeginProcessBlock
Found histograms with different axis limits or different labels 'ROCs hits multiplicity per event vs LS' not merged.
%MSG

and the underlying ME is defined as:

hp2HitsMultROC_LS[indexP] = ibooker.bookProfile2D("ROCs hits multiplicity per event vs LS",
rpTitle + ";LumiSection;Plane#___ROC#",
1000,
0.,
1000.,
NplaneMAX * NROCsMAX,
0.,
double(NplaneMAX * NROCsMAX),
0.,
rpixValues::ROCSizeInX * rpixValues::ROCSizeInY,
"");
hp2HitsMultROC_LS[indexP]->getTProfile2D()->SetOption("colz");
hp2HitsMultROC_LS[indexP]->getTProfile2D()->SetMinimum(1.0e-10);
hp2HitsMultROC_LS[indexP]->getTProfile2D()->SetCanExtend(TProfile2D::kXaxis);
TAxis *yahp2 = hp2HitsMultROC_LS[indexP]->getTProfile2D()->GetYaxis();
for (int p = 0; p < NplaneMAX; p++) {
sprintf(s, "plane%d_0", p);
yahp2->SetBinLabel(p * NplaneMAX + 1, s);
for (int r = 1; r < NROCsMAX; r++) {
sprintf(s, " %d_%d", p, r);
yahp2->SetBinLabel(p * NplaneMAX + r + 1, s);
}
}

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 2, 2024

cms-bot internal usage

@vavati
Copy link

vavati commented Sep 2, 2024

thanks, forwarded to experts....

@grzanka
Copy link
Contributor

grzanka commented Sep 3, 2024

@vavati, while investigating recent failures in Tier0 prompt processing , I see that this issue persists as of today (in CMSSW_14_0_14).

For the record the message is:

%MSG-e MergeFailure:  source 02-Sep-2024 16:23:44 CEST PostBeginProcessBlock
Found histograms with different axis limits or different labels 'ROCs hits multiplicity per event vs LS' not merged.
%MSG

and the underlying ME is defined as:

hp2HitsMultROC_LS[indexP] = ibooker.bookProfile2D("ROCs hits multiplicity per event vs LS",
rpTitle + ";LumiSection;Plane#___ROC#",
1000,
0.,
1000.,
NplaneMAX * NROCsMAX,
0.,
double(NplaneMAX * NROCsMAX),
0.,
rpixValues::ROCSizeInX * rpixValues::ROCSizeInY,
"");
hp2HitsMultROC_LS[indexP]->getTProfile2D()->SetOption("colz");
hp2HitsMultROC_LS[indexP]->getTProfile2D()->SetMinimum(1.0e-10);
hp2HitsMultROC_LS[indexP]->getTProfile2D()->SetCanExtend(TProfile2D::kXaxis);
TAxis *yahp2 = hp2HitsMultROC_LS[indexP]->getTProfile2D()->GetYaxis();
for (int p = 0; p < NplaneMAX; p++) {
sprintf(s, "plane%d_0", p);
yahp2->SetBinLabel(p * NplaneMAX + 1, s);
for (int r = 1; r < NROCsMAX; r++) {
sprintf(s, " %d_%d", p, r);
yahp2->SetBinLabel(p * NplaneMAX + r + 1, s);
}
}

Do you have any suggestion how to reproduce that problem ?
With my grzankal (in zh group) account I cannot access files mentioned in https://cms-talk.web.cern.ch/t/high-memory-usage-in-harvesting-job-for-workflow-express-run356381-streamexpress/13419/16

The profile2D binning is constant, at first glance it looks like some memory leak.

@mmusich
Copy link
Contributor

mmusich commented Sep 3, 2024

@grzanka

Do you have any suggestion how to reproduce that problem ?

you should be able to access this tarball:

eos cp /eos/cms/store/logs/prod/recent/Express/Express_Run385168_StreamExpress/ExpressMergewrite_StreamExpress_DQMIOPeriodicDQMHarvestMerged/vocms0314.cern.ch-3498656-3-log.tar.gz

(at least that's what I used).

The profile2D binning is constant, at first glance it looks like some memory leak.

but the ranges are not. I see:

 hp2HitsMultROC_LS[indexP]->getTProfile2D()->SetCanExtend(TProfile2D::kXaxis); 

I think this is potentially creating histograms with mismatched axes limits (not sure if that's the origin of the problem, there are other instances in cmssw that do the same and apparently do not create issues).
About reproducing the error, the message occurs during the harvesting step, but the input files are produced in the earlier step, so you would have to create first the DQMIO files from different LS-es and then try to merge them in the harvesting step.

@grzanka
Copy link
Contributor

grzanka commented Sep 3, 2024

@grzanka

Do you have any suggestion how to reproduce that problem ?

you should be able to access this tarball:

eos cp /eos/cms/store/logs/prod/recent/Express/Express_Run385168_StreamExpress/ExpressMergewrite_StreamExpress_DQMIOPeriodicDQMHarvestMerged/vocms0314.cern.ch-3498656-3-log.tar.gz

(at least that's what I used).

Thanks ! This works !

The profile2D binning is constant, at first glance it looks like some memory leak.

but the ranges are not. I see:

 hp2HitsMultROC_LS[indexP]->getTProfile2D()->SetCanExtend(TProfile2D::kXaxis); 

I think this is potentially creating histograms with mismatched axes limits (not sure if that's the origin of the problem, there are other instances in cmssw that do the same and apparently do not create issues). About reproducing the error, the message occurs during the harvesting step, but the input files are produced in the earlier step, so you would have to create first the DQMIO files from different LS-es and then try to merge them in the harvesting step.

The Profile2D has X axis limited by default to 1000 LS. If longer run is happening then getTProfile2D()->SetCanExtend(TProfile2D::kXaxis); is used to extend the axis past 1000 LS (by doubling the bin range if I read ROOT documentation properly).

It could be that somewhere the merge of extended and non-extended histograms happens. To check that I would need to reproduce the problem with a run long enough.

By the way - what is the recommended method of extending histograms in MEs ?
The comment here says its somehow deprecated....

// We should avoid extending histograms in general, and if the behaviour

@mmusich
Copy link
Contributor

mmusich commented Sep 3, 2024

By the way - what is the recommended method of extending histograms in MEs ?
The comment here says its somehow deprecated....

I am not sure. I let @cms-sw/dqm-l2 comment ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants