Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues in concurrent Lumi/IOV safety in DQMOneEDAnalyzer<LuminosityBlockCache<...>> #37163

Open
15 of 29 tasks
makortel opened this issue Mar 8, 2022 · 10 comments
Open
15 of 29 tasks

Comments

@makortel
Copy link
Contributor

makortel commented Mar 8, 2022

This issue follows up my comment in #25090 (comment). Below is a list of DQMOneEDAnalyzer modules that make use of edm::LuminosityCache

  • DQM/BeamMonitor/plugins/AlcaBeamMonitor.h fixed in Fix LumiCache in AlcaBeamMonitor and OnlineBeamMonitor #41700
  • DQM/BeamMonitor/plugins/OnlineBeamMonitor.h fixed in Fix LumiCache in AlcaBeamMonitor and OnlineBeamMonitor #41700
  • DQM/CTPPS/plugins/CTPPSCommonDQMSource.cc
  • DQM/CTPPS/plugins/CTPPSDiamondDQMSource.cc
  • DQM/CTPPS/plugins/DiamondSampicDQMSource.cc
  • DQM/CTPPS/plugins/TotemTimingDQMSource.cc
  • DQM/DTMonitorModule/interface/DTDataIntegrityTask.h fixed in Move events/LS counter to LuminosityBlockCache in DTDataIntegrityTask #41874
  • DQM/EcalMonitorTasks/interface/EcalDQMonitorTask.h, DQM/EcalMonitorTasks/plugins/EcalDQMonitorTask.cc
  • DQM/EcalPreshowerMonitorModule/interface/ESIntegrityTask.h
  • DQM/HcalCommon/interface/DQTask.h is a base class, deriving classes are
    • DQM/HcalTasks/interface/HFRaddamTask.h
    • DQM/HcalTasks/interface/LEDTask.h
    • DQM/HcalTasks/interface/LaserTask.h
    • DQM/HcalTasks/interface/NoCQTask.h
    • DQM/HcalTasks/interface/PedestalTask.h
    • DQM/HcalTasks/interface/QIE10Task.h
    • DQM/HcalTasks/interface/QIE11Task.h
    • DQM/HcalTasks/interface/RawTask.h
    • DQM/HcalTasks/interface/RecHitTask.h
    • DQM/HcalTasks/interface/TPTask.h
    • DQM/HcalTasks/interface/UMNioTask.h
    • DQM/HcalTasks/plugins/HcalGPUComparisonTask.cc
  • DQM/L1TMonitor/interface/L1TdeRCT.h
  • DQM/Physics/src/QcdLowPtDQM.h
  • DQM/SiPixelMonitorDigi/interface/SiPixelDigiSource.h
  • DQM/SiStripMonitorDigi/interface/SiStripMonitorDigi.h
  • DQM/SiStripMonitorHardware/src/SiStripFEDMonitor.cc
  • DQM/TrigXMonitor/interface/L1Scalers.h
  • DQMOffline/L1Trigger/interface/L1TSync_Offline.h

On quick look some of them do not seem to be safe wrt concurrent lumis or IOVs. In the comments I'm going to write some notes about them.

@makortel
Copy link
Contributor Author

makortel commented Mar 8, 2022

assign core,dqm

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 8, 2022

New categories assigned: core,dqm

@jfernan2,@ahmad3213,@rvenditti,@Dr15Jones,@smuzaffar,@emanueleusai,@makortel,@pbo0,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 8, 2022

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

makortel commented Mar 8, 2022

In DQM/BeamMonitor/plugins/AlcaBeamMonitor.h the problem is that per-LuminosityBlock data is stored in class member data instead of the LuminosityBlockCache (which actually is empty!). For example, the globalBeginLuminosityBlock()

std::shared_ptr<alcabeammonitor::NoCache> AlcaBeamMonitor::globalBeginLuminosityBlock(const LuminosityBlock& iLumi,
const EventSetup& iSetup) const {
// Always create a beamspot group for each lumi weather we have results or not! Each Beamspot will be of unknown type!
vertices_.clear();
beamSpotsMap_.clear();

clears an std::vector<reco::VertexCollection> vertices_ and BeamSpotContainer beamSpotsMap_ (which is std::map<std::string, reco::BeamSpot> ).

In short, all lumi-dependent data should be moved into the LuminosityBlockCache.


The vertices_ is filled in analyze()

Handle<VertexCollection> PVCollection;
if (iEvent.getByToken(primaryVertexLabel_, PVCollection)) {
vertices_.push_back(*PVCollection.product());
}

and used in globalEndLuminosityBlock()
for (vector<VertexCollection>::iterator itPV = vertices_.begin(); itPV != vertices_.end(); itPV++) {

The problem with this code is perhaps easiest demonstrated with e.g. a following sequence

beginLumi 1
Event 1:1
Event 1:2
beginLumi 2 # resets the `vertices_`
Event 2:1 
Event 2:2
Event 1:3
endLumi 1  # vertices_ contains now data from Lumis 1 and 2
beginLumi 3
...

The beamspotsMap_ is filled in globalBeginLuminosityBlock(), analyze(), and in globalEndLuminosityBlock(), and used in analyze() and globalEndLuminosityBlock() such that the value filled in globalBeginLuminosityBlock() is used in analyze(). The following sequence would lead to incorrect results

beginLumi 1
Event 1:1
Event 1:2
beginLumi 2, IOV of BeamSpotObjectsRcd changes
Event 1:3 # beamspotsMap_["DB"] has BeamSpot from Lumi 2
Event 2:1
endLumi 1 # beamspotsMap_["SC"] contains BeamSpot from Lumi 2; theBeamFitter and thePVFitter contain data from both Lumis(?)

Not related, but what is the aim of this fragment?

Handle<BeamSpot> recoBeamSpotHandle;
try {
iEvent.getByToken(scalerLabel_, recoBeamSpotHandle);
} catch (cms::Exception& exception) {
LogInfo("AlcaBeamMonitor") << exception.what();
return;
}

Event::getByToken() does not throw any exception that would make sense for user code to catch. If the aim is to check if the input product exists, that could be done e.g. along

if (Handle<BeamSpot> recoBeamSpotHandle = iEvent.getHandle(scalerLabel_)) {
  // product exists
} else {
  // product doesn't exist
}

@makortel
Copy link
Contributor Author

makortel commented Mar 8, 2022

In DQM/BeamMonitor/plugins/OnlineBeamMonitor.h the issue is only with beamSpotsMap_, and is similar to the issue with the similarly named member variable in #37163 (comment).

In addition, about the other mutable member variables

mutable int numberOfProcessedLumis_;
mutable std::vector<int> processedLumis_;

the mutable could be removed from processedLumis_ by filling it in globalEndLuminosityBlock(), and numberOfProcessedLumis_ appears to be unused.

@gennai
Copy link
Contributor

gennai commented May 15, 2023

Dear @makortel @francescobrivio @mmusich , I am working now on this issue (sorry to be taken so long).
I have followed the suggestion by @makortel about the lumiCache and I have prepared a new branch with the updated code:
https://github.com/MilanoBicocca-pix/cmssw/tree/lumiCacheForBeamSpot
I hope to test it asap. I will then also fix the other things discussed in the issue

@makortel
Copy link
Contributor Author

I should probably resume reviewing the modules listed in the issue description...

@makortel
Copy link
Contributor Author

makortel commented Jun 5, 2023

The DQM/CTPPS/plugins/CTPPSDiamondDQMSource.cc stores some counters in member data in PotPlots

unsigned int HitCounter, MHCounter, LeadingOnlyCounter, TrailingOnlyCounter, CompleteCounter;

and ChannelPlots
unsigned int HitCounter, MHCounter, LeadingOnlyCounter, TrailingOnlyCounter, CompleteCounter;

nested structs. These counters are incremented in the analyze()
if (digi.leadingEdge() != 0 || digi.trailingEdge() != 0) {
++potPlots_[detId_pot].HitCounter;
if (digi.leadingEdge() != 0) {
potPlots_[detId_pot].leadingEdgeCumulative_all->Fill(HPTDC_BIN_WIDTH_NS * digi.leadingEdge());
}
if (digi.leadingEdge() != 0 && digi.trailingEdge() == 0) {
++potPlots_[detId_pot].LeadingOnlyCounter;
potPlots_[detId_pot].leadingEdgeCumulative_le->Fill(HPTDC_BIN_WIDTH_NS * digi.leadingEdge());
}
if (digi.leadingEdge() == 0 && digi.trailingEdge() != 0) {
++potPlots_[detId_pot].TrailingOnlyCounter;
potPlots_[detId_pot].trailingEdgeCumulative_te->Fill(HPTDC_BIN_WIDTH_NS * digi.trailingEdge());
}
if (digi.leadingEdge() != 0 && digi.trailingEdge() != 0) {
++potPlots_[detId_pot].CompleteCounter;
// potPlots_[detId_pot].leadingTrailingCorrelationPot->Fill(HPTDC_BIN_WIDTH_NS * digi.leadingEdge(),
// HPTDC_BIN_WIDTH_NS * digi.trailingEdge());
}

if (digi.leadingEdge() != 0 || digi.trailingEdge() != 0) {
++channelPlots_[detId].HitCounter;
if (digi.trailingEdge() == 0) {
++channelPlots_[detId].LeadingOnlyCounter;
channelPlots_[detId].leadingEdgeCumulative_le->Fill(HPTDC_BIN_WIDTH_NS * digi.leadingEdge());
} else if (digi.leadingEdge() == 0) {
++channelPlots_[detId].TrailingOnlyCounter;
channelPlots_[detId].trailingEdgeCumulative_te->Fill(HPTDC_BIN_WIDTH_NS * digi.trailingEdge());
} else {
++channelPlots_[detId].CompleteCounter;
// channelPlots_[detId].LeadingTrailingCorrelationPerChannel->Fill(HPTDC_BIN_WIDTH_NS * digi.leadingEdge(),
// HPTDC_BIN_WIDTH_NS * digi.trailingEdge());
}
}

and are then used in globalEndLuminosityBlock() to fill some histograms
double HundredOverHitCounter = .0;
if (plot.second.HitCounter != 0)
HundredOverHitCounter = 100. / plot.second.HitCounter;
plot.second.HPTDCErrorFlags->setBinContent(16, HundredOverHitCounter * plot.second.MHCounter);
plot.second.leadingWithoutTrailing->setBinContent(1, HundredOverHitCounter * plot.second.LeadingOnlyCounter);
plot.second.leadingWithoutTrailing->setBinContent(2, HundredOverHitCounter * plot.second.TrailingOnlyCounter);
plot.second.leadingWithoutTrailing->setBinContent(3, HundredOverHitCounter * plot.second.CompleteCounter);

double HundredOverHitCounterPot = 0.;
if (plot.second.HitCounter != 0)
HundredOverHitCounterPot = 100. / plot.second.HitCounter;
plot.second.leadingWithoutTrailingCumulativePot->setBinContent(
1, HundredOverHitCounterPot * plot.second.LeadingOnlyCounter);
plot.second.leadingWithoutTrailingCumulativePot->setBinContent(
2, HundredOverHitCounterPot * plot.second.TrailingOnlyCounter);
plot.second.leadingWithoutTrailingCumulativePot->setBinContent(
3, HundredOverHitCounterPot * plot.second.CompleteCounter);

In the presence of concurrent lumis these counters may have counts from other lumis than the one the globalEndLuminosityBlock() is called for. Assuming this is important, the counters should be moved to the LuminosityBlockCache. Digging in history it looks like these counters were missed in #25188.

@makortel
Copy link
Contributor Author

makortel commented Jun 5, 2023

DQM/DTMonitorModule/interface/DTDataIntegrityTask.h had an issue that I felt was easier to just fix, see #41874

@makortel
Copy link
Contributor Author

makortel commented Jun 6, 2023

The DQM/HcalTasks/interface/RecHitTask.h class has

bool _unknownIdsPresent;

which is set to false in bookHistograms()
_unknownIdsPresent = false;

and
/* virtual */ void RecHitTask::_resetMonitors(hcaldqm::UpdateFreq uf) {
switch (uf) {
case hcaldqm::f1LS:
_unknownIdsPresent = false;

(AFAICT both are effectively at beginRun)

The variable is set to true in _process() (i.e. in event processing) in

if (rawid == 0) {
meUnknownIds1LS->Fill(1);
_unknownIdsPresent = true;
continue;
}

and
if (rawid == 0) {
meUnknownIds1LS->Fill(1);
_unknownIdsPresent = true;
continue;
}

It is then used in globalEndLuminosityBlock()

if (_unknownIdsPresent)
_vflags[fUnknownIds]._state = flag::fBAD;
else
_vflags[fUnknownIds]._state = flag::fGOOD;

With concurrent lumis it can happen that for a lumi 1 the condition to set _unknownIdsPresent=true does not happen, but in the following lumi 2 it happens, and the _unknownIdsPresent is set to true before the globalEndLuminosityBlock() is called for lumi 1, leading to flag::fBAD being reported for lumi 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants