Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal Exception in Prompt Reco of Run 367232, datatset JetMET0 #41645

Open
mmusich opened this issue May 12, 2023 · 49 comments
Open

Fatal Exception in Prompt Reco of Run 367232, datatset JetMET0 #41645

mmusich opened this issue May 12, 2023 · 49 comments

Comments

@mmusich
Copy link
Contributor

mmusich commented May 12, 2023

Dear all,
there is one job failing Prompt Reco for run 367232, datatset JetMET0 with a Fatal Exceptionas described in https://cms-talk.web.cern.ch/t/fatal-exception-in-prompt-reco-of-run-367232-datatset-jetmet0/23996

The crash seems to originate from the module L1TObjectsTiming:

----- Begin Fatal Exception 12-May-2023 03:17:41 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 367232 lumi: 190 event: 378449946 stream: 6
   [1] Running path 'dqmoffline_1_step'
   [2] Calling method for module L1TObjectsTiming/'l1tObjectsTiming'
Exception Message:
A std::exception was thrown.
vector::_M_range_check: __n (which is 9) >= this->size() (which is 5)
----- End Fatal Exception -------------------------------------------------

The exception is reproducible on a lxplus8 node under CMSSW_13_0_5_patch2 (el8_amd64_gcc11).
Full logs and PSet.py can be found at https://eoscmsweb.cern.ch/eos/cms/store/logs/prod/recent/PromptReco/PromptReco_Run367232_JetMET0/Reco/vocms014.cern.ch-415905-3-log.tar.gz
With this modified PSet.py file the crash occurs immediately:

import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
    process = pickle.load(handle)
    process.options.numberOfThreads = 1
    process.source.skipEvents=cms.untracked.uint32(2683)

It should be noted that the crash is preceded by these warning (perhaps related):

%MSG-w L1TStage2uGTTiming:   L1TStage2uGTTiming:l1tStage2uGTTiming@streamBeginRun 12-May-2023 08:33:34 CEST  Run: 367232 Stream: 0
Algo "L1_SingleJet60er2p5" not found in the trigger menu L1Menu_Collisions2023_v1_0_0. Could not retrieve algo bit number.
%MSG
%MSG-w L1TStage2uGTTiming:   L1TStage2uGTTiming:l1tStage2uGTTiming@streamBeginRun 12-May-2023 08:33:34 CEST  Run: 367232 Stream: 0
Algo "L1_SingleJet60_FWD3p0" not found in the trigger menu L1Menu_Collisions2023_v1_0_0. Could not retrieve algo bit number.
%MSG
@mmusich
Copy link
Contributor Author

mmusich commented May 12, 2023

assign dqm, l1

@cmsbuild
Copy link
Contributor

New categories assigned: dqm,l1

@tjavaid,@epalencia,@micsucmed,@nothingface0,@rvenditti,@emanueleusai,@syuvivida,@aloeliger,@cecilecaillol,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

A new Issue was created by @mmusich Marco Musich.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@mmusich
Copy link
Contributor Author

mmusich commented May 12, 2023

with a bit of offline diagnosis, the crash occurs here:

muons_eta_phi.at(index)->Fill(Muon->eta(), Muon->phi());

index equals to 9, but the muons_eta_phi.size() is 5.

@mmusich
Copy link
Contributor Author

mmusich commented May 12, 2023

This apparently happens because in that event (Run 367232, Event 378449946, LumiSection 190) MuonBxCollection->getFirstBX() is -10 and MuonBxCollection->getLastBX() is 11, while in other events it seems to be constrained between -2 and 2.

@mmusich
Copy link
Contributor Author

mmusich commented May 13, 2023

There is now a second Prompt Reco paused job with same symptoms

@mmusich
Copy link
Contributor Author

mmusich commented May 13, 2023

urgent

@mmusich
Copy link
Contributor Author

mmusich commented May 13, 2023

There is now a second Prompt Reco paused job with same symptoms

Also the second crash is reproducible under CMSSW_13_0_5_patch1 on lxplus8 with the following PSet.py

import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
    process = pickle.load(handle)
    process.options.numberOfThreads = 1
    process.source.skipEvents=cms.untracked.uint32(680)

(taking the rest of the configuration here)

The exception message is:

----- Begin Fatal Exception 13-May-2023 18:36:50 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 367315 lumi: 40 event: 94344670 stream: 0
   [1] Running path 'dqmoffline_1_step'
   [2] Calling method for module L1TObjectsTiming/'l1tObjectsTiming'
Exception Message:
A std::exception was thrown.
vector::_M_range_check: __n (which is 18446744073709551605) >= this->size() (which is 5)
----- End Fatal Exception -------------------------------------------------

@mmusich
Copy link
Contributor Author

mmusich commented May 14, 2023

Technically this change circumvents the crashes, though it would be good if the L1T / DQM experts find the root cause.

diff --git a/DQM/L1TMonitor/interface/L1TObjectsTiming.h b/DQM/L1TMonitor/interface/L1TObjectsTiming.h
index d39f6a4fe96..42b3ff4616e 100644
--- a/DQM/L1TMonitor/interface/L1TObjectsTiming.h
+++ b/DQM/L1TMonitor/interface/L1TObjectsTiming.h
@@ -38,6 +38,8 @@ protected:
   void dqmBeginRun(const edm::Run&, const edm::EventSetup&) override;
   void bookHistograms(DQMStore::IBooker&, const edm::Run&, const edm::EventSetup&) override;
   void analyze(const edm::Event&, const edm::EventSetup&) override;
+  template <typename T>
+  bool checkBXCollection(const T*&);
 
 private:
   edm::EDGetTokenT<l1t::MuonBxCollection> ugmtMuonToken_;
diff --git a/DQM/L1TMonitor/src/L1TObjectsTiming.cc b/DQM/L1TMonitor/src/L1TObjectsTiming.cc
index c8edfe65685..bb6cfefa67f 100644
--- a/DQM/L1TMonitor/src/L1TObjectsTiming.cc
+++ b/DQM/L1TMonitor/src/L1TObjectsTiming.cc
@@ -1,4 +1,5 @@
 #include "DQM/L1TMonitor/interface/L1TObjectsTiming.h"
+#include "FWCore/Utilities/interface/TypeDemangler.h"
 
 L1TObjectsTiming::L1TObjectsTiming(const edm::ParameterSet& ps)
     : ugmtMuonToken_(consumes<l1t::MuonBxCollection>(ps.getParameter<edm::InputTag>("muonProducer"))),
@@ -783,25 +784,44 @@ void L1TObjectsTiming::bookHistograms(DQMStore::IBooker& ibooker, const edm::Run
   }
 }
 
+template <typename T>
+bool L1TObjectsTiming::checkBXCollection(const T*& BXCollection) {
+  // check that all the BX collections do not exceed the expected range
+  if (BXCollection->getLastBX() - BXCollection->getFirstBX() > int(bxrange_ - 1)) {
+    edm::LogError("L1TObjectsTiming") << " Unexpected bunch crossing range in "
+                                      << edm::typeDemangle(typeid(BXCollection).name())
+                                      << " lastBX() = " << BXCollection->getLastBX()
+                                      << " - firstBX() = " << BXCollection->getFirstBX();
+    return false;
+  } else {
+    return true;
+  }
+}
+
 void L1TObjectsTiming::analyze(const edm::Event& e, const edm::EventSetup& c) {
   if (verbose_)
     edm::LogInfo("L1TObjectsTiming") << "L1TObjectsTiming: analyze..." << std::endl;
 
   // Muon Collection
-  edm::Handle<l1t::MuonBxCollection> MuonBxCollection;
-  e.getByToken(ugmtMuonToken_, MuonBxCollection);
+  const l1t::MuonBxCollection* MuonBxCollection = &e.get(ugmtMuonToken_);
+  if (!checkBXCollection(MuonBxCollection))
+    return;
   // Jet Collection
-  edm::Handle<l1t::JetBxCollection> JetBxCollection;
-  e.getByToken(stage2CaloLayer2JetToken_, JetBxCollection);
+  const l1t::JetBxCollection* JetBxCollection = &e.get(stage2CaloLayer2JetToken_);
+  if (!checkBXCollection(JetBxCollection))
+    return;
   // EGamma Collection
-  edm::Handle<l1t::EGammaBxCollection> EGammaBxCollection;
-  e.getByToken(stage2CaloLayer2EGammaToken_, EGammaBxCollection);
+  const l1t::EGammaBxCollection* EGammaBxCollection = &e.get(stage2CaloLayer2EGammaToken_);
+  if (!checkBXCollection(EGammaBxCollection))
+    return;
   // Tau Collection
-  edm::Handle<l1t::TauBxCollection> TauBxCollection;
-  e.getByToken(stage2CaloLayer2TauToken_, TauBxCollection);
+  const l1t::TauBxCollection* TauBxCollection = &e.get(stage2CaloLayer2TauToken_);
+  if (!checkBXCollection(TauBxCollection))
+    return;
   // EtSum Collection
-  edm::Handle<l1t::EtSumBxCollection> EtSumBxCollection;
-  e.getByToken(stage2CaloLayer2EtSumToken_, EtSumBxCollection);
+  const l1t::EtSumBxCollection* EtSumBxCollection = &e.get(stage2CaloLayer2EtSumToken_);
+  if (!checkBXCollection(EtSumBxCollection))
+    return;
 
   // Open uGT readout record
   edm::Handle<GlobalAlgBlkBxCollection> uGtAlgs;

@aloeliger
Copy link
Contributor

@mmusich I'll treat this as a priority tomorrow.

@aloeliger
Copy link
Contributor

Okay, a quick 15 minute dissection shows that the crashing bxvector here is (at least initially) the muon bxvector. It has 22 BX's in it (-10 to 11) (instead of the expected 5), and is quite (suspiciously?) full in general:

Size per BX:
-10: 1
-9: 2
-8: 2
-7: 2
-6: 2
-5: 2
-4: 2
-3: 2
-2: 2
-1: 2
0: 4
1: 2
2: 2
3: 2
4: 2
5: 2
6: 1
7: 2
8: 2
9: 2
10: 1
11: 1

It is also worth noting under this set-up, we do get a previous event, which shows the proper 5 BX's:

Size per BX:
-2: 0
-1: 2
0: 0
1: 0
2: 0

My knee-jerk reaction seeing this is that it is data corruption, perhaps related to the ongoing uGT packer/unpacker problem, but I have no definitive proof of that.

The exact crash of course occurs here,

int index = (int)itBX - std::min(0, 1 - (int)bxrange_ % 2 - (int)std::floor(bxrange_ / 2.));
muons_eta_phi.at(index)->Fill(Muon->eta(), Muon->phi());

It finds some theoretically valid muon in BX 7:

Muon in collection that passes qual cuts
Muon->pt(): 61
Muon->hwQual(): 12

and then comes up with an "index" of 9 for it in the vector of histograms in that arcane equation there. It then tries to access muons_eta_phi at this index, which was previously created in this loop here:

for (unsigned int i = 0; i < bxrange_; ++i) {
muons_eta_phi.push_back(ibooker.book2D("muons_eta_phi_bx_" + bx_obj[i],
"L1T Muon p_{T}#geq" + muonPtCutStr + " GeV qual#geq" + muonQualCutStr +
" #eta vs #phi BX=" + bx_obj[i] + ";#eta;#phi",
25,
-2.5,
2.5,
25,
-3.2,
3.2));

coincidentally, bxrange_ is a const hard coded to 5 in the member initializers here:

L1TObjectsTiming::L1TObjectsTiming(const edm::ParameterSet& ps)
: ugmtMuonToken_(consumes<l1t::MuonBxCollection>(ps.getParameter<edm::InputTag>("muonProducer"))),
stage2CaloLayer2JetToken_(
consumes<l1t::JetBxCollection>(ps.getParameter<edm::InputTag>("stage2CaloLayer2JetProducer"))),
stage2CaloLayer2EGammaToken_(
consumes<l1t::EGammaBxCollection>(ps.getParameter<edm::InputTag>("stage2CaloLayer2EGammaProducer"))),
stage2CaloLayer2TauToken_(
consumes<l1t::TauBxCollection>(ps.getParameter<edm::InputTag>("stage2CaloLayer2TauProducer"))),
stage2CaloLayer2EtSumToken_(
consumes<l1t::EtSumBxCollection>(ps.getParameter<edm::InputTag>("stage2CaloLayer2EtSumProducer"))),
l1tStage2uGtProducer_(consumes<GlobalAlgBlkBxCollection>(ps.getParameter<edm::InputTag>("ugtProducer"))),
monitorDir_(ps.getUntrackedParameter<std::string>("monitorDir")),
verbose_(ps.getUntrackedParameter<bool>("verbose")),
gtUtil_(new l1t::L1TGlobalUtil(ps,
consumesCollector(),
*this,
ps.getParameter<edm::InputTag>("ugtProducer"),
ps.getParameter<edm::InputTag>("ugtProducer"))),
algoBitFirstBxInTrain_(-1),
algoBitLastBxInTrain_(-1),
algoBitIsoBx_(-1),
algoNameFirstBxInTrain_(ps.getUntrackedParameter<std::string>("firstBXInTrainAlgo", "")),
algoNameLastBxInTrain_(ps.getUntrackedParameter<std::string>("lastBXInTrainAlgo", "")),
algoNameIsoBx_(ps.getUntrackedParameter<std::string>("isoBXAlgo", "")),
bxrange_(5),
egammaPtCuts_(ps.getUntrackedParameter<std::vector<double>>("egammaPtCuts")),
jetPtCut_(ps.getUntrackedParameter<double>("jetPtCut")),
egammaPtCut_(0.),
tauPtCut_(ps.getUntrackedParameter<double>("tauPtCut")),
etsumPtCut_(ps.getUntrackedParameter<double>("etsumPtCut")),
muonPtCut_(ps.getUntrackedParameter<double>("muonPtCut")),
muonQualCut_(ps.getUntrackedParameter<int>("muonQualCut")) {

instead of being some constexpr or something.

And when it tries to access the location defined by a bxrange of 5, at index 9, you get the usual c++ vector issue.

This is going to crash (and crash un-informatively) therefore on anything with BXVectors with BX's outside of the expected 5.
As an intermediate step, I could test each of the input BXVectors to make sure they are in the expected range, and throw a better cms::exception making the assumptions of this clear, but I can't explain where the messed up BXVector came from yet. @mmusich would it be a good idea to insert such a test/exception into this?

the collection these muons are loaded out of is defined by muonProducer here:

from DQMServices.Core.DQMEDAnalyzer import DQMEDAnalyzer
l1tObjectsTiming = DQMEDAnalyzer(
"L1TObjectsTiming",
muonProducer = cms.InputTag("gtStage2Digis", "Muon"),

which is pretty much directly the gtStage2 digis. I could try to trace it back further, but that is pretty much just the raw to digi step at that point:

gtStage2Digis = cms.EDProducer(
"L1TRawToDigi",
InputLabel = cms.InputTag("rawDataCollector"),
Setup = cms.string("stage2::GTSetup"),
FedIds = cms.vint32( 1404 ),
)

(which is some evidence for corrupted data)

@mmusich
Copy link
Contributor Author

mmusich commented May 15, 2023

@aloeliger

would it be a good idea to insert such a test/exception into this?

I think so. Incidentally that's (almost) what I was proposing at #41645 (comment). Though I (personally) wouldn't crash the entire Prompt Reco because of a mismatch with the expected data format in a DQM module. Emitting a LogError might be better suited for this case.

@aloeliger
Copy link
Contributor

@mmusich forgive my inexperience on this but are cms::exceptions catchable/ignorable, or can you force the event to skip at the configuration level?

I'm not sure we want to continue computation on some event we know has corrupted uGT data. We're just asking to crash somewhere else at that point, or continuing processing on junk.

@mmusich
Copy link
Contributor Author

mmusich commented May 15, 2023

or can you force the event to skip at the configuration level?

this is not what what happens at Tier-0 and in general it's not a good policy because of the reasons explained in the thread at #41512.
Either the data is so corrupted that an exception is thrown (with consequent complete disruption of the job) or you keep processing the junk (as in your words). The question is if the data corruption discussed here is severe enough to declare the event completely unusable for physics. If not (from the tests I did circumventing the crash, it seems the processing just continues without consequences) it would be better to perhaps handle the corrupt data at the unpacker level and create empty collections and let the clients deal with it ~ naturally.

@fwyzard
Copy link
Contributor

fwyzard commented May 15, 2023 via email

@aloeliger
Copy link
Contributor

aloeliger commented May 15, 2023

Well, the L1 packer/unpacker pretty clearly needs some robustness upgrades in any case, but that is a longer term proposition.

I can handle the immediate fallout and solution (logging an error and returning out, or throwing a defined exception) in any way the software and computing experts deem most appropriate.

@mmusich
Copy link
Contributor Author

mmusich commented May 15, 2023

@cms-sw/core-l2

I can handle the immediate fallout and solution (logging an error and returning out, or throwing a defined exception) in any way the software and computing experts deem most appropriate.

please advise, also in view of #41645 (comment)

@fwyzard
Copy link
Contributor

fwyzard commented May 15, 2023

Either the data is so corrupted that an exception is thrown (with consequent complete disruption of the job) or you keep processing the junk (as in your words). The question is if the data corruption discussed here is severe enough to declare the event completely unusable for physics. If not (from the tests I did circumventing the crash, it seems the processing just continues without consequences) it would be better to perhaps handle the corrupt data at the unpacker level and create empty collections and let the clients deal with it ~ naturally.

The problem with this approach is that the main consumer of the L1T information is the HLT, and the HLT will not be re-run during or after offline processing.

So the information that the HLT used to route the event through the various paths are likely wrong, and the event should be dropped, because it cannot be properly accounted by the purity/efficiency trigger measurements.

@mmusich
Copy link
Contributor Author

mmusich commented May 15, 2023

So the information that the HLT used to route the event through the various paths are likely wrong, and the event should be dropped, because it cannot be properly accounted by the purity/efficiency trigger measurements.

We agree, but then planning an exception in a DQM module that runs in Prompt (2 days after data is taken) is less then useless.

@makortel
Copy link
Contributor

@cms-sw/core-l2

I can handle the immediate fallout and solution (logging an error and returning out, or throwing a defined exception) in any way the software and computing experts deem most appropriate.

please advise, also in view of #41645 (comment)

Our approach so far has been to not throw exceptions if the processing can otherwise continue. To me the best option would be for the unpacker to produce a data product that in some conveys the information that the raw data was corrupt in some way. Then everything downstream can easily ignore it if necessary, and e.g. HLT could even have a filter rejecting the event.

While we do have a facility to "skip events for which specific exceptions are thrown", as discussed in #41512, it has not been battle-tested, and generally exceptions should not be used for control flow. In principle

process.options.SkipEvent = cms.untracked.vstring("<exception category>")

should do the job. Based on some experimentation, if the OutputModule consumes the data product from the module that threw the exception, the OutputModule seems to automatically skip the event. If there is no data dependence chain from the throwing module to the OutputModule (which I believe is the case here given the exception coming from a DQM module), the OutputModule needs to be somehow instructed to act based on the Path where the module is. This could be OutputModule.SelectEvents.SelectEvents or inserting TriggerResultsFilter in front of the OutputModule in the EndPath.

@aloeliger
Copy link
Contributor

Well, It isn't a fast solution, but I can take a look at the unpacker and see if there's a way to provide proper null sets out of the unpacker if it doesn't pass a few basic quality tests.

@gpetruc
Copy link
Contributor

gpetruc commented May 16, 2023

Since this is corrupt data of just the muons in the GT record, I would agree that the best strategy would be to have the GT unpacker detect the error, report it (in a place where a human will eventually notice it - is anyone really checking the LogErrors in the certification workflow?), and produce an empty GT muon collection (no need to fail the other non-muon triggers because if this)

@aloeliger
Copy link
Contributor

Since this is corrupt data of just the muons in the GT record, I would agree that the best strategy would be to have the GT unpacker detect the error, report it (in a place where a human will eventually notice it - is anyone really checking the LogErrors in the certification workflow?), and produce an empty GT muon collection (no need to fail the other non-muon triggers because if this)

It may be just the muons. The muons were simply the first thing that crashed. I can check other collections for corruption too.

@aloeliger
Copy link
Contributor

@cms-sw/core-l2

I can handle the immediate fallout and solution (logging an error and returning out, or throwing a defined exception) in any way the software and computing experts deem most appropriate.

please advise, also in view of #41645 (comment)

Our approach so far has been to not throw exceptions if the processing can otherwise continue. To me the best option would be for the unpacker to produce a data product that in some conveys the information that the raw data was corrupt in some way. Then everything downstream can easily ignore it if necessary, and e.g. HLT could even have a filter rejecting the event.

While we do have a facility to "skip events for which specific exceptions are thrown", as discussed in #41512, it has not been battle-tested, and generally exceptions should not be used for control flow. In principle

process.options.SkipEvent = cms.untracked.vstring("<exception category>")

should do the job. Based on some experimentation, if the OutputModule consumes the data product from the module that threw the exception, the OutputModule seems to automatically skip the event. If there is no data dependence chain from the throwing module to the OutputModule (which I believe is the case here given the exception coming from a DQM module), the OutputModule needs to be somehow instructed to act based on the Path where the module is. This could be OutputModule.SelectEvents.SelectEvents or inserting TriggerResultsFilter in front of the OutputModule in the EndPath.

@makortel I didn't have a lot of time to think about this yesterday, but could I borrow your judgement on how to go about unpacker modifications for this?

If I go through the various GT unpackers and add a step into the process where it checks it's own output for say, having 5 BX's as a check on whether the data is corrupted, I could conceivably log an error and force it to instead output some empty BXVector that is unlikely to disturb anything downstream of it, but I'm not sure, without looking at the log errors by eye, how one catches this and differentiates this event as a problem, instead of an event with some genuinely empty vector. I agree with @gpetruc that someone really does need to catch this but without disturbing other elements of the reconstruction, but I am concerned that simply outputting empty collections may help to bury the problem rather than catching it, alerting, and gracefully continuing.

I entertained the idea of some sort of module-in-the-middle style solution that would run after the unpackers in the RawToDigi sequence to inspect and catch any potential data corruption and insert a bit or something into the output that could be used to flag corruption in the unpacker so that other downstream clients could catch this, or HLT could filter on it, but I would imagine this is more trouble than it is worth, involves sensitive modifications to everything reliant on the RawToDigi step, and is also not truly a solution to the issue of some downstream something attempting to process on bad data. I guess I would also be terrified of potentially introducing a mistaken 100% removal filter into the process because of bad configuration of this.

The existing muon unpacker responsible for GT muons, currently shows some corruption checks that are designed to throw exceptions:

bool MuonUnpacker::unpack(const Block& block, UnpackerCollections* coll) {
LogDebug("L1T") << "Block ID = " << block.header().getID() << " size = " << block.header().getSize();
// process only if there is a payload
// If all BX block were zero suppressed the block header size is 0.
if (block.header().getSize() < 1) {
return true;
}
auto payload = block.payload();
int nBX, firstBX, lastBX;
// Check if per BX zero suppression was enabled
bool bxZsEnabled = ((block.header().getFlags() >> bxzs_enable_shift_) & 0x1) == 1;
// Calculate the total number of BXs
if (bxZsEnabled) {
BxBlockHeader bxHeader(payload.at(0));
nBX = bxHeader.getTotalBx();
} else {
nBX = int(ceil(block.header().getSize() / nWords_));
}
getBXRange(nBX, firstBX, lastBX);
// Set the muon collection and the BX range
muonCollection_ = static_cast<L1TObjectCollections*>(coll)->getMuons(muonCopy_);
muonCollection_->setBXRange(firstBX, lastBX);
// Set the muon shower collection and the BX range
muonShowerCollection_ = static_cast<L1TObjectCollections*>(coll)->getMuonShowers(muonCopy_);
muonShowerCollection_->setBXRange(firstBX, lastBX);
LogDebug("L1T") << "nBX = " << nBX << " first BX = " << firstBX << " lastBX = " << lastBX;
// Get the BX blocks and unpack them
auto bxBlocks = block.getBxBlocks(nWords_, bxZsEnabled);
for (const auto& bxBlock : bxBlocks) {
// Throw an exception if finding a corrupt BX header with out of range BX numbers
const auto bx = bxBlock.header().getBx();
if (bx < firstBX || bx > lastBX) {
throw cms::Exception("CorruptData")
<< "Corrupt RAW data from FED " << fed_ << ", AMC " << block.amc().getAMCNumber() << ". BX number " << bx
<< " in BX header is outside of the BX range [" << firstBX << "," << lastBX
<< "] defined in the block header.";
}
unpackBx(bx, bxBlock.payload(), block.header().getID());
}
return true;
}

... but from the sounds of it, this is pretty off the table at this point, having made it to this stage?

@makortel
Copy link
Contributor

How is data corruption handled in unpackers of other sub-detectors? (bringing in @cms-sw/reconstruction-l2)

@makortel
Copy link
Contributor

After discussing with @Dr15Jones we came to the conclusion that from the framework point of view using "empty containers" is not a good way to communicate processing failures, because there is no way to distinguish that from the container being empty for "legitimate reasons". It would be better if the consuming modules would be able to ask questions like "is the input data product valid" and "if the input data is not valid, why is it not". In principle the edm::Handle already provides an interface for a consuming module to ask such questions (isValid(), whyFailed()), but we currently use those to communicate only that the data product is missing.

If there is interest in the following kind of behavior, we could look into evolving the SkipEvent exception handling behavior in the framework accordingly:

  • If module A throws an exception of a specific, configurable category, then
    • All modules depending on A's data products (directly or transitively) would not be run, except the following:
      • There would be a way for a module to tell the framework that it should be run even if any of it input data product is missing because of an earlier "acceptable processing failure", and the module deals with such failure case without throwing exceptions.
        • It can be useful e.g. to have an OutputModule storing the raw data of such events, or DQM module monitoring such errors
    • Each Path containing the module A (or any module depending on A) is are marked as failed (as if A would be a filter that rejected the event) and no further modules in the Path will be scheduled to run. Modules that have already been scheduled and are independent of A will still be run.
    • The exception would be available for any consuming module (that declared it can handle the missing data product) via edm::Handle::whyFailed()
    • The failure of the module and Path(s) is recorded in TriggerResults (need to think which of "fail" or "error" would be better)
    • Modules and Paths independent of A (or any module depending on A) would be run normally.
  • OutputModule behavior
    • An OutputModule that is configured to store any data product of A, would not store any data for the event where the exception was thrown (unless configured to store such events; although while writing this now I'm actually not sure which behavior would be best as default)
    • An OutputModule configured to store events based on the decisions of Paths containing A would not store any data for the event where the exception was thrown
    • There would be a way to configure an OutputModule to store only Events for which such exception is thrown

In the mean time, similar behavior can to large degree accomplished by

  • The producer A that wants to report the processing error does not actually produce its data products for the "failing" event (i.e. does not call Event::put() or Event::emplace())
  • Such producer is accompanied with a filter that
    • Accepts the event if the data product(s) of A exists, otherwise rejects the event
    • Copies the data product(s) of A if they exist, otherwise inserts default-constructed data products.
      • This workaround is specifically to deal with modules that consume A, and whose scheduling is not controlled by the filter (i.e. they are in an independent Path or unscheduled). Such consuming modules are assumed to be able to deal with a default-constructed data product without throwing exceptions, but might throw ProductNotFound exception if the product would be missing. (I don't personally like much of this part, but it does give a way to avoid changing the consuming modules to check if an input product is missing in a situation where the modules are scheduled irrespective of the filter)

@aloeliger
Copy link
Contributor

Okay. Then, if I am understanding correctly, L1 needs to:

  1. Modify the GT unpacker to check for corruption, if it finds it, do not put or emplace anything into the event
  2. Provide a filter that can be run with the unpacker, that checks for the data product being there. If it is not found, the filter fails, filtering the rest of the path
  3. Provide a producer/filter designed to copy the products of the GT unpacker, if they exist, otherwise create default ones.
  4. Insert these into GT unpacker dependent sequences

In the long run the first set of things does seem ideal to me for this sort of situation, but I would like to help with the immediate solution first.

@aloeliger
Copy link
Contributor

@mmusich @makortel Following what I think was the desired short term solution, I have created a filter for the GT digis that will check for corruption (in this case, right now this is only defined as having output BX vectors with size different than a configured size), and will also attempt to produce either an empty BXvector if corruption is detected, or an identical BX vector in the case that it is not. The EDFilter will return false when any corruption of this kind is detected, and true whenever there is no corruption.

@makortel my understanding is that this filter will then need to be inserted after the gt unpacking step, and anything reliant on the GT digis will need to instead be told to get their information from this?

I will attempt to test this on the current problem recipe in any case.

@fwyzard
Copy link
Contributor

fwyzard commented May 23, 2023

Hi @aloeliger,
my impression is that, if you wrote an EDFilter, making a copy of the collections is redundant:

  • if the filter passes, we assume that data is good, and the consumers can read the original unpacked data as before;
  • if the filter fails, nothing else after it will run, so there is no need to produce an empty collection.

So, wouldn't it be enough to have an EDFilter that returns true or false and does not produce anything ?

@makortel
Copy link
Contributor

So, wouldn't it be enough to have an EDFilter that returns true or false and does not produce anything ?

The use case for the "copied-over-or-dummy" product are modules consuming the data product that are in Tasks (i.e. unscheduled) or that are in Paths but (for whatever reason) whose execution is not controlled by the said filter, the modules can not handle a missing input data product without throwing an exception, and the desired behavior is for those modules to not throw the missing data product exception (i.e. emulating the behavior outlined in #41645 (comment)).

@makortel
Copy link
Contributor

my understanding is that this filter will then need to be inserted after the gt unpacking step, and anything reliant on the GT digis will need to instead be told to get their information from this?

Correct. Although from the framework perspective it would be better if the actual producer of GT digis would not produce the data product in case of the error, because then the error condition is more clear for all consumers of the data product.

Do I assume correctly that the GT digis are not stored as such on any data tier?

@missirol
Copy link
Contributor

Do I assume correctly that the GT digis are not stored as such on any data tier?

Fwiw, HLT currently stores them in the output of certain data streams (looking for keep *_hltGtStage2Digis_*_* in OnLine_HLT_GRun.py).

@makortel
Copy link
Contributor

Do I assume correctly that the GT digis are not stored as such on any data tier?

Fwiw, HLT currently stores them in the output of certain data streams (looking for keep *_hltGtStage2Digis_*_* in OnLine_HLT_GRun.py).

Thanks. I suppose in the HLT context the affected OutputModules are already configured such that when the aforementioned EDFilter is added to the Sequence containing hltGtStage2Digis, the OutputModules will not store the events that the EDFilter rejects. Is this assumption correct?

I also see them being stored as part of HLTDebugRAW


and HLTDebugFEVT

I was asking because for any non-temporary storage of these data products it would be cleanest to record the failure by storing the "not-produced" product, and was wondering how much of this use case needs to be covered.

@makortel
Copy link
Contributor

If there is interest in the following kind of behavior, we could look into evolving the SkipEvent exception handling behavior in the framework accordingly:
(details in #41645 (comment))

Does anyone have any feedback on this potential evolution of SkipEvent?

@aloeliger
Copy link
Contributor

If there is interest in the following kind of behavior, we could look into evolving the SkipEvent exception handling behavior in the framework accordingly:
(details in #41645 (comment))

Does anyone have any feedback on this potential evolution of SkipEvent?

For whatever my relatively in-expert opinion is worth, I think it's a good idea for exactly these cases. This event crashes, and crashes for a good reason. However, we should be able to crash events, without crashing entire processes.

@missirol
Copy link
Contributor

Thanks. I suppose in the HLT context the affected OutputModules are already configured such that when the aforementioned EDFilter is added to the Sequence containing hltGtStage2Digis, the OutputModules will not store the events that the EDFilter rejects. Is this assumption correct?

In my understanding, it is.

But I'll add that I still have to study this issue, and it's not clear to me what the exact changes to the HLT menu would be because of it (esp. when it comes to the "copied-over-or-dummy product"). The "aforementioned EDFilter" sounds similar in purpose to the workaround tested in [1].

On top of the general issue, it would be useful (to me) to understand better some aspects that are specific to the L1T unpacker.

  • If the index of the prescale column can be unpacked correctly, it might be better to produce digis with that info (prescale index), plus empty BX vectors (or similar) wherever there is corrupted data. The product would still fail the EDFilter that checks the validity of the digis, but having the prescale index would still be useful to redirect the corrupted events to a dedicated stream (this refers to the complications with HLTPrescaler described towards the end of [1]).
  • The crash that led to guard HLTRecHitInAllL1RegionsProducer<T> against empty collection of L1T candidates [13_0_X] #41467 should be checked in more detail (I haven't done it, but we have the corrupted events). That problem suggests that the corrupted event passed certain L1T algos (e.g. L1_SingleEG* (otherwise that crashing module would not be executed in the Path), but there were no L1T e/g candidates (thus the crash). Assuming this is the case, does it make sense for the digis to contain such inconsistencies, or is there a way to deal with them in the unpacker (e.g. if there are no e/g candidates, change L1_SingleEG* to false) ?

[1] CMSHLT-2793 (comment)

I also see them being stored as part of HLTDebugRAW

and HLTDebugFEVT

Right, thanks for pointing this out.

@aloeliger
Copy link
Contributor

@mmusich is the file with this error still available to test the new filter on?

@makortel
Copy link
Contributor

So, wouldn't it be enough to have an EDFilter that returns true or false and does not produce anything ?

The use case for the "copied-over-or-dummy" product are modules consuming the data product that are in Tasks (i.e. unscheduled) or that are in Paths but (for whatever reason) whose execution is not controlled by the said filter, the modules can not handle a missing input data product without throwing an exception, and the desired behavior is for those modules to not throw the missing data product exception (i.e. emulating the behavior outlined in #41645 (comment)).

I should add that the set of consuming modules that one needs to worry about includes all modules in EndPaths (whose execution is not controlled by TriggerResultsFilter that would reject the event that would have the missing data product).

@aloeliger
Copy link
Contributor

@mmusich I just want to ping you again. The original method for recreating the error seems to no longer work because it can longer find the error file on eos. Is this still available for testing the short term solution?

@mmusich
Copy link
Contributor Author

mmusich commented May 26, 2023

@aloeliger

The original method for recreating the error seems to no longer work because it can longer find the error file on eos. Is this still available for testing the short term solution?

It seems you are not following the cmsTalk thread linked in the original issue description.
See please https://cms-talk.web.cern.ch/t/fatal-exception-in-prompt-reco-of-run-367232-datatset-jetmet0/23996/6

@aloeliger
Copy link
Contributor

@aloeliger

The original method for recreating the error seems to no longer work because it can longer find the error file on eos. Is this still available for testing the short term solution?

It seems you are not following the cmsTalk thread linked in the original issue description. See please https://cms-talk.web.cern.ch/t/fatal-exception-in-prompt-reco-of-run-367232-datatset-jetmet0/23996/6

My apologies for my absence, I was at US CMS last week. Thanks for providing this. I will start trying to test with the stored tarballs ASAP.

@aloeliger
Copy link
Contributor

@mmusich Apologies, but I am still running into issues trying to recreate the error so I can test my solution. The tarball does not include the file that the original parameter set/process runs on. Is the file root://eoscms.cern.ch//eos/cms/tier0/store/data/Run2023C/JetMET0/RAW/v1/000/367/232/00000/9144de83-470d-499d-8973-700975c88eba.root still available somewhere I should be aware of?

@mmusich
Copy link
Contributor Author

mmusich commented Jun 5, 2023

@aloeliger you'll have to see to reproduce with what is there or ask @germanfgv for clarifications.

@mmusich
Copy link
Contributor Author

mmusich commented Jun 5, 2023

apparently it's only on tape:

$ dasgoclient -query='site file=/store/data/Run2023C/JetMET0/RAW/v1/000/367/232/00000/9144de83-470d-499d-8973-700975c88eba.root'
T0_CH_CERN_Tape
T1_US_FNAL_Tape

should be trivial to make a rucio rule to get it back though.

@aloeliger
Copy link
Contributor

The rucio rule should be trivial... but whether I have enough good will at wisconsin left to get them to put this onto their disk after the stunts I've pulled is another story altogether. I'll see what I can do.

@aloeliger
Copy link
Contributor

I take it back... the block that contains this is /JetMET0/Run2023C-v1/RAW#b852026e-a127-4075-9b0a-90c0bdb1d5e4, which is 2.5Tb, which is a little much any way you slice it to pull a single file off of tape for debugging. I don't suppose there is some secret debugging method of getting single files off of tape? Alternatively, I could just open a draft PR for the draft of the GT unpacker inspecting module if we want to work without an explicit test on it.

@perrotta
Copy link
Contributor

perrotta commented Jul 7, 2023

Is anybody able to test if the fix in #42214 can solve this issue?

@mmusich
Copy link
Contributor Author

mmusich commented Jul 7, 2023

Is anybody able to test if the fix in #42214 can solve this issue?

Apparently it's not straightforward to get back the RAW data for the Tier0 job that failed, see #41645 (comment) . If we are OK in getting a 2.5TB block at CERN, should be straightforward to check.

@aloeliger
Copy link
Contributor

It may solve the immediate problem, but the point remains the vector shouldn't have existed in the first place. I think there is some evidence at this point that there is something going wrong in one of the muon unpackers. I would be hesitant to insert this solution and call anything "fixed"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants