Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guard HLTRecHitInAllL1RegionsProducer<T> against empty collection of L1T candidates [13_0_X] #41467

Conversation

missirol
Copy link
Contributor

@missirol missirol commented Apr 30, 2023

backport of #41466

PR description:

This PR adds a check to HLTRecHitInAllL1RegionsProducer<T> to handle gracefully events where the input collection of L1T candidates is empty (either completely empty, or even empty just for BX=0).

For such events, pre-PR the plugin can crash here. This type of crash was observed multiple times online in the last days. The root cause of the problem (i.e. missing L1T candidates) is likely related to issues with the L1T menu deployed a couple of weeks ago (CMSLITOPS-411).

A reproducer is in [1], and it produces the stack trace attached here (which matches the error log seen online).

FYI: @Sam-Harper @swagata87 (as this module is used by E/gamma triggers)

[1]

#!/bin/bash

# cmsrel CMSSW_13_0_3
# cd CMSSW_13_0_3/src
# cmsenv

hltGetConfiguration run:366727 \
  --globaltag 130X_dataRun3_HLT_v2 --data \
  --no-prescale --no-output \
  --max-events -1 \
  --input dummy.root \
  > hlt.py

cat <<@EOF >> hlt.py
del process.DQMOutput

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0

process.hltOnlineBeamSpotESProducer.timeThreshold = int(1e6)

del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')
process.MessageLogger.cerr.FwkReport.reportEvery = 1
process.MessageLogger.cerr.enableStatistics = False
process.MessageLogger.cerr.threshold = 'INFO'

from EventFilter.Utilities.FedRawDataInputSource_cfi import source as _source
process.source = _source.clone(
    eventChunkSize = 200,
    eventChunkBlock = 200,
    numBuffers = 4,
    maxBufferedFiles = 4,
    fileListMode = True,
    fileNames = [
      "/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/run366727/run366727_ls0136_index000137_fu-c2b04-34-01_pid3305175.raw",
    ]
)

from EventFilter.Utilities.EvFDaqDirector_cfi import EvFDaqDirector as _EvFDaqDirector
process.EvFDaqDirector = _EvFDaqDirector.clone(buBaseDir = '.', runNumber = 0)
@EOF

cmsRun hlt.py &> hlt.log

PR validation:

With the changes in this PR, the reproducer does not crash.

If this PR is a backport, please specify the original PR and why you need to backport that PR. If this PR will be backported, please specify to which release cycle the backport is meant for:

#41466

Bugfix to plugin used at HLT in release cycle used for data-taking in 2023.

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 30, 2023

A new Pull Request was created by @missirol (Marino Missiroli) for CMSSW_13_0_X.

It involves the following packages:

  • RecoEgamma/EgammaHLTProducers (hlt)

@cmsbuild, @missirol, @Martin-Grunewald can you please review it and eventually sign? Thanks.
@Sam-Harper, @HuguesBrun, @silviodonato, @jainshilpi, @sameasy, @valsdav, @Fedespring, @lgray, @sobhatta, @afiqaize, @wrtabb, @a-kapoor, @Prasant1993, @varuns23, @cericeci, @ram1123 this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@missirol
Copy link
Contributor Author

type bugfix

@missirol
Copy link
Contributor Author

urgent

This PR can reduce the number of HLT crashes seen online these days, so it should be integrated quickly.

@missirol
Copy link
Contributor Author

please test

@dinyar
Copy link
Contributor

dinyar commented Apr 30, 2023

Hi @missirol,

I wonder if only outputting a debug message here wouldn't hide serious issues? I may misunderstand, but my assumption is that the code that you're changing is only called if uGT claims that it triggered on an object of the given type. So if it claims that, but didn't correctly fill the input collection that points at a serious issue and we (at least L1T) should be alerted to it.

My current feeling is that the BXVector shouldn't allow itself go be configured with zero entries. (It prohibits that in its initialiser, but one can reduce it to size zero afterwards by changing the BX range.. )

Cheers,
Dinyar

@missirol
Copy link
Contributor Author

missirol commented Apr 30, 2023

Hi Dinyar,

I think it's a fair question, and I don't have a strong opinion. Maybe E/gamma experts can give a more informed comment, since they know this plugin better than me. I decided to go for LogDebug because

  • In the case of Stage-1 L1T objects (e.g. here), this plugin does not return warnings if the collections are empty (to me, this goes in the direction of not adding a warning for Stage-2 L1T objects, which is the part touched by this PR).
  • The plugin considers multiple types of Stage-2 L1T objects (l1t::EGamma, l1t::Tau, l1t::Jet); at HLT, it is configured as in [1]. This module runs for many HLT Paths, which obviously can have different L1T seeds (some L1-EG, some L1-Tau, etc). So, I was imagining that, in principle, in a given event for an E/gamma trigger l1t::EGamma would be non-empty, but l1t::Tau could be empty. In that case, it would be unnecessary to issue a warning because l1t::Tau is empty. But maybe I misunderstand what L1T sends to HLT: I guess, normally, hltGtStage2Digis:{EGamma,Jet,Tau} are all non-empty for BX=0, otherwise we would have seen this crash before (?).

Regarding

I may misunderstand, but my assumption is that the code that you're changing is only called if uGT claims that it triggered on an object of the given type.

In practise (in a real HLT), I think this is true. In general (this plugin alone), I don't think it is necessarily the case, as the plugin itself makes no explicit requirements on the L1T-uGT decisions, it just tries to access the L1T objects.

So if it claims that, but didn't correctly fill the input collection that points at a serious issue and we (at least L1T) should be alerted to it.

While we can certainly change this LogDebug to LogWarning or LogError, I would argue that this kind of L1T issues should trigger all sorts of alarms in the L1T unpacker or other 'upstream' modules (e.g. HLTL1TSeed), not necessarily in any given HLT 'client' (like this plugin).

[1]

process.hltRechitInRegionsECAL = cms.EDProducer( "HLTEcalRecHitInAllL1RegionsProducer",
    productLabels = cms.vstring( 'EcalRecHitsEB',
      'EcalRecHitsEE' ),
    recHitLabels = cms.VInputTag( 'hltEcalRecHit:EcalRecHitsEB','hltEcalRecHit:EcalRecHitsEE' ),
    l1InputRegions = cms.VPSet( 
      cms.PSet(  inputColl = cms.InputTag( 'hltGtStage2Digis','EGamma' ),
        regionEtaMargin = cms.double( 0.9 ),
        type = cms.string( "EGamma" ),
        minEt = cms.double( 5.0 ),
        regionPhiMargin = cms.double( 1.2 ),
        maxEt = cms.double( 999999.0 )
      ),
      cms.PSet(  inputColl = cms.InputTag( 'hltGtStage2Digis','Jet' ),
        regionEtaMargin = cms.double( 0.9 ),
        type = cms.string( "Jet" ),
        minEt = cms.double( 170.0 ),
        regionPhiMargin = cms.double( 1.2 ),
        maxEt = cms.double( 999999.0 )
      ),
      cms.PSet(  inputColl = cms.InputTag( 'hltGtStage2Digis','Tau' ),
        regionEtaMargin = cms.double( 0.9 ),
        type = cms.string( "Tau" ),
        minEt = cms.double( 100.0 ),
        regionPhiMargin = cms.double( 1.2 ),
        maxEt = cms.double( 999999.0 )
      )
    )
)

@missirol
Copy link
Contributor Author

@perrotta , I see we are builiding CMSSW_13_0_5 (#41468).

I just want to warn that, once this PR is in 13_0_X, HLT might ask for a patch release as soon as tomorrow, to mitigate online crashes.

This wasn't discussed before because the issue and the fix came up only very recently.

@perrotta
Copy link
Contributor

Thank you @missirol
We have several crashes to cure in the T0 replays. We'll certainly need to implement further fixes on top of currently building 13_0_5. This one can be implemented together with them, or even in advance of them if it is really so urgent.

@missirol
Copy link
Contributor Author

missirol commented Apr 30, 2023

Sounds good, thanks @perrotta. I will check with TSG how urgently they want to deploy this change.

or even in advance of them if it is really so urgent.

Looking at the logs, yesterday in run 366821 (collisions) we had 986 crashes in the HLT farm. This PR should fix at least roughly 50% of them, so in my opinion it is rather urgent.

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals-INPUT
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-035662/32259/summary.html
COMMIT: ed4373d
CMSSW: CMSSW_13_0_X_2023-04-30-0000/el8_amd64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/41467/32259/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals-INPUT

  • 1002.41002.4_RunDoubleMuon2022B/step2_RunDoubleMuon2022B.log

Comparison Summary

Summary:

  • You potentially added 9 lines to the logs
  • Reco comparison results: 6 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3554572
  • DQMHistoTests: Total failures: 9
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3554541
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 213 log files, 164 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@missirol
Copy link
Contributor Author

+hlt

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next CMSSW_13_0_X IBs (but tests are reportedly failing) and once validation in the development release cycle CMSSW_13_1_X is complete. This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

perrotta commented May 1, 2023

+1

@perrotta
Copy link
Contributor

perrotta commented May 1, 2023

merge

@cmsbuild cmsbuild merged commit 26e515b into cms-sw:CMSSW_13_0_X May 1, 2023
@missirol missirol deleted the devel_sanityCheckfixHLTRecHitInAllL1RegionsProducer_130X branch May 1, 2023 08:36
cmsbuild added a commit that referenced this pull request May 1, 2023
Integration of #41467 and #41470 in patch release for HLT on top of `CMSSW_13_0_3`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants