Error in IB test wf 13034.0 #29251

silviodonato · 2020-03-20T09:40:29Z

As discussed during the last Core Software meeting, we are getting a seldom error in 13034.0 :

%MSG-w SiPixelPhase1TrackClusters:   SiPixelPhase1TrackClusters:hltSiPixelPhase1TrackClustersAnalyzer  19-Mar-2020 17:46:40 CET Run: 1 Event: 1201
PixelClusterShapeCache collection is not valid
%MSG
19-Mar-2020 17:46:40 CET  Closed file root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/relval/CMSSW_10_6_1/RelValMinBias_14TeV/GEN-SIM/106X_mcRun3_2021_realistic_v1_rsb-v1/10000/97553220-ED91-4E48-B59F-ED44053F8621.root
19-Mar-2020 17:46:40 CET  Initiating request to open file root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/relval/CMSSW_10_6_1/RelValMinBias_14TeV/GEN-SIM/106X_mcRun3_2021_realistic_v1_rsb-v1/10000/70672451-102A-8243-ACF0-C4F545C049AB.root
%MSG-e TrackProducer:  TrackProducer:initialStepTracksPreSplitting  19-Mar-2020 17:46:41 CET Run: 1 Event: 1211
cms::Exception caught during theAlgo.runWithCandidate.
An exception of category 'DataCorrupt' occurred.
Exception Message:
SiPixelTemplateReco::Vavilov parameters mpv/sigmaQ/kappa = 7280.9/691.7/0.01


%MSG
[...]
%MSG-e SiPixelTemplateReco:  TrackRefitter:refittedForPixelDQM  19-Mar-2020 17:46:46 CET Run: 1 Event: 1205
illegal chi2xmin normalization (1) = -0.0747717 
%MSG
%MSG-e SiPixelTemplateReco:  TrackRefitter:refittedForPixelDQM  19-Mar-2020 17:46:46 CET Run: 1 Event: 1205
illegal chi2xmin normalization (1) = -0.0672276 
%MSG
%MSG-e SiPixelTemplateReco:  TrackRefitter:refittedForPixelDQM  19-Mar-2020 17:46:46 CET Run: 1 Event: 1205
illegal chi2xmin normalization (1) = -0.0316581 
%MSG
----- Begin Fatal Exception 19-Mar-2020 17:46:46 CET-----------------------
An exception of category 'DataCorrupt' occurred while
   [0] Processing  Event run: 1 lumi: 13 event: 1211 stream: 1
   [1] Running path 'Flag_METFilters'
   [2] Prefetching for module BooleanFlagFilter/'HBHENoiseFilter'
   [3] Prefetching for module HBHENoiseFilterResultProducer/'HBHENoiseFilterResultProducer'
   [4] Prefetching for module HcalNoiseInfoProducer/'hcalnoise'
   [5] Prefetching for module FastjetJetProducer/'ak4PFJets'
   [6] Prefetching for module PFLinker/'particleFlow'
   [7] Prefetching for module PFProducer/'particleFlowTmp'
   [8] Prefetching for module PFBlockProducer/'particleFlowBlock'
   [9] Prefetching for module PFElecTkProducer/'pfTrackElec'
   [10] Prefetching for module GsfTrackProducer/'electronGsfTracks'
   [11] Prefetching for module MeasurementTrackerEventProducer/'MeasurementTrackerEvent'
   [12] Prefetching for module JetCoreClusterSplitter/'siPixelClusters'
   [13] Prefetching for module PrimaryVertexProducer/'firstStepPrimaryVerticesPreSplitting'
   [14] Calling method for module TrackProducer/'initialStepTracksPreSplitting'
Exception Message:
SiPixelTemplateReco::Vavilov parameters mpv/sigmaQ/kappa = 7280.9/691.7/0.01
----- End Fatal Exception -------------------------------------------------

These are the log files of the latest IB tests with the error

CMSSW_11_1_X_2020-03-13-2300.log
CMSSW_11_1_X_2020-03-14-1100.log
CMSSW_11_1_X_2020-03-16-2300.log
CMSSW_11_1_X_2020-03-19-1100.log
CMSSW_11_1_X_2020-03-21-1100
CMSSW_11_1_X_2020-03-30-1100

The text was updated successfully, but these errors were encountered:

cmsbuild · 2020-03-20T09:40:50Z

A new Issue was created by @silviodonato Silvio Donato.

@Dr15Jones, @smuzaffar, @silviodonato, @makortel, @davidlange6, @fabiocos can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel · 2020-03-20T12:39:21Z

assign alca, reconstruction

cmsbuild · 2020-03-20T12:39:40Z

New categories assigned: reconstruction,alca

@slava77,@christopheralanwest,@franzoni,@tlampen,@pohsun,@perrotta,@tocheng you have been requested to review this Pull request/Issue and eventually sign? Thanks

perrotta · 2020-03-20T13:54:17Z

In https://cmssdt.cern.ch/lxr/source/RecoLocalTracker/SiPixelRecHits/src/SiPixelTemplateReco.cc#1178 I notice that the value kappa=0.01, which throws in cmssw, wouldn't throw in standalone running of the template (compare with L1174)
This is exactly the value which generates the exception here.

In the literature I find that kappa=0.01 corresponds to the Landau limit (while kappa=10 would correspond to the gaussian one). Because of that, I would expect that the same requirements for throwing which is set for the standalone run should be implemented also in cmssw, i.e. line 1178 should become

    assert((sigmaQ >= 0.) && (mpv >= 0.) && (kappa >= 0.01) && (kappa <= 10.));

That would be enough to remove the exception observed in the attached log.

Of course, the calculations in CondFormats/SiPixelTransient/src/SiPixelTemplate.cc are far too complicate to be followed in detail for this limited evaluation of mine (at least for me).

@cmantill @pmaksim1 @tvami @mmusich @tsusa @OzAmram : could any of you please have a look and confirm whether the proposed quick fix above actually corresponds to what expected from that algo, or if some different fix should be applied instead?

OzAmram · 2020-03-20T14:26:20Z

The short answer is that I think this fix makes sense, but I am checking with Morris to be sure. I'll let you know when he confirms.

The longer explanation is:
Kappa being = 0.01 exactly can only happen in the extreme case that both of the templates being interpolated between have kappa = 0.01 (which is the limit on the parameter imposed by the fit done in the template making code). Even though this indicates there is probably something not so good about the fit to the charge distributions done for these templates because kappa is right at its boundary, I don't think we want to throw an exception in CMSSW for it (if anything we should try to catch this in the template making). So it makes sense to allow kappa = 0.01 as included in the 'allowed' range and not throw an exception.

OzAmram · 2020-03-20T20:34:39Z

Ok so after a little closer inspection I think what Andrea wrote above is a little backwards. The CMSSW exception is thrown on line 1174, so it is the condition on 1173 that needs to be changed to fix the problem (note the ifndef on 1172). 1178 is what is run in the standalone code.

So it is a little confusing that the exception is being thrown with kappa = 0.01 then, because that should not satisfy the (kappa < 0.01) condition that is being checked for. It might be due to some floating point rounding error or something. So we could try giving it a little buffer, by changing 1173 to be something like:

if ((sigmaQ <= 0.) || (mpv <= 0.) || (kappa < 0.0099) || (kappa > 10.001))

We can also make a similar change to 1178 to keep consistent.

(Also I talked to Morris and he OK'ed this change)

perrotta · 2020-03-21T07:43:44Z

Ok so after a little closer inspection I think what Andrea wrote above is a little backwards. The CMSSW exception is thrown on line 1174, so it is the condition on 1173 that needs to be changed to fix the problem (note the ifndef on 1172). 1178 is what is run in the standalone code.

So it is a little confusing that the exception is being thrown with kappa = 0.01 then, because that should not satisfy the (kappa < 0.01) condition that is being checked for. It might be due to some floating point rounding error or something. So we could try giving it a little buffer, by changing 1173 to be something like:

if ((sigmaQ <= 0.) || (mpv <= 0.) || (kappa < 0.0099) || (kappa > 10.001))

We can also make a similar change to 1178 to keep consistent.

Lines 1173 to 1176 are accessed only if #ifndef SI_PIXEL_TEMPLATE_STANDALONE
CMSSW accesses line 1178 instead

Of course, one could certainly relax both cases (CMSSW and STANDALONE) to take rounding errors into account. But the error reported in the log that started this issue wouldn't have been triggered in STANDALONE mode, while it did in CMSSW.

perrotta · 2020-03-21T08:24:28Z

Sorry @OzAmram : yes, you are right. In my post above I confused #ifdef and #ifndef

Yes: rounding approximations must be taken into account when there are non integer numbers.

Instead of writing values as such, I'd rather define a const value epsilon (small) to be added to the actual bounds in the evaluation.

OzAmram · 2020-03-21T14:08:33Z

Ok that sounds good to me.

slava77 · 2020-03-26T22:32:35Z

So it is a little confusing that the exception is being thrown with kappa = 0.01 then, because that should not satisfy the (kappa < 0.01) condition that is being checked for.

the assert will happen for kappa <= 0.01, which is apparently a bug based on the description above. It sounds to me that the assert condition should be changed to kappa >= 0.01 and perhaps similarly for other variables where an exact value at the limit is still physically acceptable.

(in some other code in CMSSW we had problems in the past where binning was decided by exact inequalities and missing a floating point value falling at the bin boundary, which would lead to problems. The appropriate solution is to simply do logically the right thing and allow equal value)

OzAmram · 2020-03-27T18:39:05Z

@slava77 The assert on line 1178 is only executed in standalone mode. The CMSSW exception is being thrown based on line 1173 (Andrea confused them above). The CMSSW condition checks if kappa < 0.01 so kappa of 0.01 should not trigger it (if there were no rounding errors).

In either case I think Andrea's idea of using an epsilon should work. Do you want me to test it and make PR?

slava77 · 2020-03-27T19:59:27Z

@slava77 The assert on line 1178 is only executed in standalone mode.

I thought I read it right, but then apparently I did it backwards as well

The CMSSW condition checks if kappa < 0.01 so kappa of 0.01 should not trigger it (if there were no rounding errors).

Looking at the code more carefully now, all values are double precision where the check is made.
However, it looks like all the math in vavilov_pars is in float.
0.01 in float is close to 9.99999977648258209228515625E-3 , while in double it's 1.00000000000000002081668171172E-2 according to https://www.binaryconvert.com/result_float.html?decimal=048046048049.
So, this likely explains the edge effects.
If the limits need to be a bit strict, using 0.01f (also, using "f" for all other values) may be more appropriate in this case than introducing a generic epsilon.

Was there a good motivation to switch between float and double here?

OzAmram · 2020-03-27T20:36:47Z

Ah ok that makes sense. The SiPixelTemplate object stores the kappa and other vavilov parameters as floats (which are what is used in the interpolation), so we would need to change that rather than just the computation in vavilov_pars in order to get the 'double' value of 0.01.

Indeed using the float value could lead to an error down the line based on the ROOT function kappa is being used for which does the same check of kappa < 0.01 (see https://github.com/root-project/root/blob/331efa4c00fefc38980eaaf7b41b8e95fcd1a23b/math/mathcore/src/TMath.cxx#L2941-L2943).

slava77 · 2020-03-27T21:16:03Z

Actually, my proposal to use 0.01f in the check isn't that much better without changing the precision of kappa itself.

I see that the only places where kappa is used are:

VVIObjF, where it enters as a float : just applying kappa < 0.01f in the value check would work
TMath::VavilovI, where it enters as a double: this would need an acceptable reset of the value to 0.01 double. How about to also add if (kappa < 0.01) kappa = 0.01; before calling this double-precision function?

But then, perhaps bumping it up to 0.010000001 in vavilov_pars if the float value is ==0.01f (both double and single precision above 0.01) is acceptable

OzAmram · 2020-03-31T21:00:33Z

Does someone have a recipe to reproduce the error so I can check if the above fix does indeed fix it? When I ran runTheMatrix.py -l 13034.0 in a fresh release of CMSSW_11_1_0_pre5 I had no errors

srimanob · 2020-04-02T08:41:22Z

@OzAmram

It happens seldom. I can reproduce the issue after several jobs. RAW is kept, so you should be able to reproduce the issue. Please try raw from
file:/eos/cms/store/user/srimanob/relvals/13034/DIGI_17.root

cmsDriver (same as wf 13034):
cmsDriver.py step3 --conditions auto:phase1_2024_realistic --pileup_input 'das:/RelValMinBias_14TeV/CMSSW_10_6_1-106X_mcRun3_2021_realistic_v1_rsb-v1/GEN-SIM' -n -1 --era Run3 --eventcontent RECOSIM,MINIAODSIM,DQM --runUnscheduled -s RAW2DIGI,L1Reco,RECO,RECOSIM,EI,PAT,VALIDATION:@standardValidation+@miniAODValidation,DQM:@standardDQM+@ExtraHLT+@miniAODDQM --datatier GEN-SIM-RECO,MINIAODSIM,DQMIO --pileup AVE_35_BX_25ns --geometry DB:Extended --python reco.py --no_exec --filein 'file:/eos/cms/store/user/srimanob/relvals/13034/DIGI_17.root' --fileout file:step3.root --nThreads 8

Issue can be found in Event run: 1 lumi: 1 event: 59

----- Begin Fatal Exception 02-Apr-2020 10:34:23 CEST-----------------------
An exception of category 'DataCorrupt' occurred while
[0] Processing Event run: 1 lumi: 1 event: 59 stream: 4
[1] Prefetching for module MonitorTrackResiduals/'MonitorTrackResiduals'
[2] Calling method for module TrackRefitter/'refittedForPixelDQM'
Exception Message:
SiPixelTemplateReco::Vavilov parameters mpv/sigmaQ/kappa = 7280.9/691.7/0.01
----- End Fatal Exception -------------------------------------------------

HTH.

OzAmram · 2020-04-06T20:46:58Z

Thanks @srimanob! That worked for reproducing the error. I made a small PR with the fix. #29399

perrotta · 2020-04-24T07:36:04Z

+1

vavilov parameters kappa fix #29399 is now merged, and apparently it was effective in removing the error

christopheralanwest · 2020-04-24T13:12:12Z

+1

cmsbuild · 2020-04-24T13:12:32Z

This issue is fully signed and ready to be closed.

cmsbuild added the pending-assignment label Mar 20, 2020

cmsbuild added alca-pending pending-signatures reconstruction-pending and removed pending-assignment labels Mar 20, 2020

OzAmram mentioned this issue Apr 6, 2020

vavilov parameters kappa fix #29399

Merged

cmsbuild added reconstruction-approved and removed reconstruction-pending labels Apr 24, 2020

cmsbuild removed alca-pending pending-signatures labels Apr 24, 2020

cmsbuild added alca-approved fully-signed labels Apr 24, 2020

silviodonato closed this as completed Apr 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in IB test wf 13034.0 #29251

Error in IB test wf 13034.0 #29251

silviodonato commented Mar 20, 2020 •

edited

cmsbuild commented Mar 20, 2020

makortel commented Mar 20, 2020

cmsbuild commented Mar 20, 2020

perrotta commented Mar 20, 2020 •

edited

OzAmram commented Mar 20, 2020

OzAmram commented Mar 20, 2020 •

edited

perrotta commented Mar 21, 2020

perrotta commented Mar 21, 2020

OzAmram commented Mar 21, 2020

slava77 commented Mar 26, 2020

OzAmram commented Mar 27, 2020

slava77 commented Mar 27, 2020 •

edited

OzAmram commented Mar 27, 2020

slava77 commented Mar 27, 2020 •

edited

OzAmram commented Mar 31, 2020

srimanob commented Apr 2, 2020

OzAmram commented Apr 6, 2020

perrotta commented Apr 24, 2020

christopheralanwest commented Apr 24, 2020

cmsbuild commented Apr 24, 2020

Error in IB test wf 13034.0 #29251

Error in IB test wf 13034.0 #29251

Comments

silviodonato commented Mar 20, 2020 • edited

cmsbuild commented Mar 20, 2020

makortel commented Mar 20, 2020

cmsbuild commented Mar 20, 2020

perrotta commented Mar 20, 2020 • edited

OzAmram commented Mar 20, 2020

OzAmram commented Mar 20, 2020 • edited

perrotta commented Mar 21, 2020

perrotta commented Mar 21, 2020

OzAmram commented Mar 21, 2020

slava77 commented Mar 26, 2020

OzAmram commented Mar 27, 2020

slava77 commented Mar 27, 2020 • edited

OzAmram commented Mar 27, 2020

slava77 commented Mar 27, 2020 • edited

OzAmram commented Mar 31, 2020

srimanob commented Apr 2, 2020

OzAmram commented Apr 6, 2020

perrotta commented Apr 24, 2020

christopheralanwest commented Apr 24, 2020

cmsbuild commented Apr 24, 2020

silviodonato commented Mar 20, 2020 •

edited

perrotta commented Mar 20, 2020 •

edited

OzAmram commented Mar 20, 2020 •

edited

slava77 commented Mar 27, 2020 •

edited

slava77 commented Mar 27, 2020 •

edited