Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in IB test wf 13034.0 #29251

Closed
silviodonato opened this issue Mar 20, 2020 · 20 comments
Closed

Error in IB test wf 13034.0 #29251

silviodonato opened this issue Mar 20, 2020 · 20 comments

Comments

@silviodonato
Copy link
Contributor

silviodonato commented Mar 20, 2020

As discussed during the last Core Software meeting, we are getting a seldom error in 13034.0 :

%MSG-w SiPixelPhase1TrackClusters:   SiPixelPhase1TrackClusters:hltSiPixelPhase1TrackClustersAnalyzer  19-Mar-2020 17:46:40 CET Run: 1 Event: 1201
PixelClusterShapeCache collection is not valid
%MSG
19-Mar-2020 17:46:40 CET  Closed file root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/relval/CMSSW_10_6_1/RelValMinBias_14TeV/GEN-SIM/106X_mcRun3_2021_realistic_v1_rsb-v1/10000/97553220-ED91-4E48-B59F-ED44053F8621.root
19-Mar-2020 17:46:40 CET  Initiating request to open file root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/relval/CMSSW_10_6_1/RelValMinBias_14TeV/GEN-SIM/106X_mcRun3_2021_realistic_v1_rsb-v1/10000/70672451-102A-8243-ACF0-C4F545C049AB.root
%MSG-e TrackProducer:  TrackProducer:initialStepTracksPreSplitting  19-Mar-2020 17:46:41 CET Run: 1 Event: 1211
cms::Exception caught during theAlgo.runWithCandidate.
An exception of category 'DataCorrupt' occurred.
Exception Message:
SiPixelTemplateReco::Vavilov parameters mpv/sigmaQ/kappa = 7280.9/691.7/0.01


%MSG
[...]
%MSG-e SiPixelTemplateReco:  TrackRefitter:refittedForPixelDQM  19-Mar-2020 17:46:46 CET Run: 1 Event: 1205
illegal chi2xmin normalization (1) = -0.0747717 
%MSG
%MSG-e SiPixelTemplateReco:  TrackRefitter:refittedForPixelDQM  19-Mar-2020 17:46:46 CET Run: 1 Event: 1205
illegal chi2xmin normalization (1) = -0.0672276 
%MSG
%MSG-e SiPixelTemplateReco:  TrackRefitter:refittedForPixelDQM  19-Mar-2020 17:46:46 CET Run: 1 Event: 1205
illegal chi2xmin normalization (1) = -0.0316581 
%MSG
----- Begin Fatal Exception 19-Mar-2020 17:46:46 CET-----------------------
An exception of category 'DataCorrupt' occurred while
   [0] Processing  Event run: 1 lumi: 13 event: 1211 stream: 1
   [1] Running path 'Flag_METFilters'
   [2] Prefetching for module BooleanFlagFilter/'HBHENoiseFilter'
   [3] Prefetching for module HBHENoiseFilterResultProducer/'HBHENoiseFilterResultProducer'
   [4] Prefetching for module HcalNoiseInfoProducer/'hcalnoise'
   [5] Prefetching for module FastjetJetProducer/'ak4PFJets'
   [6] Prefetching for module PFLinker/'particleFlow'
   [7] Prefetching for module PFProducer/'particleFlowTmp'
   [8] Prefetching for module PFBlockProducer/'particleFlowBlock'
   [9] Prefetching for module PFElecTkProducer/'pfTrackElec'
   [10] Prefetching for module GsfTrackProducer/'electronGsfTracks'
   [11] Prefetching for module MeasurementTrackerEventProducer/'MeasurementTrackerEvent'
   [12] Prefetching for module JetCoreClusterSplitter/'siPixelClusters'
   [13] Prefetching for module PrimaryVertexProducer/'firstStepPrimaryVerticesPreSplitting'
   [14] Calling method for module TrackProducer/'initialStepTracksPreSplitting'
Exception Message:
SiPixelTemplateReco::Vavilov parameters mpv/sigmaQ/kappa = 7280.9/691.7/0.01
----- End Fatal Exception -------------------------------------------------

These are the log files of the latest IB tests with the error

CMSSW_11_1_X_2020-03-13-2300.log
CMSSW_11_1_X_2020-03-14-1100.log
CMSSW_11_1_X_2020-03-16-2300.log
CMSSW_11_1_X_2020-03-19-1100.log
CMSSW_11_1_X_2020-03-21-1100
CMSSW_11_1_X_2020-03-30-1100

@cmsbuild
Copy link
Contributor

A new Issue was created by @silviodonato Silvio Donato.

@Dr15Jones, @smuzaffar, @silviodonato, @makortel, @davidlange6, @fabiocos can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign alca, reconstruction

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction,alca

@slava77,@christopheralanwest,@franzoni,@tlampen,@pohsun,@perrotta,@tocheng you have been requested to review this Pull request/Issue and eventually sign? Thanks

@perrotta
Copy link
Contributor

perrotta commented Mar 20, 2020

In https://cmssdt.cern.ch/lxr/source/RecoLocalTracker/SiPixelRecHits/src/SiPixelTemplateReco.cc#1178 I notice that the value kappa=0.01, which throws in cmssw, wouldn't throw in standalone running of the template (compare with L1174)
This is exactly the value which generates the exception here.

In the literature I find that kappa=0.01 corresponds to the Landau limit (while kappa=10 would correspond to the gaussian one). Because of that, I would expect that the same requirements for throwing which is set for the standalone run should be implemented also in cmssw, i.e. line 1178 should become

    assert((sigmaQ >= 0.) && (mpv >= 0.) && (kappa >= 0.01) && (kappa <= 10.));

That would be enough to remove the exception observed in the attached log.

Of course, the calculations in CondFormats/SiPixelTransient/src/SiPixelTemplate.cc are far too complicate to be followed in detail for this limited evaluation of mine (at least for me).

@cmantill @pmaksim1 @tvami @mmusich @tsusa @OzAmram : could any of you please have a look and confirm whether the proposed quick fix above actually corresponds to what expected from that algo, or if some different fix should be applied instead?

@OzAmram
Copy link
Contributor

OzAmram commented Mar 20, 2020

The short answer is that I think this fix makes sense, but I am checking with Morris to be sure. I'll let you know when he confirms.

The longer explanation is:
Kappa being = 0.01 exactly can only happen in the extreme case that both of the templates being interpolated between have kappa = 0.01 (which is the limit on the parameter imposed by the fit done in the template making code). Even though this indicates there is probably something not so good about the fit to the charge distributions done for these templates because kappa is right at its boundary, I don't think we want to throw an exception in CMSSW for it (if anything we should try to catch this in the template making). So it makes sense to allow kappa = 0.01 as included in the 'allowed' range and not throw an exception.

@OzAmram
Copy link
Contributor

OzAmram commented Mar 20, 2020

Ok so after a little closer inspection I think what Andrea wrote above is a little backwards. The CMSSW exception is thrown on line 1174, so it is the condition on 1173 that needs to be changed to fix the problem (note the ifndef on 1172). 1178 is what is run in the standalone code.

So it is a little confusing that the exception is being thrown with kappa = 0.01 then, because that should not satisfy the (kappa < 0.01) condition that is being checked for. It might be due to some floating point rounding error or something. So we could try giving it a little buffer, by changing 1173 to be something like:

if ((sigmaQ <= 0.) || (mpv <= 0.) || (kappa < 0.0099) || (kappa > 10.001))

We can also make a similar change to 1178 to keep consistent.

(Also I talked to Morris and he OK'ed this change)

@perrotta
Copy link
Contributor

Ok so after a little closer inspection I think what Andrea wrote above is a little backwards. The CMSSW exception is thrown on line 1174, so it is the condition on 1173 that needs to be changed to fix the problem (note the ifndef on 1172). 1178 is what is run in the standalone code.

So it is a little confusing that the exception is being thrown with kappa = 0.01 then, because that should not satisfy the (kappa < 0.01) condition that is being checked for. It might be due to some floating point rounding error or something. So we could try giving it a little buffer, by changing 1173 to be something like:

if ((sigmaQ <= 0.) || (mpv <= 0.) || (kappa < 0.0099) || (kappa > 10.001))

We can also make a similar change to 1178 to keep consistent.

Lines 1173 to 1176 are accessed only if #ifndef SI_PIXEL_TEMPLATE_STANDALONE
CMSSW accesses line 1178 instead

Of course, one could certainly relax both cases (CMSSW and STANDALONE) to take rounding errors into account. But the error reported in the log that started this issue wouldn't have been triggered in STANDALONE mode, while it did in CMSSW.

@perrotta
Copy link
Contributor

Sorry @OzAmram : yes, you are right. In my post above I confused #ifdef and #ifndef

Yes: rounding approximations must be taken into account when there are non integer numbers.

Instead of writing values as such, I'd rather define a const value epsilon (small) to be added to the actual bounds in the evaluation.

@OzAmram
Copy link
Contributor

OzAmram commented Mar 21, 2020

Ok that sounds good to me.

@slava77
Copy link
Contributor

slava77 commented Mar 26, 2020

So it is a little confusing that the exception is being thrown with kappa = 0.01 then, because that should not satisfy the (kappa < 0.01) condition that is being checked for.

the assert will happen for kappa <= 0.01, which is apparently a bug based on the description above. It sounds to me that the assert condition should be changed to kappa >= 0.01 and perhaps similarly for other variables where an exact value at the limit is still physically acceptable.

(in some other code in CMSSW we had problems in the past where binning was decided by exact inequalities and missing a floating point value falling at the bin boundary, which would lead to problems. The appropriate solution is to simply do logically the right thing and allow equal value)

@OzAmram
Copy link
Contributor

OzAmram commented Mar 27, 2020

@slava77 The assert on line 1178 is only executed in standalone mode. The CMSSW exception is being thrown based on line 1173 (Andrea confused them above). The CMSSW condition checks if kappa < 0.01 so kappa of 0.01 should not trigger it (if there were no rounding errors).

In either case I think Andrea's idea of using an epsilon should work. Do you want me to test it and make PR?

@slava77
Copy link
Contributor

slava77 commented Mar 27, 2020

@slava77 The assert on line 1178 is only executed in standalone mode.

I thought I read it right, but then apparently I did it backwards as well

The CMSSW condition checks if kappa < 0.01 so kappa of 0.01 should not trigger it (if there were no rounding errors).

Looking at the code more carefully now, all values are double precision where the check is made.
However, it looks like all the math in vavilov_pars is in float.
0.01 in float is close to 9.99999977648258209228515625E-3 , while in double it's 1.00000000000000002081668171172E-2 according to https://www.binaryconvert.com/result_float.html?decimal=048046048049.
So, this likely explains the edge effects.
If the limits need to be a bit strict, using 0.01f (also, using "f" for all other values) may be more appropriate in this case than introducing a generic epsilon.

Was there a good motivation to switch between float and double here?

@OzAmram
Copy link
Contributor

OzAmram commented Mar 27, 2020

Ah ok that makes sense. The SiPixelTemplate object stores the kappa and other vavilov parameters as floats (which are what is used in the interpolation), so we would need to change that rather than just the computation in vavilov_pars in order to get the 'double' value of 0.01.

Indeed using the float value could lead to an error down the line based on the ROOT function kappa is being used for which does the same check of kappa < 0.01 (see https://github.com/root-project/root/blob/331efa4c00fefc38980eaaf7b41b8e95fcd1a23b/math/mathcore/src/TMath.cxx#L2941-L2943).

@slava77
Copy link
Contributor

slava77 commented Mar 27, 2020

Actually, my proposal to use 0.01f in the check isn't that much better without changing the precision of kappa itself.

I see that the only places where kappa is used are:

  • VVIObjF, where it enters as a float : just applying kappa < 0.01f in the value check would work
  • TMath::VavilovI, where it enters as a double: this would need an acceptable reset of the value to 0.01 double. How about to also add if (kappa < 0.01) kappa = 0.01; before calling this double-precision function?

But then, perhaps bumping it up to 0.010000001 in vavilov_pars if the float value is ==0.01f (both double and single precision above 0.01) is acceptable

@OzAmram
Copy link
Contributor

OzAmram commented Mar 31, 2020

Does someone have a recipe to reproduce the error so I can check if the above fix does indeed fix it? When I ran runTheMatrix.py -l 13034.0 in a fresh release of CMSSW_11_1_0_pre5 I had no errors

@srimanob
Copy link
Contributor

srimanob commented Apr 2, 2020

@OzAmram

It happens seldom. I can reproduce the issue after several jobs. RAW is kept, so you should be able to reproduce the issue. Please try raw from
file:/eos/cms/store/user/srimanob/relvals/13034/DIGI_17.root

cmsDriver (same as wf 13034):
cmsDriver.py step3 --conditions auto:phase1_2024_realistic --pileup_input 'das:/RelValMinBias_14TeV/CMSSW_10_6_1-106X_mcRun3_2021_realistic_v1_rsb-v1/GEN-SIM' -n -1 --era Run3 --eventcontent RECOSIM,MINIAODSIM,DQM --runUnscheduled -s RAW2DIGI,L1Reco,RECO,RECOSIM,EI,PAT,VALIDATION:@standardValidation+@miniAODValidation,DQM:@standardDQM+@ExtraHLT+@miniAODDQM --datatier GEN-SIM-RECO,MINIAODSIM,DQMIO --pileup AVE_35_BX_25ns --geometry DB:Extended --python reco.py --no_exec --filein 'file:/eos/cms/store/user/srimanob/relvals/13034/DIGI_17.root' --fileout file:step3.root --nThreads 8

Issue can be found in Event run: 1 lumi: 1 event: 59

----- Begin Fatal Exception 02-Apr-2020 10:34:23 CEST-----------------------
An exception of category 'DataCorrupt' occurred while
[0] Processing Event run: 1 lumi: 1 event: 59 stream: 4
[1] Prefetching for module MonitorTrackResiduals/'MonitorTrackResiduals'
[2] Calling method for module TrackRefitter/'refittedForPixelDQM'
Exception Message:
SiPixelTemplateReco::Vavilov parameters mpv/sigmaQ/kappa = 7280.9/691.7/0.01
----- End Fatal Exception -------------------------------------------------

HTH.

@OzAmram
Copy link
Contributor

OzAmram commented Apr 6, 2020

Thanks @srimanob! That worked for reproducing the error. I made a small PR with the fix. #29399

@perrotta
Copy link
Contributor

+1

@christopheralanwest
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants