DeepTauId module throws exception returning nan #28358

peruzzim · 2019-11-07T13:02:26Z

The DeepTauId producer has thrown this exception:

cmssw/RecoTauTag/RecoTau/plugins/DeepTauId.cc

Lines 1008 to 1009 in 619b1aa

    
           throw cms::Exception("DeepTauId") 
        
               << "invalid prediction = " << pred << " for tau_index = " << tau_index << ", pred_index = " << k;

in a NanoAODv6 production workflow:
invalid prediction = -nan for tau_index = 0, pred_index = 0

Please see https://its.cern.ch/jira/browse/CMSCOMPPR-10135 for details.
@rmanzoni @roger-wolf can you please follow up?

@pgunnell

The text was updated successfully, but these errors were encountered:

cmsbuild · 2019-11-07T13:02:50Z

A new Issue was created by @peruzzim .

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

peruzzim · 2019-11-07T13:06:39Z

assign reconstruction

Note this module runs now by default in MiniAOD - it was simply exposed by NanoAOD production campaign.

cmsbuild · 2019-11-07T13:07:04Z

New categories assigned: reconstruction

@slava77,@perrotta you have been requested to review this Pull request/Issue and eventually sign? Thanks

slava77 · 2019-11-07T13:24:15Z

IMHO, there should be no throw; we are generally not throwing on FPEs.
The data should be flagged appropriately (LogWarning or LogError depending on perceived importance) and the job should continue.

peruzzim · 2019-11-07T13:35:23Z

A small request in case you touch this: would be preferrable to avoid having nan in Mini/NanoAOD for people using this in trainings or other selections. If you deem it ok, it would be better to put a background-like value such as -1 in case there is no value to put.

rmanzoni · 2019-11-07T14:27:20Z

@mbluj @swozniewski @kandrosov
please have a look at this issue.

I don't see anything wrong with @peruzzim's proposal to return -1 rather than NaN.
And turning this into a LogWarning (certainly not an Error imo) as @slava77 suggests wouldn't be bad either.

rmanzoni · 2019-11-07T14:30:14Z

there's a few more throw in DeepTauId.cc, I think they could use the same treatment, in principle

kandrosov · 2019-11-07T16:15:21Z

In DeepTau implementation there are protections against having NaN as the output. It should never happen. For that reason, the exception is thrown if the check is not passed. If it happens, I would consider it a strong indication that there is some significant bug and in the presence of such bug, I would do not trust also the results where output is not a NaN. In the private productions done by tau POG and other analysis groups, this exception was never raised... With which CMSSW release are you having this error? Could it be related to this PR #28128 ?

peruzzim · 2019-11-07T16:18:22Z

@kandrosov please check the link above for all informations on the release, the cmsRun command, the input file etc.

kandrosov · 2019-11-07T17:21:49Z

@peruzzim Thanks! Hm... I don't succeed to reproduce the error. I ran the following commands:

cmsrel CMSSW_10_2_18
cd CMSSW_10_2_18/src
cmsenv
wget https://cms-unified.web.cern.ch/cms-unified//joblogs/pgunnell_Run2017C-31Mar2018-v1-HighMultiplicityEOF3-Nano25Oct2019_10218_191031_170638_7952/8001/DataProcessing/e7606528-7e28-4359-a2aa-a4cb5c6b546d-46-0-logArchive/job/WMTaskSpace/cmsRun1/PSet.py
wget https://cms-unified.web.cern.ch/cms-unified//joblogs/pgunnell_Run2017C-31Mar2018-v1-HighMultiplicityEOF3-Nano25Oct2019_10218_191031_170638_7952/8001/DataProcessing/e7606528-7e28-4359-a2aa-a4cb5c6b546d-46-0-logArchive/job/WMTaskSpace/cmsRun1/PSet.pkl
cmsRun  -j FrameworkJobReport.xml PSet.py

Could you, please, confirm that these steps should lead to the same output as the crashed job [1]?
If yes, a non-reproducibility could mean some problem with the setup on the machine where the job ran, or some thread safety issue in the DeepTau producer.

[1] http://cern.ch/go/7grD

peruzzim · 2019-11-13T09:34:32Z

@kandrosov can you please follow up with PdmV in the JIRA ticket mentioned above?

pgunnell · 2019-11-14T09:09:48Z

Dear all, I confirm that this might be to a site problem (MIT) in production. We ask computing to resubmit the failing jobs. Thanks a lot for all your effort, there was no hint that this error could come from a site issue and not from the code.

@kandrosov @peruzzim

davidlange6 · 2019-11-14T09:15:20Z

What sort of “site problem”? Corrupt cvmfs cache? Or is this more of a guess? On Nov 14, 2019, at 10:09 AM, pgunnell <notifications@github.com<mailto:notifications@github.com>> wrote: Dear all, I confirm that this might be to a site problem (MIT) in production. We ask computing to resubmit the failing jobs. Thanks a lot for all your effort, there was no hint that this error could come from a site issue and not from the code. @kandrosov<https://github.com/kandrosov> @peruzzim<https://github.com/peruzzim> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#28358?email_source=notifications&email_token=ABGPFQ2CFOP43BC3I3HX5P3QTUIV5A5CNFSM4JKGJFTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEBDYNY#issuecomment-553794615>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABGPFQYJDUDHVO2RC5LGGMLQTUIV5ANCNFSM4JKGJFTA>.

pgunnell · 2019-11-14T14:52:34Z

it's a guess but we had 5 wf failures in computing all running at MIT. The error is what you see in the jira ticket (https://its.cern.ch/jira/browse/CMSCOMPPR-10135). I suspect it was a glitch on reading the input file of DeepTau but not sure how to check that.

srimanob · 2019-11-21T10:42:58Z

Maybe this issue can be closed as the conclusion is about the site, and re-open if need.

perrotta · 2019-11-21T10:48:37Z

Were the failing jobs resubmitted, as requested/anticipated in #28358 (comment)?
Did the re-submissions complete without errors?

peruzzim · 2019-12-09T14:15:40Z

@pgunnell do we have news about the resubmitted jobs? Would be good to be sure that it was indeed not a code problem, in view of future productions.

pgunnell · 2019-12-09T16:29:53Z

I confirm @peruzzim . For the resubmitted jobs, there was no problem.

peruzzim · 2019-12-11T16:44:10Z

thanks, then I think this can be closed

slava77 · 2020-09-05T01:55:19Z

+1

based on earlier comments, this issue was apparently transient and there were no issues found with the code logic

cmsbuild · 2020-09-05T01:55:43Z

This issue is fully signed and ready to be closed.

VinInn · 2023-02-09T18:18:17Z

it would be useful to have full details of the Hardware were the error occurred.
If there is a real hardware issue and nothing specifically related to the code used by DeepTau it could affect many other parts of any kind of job running there.

We may have to develop and deploy some sort of ad-hoc test or just ask site to regularly run memory-checks.

VinInn · 2023-02-09T18:22:23Z

I am not at all convinced that there is a proved evidence of hardware failure.
I think we need to make controlled experiments on machines where the problem occur and find the cause (not just the symptoms).
Most of the time it is NOT memory corruption: it is just reading uninitialized memory
(I mean why always Nan and never a corrupted pointer?)
I suggest:
try cmsRunTC and cmsRunGlibc
run under valgrind

kandrosov · 2023-02-10T15:12:43Z

I agree that there is no evidence so far whatever memory corruption occurs due to software or hardware issue. We did not succeed yet to reproduce it in the controlled environment that would allow to have a detailed debugging.

Most of the time it is NOT memory corruption: it is just reading uninitialized memory

I'm not sure if reading uninitialized memory is more plausible than memory corruption. In the cases that I am aware of, such condition occurs only after some running period, i.e. for the first events DeepTau score is computed correctly. All inputs are screened against "unexpected NaNs". So we have a NN evaluation on finite inputs that starts to give NaNs after some "breaking point".

VinInn · 2023-02-10T16:41:30Z

I have a logfile where it happens in the first event of a thread

VinInn · 2023-02-11T14:35:52Z

cross posting:
My understanding is that TensorFlow is used in DeepTauId.
TnsorFlow is a fat library: uses different code on different hardware: I wonder if there has been any test to compare its result on a sse only machine, avx, avx2 and axv512.
If yes, which SSE-only machine has been used?

cmsbuild added the pending-assignment label Nov 7, 2019

cmsbuild added pending-signatures reconstruction-pending and removed pending-assignment labels Nov 7, 2019

cmsbuild added fully-signed reconstruction-approved and removed pending-signatures reconstruction-pending labels Sep 5, 2020

qliphy closed this as completed Sep 5, 2020

danielwinterbottom mentioned this issue Jan 9, 2023

Failures in Run 3 data reprocessing #40437

Open

makortel mentioned this issue Feb 9, 2023

Exception instead of LogError in DeepTauId #40733

Open

VinInn mentioned this issue Aug 2, 2023

DeepTauId throws (in event with zero taus?) #42444

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepTauId module throws exception returning nan #28358

DeepTauId module throws exception returning nan #28358

peruzzim commented Nov 7, 2019

cmsbuild commented Nov 7, 2019

peruzzim commented Nov 7, 2019 •

edited

Loading

cmsbuild commented Nov 7, 2019

slava77 commented Nov 7, 2019

peruzzim commented Nov 7, 2019

rmanzoni commented Nov 7, 2019

rmanzoni commented Nov 7, 2019

kandrosov commented Nov 7, 2019

peruzzim commented Nov 7, 2019

kandrosov commented Nov 7, 2019

peruzzim commented Nov 13, 2019 •

edited

Loading

pgunnell commented Nov 14, 2019

davidlange6 commented Nov 14, 2019 via email

pgunnell commented Nov 14, 2019

srimanob commented Nov 21, 2019

perrotta commented Nov 21, 2019

peruzzim commented Dec 9, 2019

pgunnell commented Dec 9, 2019

peruzzim commented Dec 11, 2019

slava77 commented Sep 5, 2020

cmsbuild commented Sep 5, 2020

VinInn commented Feb 9, 2023

VinInn commented Feb 9, 2023 •

edited

Loading

kandrosov commented Feb 10, 2023

VinInn commented Feb 10, 2023

VinInn commented Feb 11, 2023

DeepTauId module throws exception returning nan #28358

DeepTauId module throws exception returning nan #28358

Comments

peruzzim commented Nov 7, 2019

cmsbuild commented Nov 7, 2019

peruzzim commented Nov 7, 2019 • edited Loading

cmsbuild commented Nov 7, 2019

slava77 commented Nov 7, 2019

peruzzim commented Nov 7, 2019

rmanzoni commented Nov 7, 2019

rmanzoni commented Nov 7, 2019

kandrosov commented Nov 7, 2019

peruzzim commented Nov 7, 2019

kandrosov commented Nov 7, 2019

peruzzim commented Nov 13, 2019 • edited Loading

pgunnell commented Nov 14, 2019

davidlange6 commented Nov 14, 2019 via email

pgunnell commented Nov 14, 2019

srimanob commented Nov 21, 2019

perrotta commented Nov 21, 2019

peruzzim commented Dec 9, 2019

pgunnell commented Dec 9, 2019

peruzzim commented Dec 11, 2019

slava77 commented Sep 5, 2020

cmsbuild commented Sep 5, 2020

VinInn commented Feb 9, 2023

VinInn commented Feb 9, 2023 • edited Loading

kandrosov commented Feb 10, 2023

VinInn commented Feb 10, 2023

VinInn commented Feb 11, 2023

peruzzim commented Nov 7, 2019 •

edited

Loading

peruzzim commented Nov 13, 2019 •

edited

Loading

VinInn commented Feb 9, 2023 •

edited

Loading