Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepTauId module throws exception returning nan #28358

Closed
peruzzim opened this issue Nov 7, 2019 · 26 comments
Closed

DeepTauId module throws exception returning nan #28358

peruzzim opened this issue Nov 7, 2019 · 26 comments

Comments

@peruzzim
Copy link
Contributor

peruzzim commented Nov 7, 2019

The DeepTauId producer has thrown this exception:

throw cms::Exception("DeepTauId")
<< "invalid prediction = " << pred << " for tau_index = " << tau_index << ", pred_index = " << k;

in a NanoAODv6 production workflow:
invalid prediction = -nan for tau_index = 0, pred_index = 0

Please see https://its.cern.ch/jira/browse/CMSCOMPPR-10135 for details.
@rmanzoni @roger-wolf can you please follow up?

@pgunnell

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 7, 2019

A new Issue was created by @peruzzim .

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@peruzzim
Copy link
Contributor Author

peruzzim commented Nov 7, 2019

assign reconstruction

Note this module runs now by default in MiniAOD - it was simply exposed by NanoAOD production campaign.

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 7, 2019

New categories assigned: reconstruction

@slava77,@perrotta you have been requested to review this Pull request/Issue and eventually sign? Thanks

@slava77
Copy link
Contributor

slava77 commented Nov 7, 2019

IMHO, there should be no throw; we are generally not throwing on FPEs.
The data should be flagged appropriately (LogWarning or LogError depending on perceived importance) and the job should continue.

@peruzzim
Copy link
Contributor Author

peruzzim commented Nov 7, 2019

A small request in case you touch this: would be preferrable to avoid having nan in Mini/NanoAOD for people using this in trainings or other selections. If you deem it ok, it would be better to put a background-like value such as -1 in case there is no value to put.

@rmanzoni
Copy link
Contributor

rmanzoni commented Nov 7, 2019

@mbluj @swozniewski @kandrosov
please have a look at this issue.

I don't see anything wrong with @peruzzim's proposal to return -1 rather than NaN.
And turning this into a LogWarning (certainly not an Error imo) as @slava77 suggests wouldn't be bad either.

@rmanzoni
Copy link
Contributor

rmanzoni commented Nov 7, 2019

there's a few more throw in DeepTauId.cc, I think they could use the same treatment, in principle

@kandrosov
Copy link
Contributor

In DeepTau implementation there are protections against having NaN as the output. It should never happen. For that reason, the exception is thrown if the check is not passed. If it happens, I would consider it a strong indication that there is some significant bug and in the presence of such bug, I would do not trust also the results where output is not a NaN. In the private productions done by tau POG and other analysis groups, this exception was never raised... With which CMSSW release are you having this error? Could it be related to this PR #28128 ?

@peruzzim
Copy link
Contributor Author

peruzzim commented Nov 7, 2019

@kandrosov please check the link above for all informations on the release, the cmsRun command, the input file etc.

@kandrosov
Copy link
Contributor

@peruzzim Thanks! Hm... I don't succeed to reproduce the error. I ran the following commands:

cmsrel CMSSW_10_2_18
cd CMSSW_10_2_18/src
cmsenv
wget https://cms-unified.web.cern.ch/cms-unified//joblogs/pgunnell_Run2017C-31Mar2018-v1-HighMultiplicityEOF3-Nano25Oct2019_10218_191031_170638_7952/8001/DataProcessing/e7606528-7e28-4359-a2aa-a4cb5c6b546d-46-0-logArchive/job/WMTaskSpace/cmsRun1/PSet.py
wget https://cms-unified.web.cern.ch/cms-unified//joblogs/pgunnell_Run2017C-31Mar2018-v1-HighMultiplicityEOF3-Nano25Oct2019_10218_191031_170638_7952/8001/DataProcessing/e7606528-7e28-4359-a2aa-a4cb5c6b546d-46-0-logArchive/job/WMTaskSpace/cmsRun1/PSet.pkl
cmsRun  -j FrameworkJobReport.xml PSet.py

Could you, please, confirm that these steps should lead to the same output as the crashed job [1]?
If yes, a non-reproducibility could mean some problem with the setup on the machine where the job ran, or some thread safety issue in the DeepTau producer.

[1] http://cern.ch/go/7grD

@peruzzim
Copy link
Contributor Author

peruzzim commented Nov 13, 2019

@kandrosov can you please follow up with PdmV in the JIRA ticket mentioned above?

@pgunnell
Copy link
Contributor

Dear all, I confirm that this might be to a site problem (MIT) in production. We ask computing to resubmit the failing jobs. Thanks a lot for all your effort, there was no hint that this error could come from a site issue and not from the code.

@kandrosov @peruzzim

@davidlange6
Copy link
Contributor

davidlange6 commented Nov 14, 2019 via email

@pgunnell
Copy link
Contributor

it's a guess but we had 5 wf failures in computing all running at MIT. The error is what you see in the jira ticket (https://its.cern.ch/jira/browse/CMSCOMPPR-10135). I suspect it was a glitch on reading the input file of DeepTau but not sure how to check that.

@srimanob
Copy link
Contributor

Maybe this issue can be closed as the conclusion is about the site, and re-open if need.

@perrotta
Copy link
Contributor

Were the failing jobs resubmitted, as requested/anticipated in #28358 (comment)?
Did the re-submissions complete without errors?

@peruzzim
Copy link
Contributor Author

peruzzim commented Dec 9, 2019

@pgunnell do we have news about the resubmitted jobs? Would be good to be sure that it was indeed not a code problem, in view of future productions.

@pgunnell
Copy link
Contributor

pgunnell commented Dec 9, 2019

I confirm @peruzzim . For the resubmitted jobs, there was no problem.

@peruzzim
Copy link
Contributor Author

thanks, then I think this can be closed

@slava77
Copy link
Contributor

slava77 commented Sep 5, 2020

+1

based on earlier comments, this issue was apparently transient and there were no issues found with the code logic

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 5, 2020

This issue is fully signed and ready to be closed.

@qliphy qliphy closed this as completed Sep 5, 2020
@VinInn
Copy link
Contributor

VinInn commented Feb 9, 2023

it would be useful to have full details of the Hardware were the error occurred.
If there is a real hardware issue and nothing specifically related to the code used by DeepTau it could affect many other parts of any kind of job running there.

We may have to develop and deploy some sort of ad-hoc test or just ask site to regularly run memory-checks.

@VinInn
Copy link
Contributor

VinInn commented Feb 9, 2023

I am not at all convinced that there is a proved evidence of hardware failure.
I think we need to make controlled experiments on machines where the problem occur and find the cause (not just the symptoms).
Most of the time it is NOT memory corruption: it is just reading uninitialized memory
(I mean why always Nan and never a corrupted pointer?)
I suggest:
try cmsRunTC and cmsRunGlibc
run under valgrind

@kandrosov
Copy link
Contributor

I agree that there is no evidence so far whatever memory corruption occurs due to software or hardware issue. We did not succeed yet to reproduce it in the controlled environment that would allow to have a detailed debugging.

Most of the time it is NOT memory corruption: it is just reading uninitialized memory

I'm not sure if reading uninitialized memory is more plausible than memory corruption. In the cases that I am aware of, such condition occurs only after some running period, i.e. for the first events DeepTau score is computed correctly. All inputs are screened against "unexpected NaNs". So we have a NN evaluation on finite inputs that starts to give NaNs after some "breaking point".

@VinInn
Copy link
Contributor

VinInn commented Feb 10, 2023

I have a logfile where it happens in the first event of a thread

@VinInn
Copy link
Contributor

VinInn commented Feb 11, 2023

cross posting:
My understanding is that TensorFlow is used in DeepTauId.
TnsorFlow is a fat library: uses different code on different hardware: I wonder if there has been any test to compare its result on a sse only machine, avx, avx2 and axv512.
If yes, which SSE-only machine has been used?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests