-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepTauId module throws exception returning nan #28358
Comments
A new Issue was created by @peruzzim . @davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign reconstruction Note this module runs now by default in MiniAOD - it was simply exposed by NanoAOD production campaign. |
IMHO, there should be no throw; we are generally not throwing on FPEs. |
A small request in case you touch this: would be preferrable to avoid having nan in Mini/NanoAOD for people using this in trainings or other selections. If you deem it ok, it would be better to put a background-like value such as -1 in case there is no value to put. |
@mbluj @swozniewski @kandrosov I don't see anything wrong with @peruzzim's proposal to return -1 rather than NaN. |
there's a few more |
In DeepTau implementation there are protections against having NaN as the output. It should never happen. For that reason, the exception is thrown if the check is not passed. If it happens, I would consider it a strong indication that there is some significant bug and in the presence of such bug, I would do not trust also the results where output is not a NaN. In the private productions done by tau POG and other analysis groups, this exception was never raised... With which CMSSW release are you having this error? Could it be related to this PR #28128 ? |
@kandrosov please check the link above for all informations on the release, the cmsRun command, the input file etc. |
@peruzzim Thanks! Hm... I don't succeed to reproduce the error. I ran the following commands: cmsrel CMSSW_10_2_18
cd CMSSW_10_2_18/src
cmsenv
wget https://cms-unified.web.cern.ch/cms-unified//joblogs/pgunnell_Run2017C-31Mar2018-v1-HighMultiplicityEOF3-Nano25Oct2019_10218_191031_170638_7952/8001/DataProcessing/e7606528-7e28-4359-a2aa-a4cb5c6b546d-46-0-logArchive/job/WMTaskSpace/cmsRun1/PSet.py
wget https://cms-unified.web.cern.ch/cms-unified//joblogs/pgunnell_Run2017C-31Mar2018-v1-HighMultiplicityEOF3-Nano25Oct2019_10218_191031_170638_7952/8001/DataProcessing/e7606528-7e28-4359-a2aa-a4cb5c6b546d-46-0-logArchive/job/WMTaskSpace/cmsRun1/PSet.pkl
cmsRun -j FrameworkJobReport.xml PSet.py Could you, please, confirm that these steps should lead to the same output as the crashed job [1]? |
@kandrosov can you please follow up with PdmV in the JIRA ticket mentioned above? |
Dear all, I confirm that this might be to a site problem (MIT) in production. We ask computing to resubmit the failing jobs. Thanks a lot for all your effort, there was no hint that this error could come from a site issue and not from the code. |
What sort of “site problem”? Corrupt cvmfs cache? Or is this more of a guess?
On Nov 14, 2019, at 10:09 AM, pgunnell <notifications@github.com<mailto:notifications@github.com>> wrote:
Dear all, I confirm that this might be to a site problem (MIT) in production. We ask computing to resubmit the failing jobs. Thanks a lot for all your effort, there was no hint that this error could come from a site issue and not from the code.
@kandrosov<https://github.com/kandrosov> @peruzzim<https://github.com/peruzzim>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#28358?email_source=notifications&email_token=ABGPFQ2CFOP43BC3I3HX5P3QTUIV5A5CNFSM4JKGJFTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEBDYNY#issuecomment-553794615>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABGPFQYJDUDHVO2RC5LGGMLQTUIV5ANCNFSM4JKGJFTA>.
|
it's a guess but we had 5 wf failures in computing all running at MIT. The error is what you see in the jira ticket (https://its.cern.ch/jira/browse/CMSCOMPPR-10135). I suspect it was a glitch on reading the input file of DeepTau but not sure how to check that. |
Maybe this issue can be closed as the conclusion is about the site, and re-open if need. |
Were the failing jobs resubmitted, as requested/anticipated in #28358 (comment)? |
@pgunnell do we have news about the resubmitted jobs? Would be good to be sure that it was indeed not a code problem, in view of future productions. |
I confirm @peruzzim . For the resubmitted jobs, there was no problem. |
thanks, then I think this can be closed |
+1 based on earlier comments, this issue was apparently transient and there were no issues found with the code logic |
This issue is fully signed and ready to be closed. |
it would be useful to have full details of the Hardware were the error occurred. We may have to develop and deploy some sort of ad-hoc test or just ask site to regularly run memory-checks. |
I am not at all convinced that there is a proved evidence of hardware failure. |
I agree that there is no evidence so far whatever memory corruption occurs due to software or hardware issue. We did not succeed yet to reproduce it in the controlled environment that would allow to have a detailed debugging.
I'm not sure if reading uninitialized memory is more plausible than memory corruption. In the cases that I am aware of, such condition occurs only after some running period, i.e. for the first events DeepTau score is computed correctly. All inputs are screened against "unexpected NaNs". So we have a NN evaluation on finite inputs that starts to give NaNs after some "breaking point". |
I have a logfile where it happens in the first event of a thread |
cross posting: |
The DeepTauId producer has thrown this exception:
cmssw/RecoTauTag/RecoTau/plugins/DeepTauId.cc
Lines 1008 to 1009 in 619b1aa
in a NanoAODv6 production workflow:
invalid prediction = -nan for tau_index = 0, pred_index = 0
Please see https://its.cern.ch/jira/browse/CMSCOMPPR-10135 for details.
@rmanzoni @roger-wolf can you please follow up?
@pgunnell
The text was updated successfully, but these errors were encountered: