Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception on HLT_DoubleMediumDeepTauPFTauHPS35_eta2p1 Phase-2 workflow #42862

Closed
srimanob opened this issue Sep 25, 2023 · 54 comments
Closed

Exception on HLT_DoubleMediumDeepTauPFTauHPS35_eta2p1 Phase-2 workflow #42862

srimanob opened this issue Sep 25, 2023 · 54 comments

Comments

@srimanob
Copy link
Contributor

srimanob commented Sep 25, 2023

I just see relvals failure [1] from [2]. Not sure if this link to DeepTauId issue reported before in #40437 and #40733, as the error report looks different. I try to look on 13_1_0_preX relvals report, but I don't see this issue, so I am not clear what condition to make this exception happens.

[1]

An exception of category 'InvalidRun' occurred while
   [0] Processing  Event run: 1 lumi: 13 event: 648 stream: 0
   [1] Running path 'HLT_DoubleMediumDeepTauPFTauHPS35_eta2p1'
   [2] Calling method for module DeepTauId/'hltHpsPFTauDeepTauProducer'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,86] vs. [207]
	 [[{{node inner_egamma_norm_1/FusedBatchNorm_1/Mul}}]]

[2]
https://cmsweb.cern.ch/couchdb/workloadsummary/_design/WorkloadSummary/_show/histogramByWorkflow/pdmvserv_RVCMSSW_13_3_0_pre2ZpToMM_m6000_14TeV__2026D98noPU_RV213_230913_122713_9881
https://cmsweb.cern.ch/couchdb/workloadsummary/_design/WorkloadSummary/_show/histogramByWorkflow/pdmvserv_RVCMSSW_13_3_0_pre2TenTau_15_500_Eta3p1__2026D98noPU_RV213_230913_122651_78

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 25, 2023

A new Issue was created by @srimanob Phat Srimanobhas.

@Dr15Jones, @rappoccio, @smuzaffar, @makortel, @sextonkennedy, @antoniovilela can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@srimanob
Copy link
Contributor Author

FYI @mmusich @rovere @SohamBhattacharya
Not sure if this is a known issue on your side. But I just see it happens on 13_3 relvals, not 13_1 as you are focusing on.

@srimanob
Copy link
Contributor Author

assign upgrade,hlt

@cmsbuild
Copy link
Contributor

New categories assigned: upgrade,hlt

@AdrianoDee,@mmusich,@missirol,@srimanob,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mmusich
Copy link
Contributor

mmusich commented Sep 26, 2023

@srimanob thanks for reporting.

Fwiw, the path HLT_DoubleMediumDeepTauPFTauHPS35_eta2p1 was introduced in master (CMSSW_13_3_X) in PR #42562 and backported in CMSSW_13_1_X in PR #42649.

@hsert FYI

@swagata87
Copy link
Contributor

Shouldn't one use year=2026 instead of 2017?

and also deepTau_2026v2p5_core.pb etc files in here?

graph_file = cms.vstring( 'core:RecoTauTag/TrainingFiles/data/DeepTauId/deepTau_2017v2p6_e6_core.pb',
'inner:RecoTauTag/TrainingFiles/data/DeepTauId/deepTau_2017v2p6_e6_inner.pb',
'outer:RecoTauTag/TrainingFiles/data/DeepTauId/deepTau_2017v2p6_e6_outer.pb' ),

@SohamBhattacharya
Copy link
Contributor

Shouldn't one use year=2026 instead of 2017?

and also deepTau_2026v2p5_core.pb etc files in here?

graph_file = cms.vstring( 'core:RecoTauTag/TrainingFiles/data/DeepTauId/deepTau_2017v2p6_e6_core.pb',
'inner:RecoTauTag/TrainingFiles/data/DeepTauId/deepTau_2017v2p6_e6_inner.pb',
'outer:RecoTauTag/TrainingFiles/data/DeepTauId/deepTau_2017v2p6_e6_outer.pb' ),

I see this is also there in the 13_1 backport.
May be @hsert can clarify.

@mmusich
Copy link
Contributor

mmusich commented Sep 26, 2023

I see this is also there in the 13_1 backport.

if this path fires seldom enough it might be randomly failing in a 9k events relval.

@hsert
Copy link
Contributor

hsert commented Sep 26, 2023 via email

@mmusich
Copy link
Contributor

mmusich commented Sep 26, 2023

@hsert

Once we understand the performance issue that we observed, we will work on updating to a new/phase2 DeepTau training.

No. We can't let this randomly fail relvals (used by everyone else). Please either fix or disable the path.

@srimanob
Copy link
Contributor Author

I see this is also there in the 13_1 backport.

if this path fires seldom enough it might be randomly failing in a 9k events relval.

We don't see the failure in 13_1 because the last relvals came with CMSSW_13_1_0_pre4. The backport PR was merged in Aug, so we don't see it. I agree with @mmusich if we can't fix at this moment, we should disable the path.

@hsert
Copy link
Contributor

hsert commented Sep 26, 2023

Ok, how urgent it is? I was working on updating it due to fixes needed, then focused on performance improvement since it was suggested in the Phase2 meeting. It may take some days to fix it. If it is urgent, then we can disable it.

@srimanob
Copy link
Contributor Author

To me, the next pre-release, CMSSW_13_3_0_pre4, is targeted on 2023/10/17. So if you can't fix it by, let's say, a week before (10 Oct), then please make a PR to disable it. Other may have different comments.

@SohamBhattacharya
Copy link
Contributor

To me, the next pre-release, CMSSW_13_3_0_pre4, is targeted on 2023/10/17. So if you can't fix it by, let's say, a week before (10 Oct), then please make a PR to disable it. Other may have different comments.

Agreed, the path should be disabled for the time being if it crashes RelVals.
@hsert If there's no quick fix for this, can you please disable the DeepTau path?

@mmusich
Copy link
Contributor

mmusich commented Sep 26, 2023

If there's no quick fix for this, can you please disable the DeepTau path?

before disabling the path, I'd just like to make sure that the fix proposed at #42862 (comment) is not enough.

@hsert
Copy link
Contributor

hsert commented Sep 26, 2023

As shown in slide 18 of https://indico.cern.ch/event/1322372/#sc-2-5-taus, it gives some matrix incompatibility issue. I have another deadline this week, but I can focus on solving the issue next week if it is ok. If that is late, we can disable it. Is there anything that I should do for disabling the path?

@mmusich
Copy link
Contributor

mmusich commented Sep 26, 2023

Is there anything that I should do for disabling the path?

removing any reference from the menu (while keeping the configuration fragment) should be enough.

EDIT: to be fully explicit, remove it from here:

fragment.HLT_DoubleMediumDeepTauPFTauHPS35_eta2p1,

@mmusich
Copy link
Contributor

mmusich commented Sep 26, 2023

I have another deadline this week, but I can focus on solving the issue next week if it is ok

Next week should be fine. I'll check back here in one week time.

@SohamBhattacharya
Copy link
Contributor

Is there anything that I should do for disabling the path?

removing any reference from the menu (while keeping the configuration fragment) should be enough.

EDIT: to be fully explicit, remove it from here:

fragment.HLT_DoubleMediumDeepTauPFTauHPS35_eta2p1,

Confirm that it's enough to remove it from HLT_75e33_cff.py. The Tau paths are intentionally not there in HLT_75e33_cff_timing.py as that's meant to measure a timing that can be compared to the TDR measurement, which does not include the tau paths.

@rovere
Copy link
Contributor

rovere commented Sep 26, 2023

Is there anything that I should do for disabling the path?

removing any reference from the menu (while keeping the configuration fragment) should be enough.
EDIT: to be fully explicit, remove it from here:

fragment.HLT_DoubleMediumDeepTauPFTauHPS35_eta2p1,

Confirm that it's enough to remove it from HLT_75e33_cff.py. The Tau paths are intentionally not there in HLT_75e33_cff_timing.py as that's meant to measure a timing that can be compared to the TDR measurement, which does not include the tau paths.

Once the performance of the Tau Paths is better understood, we can hook those in the Timing version to have an improved baseline.

@mmusich
Copy link
Contributor

mmusich commented Oct 4, 2023

@hsert please update on the status of the fix. Thank you.

@hsert
Copy link
Contributor

hsert commented Oct 4, 2023

I have tried several things, but couldn't figure it out yet. Probably it would be safer to disable the path. Sorry for the inconvenience.

@mmusich
Copy link
Contributor

mmusich commented Oct 5, 2023

I have tried several things, but couldn't figure it out yet. Probably it would be safer to disable the path. Sorry for the inconvenience.

I have created #42955, just in case.
I am trying (unsuccessfully so far) to reproduce the failure offline (@srimanob it would be good if you could get the specs of the nodes in which the relval jobs failed, to see if there's any architecture dependence).
If something better doesn't appear before next pre-release deadline, we can go ahead with disabling.

@srimanob
Copy link
Contributor Author

srimanob commented Oct 5, 2023

Hi @cms-sw/pdmv-l2
Could you please help to find the spec of node which run this relvals,
https://cms-unified.web.cern.ch/cms-unified/report/pdmvserv_RVCMSSW_13_3_0_pre3DisplacedSUSY_stopToB_M_800_500mm_13__2026D98noPU_230918_064200_5200

Thx.

@mmusich
Copy link
Contributor

mmusich commented Feb 4, 2024

do you have suggestions on how to test the proposed solution at #43855 (i.e. how can I access a node that will guarantee me to get AVX512F AVX512_VNNI) ?

answering to myself, I found a lxplus-gpu node (i.e. lxplus921) that has:

<Metric Name="CPUModels" Value="Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz"/>

that enables the AVX512F AVX512_VNNI instructions.

@mmusich
Copy link
Contributor

mmusich commented Feb 7, 2024

+1

@mmusich
Copy link
Contributor

mmusich commented Feb 7, 2024

@cmsbuild, ping? seems stuck

@mmusich
Copy link
Contributor

mmusich commented Feb 8, 2024

@smuzaffar I signed this issue for HLT above #42862 (comment) but the bot doesn't recognize the signature. Can you please have a look? Thank you.

@smuzaffar
Copy link
Contributor

smuzaffar commented Feb 8, 2024

@mmusich , there was a bug in bot due to which it was not processing the comments for issues properly. It is fixed now and your last comment has already fixed the labels for this issue

@valsdav
Copy link
Contributor

valsdav commented Mar 13, 2024

This is happening due to a num_valid_cells()==0, aka empty input tensor in

(long long int)grid.num_valid_cells(), 1, 1, dnn_inputs_2017_v2::EgammaBlockInputs::NumberOfInputs});

See discussion in #44333 (comment)

@valsdav
Copy link
Contributor

valsdav commented Mar 13, 2024

assign ml

(to get updates)

@cmsbuild
Copy link
Contributor

New categories assigned: ml

@valsdav,@wpmccormack you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mmusich
Copy link
Contributor

mmusich commented Mar 13, 2024

(to get updates)

technically this is issue is solved by #43855.

@cms-sw/upgrade-l2 do you have objections to close ?

@valsdav
Copy link
Contributor

valsdav commented Mar 13, 2024

Hi @mmusich, I don't think #43855 is protecting us against empty tensors.

I'm not a expert of the DeepTau code but the problem is related to evaluating the model over an empty CellGrid. I don't know why the grid used in the producer is empty in the events that are crashing the run.. maybe @hsert may have an idea?

<DeepTauId::createConvFeatures (is_inner = 1)>:
number of valid cells = 0  --> output added by me locally

processing ( eta = -5, phi = -5 )
 skipping creation of inputs, because ( eta = -5, phi = -5 ) is not in the grid !!
processing ( eta = -5, phi = -4 )
 skipping creation of inputs, because ( eta = -5, phi = -4 ) is not in the grid !!
processing ( eta = -5, phi = -3 )
[....all same output not in the grid...]
 skipping creation of inputs, because ( eta = 5, phi = 3 ) is not in the grid !!
processing ( eta = 5, phi = 4 )
 skipping creation of inputs, because ( eta = 5, phi = 4 ) is not in the grid !!
processing ( eta = 5, phi = 5 )
 skipping creation of inputs, because ( eta = 5, phi = 5 ) is not in the grid !!

 no valid cells found, empty tensors to inference engine  --> output added by me locally

@mmusich
Copy link
Contributor

mmusich commented Mar 13, 2024

@valsdav

I don't think #43855 is protecting us against empty tensors.

as far as I could test, the exceptions reported here are gone with that PR, see #43855 (comment)

@valsdav
Copy link
Contributor

valsdav commented Mar 13, 2024

Your test in #43855 (comment) is quite clear.

I think this depends on the model version. #43855 is indeed using a different model than the other issue #44333. Maybe the new models is not giving empty cells? It would be interesting to understand what changes..

I think implementing a guard against empty inputs would at least make the issue clearer

@mmusich
Copy link
Contributor

mmusich commented Mar 13, 2024

I think implementing a guard against empty inputs would at least make the issue clearer

I agree, but I would not mix different issues (it is fine to link #44333 to this one, but this one is technically resolved), I would continue the discussion in the main one.

@valsdav
Copy link
Contributor

valsdav commented Mar 13, 2024

+1

@mmusich
Copy link
Contributor

mmusich commented Mar 20, 2024

for record, these add an extra layer of protection:

@srimanob
Copy link
Contributor Author

+Upgrade

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@makortel
Copy link
Contributor

@cmsbuild, please close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants