wf 8.0 failing in all Jenkins tests #34890

tvami · 2021-08-16T13:37:41Z

wf 8.0 fails even with the simplest PRs in Jenkins, for example:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_12_1_X_2021-08-13-1700+7d20f9/44780/validateJR.html

I understand that this is a long standing issue, and I'm wondering how it should be resolved. I cannot reproduce the problem in local testing

cmsrel CMSSW_12_1_X_2021-08-13-1700
cd CMSSW_12_1_X_2021-08-13-1700/src/
cmsenv
runTheMatrix.py -l 8.0 -j 8

leads to
1 1 1 1 1 tests passed, 0 0 0 0 0 failed

Should we maybe just remove this from the Jenkins tests? It seems to be a BeamHalo run1_mc cosmics workflow.

The text was updated successfully, but these errors were encountered:

cmsbuild · 2021-08-16T13:38:04Z

A new Issue was created by @tvami Tamas Vami.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel · 2021-08-16T13:45:03Z

Just to clarify, the "failure" in PR tests

https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_12_1_X_2021-08-13-1700+7d20f9/44780/validateJR.html

is in the DQM output comparisons, while the runTheMatrix output

1 1 1 1 1 tests passed, 0 0 0 0 0 failed

checks only if the workflow technically runs or not.

makortel · 2021-08-16T13:46:02Z

Following the link to the failing comparison leads to https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_12_1_X_2021-08-13-1700+7d20f9/44780/8.0_BeamHalo+BeamHalo+DIGICOS+RECOCOS+ALCABH+HARVESTCOS/Pixel_AdditionalPixelErrors.html

that only says "Skipped: 0.2% (1)" without pointing what exactly was skipped.

makortel · 2021-08-16T13:46:33Z

assign dqm

cmsbuild · 2021-08-16T13:46:35Z

New categories assigned: dqm

@jfernan2,@kmaeshima,@rvenditti,@andrius-k,@ErnestaP,@ahmad3213 you have been requested to review this Pull request/Issue and eventually sign? Thanks

tvami · 2021-08-16T13:50:24Z

Thanks @makortel for the clarification. Indeed following the links I see that
"Failing Comparisons: avgfedDigiOccvsLumi" here
https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_12_1_X_2021-08-13-1700+7d20f9/44780/8.0_BeamHalo+BeamHalo+DIGICOS+RECOCOS+ALCABH+HARVESTCOS/Pixel.html

and "Skiped Comparisons byLumiErrors" here
https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_12_1_X_2021-08-13-1700+7d20f9/44780/8.0_BeamHalo+BeamHalo+DIGICOS+RECOCOS+ALCABH+HARVESTCOS/Pixel_AdditionalPixelErrors.html

jfernan2 · 2021-08-16T14:15:36Z

Is there any way to see from RelMon the log error of those skipped/failing comparisons? The DQM bin by bin tool does not show any problem on those Pixel folders

smuzaffar · 2021-08-16T14:18:48Z

@jfernan2 , you can go to parent directory ( just remove the file part of the url in your browser e.g. https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_12_1_X_2021-08-13-1700+7d20f9/44780/ ) and then look for 8.0 log file.

jfernan2 · 2021-08-16T16:56:13Z

I have manually inspected the 2 root files (baseline and from a PR) for the wf8:

for the skipped one, averageDigiOccupancy, both baseline and PR file have 80 entries with mean 0 and stddev =0 (overflow and underflow are zero too)
for the failing one, consequently avgfedDigiOccvsLumi, the first LS is filled with NaN for both baseline and PR DQM root files

Given the nature of this wf 8.0 (BeamHalo) this plot has probably no sense

tvami · 2021-08-16T17:10:12Z

Hi @jfernan2 thanks for looking into this. I'm wondering why the comparison of NaN vs NaN doesnt give an agreement.
Anyway, would you then recommend to remove wf 8.0 from the Jenkins tests?

davidlange6 · 2021-08-16T17:15:45Z

By definition nan!=nan (or anything else) - Better if the code producing the nan would instead produce some valid result.

…

On Aug 16, 2021, at 7:10 PM, Tamas Vami ***@***.***> wrote: Hi @jfernan2 thanks for looking into this. I'm wondering why the comparison of NaN vs NaN doesnt give an agreement. Anyway, would you then recommend to remove wf 8.0 from the Jenkins tests? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

jfernan2 · 2021-08-16T17:22:13Z

The code can probably not give valid results in this case since there is no FED info for Pizel.
What about blacklisting them in RelMon? Indeed I see they are in a blacklist:

cmssw/Utilities/RelMon/data/blacklist.txt

Lines 75 to 76 in 6d2f660

    
           Pixel/averageDigiOccupancy 
        
           Pixel/AdditionalPixelErrors/byLumiErrors

I am adding here Tracker DQM contacts in case they shred some light:
@sroychow @mmusich

davidlange6 · 2021-08-16T17:26:21Z

Its typically straightforward to protect against generating nans.. what's the code that is doing this? (Of course will still be not useful in this workflow but then the histograms will be enabled for other workflows)

…

On Aug 16, 2021, at 7:22 PM, jfernan2 ***@***.***> wrote: The code can probably not give valid results in this case since there is no FED info for Pizel. What about blacklisting them in RelMon? Indeed I see they are in a blacklist: https://github.com/cms-sw/cmssw/blob/6d2f66057131baacc2fcbdd203588c41c885b42c/Utilities/RelMon/data/blacklist.txt#L75-L76 I am adding here Tracker DQM contacts in case they shred some light: @sroychow @mmusich — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

jfernan2 · 2021-08-16T17:46:30Z

cmssw/DQM/SiPixelMonitorDigi/src/SiPixelDigiSource.cc

Lines 133 to 146 in 639903d

    
               avgfedDigiOccvsLumi->setBinContent(thisls, i + 1, nDigisPerFed[i]);  //Same plot vs lumi section 
        
             } 
        
             if (modOn) { 
        
               if (thisls % 10 == 0) 
        
                 averageDigiOccupancy->Fill( 
        
                     i, 
        
                     averageOcc);  // "modOn" basically mean Online DQM, in this case fill histos with actual value of digi fraction per fed for each ten lumisections 
        
               if (avgfedDigiOccvsLumi && thisls % 5 == 0) 
        
                 avgfedDigiOccvsLumi->setBinContent( 
        
                     int(thisls / 5), 
        
                     i + 1, 
        
                     averageOcc);  //fill with the mean over 5 lumisections, previous code was filling this histo only with last event of each 10th lumisection 
        
             } 
        
           }

tvami · 2021-08-16T18:52:39Z

What about blacklisting them in RelMon? Indeed I see they are in a blacklist:

cmssw/Utilities/RelMon/data/blacklist.txt

Lines 75 to 76 in 6d2f660

Pixel/averageDigiOccupancy

Pixel/AdditionalPixelErrors/byLumiErrors

averageDigiOccupancy is actually blacklisted even twice, the second one here

cmssw/Utilities/RelMon/data/blacklist.txt

Line 50 in 6d2f660

Pixel/averageDigiOccupancy

Maybe that could be changed to avgfedDigiOccvsLumi. Should I go ahead with this fix?

mmusich · 2021-08-16T20:24:59Z

Its typically straightforward to protect against generating nans.. what's the code that is doing this?

see #34900.

jfernan2 · 2021-08-17T08:31:23Z

+1
Solved by #34900

cmsbuild · 2021-08-17T08:31:41Z

This issue is fully signed and ready to be closed.

cmsbuild added the pending-assignment label Aug 16, 2021

cmsbuild added dqm-pending pending-signatures and removed pending-assignment labels Aug 16, 2021

mmusich mentioned this issue Aug 16, 2021

Do not produce NaNs in SiPixelActionExecutor #34900

Merged

cmsbuild added dqm-approved fully-signed and removed dqm-pending pending-signatures labels Aug 17, 2021

cmsbuild closed this as completed in #34900 Aug 17, 2021

This was referenced Aug 18, 2021

[12.0.X] do not produce NaNs in SiPixelActionExecutor #34936

Merged

[11.3.X] do not produce NaNs in SiPixelActionExecutor #34937

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wf 8.0 failing in all Jenkins tests #34890

wf 8.0 failing in all Jenkins tests #34890

tvami commented Aug 16, 2021

cmsbuild commented Aug 16, 2021

makortel commented Aug 16, 2021

makortel commented Aug 16, 2021 •

edited

makortel commented Aug 16, 2021

cmsbuild commented Aug 16, 2021

tvami commented Aug 16, 2021

jfernan2 commented Aug 16, 2021

smuzaffar commented Aug 16, 2021

jfernan2 commented Aug 16, 2021 •

edited

tvami commented Aug 16, 2021

davidlange6 commented Aug 16, 2021 via email

jfernan2 commented Aug 16, 2021

davidlange6 commented Aug 16, 2021 via email

jfernan2 commented Aug 16, 2021

tvami commented Aug 16, 2021

mmusich commented Aug 16, 2021

jfernan2 commented Aug 17, 2021

cmsbuild commented Aug 17, 2021

wf 8.0 failing in all Jenkins tests #34890

wf 8.0 failing in all Jenkins tests #34890

Comments

tvami commented Aug 16, 2021

cmsbuild commented Aug 16, 2021

makortel commented Aug 16, 2021

makortel commented Aug 16, 2021 • edited

makortel commented Aug 16, 2021

cmsbuild commented Aug 16, 2021

tvami commented Aug 16, 2021

jfernan2 commented Aug 16, 2021

smuzaffar commented Aug 16, 2021

jfernan2 commented Aug 16, 2021 • edited

tvami commented Aug 16, 2021

davidlange6 commented Aug 16, 2021 via email

jfernan2 commented Aug 16, 2021

davidlange6 commented Aug 16, 2021 via email

jfernan2 commented Aug 16, 2021

tvami commented Aug 16, 2021

mmusich commented Aug 16, 2021

jfernan2 commented Aug 17, 2021

cmsbuild commented Aug 17, 2021

makortel commented Aug 16, 2021 •

edited

jfernan2 commented Aug 16, 2021 •

edited