Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wf 8.0 failing in all Jenkins tests #34890

Closed
tvami opened this issue Aug 16, 2021 · 18 comments · Fixed by #34900
Closed

wf 8.0 failing in all Jenkins tests #34890

tvami opened this issue Aug 16, 2021 · 18 comments · Fixed by #34900

Comments

@tvami
Copy link
Contributor

tvami commented Aug 16, 2021

wf 8.0 fails even with the simplest PRs in Jenkins, for example:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_12_1_X_2021-08-13-1700+7d20f9/44780/validateJR.html

I understand that this is a long standing issue, and I'm wondering how it should be resolved. I cannot reproduce the problem in local testing

cmsrel CMSSW_12_1_X_2021-08-13-1700
cd CMSSW_12_1_X_2021-08-13-1700/src/
cmsenv
runTheMatrix.py -l 8.0 -j 8

leads to
1 1 1 1 1 tests passed, 0 0 0 0 0 failed

Should we maybe just remove this from the Jenkins tests? It seems to be a BeamHalo run1_mc cosmics workflow.

@cmsbuild
Copy link
Contributor

A new Issue was created by @tvami Tamas Vami.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

Just to clarify, the "failure" in PR tests

https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_12_1_X_2021-08-13-1700+7d20f9/44780/validateJR.html

is in the DQM output comparisons, while the runTheMatrix output

1 1 1 1 1 tests passed, 0 0 0 0 0 failed

checks only if the workflow technically runs or not.

@makortel
Copy link
Contributor

makortel commented Aug 16, 2021

Following the link to the failing comparison leads to https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_12_1_X_2021-08-13-1700+7d20f9/44780/8.0_BeamHalo+BeamHalo+DIGICOS+RECOCOS+ALCABH+HARVESTCOS/Pixel_AdditionalPixelErrors.html

that only says "Skipped: 0.2% (1)" without pointing what exactly was skipped.

@makortel
Copy link
Contributor

assign dqm

@cmsbuild
Copy link
Contributor

New categories assigned: dqm

@jfernan2,@kmaeshima,@rvenditti,@andrius-k,@ErnestaP,@ahmad3213 you have been requested to review this Pull request/Issue and eventually sign? Thanks

@jfernan2
Copy link
Contributor

Is there any way to see from RelMon the log error of those skipped/failing comparisons? The DQM bin by bin tool does not show any problem on those Pixel folders

@smuzaffar
Copy link
Contributor

@jfernan2 , you can go to parent directory ( just remove the file part of the url in your browser e.g. https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_12_1_X_2021-08-13-1700+7d20f9/44780/ ) and then look for 8.0 log file.

@jfernan2
Copy link
Contributor

jfernan2 commented Aug 16, 2021

I have manually inspected the 2 root files (baseline and from a PR) for the wf8:

  • for the skipped one, averageDigiOccupancy, both baseline and PR file have 80 entries with mean 0 and stddev =0 (overflow and underflow are zero too)

  • for the failing one, consequently avgfedDigiOccvsLumi, the first LS is filled with NaN for both baseline and PR DQM root files
    frame

Given the nature of this wf 8.0 (BeamHalo) this plot has probably no sense

@tvami
Copy link
Contributor Author

tvami commented Aug 16, 2021

Hi @jfernan2 thanks for looking into this. I'm wondering why the comparison of NaN vs NaN doesnt give an agreement.
Anyway, would you then recommend to remove wf 8.0 from the Jenkins tests?

@davidlange6
Copy link
Contributor

davidlange6 commented Aug 16, 2021 via email

@jfernan2
Copy link
Contributor

The code can probably not give valid results in this case since there is no FED info for Pizel.
What about blacklisting them in RelMon? Indeed I see they are in a blacklist:

Pixel/averageDigiOccupancy
Pixel/AdditionalPixelErrors/byLumiErrors

I am adding here Tracker DQM contacts in case they shred some light:
@sroychow @mmusich

@davidlange6
Copy link
Contributor

davidlange6 commented Aug 16, 2021 via email

@jfernan2
Copy link
Contributor

avgfedDigiOccvsLumi->setBinContent(thisls, i + 1, nDigisPerFed[i]); //Same plot vs lumi section
}
if (modOn) {
if (thisls % 10 == 0)
averageDigiOccupancy->Fill(
i,
averageOcc); // "modOn" basically mean Online DQM, in this case fill histos with actual value of digi fraction per fed for each ten lumisections
if (avgfedDigiOccvsLumi && thisls % 5 == 0)
avgfedDigiOccvsLumi->setBinContent(
int(thisls / 5),
i + 1,
averageOcc); //fill with the mean over 5 lumisections, previous code was filling this histo only with last event of each 10th lumisection
}
}

@tvami
Copy link
Contributor Author

tvami commented Aug 16, 2021

What about blacklisting them in RelMon? Indeed I see they are in a blacklist:

Pixel/averageDigiOccupancy
Pixel/AdditionalPixelErrors/byLumiErrors

averageDigiOccupancy is actually blacklisted even twice, the second one here

Pixel/averageDigiOccupancy

Maybe that could be changed to avgfedDigiOccvsLumi. Should I go ahead with this fix?

@mmusich
Copy link
Contributor

mmusich commented Aug 16, 2021

Its typically straightforward to protect against generating nans.. what's the code that is doing this?

see #34900.

@jfernan2
Copy link
Contributor

+1
Solved by #34900

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants