-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gracefully parse cpu performance metrics in SummaryDB #11590
Conversation
@germanfgv German, as we discussed over Zoom, you have a "recipe" to reproduce these errors. So, could you please run a meaningful replay - with this patch in - to see whether we actually get to publish the job failures in (T0) WMStats? |
Jenkins results:
|
I started a T0 replay with this PR. You can follow its progress here, and look for the workflows in Testbed T0 WMStats by filtering with the following ID: This replay will have segmentation fault in PromptReco due to a problem with CMSSW_13_0_3. In previous instances of this error, JobAccountant was unable to upload the job's document to Couch. |
We were able to recreate the pause job with segmentation fault. With this patch, the agent was able to upload the information to WMStats. |
Great! Thanks for testing it, @germanfgv ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to additional safe measurements by assigning default values for cpu
parameter of the perf
metrics.
@@ -83,12 +83,12 @@ def fwjr_parser(doc): | |||
cmsRunCPUPerformance=pdict) | |||
if key.startswith('cmsRun'): | |||
perf = val['performance'] | |||
pdict['totalJobCPU'] += float(perf['cpu']['TotalJobCPU']) | |||
pdict['totalJobTime'] += float(perf['cpu']['TotalJobTime']) | |||
pdict['totalJobCPU'] += float(perf['cpu'].get('TotalJobCPU', 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alan, would it be worth to make it more safe by assigning dict as default value for cpu
, e.g.
pdict['totalJobCPU'] += float(perf.get('cpu', {}).get('TotalJobCPU', 0))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be done, but then we should likely do the same for the performance
key above. For this case, I am inclined to keep it more readable and assume that this structure is already properly defined upstream. Which I think it is, otherwise we would have seen this KeyError in the past.
test this please |
Jenkins results:
|
Securely access cpu key
@vkuznet Valentin, I added the following line (already squashed in) to safely access the
|
Jenkins results:
|
Fixes #11275
Status
not-tested
Description
After re-discussing this issue with German, we concluded that there are 2 scenarios where we miss job information in WMStats:
This PR is supposed to fix the 2) issue, dealing with dictionaries in a safe way.
Is it backward compatible (if not, which system it affects?)
YES
Related PRs
None
External dependencies / deployment changes
None