Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully parse cpu performance metrics in SummaryDB #11590

Merged
merged 1 commit into from
May 17, 2023

Conversation

amaltaro
Copy link
Contributor

Fixes #11275

Status

not-tested

Description

After re-discussing this issue with German, we concluded that there are 2 scenarios where we miss job information in WMStats:

  1. for FJR documents that are too large (larger than the configured CouchDB limit)
  2. when parsing FJR metrics for jobs that didn't generate one (for instance, jobs hitting seg fault)

This PR is supposed to fix the 2) issue, dealing with dictionaries in a safe way.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

None

@amaltaro
Copy link
Contributor Author

@germanfgv German, as we discussed over Zoom, you have a "recipe" to reproduce these errors. So, could you please run a meaningful replay - with this patch in - to see whether we actually get to publish the job failures in (T0) WMStats?

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 3 warnings and errors that must be fixed
    • 1 warnings
    • 22 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 1 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14258/artifact/artifacts/PullRequestReport.html

@germanfgv
Copy link
Contributor

germanfgv commented May 12, 2023

I started a T0 replay with this PR. You can follow its progress here, and look for the workflows in Testbed T0 WMStats by filtering with the following ID: 230512164908

This replay will have segmentation fault in PromptReco due to a problem with CMSSW_13_0_3. In previous instances of this error, JobAccountant was unable to upload the job's document to Couch.

@germanfgv
Copy link
Contributor

We were able to recreate the pause job with segmentation fault. With this patch, the agent was able to upload the information to WMStats.

https://cmsweb-testbed.cern.ch/t0_reqmon/data/jobdetail/PromptReco_Run366794_SpecialHLTPhysics11_Tier0_REPLAY_2023_ID230512164908_v12164908

@amaltaro
Copy link
Contributor Author

Great! Thanks for testing it, @germanfgv !

Copy link
Contributor

@vkuznet vkuznet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to additional safe measurements by assigning default values for cpu parameter of the perf metrics.

@@ -83,12 +83,12 @@ def fwjr_parser(doc):
cmsRunCPUPerformance=pdict)
if key.startswith('cmsRun'):
perf = val['performance']
pdict['totalJobCPU'] += float(perf['cpu']['TotalJobCPU'])
pdict['totalJobTime'] += float(perf['cpu']['TotalJobTime'])
pdict['totalJobCPU'] += float(perf['cpu'].get('TotalJobCPU', 0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alan, would it be worth to make it more safe by assigning dict as default value for cpu, e.g.

pdict['totalJobCPU'] += float(perf.get('cpu', {}).get('TotalJobCPU', 0))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be done, but then we should likely do the same for the performance key above. For this case, I am inclined to keep it more readable and assume that this structure is already properly defined upstream. Which I think it is, otherwise we would have seen this KeyError in the past.

@amaltaro
Copy link
Contributor Author

test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 3 warnings and errors that must be fixed
    • 1 warnings
    • 22 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 1 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14260/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

@vkuznet Valentin, I added the following line (already squashed in) to safely access the cpu key:

perf.setdefault('cpu', {})

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 3 warnings and errors that must be fixed
    • 1 warnings
    • 22 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 1 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14261/artifact/artifacts/PullRequestReport.html

@amaltaro amaltaro merged commit d64a483 into dmwm:master May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing error message in wmstats
4 participants