Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CMSSW performance metrics to WMArchive document #11696

Merged
merged 2 commits into from
Sep 28, 2023

Conversation

vkuznet
Copy link
Contributor

@vkuznet vkuznet commented Aug 28, 2023

Fixes #11538

Status

ready

Description

Add CMSSW performance metrics to WMArchive document

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

#11663

External dependencies / deployment changes

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 1 warnings
    • 70 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 141 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14448/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests added
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 1 warnings
    • 70 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 134 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14449/artifact/artifacts/PullRequestReport.html

@vkuznet vkuznet requested a review from amaltaro August 28, 2023 18:54
Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just saw there was a pending review request for this PR, even though it's labelled as "PR: Work in progress".

I know we have to resume working on this once the flow of these metrics are fully understood. I don't understand why this PR provides the CMSSWMetrics.py module though. That is an example and it should be placed under the correct location (either test/* or src/python/WMQuality).

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests no longer failing
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 6 warnings and errors that must be fixed
    • 3 warnings
    • 155 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 136 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14524/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 27, 2023

test this please

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 27, 2023

Alan, the CMSSWMetrics.py module is required by WMCore/Services/WMArchive/DataMap.py. It does not represent just data sample, but rather provide correct data types for all performance cmssw metrics. Instead of manually added them, I used concrete cmssw metrics and put them in this module. The DataMap expects to have proper data-types in PERFORMANCE_TYPE which is used across the codebase, and I added them via function CMSSWMetrics which comes from CMSSWMetrics.py module. Since all of them are part of DataMap workflow I doubt that this module should be placed under test or WMQuality.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 6 warnings and errors that must be fixed
    • 3 warnings
    • 155 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 136 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14525/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 27, 2023

Here is a link to a document which now contains cmssw metrics in WMArchive

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 27, 2023

Alan, I applied this PR along with #11663 to vocms0291 WMAgent and run one workflow wmagent_small_MC_Agent0291_wma_cmssw3_230927_133515_4641. I verified that in local CouchDB we have appropriate cmssw metrics, see

curl -s xxx:xxx@localhost:5984/wmagent_jobdump%2Ffwjrs/1470-0 | jq

and, then I found that these metrics are propagated into WMArchive, see this document

Therefore, I removed work in progress label and now this PR along with #11663 are ready for review. Please note that this PR address how to propagate metrics to WMArchvie, while #11663 provides CMSSW metrics in FJR JSON by parsing appropriate parts of CMSSW XML report. Both PRs will be required to complete #11538

Just in case here a direct link to FJR JSON in OpenSearch/MONIT infrastructure (previously I added link to a filtered page).

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR @vkuznet
I do not see anything dramatic that could be a showstopper. Nevertheless I made 2 or 3 comments inline. You may want to take a quick look. (no need to change anything though if you think it is already good enough)

src/python/WMCore/FwkJobReport/Report.py Outdated Show resolved Hide resolved
src/python/WMCore/Services/WMArchive/CMSSWMetrics.py Outdated Show resolved Hide resolved
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests added
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 1 warnings
    • 70 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 133 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14526/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkuznet Valentin, I left a few comments along the code.

In addition to those comments:

  • please check the report from Jenkins and take the necessary actions. New modules are supposed to be clean of any suggestions, unless there is no simple way to do that.
  • one thing that is not desirable IMO is that we will have many metrics being provided twice, under different categories. Perhaps we should make a commitment for 1 or 2 years ahead such that we can remove all the other metrics and provide only cmssw.

src/python/WMCore/Services/WMArchive/CMSSWMetrics.py Outdated Show resolved Hide resolved
def CMSSWMetrics():
"""
Helper function to convert CMSSW_METRICS dict values to their data-types
:return: dictionary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can infer from the code, but it would be helpful if you provided 2 or 3 lines of an example dictionary that would be returned.

rdict[key] = ndict
return rdict

def XrdMetrics(mdict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a unit test for this function

@@ -52,7 +54,8 @@
'readTotalMB': float,
'readTotalSecs': float,
'writeTotalMB': float,
'writeTotalSecs': float}}
'writeTotalSecs': float},
'cmssw': CMSSWMetrics()}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand it right that this function would be executed only once during the lifetime of the ArchiveDataReporter? Which is the component responsible for parsing the FJR and creating the WMArchive document.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be done once, but is hard to say based on my current understanding of the all involved components. The parsing will be done by WMCore/FwkJobReport/XMLParser.py when it will parse provided CMSSW XML FJR file, i.e. it will be done in WMCore/WMSpec/Steps/Executors/CMSSW.py module and whoever component will call it. But it is not happen in ArchiveDataReporter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was afraid it could happen for every single job, but grepping the code:

(py3.8) OR-FVFH37J7Q6LR:WMCore amaltar2$ grep -rI DataMap src/* | grep import
src/python/WMComponent/ArchiveDataReporter/ArchiveDataPoller.py:from WMCore.Services.WMArchive.DataMap import createArchiverDoc

my understand is that the map will be created once, the moment ArchiveDataReporter component gets started.
In that case, everything is fine.

@@ -207,6 +211,10 @@ def typeCastPerformance(performDict):
for key in PERFORMANCE_TYPE:
if key in performDict:
for param in PERFORMANCE_TYPE[key]:
if key == 'cmssw':
# so far we skip validation of cmssw metrics values
newPerfDict[key] = performDict[key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are new metrics to be provided to the monitoring system, why do we skip this validation?

One of my fears is that, once we start validating it, that we get different metric data types for different CMSSW releases. If that happens, this will crash until someone looks and resolve the issue. Perhaps the way to deal with this would be to catch an exception and simply provide the value as is, thus avoiding human action.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why I skipped validation, and in fact I do not know how to properly do it.

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 28, 2023

Alan, regarding new module clean code. The pep8 errors comes from non complaint indentation of CMSSW_METRICS dict I put into the code:

CMSSW_METRICS = {
        "SystemMemory": {
            "Active(file)": 4004848,
            "KernelStack": 13760,
....

Of course I can manually try to align all parts (460 lines in total) but I get it from JSON file and parse it with jq tool to create nice indent structure. This is a typical example that Python does not have any specific (fixed) standard for code indentation and all tools can dictate it differently. If you want me to align 460 lines I'll do it but it will make no sense if we'll change this dict again in a future. May be we should put it into file and load it somehow, but in this case at run time this file should be present. I can put it into separate file and make explicit comment about pep8 complains if you think it will be useful. Said that I'm open to suggestion but as I said in this particular case pep8 comments have little sense to me.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 tests added
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 1 warnings
    • 71 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 131 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14529/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor

@vkuznet Valentin, my IDE is not completely in sync with the rules used by jenkins pylint.
But I pulled the file provided in this PR, reformatted it in my IDE and made it back available in:
https://amaltaro.web.cern.ch/amaltaro/forValentin/CMSSWMetrics.py

can you fetch it and make a new commit with all its changes to see how it goes?

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 5 new failures
    • 2 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 1 warnings
    • 71 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 95 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14531/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 28, 2023

Alan, here is your IDE failures from pep8:

src/python/WMCore/Services/WMArchive/CMSSWMetrics.py
Line 55, E133 closing bracket is missing indentation
Line 61, E133 closing bracket is missing indentation
Line 66, E133 closing bracket is missing indentation
Line 72, E133 closing bracket is missing indentation
....

@amaltaro
Copy link
Contributor

Haa, I found which option I was missing in the IDE. Valentin, if you don't mind, can you please fetch an updated file from
https://amaltaro.web.cern.ch/amaltaro/forValentin/CMSSWMetrics.py
and try it again? Thanks

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 tests no longer failing
    • 2 tests added
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 1 warnings
    • 71 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 52 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14532/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests deleted
    • 2 tests no longer failing
    • 2 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 33 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 193 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14533/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 tests no longer failing
    • 2 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 1 warnings
    • 71 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 48 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14534/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 tests no longer failing
    • 2 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 1 warnings
    • 71 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 47 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14536/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 28, 2023

Alan, after applying your latest IDE and few more iterations I made most of pep8 errors gone (except previous code in DataMap). Said that, I think this PR is ready for final review.

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valentin, the code looks good in general. However, I fear that we will be providing inconsistent data type in the output document.

The method named typeCastPerformance does data casting for everything but cmssw metrics. In addition, AFAIR the FJR has a string type for boolean data, which means that anything like "false", "False" would be evaluated to True.

I think it would be really helpful to get a FJR - that would be parsed by ArchiveDataReporter - and have it go through the DataMap logic to see what the outcome would be. Are you able to provide both income and outcome documents?

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 28, 2023

Alan, yes I can easily generate JSON report from XML using this piece of code:

#!/usr/bin/env python

import pprint

from WMCore.FwkJobReport.Report import Report

xmlXrd = "./test/python/WMCore_t/FwkJobReport_t/CMSSWJobReportXrdSiteStatistics.xml"
report = Report('cmsRun1')
report.parse(xmlXrd)
cmsRun = getattr(report.data, 'cmsRun1', {})
performance = getattr(cmsRun, 'performance', {})
cmssw = getattr(performance, 'cmssw', {})
rdict = cmssw.dictionary_whole_tree_()

pprint.pprint(rdict)

and you can find the JSON report over here and as you can see it is using test/python/WMCore_t/FwkJobReport_t/CMSSWJobReportXrdSiteStatistics.xml file which you can look up from repository.

@amaltaro
Copy link
Contributor

Given that you have all this code in your repository, can you please generate the output file for ALL of those performace sections? They are: ["storage", "memory", "cpu", "multicore", "cmssw"], only then we can have a real comparison of the duplicate attributes.

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 28, 2023

Alan, I updated report and now it contains all performance metrics.

@amaltaro
Copy link
Contributor

Thanks Valentin. I was trying to figure out why the cmssw metrics have been casted while the others didn't, but I just noticed that this has been implemented in the other PR: #11663

Here is one example metric

  "cmssw": {
           "Timing": {"AvgEventTime": 34.1318,

versus

 "cpu": {"AvgEventTime": "34.1318",

Said that, the cmssw metrics are correctly provided since the beginning and I think this PR is good to go!

@amaltaro
Copy link
Contributor

@vkuznet Valentin, in the coming days, can you please create a new GH issue to remove the now duplicate sections from the FJR and WMArchive documents. Please provide an attachment/link/reference to the input FJR:
https://raw.githubusercontent.com/dmwm/WMCore/36ee6d26cf2dd25e2125ee56ab163369d466b642/test/python/WMCore_t/FwkJobReport_t/CMSSWJobReportXrdSiteStatistics.xml
and to the output WMArchive document
https://gist.github.com/vkuznet/bb1a2bbd9e2cbd5dbc534c51c7ccd58e

@amaltaro amaltaro merged commit 992a2f5 into dmwm:master Sep 28, 2023
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Review and improve CMSSW I/O metric reporting
4 participants