Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMSSW metrics for FWJR #11663

Merged
merged 2 commits into from
Sep 29, 2023
Merged

CMSSW metrics for FWJR #11663

merged 2 commits into from
Sep 29, 2023

Conversation

vkuznet
Copy link
Contributor

@vkuznet vkuznet commented Jul 17, 2023

Fixes #11538
Superseeds #11575

Status

ready

Description

Implement CMSSW metrics for WMCore FWJR report, and perform data-type casting of all existing metrics, i.e. int, float, boolean metrics values if they are represented in string form are converted to int, float, boolean data-type values, respectively, e.g. "1" string value is converted to 1 integer one, or "1.1" string value is converted to 1.1 float value. In addition, extra spaces surrounding strings are discarded, e.g. " foo " is converted to "foo".

This PR also provides proper way to handle XrdSiteStatistics metrics which are presented via the following XML form:

    <PerformanceReport>
      <PerformanceModule Metric="cern.ch"  Module="XrdSiteStatistics" >
        <Metric Name="read-numOperations" Value="2"/>
        ...
      </PerformanceModule>
    </PerformanceReport>

and transformed into the following representation in JSON:

{...,
        "XrdSiteStatistics": {
            "readv-numOperations": [
              {
                "site": "cern.ch",
                "value": 2
              }, ... ],
            ....
           }
}

This PR also provides a fresh CMSSWMergeReport2.xml file from one of the recent CMSSW jobs and made unit testing against it.

TODO: it is required that we should make a decision about preservation of current performance metrics since it may affect backward compatibility. I think we should preserve them for time being until new docs with full set of CMSSW metrics will be propagated to WMArchive/MONIT, and then we'll adjust existing dashboard to switch from current performance metrics (cpu, storage, memory) to a full set of CMSSW ones. Therefore, further work on WMArchive/MONIT side may be required.

Is it backward compatible (if not, which system it affects?)

YES, but when we merge it we may drop existing performance cpu, storage, memory metrics as they will be covered by full set of CMSSW one provided in this PR.

Related PRs

#11696

This PR is superset of #11575 which may be dropped if we'll decide to preserve all CMSSW metrics

External dependencies / deployment changes

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 7 warnings and errors that must be fixed
    • 3 warnings
    • 158 comments to review
  • Pylint py3k check: failed
    • 1 errors and warnings that should be fixed
  • Pycodestyle check: succeeded
    • 36 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14349/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 7 warnings and errors that must be fixed
    • 3 warnings
    • 160 comments to review
  • Pylint py3k check: failed
    • 1 errors and warnings that should be fixed
  • Pycodestyle check: succeeded
    • 36 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14350/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 tests no longer failing
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 7 warnings and errors that must be fixed
    • 3 warnings
    • 160 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 36 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14351/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkuznet Valentin, these changes look good to me.

I am in favor of keeping the metrics summary for the next year or so, until we have data mostly consistent and can implement metric aggregation at a different layer, as you mentioned in the initial PR description.

Given that this PR changes (enforces) a better data schema in terms of data type, we might have to follow up with CMS Monit once it gets deployed to ensure that dashboards remain functional.

@vkuznet
Copy link
Contributor Author

vkuznet commented Jul 19, 2023

ok, then let's get feedback from @makortel about decision of which set of metrics to keep at the end. If Matti will agree to keep everything (this PR) then we can proceed with this PR and close #11575 one. I also checked unit test and the one which fail WorkQueueTest:testThrottling does not seem to be related to this PR.

@amaltaro
Copy link
Contributor

Sounds good! If possible, could you please provide the FJR that would be generated by the agent? Just so we have a realistic example of how it will end up with all those performance sections.

@vkuznet
Copy link
Contributor Author

vkuznet commented Jul 25, 2023

I finally got FJR report from vocms0291 WMAgent where I run the following workflow

python3 inject-test-wfs.py -u "https://cmsweb-testbed.cern.ch" -m Integration -f TaskChain_MC.json -c DMWM_Test -r Agent0291_forceComplete -t testbed-vocms0291 -a Integ_Test -p forceComplete_VK_patch11663

It was submitted to cmsweb-testbed, where I obtained the report from one of sub jobs. The Report.0.pkl has been decoded in python and correctly shows cmssw section in FrameworkJobReport.cmsRun1.performance object.

@amaltaro , please review provided report and let me know if we need anything else.

@vkuznet
Copy link
Contributor Author

vkuznet commented Jul 26, 2023

The appropriate part for WMArchive integration can be found in this PR: dmwm/WMArchive#361

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the requested unit tests, these changes look good to me.

Based on the CMSW FJR and the Report file, I do want to point out the following:

  • The "performance.cmssw" contain all of the PerformanceSummary raw metrics provided by CMSSW
  • The "performance.cpu" section has the same information as "performance.cmssw.Timing", with the only difference that the latter has the new value type casting feature.
  • The "performance.memory" is a subset of the "performance.cmssw.ApplicationMemory", where the latter has value type casting.
  • The "performance.summaries" is an exact copy of "performance.cmssw.SystemCPU".
  • The "performance.storage" section does seem to be unique, because in the source code it is extracted from "Timing-*-numOperations" and some extra manipulation is performed.

In short, once we convert all of the potential monitoring tools to start consuming "performance.cmssw" metrics, we can then completely remove the following handlers/sections:

  • cpu, memory, summaries

@@ -322,7 +328,6 @@ def perfRepHandler(targets):
targets['PerformanceSummary'].send((perfRep.summaries,
subnode))


@coroutine
def perfSummaryHandler():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please make a specific unit test for this function

@makortel
Copy link

makortel commented Aug 1, 2023

ok, then let's get feedback from @makortel about decision of which set of metrics to keep at the end. If Matti will agree to keep everything (this PR) then we can proceed with this PR and close #11575 one.

I'm fine with keeping all the raw metrics (as I noted in #11538 (comment)).

Note my reply in #11575 (comment) that the xrootd per-site statistics are probably included currently in a sub-optimal way. Would it be possible to see an example job report JSON that gets passed forward to WMArchive?

@amaltaro
Copy link
Contributor

amaltaro commented Aug 2, 2023

@makortel Matti, Valentin didn't provide a WMArchive like document, but it will be completely based on this WMAgent job report that Valentin shared in his previous comment and very similar in terms of organization and metrics:

It was submitted to cmsweb-testbed, where I obtained the report from one of sub jobs. The Report.0.pkl has ...

If this is not enough, then I am afraid we need to wait for Valentin to get back from Vacation (around Aug 11).

@makortel
Copy link

makortel commented Aug 2, 2023

Matti, Valentin didn't provide a WMArchive like document, but it will be completely based on this WMAgent job report that Valentin shared in his previous comment and very similar in terms of organization and metrics:

It was submitted to cmsweb-testbed, where I obtained the report from one of sub jobs. The Report.0.pkl has ...

If this is not enough, then I am afraid we need to wait for Valentin to get back from Vacation (around Aug 11).

Thanks, that format works too. Apparently that example doesn't contain the xrootd statistics. For the next round of example would it be possible to see both the XML and the JSON or WMAgent job report?

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 2 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 8 warnings and errors that must be fixed
    • 3 warnings
    • 161 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 40 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14408/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Aug 14, 2023

I doubt reported failure are related to unit test I added since they are for Rucio and WQ. Therefore, I request a review again.

@vkuznet vkuznet requested a review from amaltaro August 14, 2023 18:38
@vkuznet
Copy link
Contributor Author

vkuznet commented Aug 14, 2023

@makortel , to avoid arguing which XML file to use and which sections should be present please provide the one you would like to see and I can generate its WMCore JSON representation.

@amaltaro
Copy link
Contributor

@vkuznet Valentin, to avoid confusion, can you please provide the following (either at the initial PR description or in a single comment):

  1. CMSSW FJR (xml document)
  2. WMAgent FJR (as far as I can see, it can be either a xml document or the dot-notation document)
  3. JSON document passed over to WMArchive

@vkuznet
Copy link
Contributor Author

vkuznet commented Aug 15, 2023

@amaltaro , @makortel please provide proper CMSSW FJR (xml document) document to start with, and then I can provide pkl, WMCore FJR configuration format, and WMArchive JSON.

@makortel
Copy link

please provide proper CMSSW FJR (xml document) document to start with

I'll look into this some time soonish.

@makortel
Copy link

makortel commented Aug 15, 2023

Attached is a framework job report XML produced with CMSSW_13_2_0_pre3 that should contain performance information from all components in CMSSW that currently produce such, and has information from several xrootd servers.
fwjr_performance.xml.txt
(edit: updated the file to list less accessed branches to make the file smaller, the exact set of branches is not relevant for this work)

@vkuznet
Copy link
Contributor Author

vkuznet commented Aug 15, 2023

Here are the links to WMAgent FJR in dot notation and JSON formats based on provided XML

All of them were produced using the following code (feel free to use this snippet with this PR to parse any other CMSSW XML file if you like):

#!/usr/bin/env python

# system modules
import json

# WMCore modules
from WMCore.FwkJobReport.Report import Report
from WMCore.FwkJobReport.XMLParser import xmlToJobReport

xmlfile = '/Users/vk/Downloads/fwjr_performance.xml'
report = Report('cmsRun1')
xmlToJobReport(report, xmlfile)
with open('report.txt', 'w') as ostream:
    ostream.write(report.__str__())
thunker = None
rdict = report.__to_json__(thunker)
with open('report.json', 'w') as ostream:
    ostream.write(json.dumps(rdict))

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valentin, I left one minor comment for your consideration.

In addition to that, we still need to implement the conversion from FJR to WMArchive document. Let us try to first complete the development and merge this one: #11692
then you can update this PR with the relevant changes.

src/python/WMCore/FwkJobReport/Report.py Outdated Show resolved Hide resolved
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 3 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 8 warnings and errors that must be fixed
    • 3 warnings
    • 161 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 41 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14438/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 27, 2023

This PR is read for review and should be viewed together with #11696

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one looks good to me

vkuznet added a commit to vkuznet/WMCore that referenced this pull request Sep 27, 2023
@amaltaro
Copy link
Contributor

@vkuznet Valentin, before I review it, can you please rebase it?

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 27, 2023

ok, now it is rebased.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 3 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 8 warnings and errors that must be fixed
    • 3 warnings
    • 162 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 42 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14527/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkuznet Valentin, these changes are looking good to me. Nonetheless, I left a couple of comments along the code.

I need to emphasize though that with this PR, you are casting every single performance metric in addition to adding the new cmssw metrics. Please make it extra clear in the pull request description.

src/python/WMCore/FwkJobReport/XMLParser.py Show resolved Hide resolved
src/python/WMCore/FwkJobReport/XMLParser.py Outdated Show resolved Hide resolved
@amaltaro
Copy link
Contributor

Please also document in the pull request description on our decision for the XrdSite metrics, such that it is easy to find it.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 4 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 11 warnings and errors that must be fixed
    • 3 warnings
    • 162 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 44 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14530/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 28, 2023

Description is updated. Please review again.

@amaltaro
Copy link
Contributor

@vkuznet can you please squash these commits accordingly?

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 tests no longer failing
    • 4 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 11 warnings and errors that must be fixed
    • 3 warnings
    • 162 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 44 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14541/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 29, 2023

Alan, all commits are squashed now. Please proceed with review and merge.

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Valentin!

@amaltaro amaltaro merged commit 2a5117f into dmwm:master Sep 29, 2023
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Review and improve CMSSW I/O metric reporting
5 participants