Add CMSSW performance metrics to WMArchive document #11696

vkuznet · 2023-08-28T18:10:30Z

Fixes #11538

Status

ready

Description

Add CMSSW performance metrics to WMArchive document

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

#11663

External dependencies / deployment changes

cmsdmwmbot · 2023-08-28T18:24:45Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 1 tests added
- 2 changes in unstable tests
Python3 Pylint check: failed
- 4 warnings and errors that must be fixed
- 1 warnings
- 70 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 141 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14448/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-08-28T18:46:44Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 1 tests added
Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 1 warnings
- 70 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 134 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14449/artifact/artifacts/PullRequestReport.html

amaltaro

I just saw there was a pending review request for this PR, even though it's labelled as "PR: Work in progress".

I know we have to resume working on this once the flow of these metrics are fully understood. I don't understand why this PR provides the CMSSWMetrics.py module though. That is an example and it should be placed under the correct location (either test/* or src/python/WMQuality).

cmsdmwmbot · 2023-09-27T11:55:24Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 1 tests no longer failing
- 1 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 6 warnings and errors that must be fixed
- 3 warnings
- 155 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 136 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14524/artifact/artifacts/PullRequestReport.html

vkuznet · 2023-09-27T14:19:29Z

test this please

vkuznet · 2023-09-27T14:24:35Z

Alan, the CMSSWMetrics.py module is required by WMCore/Services/WMArchive/DataMap.py. It does not represent just data sample, but rather provide correct data types for all performance cmssw metrics. Instead of manually added them, I used concrete cmssw metrics and put them in this module. The DataMap expects to have proper data-types in PERFORMANCE_TYPE which is used across the codebase, and I added them via function CMSSWMetrics which comes from CMSSWMetrics.py module. Since all of them are part of DataMap workflow I doubt that this module should be placed under test or WMQuality.

cmsdmwmbot · 2023-09-27T14:28:34Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 1 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 6 warnings and errors that must be fixed
- 3 warnings
- 155 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 136 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14525/artifact/artifacts/PullRequestReport.html

vkuznet · 2023-09-27T14:28:40Z

Here is a link to a document which now contains cmssw metrics in WMArchive

vkuznet · 2023-09-27T14:32:22Z

Alan, I applied this PR along with #11663 to vocms0291 WMAgent and run one workflow wmagent_small_MC_Agent0291_wma_cmssw3_230927_133515_4641. I verified that in local CouchDB we have appropriate cmssw metrics, see

curl -s xxx:xxx@localhost:5984/wmagent_jobdump%2Ffwjrs/1470-0 | jq

and, then I found that these metrics are propagated into WMArchive, see this document

Therefore, I removed work in progress label and now this PR along with #11663 are ready for review. Please note that this PR address how to propagate metrics to WMArchvie, while #11663 provides CMSSW metrics in FJR JSON by parsing appropriate parts of CMSSW XML report. Both PRs will be required to complete #11538

Just in case here a direct link to FJR JSON in OpenSearch/MONIT infrastructure (previously I added link to a filtered page).

todor-ivanov

thanks for the PR @vkuznet
I do not see anything dramatic that could be a showstopper. Nevertheless I made 2 or 3 comments inline. You may want to take a quick look. (no need to change anything though if you think it is already good enough)

src/python/WMCore/FwkJobReport/Report.py

src/python/WMCore/Services/WMArchive/CMSSWMetrics.py

cmsdmwmbot · 2023-09-27T16:56:16Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests added
Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 1 warnings
- 70 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 133 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14526/artifact/artifacts/PullRequestReport.html

amaltaro

@vkuznet Valentin, I left a few comments along the code.

In addition to those comments:

please check the report from Jenkins and take the necessary actions. New modules are supposed to be clean of any suggestions, unless there is no simple way to do that.
one thing that is not desirable IMO is that we will have many metrics being provided twice, under different categories. Perhaps we should make a commitment for 1 or 2 years ahead such that we can remove all the other metrics and provide only cmssw.

src/python/WMCore/Services/WMArchive/CMSSWMetrics.py

amaltaro · 2023-09-27T20:14:35Z

src/python/WMCore/Services/WMArchive/CMSSWMetrics.py

+def CMSSWMetrics():
+    """
+    Helper function to convert CMSSW_METRICS dict values to their data-types
+    :return: dictionary


We can infer from the code, but it would be helpful if you provided 2 or 3 lines of an example dictionary that would be returned.

src/python/WMCore/Services/WMArchive/CMSSWMetrics.py

amaltaro · 2023-09-27T20:15:01Z

src/python/WMCore/Services/WMArchive/CMSSWMetrics.py

+        rdict[key] = ndict
+    return rdict
+
+def XrdMetrics(mdict):


We need a unit test for this function

amaltaro · 2023-09-27T20:16:30Z

src/python/WMCore/Services/WMArchive/DataMap.py

@@ -52,7 +54,8 @@
                                'readTotalMB': float,
                                'readTotalSecs': float,
                                'writeTotalMB': float,
-                                'writeTotalSecs': float}}
+                                'writeTotalSecs': float},
+                    'cmssw': CMSSWMetrics()}


Do I understand it right that this function would be executed only once during the lifetime of the ArchiveDataReporter? Which is the component responsible for parsing the FJR and creating the WMArchive document.

I think it will be done once, but is hard to say based on my current understanding of the all involved components. The parsing will be done by WMCore/FwkJobReport/XMLParser.py when it will parse provided CMSSW XML FJR file, i.e. it will be done in WMCore/WMSpec/Steps/Executors/CMSSW.py module and whoever component will call it. But it is not happen in ArchiveDataReporter.

I was afraid it could happen for every single job, but grepping the code:

(py3.8) OR-FVFH37J7Q6LR:WMCore amaltar2$ grep -rI DataMap src/* | grep import src/python/WMComponent/ArchiveDataReporter/ArchiveDataPoller.py:from WMCore.Services.WMArchive.DataMap import createArchiverDoc

my understand is that the map will be created once, the moment ArchiveDataReporter component gets started.
In that case, everything is fine.

amaltaro · 2023-09-27T20:18:38Z

src/python/WMCore/Services/WMArchive/DataMap.py

@@ -207,6 +211,10 @@ def typeCastPerformance(performDict):
    for key in PERFORMANCE_TYPE:
        if key in performDict:
            for param in PERFORMANCE_TYPE[key]:
+                if key == 'cmssw':
+                    # so far we skip validation of cmssw metrics values
+                    newPerfDict[key] = performDict[key]


If these are new metrics to be provided to the monitoring system, why do we skip this validation?

One of my fears is that, once we start validating it, that we get different metric data types for different CMSSW releases. If that happens, this will crash until someone looks and resolve the issue. Perhaps the way to deal with this would be to catch an exception and simply provide the value as is, thus avoiding human action.

This is why I skipped validation, and in fact I do not know how to properly do it.

vkuznet · 2023-09-28T12:29:31Z

Alan, regarding new module clean code. The pep8 errors comes from non complaint indentation of CMSSW_METRICS dict I put into the code:

CMSSW_METRICS = {
        "SystemMemory": {
            "Active(file)": 4004848,
            "KernelStack": 13760,
....

Of course I can manually try to align all parts (460 lines in total) but I get it from JSON file and parse it with jq tool to create nice indent structure. This is a typical example that Python does not have any specific (fixed) standard for code indentation and all tools can dictate it differently. If you want me to align 460 lines I'll do it but it will make no sense if we'll change this dict again in a future. May be we should put it into file and load it somehow, but in this case at run time this file should be present. I can put it into separate file and make explicit comment about pep8 complains if you think it will be useful. Said that I'm open to suggestion but as I said in this particular case pep8 comments have little sense to me.

cmsdmwmbot · 2023-09-28T12:42:02Z

Jenkins results:

Python3 Unit tests: succeeded
- 2 tests added
Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 1 warnings
- 71 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 131 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14529/artifact/artifacts/PullRequestReport.html

amaltaro · 2023-09-28T12:50:55Z

@vkuznet Valentin, my IDE is not completely in sync with the rules used by jenkins pylint.
But I pulled the file provided in this PR, reformatted it in my IDE and made it back available in:
https://amaltaro.web.cern.ch/amaltaro/forValentin/CMSSWMetrics.py

can you fetch it and make a new commit with all its changes to see how it goes?

cmsdmwmbot · 2023-09-28T13:13:10Z

Jenkins results:

Python3 Unit tests: failed
- 5 new failures
- 2 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 1 warnings
- 71 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 95 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14531/artifact/artifacts/PullRequestReport.html

vkuznet · 2023-09-28T13:18:39Z

Alan, here is your IDE failures from pep8:

src/python/WMCore/Services/WMArchive/CMSSWMetrics.py
Line 55, E133 closing bracket is missing indentation
Line 61, E133 closing bracket is missing indentation
Line 66, E133 closing bracket is missing indentation
Line 72, E133 closing bracket is missing indentation
....

amaltaro · 2023-09-28T13:34:05Z

Haa, I found which option I was missing in the IDE. Valentin, if you don't mind, can you please fetch an updated file from
https://amaltaro.web.cern.ch/amaltaro/forValentin/CMSSWMetrics.py
and try it again? Thanks

cmsdmwmbot · 2023-09-28T14:45:06Z

Jenkins results:

Python3 Unit tests: succeeded
- 2 tests no longer failing
- 2 tests added
Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 1 warnings
- 71 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 52 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14532/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-09-28T15:17:04Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests deleted
- 2 tests no longer failing
- 2 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 5 warnings and errors that must be fixed
- 33 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 193 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14533/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-09-28T15:41:13Z

Jenkins results:

Python3 Unit tests: succeeded
- 2 tests no longer failing
- 2 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 1 warnings
- 71 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 48 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14534/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-09-28T16:35:31Z

Jenkins results:

Python3 Unit tests: succeeded
- 2 tests no longer failing
- 2 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 1 warnings
- 71 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 47 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14536/artifact/artifacts/PullRequestReport.html

vkuznet · 2023-09-28T16:47:14Z

Alan, after applying your latest IDE and few more iterations I made most of pep8 errors gone (except previous code in DataMap). Said that, I think this PR is ready for final review.

amaltaro

Valentin, the code looks good in general. However, I fear that we will be providing inconsistent data type in the output document.

The method named typeCastPerformance does data casting for everything but cmssw metrics. In addition, AFAIR the FJR has a string type for boolean data, which means that anything like "false", "False" would be evaluated to True.

I think it would be really helpful to get a FJR - that would be parsed by ArchiveDataReporter - and have it go through the DataMap logic to see what the outcome would be. Are you able to provide both income and outcome documents?

vkuznet · 2023-09-28T19:06:45Z

Alan, yes I can easily generate JSON report from XML using this piece of code:

#!/usr/bin/env python

import pprint

from WMCore.FwkJobReport.Report import Report

xmlXrd = "./test/python/WMCore_t/FwkJobReport_t/CMSSWJobReportXrdSiteStatistics.xml"
report = Report('cmsRun1')
report.parse(xmlXrd)
cmsRun = getattr(report.data, 'cmsRun1', {})
performance = getattr(cmsRun, 'performance', {})
cmssw = getattr(performance, 'cmssw', {})
rdict = cmssw.dictionary_whole_tree_()

pprint.pprint(rdict)

and you can find the JSON report over here and as you can see it is using test/python/WMCore_t/FwkJobReport_t/CMSSWJobReportXrdSiteStatistics.xml file which you can look up from repository.

amaltaro · 2023-09-28T20:28:42Z

Given that you have all this code in your repository, can you please generate the output file for ALL of those performace sections? They are: ["storage", "memory", "cpu", "multicore", "cmssw"], only then we can have a real comparison of the duplicate attributes.

vkuznet · 2023-09-28T21:06:25Z

Alan, I updated report and now it contains all performance metrics.

amaltaro · 2023-09-28T21:41:15Z

Thanks Valentin. I was trying to figure out why the cmssw metrics have been casted while the others didn't, but I just noticed that this has been implemented in the other PR: #11663

Here is one example metric

  "cmssw": {
           "Timing": {"AvgEventTime": 34.1318,

versus

 "cpu": {"AvgEventTime": "34.1318",

Said that, the cmssw metrics are correctly provided since the beginning and I think this PR is good to go!

amaltaro · 2023-09-28T21:45:50Z

@vkuznet Valentin, in the coming days, can you please create a new GH issue to remove the now duplicate sections from the FJR and WMArchive documents. Please provide an attachment/link/reference to the input FJR:
https://raw.githubusercontent.com/dmwm/WMCore/36ee6d26cf2dd25e2125ee56ab163369d466b642/test/python/WMCore_t/FwkJobReport_t/CMSSWJobReportXrdSiteStatistics.xml
and to the output WMArchive document
https://gist.github.com/vkuznet/bb1a2bbd9e2cbd5dbc534c51c7ccd58e

vkuznet added the PR: Work in progress label Aug 28, 2023

vkuznet self-assigned this Aug 28, 2023

vkuznet requested a review from amaltaro August 28, 2023 18:54

amaltaro reviewed Sep 18, 2023

View reviewed changes

vkuznet force-pushed the fix-issue-11538-wma-v2 branch from 7542509 to 47e4de2 Compare September 27, 2023 11:44

vkuznet removed the PR: Work in progress label Sep 27, 2023

vkuznet requested review from amaltaro and todor-ivanov September 27, 2023 14:32

vkuznet mentioned this pull request Sep 27, 2023

CMSSW metrics for FWJR #11663

Merged

todor-ivanov approved these changes Sep 27, 2023

View reviewed changes

src/python/WMCore/FwkJobReport/Report.py Outdated Show resolved Hide resolved

src/python/WMCore/Services/WMArchive/CMSSWMetrics.py Show resolved Hide resolved

src/python/WMCore/Services/WMArchive/CMSSWMetrics.py Outdated Show resolved Hide resolved

amaltaro requested changes Sep 27, 2023

View reviewed changes

vkuznet requested a review from amaltaro September 28, 2023 13:13

vkuznet force-pushed the fix-issue-11538-wma-v2 branch from e834435 to 1e9a3ed Compare September 28, 2023 15:06

vkuznet force-pushed the fix-issue-11538-wma-v2 branch from 1e9a3ed to ec85ae7 Compare September 28, 2023 15:30

vkuznet added 2 commits September 28, 2023 12:27

Add CMSSW performance metrics to WMArchvie document

73a5055

New unit test for CMSSW peformance metrics in WMArchive document

08f293c

vkuznet force-pushed the fix-issue-11538-wma-v2 branch from ec85ae7 to 08f293c Compare September 28, 2023 16:27

amaltaro reviewed Sep 28, 2023

View reviewed changes

amaltaro approved these changes Sep 28, 2023

View reviewed changes

amaltaro merged commit 992a2f5 into dmwm:master Sep 28, 2023
3 of 4 checks passed

amaltaro mentioned this pull request Sep 29, 2023

Review and improve CMSSW I/O metric reporting #11538

Closed

Add CMSSW performance metrics to WMArchive document #11696

Add CMSSW performance metrics to WMArchive document #11696

Conversation

vkuznet commented Aug 28, 2023

Status

Description

Is it backward compatible (if not, which system it affects?)

Related PRs

External dependencies / deployment changes

cmsdmwmbot commented Aug 28, 2023

cmsdmwmbot commented Aug 28, 2023

amaltaro left a comment

Choose a reason for hiding this comment

cmsdmwmbot commented Sep 27, 2023

vkuznet commented Sep 27, 2023

vkuznet commented Sep 27, 2023

cmsdmwmbot commented Sep 27, 2023

vkuznet commented Sep 27, 2023

vkuznet commented Sep 27, 2023 • edited Loading

todor-ivanov left a comment

Choose a reason for hiding this comment

cmsdmwmbot commented Sep 27, 2023

amaltaro left a comment

Choose a reason for hiding this comment

amaltaro Sep 27, 2023

Choose a reason for hiding this comment

amaltaro Sep 27, 2023

Choose a reason for hiding this comment

amaltaro Sep 27, 2023

Choose a reason for hiding this comment

vkuznet Sep 28, 2023

Choose a reason for hiding this comment

amaltaro Sep 28, 2023

Choose a reason for hiding this comment

amaltaro Sep 27, 2023

Choose a reason for hiding this comment

vkuznet Sep 28, 2023

Choose a reason for hiding this comment

vkuznet commented Sep 28, 2023

cmsdmwmbot commented Sep 28, 2023

amaltaro commented Sep 28, 2023

cmsdmwmbot commented Sep 28, 2023

vkuznet commented Sep 28, 2023

amaltaro commented Sep 28, 2023

cmsdmwmbot commented Sep 28, 2023

cmsdmwmbot commented Sep 28, 2023

cmsdmwmbot commented Sep 28, 2023

cmsdmwmbot commented Sep 28, 2023

vkuznet commented Sep 28, 2023

amaltaro left a comment

Choose a reason for hiding this comment

vkuznet commented Sep 28, 2023 • edited Loading

amaltaro commented Sep 28, 2023

vkuznet commented Sep 28, 2023

amaltaro commented Sep 28, 2023

amaltaro commented Sep 28, 2023

vkuznet commented Sep 27, 2023 •

edited

Loading

vkuznet commented Sep 28, 2023 •

edited

Loading