Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide timestamps metrics about WM job #11656

Merged
merged 2 commits into from
Aug 23, 2023
Merged

Conversation

vkuznet
Copy link
Contributor

@vkuznet vkuznet commented Jul 12, 2023

Fixes #11604

Status

In development

Description

Provides timestamps metrics, job wall clock time, etc. (in GMT and UNIX since epochs data-format), to WMCore FJR (Report.pkl) file during job execution script

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

#11575 , #11659

External dependencies / deployment changes

@vkuznet vkuznet self-assigned this Jul 12, 2023
@vkuznet vkuznet changed the title Fix issue 11604 Provide timestamps metrics about WM job Jul 12, 2023
@vkuznet
Copy link
Contributor Author

vkuznet commented Jul 12, 2023

test this please

1 similar comment
@amaltaro
Copy link
Contributor

test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 4 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14325/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Jul 13, 2023

test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 1 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 1 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14326/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 1 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 1 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14327/artifact/artifacts/PullRequestReport.html

@vkuznet vkuznet requested a review from amaltaro July 13, 2023 13:03
@vkuznet
Copy link
Contributor Author

vkuznet commented Jul 13, 2023

@amaltaro this PR is ready for review. One thing, please checkout line https://github.com/dmwm/WMCore/pull/11656/files#diff-a9065684e3919aaf09e74699c1005bd636968852babd4031429723a206788771R218 and tell me if I'm correct to use $outputFile in submit script or report file is defined differently. The reason I ask since this script only perform touch of this file, but later in a script it uses explicit name cp WMTaskSpace/Report*.pkl ../ in here

cp WMTaskSpace/Report*.pkl ../
Therefore, I'm curious if $outputFile is ever used.

@vkuznet
Copy link
Contributor Author

vkuznet commented Jul 13, 2023

@amaltaro , additional (independent) WMArchive changes can be found in #11659 and dmwm/WMArchive#360

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkuznet Valentin, from a quick look, these changes look fine to me.
However, this is not what has been requested in the ticket and IMO it adds unnecessary complication. We should not record start and end time, but only provide the wallclock time for the overall job, aka the elapse time.

From the CMSSW FJR, they provide a very simple metric:

    <Metric Name="TotalJobTime" Value="565.481"/>

and IMO this is what we should do from the WMAgent job wrapper as well.

In addition to that, I think it would be good to pull out a production document uploaded to WMArchive to see what is already there, potentially some timestamps are already provided.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 1 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 4 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14332/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Jul 14, 2023

@amaltaro , I adjusted the code to provide only TotalWallClockTime under Timing. But this is a question for me now. Originally I created Timing section to include 4 metrics, but now if we have only one does it make sense to have Timing section with one metric? May be we should put TotalWallClockTIme at top level of the document. Please provide your feedback that I can correct the codebase.

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkuznet thanks for providing those changes. I left a few more comments along the code.

About the timing section, I think we can use that section for all the timings that we are going to provide, so it's likely better to keep it around.

I missed your previous question about $outputFile and I am not 100% certain of the answer. But I think that each report inside the cmsRun area is in the end parsed and the relevant information makes it to this job report (still in WN runtime code). But we will have to test these changes once we are happy with them.

etc/submit_py3.sh Outdated Show resolved Hide resolved
etc/submit_py3.sh Outdated Show resolved Hide resolved
src/python/WMCore/WMRuntime/Timestamps.py Outdated Show resolved Hide resolved
src/python/WMCore/WMRuntime/Timestamps.py Outdated Show resolved Hide resolved
@vkuznet vkuznet force-pushed the fix-issue-11604 branch 2 times, most recently from 2079626 to 45f509e Compare July 14, 2023 14:48
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 2 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14335/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 2 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14336/artifact/artifacts/PullRequestReport.html

@vkuznet vkuznet requested a review from amaltaro July 14, 2023 15:03
Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkuznet please have a look at the Jenkins results, you have errors in the code that need to be fixed. As you are providing new modules, ideally all their pylint/pycodestyle should come back clean as well, as noted in the contribution guidelines.

Regarding the new module, I am still not sure whether it should go under the Utils package or not, but maybe under this package would be a good commitment:
https://github.com/dmwm/WMCore/tree/master/src/python/WMCore/WMRuntime/Scripts

don't forget to replicate the relevant changes to the unit tests location as well.

@amaltaro
Copy link
Contributor

I forgot to mention, I believe the next steps are:

  • test these changes at runtime
  • look into ArchiveDataReporter to start publishing this new metric, otherwise it will be contained only in the agent.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 2 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14340/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 1 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14341/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 1 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14409/artifact/artifacts/PullRequestReport.html

@vkuznet vkuznet requested a review from amaltaro August 14, 2023 18:52
@amaltaro
Copy link
Contributor

If disk is full, then you cannot create the new pickle file either. If permissions are wrong, then it would have failed in previous steps that write that report. In addition, it is created by the user running the job, so there are no expected permission issues.

On what concerns location of the script, as we have specific sub-packages under WMRuntime, my preference is to use the Scripts sub-package. Please see this comment for specifics:
#11656 (comment)

@vkuznet
Copy link
Contributor Author

vkuznet commented Aug 14, 2023

Alan, the disk can become full once you write first pickle file such that there is no room to write another one since its size will be bigger than original file. In other words, the mv Unix command first write new file and then delete original one. I'm following normal UNIX practices and logic here.

@vkuznet
Copy link
Contributor Author

vkuznet commented Aug 14, 2023

And, I relocated Timestamps.py module to Scripts area per your request.

@amaltaro
Copy link
Contributor

Yes, and if disk becomes full after you write the first pickle file, it would of course fail to write the pickle.new that you implemented in the script. In other words, there is no extra protection offered by creating a new file + move compared to updating the current pickle.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 1 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14411/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Aug 14, 2023

Alan, I don't want to argue more, but it is a possibility on system which support multiple users or jobs that when I write first file and second file then I'll not have disk space to overwrite original file, and there is a possibility that during a time when I'll overwrite existing file the disk will not be filled with other stuff by another user or process. In other words, the original file may be lost and this is what I try to avoid by writing completely new file and only then (when both are written and safe) to overwrite original file. If you still think that few lines of code is worth to remove to make code simpler but still open a possibility to lost original file I'll be happy to remove them.

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkuznet in addition to the comments along the code, please also consider the pycodestyle report.

#!/usr/bin/env python
"""
_Timestamps_ module designed to insert proper WM timing into WMCore FJR.
Usage: python3 Timestamps.py --reportFile=$outputFile --wmJobTime=$wmJobTime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update usage with the start/end new parameters.

test/python/WMCore_t/WMRuntime_t/Scripts_t/Timestamps_t.py Outdated Show resolved Hide resolved
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 1 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14415/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Aug 15, 2023

test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 1 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14417/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Aug 15, 2023

Alan, I resolved issues you pointed out and adjusted code with pep8 suggestions. Now it is clean. Please have a final review of this PR.

@vkuznet vkuznet requested a review from amaltaro August 15, 2023 13:51
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 1 tests no longer failing
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 1 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14416/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valentin, these changes are looking good as well.

I have to repeat though that we are still missing the conversion from FJR to WMArchive document, such that we can publish it to WMArchive. Please add the relevant changes to this PR. If you prefer, feel free to leave those in a separate commit.

@amaltaro
Copy link
Contributor

It needs to be squashed and initial description updated.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 3 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 1 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14440/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor

Jenkins reported the following failure:

WMCore_t.WMSpec_t.Steps_t.Executors_t.CMSSW_t.CMSSW_t:testSubprocessTime was added with status failure. Must be fixed

which should not fail IMO. You might want to reproduce it tomorrow with the docker unit test image.

@amaltaro amaltaro merged commit b8905db into dmwm:master Aug 23, 2023
2 of 4 checks passed
@vkuznet
Copy link
Contributor Author

vkuznet commented Aug 24, 2023

Alan, I looked at jenkins log output and it complains about test provided in #11665 in particular at this line:
https://github.com/dmwm/WMCore/pull/11665/files#diff-b66e4eefedb43252aea569121f9e099cf9166a8c732af9923b4568b89409e73dR264
It means that subprocess didn't sleep (as in test I provided a sleep 1 function). I'm not sure how this happens but I may adjust this test to either have larger interval, or use a different command, like ls or df or du, and/or modify test condition to be self.assertTrue(userTime1 >= userTime0). Let me know if you prefer any of this solution and I can provide separate PR to the test in question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Timestamps for monitoring activities in WM job
3 participants