-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide timestamps metrics about WM job #11656
Conversation
test this please |
1 similar comment
test this please |
Jenkins results:
|
test this please |
Jenkins results:
|
Jenkins results:
|
@amaltaro this PR is ready for review. One thing, please checkout line https://github.com/dmwm/WMCore/pull/11656/files#diff-a9065684e3919aaf09e74699c1005bd636968852babd4031429723a206788771R218 and tell me if I'm correct to use Line 232 in 208a644
$outputFile is ever used.
|
@amaltaro , additional (independent) WMArchive changes can be found in #11659 and dmwm/WMArchive#360 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vkuznet Valentin, from a quick look, these changes look fine to me.
However, this is not what has been requested in the ticket and IMO it adds unnecessary complication. We should not record start and end time, but only provide the wallclock time for the overall job, aka the elapse time.
From the CMSSW FJR, they provide a very simple metric:
<Metric Name="TotalJobTime" Value="565.481"/>
and IMO this is what we should do from the WMAgent job wrapper as well.
In addition to that, I think it would be good to pull out a production document uploaded to WMArchive to see what is already there, potentially some timestamps are already provided.
Jenkins results:
|
@amaltaro , I adjusted the code to provide only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vkuznet thanks for providing those changes. I left a few more comments along the code.
About the timing section, I think we can use that section for all the timings that we are going to provide, so it's likely better to keep it around.
I missed your previous question about $outputFile
and I am not 100% certain of the answer. But I think that each report inside the cmsRun area is in the end parsed and the relevant information makes it to this job report (still in WN runtime code). But we will have to test these changes once we are happy with them.
2079626
to
45f509e
Compare
Jenkins results:
|
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vkuznet please have a look at the Jenkins results, you have errors in the code that need to be fixed. As you are providing new modules, ideally all their pylint/pycodestyle should come back clean as well, as noted in the contribution guidelines.
Regarding the new module, I am still not sure whether it should go under the Utils package or not, but maybe under this package would be a good commitment:
https://github.com/dmwm/WMCore/tree/master/src/python/WMCore/WMRuntime/Scripts
don't forget to replicate the relevant changes to the unit tests location as well.
I forgot to mention, I believe the next steps are:
|
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
If disk is full, then you cannot create the new pickle file either. If permissions are wrong, then it would have failed in previous steps that write that report. In addition, it is created by the user running the job, so there are no expected permission issues. On what concerns location of the script, as we have specific sub-packages under WMRuntime, my preference is to use the |
Alan, the disk can become full once you write first pickle file such that there is no room to write another one since its size will be bigger than original file. In other words, the |
And, I relocated |
Yes, and if disk becomes full after you write the first pickle file, it would of course fail to write the pickle.new that you implemented in the script. In other words, there is no extra protection offered by creating a new file + move compared to updating the current pickle. |
Jenkins results:
|
Alan, I don't want to argue more, but it is a possibility on system which support multiple users or jobs that when I write first file and second file then I'll not have disk space to overwrite original file, and there is a possibility that during a time when I'll overwrite existing file the disk will not be filled with other stuff by another user or process. In other words, the original file may be lost and this is what I try to avoid by writing completely new file and only then (when both are written and safe) to overwrite original file. If you still think that few lines of code is worth to remove to make code simpler but still open a possibility to lost original file I'll be happy to remove them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vkuznet in addition to the comments along the code, please also consider the pycodestyle report.
#!/usr/bin/env python | ||
""" | ||
_Timestamps_ module designed to insert proper WM timing into WMCore FJR. | ||
Usage: python3 Timestamps.py --reportFile=$outputFile --wmJobTime=$wmJobTime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update usage with the start/end new parameters.
Jenkins results:
|
test this please |
Jenkins results:
|
Alan, I resolved issues you pointed out and adjusted code with pep8 suggestions. Now it is clean. Please have a final review of this PR. |
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valentin, these changes are looking good as well.
I have to repeat though that we are still missing the conversion from FJR to WMArchive document, such that we can publish it to WMArchive. Please add the relevant changes to this PR. If you prefer, feel free to leave those in a separate commit.
It needs to be squashed and initial description updated. |
f27210f
to
7c29887
Compare
Jenkins results:
|
Jenkins reported the following failure:
which should not fail IMO. You might want to reproduce it tomorrow with the docker unit test image. |
Alan, I looked at jenkins log output and it complains about test provided in #11665 in particular at this line: |
Fixes #11604
Status
In development
Description
Provides timestamps metrics, job wall clock time, etc. (in GMT and UNIX since epochs data-format), to WMCore FJR (Report.pkl) file during job execution script
Is it backward compatible (if not, which system it affects?)
YES
Related PRs
#11575 , #11659
External dependencies / deployment changes