Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update workqueue stats as new blocks are added to the workflow #11135

Merged
merged 3 commits into from
May 12, 2022

Conversation

amaltaro
Copy link
Contributor

@amaltaro amaltaro commented May 6, 2022

Fixes #11129

Status

testing

Description

Short description of the changes provided within this PR:

  • growing global workqueue stats didn't need any modifications, it's supposed to do incremental stats already
  • added support to OpenRunningTimeout for StepChain and TaskChain workflow specs
  • fixed a mistake from the previous PR, properly retrieving the OpenRunningTimeout from the StartPolicy dictionary
  • updated a few test workflow json templates to remain open for 2h (even though their input dataset is no growing).

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

None

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 7 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13164/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 7 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 9 warnings
    • 97 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13181/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 28 new failures
    • 7 tests added
    • 10 changes in unstable tests
  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 9 warnings
    • 97 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13182/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 7 tests added
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 28 warnings and errors that must be fixed
    • 12 warnings
    • 182 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 21 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13183/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 7 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 28 warnings and errors that must be fixed
    • 12 warnings
    • 182 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 21 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13186/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 7 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 28 warnings and errors that must be fixed
    • 12 warnings
    • 182 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 21 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13187/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

Valentin, Todor, even though I am still trying to run a real test with a growing dataset, I'd appreciate any feedback that you might have on this PR. The goal is to get it deployed in testbed tomorrow morning CERN time (meaning, we need to merge it before I go to bed :))

Copy link
Contributor

@vkuznet vkuznet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes to the code are fine, but I have a question about choice of OpenRunningTimeout. Why 7200 (2h)? And, I don't know what is a good place to document it since you make this default in JSON. May be you can default values to src/python/WMCore/ReqMgr/Tools/cms.py, but it does not describe other defaults either.

@amaltaro
Copy link
Contributor Author

Changes to the code are fine, but I have a question about choice of OpenRunningTimeout. Why 7200 (2h)?

That's just a random number that came up to my mind. It has to be small enough such that:

  • it does not delay test workflow execution
  • and to allow it to test the global workqueue logic of adding work to already splitted workflows
    Note that it goes to the test templates, so it's definitely not used anywhere in our production system.

May be you can default values to src/python/WMCore/ReqMgr/Tools/cms.py, but it does not describe other defaults either.

Regarding this module, I think it's used for the ReqMgr2 Web UI. There are other things that we should modify in there, including the addition/removal of workflow parameters from the web form. Given that the Web UI is barely used, I think we can address that in the future, when we also work on other workflow fields that are no longer relevant. The important is that we stop mentioning that that parameter has been deprecated :-D

@amaltaro
Copy link
Contributor Author

I am still unable to see progress with my real test. Problem is likely coming from the stuckness of the JobAccountant components spotted today. I will keep testing it during the day tomorrow (Thursday), and hopefully new blocks will be appended to the input dataset I am using in my tests.

I do see though global workqueue trying to add new work for such requests open running requests, now regarless of which spec type it is.
On what concerns the global stats update, I explained - in the GH issue - that it should be already covered with the current code.

@amaltaro amaltaro merged commit be2300f into dmwm:master May 12, 2022
@amaltaro
Copy link
Contributor Author

Now that I managed to get a test with a growing dataset running, I can confirm that:

  1. new blocks get added as they are made available in DBS/Rucio
  2. and that the workflow statistics are properly incremented

and enhancement here would be not to trigger a couchdb document update when there is nothing to be updated. From the ReqMgr2 logs, I see PUT requests like this:

[14/May/2022:02:03:14]  Updated workqueue statistics of "amaltaro_TC_Growing_May2022_Val_220513_033129_3991", with:  {'total_jobs': 0, 'input_events': 0, 'input_lumis': 0, 'input_num_files': 0}

every few minutes (5min, but it really depends on the load in the system). I'm going to make another GH issue later today to keep track of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support incrementing workqueue stats in the request spec
3 participants