Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure aborted/force-complete workflows leave no elements behind #11113

Merged
merged 3 commits into from
Apr 27, 2022

Conversation

amaltaro
Copy link
Contributor

@amaltaro amaltaro commented Apr 25, 2022

Fixes #11112

Status

not-tested

Description

First commit contains a script to be used to cancel WQEs at the global workqueue level. It can be used like:

python3 kill-workflow-in-global <your_workflow_name>

Second commit contains changes to how aborted/force-complete workflows are dealt with in the ReqMgr2 CherryPy thread responsible for status transition. Including:

  • if such workflows contains available/active WQEs, then a cancelWorkflow request is triggered once again
  • global workqueue state can now return an aborted status, which in this context means that all the WQEs are sitting in CancelRequested state
  • don't (re-)set WQEs in CancelRequested to the same state

NOTE: the logic to avoid pulling work down for aborted workflows would demand a substantial amount of changes, so I will skip it and rely only on the cherrypy thread mechanism.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

None

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 7 tests added
  • Python3 Pylint check: succeeded
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13072/artifact/artifacts/PullRequestReport.html

@amaltaro amaltaro changed the title Create script to cancel GQEs Ensure aborted/force-complete workflows leave no elements behind Apr 25, 2022
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 7 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 1 warnings
    • 25 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13073/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 7 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 2 warnings
    • 38 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13074/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 7 tests added
  • Python3 Pylint check: succeeded
    • 2 warnings
    • 38 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13076/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@vkuznet vkuznet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find very confusing which naming convention style should be used for states. On one side you have all lower case states like, aborted, running-open, on another within the code you use Camel case, like Done, Running. I think I understand that lower case used in WQ DB records, while Camel case within the code, but such convention can lead to simple mistake when somewhere in a code someone will use different style.

Apart from that the changes are fine.

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amaltaro for providing those changes. I have left two minor comments inline. You may want to take a look. Other than that the code looks good to me.


# Workflow state transition automatically controlled by ReqMgr2
# Specific to workflows either aborted or force-completed
CANCEL_AUTO_TRANSITION = {"aborted": "aborted-completed",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't those "allowed" status transitions (here aborted-completed) be enclosed in a list instead of pure strings. I'd say a list here would be better at least for staying consistent with the rest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could, but there is no real need to have a list because it's a single transition (instead of chained transition for some of the previous ones). This makes the source code slightly more clear as well, so I think it's a reasonable tradeoff.

@@ -37,11 +37,15 @@ def convertWQElementsStatusToWFStatus(elementsStatusSet):
running = set(["Running"])
runningOpen = set(["Available", "Negotiating", "Acquired"])
runningClosed = set(["Running", "Done", "Canceled"])
aborted = set(["CancelRequested", "Done", "Canceled", "Failed"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if putting the CancelRequested status under the set of statuses covered by the aborted does not make any difference in the note from above:

NOTE: CancelRequested status is a transient status and it should not trigger
any request status transition (thus, None gets returned).

I mean it could now trigger a request status transition (IINM) because of line 48 from bellow, which says:

    if elementsStatusSet == forceCompleted:  # all WQEs in CancelRequested
        return "aborted"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I definitely need to update the docstring.

This aborted state here should not trigger any status transition, as implemented:
https://github.com/dmwm/WMCore/pull/11113/files#diff-b42637227c688d15bfc68058c1005999390006897e0abc07b1a8b6989039afb1R18
and
https://github.com/dmwm/WMCore/pull/11113/files#diff-b42637227c688d15bfc68058c1005999390006897e0abc07b1a8b6989039afb1R62

but I had to implement it to be able to differentiate options that were returning a None value.


# get configuration file path
if "WMAGENT_CONFIG" not in os.environ:
os.environ['WMAGENT_CONFIG'] = '/data/srv/wmagent/current/config/wmagent/config.py'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well this is just a comment here, no need to be changed with this PR (because it is in bin and supposedly the script should be used for manual workflow cancellation). But in any case, I think we should avoid hard coding a global path inside those scripts. We should rather stick to relative paths starting from the root of the deployment destination, which may change in the future (e.g if/once we move the wmagent deployment procedure under a virtual environment with the freedom to deploy anywhere but not just at /data/srv)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WMAgent ROOT directory would have to be defined within the WMAgent/WMCore setup as well, and I think if WMAGENT_CONFIG isn't set, changes that the ROOT_DIR (or whatever name that is) would not be set as well.

This is a fallback though, so it should be good enough until we come back and update all of these bin/ scripts (I am pretty sure we have a few with such hard-coded path). Please keep it in mind as we get closer to the virtual environment setup ;)

Copy link
Contributor

@todor-ivanov todor-ivanov Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep it in mind as we get closer to the virtual environment setup ;)

Indeed, this was the reason I mentioned it. No change needed right now.

@amaltaro
Copy link
Contributor Author

@vkuznet yes, I completely second your comment. As you noted, we have many structs/objects in WMCore and they have a different set of status (sometimes called states) and some of those were implemented following a different convention.

Yes, WorkQueue element properties and status are all upper CamelCase. This code is basically deciding what status should the request be set, according to its aggregation of workqueue element statuses.

Todor, Valentin, the aborted word there is probably not the best choice to be honest, since it covers both aborted and force-completed workflows. I'm going to rename it to canceled instead (even though that will require changes everywhere in my changes, sorry!).

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 7 tests added
    • 3 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 2 warnings
    • 38 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13084/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow up @amaltaro
the PR looks good to me.

… workflows

CancelRequested plus final state should be considered as aborted

renamed constant for work cancelation

rename workqueue aggregate status from aborted to canceled
update unit tests
@amaltaro
Copy link
Contributor Author

Thanks for the re-review Todor. This needs to be further tested, but instead of running the whole deployment/workflow in my VMs, I will hand this over to Imran for a testbed deployment tomorrow.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 7 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 2 warnings
    • 38 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13090/artifact/artifacts/PullRequestReport.html

@amaltaro amaltaro merged commit f8478de into dmwm:master Apr 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Workflows moved from staged to aborted could leave elements behind
4 participants