New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add utility for posting job events manually if required #5848
Conversation
Oh good this will be helpful. Could the event just be an argument to the jobtap plugin so you just run
Eh, then you'd have to unload the plugin each time you want to use it, which would be annoying when you have those 10 stuck jobs to deal with. Nevermind. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #5848 +/- ##
==========================================
+ Coverage 83.26% 83.28% +0.01%
==========================================
Files 511 513 +2
Lines 82694 82739 +45
==========================================
+ Hits 68858 68908 +50
+ Misses 13836 13831 -5
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -120,7 +120,8 @@ dist_fluxcmd_SCRIPTS = \ | |||
flux-update.py \ | |||
flux-imp-exec-helper \ | |||
py-runner.py \ | |||
flux-hostlist.py | |||
flux-hostlist.py \ | |||
flux-post-job-event.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in commit "finidh"
@@ -10,7 +10,7 @@ skip = [ | |||
"src/bindings/python/flux/resource/__init__.py"] | |||
|
|||
[tool.mypy] | |||
python_version = 3.6 | |||
python_version = "3.6" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit message typo "aarond"
Problem: Sometimes a job becomes stuck in a given state, requiring intervention by forcing an event to be posted to the job manager. Currently, this requires building a custom jobtap plugin, which is annoying and not sysadmin friendly. Add a builtin jobtap plugin that will simply post a custom job event defined in the RPC.
Problem: There's no convenient way to post job events manually when it's necessary. Add a new `flux post-job-event` command which sends an RPC to the job-manager.post-event.post service when called, e.g.: flux post-job-event JOBID NAME [KEY=VAL, KEY=VAL..] This command is meant to be a temporary solution to clean up cases where a job gets stuck in CLENAUP becuase an epilog-start event was posted without a corresponding epilog-finish event.
Problem: There's no tests for the `flux post-job-event` utility. Add a new sharness test t2815-post-job-event.t.
Problem: mypy complains pyproject.toml: [mypy]: python_version: Python 3.6 is not supported (must be 3.8 or higher). You may need to put quotes around your Python version Put quotes around the python version.
This is actually the strategy we're using now, except it does work for multiple jobids because there's a We could keep this method if preferred. It is just slightly awkward, so is more difficult to instruct an admin how to do it rather than a purpose-built command. Therefore, until now either @jameshcorbett or myself have been doing this for the admins. Like I said, though, I'm hopeful we won't even have to use this. |
Gotcha. This is fine, thanks! |
Thanks! |
Jobs can end up stuck in CLEANUP state if, for example, an
epilog-start
event is posted with no correspondingepilog-finish
event. This can occur if the rank 0 broker restarts when a job epilog is running. Currently, to clean up this state a special, one-off jobtap plugin must be loaded to manually post the missingepilog-finish
event.Actually, my guess is this issue will be rare if we've fixed the broker crash that's been the most common cause of this condition. However, it might be useful to have on hand a more usable utility to post custom job events for this case or any other similar problems that might arise.
To that end, this PR adds a simple builtin jobtap plugin that can accept events for jobs as an RPC, and posts them to the requested jobid. A front-end command is then added to use this service:
flux post-job-event JOBID [KEY=VAL...]
. This is just meant to have on hand to be used when necessary. I imagine it will someday be removed.