Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add utility for posting job events manually if required #5848

Merged
merged 4 commits into from Apr 2, 2024

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Apr 1, 2024

Jobs can end up stuck in CLEANUP state if, for example, an epilog-start event is posted with no corresponding epilog-finish event. This can occur if the rank 0 broker restarts when a job epilog is running. Currently, to clean up this state a special, one-off jobtap plugin must be loaded to manually post the missing epilog-finish event.

Actually, my guess is this issue will be rare if we've fixed the broker crash that's been the most common cause of this condition. However, it might be useful to have on hand a more usable utility to post custom job events for this case or any other similar problems that might arise.

To that end, this PR adds a simple builtin jobtap plugin that can accept events for jobs as an RPC, and posts them to the requested jobid. A front-end command is then added to use this service: flux post-job-event JOBID [KEY=VAL...]. This is just meant to have on hand to be used when necessary. I imagine it will someday be removed.

@garlick
Copy link
Member

garlick commented Apr 1, 2024

Oh good this will be helpful.

Could the event just be an argument to the jobtap plugin so you just run

$ flux jobtap load post-event jobid name [key=val] [key=val]...

Eh, then you'd have to unload the plugin each time you want to use it, which would be annoying when you have those 10 stuck jobs to deal with. Nevermind.

Copy link

codecov bot commented Apr 1, 2024

Codecov Report

Merging #5848 (3ecf139) into master (ed07466) will increase coverage by 0.01%.
The diff coverage is 84.44%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5848      +/-   ##
==========================================
+ Coverage   83.26%   83.28%   +0.01%     
==========================================
  Files         511      513       +2     
  Lines       82694    82739      +45     
==========================================
+ Hits        68858    68908      +50     
+ Misses      13836    13831       -5     
Files Coverage Δ
src/modules/job-manager/jobtap.c 84.85% <ø> (ø)
src/cmd/flux-post-job-event.py 92.00% <92.00%> (ø)
src/modules/job-manager/plugins/post-event.c 75.00% <75.00%> (ø)

... and 11 files with indirect coverage changes

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -120,7 +120,8 @@ dist_fluxcmd_SCRIPTS = \
flux-update.py \
flux-imp-exec-helper \
py-runner.py \
flux-hostlist.py
flux-hostlist.py \
flux-post-job-event.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in commit "finidh"

@@ -10,7 +10,7 @@ skip = [
"src/bindings/python/flux/resource/__init__.py"]

[tool.mypy]
python_version = 3.6
python_version = "3.6"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit message typo "aarond"

Problem: Sometimes a job becomes stuck in a given state, requiring
intervention by forcing an event to be posted to the job manager.
Currently, this requires building a custom jobtap plugin, which is
annoying and not sysadmin friendly.

Add a builtin jobtap plugin that will simply post a custom job
event defined in the RPC.
Problem: There's no convenient way to post job events manually when
it's necessary.

Add a new `flux post-job-event` command which sends an RPC to the
job-manager.post-event.post service when called, e.g.:

  flux post-job-event JOBID NAME [KEY=VAL, KEY=VAL..]

This command is meant to be a temporary solution to clean up cases
where a job gets stuck in CLENAUP becuase an epilog-start event was
posted without a corresponding epilog-finish event.
Problem: There's no tests for the `flux post-job-event` utility.

Add a new sharness test t2815-post-job-event.t.
Problem: mypy complains

 pyproject.toml: [mypy]: python_version: Python 3.6 is not supported
   (must be 3.8 or higher). You may need to put quotes around your
   Python version

Put quotes around the python version.
@grondo
Copy link
Contributor Author

grondo commented Apr 2, 2024

Could the event just be an argument to the jobtap plugin so you just run:

$ flux jobtap load post-event jobid name [key=val] [key=val]...

This is actually the strategy we're using now, except it does work for multiple jobids because there's a jobids=[ID1, ID2, ...] config keyword (there's no way to pass just jobid and name to a jobtap plugin since the "config" is packed into a JSON object).

We could keep this method if preferred. It is just slightly awkward, so is more difficult to instruct an admin how to do it rather than a purpose-built command. Therefore, until now either @jameshcorbett or myself have been doing this for the admins.

Like I said, though, I'm hopeful we won't even have to use this.

@garlick
Copy link
Member

garlick commented Apr 2, 2024

Gotcha. This is fine, thanks!

@mergify mergify bot merged commit 2e4a13d into flux-framework:master Apr 2, 2024
33 checks passed
@grondo grondo deleted the post-event branch April 2, 2024 14:19
@grondo
Copy link
Contributor Author

grondo commented Apr 2, 2024

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants