Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-manager: drop sched.free response requirement #5786

Merged
merged 2 commits into from Mar 11, 2024

Conversation

garlick
Copy link
Member

@garlick garlick commented Mar 11, 2024

Problem: the sched.free RPC requires a response but the response is unnecessary for correct job management.

Moreover, the coupling between the RPC response and the job state machine makes some anticipated changes harder to design.

Make sched.free a "fire and forget" RPC. Simply post the free event immediately instead of deferring it until the response is received..

Update flux queue status -v output to not show pending free rpcs, since they can no longer be pending.

This change should be accompanied by an update to RFC 27.

Problem: `flux queue status -v` shows a count of pending
sched.free requests, but once sched.free no longer requires a
response, they will never be "pending".

Drop pending free requests from the output.

Update sharness tests that expected that in the output.

Drop `free_pending` from the job-manager.alloc-query response
payload.
Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I guess it isn't an issue if the scheduler has an error handling a free request since we couldn't/didn't do anything with the error before?

garlick added a commit to garlick/flux-rfc that referenced this pull request Mar 11, 2024
Problem: sched.free requires a response, but this adds complexity
to the job manager and has little benefit.

Code that drops the sched.free response from flux-core is proposed
in flux-framework/flux-core#5786.

Drop the free response.
@garlick
Copy link
Member Author

garlick commented Mar 11, 2024

Oof just noticed a typo - an extra log message that shouldn't be there. Will force push.

Before this, a free error would cause the scheduler interface to be torn down (scheduler stopped). Now we just leave it to schedulers to log something. I think that is probably fine - the scheduler can choose to abort if it wants to, or log. (They both currently do log at LOG_ERR level anyway).

Problem: the sched.free RPC requires a response but the response
is unnecessary for correct job management.

Moreover, the coupling between the RPC response and the job state
machine makes some anticipated changes harder to design.

Make sched.free a "fire and forget" RPC.  Simply post the free event
immediately instead of deferring it until the response is received.

Update libschedutil to make schedutil_free_respond() a no-op.
Update sched-simple to not directly respond when there is a free error.

This change should be accompanied by an update to RFC 27.
@garlick
Copy link
Member Author

garlick commented Mar 11, 2024

Thanks - I'll set MWP.

Copy link

codecov bot commented Mar 11, 2024

Codecov Report

Merging #5786 (a1abce4) into master (3f26e6e) will increase coverage by 28.47%.
The diff coverage is 85.71%.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #5786       +/-   ##
===========================================
+ Coverage   54.85%   83.32%   +28.47%     
===========================================
  Files         461      509       +48     
  Lines       74808    82479     +7671     
===========================================
+ Hits        41033    68724    +27691     
+ Misses      33775    13755    -20020     
Files Coverage Δ
src/cmd/flux-queue.py 94.25% <ø> (+56.97%) ⬆️
src/common/libschedutil/free.c 100.00% <100.00%> (ø)
src/modules/job-manager/alloc.c 77.36% <100.00%> (+28.65%) ⬆️
src/modules/job-manager/event.c 79.04% <100.00%> (+17.05%) ⬆️
src/modules/job-manager/jobtap.c 84.47% <100.00%> (+50.13%) ⬆️
src/modules/sched-simple/sched.c 76.99% <0.00%> (+46.19%) ⬆️

... and 427 files with indirect coverage changes

@mergify mergify bot merged commit 9373f55 into flux-framework:master Mar 11, 2024
35 checks passed
@garlick garlick deleted the sched_free_noresponse branch March 11, 2024 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants