-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job-manager: misc cleanup in scheduler interface #5969
Conversation
Problem: the alloc subsystem contains several variables that were abandoned when queues were added. Drop unused variables.
Problem: the code in alloc.c for maintaining the two priority queues is duplicated in several places, and therefore not as robust as it could be with regard to maintaining its invariants. Specifically with regard to job->handle - it must be set on insertion - it must be cleared on deletion - insertion/deletion should fail if it is not clear/set - it must be revalidated after a zlistx_sort() Add some new functions to job.c that should simplify this: zlistx_t *job_priority_queue_create (void); int job_priority_queue_insert (zlistx_t *l, struct job *job); int job_priority_queue_delete (zlistx_t *l, struct job *job); void job_priority_queue_reorder (zlistx_t *l, struct job *job); void job_priority_queue_sort (zlistx_t *l); Update alloc.c
Problem: the variable name of alloc->pending_jobs is confusing since a pending job is defined as a job in DEPEND|PRIORITY|SCHED states, and this list contains only some jobs in SCHED state that have unanswered alloc requests pending. Rename to alloc->sent.
Problem: alloc->ready is a bit vague. Rename to alloc->scheduler_is_online.
Problem: alloc->alloc_pending_count uses the name "pending" which is confusing, and has an "alloc" prefix when it's already in an alloc struct. Rename to alloc->sent_count, which more accurately describes what it's counting.
Problem: the comments next to some context variables are not super helpful. Try to improve them a bit.
Problem: there are a few assertions in the alloc code that may cause an outsized reaction if they ever trigger. Instead of crashing the rank 0 broker, do something less scary like return an error to the caller or skip the operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
The bookworm/distcheck builder failed at:
Restarted (this test could be racy) |
Thanks! I'll set MWP here. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5969 +/- ##
==========================================
- Coverage 83.33% 83.29% -0.05%
==========================================
Files 515 515
Lines 83398 83403 +5
==========================================
- Hits 69502 69472 -30
- Misses 13896 13931 +35
|
While trying to find the cause of issue #5964, I finally got fed up and did some cleanup in alloc.c.
Nothing too risky here.
I didn't get to the bottom of the actual issue but figured this cleanup could stand alone.