Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-manager: misc cleanup in scheduler interface #5969

Merged
merged 7 commits into from
May 15, 2024

Conversation

garlick
Copy link
Member

@garlick garlick commented May 14, 2024

While trying to find the cause of issue #5964, I finally got fed up and did some cleanup in alloc.c.

Nothing too risky here.

I didn't get to the bottom of the actual issue but figured this cleanup could stand alone.

Problem: the alloc subsystem contains several variables that
were abandoned when queues were added.

Drop unused variables.
Problem: the code in alloc.c for maintaining the two priority
queues is duplicated in several places, and therefore not as
robust as it could be with regard to maintaining its invariants.

Specifically with regard to job->handle
- it must be set on insertion
- it must be cleared on deletion
- insertion/deletion should fail if it is not clear/set
- it must be revalidated after a zlistx_sort()

Add some new functions to job.c that should simplify this:

zlistx_t *job_priority_queue_create (void);
int job_priority_queue_insert (zlistx_t *l, struct job *job);
int job_priority_queue_delete (zlistx_t *l, struct job *job);
void job_priority_queue_reorder (zlistx_t *l, struct job *job);
void job_priority_queue_sort (zlistx_t *l);

Update alloc.c
Problem: the variable name of alloc->pending_jobs is confusing
since a pending job is defined as a job in DEPEND|PRIORITY|SCHED
states, and this list contains only some jobs in SCHED state that
have unanswered alloc requests pending.

Rename to alloc->sent.
Problem: alloc->ready is a bit vague.

Rename to alloc->scheduler_is_online.
Problem: alloc->alloc_pending_count uses the name "pending"
which is confusing, and has an "alloc" prefix when it's already
in an alloc struct.

Rename to alloc->sent_count, which more accurately describes what
it's counting.
Problem: the comments next to some context variables are
not super helpful.

Try to improve them a bit.
Problem: there are a few assertions in the alloc code
that may cause an outsized reaction if they ever trigger.

Instead of crashing the rank 0 broker, do something less scary
like return an error to the caller or skip the operation.
Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@grondo
Copy link
Contributor

grondo commented May 15, 2024

The bookworm/distcheck builder failed at:

 not ok 17 - attach: reports job shell Killed if job shell is killed

Restarted (this test could be racy)

@garlick
Copy link
Member Author

garlick commented May 15, 2024

Thanks! I'll set MWP here.

Copy link

codecov bot commented May 15, 2024

Codecov Report

Attention: Patch coverage is 87.67123% with 9 lines in your changes are missing coverage. Please review.

Project coverage is 83.29%. Comparing base (9f5ad97) to head (819f0f1).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5969      +/-   ##
==========================================
- Coverage   83.33%   83.29%   -0.05%     
==========================================
  Files         515      515              
  Lines       83398    83403       +5     
==========================================
- Hits        69502    69472      -30     
- Misses      13896    13931      +35     
Files Coverage Δ
src/modules/job-manager/alloc.c 75.91% <97.36%> (-1.27%) ⬇️
src/modules/job-manager/job.c 87.94% <77.14%> (-1.15%) ⬇️

... and 12 files with indirect coverage changes

@mergify mergify bot merged commit be2fcf4 into flux-framework:master May 15, 2024
35 checks passed
@garlick garlick deleted the issue#5964 branch May 15, 2024 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants