Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux-top: limit jobs and summary to specific queue #4847

Merged
merged 4 commits into from
Jan 25, 2023

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Dec 20, 2022

per #4606. I'm really glad I added that testing in #4833, caught a couple of issues :P

  • i assume --queue doesn't recursively goto into sub-instances. Possibly sub-instances may have same queue names setup, but it's not what i think is desired by the --queue option.

  • right now flux-top determines if the "QUEUE" column should be displayed if any jobs have been submitted to any queue. Maybe this should be changed to just determining if any queues have been configured, i.e. if queue.* actually is configured? While what is implemented is perhaps more "correct", its heavier weight and not sure its really needed except outside of testing scenarios (i.e. queue configuration changed between job submissions). But for the time being I left it in.

  • will probably have to add support in flux-sched for the resource-status RPC.

  • i somewhat lazily chose not to add "queues" (plural) support. Didn't think there was a big use case at the moment.

    • big issue, job-list only filters jobs by single queue, so would have to update job-list module or do some type of "build up display of multiple responses". Edit: OR get all jobs and just filter them in flux-top, which dunno if I'd want to do that.

    • getting resources of multiple queues should be easy, just add to the constraints list

    • tiny annoying to support in job-stats, b/c gotta add the queue stats together

    • I'm thinking I'll put into an issue as a longer-term todo.

@grondo
Copy link
Contributor

grondo commented Dec 20, 2022

Haven't taken a look yet, but I did want to comment on this:

will probably have to add support in flux-sched for the resource-status RPC.

Hm, this might be non-trivial and also means this PR can't be merged until that support has landed in flux-sched.

Did you considering doing the filtering in flux-top instead? This could greatly simplify the implementation, and would work without changing the scheduler status RPC.

@grondo
Copy link
Contributor

grondo commented Dec 20, 2022

i somewhat lazily chose not to add "queues" (plural) support. Didn't think there was a big use case at the moment.

Yeah, I don't think there is a use case for multiple queues besides the obvious one of displaying all queues.

OR get all jobs and just filter them in flux-top, which dunno if I'd want to do that.

I think it would actually be very advantageous to filter jobs in flux top directly. We may want to add support for a Q key (or I was thinking left/right arrow keys) to "page" between queue displays, so that a user doesn't have to quit and restart top to view different queues. It would be much easier to do this if the set of displayed jobs is filtered rather than mucking about with sending a new RPC...

@chu11
Copy link
Member Author

chu11 commented Dec 20, 2022

Hm, this might be non-trivial and also means this PR can't be merged until that support has landed in flux-sched.

Did you considering doing the filtering in flux-top instead? This could greatly simplify the implementation, and would work without changing the scheduler status RPC.

i hadn't considered it, but after skimming flux-sched this morning ... this does seem like a good idea.

@chu11
Copy link
Member Author

chu11 commented Dec 20, 2022

I think it would actually be very advantageous to filter jobs in flux top directly. We may want to add support for a Q key (or I was thinking left/right arrow keys) to "page" between queue displays, so that a user doesn't have to quit and restart top to view different queues. It would be much easier to do this if the set of displayed jobs is filtered rather than mucking about with sending a new RPC...

that's a good idea, and it would actually allow us to "flip" through queues on sub-instances that are selected. let me look into doing it this way.

@chu11
Copy link
Member Author

chu11 commented Dec 20, 2022

re-pushed, implementing the "filtering" within flux-top instead of in sched-simple or job-list. The code ended up simpler in some cases.

I don't think I'll do the 'Q' (jump between queues) support in this PR, punt that to a follow up PR.

@grondo
Copy link
Contributor

grondo commented Dec 21, 2022

We could probably punt the runtime queue selection for awhile. I was thinking to allocate left-arrow/right-arrow (maybe h/l) to flip through queues, but not critical at all.

@chu11 chu11 force-pushed the issue4606_flux_top_queue branch 2 times, most recently from 464ad37 to 4064136 Compare January 9, 2023 20:23
Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested on my system with two queues and seems to work well!
A few comments inline.

@@ -146,6 +146,11 @@ void joblist_pane_draw (struct joblist_pane *joblist)
"user",
"uri", &uri) < 0)
fatal (0, "error decoding a job record from job-list RPC");

if (joblist->top->queue
&& strcmp (joblist->top->queue, queue) != 0)
Copy link
Member

@garlick garlick Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to convert to streq() here (top.c already uses it).

Comment on lines 383 to 387
rl_constraint = rlist_copy_constraint (rl_all,
sum->top->queue_constraint,
&error);
if (!rl_constraint)
fatal (errno, "failed to create constrained rlist");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider including error.text as part of the failure message?

Comment on lines 361 to 366
static int resource_count (json_t *o,
static int resource_count (struct summary_pane *sum,
json_t *o,
const char *name,
int *nnodes,
int *ncores,
int *ngpus)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but since this function is counting resources returned from sched.resource-status maybe the function declaration would be more clear with the response payload first (as before), then add a new json_t *constraint argument rather than having the caller pass in all of struct summary_pane when it's only used to find the constraint?

Comment on lines 450 to 467
if (sum->top->queue) {
json_t *queues = NULL;
json_t *q = NULL;
size_t index;
json_t *value;
if (json_unpack (o, "{s:o}", "queues", &queues) < 0)
fatal (0, "error getting job-list.job-stats queues");
if (json_is_array (queues) == 0)
fatal (EPROTO, "job-list.job-stats queue not an array");
json_array_foreach (queues, index, value) {
const char *name;
if (json_unpack (value, "{s:s}", "name", &name) < 0)
fatal (EPROTO, "job-list.job-stats queue name not found");
if (!strcmp (name, sum->top->queue)) {
q = value;
break;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: move his block to its own function, something like

json_t *parse_queue_stats (json_t *o, const char *name)

and have it return NULL on failure, then handle the failure in one place as a protocol error rather than having three separate error message for what is basically the same unlikely problem.

Comment on lines +248 to +250
if (!(f = flux_rpc (top->h, "config.get", NULL, FLUX_NODEID_ANY, 0))
|| flux_rpc_get_unpack (f, "o", &o) < 0)
fatal (errno, "Error fetching flux config");

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: fetch the config object and store it within struct top before calling setup_constraint() ? Seems like it might come in handy later e.g. when multiple queues/constraints are supported.

Comment on lines 264 to 276
json_t *array;
size_t index;
json_t *value;
if (!json_is_array (requires))
fatal (0, "queue requires configuration invalid");
if (!(array = json_array ()))
fatal (0, "Error creating json array");
json_array_foreach (requires, index, value) {
if (json_array_append (array, value) < 0)
fatal (0, "Error appending to json array");
}
if (!(top->queue_constraint = json_pack ("{s:o}", "properties", array)))
fatal (0, "Error creating queue constraints");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: instead of copying requires, just take a reference on it e.g.

constraint = json_pack ("{s:O}", "properties", requires);

@chu11
Copy link
Member Author

chu11 commented Jan 11, 2023

rebased and re-pushed with fixups

@garlick
Copy link
Member

garlick commented Jan 20, 2023

I've lost my context on this one. Want to squash the fixups and I'll give it another pass?

@chu11
Copy link
Member Author

chu11 commented Jan 21, 2023

squashed and repushed

@codecov
Copy link

codecov bot commented Jan 23, 2023

Codecov Report

Merging #4847 (3e5ceca) into master (011d764) will increase coverage by 28.57%.
The diff coverage is n/a.

❗ Current head 3e5ceca differs from pull request most recent head 4903b9a. Consider uploading reports for the commit 4903b9a to get more accurate results

@@             Coverage Diff             @@
##           master    #4847       +/-   ##
===========================================
+ Coverage   54.25%   82.82%   +28.57%     
===========================================
  Files         401      425       +24     
  Lines       69596    74707     +5111     
===========================================
+ Hits        37756    61873    +24117     
+ Misses      31840    12834    -19006     
Impacted Files Coverage Δ
src/common/libpmi/simple_client.c 68.53% <0.00%> (-14.41%) ⬇️
src/common/libflux/response.c 79.56% <0.00%> (-10.34%) ⬇️
src/common/libflux/control.c 68.75% <0.00%> (-9.83%) ⬇️
src/common/libflux/request.c 84.76% <0.00%> (-8.65%) ⬇️
src/modules/kvs/kvs_wait_version.c 89.85% <0.00%> (-6.92%) ⬇️
src/common/libioencode/ioencode.c 89.61% <0.00%> (-6.23%) ⬇️
src/common/libsubprocess/command.c 69.42% <0.00%> (-5.83%) ⬇️
src/common/libsubprocess/util.c 94.64% <0.00%> (-5.36%) ⬇️
src/common/libsubprocess/posix_spawn.c 88.40% <0.00%> (-5.25%) ⬇️
src/common/libutil/tomltk.c 78.39% <0.00%> (-4.15%) ⬇️
... and 350 more

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Thanks!

Problem: In the near future we may need to parse optional fields
from a job-list.job-stats response.  This is difficult given that
fields are directly parsed in flux_rpc_get_unpack().

Solution: Refactor to use flux_rpc_get_unpack() to get the whole
payload of the job-list.job-stats response.  Then use json_unpack()
to parse that object.
Problem: There is no way to filter jobs and summary data in flux-top
by queue.

Solution: Support a --queue option.

Fixes flux-framework#4606
Problem: flux-top's --queue option isn't documented.

Add documentation in the flux-top(1) manpage.
Problem: There is no coverage for the flux-top --queue option.

Add some test coverage in t2801-top-cmd.t.
@chu11 chu11 deleted the issue4606_flux_top_queue branch January 30, 2023 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants