Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-list: allow list-id to wait for job state #5213

Merged
merged 8 commits into from Jun 6, 2023

Conversation

chu11
Copy link
Member

@chu11 chu11 commented May 27, 2023

Problem: It'd be convenient if the job-list.list-id service not only waited for a job id to be legal but to also waited for the job to pass
a certain job state. Then it can also be used to "wait on" job-list for a number of eventual consistency checks in the testsuite.

Fixes #2864

Also stuck on a testsuite coverage for a prior PR that I just forgot to do.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup! Just a few minor comments.

@@ -39,7 +40,9 @@ def set_and_check_context(self, key, value, type):
kd2 = flux.kvs.KVSDir(self.f)
nv = kd2[key]
if isinstance(value, bytes) and type is str:
self.assertEqual(value.decode('utf-8'), nv)
self.assertEqual(value.decode("utf-8"), nv)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think normally these files are not subject to formatting by black. That might be why the formatting changed Did you run black directly on the file?

Actually, it could be a mistake that the python test files aren't being formatted. The .pre-commit-config.yaml has:

files: '^(src/bindings/python/flux|src/cmd|t/.*\\.py)'
exclude: "^(src/bindings/python/_flux/|src/bindings/python/flux/utils/)"

So it does appear that inclusion of python files under t/ was intended. I think the double backslash is preventing this regex from matching though. I suppose we should fix this in a separate PR. I leave whether you keep the unrelated formatting change in this PR up to you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI - I have a PR ready that fixes .pre-commit-config.yaml and all the fallout...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I guess I ran black on it, leading to that change.

Comment on lines 44 to 45
elif isinstance(value, bytes) and type is dict:
self.assertEqual(ast.literal_eval(value.decode("utf-8")), nv)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use a comment -- at least, I don't understand exactly why ast.literal_eval is needed here.
Also instead of self.assertEqual should you use self.assertDictEqual?

goto enomem;

return 0;

enomem:
zlistx_destroy (&list_isd);
idsync_wait_list_destroy ((void **)&iwl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cast to (void **) is unnecessary here I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i get a error/warning with it :-(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, ok, my bad then!

@@ -54,6 +54,7 @@ static struct idsync_data *idsync_data_create (flux_t *h,
flux_jobid_t id,
const flux_msg_t *msg,
json_t *attrs,
flux_job_state_t state,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit message: it might be fine, but I was stumbling over reading of:

Solution: Support an option job state in the job-list.list-id service.

Is it supposed to be either "Support an optional job state in the..." or maybe "Support a job state option in the..."?

@@ -145,6 +145,13 @@ static struct optparse_option list_inactive_opts[] = {
OPTPARSE_TABLE_END
};

static struct optparse_option list_ids_opts[] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit message:

However, the flux job list-ids does not support anything to use it.

Suggestion: "the flux job list-ids command does not support any way to use it" ?

@@ -1471,12 +1480,27 @@ int cmd_list_ids (optparse_t *p, int argc, char **argv)
if (!(h = flux_open (NULL, 0)))
log_err_exit ("flux_open");

/* if no job state specified by user, pick first job state of
* depend, which means will return immediately
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more correct to say "will return as soon as the job-list module is aware of the job" instead of "immediately"?

@chu11
Copy link
Member Author

chu11 commented May 31, 2023

rebased picking up the black fixes on t/python and fixed up per comments above

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! LGTM!

@grondo
Copy link
Contributor

grondo commented Jun 6, 2023

This could actually be really handy even outside of testing, nice!

chu11 added 8 commits June 6, 2023 15:54
Problem: In python/t0005-kvs.py an illegally stored json object
was stored as binary and tested.  A legally stored json object
stored as binary was not tested.

Add a legally stored json object in binary for coverage.
Problem: In t2800-jobs-cmd.t some convenience functions are implemented
that are identical to some helper functions in helper files.

Solution: Use the helper functions provided in job-list-helper.sh and
remove the unnecessary function declarations.
Problem: In the idsync_add_waiter() function, the cleanup path cleaned
up the wrong memory.

Solution: Clean up the zlistx_t that was possibly created within the function,
not the data structure passed in as a parameter.
Problem: In the idsync API a list is stored in a hash and
an externally stored jobid in a list entry is used as the key.
This currently is fine because the entire list is read back and
destroyed in idsync_check_waiting_id().

However, if we ever want to read/delete some list entries but not
others, the externally stored jobid makes the hash lookup unsafe,
as the storage of the key can no longer be guaranteed.

Solution: Store the list in a new struct that also stores the
jobid.  Also, adjust the name of a cleanup function as a result.
Problem: It'd be convenient if the job-list.list-id service not only
waited for a job id to be legal but to also waited for the job to
pass a certain job state.

Solution: Support an optional job state in the job-list.list-id service.
Return when a job passes this job state or reaches the inactive state.

Fixes flux-framework#2864
Problem: The job-list.list-ids RPC now supports the ability to return
after a specific job state has passed.  However, the flux job list-ids
command does not support any way to use it.

Solution: Add a --wait-state option which allows the caller to tell
list-ids to only return after a specific job state has been hit.
Problem: There is no coverage of job-list.list-ids and its
ability to return only after a specific job state has been reached.

Add coverage in t2260-job-list.t.
Problem: The job-list module has eventual consistency for the jobs
that are submitted to the job-manager queue.  A number of helper
functions have been written to check for this consistency before
running tests.  These functions are no longer necessary due to the
new `flux job list-ids --wait-state` option.

Use the flux job list-ids --wait-state option when appropriate.  Remove
no longer necessary code.
@codecov
Copy link

codecov bot commented Jun 6, 2023

Codecov Report

Merging #5213 (3c0c043) into master (666fb9c) will decrease coverage by 10.67%.
The diff coverage is 88.46%.

❗ Current head 3c0c043 differs from pull request most recent head 75a64be. Consider uploading reports for the commit 75a64be to get more accurate results

@@             Coverage Diff             @@
##           master    #5213       +/-   ##
===========================================
- Coverage   94.35%   83.68%   -10.67%     
===========================================
  Files          84      444      +360     
  Lines        7560    75766    +68206     
===========================================
+ Hits         7133    63405    +56272     
- Misses        427    12361    +11934     
Impacted Files Coverage Δ
src/modules/job-list/list.c 76.88% <70.00%> (ø)
src/cmd/flux-job.c 87.97% <85.71%> (ø)
src/modules/job-list/idsync.c 72.18% <94.11%> (ø)
src/modules/job-list/job_state.c 74.34% <100.00%> (ø)

... and 358 files with indirect coverage changes

@mergify mergify bot merged commit 80c8a69 into flux-framework:master Jun 6, 2023
31 of 32 checks passed
@chu11 chu11 deleted the issue2864_idsync_state branch June 6, 2023 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

job-info: list-id service wait for specific state
2 participants