resource: emit a more specific error when rlist_rerank() fails #4126

grondo · 2022-02-11T17:43:09Z

This PR adds an rlist_error_t parameter to rlist_rerank() which offers a more specific error when the rerank operation fails.

The parameter is then used in the resource module to print a specific error along with the general error reranking R.

Finally, to avoid the confusing error message when the resource module exits due to rlist_rerank's use of errno like ENOSPC and EOVERFLOW, reset errno to EINVAL when rlist_rerank() fails.

chu11

LGTM, just this comment nit

chu11 · 2022-02-11T20:36:02Z

src/cmd/flux-R.c

@@ -562,7 +562,7 @@ int cmd_rerank (optparse_t *p, int argc, char **argv)
        log_err_exit ("Failed to transform R objects on stdin");


Seem to be missing a end paren in commit message, probably goes after "rerank operation"?

(too many or too few hosts specified for the rerank operation, but the resulting error message from users can therefore be vague and confusing, e.g. "rlist_rerank: No space left of device".

Thanks! Fixed the commit message and pushed the result.

garlick

Looks good! I misconfigured my system instance to see how the logs turned out and this is what I get from journalctl -u flux after the broker fails to start on rank 0:

Feb 12 18:38:21 picl0 bash[24554]: resource.err[0]: error reranking R: Number of hosts (4) is less than node count (8)
Feb 12 18:38:21 picl0 bash[24554]: resource.crit[0]: fatal error: Invalid argument
Feb 12 18:38:21 picl0 bash[24554]: broker.err[0]: rc1.0: flux-module: broker.insmod: Invalid argument
Feb 12 18:38:21 picl0 bash[24554]: broker.err[0]: rc1.0: /bin/bash -c /usr/local/etc/flux/rc1 Exited (rc=1) 0.2s
Feb 12 18:38:21 picl0 bash[24554]: broker.info[0]: rc1-fail: init->shutdown 0.200315s
Feb 12 18:38:21 picl0 bash[24554]: broker.info[0]: children-none: shutdown->finalize 0.00039453s
Feb 12 18:38:21 picl0 bash[24554]: broker.info[0]: rc3.0: /bin/bash -c /usr/local/etc/flux/rc3 Exited (rc=0) 0.2s
Feb 12 18:38:21 picl0 bash[24554]: broker.info[0]: rc3-success: finalize->exit 0.173281s
Feb 12 18:38:21 picl0 systemd[1]: flux.service: Main process exited, code=exited, status=1/FAILURE
Feb 12 18:38:21 picl0 systemd[1]: flux.service: Failed with result 'exit-code'.

grondo · 2022-02-14T16:06:00Z

resource.crit[0]: fatal error: Invalid argument

I wonder if we should also simplify this generic error on abnormal module termination. It might not be helping to include a generic fatal error message along with strerror (errno). In most cases the module will have (hopefully) already printed a more descriptive error of what actually went wrong, and, if not, then it is unlikely the errno will help pinpoint the problem (except perhaps in the exception case of ENOMEM).

Perhaps just module exiting abnormally or similar?

garlick · 2022-02-14T17:13:30Z

Good point. That works for me.

grondo · 2022-02-14T18:34:31Z

That works for me.

I went ahead and tacked that on here.

garlick · 2022-02-14T19:14:41Z

Looks like there may be a fluxion test that depends on the current behavior:

2022-02-14T18:42:08.1548425Z PASS: t4000-match-params.t 1 - loading resource module with a tiny machine GRUG works
2022-02-14T18:42:08.1552903Z PASS: t4000-match-params.t 2 - loading resource module with an XML works
2022-02-14T18:42:08.1556859Z PASS: t4000-match-params.t 3 - loading resource module with no option works
2022-02-14T18:42:08.1560752Z PASS: t4000-match-params.t 4 - loading resource module with a nonexistent GRUG fails
2022-02-14T18:42:08.1564600Z PASS: t4000-match-params.t 5 - loading resource module with a nonexistent XML fails
2022-02-14T18:42:08.1568812Z FAIL: t4000-match-params.t 6 - loading resource module with incorrect reader fails
2022-02-14T18:42:08.1572710Z PASS: t4000-match-params.t 7 - loading resource module with known policies works
2022-02-14T18:42:08.1576557Z PASS: t4000-match-params.t 8 - loading resource module with unknown policies is tolerated
2022-02-14T18:42:08.1580677Z PASS: t4000-match-params.t 9 - removing resource works
2022-02-14T18:42:08.1584412Z ERROR: t4000-match-params.t - exited with status 1

Edit: looks like maybe we could just change the test to grep for the new message:

test_expect_success 'loading resource module with incorrect reader fails' '
    unload_resource &&
    flux dmesg -C &&
    load_resource load-file=${xml} load-format=grug \
prune-filters=ALL:core policy=high &&
    test_must_fail flux module stats sched-fluxion-resource &&
    flux dmesg > error3 &&
    grep -i "Invalid argument" error3
'

grondo · 2022-02-14T20:29:43Z

Interesting. I'll remove the last commit and save that for a future PR.

grondo · 2022-02-14T20:55:12Z

Edit: looks like maybe we could just change the test to grep for the new message:

Yeah, that test isn't so good anyway since Invalid argument could appear anywhere in the logs and it would pass. I already have a PR open on the flux-sched testsuite, so I'll fix it there and make the test less sensitive to the actual content of the flux-core logging. Once that is fixed I can re-propose the one liner fix I just removed here.

Problem: rlist_rerank() repurposes errnos like EOVERFLOW and ENOSPC to communicate the type of error (e.g. too many or too few hosts specified for the rerank operation), but the resulting error message from users can therefore be vague and confusing, e.g. "rlist_rerank: No space left of device". Add an optional rlist_error_t parameter to the rlist_rerank() function which, when provided, will be filled in with a more useful error. Callers can then print this error instead of using strerror. Since the function prototype changed, update all tests and callers. Fixes flux-framework#3830

Problem: rlist_rerank() repurposes system errnos like ENOSPC and EOVERFLOW to indicate the specific type of error encountered. However, this can cause confusion when a subsequent call to flux_log_error() occurs, which could result in something like: resource.crit[0]: fatal error: No space left on device which is clearly misleading. Since the specific error is now printed at the rlist_rerank() call site, reset errno to the more general (and perhaps correct) EINVAL before returning with error from convert_R_conf(). Since this applies specifically to rlist_rerank(), split the call to flux_attr_get(3) out from the combined conditional and give it its own error message. Fixes flux-framework#4122

Problem: Inclusion of `strerror (errno)` in the error message logged when a module exits mod_main() with a nonzero return code is probably not helpful to administrators perusing the logs. In most cases, the module will have already printed a more descriptive error before exiting with error, and the errno at best is not helpful and worst may be confusing. Clarify the log message at abnormal module exit to be clear about what is happening, and exclude the result of strerror() to avoid any confusion with the real error which should be further up in the logs.

Codecov · 2022-02-15T02:12:48Z

Codecov Report

Merging #4126 (028549c) into master (cac0153) will decrease coverage by 3.18%.
The diff coverage is 85.71%.

❗ Current head 028549c differs from pull request most recent head 57d7b28. Consider uploading reports for the commit 57d7b28 to get more accurate results

@@            Coverage Diff             @@
##           master    #4126      +/-   ##
==========================================
- Coverage   83.32%   80.13%   -3.19%     
==========================================
  Files         376      376              
  Lines       63016    62547     -469     
==========================================
- Hits        52509    50123    -2386     
- Misses      10507    12424    +1917

Impacted Files	Coverage Δ
src/modules/resource/inventory.c	`78.01% <50.00%> (-2.80%)`	⬇️
src/broker/module.c	`74.88% <100.00%> (-2.53%)`	⬇️
src/cmd/flux-R.c	`67.91% <100.00%> (-12.47%)`	⬇️
src/common/librlist/rlist.c	`74.56% <100.00%> (-10.42%)`	⬇️
src/modules/job-info/allow.c	`28.57% <0.00%> (-42.86%)`	⬇️
src/cmd/top/ucache.c	`58.62% <0.00%> (-28.05%)`	⬇️
src/modules/job-manager/getattr.c	`37.03% <0.00%> (-22.23%)`	⬇️
src/common/librlist/rnode.c	`65.10% <0.00%> (-19.71%)`	⬇️
src/common/libsubprocess/remote.c	`53.21% <0.00%> (-17.65%)`	⬇️
src/shell/affinity.c	`62.22% <0.00%> (-17.19%)`	⬇️
... and 204 more

grondo · 2022-02-15T02:13:21Z

Actually, since the fix for flux-sched was merged (thanks @garlick), I pushed the commit again to simplify the module abnormal exit error.

grondo · 2022-02-15T14:56:20Z

Setting MWP.

grondo force-pushed the issue#4122 branch from de808dd to 4b68c49 Compare February 11, 2022 18:21

chu11 approved these changes Feb 11, 2022

View reviewed changes

grondo force-pushed the issue#4122 branch from 4b68c49 to a64524a Compare February 11, 2022 21:46

garlick approved these changes Feb 13, 2022

View reviewed changes

grondo force-pushed the issue#4122 branch from a64524a to ab5c85b Compare February 14, 2022 18:32

grondo force-pushed the issue#4122 branch from ab5c85b to 2250c5b Compare February 14, 2022 20:52

grondo added 3 commits February 14, 2022 13:14

librlist/test: fix whitespace

05815af

grondo mentioned this pull request Feb 14, 2022

testsuite: update flux-sharness.sh and add fedora34 flux-framework/flux-sched#905

Merged

grondo force-pushed the issue#4122 branch from 2250c5b to 94976fd Compare February 14, 2022 21:27

grondo added the merge-when-passing label Feb 15, 2022

Mergify bot merged commit 6f0165b into flux-framework:master Feb 15, 2022

grondo deleted the issue#4122 branch September 12, 2022 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resource: emit a more specific error when rlist_rerank() fails #4126

resource: emit a more specific error when rlist_rerank() fails #4126

grondo commented Feb 11, 2022

chu11 left a comment

chu11 Feb 11, 2022

grondo Feb 11, 2022

garlick left a comment

grondo commented Feb 14, 2022

garlick commented Feb 14, 2022

grondo commented Feb 14, 2022

garlick commented Feb 14, 2022 •

edited

Loading

grondo commented Feb 14, 2022

grondo commented Feb 14, 2022

Codecov bot commented Feb 15, 2022

grondo commented Feb 15, 2022

grondo commented Feb 15, 2022

		@@ -562,7 +562,7 @@ int cmd_rerank (optparse_t p, int argc, char *argv)
		log_err_exit ("Failed to transform R objects on stdin");

resource: emit a more specific error when rlist_rerank() fails #4126

resource: emit a more specific error when rlist_rerank() fails #4126

Conversation

grondo commented Feb 11, 2022

chu11 left a comment

Choose a reason for hiding this comment

chu11 Feb 11, 2022

Choose a reason for hiding this comment

grondo Feb 11, 2022

Choose a reason for hiding this comment

garlick left a comment

Choose a reason for hiding this comment

grondo commented Feb 14, 2022

garlick commented Feb 14, 2022

grondo commented Feb 14, 2022

garlick commented Feb 14, 2022 • edited Loading

grondo commented Feb 14, 2022

grondo commented Feb 14, 2022

Codecov bot commented Feb 15, 2022

Codecov Report

grondo commented Feb 15, 2022

grondo commented Feb 15, 2022

garlick commented Feb 14, 2022 •

edited

Loading