Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resource: emit a more specific error when rlist_rerank() fails #4126

Merged
merged 4 commits into from
Feb 15, 2022

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Feb 11, 2022

This PR adds an rlist_error_t parameter to rlist_rerank() which offers a more specific error when the rerank operation fails.

The parameter is then used in the resource module to print a specific error along with the general error reranking R.

Finally, to avoid the confusing error message when the resource module exits due to rlist_rerank's use of errno like ENOSPC and EOVERFLOW, reset errno to EINVAL when rlist_rerank() fails.

Copy link
Member

@chu11 chu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just this comment nit

@@ -562,7 +562,7 @@ int cmd_rerank (optparse_t *p, int argc, char **argv)
log_err_exit ("Failed to transform R objects on stdin");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seem to be missing a end paren in commit message, probably goes after "rerank operation"?

(too many or too few hosts specified for the rerank operation, but the resulting error message from users
can therefore be vague and confusing, e.g. "rlist_rerank: No space left of device".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fixed the commit message and pushed the result.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I misconfigured my system instance to see how the logs turned out and this is what I get from journalctl -u flux after the broker fails to start on rank 0:

Feb 12 18:38:21 picl0 bash[24554]: resource.err[0]: error reranking R: Number of hosts (4) is less than node count (8)
Feb 12 18:38:21 picl0 bash[24554]: resource.crit[0]: fatal error: Invalid argument
Feb 12 18:38:21 picl0 bash[24554]: broker.err[0]: rc1.0: flux-module: broker.insmod: Invalid argument
Feb 12 18:38:21 picl0 bash[24554]: broker.err[0]: rc1.0: /bin/bash -c /usr/local/etc/flux/rc1 Exited (rc=1) 0.2s
Feb 12 18:38:21 picl0 bash[24554]: broker.info[0]: rc1-fail: init->shutdown 0.200315s
Feb 12 18:38:21 picl0 bash[24554]: broker.info[0]: children-none: shutdown->finalize 0.00039453s
Feb 12 18:38:21 picl0 bash[24554]: broker.info[0]: rc3.0: /bin/bash -c /usr/local/etc/flux/rc3 Exited (rc=0) 0.2s
Feb 12 18:38:21 picl0 bash[24554]: broker.info[0]: rc3-success: finalize->exit 0.173281s
Feb 12 18:38:21 picl0 systemd[1]: flux.service: Main process exited, code=exited, status=1/FAILURE
Feb 12 18:38:21 picl0 systemd[1]: flux.service: Failed with result 'exit-code'.

@grondo
Copy link
Contributor Author

grondo commented Feb 14, 2022

resource.crit[0]: fatal error: Invalid argument

I wonder if we should also simplify this generic error on abnormal module termination. It might not be helping to include a generic fatal error message along with strerror (errno). In most cases the module will have (hopefully) already printed a more descriptive error of what actually went wrong, and, if not, then it is unlikely the errno will help pinpoint the problem (except perhaps in the exception case of ENOMEM).

Perhaps just module exiting abnormally or similar?

@garlick
Copy link
Member

garlick commented Feb 14, 2022

Good point. That works for me.

@grondo
Copy link
Contributor Author

grondo commented Feb 14, 2022

That works for me.

I went ahead and tacked that on here.

@garlick
Copy link
Member

garlick commented Feb 14, 2022

Looks like there may be a fluxion test that depends on the current behavior:

2022-02-14T18:42:08.1548425Z PASS: t4000-match-params.t 1 - loading resource module with a tiny machine GRUG works
2022-02-14T18:42:08.1552903Z PASS: t4000-match-params.t 2 - loading resource module with an XML works
2022-02-14T18:42:08.1556859Z PASS: t4000-match-params.t 3 - loading resource module with no option works
2022-02-14T18:42:08.1560752Z PASS: t4000-match-params.t 4 - loading resource module with a nonexistent GRUG fails
2022-02-14T18:42:08.1564600Z PASS: t4000-match-params.t 5 - loading resource module with a nonexistent XML fails
2022-02-14T18:42:08.1568812Z FAIL: t4000-match-params.t 6 - loading resource module with incorrect reader fails
2022-02-14T18:42:08.1572710Z PASS: t4000-match-params.t 7 - loading resource module with known policies works
2022-02-14T18:42:08.1576557Z PASS: t4000-match-params.t 8 - loading resource module with unknown policies is tolerated
2022-02-14T18:42:08.1580677Z PASS: t4000-match-params.t 9 - removing resource works
2022-02-14T18:42:08.1584412Z ERROR: t4000-match-params.t - exited with status 1

Edit: looks like maybe we could just change the test to grep for the new message:

test_expect_success 'loading resource module with incorrect reader fails' '
    unload_resource &&
    flux dmesg -C &&
    load_resource load-file=${xml} load-format=grug \
prune-filters=ALL:core policy=high &&
    test_must_fail flux module stats sched-fluxion-resource &&
    flux dmesg > error3 &&
    grep -i "Invalid argument" error3
'

@grondo
Copy link
Contributor Author

grondo commented Feb 14, 2022

Interesting. I'll remove the last commit and save that for a future PR.

@grondo
Copy link
Contributor Author

grondo commented Feb 14, 2022

Edit: looks like maybe we could just change the test to grep for the new message:

Yeah, that test isn't so good anyway since Invalid argument could appear anywhere in the logs and it would pass. I already have a PR open on the flux-sched testsuite, so I'll fix it there and make the test less sensitive to the actual content of the flux-core logging. Once that is fixed I can re-propose the one liner fix I just removed here.

Problem: rlist_rerank() repurposes errnos like EOVERFLOW and ENOSPC
to communicate the type of error (e.g. too many or too few hosts
specified for the rerank operation), but the resulting error message
from users can therefore be vague and confusing, e.g. "rlist_rerank:
No space left of device".

Add an optional rlist_error_t parameter to the rlist_rerank() function
which, when provided, will be filled in with a more useful error.
Callers can then print this error instead of using strerror.

Since the function prototype changed, update all tests and callers.

Fixes flux-framework#3830
Problem: rlist_rerank() repurposes system errnos like ENOSPC and
EOVERFLOW to indicate the specific type of error encountered. However,
this can cause confusion when a subsequent call to flux_log_error()
occurs, which could result in something like:

 resource.crit[0]: fatal error: No space left on device

which is clearly misleading.

Since the specific error is now printed at the rlist_rerank() call
site, reset errno to the more general (and perhaps correct) EINVAL
before returning with error from convert_R_conf().

Since this applies specifically to rlist_rerank(), split the call to
flux_attr_get(3) out from the combined conditional and give it its
own error message.

Fixes flux-framework#4122
Problem: Inclusion of `strerror (errno)` in the error message logged
when a module exits mod_main() with a nonzero return code is probably
not helpful to administrators perusing the logs. In most cases,
the module will have already printed a more descriptive error before
exiting with error, and the errno at best is not helpful and worst
may be confusing.

Clarify the log message at abnormal module exit to be clear about
what is happening, and exclude the result of strerror() to avoid any
confusion with the real error which should be further up in the logs.
@codecov
Copy link

Codecov bot commented Feb 15, 2022

Codecov Report

Merging #4126 (028549c) into master (cac0153) will decrease coverage by 3.18%.
The diff coverage is 85.71%.

❗ Current head 028549c differs from pull request most recent head 57d7b28. Consider uploading reports for the commit 57d7b28 to get more accurate results

@@            Coverage Diff             @@
##           master    #4126      +/-   ##
==========================================
- Coverage   83.32%   80.13%   -3.19%     
==========================================
  Files         376      376              
  Lines       63016    62547     -469     
==========================================
- Hits        52509    50123    -2386     
- Misses      10507    12424    +1917     
Impacted Files Coverage Δ
src/modules/resource/inventory.c 78.01% <50.00%> (-2.80%) ⬇️
src/broker/module.c 74.88% <100.00%> (-2.53%) ⬇️
src/cmd/flux-R.c 67.91% <100.00%> (-12.47%) ⬇️
src/common/librlist/rlist.c 74.56% <100.00%> (-10.42%) ⬇️
src/modules/job-info/allow.c 28.57% <0.00%> (-42.86%) ⬇️
src/cmd/top/ucache.c 58.62% <0.00%> (-28.05%) ⬇️
src/modules/job-manager/getattr.c 37.03% <0.00%> (-22.23%) ⬇️
src/common/librlist/rnode.c 65.10% <0.00%> (-19.71%) ⬇️
src/common/libsubprocess/remote.c 53.21% <0.00%> (-17.65%) ⬇️
src/shell/affinity.c 62.22% <0.00%> (-17.19%) ⬇️
... and 204 more

@grondo
Copy link
Contributor Author

grondo commented Feb 15, 2022

Actually, since the fix for flux-sched was merged (thanks @garlick), I pushed the commit again to simplify the module abnormal exit error.

@grondo
Copy link
Contributor Author

grondo commented Feb 15, 2022

Setting MWP.

@mergify Mergify bot merged commit 6f0165b into flux-framework:master Feb 15, 2022
@grondo grondo deleted the issue#4122 branch September 12, 2022 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants