-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resource: emit a more specific error when rlist_rerank() fails #4126
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just this comment nit
@@ -562,7 +562,7 @@ int cmd_rerank (optparse_t *p, int argc, char **argv) | |||
log_err_exit ("Failed to transform R objects on stdin"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seem to be missing a end paren in commit message, probably goes after "rerank operation"?
(too many or too few hosts specified for the rerank operation, but the resulting error message from users
can therefore be vague and confusing, e.g. "rlist_rerank: No space left of device".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Fixed the commit message and pushed the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I misconfigured my system instance to see how the logs turned out and this is what I get from journalctl -u flux
after the broker fails to start on rank 0:
Feb 12 18:38:21 picl0 bash[24554]: resource.err[0]: error reranking R: Number of hosts (4) is less than node count (8)
Feb 12 18:38:21 picl0 bash[24554]: resource.crit[0]: fatal error: Invalid argument
Feb 12 18:38:21 picl0 bash[24554]: broker.err[0]: rc1.0: flux-module: broker.insmod: Invalid argument
Feb 12 18:38:21 picl0 bash[24554]: broker.err[0]: rc1.0: /bin/bash -c /usr/local/etc/flux/rc1 Exited (rc=1) 0.2s
Feb 12 18:38:21 picl0 bash[24554]: broker.info[0]: rc1-fail: init->shutdown 0.200315s
Feb 12 18:38:21 picl0 bash[24554]: broker.info[0]: children-none: shutdown->finalize 0.00039453s
Feb 12 18:38:21 picl0 bash[24554]: broker.info[0]: rc3.0: /bin/bash -c /usr/local/etc/flux/rc3 Exited (rc=0) 0.2s
Feb 12 18:38:21 picl0 bash[24554]: broker.info[0]: rc3-success: finalize->exit 0.173281s
Feb 12 18:38:21 picl0 systemd[1]: flux.service: Main process exited, code=exited, status=1/FAILURE
Feb 12 18:38:21 picl0 systemd[1]: flux.service: Failed with result 'exit-code'.
I wonder if we should also simplify this generic error on abnormal module termination. It might not be helping to include a generic Perhaps just |
Good point. That works for me. |
I went ahead and tacked that on here. |
Looks like there may be a fluxion test that depends on the current behavior:
Edit: looks like maybe we could just change the test to grep for the new message:
|
Interesting. I'll remove the last commit and save that for a future PR. |
Yeah, that test isn't so good anyway since |
Problem: rlist_rerank() repurposes errnos like EOVERFLOW and ENOSPC to communicate the type of error (e.g. too many or too few hosts specified for the rerank operation), but the resulting error message from users can therefore be vague and confusing, e.g. "rlist_rerank: No space left of device". Add an optional rlist_error_t parameter to the rlist_rerank() function which, when provided, will be filled in with a more useful error. Callers can then print this error instead of using strerror. Since the function prototype changed, update all tests and callers. Fixes flux-framework#3830
Problem: rlist_rerank() repurposes system errnos like ENOSPC and EOVERFLOW to indicate the specific type of error encountered. However, this can cause confusion when a subsequent call to flux_log_error() occurs, which could result in something like: resource.crit[0]: fatal error: No space left on device which is clearly misleading. Since the specific error is now printed at the rlist_rerank() call site, reset errno to the more general (and perhaps correct) EINVAL before returning with error from convert_R_conf(). Since this applies specifically to rlist_rerank(), split the call to flux_attr_get(3) out from the combined conditional and give it its own error message. Fixes flux-framework#4122
Problem: Inclusion of `strerror (errno)` in the error message logged when a module exits mod_main() with a nonzero return code is probably not helpful to administrators perusing the logs. In most cases, the module will have already printed a more descriptive error before exiting with error, and the errno at best is not helpful and worst may be confusing. Clarify the log message at abnormal module exit to be clear about what is happening, and exclude the result of strerror() to avoid any confusion with the real error which should be further up in the logs.
Codecov Report
@@ Coverage Diff @@
## master #4126 +/- ##
==========================================
- Coverage 83.32% 80.13% -3.19%
==========================================
Files 376 376
Lines 63016 62547 -469
==========================================
- Hits 52509 50123 -2386
- Misses 10507 12424 +1917
|
Actually, since the fix for flux-sched was merged (thanks @garlick), I pushed the commit again to simplify the module abnormal exit error. |
Setting MWP. |
This PR adds an
rlist_error_t
parameter torlist_rerank()
which offers a more specific error when the rerank operation fails.The parameter is then used in the
resource
module to print a specific error along with the generalerror reranking R
.Finally, to avoid the confusing error message when the
resource
module exits due torlist_rerank
's use oferrno
likeENOSPC
andEOVERFLOW
, reseterrno
toEINVAL
whenrlist_rerank()
fails.