-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job-ingest: handle worker channel overflow #4948
Conversation
de8496b
to
420fad0
Compare
I pared this back a bit to focus on the issue that was seen in production. |
1ce5d4d
to
decb0f2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like great cleanup to me! Really nice job on the new set of unit tests.
Just noticed one thing in one of the unit tests that may or may not be an issue.
diag ("%s: destroying subprocess", name); | ||
flux_subprocess_destroy (ctx.p); | ||
flux_cmd_destroy (cmd); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: is ctx.timer
leaked here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Still playing with the tests a bit. They are a little tricky to get right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry! I neglected to notice this PR was still WIP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem! I should be wrapping up as soon as I get a clean CI run.
0fa38a6
to
9a8e3c3
Compare
Per offline comment from @grondo, repushed with SIGPIPE blocked in the test subprocess server, and a note suggesting to do that in the server.h header. |
Problem: when something goes wrong with the open loop rexec.write RPC in the subprocess server, the log messages could be more helpful. Promote short write on stream to ENOSPC not EOVERFLOW to match the flux_buffer_t error when buffer is completely full. Ensure errno is set before calling flux_log_error(). Include pid, byte count, and stream in other rexec.write errors where appropriate. Collapse unnecessary local helper functions.
Problem: when the remote subprocess server returns output that cannot be buffered, the error messages are not too helpful. Promote a short write to the buffer to a fatal error and use ENOSPC to match the error flux_buffer returns when completely full and consolidate logging for short write and other buffer errors. Provide more context in the error messages so it's clear which buffer is overflowing. Consolidate unlikely message decode errors into one log message.
Problem: if a worker (validator/frobnicator) enters the FLUX_SUBPROCESS_FAILED state, job-ingest assumes that it will get an on_completion() callback, but this is not the case. Share the same worker cleanup code between on_completion() and on_state_change() for the FAILED case. Fixes flux-framework#4920
Problem: there are no unit tests for remote subprocesses. Create a subprocess server on a back to back flux_t handle using libtestutil, then poke at it various ways. The "remote" unit test program just ensures the subprocess API works as advertised for remote processes. The "iostress" unit test program overflows I/O buffers to cover the error code.
Codecov Report
@@ Coverage Diff @@
## master #4948 +/- ##
===========================================
+ Coverage 54.34% 83.11% +28.76%
===========================================
Files 405 428 +23
Lines 70468 75544 +5076
===========================================
+ Hits 38297 62789 +24492
+ Misses 32171 12755 -19416
|
Problem: we noticed that flux stopped accepting job requests on one of the production login nodes after some "channel" errors on the validator subprocess (#4920)
This fixes issue #4920 and improves the error handling and logging around subprocess channel overflow errors. Now when this occurs you get
and the ingest worker is restarted.