job-ingest: handle worker channel overflow #4948

garlick · 2023-02-17T00:54:33Z

Problem: we noticed that flux stopped accepting job requests on one of the production login nodes after some "channel" errors on the validator subprocess (#4920)

This fixes issue #4920 and improves the error handling and logging around subprocess channel overflow errors. Now when this occurs you get

broker.err[0]: Error writing 4707 bytes to subprocess pid 4077290 stdin: No space left on device
job-ingest.err[0]: job-validator[0]: Failed: No space left on device

and the ingest worker is restarted.

garlick · 2023-02-20T01:12:44Z

I pared this back a bit to focus on the issue that was seen in production.

grondo

This looks like great cleanup to me! Really nice job on the new set of unit tests.
Just noticed one thing in one of the unit tests that may or may not be an issue.

grondo · 2023-02-20T21:03:50Z

src/common/libsubprocess/test/iostress.c

+    diag ("%s: destroying subprocess", name);
+    flux_subprocess_destroy (ctx.p);
+    flux_cmd_destroy (cmd);
+


Minor: is ctx.timer leaked here?

Good catch! Still playing with the tests a bit. They are a little tricky to get right.

Sorry! I neglected to notice this PR was still WIP.

No problem! I should be wrapping up as soon as I get a clean CI run.

garlick · 2023-02-21T17:33:48Z

Per offline comment from @grondo, repushed with SIGPIPE blocked in the test subprocess server, and a note suggesting to do that in the server.h header.

Problem: when something goes wrong with the open loop rexec.write RPC in the subprocess server, the log messages could be more helpful. Promote short write on stream to ENOSPC not EOVERFLOW to match the flux_buffer_t error when buffer is completely full. Ensure errno is set before calling flux_log_error(). Include pid, byte count, and stream in other rexec.write errors where appropriate. Collapse unnecessary local helper functions.

Problem: when the remote subprocess server returns output that cannot be buffered, the error messages are not too helpful. Promote a short write to the buffer to a fatal error and use ENOSPC to match the error flux_buffer returns when completely full and consolidate logging for short write and other buffer errors. Provide more context in the error messages so it's clear which buffer is overflowing. Consolidate unlikely message decode errors into one log message.

Problem: if a worker (validator/frobnicator) enters the FLUX_SUBPROCESS_FAILED state, job-ingest assumes that it will get an on_completion() callback, but this is not the case. Share the same worker cleanup code between on_completion() and on_state_change() for the FAILED case. Fixes flux-framework#4920

Problem: there are no unit tests for remote subprocesses. Create a subprocess server on a back to back flux_t handle using libtestutil, then poke at it various ways. The "remote" unit test program just ensures the subprocess API works as advertised for remote processes. The "iostress" unit test program overflows I/O buffers to cover the error code.

codecov · 2023-02-21T18:14:57Z

Codecov Report

Merging #4948 (024ad6a) into master (3fee5ce) will increase coverage by 28.76%.
The diff coverage is 72.22%.

❗ Current head 024ad6a differs from pull request most recent head e26c0e8. Consider uploading reports for the commit e26c0e8 to get more accurate results

@@             Coverage Diff             @@
##           master    #4948       +/-   ##
===========================================
+ Coverage   54.34%   83.11%   +28.76%     
===========================================
  Files         405      428       +23     
  Lines       70468    75544     +5076     
===========================================
+ Hits        38297    62789    +24492     
+ Misses      32171    12755    -19416

Impacted Files	Coverage Δ
src/common/libsubprocess/remote.c	`73.41% <50.00%> (+32.27%)`	⬆️
src/common/libsubprocess/server.c	`78.97% <80.00%> (+35.85%)`	⬆️
src/modules/job-ingest/worker.c	`76.66% <81.81%> (+14.54%)`	⬆️
src/common/libflux/response.c	`79.56% <0.00%> (-10.34%)`	⬇️
src/common/libpmi/simple_client.c	`73.03% <0.00%> (-9.92%)`	⬇️
src/common/libflux/control.c	`68.75% <0.00%> (-9.83%)`	⬇️
src/common/libflux/request.c	`84.76% <0.00%> (-8.65%)`	⬇️
src/modules/kvs/kvs_wait_version.c	`89.85% <0.00%> (-6.92%)`	⬇️
src/common/libioencode/ioencode.c	`89.61% <0.00%> (-6.23%)`	⬇️
src/common/libsubprocess/command.c	`69.33% <0.00%> (-5.60%)`	⬇️
... and 356 more

garlick force-pushed the issue#4920 branch 3 times, most recently from de8496b to 420fad0 Compare February 20, 2023 01:10

garlick changed the title ~~WIP: job-ingest: handle validator/frobnicator communication failures~~ WIP: job-ingest: handle worker channel overflow Feb 20, 2023

garlick force-pushed the issue#4920 branch 2 times, most recently from 1ce5d4d to decb0f2 Compare February 20, 2023 20:47

grondo approved these changes Feb 20, 2023

View reviewed changes

garlick force-pushed the issue#4920 branch 2 times, most recently from 0fa38a6 to 9a8e3c3 Compare February 20, 2023 21:16

garlick changed the title ~~WIP: job-ingest: handle worker channel overflow~~ job-ingest: handle worker channel overflow Feb 21, 2023

garlick force-pushed the issue#4920 branch from 9a8e3c3 to 024ad6a Compare February 21, 2023 17:31

garlick added the merge-when-passing label Feb 21, 2023

garlick added 4 commits February 21, 2023 18:12

garlick force-pushed the issue#4920 branch from 024ad6a to e26c0e8 Compare February 21, 2023 18:12

mergify bot merged commit 59db42c into flux-framework:master Feb 21, 2023

garlick deleted the issue#4920 branch February 21, 2023 22:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job-ingest: handle worker channel overflow #4948

job-ingest: handle worker channel overflow #4948

garlick commented Feb 17, 2023 •

edited

Loading

garlick commented Feb 20, 2023

grondo left a comment

grondo Feb 20, 2023

garlick Feb 20, 2023

grondo Feb 20, 2023

garlick Feb 20, 2023

garlick commented Feb 21, 2023

codecov bot commented Feb 21, 2023

job-ingest: handle worker channel overflow #4948

job-ingest: handle worker channel overflow #4948

Conversation

garlick commented Feb 17, 2023 • edited Loading

garlick commented Feb 20, 2023

grondo left a comment

Choose a reason for hiding this comment

grondo Feb 20, 2023

Choose a reason for hiding this comment

garlick Feb 20, 2023

Choose a reason for hiding this comment

grondo Feb 20, 2023

Choose a reason for hiding this comment

garlick Feb 20, 2023

Choose a reason for hiding this comment

garlick commented Feb 21, 2023

codecov bot commented Feb 21, 2023

Codecov Report

garlick commented Feb 17, 2023 •

edited

Loading