Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-ingest: handle worker channel overflow #4948

Merged
merged 4 commits into from
Feb 21, 2023

Conversation

garlick
Copy link
Member

@garlick garlick commented Feb 17, 2023

Problem: we noticed that flux stopped accepting job requests on one of the production login nodes after some "channel" errors on the validator subprocess (#4920)

This fixes issue #4920 and improves the error handling and logging around subprocess channel overflow errors. Now when this occurs you get

broker.err[0]: Error writing 4707 bytes to subprocess pid 4077290 stdin: No space left on device
job-ingest.err[0]: job-validator[0]: Failed: No space left on device

and the ingest worker is restarted.

@garlick garlick force-pushed the issue#4920 branch 3 times, most recently from de8496b to 420fad0 Compare February 20, 2023 01:10
@garlick garlick changed the title WIP: job-ingest: handle validator/frobnicator communication failures WIP: job-ingest: handle worker channel overflow Feb 20, 2023
@garlick
Copy link
Member Author

garlick commented Feb 20, 2023

I pared this back a bit to focus on the issue that was seen in production.

@garlick garlick force-pushed the issue#4920 branch 2 times, most recently from 1ce5d4d to decb0f2 Compare February 20, 2023 20:47
Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like great cleanup to me! Really nice job on the new set of unit tests.
Just noticed one thing in one of the unit tests that may or may not be an issue.

diag ("%s: destroying subprocess", name);
flux_subprocess_destroy (ctx.p);
flux_cmd_destroy (cmd);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: is ctx.timer leaked here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Still playing with the tests a bit. They are a little tricky to get right.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry! I neglected to notice this PR was still WIP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem! I should be wrapping up as soon as I get a clean CI run.

@garlick garlick force-pushed the issue#4920 branch 2 times, most recently from 0fa38a6 to 9a8e3c3 Compare February 20, 2023 21:16
@garlick garlick changed the title WIP: job-ingest: handle worker channel overflow job-ingest: handle worker channel overflow Feb 21, 2023
@garlick
Copy link
Member Author

garlick commented Feb 21, 2023

Per offline comment from @grondo, repushed with SIGPIPE blocked in the test subprocess server, and a note suggesting to do that in the server.h header.

Problem: when something goes wrong with the open loop rexec.write
RPC in the subprocess server, the log messages could be more helpful.

Promote short write on stream to ENOSPC not EOVERFLOW to match the
flux_buffer_t error when buffer is completely full.  Ensure errno
is set before calling flux_log_error().

Include pid, byte count, and stream in other rexec.write errors where
appropriate.

Collapse unnecessary local helper functions.
Problem: when the remote subprocess server returns output that
cannot be buffered, the error messages are not too helpful.

Promote a short write to the buffer to a fatal error and use ENOSPC
to match the error flux_buffer returns when completely full and
consolidate logging for short write and other buffer errors.

Provide more context in the error messages so it's clear which buffer
is overflowing.

Consolidate unlikely message decode errors into one log message.
Problem: if a worker (validator/frobnicator) enters the
FLUX_SUBPROCESS_FAILED state, job-ingest assumes that it will
get an on_completion() callback, but this is not the case.

Share the same worker cleanup code between on_completion() and
on_state_change() for the FAILED case.

Fixes flux-framework#4920
Problem: there are no unit tests for remote subprocesses.

Create a subprocess server on a back to back flux_t handle using
libtestutil, then poke at it various ways.

The "remote" unit test program just ensures the subprocess API
works as advertised for remote processes.

The "iostress" unit test program overflows I/O buffers to cover
the error code.
@codecov
Copy link

codecov bot commented Feb 21, 2023

Codecov Report

Merging #4948 (024ad6a) into master (3fee5ce) will increase coverage by 28.76%.
The diff coverage is 72.22%.

❗ Current head 024ad6a differs from pull request most recent head e26c0e8. Consider uploading reports for the commit e26c0e8 to get more accurate results

@@             Coverage Diff             @@
##           master    #4948       +/-   ##
===========================================
+ Coverage   54.34%   83.11%   +28.76%     
===========================================
  Files         405      428       +23     
  Lines       70468    75544     +5076     
===========================================
+ Hits        38297    62789    +24492     
+ Misses      32171    12755    -19416     
Impacted Files Coverage Δ
src/common/libsubprocess/remote.c 73.41% <50.00%> (+32.27%) ⬆️
src/common/libsubprocess/server.c 78.97% <80.00%> (+35.85%) ⬆️
src/modules/job-ingest/worker.c 76.66% <81.81%> (+14.54%) ⬆️
src/common/libflux/response.c 79.56% <0.00%> (-10.34%) ⬇️
src/common/libpmi/simple_client.c 73.03% <0.00%> (-9.92%) ⬇️
src/common/libflux/control.c 68.75% <0.00%> (-9.83%) ⬇️
src/common/libflux/request.c 84.76% <0.00%> (-8.65%) ⬇️
src/modules/kvs/kvs_wait_version.c 89.85% <0.00%> (-6.92%) ⬇️
src/common/libioencode/ioencode.c 89.61% <0.00%> (-6.23%) ⬇️
src/common/libsubprocess/command.c 69.33% <0.00%> (-5.60%) ⬇️
... and 356 more

@mergify mergify bot merged commit 59db42c into flux-framework:master Feb 21, 2023
@garlick garlick deleted the issue#4920 branch February 21, 2023 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants