Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce delays in detecting child exit #172

Merged
merged 5 commits into from
Mar 31, 2024
Merged

Conversation

nmisch
Copy link
Collaborator

@nmisch nmisch commented Mar 10, 2024

In brief, this fixes most instances of #166 through the combination of
[SIGCHLD=sub{}], [low-select-timeout-w32], and [waitpid-blocking]


Issue #166 reported delays of up to 0.5s in detecting child exit. When IPC::Run
is interacting with the child via a file descriptor (FD), the delay likely
doesn't happen. Child exit will make FDs ready, which select() will notice.
With zero such FDs, _select_loop() alternates between waitpid($pid, WNOHANG) and
timeout-only select(undef, undef, undef, $timeout). That limits detection
latency to $timeout. A number of tactics can help, and we can combine tactics
to reduce delays in more cases. I'm choosing each tactic in bold:

Tactic Race-free Helps if SIGCHLD blocked Helps Windows Helps if timeout active CPU cost Complexity
SIGCHLD=sub{} + +
self-pipe + +++
low-select-timeout-w32 +++ +
waitpid-blocking - ++
Windows-wait - ++++
helper-pid ++++ ++++

[SIGCHLD=sub{}] makes Perl terminate select() in response to that signal.

[self-pipe] uses https://cr.yp.to/docs/selfpipe.html to fix a race condition in
[SIGCHLD=sub{}], when SIGCHLD arrives just before select() begins.

[low-select-timeout-w32] caps the select() timeout at something less than
today's 0.5s, in Windows zero-FD cases. This gains responsiveness, but it loses
CPU efficiency.

[waitpid-blocking] uses waitpid($pid, 0) when we have no timeouts or FDs to
interact with a child. This works with SIGCHLD unavailable (Windows) or
blocked, but not working with timeouts is a major limitation.

[helper-pid] uses a separate process or thread of the current process to handle
one of the tasks. For example, have the main process select(), and have a
second process write to a pipe when it detects a child exit. The CPU overhead
makes this a loss outside academic scenarios. It could win if SIGCHLD is
blocked, but an application blocking SIGCHLD announces its disinterest in fast
child exit detection. It could win on Windows with sufficiently-long child
runtimes and timeouts, making the continuous cost of select() wake-ups exceed
the fixed cost of creating the helper.

[Windows-wait] uses a C-language module to issue Windows API calls
OpenProcess(), WSAEventSelect(), and WaitForMultipleObjects(). Together, they
enable one thread to wait for child exit, FD readiness, and timeout. This may
allow removing the existing Win32Helper.pm process, reducing today's overhead on
Windows; search the source tree for WaitForMultipleObjects(). No CPAN module
covers all those functions. This might pay off if Win32::API suffices or if
bypassing Win32Helper.pm gains considerable efficiency. Otherwise, it might pay
off on a long time horizon, if we can get its C code in the default installation
of Windows Perl.

Are there other tactics that might outperform these? Are there other key
considerations for evaluating the tactics listed?


I feel the combination of [SIGCHLD=sub{}], [low-select-timeout-w32], and
[waitpid-blocking] has only minor drawbacks, hence that stopping point. The
delay remains when SIGCHLD is blocked or arrives just before select(). Windows
incurs CPU overhead via shorter select() timeouts, only when there's a harness
timeout and zero FDs. The most-plausible alternative was to replace
[SIGCHLD=sub{}] with [self-pipe], accepting a bit of CPU overhead and
developer-facing complexity to remove the "just before select()" race condition.
[Windows-wait] is a more-speculative alternative.

Since tests can't assume much about the passage of real time, I'm not adding
test cases. Existing tests caught bugs in earlier versions of the
waitpid-related commits. You can see the responsiveness of several cases via
the output of temporary tests I had in t/eintr.t here:
https://github.com/nmisch/IPC-Run/actions/runs/8219780209/job/22478157192.

Windows lacks SIGCHLD.  When _select_loop() called select() with zero
file descriptors, IPC::Run would not respond to child termination until
the select() timeout elapsed.  For that case, use the 0.01s timeout
every time, as an exception to the usual doubling of the timeout until
we reach 0.5s.  This spends CPU to achieve responsiveness.
In the absence of an application-defined signal handler, SIGCHLD did not
improve _select_loop() responsiveness.  Install a transient handler.
This refactoring has no functional effects.  It makes the function
available to subsequent commits.
This code didn't know about GetExitCode(), and it cleaned up filters too
early.  This code was mostly unreachable, because callers kill_kill()
and finish() reap all kids first.  The next commit will change that.
This might already be reachable when start() failure calls _cleanup(),
but broken $? values or I/O results are unimportant then.
… else.

This helps when the application blocks SIGCHLD or the platform does not
provide SIGCHLD.  Expect it to be more efficient, too.
@nmisch
Copy link
Collaborator Author

nmisch commented Mar 17, 2024

waitpid-blocking will change $? in some cases, e.g. the first one in the
following program. I am inclined to accept that. As the other two examples
show, $? already has been sensitive to implementation details. When a
pipeline has multiple kids, use result(), results(), full_result(), or
full_results() for stable behavior. The main alternative would be to drop
waitpid-blocking, keeping the other two fixes. That would lose a
probably-negligible bit of efficiency, more noticeable on Windows than
elsewhere. Would that be better?

use IPC::Run 'run';

my $in;
my $out;

# master : print 2, because _select_loop -> pumpable -> pump_nb reacts in order of actual termination
# patched: print 1, because finish -> _cleanup -> waitpid follows $self->{KIDS} order
run([$^X, '-e', 'sleep 2; exit 2'], '&',
    [$^X, '-e', 'sleep 1; exit 1']);
printf "%d\n", $? >> 8;

# master : print 1, because the >$out channel remains open until both kids have
#   terminated, and _select_loop -> pumpable doesn't call pump_nb due to the
#   open channel
# patched : same
run([$^X, '-e', 'sleep 2; exit 2'], '<', \$in, '>', \$out, '&',
    [$^X, '-e', 'sleep 1; exit 1']);
printf "%d\n", $? >> 8;

# master : print 2, because _select_loop -> pumpable -> pump_nb reacts in order of actual termination
# patched: same
run([$^X, '-e', 'close STDOUT; sleep 2; exit 2'], '>', \$out, '&',
    [$^X, '-e', 'close STDOUT; sleep 1; exit 1']);
printf "%d\n", $? >> 8;

@nmisch nmisch merged commit 362f28f into cpan-authors:master Mar 31, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant