Reduce delays in detecting child exit #172

nmisch · 2024-03-10T07:32:03Z

In brief, this fixes most instances of #166 through the combination of
[SIGCHLD=sub{}], [low-select-timeout-w32], and [waitpid-blocking]

Issue #166 reported delays of up to 0.5s in detecting child exit. When IPC::Run
is interacting with the child via a file descriptor (FD), the delay likely
doesn't happen. Child exit will make FDs ready, which select() will notice.
With zero such FDs, _select_loop() alternates between waitpid($pid, WNOHANG) and
timeout-only select(undef, undef, undef, $timeout). That limits detection
latency to $timeout. A number of tactics can help, and we can combine tactics
to reduce delays in more cases. I'm choosing each tactic in bold:

Tactic	Race-free	Helps if SIGCHLD blocked	Helps Windows	Helps if timeout active	CPU cost	Complexity
SIGCHLD=sub{}	❌	❌	❌	✅	+	+
self-pipe	✅	❌	❌	✅	+	+++
low-select-timeout-w32	✅	✅	✅	✅	+++	+
waitpid-blocking	✅	✅	✅	❌	-	++
Windows-wait	✅	✅	✅	✅	-	++++
helper-pid	✅	✅	✅	✅	++++	++++

[SIGCHLD=sub{}] makes Perl terminate select() in response to that signal.

[self-pipe] uses https://cr.yp.to/docs/selfpipe.html to fix a race condition in
[SIGCHLD=sub{}], when SIGCHLD arrives just before select() begins.

[low-select-timeout-w32] caps the select() timeout at something less than
today's 0.5s, in Windows zero-FD cases. This gains responsiveness, but it loses
CPU efficiency.

[waitpid-blocking] uses waitpid($pid, 0) when we have no timeouts or FDs to
interact with a child. This works with SIGCHLD unavailable (Windows) or
blocked, but not working with timeouts is a major limitation.

[helper-pid] uses a separate process or thread of the current process to handle
one of the tasks. For example, have the main process select(), and have a
second process write to a pipe when it detects a child exit. The CPU overhead
makes this a loss outside academic scenarios. It could win if SIGCHLD is
blocked, but an application blocking SIGCHLD announces its disinterest in fast
child exit detection. It could win on Windows with sufficiently-long child
runtimes and timeouts, making the continuous cost of select() wake-ups exceed
the fixed cost of creating the helper.

[Windows-wait] uses a C-language module to issue Windows API calls
OpenProcess(), WSAEventSelect(), and WaitForMultipleObjects(). Together, they
enable one thread to wait for child exit, FD readiness, and timeout. This may
allow removing the existing Win32Helper.pm process, reducing today's overhead on
Windows; search the source tree for WaitForMultipleObjects(). No CPAN module
covers all those functions. This might pay off if Win32::API suffices or if
bypassing Win32Helper.pm gains considerable efficiency. Otherwise, it might pay
off on a long time horizon, if we can get its C code in the default installation
of Windows Perl.

Are there other tactics that might outperform these? Are there other key
considerations for evaluating the tactics listed?

I feel the combination of [SIGCHLD=sub{}], [low-select-timeout-w32], and
[waitpid-blocking] has only minor drawbacks, hence that stopping point. The
delay remains when SIGCHLD is blocked or arrives just before select(). Windows
incurs CPU overhead via shorter select() timeouts, only when there's a harness
timeout and zero FDs. The most-plausible alternative was to replace
[SIGCHLD=sub{}] with [self-pipe], accepting a bit of CPU overhead and
developer-facing complexity to remove the "just before select()" race condition.
[Windows-wait] is a more-speculative alternative.

Since tests can't assume much about the passage of real time, I'm not adding
test cases. Existing tests caught bugs in earlier versions of the
waitpid-related commits. You can see the responsiveness of several cases via
the output of temporary tests I had in t/eintr.t here:
https://github.com/nmisch/IPC-Run/actions/runs/8219780209/job/22478157192.

Windows lacks SIGCHLD. When _select_loop() called select() with zero file descriptors, IPC::Run would not respond to child termination until the select() timeout elapsed. For that case, use the 0.01s timeout every time, as an exception to the usual doubling of the timeout until we reach 0.5s. This spends CPU to achieve responsiveness.

In the absence of an application-defined signal handler, SIGCHLD did not improve _select_loop() responsiveness. Install a transient handler.

This refactoring has no functional effects. It makes the function available to subsequent commits.

This code didn't know about GetExitCode(), and it cleaned up filters too early. This code was mostly unreachable, because callers kill_kill() and finish() reap all kids first. The next commit will change that. This might already be reachable when start() failure calls _cleanup(), but broken $? values or I/O results are unimportant then.

… else. This helps when the application blocks SIGCHLD or the platform does not provide SIGCHLD. Expect it to be more efficient, too.

nmisch · 2024-03-17T16:35:31Z

waitpid-blocking will change $? in some cases, e.g. the first one in the
following program. I am inclined to accept that. As the other two examples
show, $? already has been sensitive to implementation details. When a
pipeline has multiple kids, use result(), results(), full_result(), or
full_results() for stable behavior. The main alternative would be to drop
waitpid-blocking, keeping the other two fixes. That would lose a
probably-negligible bit of efficiency, more noticeable on Windows than
elsewhere. Would that be better?

use IPC::Run 'run';

my $in;
my $out;

# master : print 2, because _select_loop -> pumpable -> pump_nb reacts in order of actual termination
# patched: print 1, because finish -> _cleanup -> waitpid follows $self->{KIDS} order
run([$^X, '-e', 'sleep 2; exit 2'], '&',
    [$^X, '-e', 'sleep 1; exit 1']);
printf "%d\n", $? >> 8;

# master : print 1, because the >$out channel remains open until both kids have
#   terminated, and _select_loop -> pumpable doesn't call pump_nb due to the
#   open channel
# patched : same
run([$^X, '-e', 'sleep 2; exit 2'], '<', \$in, '>', \$out, '&',
    [$^X, '-e', 'sleep 1; exit 1']);
printf "%d\n", $? >> 8;

# master : print 2, because _select_loop -> pumpable -> pump_nb reacts in order of actual termination
# patched: same
run([$^X, '-e', 'close STDOUT; sleep 2; exit 2'], '>', \$out, '&',
    [$^X, '-e', 'close STDOUT; sleep 1; exit 1']);
printf "%d\n", $? >> 8;

nmisch added 5 commits March 9, 2024 23:00

Detect child exit faster, via SIGCHLD.

fc9288c

In the absence of an application-defined signal handler, SIGCHLD did not improve _select_loop() responsiveness. Install a transient handler.

Extract _waitpid() subroutine from reap_nb().

1069736

This refactoring has no functional effects. It makes the function available to subsequent commits.

Detect child exit faster, via blocking waitpid(), if awaiting nothing…

896429f

… else. This helps when the application blocks SIGCHLD or the platform does not provide SIGCHLD. Expect it to be more efficient, too.

nmisch merged commit 362f28f into cpan-authors:master Mar 31, 2024
20 checks passed

nmisch mentioned this pull request Mar 31, 2024

libipc-run-perl: delay after child exit #166

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce delays in detecting child exit #172

Reduce delays in detecting child exit #172

nmisch commented Mar 10, 2024

nmisch commented Mar 17, 2024

Reduce delays in detecting child exit #172

Reduce delays in detecting child exit #172

Conversation

nmisch commented Mar 10, 2024

nmisch commented Mar 17, 2024