Reduce delays in detecting child exit #172
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In brief, this fixes most instances of #166 through the combination of
[SIGCHLD=sub{}], [low-select-timeout-w32], and [waitpid-blocking]
Issue #166 reported delays of up to 0.5s in detecting child exit. When IPC::Run
is interacting with the child via a file descriptor (FD), the delay likely
doesn't happen. Child exit will make FDs ready, which select() will notice.
With zero such FDs, _select_loop() alternates between waitpid($pid, WNOHANG) and
timeout-only select(undef, undef, undef, $timeout). That limits detection
latency to $timeout. A number of tactics can help, and we can combine tactics
to reduce delays in more cases. I'm choosing each tactic in bold:
[SIGCHLD=sub{}] makes Perl terminate select() in response to that signal.
[self-pipe] uses https://cr.yp.to/docs/selfpipe.html to fix a race condition in
[SIGCHLD=sub{}], when SIGCHLD arrives just before select() begins.
[low-select-timeout-w32] caps the select() timeout at something less than
today's 0.5s, in Windows zero-FD cases. This gains responsiveness, but it loses
CPU efficiency.
[waitpid-blocking] uses waitpid($pid, 0) when we have no timeouts or FDs to
interact with a child. This works with SIGCHLD unavailable (Windows) or
blocked, but not working with timeouts is a major limitation.
[helper-pid] uses a separate process or thread of the current process to handle
one of the tasks. For example, have the main process select(), and have a
second process write to a pipe when it detects a child exit. The CPU overhead
makes this a loss outside academic scenarios. It could win if SIGCHLD is
blocked, but an application blocking SIGCHLD announces its disinterest in fast
child exit detection. It could win on Windows with sufficiently-long child
runtimes and timeouts, making the continuous cost of select() wake-ups exceed
the fixed cost of creating the helper.
[Windows-wait] uses a C-language module to issue Windows API calls
OpenProcess(), WSAEventSelect(), and WaitForMultipleObjects(). Together, they
enable one thread to wait for child exit, FD readiness, and timeout. This may
allow removing the existing Win32Helper.pm process, reducing today's overhead on
Windows; search the source tree for WaitForMultipleObjects(). No CPAN module
covers all those functions. This might pay off if Win32::API suffices or if
bypassing Win32Helper.pm gains considerable efficiency. Otherwise, it might pay
off on a long time horizon, if we can get its C code in the default installation
of Windows Perl.
Are there other tactics that might outperform these? Are there other key
considerations for evaluating the tactics listed?
I feel the combination of [SIGCHLD=sub{}], [low-select-timeout-w32], and
[waitpid-blocking] has only minor drawbacks, hence that stopping point. The
delay remains when SIGCHLD is blocked or arrives just before select(). Windows
incurs CPU overhead via shorter select() timeouts, only when there's a harness
timeout and zero FDs. The most-plausible alternative was to replace
[SIGCHLD=sub{}] with [self-pipe], accepting a bit of CPU overhead and
developer-facing complexity to remove the "just before select()" race condition.
[Windows-wait] is a more-speculative alternative.
Since tests can't assume much about the passage of real time, I'm not adding
test cases. Existing tests caught bugs in earlier versions of the
waitpid-related commits. You can see the responsiveness of several cases via
the output of temporary tests I had in t/eintr.t here:
https://github.com/nmisch/IPC-Run/actions/runs/8219780209/job/22478157192.