Add retry to sockets on EINTR error #6953

rkimball · 2020-11-23T17:55:11Z

Long-running system calls like socket recv or send may be interrupted and should be retried. If a call is interrupted errno is set to EINTR. Added retry_call which retries a call in hopes of reducing the occurrence of this error.

tqchen

Some general comments. I am also not sure if EINTR should be handle in this busy retry way

tqchen · 2020-11-23T21:05:35Z

src/support/socket.h

+ * \return The return code returned by function f or error_value on retry failure.
+ */
+template <typename T>
+T retry_call(std::function<T()> f, T error_value) {


inline this function, as it is in header file.

Use CamelCase

tqchen · 2020-11-23T21:07:07Z

src/support/socket.h

+ * \return The return code returned by function f or error_value on retry failure.
+ */
+template <typename T>
+T retry_call(std::function<T()> f, T error_value) {


Prefer not using std::function, because that means construction of std::function in cases where it can be inlined. Instead, use the following signature so f can be inlined.

template <typename F> T RetryCallWhenEINTR(F f, T error_value);

tqchen · 2020-11-23T21:09:18Z

src/support/socket.h

+T retry_call(std::function<T()> f, T error_value) {
+  errno = 0;
+  T rc = error_value;
+  for (size_t retry = 0; retry < 8; ++retry) {


Consider start with a first call to f and return, then the less freqeuent path of retry. The compiler can pick that up as opposed to enter a loop in most of the time.

tqchen · 2020-11-23T21:09:30Z

src/support/socket.h

+    if (errno == EINTR) {
+      rc = error_value;
+    } else {
+      break;


directly return error_value here

tqchen · 2020-11-23T21:09:43Z

src/support/socket.h

+T retry_call(std::function<T()> f, T error_value) {
+  errno = 0;
+  T rc = error_value;
+  for (size_t retry = 0; retry < 8; ++retry) {


8 seems to be a magic number, need a constant

tqchen · 2020-11-23T21:16:32Z

src/support/socket.h

+ */
+template <typename T>
+T retry_call(std::function<T()> f, T error_value) {
+  errno = 0;


not sure if it is the best pratice to set errno, instead, check rc first, then check errno

tqchen · 2020-11-23T21:16:51Z

src/support/socket.h

+template <typename T>
+T retry_call(std::function<T()> f, T error_value) {
+  errno = 0;
+  T rc = error_value;


error_value is not needed if you get the rc in the first call

tqchen · 2020-11-23T21:22:15Z

It might be helpful to discuss a scenario when this happens. Since in many cases we would indeed want to abort, as opposed to retry when an interrupt happens

tqchen · 2020-11-23T21:26:12Z

cc @areusch

areusch · 2020-11-23T21:33:11Z

hey @rkimball one case where this may occur is a Ctrl+C-triggered SIGINT. are you encountering that case? if so, this is the mechanism by which we decide to terminate TVM on posix. i previously tried to see if there is a way to work around this by calling back into Python on EINTR to check whether we should actually terminate or just retry, but that is fraught because libtvm.so doesn't depend on Python and we don't want it to. could you elaborate on when you're seeing this undisturbed?

rkimball · 2020-11-24T23:32:59Z

@areusch I saw this happen periodically when I was running cpp_rpc on either Linux or Windows (I can't remember which now, but I am pretty sure it was linux). If you run cpp_rpc from the command line then it will sometimes print out Socket SockChannel::Recv Error:Interrupted system call in the middle of tuning with no interaction. Everything seems to continue.

areusch · 2020-11-24T23:39:50Z

@rkimball from my investigations into handling Ctrl+C I remember the windows model for forwarding Ctrl+C from terminal to program is much more invasive than the linux (but also a little more intuitive): it spins up a new thread inside the process and dispatches to that thread. on linux, any system call is interrupted, returns EINTR (I believe) and then jumps to the signal handler.

In CPython, the signal handler merely sets a flag reminding Python to run the SIGINT handler defined with the signal module next time the interpreter is looking for a new instruction. Unfortunately, "next time" while a C extension is running is after the C extension returns. So I think committing this would mean you have to press Ctrl+C 8 times to bail out of TVM on not-windows. I don't think we can do that--anyhow, at least someone should test this before submitting.

rkimball · 2020-11-30T18:43:59Z

I have tested this on linux mostly but windows as well and I don't see any strange behavior where I need to hit ctrl-c multiple times to break. This change does not break ctrl-c support.
EINTR support is already in the repo here https://github.com/apache/tvm/blob/main/apps/cpp_rpc/rpc_server.cc#L57-L63
There is also support for EINTR in dmlc-core except there it will infinitely retry.
@areusch If there are specific tests you would like me to try then let me know. I have tested stopping the cpp rpc server with ctrl-c on windows and linux and everything works fine.

tqchen · 2020-12-01T14:03:33Z

It could be hard to reproduce what @areusch said.

Because such interruption need to happen during a socket call, nevertheless, it would be useful to confirm the behavior(what will happen when ctril c get pressed during a socket call) before check in the change. One idea is perhaps to construct a blocking socket call.

Notably in the case of cpp rpc server it is fine to have such error on the server side(except for the error message itself), as the rpc server connection forks a new process, the fault only terminate that specific server session. As @rkimball mentioned, everything continues. The auto-tuner can be made to retry next session.

rkimball · 2020-12-01T17:12:29Z

I have confirmed that nothing bad or unexpected happens if ctrl-c is entered when we are waiting on a long-running network call. I made a test repo here https://github.com/rkimball/network_test.
The test waits on accept and when a connection is accepted it spawns an echo client, so there are two blocking calls, one in the main thread waiting on accept and one in the echo client waiting on recv. While this is running I type ctrl-c and the program exits and does not print out EINTR so neither call receives an EINTR from ctrl-c.
This test was performed on Ubuntu 20.04
Here is the output of the test run

(venv3) rkimball@Callisto:~/dev/network_test/build$ ./test
main thread accepting connection now
waiting on call accept
call done
main thread got a connection
main thread accepting connection now
waiting on call accept
waiting on call recv
call done
got abc

waiting on call recv
^C
(venv3) rkimball@Callisto:~/dev/network_test/build$

areusch · 2020-12-01T17:37:45Z

@rkimball thanks for testing this. can you modify your test as follows, which would replicate the way python does Ctrl+C handling:

override the SIGINT handle using signal.
return from the signal handler without exiting the program (you can set some global flag if you want to, which is what the python signal handler does)
now see if you get EINTR from the network call.

if so, you need to check the global flag between retries. unfortunately, in the typical TVM use case, that global flag is in python land, and merely says "please jump the python interpreter to the python-land signal handler at the next instruction"

rkimball · 2020-12-01T17:50:33Z

@areusch I tried it and updated my test repo if you want to have a look. My signal handler does nothing until the third time it gets SIGINT when it aborts, just to allow me to kill the program.
EINTR is never triggered. The blocked network calls remain blocked until we exit.

areusch · 2020-12-01T18:37:59Z

@rkimball weird. it's not that I don't believe you, just I am fairly sure that's what i saw before and it is listed in the man page for accept(). i'll see if I can try it out later this afternoon.

tqchen

Please fix the additional lint error. After reading the discussion in python and @rkimball 's latest set of experiment I think we can consider merge this.

tqchen · 2020-12-04T15:57:44Z

src/support/socket.h

+
+/*!
+ * \brief Call a function and retry if an EINTR error is encountered.
+ * \param f The function to retry.


Add additional rationales here

tqchen · 2020-12-04T15:58:12Z

src/support/socket.h

+ * \param error_value The error value returned by the call on retry failure.
+ * \return The return code returned by function f or error_value on retry failure.
+ */
+template <typename FUNC>


FUNC => Function (CamelCase)

areusch · 2020-12-04T17:35:32Z

@tqchen okay, i've iterated on this with @rkimball and it's a little more complex. I was previously working on this problem when i was trying to make TVM RPC server run GDB. To do that, I installed a custom SIGINT handler in Python which only raised KeyboardInterrupt if GDB had died.

it turns out this all boils down to siginterrupt(), which Python implicitly calls when you set a custom signal handler and also exposes to Python programs as signal.siginterrupt. I was able to reproduce the behavior I saw in @rkimball's demo by adding this code to main():

if (siginterrupt(SIGINT, 1) == -1) {
  perror("siginterrupt");
}

finally, the root of the problem: why does Python care about EINTR and siginterrupt? See PEP 475, which explains that basically handling is specific to even the particular Python code that's running on top of TVM. Python provides a callback function, PyErr_CheckSignals, and C extensions are supposed to invoke that function if EINTR is seen to decide whether to return. This is how Ctrl+C is handled in Python when a C extension is blocked in a syscall. To handle this properly within libtvm.so then, we need to provide some generic e.g. PackedFunc callback that frontends can set to define the EINTR behavior. This didn't seem worth it when I was just trying to run GDB on an RPC server, but it does now. Thoughts?

rkimball · 2020-12-04T18:54:33Z

@areusch Thank you for the research and help
There are two uses of socket.h within tvm:

In tvm_runtime as the python RPC server via runtime/rpc/rpc_socket_impl.cc
In cpp_rpc as the cpp RPC server via apps/cpp_rpc/rpc_server.cc

We could add a flag which is passed from the users of the user rpc_socket_impl to socket.h indicating how it is supposed to handle EINTR. Perhaps we could pass an optional interrupt handler function to the socket which would be called on EINTR and use a returned bool to either retry or abort.
I have not seen EINTR while using the cpp RPC server but have seen it with python.

areusch · 2020-12-04T19:26:43Z

@rkimball a callback function probably makes sense, but we have to plumb it all the way from the libtvm.so consumer, so it has to be a PackedFunc.

rkimball · 2020-12-04T19:27:09Z

After lots of reading:
For c++ we don't need to do anything since EINTR won't happen and retry is automatic (as long as siginterrupt is not set).

For python (siginterrupt is automatically set) we need to call a callback on EINTR which ultimately calls PyErr_CheckSignals() and returns a value indicating if an exception (ctrl-c, etc.) has occurred. If no exception has occurred then we retry the function.

With an optional callback we can easly handle both c++ and python.

tqchen · 2020-12-07T18:52:30Z

I see, in that case @areusch is right that we might need a PackedFunc callback in the handler

tqchen · 2020-12-17T19:22:38Z

I put some time to think about this, thanks to @rkimball @areusch for great discussions. Here is one possible solution:

Add a TypedPackedFunc field eintr_handle to the Socket which can be called to check whether we should retry when facing eintr. default to nullptr and no retry.
Setup a global PackedFunc registry runtime.socket.SignalHandler that returns a PackedFunc to set into the socket, call it during socket construction in RPC server.
Register when we are in python env during initialization.

Let me know if that makes sense

areusch · 2020-12-17T19:27:45Z

the only thing i would change is "default to nullptr and retry." by default, I think we do want to retry interrupted syscalls. It is only when a non-traditional signal handler such as Python-based SIGINT handler wants us to exit. Traditionally, the C SIGINT handler would call abort() in such cases, which means we don't need to by exit by default on an interrupted syscall.

the rest makes sense to me!

rkimball · 2020-12-17T19:29:09Z

I think if we really want a python RPC server then we should write it using python socket calls which already properly handle signals. The problem we have is that we are calling c++ socket calls via PackedFunc and those c++ calls are not specifically designed to handle python. If we call python socket calls like os.recv then we should have no issues.

tqchen · 2020-12-17T19:31:36Z

@rkimball I agree that switching to python socket is another way (e.g. creating a customize channel that goes through python socket), the main reason to push the callback approach is that we may want the same libtvm runtime to function with and without the python. So ideally a handler function is slightly easier.

rkimball · 2020-12-17T19:48:48Z

The problem with python is that it handles signals internally. With PackedFunc we can't properly wrap the tvm socket calls with signal handlers to make them operate the way that native python sockets operate.

Another alternative is to wrap the c++ rpc server in python, not using FFI. Within the wrapper we could handle signals and make the python wrapped RPC server operate just like other python network functions.

What do other languages like Rust do with signals? Is there an issue with Rust?

tqchen · 2020-12-17T19:52:06Z

I believe @areusch 's point is we can register a special function e.g. check python error PyErr_CheckSignals(via ctypes or cython), @areusch perhaps you can elaborate? So it behaves along the same lines as python socket

areusch · 2020-12-17T20:21:03Z

that is correct. this problem is not specific to sockets--it is just the only place we see it now. it may appear in other contexts, such as in cases where a GPU driver hangs and you cannot ctrl+c it. it is better to solve this broadly using the EINTR hook rather than by requiring all languages to implement sockets on the frontend. I do agree that is an approach, but it just solves the one case rather than this issue more generally.

@rkimball is right in saying that Python handles retrying internally--here is how it does it: when you call e.g. os.read, an internal CPython function (written in C) enters a loop. Each time, it calls read, and if it gets back EINTR, it calls PyErr_CheckSignals in a normal (I.e. not signal handler) context. It may get EINTR for many reasons, but one could be that a signal handler was invoked. When this happens and CPython is listening for said signal handler, it will have registered a C signal handler by calling signal. The OS now aborts read and calls this C signal handler. Notably, inside the C signal handler, the Python interpreter can't be resumed to handle the signal in Python, because the signal handler could be invoked at any time and that would be a multithreading catastrophe. So instead CPython sets an internal "signal handler pending" flag.

Now two things could happen depending on the value of siginterrupt. If siginterrupt is set, then the syscall will return EINTR. If not, the OS will resume the syscall internally as if nothing happened. In the EINTR case, C functions called from Python should call PyErr_CheckSignals. Though it is called from a "c function call instruction," PyErr_CheckSignals checks if there are any pending Python-side signal handlers and jumps the Python interpreter to those Python-side signal handlers. If one of those raises an exception, the signal handler terminates and PyErr_CheckSignals returns 1. It is a bit of a hack to call this function from cython/ctypes, but it worked in my prototype.

Note that in the case that siginterrupt is 0, a C extension probably should call PyErr_CheckSignals if it is not merely wrapping an OS syscall, because if SIGINT happened and was swallowed by siginterrupt, the C extension does need to handle the side effect of the syscall (i.e. the data returned from os.read should go somewhere), but it should not issue another syscall which may block and make Ctrl+C appear to do nothing. I believe this is why siginterrupt is automatically set when you set a custom Python signal handler.

from Python's perspective, calling TVM's C functions through cython or ctypes is the same as calling one of its internal syscall wrapper functions. Python is expecting us to also handle EINTR ourselves. so if we add this hook, we are holding up our end of the contract, and solving this not just for socket ops, but providing an easy way to handle this for e.g. GPU syscalls. Finally, new language frontends need only implement this hook, rather than solve this in each separate e.g. socket, GPU, etc case.

tqchen · 2021-04-25T21:14:07Z

Thanks for great discussions, I take a deeper look at the python signal handling mechanism and opened #7919 which reuses the code and discussions in this thread

areusch · 2021-05-20T19:19:39Z

superseded by #7919

Add retry to sockets on EINTR error

7e29678

tqchen requested changes Nov 23, 2020

View reviewed changes

Update per review comments

faa4236

cleanup retry loop

317db2c

rkimball requested a review from tqchen November 30, 2020 18:48

tqchen reviewed Dec 4, 2020

View reviewed changes

rkimball marked this pull request as draft December 4, 2020 17:28

areusch closed this May 20, 2021

Add retry to sockets on EINTR error #6953

Add retry to sockets on EINTR error #6953

Conversation

rkimball commented Nov 23, 2020

tqchen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tqchen commented Nov 23, 2020

tqchen commented Nov 23, 2020

areusch commented Nov 23, 2020 • edited

rkimball commented Nov 24, 2020

areusch commented Nov 24, 2020

rkimball commented Nov 30, 2020 • edited

tqchen commented Dec 1, 2020 • edited

rkimball commented Dec 1, 2020

areusch commented Dec 1, 2020

rkimball commented Dec 1, 2020

areusch commented Dec 1, 2020

tqchen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

areusch commented Dec 4, 2020 • edited

rkimball commented Dec 4, 2020 • edited

areusch commented Dec 4, 2020

rkimball commented Dec 4, 2020

tqchen commented Dec 7, 2020

tqchen commented Dec 17, 2020

areusch commented Dec 17, 2020

rkimball commented Dec 17, 2020

tqchen commented Dec 17, 2020 • edited

rkimball commented Dec 17, 2020

tqchen commented Dec 17, 2020

areusch commented Dec 17, 2020 • edited

tqchen commented Apr 25, 2021

areusch commented May 20, 2021

areusch commented Nov 23, 2020 •

edited

rkimball commented Nov 30, 2020 •

edited

tqchen commented Dec 1, 2020 •

edited

areusch commented Dec 4, 2020 •

edited

rkimball commented Dec 4, 2020 •

edited

tqchen commented Dec 17, 2020 •

edited

areusch commented Dec 17, 2020 •

edited