-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retry to sockets on EINTR error #6953
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some general comments. I am also not sure if EINTR should be handle in this busy retry way
src/support/socket.h
Outdated
* \return The return code returned by function f or error_value on retry failure. | ||
*/ | ||
template <typename T> | ||
T retry_call(std::function<T()> f, T error_value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline this function, as it is in header file.
Use CamelCase
src/support/socket.h
Outdated
* \return The return code returned by function f or error_value on retry failure. | ||
*/ | ||
template <typename T> | ||
T retry_call(std::function<T()> f, T error_value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer not using std::function, because that means construction of std::function in cases where it can be inlined. Instead, use the following signature so f can be inlined.
template <typename F>
T RetryCallWhenEINTR(F f, T error_value);
src/support/socket.h
Outdated
T retry_call(std::function<T()> f, T error_value) { | ||
errno = 0; | ||
T rc = error_value; | ||
for (size_t retry = 0; retry < 8; ++retry) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider start with a first call to f and return, then the less freqeuent path of retry. The compiler can pick that up as opposed to enter a loop in most of the time.
src/support/socket.h
Outdated
if (errno == EINTR) { | ||
rc = error_value; | ||
} else { | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
directly return error_value here
src/support/socket.h
Outdated
T retry_call(std::function<T()> f, T error_value) { | ||
errno = 0; | ||
T rc = error_value; | ||
for (size_t retry = 0; retry < 8; ++retry) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 seems to be a magic number, need a constant
src/support/socket.h
Outdated
*/ | ||
template <typename T> | ||
T retry_call(std::function<T()> f, T error_value) { | ||
errno = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if it is the best pratice to set errno, instead, check rc first, then check errno
src/support/socket.h
Outdated
template <typename T> | ||
T retry_call(std::function<T()> f, T error_value) { | ||
errno = 0; | ||
T rc = error_value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error_value is not needed if you get the rc in the first call
It might be helpful to discuss a scenario when this happens. Since in many cases we would indeed want to abort, as opposed to retry when an interrupt happens |
cc @areusch |
hey @rkimball one case where this may occur is a Ctrl+C-triggered SIGINT. are you encountering that case? if so, this is the mechanism by which we decide to terminate TVM on posix. i previously tried to see if there is a way to work around this by calling back into Python on EINTR to check whether we should actually terminate or just retry, but that is fraught because libtvm.so doesn't depend on Python and we don't want it to. could you elaborate on when you're seeing this undisturbed? |
@areusch I saw this happen periodically when I was running cpp_rpc on either Linux or Windows (I can't remember which now, but I am pretty sure it was linux). If you run cpp_rpc from the command line then it will sometimes print out |
@rkimball from my investigations into handling Ctrl+C I remember the windows model for forwarding Ctrl+C from terminal to program is much more invasive than the linux (but also a little more intuitive): it spins up a new thread inside the process and dispatches to that thread. on linux, any system call is interrupted, returns EINTR (I believe) and then jumps to the signal handler. In CPython, the signal handler merely sets a flag reminding Python to run the SIGINT handler defined with the |
I have tested this on linux mostly but windows as well and I don't see any strange behavior where I need to hit ctrl-c multiple times to break. This change does not break ctrl-c support. |
It could be hard to reproduce what @areusch said. Because such interruption need to happen during a socket call, nevertheless, it would be useful to confirm the behavior(what will happen when ctril c get pressed during a socket call) before check in the change. One idea is perhaps to construct a blocking socket call. Notably in the case of cpp rpc server it is fine to have such error on the server side(except for the error message itself), as the rpc server connection forks a new process, the fault only terminate that specific server session. As @rkimball mentioned, everything continues. The auto-tuner can be made to retry next session. |
I have confirmed that nothing bad or unexpected happens if ctrl-c is entered when we are waiting on a long-running network call. I made a test repo here https://github.com/rkimball/network_test.
|
@rkimball thanks for testing this. can you modify your test as follows, which would replicate the way python does Ctrl+C handling:
if so, you need to check the global flag between retries. unfortunately, in the typical TVM use case, that global flag is in python land, and merely says "please jump the python interpreter to the python-land signal handler at the next instruction" |
@areusch I tried it and updated my test repo if you want to have a look. My signal handler does nothing until the third time it gets SIGINT when it aborts, just to allow me to kill the program. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the additional lint error. After reading the discussion in python and @rkimball 's latest set of experiment I think we can consider merge this.
|
||
/*! | ||
* \brief Call a function and retry if an EINTR error is encountered. | ||
* \param f The function to retry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add additional rationales here
* \param error_value The error value returned by the call on retry failure. | ||
* \return The return code returned by function f or error_value on retry failure. | ||
*/ | ||
template <typename FUNC> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FUNC => Function (CamelCase)
@tqchen okay, i've iterated on this with @rkimball and it's a little more complex. I was previously working on this problem when i was trying to make TVM RPC server run GDB. To do that, I installed a custom SIGINT handler in Python which only raised KeyboardInterrupt if GDB had died. it turns out this all boils down to
finally, the root of the problem: why does Python care about |
@areusch Thank you for the research and help
We could add a flag which is passed from the users of the user rpc_socket_impl to socket.h indicating how it is supposed to handle EINTR. Perhaps we could pass an optional interrupt handler function to the socket which would be called on EINTR and use a returned bool to either retry or abort. |
@rkimball a callback function probably makes sense, but we have to plumb it all the way from the libtvm.so consumer, so it has to be a PackedFunc. |
After lots of reading: For python (siginterrupt is automatically set) we need to call a callback on EINTR which ultimately calls PyErr_CheckSignals() and returns a value indicating if an exception (ctrl-c, etc.) has occurred. If no exception has occurred then we retry the function. With an optional callback we can easly handle both c++ and python. |
I see, in that case @areusch is right that we might need a PackedFunc callback in the handler |
I put some time to think about this, thanks to @rkimball @areusch for great discussions. Here is one possible solution:
Let me know if that makes sense |
the only thing i would change is "default to nullptr and retry." by default, I think we do want to retry interrupted syscalls. It is only when a non-traditional signal handler such as Python-based SIGINT handler wants us to exit. Traditionally, the C SIGINT handler would call abort() in such cases, which means we don't need to by exit by default on an interrupted syscall. the rest makes sense to me! |
I think if we really want a python RPC server then we should write it using python socket calls which already properly handle signals. The problem we have is that we are calling c++ socket calls via PackedFunc and those c++ calls are not specifically designed to handle python. If we call python socket calls like os.recv then we should have no issues. |
@rkimball I agree that switching to python socket is another way (e.g. creating a customize channel that goes through python socket), the main reason to push the callback approach is that we may want the same libtvm runtime to function with and without the python. So ideally a handler function is slightly easier. |
The problem with python is that it handles signals internally. With PackedFunc we can't properly wrap the tvm socket calls with signal handlers to make them operate the way that native python sockets operate. Another alternative is to wrap the c++ rpc server in python, not using FFI. Within the wrapper we could handle signals and make the python wrapped RPC server operate just like other python network functions. What do other languages like Rust do with signals? Is there an issue with Rust? |
that is correct. this problem is not specific to sockets--it is just the only place we see it now. it may appear in other contexts, such as in cases where a GPU driver hangs and you cannot ctrl+c it. it is better to solve this broadly using the EINTR hook rather than by requiring all languages to implement sockets on the frontend. I do agree that is an approach, but it just solves the one case rather than this issue more generally. @rkimball is right in saying that Python handles retrying internally--here is how it does it: when you call e.g. Now two things could happen depending on the value of Note that in the case that from Python's perspective, calling TVM's C functions through cython or ctypes is the same as calling one of its internal syscall wrapper functions. Python is expecting us to also handle |
Thanks for great discussions, I take a deeper look at the python signal handling mechanism and opened #7919 which reuses the code and discussions in this thread |
superseded by #7919 |
Long-running system calls like socket recv or send may be interrupted and should be retried. If a call is interrupted errno is set to EINTR. Added retry_call which retries a call in hopes of reducing the occurrence of this error.