Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERL-405: rpc:call/5 takes longer than specified timeout #3178

Closed
OTP-Maintainer opened this issue Apr 20, 2017 · 5 comments
Closed

ERL-405: rpc:call/5 takes longer than specified timeout #3178

OTP-Maintainer opened this issue Apr 20, 2017 · 5 comments
Labels
bug Issue is reported as a bug priority:medium
Milestone

Comments

@OTP-Maintainer
Copy link

Original reporter: jbf
Affected versions: OTP-17.5, OTP-19.3
Fixed in version: OTP-21.0
Component: kernel
Migrated from: https://bugs.erlang.org/browse/ERL-405


When the recipient node of an rpc:call/5 is not connected, and can not be connected, the rpc call can take longer than the specified timeout. We discovered this during a partial network outage, but it is easily reproduced locally in your shell (assuming bash here):

Start one node, wait for erlang to boot, then suspend it using ctrl-z

{code}
erl -sname bar
...
Ctrl+z
{code}

In a new shell, start a new node and execute an rpc with a short timeout:
{code}
erl -sname foo
Erlang/OTP 19 [erts-8.3] [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V8.3  (abort with ^G)
(foo@MY-MACHINE)1> timer:tc(rpc, call, ['bar@MY-MACHINE', io, format, ["Hej~n",[]], 1000]).
{7024780,{badrpc,nodedown}}
(foo@MY-MACHINE)2>
{code}

I belive the issue is in gen:do_call/4 (https://github.com/erlang/otp/blob/master/lib/stdlib/src/gen.erl#L156) where the erlang:monitor call does not have a timeout and in this case takes 7 seconds to complete on my machine.
@OTP-Maintainer
Copy link
Author

siri said:

Hello, it is correct that it is the call to erlang:monitor that hangs - or rather, it is the connection attempt towards the suspended node. The time spent here is limited by the kernel configuration parameter {{net_setuptime}}, which by default is 7 seconds. From the documentation:

{quote}net_setuptime = SetupTime

SetupTime must be a positive integer or floating point number, and is interpreted as the
maximum allowed time for each network operation during connection setup to another Erlang
node. The maximum allowed value is 120. If higher values are specified, 120 is used. Default is
7 seconds if the variable is not specified, or if the value is incorrect (for example, not a number).

Notice that this value does not limit the total connection setup time, but rather each individual
network operation during the connection setup and handshake.
{quote}

I will discuss the following with the team:
* should rpc:call (and other types of sychronous calls) be forcefully aborted if the connection setup takes longer time than the given timeout value? (could there be unintended side effects?)
* should we keep the behaviour as it is - and possibly add some documentation?

Please feel free to comment or suggest other solutions.

@OTP-Maintainer
Copy link
Author

chandru said:

Hi Siri,

{quote}
should rpc:call (and other types of sychronous calls) be forcefully aborted if the connection setup takes longer time than the given timeout value? (could there be unintended side effects?)
{quote}

IMHO, rpc:call/5 should strictly enforce the timeout value specified by the caller. It makes it easier to reason about code without having to worry that any rpc call might block for an indeterminate amount of time. Fixing this in OTP will relieve the burden on every Erlang programmer to come up with strategies to fix this issue.

@OTP-Maintainer
Copy link
Author

jbf said:

Given that you already exit the caller in case of a timeout (assuming the connection setup succeeds or is a noop), arguably clients already know how to deal with this.

To me it looks less surprising to always fail after timeout.

Thanks for looking into this

@OTP-Maintainer
Copy link
Author

siri said:

We have discussed this a bit more, and we agree that the solution must be to return after the timeout, no matter what the reason for the delay is. It will be solved in the underlying distribution mechanism and not in rpc itself. A the moment, it is not possible to say exactly which release this can be targeted for, so if urgent please do your own work-around by forcing the timeout, for example something like this:

{code}
{P,Ref} = spawn_monitor(fun() -> timer:exit_after(T,timeout),R=rpc:call(N,M,F,A,T), exit(R) end),
receive {'DOWN',Ref,process,P,R} -> R end.
{code}
Sorry for the inconvenience!
I will update this issue when the target release is set.

@OTP-Maintainer
Copy link
Author

siri said:

Unfortunately this will not make it to the OTP-20 release

@OTP-Maintainer OTP-Maintainer added bug Issue is reported as a bug priority:medium labels Feb 10, 2021
@OTP-Maintainer OTP-Maintainer added this to the OTP-21.0 milestone Feb 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug priority:medium
Projects
None yet
Development

No branches or pull requests

1 participant