ERL-405: rpc:call/5 takes longer than specified timeout #3178

OTP-Maintainer · 2017-04-20T11:33:21Z

Original reporter: jbf
Affected versions: OTP-17.5, OTP-19.3
Fixed in version: OTP-21.0
Component: kernel
Migrated from: https://bugs.erlang.org/browse/ERL-405

When the recipient node of an rpc:call/5 is not connected, and can not be connected, the rpc call can take longer than the specified timeout. We discovered this during a partial network outage, but it is easily reproduced locally in your shell (assuming bash here):

Start one node, wait for erlang to boot, then suspend it using ctrl-z

{code}
erl -sname bar
...
Ctrl+z
{code}

In a new shell, start a new node and execute an rpc with a short timeout:
{code}
erl -sname foo
Erlang/OTP 19 [erts-8.3] [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V8.3  (abort with ^G)
(foo@MY-MACHINE)1> timer:tc(rpc, call, ['bar@MY-MACHINE', io, format, ["Hej~n",[]], 1000]).
{7024780,{badrpc,nodedown}}
(foo@MY-MACHINE)2>
{code}

I belive the issue is in gen:do_call/4 (https://github.com/erlang/otp/blob/master/lib/stdlib/src/gen.erl#L156) where the erlang:monitor call does not have a timeout and in this case takes 7 seconds to complete on my machine.

The text was updated successfully, but these errors were encountered:

OTP-Maintainer · 2017-04-25T08:03:36Z

siri said:

Hello, it is correct that it is the call to erlang:monitor that hangs - or rather, it is the connection attempt towards the suspended node. The time spent here is limited by the kernel configuration parameter {{net_setuptime}}, which by default is 7 seconds. From the documentation:

{quote}net_setuptime = SetupTime

SetupTime must be a positive integer or floating point number, and is interpreted as the
maximum allowed time for each network operation during connection setup to another Erlang
node. The maximum allowed value is 120. If higher values are specified, 120 is used. Default is
7 seconds if the variable is not specified, or if the value is incorrect (for example, not a number).

Notice that this value does not limit the total connection setup time, but rather each individual
network operation during the connection setup and handshake.
{quote}

I will discuss the following with the team:
* should rpc:call (and other types of sychronous calls) be forcefully aborted if the connection setup takes longer time than the given timeout value? (could there be unintended side effects?)
* should we keep the behaviour as it is - and possibly add some documentation?

Please feel free to comment or suggest other solutions.

OTP-Maintainer · 2017-04-25T10:53:05Z

chandru said:

Hi Siri,

{quote}
should rpc:call (and other types of sychronous calls) be forcefully aborted if the connection setup takes longer time than the given timeout value? (could there be unintended side effects?)
{quote}

IMHO, rpc:call/5 should strictly enforce the timeout value specified by the caller. It makes it easier to reason about code without having to worry that any rpc call might block for an indeterminate amount of time. Fixing this in OTP will relieve the burden on every Erlang programmer to come up with strategies to fix this issue.

OTP-Maintainer · 2017-04-25T11:12:13Z

jbf said:

Given that you already exit the caller in case of a timeout (assuming the connection setup succeeds or is a noop), arguably clients already know how to deal with this.

To me it looks less surprising to always fail after timeout.

Thanks for looking into this

OTP-Maintainer · 2017-04-26T09:29:56Z

siri said:

We have discussed this a bit more, and we agree that the solution must be to return after the timeout, no matter what the reason for the delay is. It will be solved in the underlying distribution mechanism and not in rpc itself. A the moment, it is not possible to say exactly which release this can be targeted for, so if urgent please do your own work-around by forcing the timeout, for example something like this:

{code}
{P,Ref} = spawn_monitor(fun() -> timer:exit_after(T,timeout),R=rpc:call(N,M,F,A,T), exit(R) end),
receive {'DOWN',Ref,process,P,R} -> R end.
{code}
Sorry for the inconvenience!
I will update this issue when the target release is set.

OTP-Maintainer · 2017-05-17T10:36:12Z

siri said:

Unfortunately this will not make it to the OTP-20 release

OTP-Maintainer added bug Issue is reported as a bug priority:medium labels Feb 10, 2021

OTP-Maintainer added this to the OTP-21.0 milestone Feb 10, 2021

OTP-Maintainer closed this as completed Feb 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERL-405: rpc:call/5 takes longer than specified timeout #3178

ERL-405: rpc:call/5 takes longer than specified timeout #3178

OTP-Maintainer commented Apr 20, 2017

OTP-Maintainer commented Apr 25, 2017

OTP-Maintainer commented Apr 25, 2017

OTP-Maintainer commented Apr 25, 2017

OTP-Maintainer commented Apr 26, 2017

OTP-Maintainer commented May 17, 2017

ERL-405: rpc:call/5 takes longer than specified timeout #3178

ERL-405: rpc:call/5 takes longer than specified timeout #3178

Comments

OTP-Maintainer commented Apr 20, 2017

OTP-Maintainer commented Apr 25, 2017

OTP-Maintainer commented Apr 25, 2017

OTP-Maintainer commented Apr 25, 2017

OTP-Maintainer commented Apr 26, 2017

OTP-Maintainer commented May 17, 2017