ERL-1343: Dist becomes much slower when sending complex terms #4348

OTP-Maintainer · 2020-09-09T22:19:59Z

Original reporter: JIRAUSER16602
Affected version: OTP-23.0.3
Fixed in version: OTP-23.1.2
Component: kernel
Migrated from: https://bugs.erlang.org/browse/ERL-1343

There is significant throughput difference between sending the following two messages through dist:

1. [crypto:strong_rand_bytes(128) || _ <- lists:seq(1, 1 bsl 10)]

2. term_to_binary([crypto:strong_rand_bytes(128) || _ <- lists:seq(1, 1 bsl 10)])

(2) is much faster

*Reproduce:*

erlc test.erl

On terminal 1:

erl -setcookie test -name 1@127.0.0.1 -s test server -noinput

On terminal 2:

erl -setcookie test -name 2@127.0.0.1 -s test client binary -noinput

Result: 861.9077 MB/s

Again on terminal 2:

erl -setcookie test -name 2@127.0.0.1 -s test client list -noinput

Result: 131.8675 MB/s

*Remark*

It is much better in OTP 22.3.2,

erl -setcookie test -name 2@127.0.0.1 -s test client binary -noinput
 823.7916 MB/s

erl -setcookie test -name 2@127.0.0.1 -s test client list -noinput
 418.5475 MB/s

The text was updated successfully, but these errors were encountered:

OTP-Maintainer · 2020-09-10T17:15:16Z

peerst@gmail.com said:

Q: are you running this on one physical device or two?

(Addendum: seeing you're 127.0.01 addresses I could have inferred this)
 

If its on one device I could imagine operating system scheduling influencing things.

OTP-Maintainer · 2020-09-10T19:27:35Z

JIRAUSER16602 said:

On one device, but the same thing happens if we communicate between 2 hosts

I've done a more complex message sending test, with 16 processes, 32 flying message per processes (sliding window)

message (1) is 43MB/s, while message (2) is 480MB/s

OTP-Maintainer · 2020-10-18T19:06:46Z

dane said:

I narrowed search down to this patch, https://github.com/erlang/otp/commit/f3a6f63b4ff87ab21dc3af015f3b970f5c096678#diff-d256ca218de9be9017823817bd9953a7f3000a190a44dd34956e446b8d68b478

And snapped perf flame graph

 

[^perf-kernel.svg]

 

^It appears that "writev" is very expensive with this patch.^

^Actual perf sample shows that tcp_inet_output is responsible for the regression:^

^- 33.50% 33.50% 3_scheduler [kernel.kallsyms] [k] 0xffffffff9a264f76 ▒^
 ^__GI___clone (inlined) ▒^
 ^start_thread ▒^
 ^thr_wrapper ▒^
 ^sched_thread_func ▒^
 ^process_main ▒^
 ^erts_schedule ▒^
 ^- erts_port_task_execute ▒^
 ^- 20.86% tcp_inet_output (inlined)^

OTP-Maintainer · 2020-10-19T14:20:38Z

rickard said:

Thanks for the bug report [~zzydxm] and thanks for the bisecting and profiling [~dane]!

The main cause of this issue is the fixed maximum io-vector length of 16 in the inet-driver (MAX_VSIZE in inet_drv.c). This caused an unnecessarily large amount of calls to writev() (64 times in your benchmark). When changing that to 1024 which equals to IOV_MAX on my Linux machine, I get the throughput back to what it was in before OTP 23.

We should however probably not refer to binaries that small in the io-vector. I'll look some more into this during this week and get back to you.

OTP-Maintainer · 2020-10-26T09:24:34Z

rickard said:

We will release a patch that will use {{IOV_MAX}} (1024 on most systems) instead of 16 as maximum io-vector length later this week or next week. For now we will keep referring to all off heap binaries in separate elements. This since it is hard to determine what minimum size would be best and you need terms with a very large amount of binaries before you will run into trouble on systems with an {{IOV_MAX}} of 1024.

Regarding the throughput drop when comparing sending the term as is and doing {{term_to_binary()}} on the term before sending it. I can reproduce this behavior over the loopback on my machine for OTP 20-23 (haven't tested earler releases). Encoding of the data and sending it to the distribution port is faster when sending the data as is than when doing {{term_to_binary()}} first. This is also what is expected since you encode the data twice when doing {{term_to_binary()}} on the term before sending it. The operation as a whole however takes longer time since the {{writev}} operation would block and the emulator has to wait for a much longer time before it can complete the sending of all the data when not doing {{term_to_binary()}}. I see this behaviour on an all releases between OTP 20-23 with different I/O poll implementations and various changes in the distribution implementation. Changing the data that you send also change the behaviour. I see this as timing/tcp-ip stack related and won't dig further into this.

OTP-Maintainer · 2020-11-05T13:46:39Z

rickard said:

Fix just released in patch [OTP 23.1.2|http://erlang.org/download/OTP-23.1.2.README].

OTP-Maintainer · 2020-11-07T00:47:11Z

dane said:

Thank you for the fix!

I'm trying to understand this:

> This is also what is expected since you encode the data twice when doing {{term_to_binary()}} on the term before sending it. The operation as a whole however takes longer time since the {{writev}} operation would block and the emulator has to wait for a much longer time before it can complete the sending of all the data when not doing {{term_to_binary()}}. 

 

What I was aiming for, end-to-end performance for databases (e.g. mnesia) which are sending terms over distribution. What we saw, that if we use term_to_binary() and then binary_to_term() on a receiver side, throughput is 3x-5x higher compared to non-term-to-binaried way. Which means that initial database replication takes considerably longer if not term-to-binaried.

I wonder if there is a benchmark for distribution,  so it can be easy to use Linux perf to understand where regression comes from.

OTP-Maintainer · 2020-11-09T15:10:43Z

rickard said:

{quote}
I'm trying to understand this:
bq. This is also what is expected since you encode the data twice when doing term_to_binary() on the term before sending it. The operation as a whole however takes longer time since the writev operation would block and the emulator has to wait for a much longer time before it can complete the sending of all the data when not doing term_to_binary(). 
{quote}

When running the attached test code. Encoding and moving the encoded data into the driver is faster when sending the term directly than when first encoding the term using {{term_to_binary()}}. This is also expected since when you do an explicit {{term_to_binary()}} first, you'll encode the term twice. The distribution will always encode using the same functionality as {{term_to_binary()}} when passing a term over the distribution. The second encoding will by much easier since it is only one binary to encode, but it is still extra work. Throughput was however  lower when sending the term directly (compared to doing an explicit {{term_to_binary()}}) since the inet driver was not able to write the data as fast in this case. The driver became busy more frequently and for longer times when sending the term directly.  The behavior is more or less the same on OTP 20 up to OTP 23 (with the latest patch). Assuming that there isn't a bug in our I/O poll implementation, the extra time it takes to deliver the data is spent outside of the Erlang VM.

It is of course not impossible that there is a bug in our I/O poll implementation causing delays in wakeup of the distribution port executing in the inet driver. Inbetween OTP 20 and OTP 23 the I/O poll implementation has been largely rewritten. Such a bug then need to exist in both old and new implementation. If such a bug exist, it has also never caused lost wakeups, only delayed wakeups, when I've tested. Due to this I do not suspect a bug in our I/O poll implementation.

When changing the amount of data written in each message the behavior also changed. Sometimes the throughput loss was there sometimes not depending on amount of data.

bq. What I was aiming for, end-to-end performance for databases (e.g. mnesia) which are sending terms over distribution. What we saw, that if we use term_to_binary() and then binary_to_term() on a receiver side, throughput is 3x-5x higher compared to non-term-to-binaried way. Which means that initial database replication takes considerably longer if not term-to-binaried.

Is this with or without the fix in OTP 23.1.2? If not, what is the result with OTP 23.1.2?

Is this a real usage scenario or a micro benchmark?

bq. I wonder if there is a benchmark for distribution,  so it can be easy to use Linux perf to understand where regression comes from.

Unfortunately we do not have any good benchmarks for the distribution. People have mainly made their own hacks when testing out changes and then tossed them away. When introducing the IOV change my tests did not include huge amounts of binaries which is why this could slip through. I would be surprised if anyone have compared with/without preceding {{term_to_binary()}} on the terms. I have at least never done that. We've put up writing better distribution benchmarks on our todo list.

OTP-Maintainer added bug Issue is reported as a bug team:VM Assigned to OTP team VM priority:medium labels Feb 10, 2021

OTP-Maintainer added this to the OTP-23.1.2 milestone Feb 10, 2021

OTP-Maintainer assigned rickard-green Feb 10, 2021

OTP-Maintainer closed this as completed Feb 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERL-1343: Dist becomes much slower when sending complex terms #4348

ERL-1343: Dist becomes much slower when sending complex terms #4348

OTP-Maintainer commented Sep 9, 2020

OTP-Maintainer commented Sep 10, 2020

OTP-Maintainer commented Sep 10, 2020

OTP-Maintainer commented Oct 18, 2020

OTP-Maintainer commented Oct 19, 2020

OTP-Maintainer commented Oct 26, 2020

OTP-Maintainer commented Nov 5, 2020

OTP-Maintainer commented Nov 7, 2020

OTP-Maintainer commented Nov 9, 2020

ERL-1343: Dist becomes much slower when sending complex terms #4348

ERL-1343: Dist becomes much slower when sending complex terms #4348

Comments

OTP-Maintainer commented Sep 9, 2020

OTP-Maintainer commented Sep 10, 2020

OTP-Maintainer commented Sep 10, 2020

OTP-Maintainer commented Oct 18, 2020

OTP-Maintainer commented Oct 19, 2020

OTP-Maintainer commented Oct 26, 2020

OTP-Maintainer commented Nov 5, 2020

OTP-Maintainer commented Nov 7, 2020

OTP-Maintainer commented Nov 9, 2020