Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERL-1343: Dist becomes much slower when sending complex terms #4348

Closed
OTP-Maintainer opened this issue Sep 9, 2020 · 8 comments
Closed
Assignees
Labels
bug Issue is reported as a bug priority:medium team:VM Assigned to OTP team VM
Milestone

Comments

@OTP-Maintainer
Copy link

Original reporter: JIRAUSER16602
Affected version: OTP-23.0.3
Fixed in version: OTP-23.1.2
Component: kernel
Migrated from: https://bugs.erlang.org/browse/ERL-1343


There is significant throughput difference between sending the following two messages through dist:

1. [crypto:strong_rand_bytes(128) || _ <- lists:seq(1, 1 bsl 10)]

2. term_to_binary([crypto:strong_rand_bytes(128) || _ <- lists:seq(1, 1 bsl 10)])

(2) is much faster

*Reproduce:*

erlc test.erl

On terminal 1:

erl -setcookie test -name 1@127.0.0.1 -s test server -noinput

On terminal 2:

erl -setcookie test -name 2@127.0.0.1 -s test client binary -noinput

Result: 861.9077 MB/s

Again on terminal 2:

erl -setcookie test -name 2@127.0.0.1 -s test client list -noinput

Result: 131.8675 MB/s

*Remark*

It is much better in OTP 22.3.2,

erl -setcookie test -name 2@127.0.0.1 -s test client binary -noinput
 823.7916 MB/s

erl -setcookie test -name 2@127.0.0.1 -s test client list -noinput
 418.5475 MB/s

 
@OTP-Maintainer
Copy link
Author

peerst@gmail.com said:

Q: are you running this on one physical device or two?

(Addendum: seeing you're 127.0.01 addresses I could have inferred this)
 

If its on one device I could imagine operating system scheduling influencing things.

@OTP-Maintainer
Copy link
Author

JIRAUSER16602 said:

On one device, but the same thing happens if we communicate between 2 hosts

I've done a more complex message sending test, with 16 processes, 32 flying message per processes (sliding window)

message (1) is 43MB/s, while message (2) is 480MB/s

@OTP-Maintainer
Copy link
Author

dane said:

I narrowed search down to this patch, https://github.com/erlang/otp/commit/f3a6f63b4ff87ab21dc3af015f3b970f5c096678#diff-d256ca218de9be9017823817bd9953a7f3000a190a44dd34956e446b8d68b478

And snapped perf flame graph

 

[^perf-kernel.svg]

 

^It appears that "writev" is very expensive with this patch.^

^Actual perf sample shows that tcp_inet_output is responsible for the regression:^

^- 33.50% 33.50% 3_scheduler [kernel.kallsyms] [k] 0xffffffff9a264f76 ▒^
 ^__GI___clone (inlined) ▒^
 ^start_thread ▒^
 ^thr_wrapper ▒^
 ^sched_thread_func ▒^
 ^process_main ▒^
 ^erts_schedule ▒^
 ^- erts_port_task_execute ▒^
 ^- 20.86% tcp_inet_output (inlined)^

@OTP-Maintainer
Copy link
Author

rickard said:

Thanks for the bug report [~zzydxm] and thanks for the bisecting and profiling [~dane]!

The main cause of this issue is the fixed maximum io-vector length of 16 in the inet-driver (MAX_VSIZE in inet_drv.c). This caused an unnecessarily large amount of calls to writev() (64 times in your benchmark). When changing that to 1024 which equals to IOV_MAX on my Linux machine, I get the throughput back to what it was in before OTP 23.

We should however probably not refer to binaries that small in the io-vector. I'll look some more into this during this week and get back to you.

@OTP-Maintainer
Copy link
Author

rickard said:

We will release a patch that will use {{IOV_MAX}} (1024 on most systems) instead of 16 as maximum io-vector length later this week or next week. For now we will keep referring to all off heap binaries in separate elements. This since it is hard to determine what minimum size would be best and you need terms with a very large amount of binaries before you will run into trouble on systems with an {{IOV_MAX}} of 1024.

Regarding the throughput drop when comparing sending the term as is and doing {{term_to_binary()}} on the term before sending it. I can reproduce this behavior over the loopback on my machine for OTP 20-23 (haven't tested earler releases). Encoding of the data and sending it to the distribution port is faster when sending the data as is than when doing {{term_to_binary()}} first. This is also what is expected since you encode the data twice when doing {{term_to_binary()}} on the term before sending it. The operation as a whole however takes longer time since the {{writev}} operation would block and the emulator has to wait for a much longer time before it can complete the sending of all the data when not doing {{term_to_binary()}}. I see this behaviour on an all releases between OTP 20-23 with different I/O poll implementations and various changes in the distribution implementation. Changing the data that you send also change the behaviour. I see this as timing/tcp-ip stack related and won't dig further into this.

@OTP-Maintainer
Copy link
Author

rickard said:

Fix just released in patch [OTP 23.1.2|http://erlang.org/download/OTP-23.1.2.README].

@OTP-Maintainer
Copy link
Author

dane said:

Thank you for the fix!

I'm trying to understand this:

> This is also what is expected since you encode the data twice when doing {{term_to_binary()}} on the term before sending it. The operation as a whole however takes longer time since the {{writev}} operation would block and the emulator has to wait for a much longer time before it can complete the sending of all the data when not doing {{term_to_binary()}}. 

 

What I was aiming for, end-to-end performance for databases (e.g. mnesia) which are sending terms over distribution. What we saw, that if we use term_to_binary() and then binary_to_term() on a receiver side, throughput is 3x-5x higher compared to non-term-to-binaried way. Which means that initial database replication takes considerably longer if not term-to-binaried.

I wonder if there is a benchmark for distribution,  so it can be easy to use Linux perf to understand where regression comes from.

@OTP-Maintainer
Copy link
Author

rickard said:

{quote}
I'm trying to understand this:
bq. This is also what is expected since you encode the data twice when doing term_to_binary() on the term before sending it. The operation as a whole however takes longer time since the writev operation would block and the emulator has to wait for a much longer time before it can complete the sending of all the data when not doing term_to_binary(). 
{quote}

When running the attached test code. Encoding and moving the encoded data into the driver is faster when sending the term directly than when first encoding the term using {{term_to_binary()}}. This is also expected since when you do an explicit {{term_to_binary()}} first, you'll encode the term twice. The distribution will always encode using the same functionality as {{term_to_binary()}} when passing a term over the distribution. The second encoding will by much easier since it is only one binary to encode, but it is still extra work. Throughput was however  lower when sending the term directly (compared to doing an explicit {{term_to_binary()}}) since the inet driver was not able to write the data as fast in this case. The driver became busy more frequently and for longer times when sending the term directly.  The behavior is more or less the same on OTP 20 up to OTP 23 (with the latest patch). Assuming that there isn't a bug in our I/O poll implementation, the extra time it takes to deliver the data is spent outside of the Erlang VM.

It is of course not impossible that there is a bug in our I/O poll implementation causing delays in wakeup of the distribution port executing in the inet driver. Inbetween OTP 20 and OTP 23 the I/O poll implementation has been largely rewritten. Such a bug then need to exist in both old and new implementation. If such a bug exist, it has also never caused lost wakeups, only delayed wakeups, when I've tested. Due to this I do not suspect a bug in our I/O poll implementation.

When changing the amount of data written in each message the behavior also changed. Sometimes the throughput loss was there sometimes not depending on amount of data.

bq. What I was aiming for, end-to-end performance for databases (e.g. mnesia) which are sending terms over distribution. What we saw, that if we use term_to_binary() and then binary_to_term() on a receiver side, throughput is 3x-5x higher compared to non-term-to-binaried way. Which means that initial database replication takes considerably longer if not term-to-binaried.

Is this with or without the fix in OTP 23.1.2? If not, what is the result with OTP 23.1.2?

Is this a real usage scenario or a micro benchmark?

bq. I wonder if there is a benchmark for distribution,  so it can be easy to use Linux perf to understand where regression comes from.

Unfortunately we do not have any good benchmarks for the distribution. People have mainly made their own hacks when testing out changes and then tossed them away. When introducing the IOV change my tests did not include huge amounts of binaries which is why this could slip through. I would be surprised if anyone have compared with/without preceding {{term_to_binary()}} on the terms. I have at least never done that. We've put up writing better distribution benchmarks on our todo list.

@OTP-Maintainer OTP-Maintainer added bug Issue is reported as a bug team:VM Assigned to OTP team VM priority:medium labels Feb 10, 2021
@OTP-Maintainer OTP-Maintainer added this to the OTP-23.1.2 milestone Feb 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug priority:medium team:VM Assigned to OTP team VM
Projects
None yet
Development

No branches or pull requests

2 participants