Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite RecvByteBufferAllocator. #1850

Merged
merged 1 commit into from
May 6, 2021

Conversation

Lukasa
Copy link
Contributor

@Lukasa Lukasa commented May 6, 2021

Motivation:

Platforms with 32-bit integers continue to exist. On those platforms,
the way we calculate the size table for AdaptiveRecvByteBufferAllocator
will trap, as we attempt to compare an Int to UInt32.max as a loop
termination condition.

While I was amending this code, I noticed that
AdaptiveRecvByteBufferAllocator had a number of very strange behaviours,
and was arguably excessively complex. To that end, this patch
constitutes a substantial rewrite of the allocator. To understand what
it does, we need to describe the previous allocator algorithm.

The goal of AdaptiveRecvByteBufferAllocator is to attempt to dynamically
track the throughput of a TCP connection and to minimise the memory
usage required to get it to achieve maximum throughput, within
user-defined constraints. To that end, it implements a fairly simple
resizing algorithm.

At a high level, the algorithm is as follows. The allocator keeps track
of the size of the buffer it is offering the user. When the user reports
how much of that buffer they actually used, the allocator determines
whether better throughput could be achieved with larger allocation
sizes, or whether throughput would not be harmed with smaller allocation
sizes. It then adjusts accordingly.

When does the allocator believe beetter throughput could be achieved
with larger allocation sizes? When the allocation was entirely filled.
If a network recv() entirely fills the buffer, that means there was
likely more data to read that could not be added to the buffer. Improved
throughput would be gained by using larger buffers, as fewer system
calls would be necessary.

Similarly, the allocator believes it could shrink the buffer size
without harming throughput when the recv() uses less memory than the
next buffer size down. However, the algorithm attempts to detect the
possibility that we have "read until the end". The reason this is
relevant is that after a read completes we will often serve other work
on the loop for a while, causing data to pile up. As a result, we don't
want to shrink the buffer if the average read size would be higher, just
because one read happened to be short.

The original implementation's algorithm was based on a bucketed
scheme. For allocation sizes below 512 bytes, the allocation buckets
moved 16 bytes at a time. In principle this gave the allocator 16 byte
granularity. For allocation sizes above 512 bytes, the allocation
buckets would double: 512, 1024, 2048, 4096, and so on, up to
UInt32.max. This, incidentally, was the source of the crash on 32-bit
systems.

Unfortunately, this scheme had a few failings.

Firstly, the implementation complexity was fairly high. The bounds
provided by the user needed to be turned into bucket indices,
necessitating an awkward and complex binary search of the bucket array.
The bucket array itself needed to be generated, forcing a dispatch_once
to guard it and dirtying memory.

Secondly, the scheme was weighted very heavily toward growing memory and
not giving it back. Whenever the buffer was filled and a new size up
wanted to be chosen, the allocator would jump 4 size buckets. As the
default initial size was 2048 bytes, and the default maximum was 65kB,
the first read of 2048 bytes would cause the allocator to immediately
jump up to allocating 32kB of memory for the next read: a huge leap! It
would then only release memory after two short reads, at which point it
would drop back only 1 bucket, meaning that there was a very aggressive
sawtooth pattern favouring higher memory usage. This pattern favours
benchmarks, where high-throughput localhost connections will naturally
want larger buffer sizes, but it is unnecessarily aggressive for
real-world networks.

Thirdly, the scheme had a bug that would mean it did not require two
consecutive short reads to shrink the buffer, just two short reads
between an increase. This means that it had a tendency not to stabilise:
two "read to the end"s in two different event loop ticks would cause
the buffer to shrink, even if there had been 10 almost-full-reads
in between.

Fourthly, the system would occasionally report that the Channel should
reallocate the buffer because it was larger, when in fact it wasn't:
we'd hit the max, and were never going to get larger. This forced
high-throughput Channels to excessively allocate, albeit only in systems
that customized this allocator (and basically no-one does).

Fifthly, the bucketing system was defeated by ByteBuffer. While having
16-byte granularity was nice in principle, ByteBuffer only allows
power-of-two allocation sizes. This meant that, at the low end, there
were several wasted buckets that were effectively identical. Between 256
bytes and 512 bytes there were 15 redundant buckets!

This last point is the main reason to justify a rewrite. The complexity
of the scheme was in principle justified by having fine, granular
control of allocation sizes. Given that those don't exist, there is no
reason not to simply resort to power-of-two sizing. Once we do that, we
can replace the complex bucketing system by simple shift-and-multiply
math. The effect of this is to produce smaller code (we save about 30%
of the instructions, even as we add some extra safety preconditions)
with no additional branching, and no need to load from a random table in
memory. This avoids the need to keep that table in cache, reducing cache
pressure in the hot read loop. Additionally, we can drop the index into
the table, which lets us save some per-Channel memory, further reducing
cache pressure.

Additionally, constructing one of these is now vastly cheaper. That
matters less (it's not really hot-code), but it does improve Channel
creation time somewhat.

Modifications:

  • Remove the size table.
  • Round all values to powers of two.
  • Implement new "previous power of two" function.
  • Cap the allocation size at the largest power of 2 representable in a
    32-bit Int.
  • Add more tests.

Result:

The result is less code, simpler code, and faster code. No trapping on
32-bit integer platforms.

Resolves #1848

@Lukasa Lukasa added the semver/patch No public API change. label May 6, 2021
@Lukasa
Copy link
Contributor Author

Lukasa commented May 6, 2021

@swift-nio-bot test perf please

Copy link
Contributor

@glbrntt glbrntt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change; one question but looks good otherwise.

Comment on lines 66 to 67
precondition(initial > minimum, "initial: \(initial)")
precondition(maximum > initial, "maximum: \(maximum)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't initial and maximum be equal to initial?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason, but I wasn't planning to lift that constraint today.

self.maximum = maximum
// We need to round all of these numbers to a power of 2. Initial will be rounded down,
// minimum down, and maximum up.
self.minimum = minimum.previousPowerOf2()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to min this with the max allocation size here as well? Looks like we could end up with minimum > initial if minimum and initial were sufficiently large

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do. Good catch.

@Lukasa
Copy link
Contributor Author

Lukasa commented May 6, 2021

While we're here, this generates lovely straight-line code:

NIOHTTP1Client`AdaptiveRecvByteBufferAllocator.record(actualReadBytes:):
NIOHTTP1Client[0xcb6b0] <+0>:   pushq  %r13
NIOHTTP1Client[0xcb6b2] <+2>:   movq   0x18(%r13), %rax
NIOHTTP1Client[0xcb6b6] <+6>:   testb  $0x1, %al
NIOHTTP1Client[0xcb6b8] <+8>:   jne    0xcb728                   ; <+120> [inlined] Swift runtime failure: precondition failure at RecvByteBufferAllocator.swift:85
NIOHTTP1Client[0xcb6ba] <+10>:  movq   (%r13), %rdx
NIOHTTP1Client[0xcb6be] <+14>:  cmpq   %rdx, %rax
NIOHTTP1Client[0xcb6c1] <+17>:  jl     0xcb728                   ; <+120> [inlined] Swift runtime failure: precondition failure at RecvByteBufferAllocator.swift:85
NIOHTTP1Client[0xcb6c3] <+19>:  movq   0x8(%r13), %r8
NIOHTTP1Client[0xcb6c7] <+23>:  cmpq   %rax, %r8
NIOHTTP1Client[0xcb6ca] <+26>:  jl     0xcb72a                   ; <+122> [inlined] Swift runtime failure: precondition failure at RecvByteBufferAllocator.swift:86
NIOHTTP1Client[0xcb6cc] <+28>:  movq   %rax, %rcx
NIOHTTP1Client[0xcb6cf] <+31>:  sarq   %rcx
NIOHTTP1Client[0xcb6d2] <+34>:  movq   %rax, %rsi
NIOHTTP1Client[0xcb6d5] <+37>:  addq   %rax, %rsi
NIOHTTP1Client[0xcb6d8] <+40>:  cmovoq %rax, %rsi
NIOHTTP1Client[0xcb6dc] <+44>:  cmpq   %rdi, %rcx
NIOHTTP1Client[0xcb6df] <+47>:  jl     0xcb6f9                   ; <+73> at RecvByteBufferAllocator.swift:105:35
NIOHTTP1Client[0xcb6e1] <+49>:  cmpq   %rdx, %rcx
NIOHTTP1Client[0xcb6e4] <+52>:  jl     0xcb6f9                   ; <+73> at RecvByteBufferAllocator.swift:105:35
NIOHTTP1Client[0xcb6e6] <+54>:  leaq   0x20(%r13), %rsi
NIOHTTP1Client[0xcb6ea] <+58>:  cmpb   $0x1, 0x20(%r13)
NIOHTTP1Client[0xcb6ef] <+63>:  jne    0xcb71f                   ; <+111> at RecvByteBufferAllocator.swift
NIOHTTP1Client[0xcb6f1] <+65>:  movq   %rcx, 0x18(%r13)
NIOHTTP1Client[0xcb6f5] <+69>:  xorl   %edx, %edx
NIOHTTP1Client[0xcb6f7] <+71>:  jmp    0xcb721                   ; <+113> at RecvByteBufferAllocator.swift
NIOHTTP1Client[0xcb6f9] <+73>:  cmpq   %rdi, %rax
NIOHTTP1Client[0xcb6fc] <+76>:  jg     0xcb714                   ; <+100> at RecvByteBufferAllocator.swift:110:30
NIOHTTP1Client[0xcb6fe] <+78>:  cmpq   %rsi, %r8
NIOHTTP1Client[0xcb701] <+81>:  jl     0xcb714                   ; <+100> at RecvByteBufferAllocator.swift:110:30
NIOHTTP1Client[0xcb703] <+83>:  movq   %rsi, 0x18(%r13)
NIOHTTP1Client[0xcb707] <+87>:  addq   $0x20, %r13
NIOHTTP1Client[0xcb70b] <+91>:  movb   $0x1, %al
NIOHTTP1Client[0xcb70d] <+93>:  xorl   %edx, %edx
NIOHTTP1Client[0xcb70f] <+95>:  movq   %r13, %rsi
NIOHTTP1Client[0xcb712] <+98>:  jmp    0xcb723                   ; <+115> at RecvByteBufferAllocator.swift
NIOHTTP1Client[0xcb714] <+100>: addq   $0x20, %r13
NIOHTTP1Client[0xcb718] <+104>: xorl   %edx, %edx
NIOHTTP1Client[0xcb71a] <+106>: movq   %r13, %rsi
NIOHTTP1Client[0xcb71d] <+109>: jmp    0xcb721                   ; <+113> at RecvByteBufferAllocator.swift
NIOHTTP1Client[0xcb71f] <+111>: movb   $0x1, %dl
NIOHTTP1Client[0xcb721] <+113>: xorl   %eax, %eax
NIOHTTP1Client[0xcb723] <+115>: movb   %dl, (%rsi)
NIOHTTP1Client[0xcb725] <+117>: popq   %r13
NIOHTTP1Client[0xcb727] <+119>: retq   
NIOHTTP1Client[0xcb728] <+120>: ud2    
NIOHTTP1Client[0xcb72a] <+122>: ud2   

@swift-server-bot
Copy link

performance report

build id: 69

timestamp: Thu May 6 13:34:12 UTC 2021

results

nameminmaxmeanstd
write_http_headers 0.004284182 0.004589692 0.0043268965 9.269719266784732e-05
bytebuffer_write_12MB_short_string_literals 0.527023638 0.534262668 0.5284211628000001 0.0021713186623970085
bytebuffer_write_12MB_short_calculated_strings 0.5280216 0.530982851 0.52907248 0.0008168328665326853
bytebuffer_write_12MB_medium_string_literals 0.187812129 0.190523372 0.1887237579 0.0008660906122528667
bytebuffer_write_12MB_medium_calculated_strings 0.232751835 0.234440106 0.23347936660000004 0.0004425693872958025
bytebuffer_write_12MB_large_calculated_strings 0.199999938 0.200966259 0.20057125589999997 0.0002450965789450079
bytebuffer_lots_of_rw 0.565500495 0.567163326 0.566125406 0.0004743165873886326
bytebuffer_write_http_response_ascii_only_as_string 0.042033026 0.046081372 0.0426466997 0.0012339299653253385
bytebuffer_write_http_response_ascii_only_as_staticstring 0.032571562 0.033094734 0.03271949 0.00015134086690059062
bytebuffer_write_http_response_some_nonascii_as_string 0.042317613 0.042837743 0.0424809215 0.00018655087824519972
bytebuffer_write_http_response_some_nonascii_as_staticstring 0.032559971 0.03274433 0.0326568767 6.286983678654426e-05
no-net_http1_10k_reqs_1_conn 0.141763182 0.145294153 0.1437500209 0.0011416561114886742
http1_10k_reqs_1_conn 0.617356179 0.627004289 0.6229928584 0.0031050756466489064
http1_10k_reqs_100_conns 0.624422159 0.627204666 0.6256351557 0.0008921174939977934
future_whenallsucceed_100k_immediately_succeeded_off_loop 0.09281187 0.093306944 0.0930193905 0.00016504826230008397
future_whenallsucceed_100k_immediately_succeeded_on_loop 0.092780679 0.099773494 0.0937509692 0.002124706063357774
future_whenallsucceed_100k_deferred_off_loop 0.365960467 0.372562664 0.3676152249 0.002017967481820941
future_whenallsucceed_100k_deferred_on_loop 0.140428096 0.144649803 0.1419851586 0.0013147732857826083
future_whenallcomplete_100k_immediately_succeeded_off_loop 0.035422919 0.035976778 0.035633603 0.00014791270645590754
future_whenallcomplete_100k_immediately_succeeded_on_loop 0.036394337 0.037726155 0.03699124090000001 0.0004689163855937663
future_whenallcomplete_100k_deferred_off_loop 0.276975763 0.284069867 0.2796452053 0.002449498826110167
future_whenallcomplete_100k_deferred_on_loop 0.073111514 0.076406456 0.07399294120000001 0.0009240140655034297
future_reduce_10k_futures 0.036673802 0.037924895 0.0373218946 0.0003900262892231184
future_reduce_into_10k_futures 0.036292155 0.037287031 0.036675626899999994 0.0003016802121689819
channel_pipeline_1m_events 0.172178609 0.172325874 0.1722852265 4.750975160076454e-05
websocket_encode_50b_space_at_front_1m_frames_cow 0.826327505 0.826745015 0.8264788264 0.00014297125592478173
websocket_encode_50b_space_at_front_1m_frames_cow_masking 0.091233489 0.091874626 0.09150898360000001 0.0002593489419673803
websocket_encode_1kb_space_at_front_100k_frames_cow 0.085258252 0.085674592 0.0854702536 0.00020060304475488905
websocket_encode_50b_no_space_at_front_1m_frames_cow 0.83403281 0.83443184 0.8341742205000001 0.0001438717091719424
websocket_encode_1kb_no_space_at_front_100k_frames_cow 0.085234419 0.085680175 0.0854073041 0.00020580988801018933
websocket_encode_50b_space_at_front_10k_frames 0.011169544 0.011549482 0.0112186952 0.00011690407190208936
websocket_encode_50b_space_at_front_10k_frames_masking 0.120633584 0.121113624 0.12092254399999999 0.00022503033656227957
websocket_encode_1kb_space_at_front_1k_frames 0.001986451 0.002385271 0.0020317412 0.00012427759359666483
websocket_encode_50b_no_space_at_front_10k_frames 0.010901066 0.010938971 0.0109116675 1.136607590107046e-05
websocket_encode_1kb_no_space_at_front_1k_frames 0.00198893 0.002021951 0.0019937326999999996 1.0185753951802854e-05
websocket_decode_125b_100k_frames 0.142496783 0.143296415 0.14283051749999998 0.00025239900883951136
websocket_decode_125b_with_a_masking_key_100k_frames 0.148073031 0.148711718 0.14840129890000003 0.0002271908029958018
websocket_decode_64kb_100k_frames 0.147815171 0.148649616 0.14830683559999996 0.00030213759813414205
websocket_decode_64kb_with_a_masking_key_100k_frames 0.153807712 0.15483722 0.1544697348 0.00033716712114544863
websocket_decode_64kb_+1_100k_frames 0.147263396 0.148000116 0.14765755519999998 0.0002719794304517214
websocket_decode_64kb_+1_with_a_masking_key_100k_frames 0.153634526 0.154178599 0.1539771252 0.00023328727587952137
circular_buffer_into_byte_buffer_1kb 0.046950231 0.047430568 0.047057278999999994 0.000191016638945407
circular_buffer_into_byte_buffer_1mb 0.09402604 0.094513279 0.0942668616 0.00021581918018316655
byte_buffer_view_iterator_1mb 0.135506678 0.136332838 0.1360018399 0.00028191258907109414
byte_to_message_decoder_decode_many_small 0.235076814 0.235544159 0.2352183034 0.00017861348420430105

comparison

name current previous winner diff
write_http_headers 0.004284182 0.00423983 previous 1%
bytebuffer_write_12MB_short_string_literals 0.527023638 0.524823009 previous 0%
bytebuffer_write_12MB_short_calculated_strings 0.5280216 0.524930183 previous 0%
bytebuffer_write_12MB_medium_string_literals 0.187812129 0.186434691 previous 0%
bytebuffer_write_12MB_medium_calculated_strings 0.232751835 0.231389109 previous 0%
bytebuffer_write_12MB_large_calculated_strings 0.199999938 0.197412867 previous 1%
bytebuffer_lots_of_rw 0.565500495 0.560243487 previous 0%
bytebuffer_write_http_response_ascii_only_as_string 0.042033026 0.041796932 previous 0%
bytebuffer_write_http_response_ascii_only_as_staticstring 0.032571562 0.031829099 previous 2%
bytebuffer_write_http_response_some_nonascii_as_string 0.042317613 0.041882572 previous 1%
bytebuffer_write_http_response_some_nonascii_as_staticstring 0.032559971 0.031819275 previous 2%
no-net_http1_10k_reqs_1_conn 0.141763182 0.144234562 current -1%
http1_10k_reqs_1_conn 0.617356179 0.627708312 current -1%
http1_10k_reqs_100_conns 0.624422159 0.626224356 current 0%
future_whenallsucceed_100k_immediately_succeeded_off_loop 0.09281187 0.090842007 previous 2%
future_whenallsucceed_100k_immediately_succeeded_on_loop 0.092780679 0.091243954 previous 1%
future_whenallsucceed_100k_deferred_off_loop 0.365960467 0.364492078 previous 0%
future_whenallsucceed_100k_deferred_on_loop 0.140428096 0.141365116 current 0%
future_whenallcomplete_100k_immediately_succeeded_off_loop 0.035422919 0.035272176 previous 0%
future_whenallcomplete_100k_immediately_succeeded_on_loop 0.036394337 0.036366598 previous 0%
future_whenallcomplete_100k_deferred_off_loop 0.276975763 0.276370435 previous 0%
future_whenallcomplete_100k_deferred_on_loop 0.073111514 0.072524211 previous 0%
future_reduce_10k_futures 0.036673802 0.038679783 current -5%
future_reduce_into_10k_futures 0.036292155 0.038190771 current -4%
channel_pipeline_1m_events 0.172178609 0.172229848 current 0%
websocket_encode_50b_space_at_front_1m_frames_cow 0.826327505 0.825399235 previous 0%
websocket_encode_50b_space_at_front_1m_frames_cow_masking 0.091233489 0.09279181 current -1%
websocket_encode_1kb_space_at_front_100k_frames_cow 0.085258252 0.085626221 current 0%
websocket_encode_50b_no_space_at_front_1m_frames_cow 0.83403281 0.828088881 previous 0%
websocket_encode_1kb_no_space_at_front_100k_frames_cow 0.085234419 0.085354986 current 0%
websocket_encode_50b_space_at_front_10k_frames 0.011169544 0.011362554 current -1%
websocket_encode_50b_space_at_front_10k_frames_masking 0.120633584 0.121844198 current 0%
websocket_encode_1kb_space_at_front_1k_frames 0.001986451 0.002004478 current 0%
websocket_encode_50b_no_space_at_front_10k_frames 0.010901066 0.011024638 current -1%
websocket_encode_1kb_no_space_at_front_1k_frames 0.00198893 0.00197553 previous 0%
websocket_decode_125b_100k_frames 0.142496783 0.141501726 previous 0%
websocket_decode_125b_with_a_masking_key_100k_frames 0.148073031 0.147995478 previous 0%
websocket_decode_64kb_100k_frames 0.147815171 0.148208093 current 0%
websocket_decode_64kb_with_a_masking_key_100k_frames 0.153807712 0.154509227 current 0%
websocket_decode_64kb_+1_100k_frames 0.147263396 0.147028491 previous 0%
websocket_decode_64kb_+1_with_a_masking_key_100k_frames 0.153634526 0.15363602 current 0%
circular_buffer_into_byte_buffer_1kb 0.046950231 0.046947449 previous 0%
circular_buffer_into_byte_buffer_1mb 0.09402604 0.094050442 current 0%
byte_buffer_view_iterator_1mb 0.135506678 0.134265703 previous 0%
byte_to_message_decoder_decode_many_small 0.235076814 0.230608466 previous 1%

significant differences found

Copy link
Member

@weissi weissi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Definitely an improvement but at least one comment isn't quite right

@@ -15,47 +15,85 @@
import XCTest
import NIO

public final class AdaptiveRecvByteBufferAllocatorTest : XCTestCase {
final class AdaptiveRecvByteBufferAllocatorTest : XCTestCase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
final class AdaptiveRecvByteBufferAllocatorTest : XCTestCase {
final class AdaptiveRecvByteBufferAllocatorTest: XCTestCase {

}
} while true
}
// Here we need to be careful with 32-bit systems: if maximum is Int32.max then any shift or multiply will overflow, which
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something's not right here. You initialise self.maximum to always be a power of 2. So it can't be Int32.max because that's not a power of two. Do you mean Int32(1) << 30 == 1 GiB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this comment is from an older version of this code.

Sources/NIO/RecvByteBufferAllocator.swift Outdated Show resolved Hide resolved
@Lukasa Lukasa force-pushed the cb-32-bit-ints-remain-unfortunately-real branch from 3d9b4c2 to c0bd944 Compare May 6, 2021 14:27
@Lukasa Lukasa requested review from weissi and glbrntt May 6, 2021 14:27
Motivation:

Platforms with 32-bit integers continue to exist. On those platforms,
the way we calculate the size table for AdaptiveRecvByteBufferAllocator
will trap, as we attempt to compare an Int to UInt32.max as a loop
termination condition.

While I was amending this code, I noticed that
AdaptiveRecvByteBufferAllocator had a number of very strange behaviours,
and was arguably excessively complex. To that end, this patch
constitutes a substantial rewrite of the allocator. To understand what
it does, we need to describe the previous allocator algorithm.

The goal of AdaptiveRecvByteBufferAllocator is to attempt to dynamically
track the throughput of a TCP connection and to minimise the memory
usage required to get it to achieve maximum throughput, within
user-defined constraints. To that end, it implements a fairly simple
resizing algorithm.

At a high level, the algorithm is as follows. The allocator keeps track
of the size of the buffer it is offering the user. When the user reports
how much of that buffer they actually used, the allocator determines
whether better throughput could be achieved with larger allocation
sizes, or whether throughput would not be harmed with smaller allocation
sizes. It then adjusts accordingly.

When does the allocator believe beetter throughput could be achieved
with larger allocation sizes? When the allocation was entirely filled.
If a network recv() entirely fills the buffer, that means there was
likely more data to read that could not be added to the buffer. Improved
throughput would be gained by using larger buffers, as fewer system
calls would be necessary.

Similarly, the allocator believes it could shrink the buffer size
without harming throughput when the recv() uses less memory than the
next buffer size down. However, the algorithm attempts to detect the
possibility that we have "read until the end". The reason this is
relevant is that after a read completes we will often serve other work
on the loop for a while, causing data to pile up. As a result, we don't
want to shrink the buffer if the average read size would be higher, just
because one read happened to be short.

The original implementation's algorithm was based on a bucketed
scheme. For allocation sizes below 512 bytes, the allocation buckets
moved 16 bytes at a time. In principle this gave the allocator 16 byte
granularity. For allocation sizes above 512 bytes, the allocation
buckets would double: 512, 1024, 2048, 4096, and so on, up to
UInt32.max. This, incidentally, was the source of the crash on 32-bit
systems.

Unfortunately, this scheme had a few failings.

Firstly, the implementation complexity was fairly high. The bounds
provided by the user needed to be turned into bucket indices,
necessitating an awkward and complex binary search of the bucket array.
The bucket array itself needed to be generated, forcing a dispatch_once
to guard it and dirtying memory.

Secondly, the scheme was weighted very heavily toward growing memory and
not giving it back. Whenever the buffer was filled and a new size up
wanted to be chosen, the allocator would jump _4_ size buckets. As the
default initial size was 2048 bytes, and the default maximum was 65kB,
the first read of 2048 bytes would cause the allocator to immediately
jump up to allocating 32kB of memory for the next read: a huge leap! It
would then only release memory after two short reads, at which point it
would drop back only 1 bucket, meaning that there was a very aggressive
sawtooth pattern favouring higher memory usage. This pattern favours
benchmarks, where high-throughput localhost connections will naturally
want larger buffer sizes, but it is unnecessarily aggressive for
real-world networks.

Thirdly, the scheme had a bug that would mean it did not require two
_consecutive_ short reads to shrink the buffer, just two short reads
between an increase. This means that it had a tendency not to stabilise:
two "read to the end"s in two different event loop ticks would cause
the buffer to shrink, even if there had been 10 almost-full-reads
in between.

Fourthly, the system would occasionally report that the Channel should
reallocate the buffer because it was larger, when in fact it wasn't:
we'd hit the max, and were never going to get larger. This forced
high-throughput Channels to excessively allocate, albeit only in systems
that customized this allocator (and basically no-one does).

Fifthly, the bucketing system was defeated by ByteBuffer. While having
16-byte granularity was nice in principle, ByteBuffer only allows
power-of-two allocation sizes. This meant that, at the low end, there
were several wasted buckets that were effectively identical. Between 256
bytes and 512 bytes there were 15 redundant buckets!

This last point is the main reason to justify a rewrite. The complexity
of the scheme was in principle justified by having fine, granular
control of allocation sizes. Given that those don't exist, there is no
reason not to simply resort to power-of-two sizing. Once we do that, we
can replace the complex bucketing system by simple shift-and-multiply
math. The effect of this is to produce smaller code (we save about 30%
of the instructions, even as we add some extra safety preconditions)
with no additional branching, and no need to load from a random table in
memory. This avoids the need to keep that table in cache, reducing cache
pressure in the hot read loop. Additionally, we can drop the index into
the table, which lets us save some per-Channel memory, further reducing
cache pressure.

Additionally, constructing one of these is now vastly cheaper. That
matters less (it's not really hot-code), but it does improve Channel
creation time somewhat.

Modifications:

- Remove the size table.
- Round all values to powers of two.
- Implement new "previous power of two" function.
- Cap the allocation size at the largest power of 2 representable in a
  32-bit Int.
- Add more tests.

Result:

The result is less code, simpler code, and faster code. No trapping on
32-bit integer platforms.
@Lukasa Lukasa force-pushed the cb-32-bit-ints-remain-unfortunately-real branch from c0bd944 to 361272a Compare May 6, 2021 15:25
Copy link
Member

@weissi weissi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, looks good to me!

@Lukasa Lukasa merged commit e0e289a into apple:main May 6, 2021
@Lukasa Lukasa deleted the cb-32-bit-ints-remain-unfortunately-real branch May 6, 2021 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver/patch No public API change.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change adaptive allocator size table to work on platforms with 32-bit Int
4 participants