Rewrite RecvByteBufferAllocator. #1850

Lukasa · 2021-05-06T13:01:41Z

Motivation:

Platforms with 32-bit integers continue to exist. On those platforms,
the way we calculate the size table for AdaptiveRecvByteBufferAllocator
will trap, as we attempt to compare an Int to UInt32.max as a loop
termination condition.

While I was amending this code, I noticed that
AdaptiveRecvByteBufferAllocator had a number of very strange behaviours,
and was arguably excessively complex. To that end, this patch
constitutes a substantial rewrite of the allocator. To understand what
it does, we need to describe the previous allocator algorithm.

The goal of AdaptiveRecvByteBufferAllocator is to attempt to dynamically
track the throughput of a TCP connection and to minimise the memory
usage required to get it to achieve maximum throughput, within
user-defined constraints. To that end, it implements a fairly simple
resizing algorithm.

At a high level, the algorithm is as follows. The allocator keeps track
of the size of the buffer it is offering the user. When the user reports
how much of that buffer they actually used, the allocator determines
whether better throughput could be achieved with larger allocation
sizes, or whether throughput would not be harmed with smaller allocation
sizes. It then adjusts accordingly.

When does the allocator believe beetter throughput could be achieved
with larger allocation sizes? When the allocation was entirely filled.
If a network recv() entirely fills the buffer, that means there was
likely more data to read that could not be added to the buffer. Improved
throughput would be gained by using larger buffers, as fewer system
calls would be necessary.

Similarly, the allocator believes it could shrink the buffer size
without harming throughput when the recv() uses less memory than the
next buffer size down. However, the algorithm attempts to detect the
possibility that we have "read until the end". The reason this is
relevant is that after a read completes we will often serve other work
on the loop for a while, causing data to pile up. As a result, we don't
want to shrink the buffer if the average read size would be higher, just
because one read happened to be short.

The original implementation's algorithm was based on a bucketed
scheme. For allocation sizes below 512 bytes, the allocation buckets
moved 16 bytes at a time. In principle this gave the allocator 16 byte
granularity. For allocation sizes above 512 bytes, the allocation
buckets would double: 512, 1024, 2048, 4096, and so on, up to
UInt32.max. This, incidentally, was the source of the crash on 32-bit
systems.

Unfortunately, this scheme had a few failings.

Firstly, the implementation complexity was fairly high. The bounds
provided by the user needed to be turned into bucket indices,
necessitating an awkward and complex binary search of the bucket array.
The bucket array itself needed to be generated, forcing a dispatch_once
to guard it and dirtying memory.

Secondly, the scheme was weighted very heavily toward growing memory and
not giving it back. Whenever the buffer was filled and a new size up
wanted to be chosen, the allocator would jump 4 size buckets. As the
default initial size was 2048 bytes, and the default maximum was 65kB,
the first read of 2048 bytes would cause the allocator to immediately
jump up to allocating 32kB of memory for the next read: a huge leap! It
would then only release memory after two short reads, at which point it
would drop back only 1 bucket, meaning that there was a very aggressive
sawtooth pattern favouring higher memory usage. This pattern favours
benchmarks, where high-throughput localhost connections will naturally
want larger buffer sizes, but it is unnecessarily aggressive for
real-world networks.

Thirdly, the scheme had a bug that would mean it did not require two
consecutive short reads to shrink the buffer, just two short reads
between an increase. This means that it had a tendency not to stabilise:
two "read to the end"s in two different event loop ticks would cause
the buffer to shrink, even if there had been 10 almost-full-reads
in between.

Fourthly, the system would occasionally report that the Channel should
reallocate the buffer because it was larger, when in fact it wasn't:
we'd hit the max, and were never going to get larger. This forced
high-throughput Channels to excessively allocate, albeit only in systems
that customized this allocator (and basically no-one does).

Fifthly, the bucketing system was defeated by ByteBuffer. While having
16-byte granularity was nice in principle, ByteBuffer only allows
power-of-two allocation sizes. This meant that, at the low end, there
were several wasted buckets that were effectively identical. Between 256
bytes and 512 bytes there were 15 redundant buckets!

This last point is the main reason to justify a rewrite. The complexity
of the scheme was in principle justified by having fine, granular
control of allocation sizes. Given that those don't exist, there is no
reason not to simply resort to power-of-two sizing. Once we do that, we
can replace the complex bucketing system by simple shift-and-multiply
math. The effect of this is to produce smaller code (we save about 30%
of the instructions, even as we add some extra safety preconditions)
with no additional branching, and no need to load from a random table in
memory. This avoids the need to keep that table in cache, reducing cache
pressure in the hot read loop. Additionally, we can drop the index into
the table, which lets us save some per-Channel memory, further reducing
cache pressure.

Additionally, constructing one of these is now vastly cheaper. That
matters less (it's not really hot-code), but it does improve Channel
creation time somewhat.

Modifications:

Remove the size table.
Round all values to powers of two.
Implement new "previous power of two" function.
Cap the allocation size at the largest power of 2 representable in a
32-bit Int.
Add more tests.

Result:

The result is less code, simpler code, and faster code. No trapping on
32-bit integer platforms.

Resolves #1848

Lukasa · 2021-05-06T13:30:46Z

@swift-nio-bot test perf please

glbrntt

Nice change; one question but looks good otherwise.

glbrntt · 2021-05-06T13:23:06Z

Sources/NIO/RecvByteBufferAllocator.swift

        precondition(initial > minimum, "initial: \(initial)")
        precondition(maximum > initial, "maximum: \(maximum)")


Why can't initial and maximum be equal to initial?

No particular reason, but I wasn't planning to lift that constraint today.

glbrntt · 2021-05-06T13:23:40Z

Sources/NIO/RecvByteBufferAllocator.swift

-        self.maximum = maximum
+        // We need to round all of these numbers to a power of 2. Initial will be rounded down,
+        // minimum down, and maximum up.
+        self.minimum = minimum.previousPowerOf2()


Do we need to min this with the max allocation size here as well? Looks like we could end up with minimum > initial if minimum and initial were sufficiently large

Yes, we do. Good catch.

Lukasa · 2021-05-06T13:32:46Z

While we're here, this generates lovely straight-line code:

NIOHTTP1Client`AdaptiveRecvByteBufferAllocator.record(actualReadBytes:):
NIOHTTP1Client[0xcb6b0] <+0>:   pushq  %r13
NIOHTTP1Client[0xcb6b2] <+2>:   movq   0x18(%r13), %rax
NIOHTTP1Client[0xcb6b6] <+6>:   testb  $0x1, %al
NIOHTTP1Client[0xcb6b8] <+8>:   jne    0xcb728                   ; <+120> [inlined] Swift runtime failure: precondition failure at RecvByteBufferAllocator.swift:85
NIOHTTP1Client[0xcb6ba] <+10>:  movq   (%r13), %rdx
NIOHTTP1Client[0xcb6be] <+14>:  cmpq   %rdx, %rax
NIOHTTP1Client[0xcb6c1] <+17>:  jl     0xcb728                   ; <+120> [inlined] Swift runtime failure: precondition failure at RecvByteBufferAllocator.swift:85
NIOHTTP1Client[0xcb6c3] <+19>:  movq   0x8(%r13), %r8
NIOHTTP1Client[0xcb6c7] <+23>:  cmpq   %rax, %r8
NIOHTTP1Client[0xcb6ca] <+26>:  jl     0xcb72a                   ; <+122> [inlined] Swift runtime failure: precondition failure at RecvByteBufferAllocator.swift:86
NIOHTTP1Client[0xcb6cc] <+28>:  movq   %rax, %rcx
NIOHTTP1Client[0xcb6cf] <+31>:  sarq   %rcx
NIOHTTP1Client[0xcb6d2] <+34>:  movq   %rax, %rsi
NIOHTTP1Client[0xcb6d5] <+37>:  addq   %rax, %rsi
NIOHTTP1Client[0xcb6d8] <+40>:  cmovoq %rax, %rsi
NIOHTTP1Client[0xcb6dc] <+44>:  cmpq   %rdi, %rcx
NIOHTTP1Client[0xcb6df] <+47>:  jl     0xcb6f9                   ; <+73> at RecvByteBufferAllocator.swift:105:35
NIOHTTP1Client[0xcb6e1] <+49>:  cmpq   %rdx, %rcx
NIOHTTP1Client[0xcb6e4] <+52>:  jl     0xcb6f9                   ; <+73> at RecvByteBufferAllocator.swift:105:35
NIOHTTP1Client[0xcb6e6] <+54>:  leaq   0x20(%r13), %rsi
NIOHTTP1Client[0xcb6ea] <+58>:  cmpb   $0x1, 0x20(%r13)
NIOHTTP1Client[0xcb6ef] <+63>:  jne    0xcb71f                   ; <+111> at RecvByteBufferAllocator.swift
NIOHTTP1Client[0xcb6f1] <+65>:  movq   %rcx, 0x18(%r13)
NIOHTTP1Client[0xcb6f5] <+69>:  xorl   %edx, %edx
NIOHTTP1Client[0xcb6f7] <+71>:  jmp    0xcb721                   ; <+113> at RecvByteBufferAllocator.swift
NIOHTTP1Client[0xcb6f9] <+73>:  cmpq   %rdi, %rax
NIOHTTP1Client[0xcb6fc] <+76>:  jg     0xcb714                   ; <+100> at RecvByteBufferAllocator.swift:110:30
NIOHTTP1Client[0xcb6fe] <+78>:  cmpq   %rsi, %r8
NIOHTTP1Client[0xcb701] <+81>:  jl     0xcb714                   ; <+100> at RecvByteBufferAllocator.swift:110:30
NIOHTTP1Client[0xcb703] <+83>:  movq   %rsi, 0x18(%r13)
NIOHTTP1Client[0xcb707] <+87>:  addq   $0x20, %r13
NIOHTTP1Client[0xcb70b] <+91>:  movb   $0x1, %al
NIOHTTP1Client[0xcb70d] <+93>:  xorl   %edx, %edx
NIOHTTP1Client[0xcb70f] <+95>:  movq   %r13, %rsi
NIOHTTP1Client[0xcb712] <+98>:  jmp    0xcb723                   ; <+115> at RecvByteBufferAllocator.swift
NIOHTTP1Client[0xcb714] <+100>: addq   $0x20, %r13
NIOHTTP1Client[0xcb718] <+104>: xorl   %edx, %edx
NIOHTTP1Client[0xcb71a] <+106>: movq   %r13, %rsi
NIOHTTP1Client[0xcb71d] <+109>: jmp    0xcb721                   ; <+113> at RecvByteBufferAllocator.swift
NIOHTTP1Client[0xcb71f] <+111>: movb   $0x1, %dl
NIOHTTP1Client[0xcb721] <+113>: xorl   %eax, %eax
NIOHTTP1Client[0xcb723] <+115>: movb   %dl, (%rsi)
NIOHTTP1Client[0xcb725] <+117>: popq   %r13
NIOHTTP1Client[0xcb727] <+119>: retq   
NIOHTTP1Client[0xcb728] <+120>: ud2    
NIOHTTP1Client[0xcb72a] <+122>: ud2

swift-server-bot · 2021-05-06T13:34:16Z

performance report

build id: 69

timestamp: Thu May 6 13:34:12 UTC 2021

results

name	min	max	mean	std
write_http_headers	0.004284182	0.004589692	0.0043268965	9.269719266784732e-05
bytebuffer_write_12MB_short_string_literals	0.527023638	0.534262668	0.5284211628000001	0.0021713186623970085
bytebuffer_write_12MB_short_calculated_strings	0.5280216	0.530982851	0.52907248	0.0008168328665326853
bytebuffer_write_12MB_medium_string_literals	0.187812129	0.190523372	0.1887237579	0.0008660906122528667
bytebuffer_write_12MB_medium_calculated_strings	0.232751835	0.234440106	0.23347936660000004	0.0004425693872958025
bytebuffer_write_12MB_large_calculated_strings	0.199999938	0.200966259	0.20057125589999997	0.0002450965789450079
bytebuffer_lots_of_rw	0.565500495	0.567163326	0.566125406	0.0004743165873886326
bytebuffer_write_http_response_ascii_only_as_string	0.042033026	0.046081372	0.0426466997	0.0012339299653253385
bytebuffer_write_http_response_ascii_only_as_staticstring	0.032571562	0.033094734	0.03271949	0.00015134086690059062
bytebuffer_write_http_response_some_nonascii_as_string	0.042317613	0.042837743	0.0424809215	0.00018655087824519972
bytebuffer_write_http_response_some_nonascii_as_staticstring	0.032559971	0.03274433	0.0326568767	6.286983678654426e-05
no-net_http1_10k_reqs_1_conn	0.141763182	0.145294153	0.1437500209	0.0011416561114886742
http1_10k_reqs_1_conn	0.617356179	0.627004289	0.6229928584	0.0031050756466489064
http1_10k_reqs_100_conns	0.624422159	0.627204666	0.6256351557	0.0008921174939977934
future_whenallsucceed_100k_immediately_succeeded_off_loop	0.09281187	0.093306944	0.0930193905	0.00016504826230008397
future_whenallsucceed_100k_immediately_succeeded_on_loop	0.092780679	0.099773494	0.0937509692	0.002124706063357774
future_whenallsucceed_100k_deferred_off_loop	0.365960467	0.372562664	0.3676152249	0.002017967481820941
future_whenallsucceed_100k_deferred_on_loop	0.140428096	0.144649803	0.1419851586	0.0013147732857826083
future_whenallcomplete_100k_immediately_succeeded_off_loop	0.035422919	0.035976778	0.035633603	0.00014791270645590754
future_whenallcomplete_100k_immediately_succeeded_on_loop	0.036394337	0.037726155	0.03699124090000001	0.0004689163855937663
future_whenallcomplete_100k_deferred_off_loop	0.276975763	0.284069867	0.2796452053	0.002449498826110167
future_whenallcomplete_100k_deferred_on_loop	0.073111514	0.076406456	0.07399294120000001	0.0009240140655034297
future_reduce_10k_futures	0.036673802	0.037924895	0.0373218946	0.0003900262892231184
future_reduce_into_10k_futures	0.036292155	0.037287031	0.036675626899999994	0.0003016802121689819
channel_pipeline_1m_events	0.172178609	0.172325874	0.1722852265	4.750975160076454e-05
websocket_encode_50b_space_at_front_1m_frames_cow	0.826327505	0.826745015	0.8264788264	0.00014297125592478173
websocket_encode_50b_space_at_front_1m_frames_cow_masking	0.091233489	0.091874626	0.09150898360000001	0.0002593489419673803
websocket_encode_1kb_space_at_front_100k_frames_cow	0.085258252	0.085674592	0.0854702536	0.00020060304475488905
websocket_encode_50b_no_space_at_front_1m_frames_cow	0.83403281	0.83443184	0.8341742205000001	0.0001438717091719424
websocket_encode_1kb_no_space_at_front_100k_frames_cow	0.085234419	0.085680175	0.0854073041	0.00020580988801018933
websocket_encode_50b_space_at_front_10k_frames	0.011169544	0.011549482	0.0112186952	0.00011690407190208936
websocket_encode_50b_space_at_front_10k_frames_masking	0.120633584	0.121113624	0.12092254399999999	0.00022503033656227957
websocket_encode_1kb_space_at_front_1k_frames	0.001986451	0.002385271	0.0020317412	0.00012427759359666483
websocket_encode_50b_no_space_at_front_10k_frames	0.010901066	0.010938971	0.0109116675	1.136607590107046e-05
websocket_encode_1kb_no_space_at_front_1k_frames	0.00198893	0.002021951	0.0019937326999999996	1.0185753951802854e-05
websocket_decode_125b_100k_frames	0.142496783	0.143296415	0.14283051749999998	0.00025239900883951136
websocket_decode_125b_with_a_masking_key_100k_frames	0.148073031	0.148711718	0.14840129890000003	0.0002271908029958018
websocket_decode_64kb_100k_frames	0.147815171	0.148649616	0.14830683559999996	0.00030213759813414205
websocket_decode_64kb_with_a_masking_key_100k_frames	0.153807712	0.15483722	0.1544697348	0.00033716712114544863
websocket_decode_64kb_+1_100k_frames	0.147263396	0.148000116	0.14765755519999998	0.0002719794304517214
websocket_decode_64kb_+1_with_a_masking_key_100k_frames	0.153634526	0.154178599	0.1539771252	0.00023328727587952137
circular_buffer_into_byte_buffer_1kb	0.046950231	0.047430568	0.047057278999999994	0.000191016638945407
circular_buffer_into_byte_buffer_1mb	0.09402604	0.094513279	0.0942668616	0.00021581918018316655
byte_buffer_view_iterator_1mb	0.135506678	0.136332838	0.1360018399	0.00028191258907109414
byte_to_message_decoder_decode_many_small	0.235076814	0.235544159	0.2352183034	0.00017861348420430105

comparison

name	current	previous	winner	diff
write_http_headers	0.004284182	0.00423983	previous	1%
bytebuffer_write_12MB_short_string_literals	0.527023638	0.524823009	previous	0%
bytebuffer_write_12MB_short_calculated_strings	0.5280216	0.524930183	previous	0%
bytebuffer_write_12MB_medium_string_literals	0.187812129	0.186434691	previous	0%
bytebuffer_write_12MB_medium_calculated_strings	0.232751835	0.231389109	previous	0%
bytebuffer_write_12MB_large_calculated_strings	0.199999938	0.197412867	previous	1%
bytebuffer_lots_of_rw	0.565500495	0.560243487	previous	0%
bytebuffer_write_http_response_ascii_only_as_string	0.042033026	0.041796932	previous	0%
bytebuffer_write_http_response_ascii_only_as_staticstring	0.032571562	0.031829099	previous	2%
bytebuffer_write_http_response_some_nonascii_as_string	0.042317613	0.041882572	previous	1%
bytebuffer_write_http_response_some_nonascii_as_staticstring	0.032559971	0.031819275	previous	2%
no-net_http1_10k_reqs_1_conn	0.141763182	0.144234562	current	-1%
http1_10k_reqs_1_conn	0.617356179	0.627708312	current	-1%
http1_10k_reqs_100_conns	0.624422159	0.626224356	current	0%
future_whenallsucceed_100k_immediately_succeeded_off_loop	0.09281187	0.090842007	previous	2%
future_whenallsucceed_100k_immediately_succeeded_on_loop	0.092780679	0.091243954	previous	1%
future_whenallsucceed_100k_deferred_off_loop	0.365960467	0.364492078	previous	0%
future_whenallsucceed_100k_deferred_on_loop	0.140428096	0.141365116	current	0%
future_whenallcomplete_100k_immediately_succeeded_off_loop	0.035422919	0.035272176	previous	0%
future_whenallcomplete_100k_immediately_succeeded_on_loop	0.036394337	0.036366598	previous	0%
future_whenallcomplete_100k_deferred_off_loop	0.276975763	0.276370435	previous	0%
future_whenallcomplete_100k_deferred_on_loop	0.073111514	0.072524211	previous	0%
future_reduce_10k_futures	0.036673802	0.038679783	current	-5%
future_reduce_into_10k_futures	0.036292155	0.038190771	current	-4%
channel_pipeline_1m_events	0.172178609	0.172229848	current	0%
websocket_encode_50b_space_at_front_1m_frames_cow	0.826327505	0.825399235	previous	0%
websocket_encode_50b_space_at_front_1m_frames_cow_masking	0.091233489	0.09279181	current	-1%
websocket_encode_1kb_space_at_front_100k_frames_cow	0.085258252	0.085626221	current	0%
websocket_encode_50b_no_space_at_front_1m_frames_cow	0.83403281	0.828088881	previous	0%
websocket_encode_1kb_no_space_at_front_100k_frames_cow	0.085234419	0.085354986	current	0%
websocket_encode_50b_space_at_front_10k_frames	0.011169544	0.011362554	current	-1%
websocket_encode_50b_space_at_front_10k_frames_masking	0.120633584	0.121844198	current	0%
websocket_encode_1kb_space_at_front_1k_frames	0.001986451	0.002004478	current	0%
websocket_encode_50b_no_space_at_front_10k_frames	0.010901066	0.011024638	current	-1%
websocket_encode_1kb_no_space_at_front_1k_frames	0.00198893	0.00197553	previous	0%
websocket_decode_125b_100k_frames	0.142496783	0.141501726	previous	0%
websocket_decode_125b_with_a_masking_key_100k_frames	0.148073031	0.147995478	previous	0%
websocket_decode_64kb_100k_frames	0.147815171	0.148208093	current	0%
websocket_decode_64kb_with_a_masking_key_100k_frames	0.153807712	0.154509227	current	0%
websocket_decode_64kb_+1_100k_frames	0.147263396	0.147028491	previous	0%
websocket_decode_64kb_+1_with_a_masking_key_100k_frames	0.153634526	0.15363602	current	0%
circular_buffer_into_byte_buffer_1kb	0.046950231	0.046947449	previous	0%
circular_buffer_into_byte_buffer_1mb	0.09402604	0.094050442	current	0%
byte_buffer_view_iterator_1mb	0.135506678	0.134265703	previous	0%
byte_to_message_decoder_decode_many_small	0.235076814	0.230608466	previous	1%

significant differences found

weissi

Thank you! Definitely an improvement but at least one comment isn't quite right

weissi · 2021-05-06T13:38:19Z

Tests/NIOTests/RecvByteBufAllocatorTest.swift

@@ -15,47 +15,85 @@
 import XCTest
 import NIO

-public final class AdaptiveRecvByteBufferAllocatorTest : XCTestCase {
+final class AdaptiveRecvByteBufferAllocatorTest : XCTestCase {


Suggested change

final class AdaptiveRecvByteBufferAllocatorTest : XCTestCase {

final class AdaptiveRecvByteBufferAllocatorTest: XCTestCase {

weissi · 2021-05-06T13:59:05Z

Sources/NIO/RecvByteBufferAllocator.swift

-            }
-        } while true
-    }
+        // Here we need to be careful with 32-bit systems: if maximum is Int32.max then any shift or multiply will overflow, which


something's not right here. You initialise self.maximum to always be a power of 2. So it can't be Int32.max because that's not a power of two. Do you mean Int32(1) << 30 == 1 GiB?

Yeah, this comment is from an older version of this code.

Sources/NIO/RecvByteBufferAllocator.swift

Motivation: Platforms with 32-bit integers continue to exist. On those platforms, the way we calculate the size table for AdaptiveRecvByteBufferAllocator will trap, as we attempt to compare an Int to UInt32.max as a loop termination condition. While I was amending this code, I noticed that AdaptiveRecvByteBufferAllocator had a number of very strange behaviours, and was arguably excessively complex. To that end, this patch constitutes a substantial rewrite of the allocator. To understand what it does, we need to describe the previous allocator algorithm. The goal of AdaptiveRecvByteBufferAllocator is to attempt to dynamically track the throughput of a TCP connection and to minimise the memory usage required to get it to achieve maximum throughput, within user-defined constraints. To that end, it implements a fairly simple resizing algorithm. At a high level, the algorithm is as follows. The allocator keeps track of the size of the buffer it is offering the user. When the user reports how much of that buffer they actually used, the allocator determines whether better throughput could be achieved with larger allocation sizes, or whether throughput would not be harmed with smaller allocation sizes. It then adjusts accordingly. When does the allocator believe beetter throughput could be achieved with larger allocation sizes? When the allocation was entirely filled. If a network recv() entirely fills the buffer, that means there was likely more data to read that could not be added to the buffer. Improved throughput would be gained by using larger buffers, as fewer system calls would be necessary. Similarly, the allocator believes it could shrink the buffer size without harming throughput when the recv() uses less memory than the next buffer size down. However, the algorithm attempts to detect the possibility that we have "read until the end". The reason this is relevant is that after a read completes we will often serve other work on the loop for a while, causing data to pile up. As a result, we don't want to shrink the buffer if the average read size would be higher, just because one read happened to be short. The original implementation's algorithm was based on a bucketed scheme. For allocation sizes below 512 bytes, the allocation buckets moved 16 bytes at a time. In principle this gave the allocator 16 byte granularity. For allocation sizes above 512 bytes, the allocation buckets would double: 512, 1024, 2048, 4096, and so on, up to UInt32.max. This, incidentally, was the source of the crash on 32-bit systems. Unfortunately, this scheme had a few failings. Firstly, the implementation complexity was fairly high. The bounds provided by the user needed to be turned into bucket indices, necessitating an awkward and complex binary search of the bucket array. The bucket array itself needed to be generated, forcing a dispatch_once to guard it and dirtying memory. Secondly, the scheme was weighted very heavily toward growing memory and not giving it back. Whenever the buffer was filled and a new size up wanted to be chosen, the allocator would jump _4_ size buckets. As the default initial size was 2048 bytes, and the default maximum was 65kB, the first read of 2048 bytes would cause the allocator to immediately jump up to allocating 32kB of memory for the next read: a huge leap! It would then only release memory after two short reads, at which point it would drop back only 1 bucket, meaning that there was a very aggressive sawtooth pattern favouring higher memory usage. This pattern favours benchmarks, where high-throughput localhost connections will naturally want larger buffer sizes, but it is unnecessarily aggressive for real-world networks. Thirdly, the scheme had a bug that would mean it did not require two _consecutive_ short reads to shrink the buffer, just two short reads between an increase. This means that it had a tendency not to stabilise: two "read to the end"s in two different event loop ticks would cause the buffer to shrink, even if there had been 10 almost-full-reads in between. Fourthly, the system would occasionally report that the Channel should reallocate the buffer because it was larger, when in fact it wasn't: we'd hit the max, and were never going to get larger. This forced high-throughput Channels to excessively allocate, albeit only in systems that customized this allocator (and basically no-one does). Fifthly, the bucketing system was defeated by ByteBuffer. While having 16-byte granularity was nice in principle, ByteBuffer only allows power-of-two allocation sizes. This meant that, at the low end, there were several wasted buckets that were effectively identical. Between 256 bytes and 512 bytes there were 15 redundant buckets! This last point is the main reason to justify a rewrite. The complexity of the scheme was in principle justified by having fine, granular control of allocation sizes. Given that those don't exist, there is no reason not to simply resort to power-of-two sizing. Once we do that, we can replace the complex bucketing system by simple shift-and-multiply math. The effect of this is to produce smaller code (we save about 30% of the instructions, even as we add some extra safety preconditions) with no additional branching, and no need to load from a random table in memory. This avoids the need to keep that table in cache, reducing cache pressure in the hot read loop. Additionally, we can drop the index into the table, which lets us save some per-Channel memory, further reducing cache pressure. Additionally, constructing one of these is now vastly cheaper. That matters less (it's not really hot-code), but it does improve Channel creation time somewhat. Modifications: - Remove the size table. - Round all values to powers of two. - Implement new "previous power of two" function. - Cap the allocation size at the largest power of 2 representable in a 32-bit Int. - Add more tests. Result: The result is less code, simpler code, and faster code. No trapping on 32-bit integer platforms.

weissi

Awesome, looks good to me!

Lukasa added the semver/patch No public API change. label May 6, 2021

Lukasa requested review from weissi, glbrntt, Davidde94 and PeterAdams-A May 6, 2021 13:01

glbrntt reviewed May 6, 2021

View reviewed changes

weissi reviewed May 6, 2021

View reviewed changes

Lukasa force-pushed the cb-32-bit-ints-remain-unfortunately-real branch from 3d9b4c2 to c0bd944 Compare May 6, 2021 14:27

Lukasa requested review from weissi and glbrntt May 6, 2021 14:27

Lukasa force-pushed the cb-32-bit-ints-remain-unfortunately-real branch from c0bd944 to 361272a Compare May 6, 2021 15:25

glbrntt approved these changes May 6, 2021

View reviewed changes

weissi approved these changes May 6, 2021

View reviewed changes

Lukasa merged commit e0e289a into apple:main May 6, 2021

Lukasa deleted the cb-32-bit-ints-remain-unfortunately-real branch May 6, 2021 15:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite RecvByteBufferAllocator. #1850

Rewrite RecvByteBufferAllocator. #1850

Lukasa commented May 6, 2021

Lukasa commented May 6, 2021

glbrntt left a comment

glbrntt May 6, 2021

Lukasa May 6, 2021

glbrntt May 6, 2021

Lukasa May 6, 2021

Lukasa commented May 6, 2021

swift-server-bot commented May 6, 2021

weissi left a comment

weissi May 6, 2021

weissi May 6, 2021

Lukasa May 6, 2021

weissi left a comment

		precondition(initial > minimum, "initial: \(initial)")
		precondition(maximum > initial, "maximum: \(maximum)")

	final class AdaptiveRecvByteBufferAllocatorTest : XCTestCase {
	final class AdaptiveRecvByteBufferAllocatorTest: XCTestCase {

Rewrite RecvByteBufferAllocator. #1850

Rewrite RecvByteBufferAllocator. #1850

Conversation

Lukasa commented May 6, 2021

Lukasa commented May 6, 2021

glbrntt left a comment

Choose a reason for hiding this comment

glbrntt May 6, 2021

Choose a reason for hiding this comment

Lukasa May 6, 2021

Choose a reason for hiding this comment

glbrntt May 6, 2021

Choose a reason for hiding this comment

Lukasa May 6, 2021

Choose a reason for hiding this comment

Lukasa commented May 6, 2021

swift-server-bot commented May 6, 2021

performance report

results

comparison

weissi left a comment

Choose a reason for hiding this comment

weissi May 6, 2021

Choose a reason for hiding this comment

weissi May 6, 2021

Choose a reason for hiding this comment

Lukasa May 6, 2021

Choose a reason for hiding this comment

weissi left a comment

Choose a reason for hiding this comment