-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite RecvByteBufferAllocator. #1850
Rewrite RecvByteBufferAllocator. #1850
Conversation
@swift-nio-bot test perf please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice change; one question but looks good otherwise.
precondition(initial > minimum, "initial: \(initial)") | ||
precondition(maximum > initial, "maximum: \(maximum)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't initial
and maximum
be equal to initial
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No particular reason, but I wasn't planning to lift that constraint today.
self.maximum = maximum | ||
// We need to round all of these numbers to a power of 2. Initial will be rounded down, | ||
// minimum down, and maximum up. | ||
self.minimum = minimum.previousPowerOf2() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to min
this with the max allocation size here as well? Looks like we could end up with minimum > initial
if minimum and initial were sufficiently large
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we do. Good catch.
While we're here, this generates lovely straight-line code:
|
performance reportbuild id: 69 timestamp: Thu May 6 13:34:12 UTC 2021 results
comparison
significant differences found |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Definitely an improvement but at least one comment isn't quite right
@@ -15,47 +15,85 @@ | |||
import XCTest | |||
import NIO | |||
|
|||
public final class AdaptiveRecvByteBufferAllocatorTest : XCTestCase { | |||
final class AdaptiveRecvByteBufferAllocatorTest : XCTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final class AdaptiveRecvByteBufferAllocatorTest : XCTestCase { | |
final class AdaptiveRecvByteBufferAllocatorTest: XCTestCase { |
} | ||
} while true | ||
} | ||
// Here we need to be careful with 32-bit systems: if maximum is Int32.max then any shift or multiply will overflow, which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something's not right here. You initialise self.maximum
to always be a power of 2. So it can't be Int32.max
because that's not a power of two. Do you mean Int32(1) << 30
== 1 GiB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this comment is from an older version of this code.
3d9b4c2
to
c0bd944
Compare
Motivation: Platforms with 32-bit integers continue to exist. On those platforms, the way we calculate the size table for AdaptiveRecvByteBufferAllocator will trap, as we attempt to compare an Int to UInt32.max as a loop termination condition. While I was amending this code, I noticed that AdaptiveRecvByteBufferAllocator had a number of very strange behaviours, and was arguably excessively complex. To that end, this patch constitutes a substantial rewrite of the allocator. To understand what it does, we need to describe the previous allocator algorithm. The goal of AdaptiveRecvByteBufferAllocator is to attempt to dynamically track the throughput of a TCP connection and to minimise the memory usage required to get it to achieve maximum throughput, within user-defined constraints. To that end, it implements a fairly simple resizing algorithm. At a high level, the algorithm is as follows. The allocator keeps track of the size of the buffer it is offering the user. When the user reports how much of that buffer they actually used, the allocator determines whether better throughput could be achieved with larger allocation sizes, or whether throughput would not be harmed with smaller allocation sizes. It then adjusts accordingly. When does the allocator believe beetter throughput could be achieved with larger allocation sizes? When the allocation was entirely filled. If a network recv() entirely fills the buffer, that means there was likely more data to read that could not be added to the buffer. Improved throughput would be gained by using larger buffers, as fewer system calls would be necessary. Similarly, the allocator believes it could shrink the buffer size without harming throughput when the recv() uses less memory than the next buffer size down. However, the algorithm attempts to detect the possibility that we have "read until the end". The reason this is relevant is that after a read completes we will often serve other work on the loop for a while, causing data to pile up. As a result, we don't want to shrink the buffer if the average read size would be higher, just because one read happened to be short. The original implementation's algorithm was based on a bucketed scheme. For allocation sizes below 512 bytes, the allocation buckets moved 16 bytes at a time. In principle this gave the allocator 16 byte granularity. For allocation sizes above 512 bytes, the allocation buckets would double: 512, 1024, 2048, 4096, and so on, up to UInt32.max. This, incidentally, was the source of the crash on 32-bit systems. Unfortunately, this scheme had a few failings. Firstly, the implementation complexity was fairly high. The bounds provided by the user needed to be turned into bucket indices, necessitating an awkward and complex binary search of the bucket array. The bucket array itself needed to be generated, forcing a dispatch_once to guard it and dirtying memory. Secondly, the scheme was weighted very heavily toward growing memory and not giving it back. Whenever the buffer was filled and a new size up wanted to be chosen, the allocator would jump _4_ size buckets. As the default initial size was 2048 bytes, and the default maximum was 65kB, the first read of 2048 bytes would cause the allocator to immediately jump up to allocating 32kB of memory for the next read: a huge leap! It would then only release memory after two short reads, at which point it would drop back only 1 bucket, meaning that there was a very aggressive sawtooth pattern favouring higher memory usage. This pattern favours benchmarks, where high-throughput localhost connections will naturally want larger buffer sizes, but it is unnecessarily aggressive for real-world networks. Thirdly, the scheme had a bug that would mean it did not require two _consecutive_ short reads to shrink the buffer, just two short reads between an increase. This means that it had a tendency not to stabilise: two "read to the end"s in two different event loop ticks would cause the buffer to shrink, even if there had been 10 almost-full-reads in between. Fourthly, the system would occasionally report that the Channel should reallocate the buffer because it was larger, when in fact it wasn't: we'd hit the max, and were never going to get larger. This forced high-throughput Channels to excessively allocate, albeit only in systems that customized this allocator (and basically no-one does). Fifthly, the bucketing system was defeated by ByteBuffer. While having 16-byte granularity was nice in principle, ByteBuffer only allows power-of-two allocation sizes. This meant that, at the low end, there were several wasted buckets that were effectively identical. Between 256 bytes and 512 bytes there were 15 redundant buckets! This last point is the main reason to justify a rewrite. The complexity of the scheme was in principle justified by having fine, granular control of allocation sizes. Given that those don't exist, there is no reason not to simply resort to power-of-two sizing. Once we do that, we can replace the complex bucketing system by simple shift-and-multiply math. The effect of this is to produce smaller code (we save about 30% of the instructions, even as we add some extra safety preconditions) with no additional branching, and no need to load from a random table in memory. This avoids the need to keep that table in cache, reducing cache pressure in the hot read loop. Additionally, we can drop the index into the table, which lets us save some per-Channel memory, further reducing cache pressure. Additionally, constructing one of these is now vastly cheaper. That matters less (it's not really hot-code), but it does improve Channel creation time somewhat. Modifications: - Remove the size table. - Round all values to powers of two. - Implement new "previous power of two" function. - Cap the allocation size at the largest power of 2 representable in a 32-bit Int. - Add more tests. Result: The result is less code, simpler code, and faster code. No trapping on 32-bit integer platforms.
c0bd944
to
361272a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, looks good to me!
Motivation:
Platforms with 32-bit integers continue to exist. On those platforms,
the way we calculate the size table for AdaptiveRecvByteBufferAllocator
will trap, as we attempt to compare an Int to UInt32.max as a loop
termination condition.
While I was amending this code, I noticed that
AdaptiveRecvByteBufferAllocator had a number of very strange behaviours,
and was arguably excessively complex. To that end, this patch
constitutes a substantial rewrite of the allocator. To understand what
it does, we need to describe the previous allocator algorithm.
The goal of AdaptiveRecvByteBufferAllocator is to attempt to dynamically
track the throughput of a TCP connection and to minimise the memory
usage required to get it to achieve maximum throughput, within
user-defined constraints. To that end, it implements a fairly simple
resizing algorithm.
At a high level, the algorithm is as follows. The allocator keeps track
of the size of the buffer it is offering the user. When the user reports
how much of that buffer they actually used, the allocator determines
whether better throughput could be achieved with larger allocation
sizes, or whether throughput would not be harmed with smaller allocation
sizes. It then adjusts accordingly.
When does the allocator believe beetter throughput could be achieved
with larger allocation sizes? When the allocation was entirely filled.
If a network recv() entirely fills the buffer, that means there was
likely more data to read that could not be added to the buffer. Improved
throughput would be gained by using larger buffers, as fewer system
calls would be necessary.
Similarly, the allocator believes it could shrink the buffer size
without harming throughput when the recv() uses less memory than the
next buffer size down. However, the algorithm attempts to detect the
possibility that we have "read until the end". The reason this is
relevant is that after a read completes we will often serve other work
on the loop for a while, causing data to pile up. As a result, we don't
want to shrink the buffer if the average read size would be higher, just
because one read happened to be short.
The original implementation's algorithm was based on a bucketed
scheme. For allocation sizes below 512 bytes, the allocation buckets
moved 16 bytes at a time. In principle this gave the allocator 16 byte
granularity. For allocation sizes above 512 bytes, the allocation
buckets would double: 512, 1024, 2048, 4096, and so on, up to
UInt32.max. This, incidentally, was the source of the crash on 32-bit
systems.
Unfortunately, this scheme had a few failings.
Firstly, the implementation complexity was fairly high. The bounds
provided by the user needed to be turned into bucket indices,
necessitating an awkward and complex binary search of the bucket array.
The bucket array itself needed to be generated, forcing a dispatch_once
to guard it and dirtying memory.
Secondly, the scheme was weighted very heavily toward growing memory and
not giving it back. Whenever the buffer was filled and a new size up
wanted to be chosen, the allocator would jump 4 size buckets. As the
default initial size was 2048 bytes, and the default maximum was 65kB,
the first read of 2048 bytes would cause the allocator to immediately
jump up to allocating 32kB of memory for the next read: a huge leap! It
would then only release memory after two short reads, at which point it
would drop back only 1 bucket, meaning that there was a very aggressive
sawtooth pattern favouring higher memory usage. This pattern favours
benchmarks, where high-throughput localhost connections will naturally
want larger buffer sizes, but it is unnecessarily aggressive for
real-world networks.
Thirdly, the scheme had a bug that would mean it did not require two
consecutive short reads to shrink the buffer, just two short reads
between an increase. This means that it had a tendency not to stabilise:
two "read to the end"s in two different event loop ticks would cause
the buffer to shrink, even if there had been 10 almost-full-reads
in between.
Fourthly, the system would occasionally report that the Channel should
reallocate the buffer because it was larger, when in fact it wasn't:
we'd hit the max, and were never going to get larger. This forced
high-throughput Channels to excessively allocate, albeit only in systems
that customized this allocator (and basically no-one does).
Fifthly, the bucketing system was defeated by ByteBuffer. While having
16-byte granularity was nice in principle, ByteBuffer only allows
power-of-two allocation sizes. This meant that, at the low end, there
were several wasted buckets that were effectively identical. Between 256
bytes and 512 bytes there were 15 redundant buckets!
This last point is the main reason to justify a rewrite. The complexity
of the scheme was in principle justified by having fine, granular
control of allocation sizes. Given that those don't exist, there is no
reason not to simply resort to power-of-two sizing. Once we do that, we
can replace the complex bucketing system by simple shift-and-multiply
math. The effect of this is to produce smaller code (we save about 30%
of the instructions, even as we add some extra safety preconditions)
with no additional branching, and no need to load from a random table in
memory. This avoids the need to keep that table in cache, reducing cache
pressure in the hot read loop. Additionally, we can drop the index into
the table, which lets us save some per-Channel memory, further reducing
cache pressure.
Additionally, constructing one of these is now vastly cheaper. That
matters less (it's not really hot-code), but it does improve Channel
creation time somewhat.
Modifications:
32-bit Int.
Result:
The result is less code, simpler code, and faster code. No trapping on
32-bit integer platforms.
Resolves #1848