-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement StreamMap and replace Dictionary #258
Conversation
4978b29
to
f3c8909
Compare
f3c8909
to
a6bc0d3
Compare
@@ -125,7 +112,6 @@ struct ConnectionStreamState { | |||
/// - modifier: A block that will be invoked to modify the stream state, if present. | |||
/// - throws: Any errors thrown from the creator. | |||
/// - returns: The result of the state modification, as well as any state change that occurred to the stream. | |||
@inline(__always) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to remove this because otherwise the 5.0 compiler crashes. 🤷 We don't much care about perf on 5.0 anymore, it's not the best target, so we'll accept this change. Newer compilers didn't need this anyway.
0be41e1
to
9c0dc7f
Compare
Note the huge reduction in allocations: 13 allocations out of 73 removed in some benchmarks, an 18% reduction in allocations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good.
@@ -271,13 +256,13 @@ struct ConnectionStreamState { | |||
mutating func dropAllStreamsWithIDHigherThan(_ streamID: HTTP2StreamID, | |||
droppedLocally: Bool, | |||
initiatedBy initiator: HTTP2ConnectionStateMachine.ConnectionRole) -> [HTTP2StreamID]? { | |||
let idsToDrop = self.activeStreams.keys.filter { $0.mayBeInitiatedBy(initiator) && $0 > streamID } | |||
let idsToDrop = self.activeStreams.elements(initiatedBy: initiator).drop(while: {$0.streamID <= streamID }).map { $0.streamID } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shame you couldn't get an optimal leap to start in here. eg Binary search or scan depending on size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can address this in a separate patch. I want to add a benchmark that actually hits this code-path, because right now we can't measure any improvement or change here.
for number in 1..<1000 { | ||
let contains = map.contains(streamID: HTTP2StreamID(number)) | ||
if number % 3 == 1 { | ||
XCTAssertTrue(contains) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't really matter but I'd have been tempted to use AssetEquals with the two bools to save a few lines.
} | ||
|
||
// Binary search is somewhat complex code compared to a linear scan, so we don't want to inline this code if we can avoid it. | ||
@inline(never) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have just let the compiler do whatever it wants until we discover a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I agree. Here, because we're explicitly writing fast-path/slow-path code, I think the choice is justifiable. If we're going to do a binary search we know we have at least 200 elements to search, and so the overhead of jumping to the new method (potentially missing cache) is acceptable.
For the linear scan we may have zero or a small number of data objects. In those cases, it'd be beneficial if we inlined the relatively small amount of code necessary to run the linear search into the caller, improving locality.
@swift-nio-bot test perf please |
performance reportbuild id: 4 timestamp: Tue Nov 24 09:25:37 UTC 2020 results
comparison
significant differences found |
Hmm, it's weird that we don't have baseline perf numbers for the newer benchmarks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nice win! I left a couple of comments inline.
/// are cache-friendly and branch-predictor-friendly, which means it can in some cases be cheaper to do a linear scan of | ||
/// 100 items than to do a binary search or hash table lookup. | ||
/// | ||
/// Our strategy here is therefore hybrid. Up to 200 streams we will do linear searches to find our stream. Beyond that, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you come up with 200? (It seems completely reasonable, I'm just curious)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ballpark benchmarking. It's hard to do it confidently but it's ballpark in the right place.
/// Creates an "empty" stream map. This should be used only to create static singletons | ||
/// whose purpose is to be swapped to avoid CoWs. Otherwise use regular init(). | ||
static func empty() -> StreamMap<Element> { | ||
let sortaEmptyCircularBuffer = CircularBuffer<Element>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This init quietly uses 16 as the initial capacity, might be worth using initialCapacity: 0
here instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no actually empty circular buffer: it always has a size of at least one. So I figured it didn't really matter, as this is a singleton anyway.
Sources/NIOHTTP2/StreamMap.swift
Outdated
/// - modifier: A block that will modify the contained value in the | ||
/// optional, if there is one present. | ||
/// - returns: The return value of the block or `nil` if the optional was `nil`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This documentation is a little off here
Sources/NIOHTTP2/StreamMap.swift
Outdated
/// - modifier: A block that will modify the contained value in the | ||
/// optional, if there is one present. | ||
/// - returns: The return value of the block or `nil` if the optional was `nil`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation also a little off here
FYI I just tested this change using the gRPC unary benchmark (which uses 100 streams concurrently), it yielded a 4% improvement in QPS. |
Motivation: When working with HTTP/2 we frequently have a need to store "per-stream" state. This is state that is independently reproduced once per stream. We often need to look up this state by stream ID, and so we are motivated to store this state as efficiently as possible. Naively, Dictionary seems like a useful data type for this. However, Dictionaries are not free. The cost of hashing in order to index into a Dictionary can be substantial. This is both because the hashing function itself is branchy and complex, but also because the dictionary access is very unfriendly to the branch predictor and the memory caching systems. We can take advantage of the fact that if you split streams into client and server namespaces, stream IDs are strictly ordered. That is, streams are created in order of increasing stream ID. While streams are _retired_ out of order, at no point will a CircularBuffer of stream IDs ever be unordered. This unlocks a very powerful data structure for us: the ordered "Array" (in our case actually CircularBuffer). An ordered Array supports two powerful lookup tools. Firstly, it supports the mighty binary search. This means the absolute worst-case searching performance for an ordered Array is log(n) in the size of the Array. Finding a stream by stream ID in an Array of one million streams would take around 20 steps: extremely cheap. For smaller sizes, however, we unlock the one true searching function: linear scan. On modern CPUs, performing simple computations (such as checking if one number equals another) is orders of magnitude faster than looking data up in memory. More importantly, modern CPUs can often effectively compute multiple lookups at once thanks to their out-of-order superscalar pipelines. As a result, on any modern CPU (including all mobile CPUs) a linear search outperforms almost any other searching algorithm for surprisingly large Array sizes! Given that stream IDs are intrinsically sorted, and thus all per-stream-data is intrinsically sorted by stream ID, we have the ability to use either of the above strategies. More importantly for us, we have the ability to be _adaptive_ to those strategies, and swap between them as needed. Thus we deploy the StreamMap. This data structure stores per-stream data in a pair of circular buffers, keyed by stream ID. For smallish numbers of streams we just search these buffers linearly. For larger numbers, we flip over to binary search. Why circular buffers? Because we avoid a compaction problem. While in general removal of stream IDs is done in arbitrary order, in _general_ the mostly likely streams to end at any given time are the oldest and the youngest. Circular buffers make removing streams at those ends very cheap, as compaction can be done easily. This, along with avoiding new allocations, makes circular buffers a perfect data structure for this strategy. Modifications: - Implemented StreamMap - Replaced Dictionary Results: Meaningful performance gains on all benchmarks. 1k requests with 1 concurrent stream sees about a .1% performance improvement. 1k requests with 100 concurrent streams sees a more like 15% performance improvement.
9c0dc7f
to
d943aa7
Compare
Co-authored-by: George Barnett <gbarnett@apple.com>
Motivation:
When working with HTTP/2 we frequently have a need to store "per-stream"
state. This is state that is independently reproduced once per stream.
We often need to look up this state by stream ID, and so we are
motivated to store this state as efficiently as possible.
Naively, Dictionary seems like a useful data type for this. However,
Dictionaries are not free. The cost of hashing in order to index into a
Dictionary can be substantial. This is both because the hashing function
itself is branchy and complex, but also because the dictionary access is
very unfriendly to the branch predictor and the memory caching systems.
We can take advantage of the fact that if you split streams into client
and server namespaces, stream IDs are strictly ordered. That is, streams
are created in order of increasing stream ID. While streams are
retired out of order, at no point will a CircularBuffer of stream IDs
ever be unordered.
This unlocks a very powerful data structure for us: the ordered "Array"
(in our case actually CircularBuffer). An ordered Array supports two
powerful lookup tools. Firstly, it supports the mighty binary search.
This means the absolute worst-case searching performance for an ordered
Array is log(n) in the size of the Array. Finding a stream by stream ID
in an Array of one million streams would take around 20 steps: extremely
cheap.
For smaller sizes, however, we unlock the one true searching function:
linear scan. On modern CPUs, performing simple computations (such as
checking if one number equals another) is orders of magnitude faster
than looking data up in memory. More importantly, modern CPUs can often
effectively compute multiple lookups at once thanks to their
out-of-order superscalar pipelines. As a result, on any modern CPU
(including all mobile CPUs) a linear search outperforms almost any other
searching algorithm for surprisingly large Array sizes!
Given that stream IDs are intrinsically sorted, and thus all
per-stream-data is intrinsically sorted by stream ID, we have the
ability to use either of the above strategies. More importantly for us,
we have the ability to be adaptive to those strategies, and swap
between them as needed.
Thus we deploy the StreamMap. This data structure stores per-stream data
in a pair of circular buffers, keyed by stream ID. For smallish numbers
of streams we just search these buffers linearly. For larger numbers, we
flip over to binary search.
Why circular buffers? Because we avoid a compaction problem. While in
general removal of stream IDs is done in arbitrary order, in general
the mostly likely streams to end at any given time are the oldest and
the youngest. Circular buffers make removing streams at those ends very
cheap, as compaction can be done easily. This, along with avoiding new
allocations, makes circular buffers a perfect data structure for this
strategy.
Modifications:
Results:
Meaningful performance gains on all benchmarks. 1k requests with 1
concurrent stream sees about a .1% performance improvement. 1k requests
with 100 concurrent streams sees a more like 15% performance
improvement.