Implement StreamMap and replace Dictionary #258

Lukasa · 2020-11-23T17:49:53Z

Motivation:

When working with HTTP/2 we frequently have a need to store "per-stream"
state. This is state that is independently reproduced once per stream.
We often need to look up this state by stream ID, and so we are
motivated to store this state as efficiently as possible.

Naively, Dictionary seems like a useful data type for this. However,
Dictionaries are not free. The cost of hashing in order to index into a
Dictionary can be substantial. This is both because the hashing function
itself is branchy and complex, but also because the dictionary access is
very unfriendly to the branch predictor and the memory caching systems.

We can take advantage of the fact that if you split streams into client
and server namespaces, stream IDs are strictly ordered. That is, streams
are created in order of increasing stream ID. While streams are
retired out of order, at no point will a CircularBuffer of stream IDs
ever be unordered.

This unlocks a very powerful data structure for us: the ordered "Array"
(in our case actually CircularBuffer). An ordered Array supports two
powerful lookup tools. Firstly, it supports the mighty binary search.
This means the absolute worst-case searching performance for an ordered
Array is log(n) in the size of the Array. Finding a stream by stream ID
in an Array of one million streams would take around 20 steps: extremely
cheap.

For smaller sizes, however, we unlock the one true searching function:
linear scan. On modern CPUs, performing simple computations (such as
checking if one number equals another) is orders of magnitude faster
than looking data up in memory. More importantly, modern CPUs can often
effectively compute multiple lookups at once thanks to their
out-of-order superscalar pipelines. As a result, on any modern CPU
(including all mobile CPUs) a linear search outperforms almost any other
searching algorithm for surprisingly large Array sizes!

Given that stream IDs are intrinsically sorted, and thus all
per-stream-data is intrinsically sorted by stream ID, we have the
ability to use either of the above strategies. More importantly for us,
we have the ability to be adaptive to those strategies, and swap
between them as needed.

Thus we deploy the StreamMap. This data structure stores per-stream data
in a pair of circular buffers, keyed by stream ID. For smallish numbers
of streams we just search these buffers linearly. For larger numbers, we
flip over to binary search.

Why circular buffers? Because we avoid a compaction problem. While in
general removal of stream IDs is done in arbitrary order, in general
the mostly likely streams to end at any given time are the oldest and
the youngest. Circular buffers make removing streams at those ends very
cheap, as compaction can be done easily. This, along with avoiding new
allocations, makes circular buffers a perfect data structure for this
strategy.

Modifications:

Implemented StreamMap
Replaced Dictionary

Results:

Meaningful performance gains on all benchmarks. 1k requests with 1
concurrent stream sees about a .1% performance improvement. 1k requests
with 100 concurrent streams sees a more like 15% performance
improvement.

Lukasa · 2020-11-23T18:29:06Z

Sources/NIOHTTP2/ConnectionStateMachine/ConnectionStreamsState.swift

@@ -125,7 +112,6 @@ struct ConnectionStreamState {
    ///     - modifier: A block that will be invoked to modify the stream state, if present.
    /// - throws: Any errors thrown from the creator.
    /// - returns: The result of the state modification, as well as any state change that occurred to the stream.
-    @inline(__always)


We have to remove this because otherwise the 5.0 compiler crashes. 🤷 We don't much care about perf on 5.0 anymore, it's not the best target, so we'll accept this change. Newer compilers didn't need this anyway.

Lukasa · 2020-11-23T19:01:22Z

Note the huge reduction in allocations: 13 allocations out of 73 removed in some benchmarks, an 18% reduction in allocations.

PeterAdams-A

Generally looks good.

PeterAdams-A · 2020-11-23T20:54:50Z

Sources/NIOHTTP2/ConnectionStateMachine/ConnectionStreamsState.swift

@@ -271,13 +256,13 @@ struct ConnectionStreamState {
    mutating func dropAllStreamsWithIDHigherThan(_ streamID: HTTP2StreamID,
                                                 droppedLocally: Bool,
                                                 initiatedBy initiator: HTTP2ConnectionStateMachine.ConnectionRole) -> [HTTP2StreamID]? {
-        let idsToDrop = self.activeStreams.keys.filter { $0.mayBeInitiatedBy(initiator) && $0 > streamID }
+        let idsToDrop = self.activeStreams.elements(initiatedBy: initiator).drop(while: {$0.streamID <= streamID }).map { $0.streamID }


Shame you couldn't get an optimal leap to start in here. eg Binary search or scan depending on size.

Yeah, I can address this in a separate patch. I want to add a benchmark that actually hits this code-path, because right now we can't measure any improvement or change here.

PeterAdams-A · 2020-11-23T21:03:27Z

Tests/NIOHTTP2Tests/StreamMapTests.swift

+        for number in 1..<1000 {
+            let contains = map.contains(streamID: HTTP2StreamID(number))
+            if number % 3 == 1 {
+                XCTAssertTrue(contains)


It doesn't really matter but I'd have been tempted to use AssetEquals with the two bools to save a few lines.

PeterAdams-A · 2020-11-23T21:08:49Z

Sources/NIOHTTP2/StreamMap.swift

+    }
+
+    // Binary search is somewhat complex code compared to a linear scan, so we don't want to inline this code if we can avoid it.
+    @inline(never)


I would have just let the compiler do whatever it wants until we discover a problem.

In general I agree. Here, because we're explicitly writing fast-path/slow-path code, I think the choice is justifiable. If we're going to do a binary search we know we have at least 200 elements to search, and so the overhead of jumping to the new method (potentially missing cache) is acceptable.

For the linear scan we may have zero or a small number of data objects. In those cases, it'd be beneficial if we inlined the relatively small amount of code necessary to run the linear search into the caller, improving locality.

Lukasa · 2020-11-24T09:21:05Z

@swift-nio-bot test perf please

swift-server-bot · 2020-11-24T09:25:41Z

performance report

build id: 4

timestamp: Tue Nov 24 09:25:37 UTC 2020

results

name	min	max	mean	std
1_conn_10k_reqs	2.1609249114990234	2.1683980226516724	2.165195906162262	0.002166591213947045
encode_100k_header_blocks_indexable	0.33807992935180664	0.3480340242385864	0.3396589994430542	0.0029604356327186093
encode_100k_header_blocks_nonindexable	0.28553497791290283	0.29137396812438965	0.2870213031768799	0.0019812307135574914
encode_100k_header_blocks_neverIndexed	0.29058992862701416	0.2935370206832886	0.2912513017654419	0.0008888522555732765
decode_100k_header_blocks_indexable	0.20630097389221191	0.206853985786438	0.20660439729690552	0.00019352288409948322
decode_100k_header_blocks_nonindexable	0.27768003940582275	0.278190016746521	0.27790811061859133	0.00021482515670429315
decode_100k_header_blocks_neverIndexed	0.2791029214859009	0.2796449661254883	0.2793076992034912	0.00022617209891339173
huffman_encode_basic	0.311972975730896	0.31306707859039307	0.3126974940299988	0.0003230447210063599
huffman_encode_complex	0.24950003623962402	0.2502020597457886	0.24975290298461914	0.00023678273908477079
huffman_decode_basic	0.06633400917053223	0.06687092781066895	0.06656659841537475	0.0002254017622956362
huffman_decode_complex	0.0512620210647583	0.051841020584106445	0.05145429372787476	0.00024408797016918396
server_only_10k_requests_1_concurrent	0.5370550155639648	0.5375570058822632	0.5372635960578919	0.00016729266370120124
server_only_10k_requests_100_concurrent	0.47634899616241455	0.47715699672698975	0.47675328254699706	0.00028971541440381125

comparison

name	current	previous	winner	diff
1_conn_10k_reqs	2.165195906162262	2.221388208866119	current	-2%
encode_100k_header_blocks_indexable	0.3396589994430542	0.3241667032241821	previous	4%
encode_100k_header_blocks_nonindexable	0.2870213031768799	0.290792191028595	current	-1%
encode_100k_header_blocks_neverIndexed	0.2912513017654419	0.28608070611953734	previous	1%
decode_100k_header_blocks_indexable	0.20660439729690552	0.20769050121307372	previous	0%
decode_100k_header_blocks_nonindexable	0.27790811061859133	0.28266890048980714	current	-1%
decode_100k_header_blocks_neverIndexed	0.2793076992034912	0.28234760761260985	current	-1%
huffman_encode_basic	0.3126974940299988	0.3034381031990051	previous	2%
huffman_encode_complex	0.24975290298461914	0.2507891058921814	current	0%
huffman_decode_basic	0.06656659841537475	0.0703089952468872	current	-5%
huffman_decode_complex	0.05145429372787476	0.05351079702377319	current	-3%
server_only_10k_requests_1_concurrent	0.5372635960578919		n/a	n/a%
server_only_10k_requests_100_concurrent	0.47675328254699706		n/a	n/a%

significant differences found

Lukasa · 2020-11-24T09:59:23Z

Hmm, it's weird that we don't have baseline perf numbers for the newer benchmarks.

glbrntt

A nice win! I left a couple of comments inline.

glbrntt · 2020-11-24T09:32:05Z

Sources/NIOHTTP2/StreamMap.swift

+/// are cache-friendly and branch-predictor-friendly, which means it can in some cases be cheaper to do a linear scan of
+/// 100 items than to do a binary search or hash table lookup.
+///
+/// Our strategy here is therefore hybrid. Up to 200 streams we will do linear searches to find our stream. Beyond that,


How did you come up with 200? (It seems completely reasonable, I'm just curious)

Ballpark benchmarking. It's hard to do it confidently but it's ballpark in the right place.

glbrntt · 2020-11-24T09:38:15Z

Sources/NIOHTTP2/StreamMap.swift

+    /// Creates an "empty" stream map. This should be used only to create static singletons
+    /// whose purpose is to be swapped to avoid CoWs. Otherwise use regular init().
+    static func empty() -> StreamMap<Element> {
+        let sortaEmptyCircularBuffer = CircularBuffer<Element>()


This init quietly uses 16 as the initial capacity, might be worth using initialCapacity: 0 here instead.

There is no actually empty circular buffer: it always has a size of at least one. So I figured it didn't really matter, as this is a singleton anyway.

glbrntt · 2020-11-24T09:54:49Z

Sources/NIOHTTP2/StreamMap.swift

+    ///     - modifier: A block that will modify the contained value in the
+    ///         optional, if there is one present.
+    /// - returns: The return value of the block or `nil` if the optional was `nil`.


This documentation is a little off here

glbrntt · 2020-11-24T09:55:12Z

Sources/NIOHTTP2/StreamMap.swift

+    ///     - modifier: A block that will modify the contained value in the
+    ///         optional, if there is one present.
+    /// - returns: The return value of the block or `nil` if the optional was `nil`.


Documentation also a little off here

glbrntt · 2020-11-24T11:09:15Z

FYI I just tested this change using the gRPC unary benchmark (which uses 100 streams concurrently), it yielded a 4% improvement in QPS.

Motivation: When working with HTTP/2 we frequently have a need to store "per-stream" state. This is state that is independently reproduced once per stream. We often need to look up this state by stream ID, and so we are motivated to store this state as efficiently as possible. Naively, Dictionary seems like a useful data type for this. However, Dictionaries are not free. The cost of hashing in order to index into a Dictionary can be substantial. This is both because the hashing function itself is branchy and complex, but also because the dictionary access is very unfriendly to the branch predictor and the memory caching systems. We can take advantage of the fact that if you split streams into client and server namespaces, stream IDs are strictly ordered. That is, streams are created in order of increasing stream ID. While streams are _retired_ out of order, at no point will a CircularBuffer of stream IDs ever be unordered. This unlocks a very powerful data structure for us: the ordered "Array" (in our case actually CircularBuffer). An ordered Array supports two powerful lookup tools. Firstly, it supports the mighty binary search. This means the absolute worst-case searching performance for an ordered Array is log(n) in the size of the Array. Finding a stream by stream ID in an Array of one million streams would take around 20 steps: extremely cheap. For smaller sizes, however, we unlock the one true searching function: linear scan. On modern CPUs, performing simple computations (such as checking if one number equals another) is orders of magnitude faster than looking data up in memory. More importantly, modern CPUs can often effectively compute multiple lookups at once thanks to their out-of-order superscalar pipelines. As a result, on any modern CPU (including all mobile CPUs) a linear search outperforms almost any other searching algorithm for surprisingly large Array sizes! Given that stream IDs are intrinsically sorted, and thus all per-stream-data is intrinsically sorted by stream ID, we have the ability to use either of the above strategies. More importantly for us, we have the ability to be _adaptive_ to those strategies, and swap between them as needed. Thus we deploy the StreamMap. This data structure stores per-stream data in a pair of circular buffers, keyed by stream ID. For smallish numbers of streams we just search these buffers linearly. For larger numbers, we flip over to binary search. Why circular buffers? Because we avoid a compaction problem. While in general removal of stream IDs is done in arbitrary order, in _general_ the mostly likely streams to end at any given time are the oldest and the youngest. Circular buffers make removing streams at those ends very cheap, as compaction can be done easily. This, along with avoiding new allocations, makes circular buffers a perfect data structure for this strategy. Modifications: - Implemented StreamMap - Replaced Dictionary Results: Meaningful performance gains on all benchmarks. 1k requests with 1 concurrent stream sees about a .1% performance improvement. 1k requests with 100 concurrent streams sees a more like 15% performance improvement.

Sources/NIOHTTP2/StreamMap.swift

Co-authored-by: George Barnett <gbarnett@apple.com>

Lukasa added the semver/patch No public API change. label Nov 23, 2020

Lukasa requested review from glbrntt, Davidde94 and PeterAdams-A November 23, 2020 17:49

Lukasa force-pushed the cb-stream-map branch 2 times, most recently from 4978b29 to f3c8909 Compare November 23, 2020 18:13

normanmaurer mentioned this pull request Nov 23, 2020

Consider implementing a custom stream map netty/netty-incubator-codec-quic#54

Open

Lukasa force-pushed the cb-stream-map branch from f3c8909 to a6bc0d3 Compare November 23, 2020 18:27

Lukasa commented Nov 23, 2020

View reviewed changes

Lukasa force-pushed the cb-stream-map branch 2 times, most recently from 0be41e1 to 9c0dc7f Compare November 23, 2020 18:51

PeterAdams-A approved these changes Nov 23, 2020

View reviewed changes

glbrntt reviewed Nov 24, 2020

View reviewed changes

Lukasa force-pushed the cb-stream-map branch from 9c0dc7f to d943aa7 Compare November 24, 2020 11:36

Lukasa requested a review from glbrntt November 24, 2020 11:39

glbrntt reviewed Nov 24, 2020

View reviewed changes

Sources/NIOHTTP2/StreamMap.swift Outdated Show resolved Hide resolved

glbrntt approved these changes Nov 24, 2020

View reviewed changes

shervintravel mentioned this pull request Nov 24, 2020

Documentation also a little off here shervintravel/ShervinTravel-Hotel#1

Open

Update Sources/NIOHTTP2/StreamMap.swift

9edd1bf

Co-authored-by: George Barnett <gbarnett@apple.com>

Lukasa merged commit a08dce8 into apple:main Nov 24, 2020

Lukasa deleted the cb-stream-map branch November 24, 2020 13:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement StreamMap and replace Dictionary #258

Implement StreamMap and replace Dictionary #258

Lukasa commented Nov 23, 2020

Lukasa Nov 23, 2020

Lukasa commented Nov 23, 2020

PeterAdams-A left a comment

PeterAdams-A Nov 23, 2020

Lukasa Nov 24, 2020

PeterAdams-A Nov 23, 2020

PeterAdams-A Nov 23, 2020

Lukasa Nov 24, 2020

Lukasa commented Nov 24, 2020

swift-server-bot commented Nov 24, 2020

Lukasa commented Nov 24, 2020

glbrntt left a comment

glbrntt Nov 24, 2020

Lukasa Nov 24, 2020

glbrntt Nov 24, 2020

Lukasa Nov 24, 2020

glbrntt Nov 24, 2020

glbrntt Nov 24, 2020

glbrntt commented Nov 24, 2020

Implement StreamMap and replace Dictionary #258

Implement StreamMap and replace Dictionary #258

Conversation

Lukasa commented Nov 23, 2020

Choose a reason for hiding this comment

Lukasa commented Nov 23, 2020

PeterAdams-A left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lukasa commented Nov 24, 2020

swift-server-bot commented Nov 24, 2020

performance report

results

comparison

Lukasa commented Nov 24, 2020

glbrntt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glbrntt commented Nov 24, 2020