Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement StreamMap and replace Dictionary #258

Merged
merged 2 commits into from
Nov 24, 2020
Merged

Conversation

Lukasa
Copy link
Contributor

@Lukasa Lukasa commented Nov 23, 2020

Motivation:

When working with HTTP/2 we frequently have a need to store "per-stream"
state. This is state that is independently reproduced once per stream.
We often need to look up this state by stream ID, and so we are
motivated to store this state as efficiently as possible.

Naively, Dictionary seems like a useful data type for this. However,
Dictionaries are not free. The cost of hashing in order to index into a
Dictionary can be substantial. This is both because the hashing function
itself is branchy and complex, but also because the dictionary access is
very unfriendly to the branch predictor and the memory caching systems.

We can take advantage of the fact that if you split streams into client
and server namespaces, stream IDs are strictly ordered. That is, streams
are created in order of increasing stream ID. While streams are
retired out of order, at no point will a CircularBuffer of stream IDs
ever be unordered.

This unlocks a very powerful data structure for us: the ordered "Array"
(in our case actually CircularBuffer). An ordered Array supports two
powerful lookup tools. Firstly, it supports the mighty binary search.
This means the absolute worst-case searching performance for an ordered
Array is log(n) in the size of the Array. Finding a stream by stream ID
in an Array of one million streams would take around 20 steps: extremely
cheap.

For smaller sizes, however, we unlock the one true searching function:
linear scan. On modern CPUs, performing simple computations (such as
checking if one number equals another) is orders of magnitude faster
than looking data up in memory. More importantly, modern CPUs can often
effectively compute multiple lookups at once thanks to their
out-of-order superscalar pipelines. As a result, on any modern CPU
(including all mobile CPUs) a linear search outperforms almost any other
searching algorithm for surprisingly large Array sizes!

Given that stream IDs are intrinsically sorted, and thus all
per-stream-data is intrinsically sorted by stream ID, we have the
ability to use either of the above strategies. More importantly for us,
we have the ability to be adaptive to those strategies, and swap
between them as needed.

Thus we deploy the StreamMap. This data structure stores per-stream data
in a pair of circular buffers, keyed by stream ID. For smallish numbers
of streams we just search these buffers linearly. For larger numbers, we
flip over to binary search.

Why circular buffers? Because we avoid a compaction problem. While in
general removal of stream IDs is done in arbitrary order, in general
the mostly likely streams to end at any given time are the oldest and
the youngest. Circular buffers make removing streams at those ends very
cheap, as compaction can be done easily. This, along with avoiding new
allocations, makes circular buffers a perfect data structure for this
strategy.

Modifications:

  • Implemented StreamMap
  • Replaced Dictionary

Results:

Meaningful performance gains on all benchmarks. 1k requests with 1
concurrent stream sees about a .1% performance improvement. 1k requests
with 100 concurrent streams sees a more like 15% performance
improvement.

@@ -125,7 +112,6 @@ struct ConnectionStreamState {
/// - modifier: A block that will be invoked to modify the stream state, if present.
/// - throws: Any errors thrown from the creator.
/// - returns: The result of the state modification, as well as any state change that occurred to the stream.
@inline(__always)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to remove this because otherwise the 5.0 compiler crashes. 🤷 We don't much care about perf on 5.0 anymore, it's not the best target, so we'll accept this change. Newer compilers didn't need this anyway.

@Lukasa Lukasa force-pushed the cb-stream-map branch 2 times, most recently from 0be41e1 to 9c0dc7f Compare November 23, 2020 18:51
@Lukasa
Copy link
Contributor Author

Lukasa commented Nov 23, 2020

Note the huge reduction in allocations: 13 allocations out of 73 removed in some benchmarks, an 18% reduction in allocations.

Copy link
Contributor

@PeterAdams-A PeterAdams-A left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good.

@@ -271,13 +256,13 @@ struct ConnectionStreamState {
mutating func dropAllStreamsWithIDHigherThan(_ streamID: HTTP2StreamID,
droppedLocally: Bool,
initiatedBy initiator: HTTP2ConnectionStateMachine.ConnectionRole) -> [HTTP2StreamID]? {
let idsToDrop = self.activeStreams.keys.filter { $0.mayBeInitiatedBy(initiator) && $0 > streamID }
let idsToDrop = self.activeStreams.elements(initiatedBy: initiator).drop(while: {$0.streamID <= streamID }).map { $0.streamID }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shame you couldn't get an optimal leap to start in here. eg Binary search or scan depending on size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can address this in a separate patch. I want to add a benchmark that actually hits this code-path, because right now we can't measure any improvement or change here.

for number in 1..<1000 {
let contains = map.contains(streamID: HTTP2StreamID(number))
if number % 3 == 1 {
XCTAssertTrue(contains)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't really matter but I'd have been tempted to use AssetEquals with the two bools to save a few lines.

}

// Binary search is somewhat complex code compared to a linear scan, so we don't want to inline this code if we can avoid it.
@inline(never)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have just let the compiler do whatever it wants until we discover a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I agree. Here, because we're explicitly writing fast-path/slow-path code, I think the choice is justifiable. If we're going to do a binary search we know we have at least 200 elements to search, and so the overhead of jumping to the new method (potentially missing cache) is acceptable.

For the linear scan we may have zero or a small number of data objects. In those cases, it'd be beneficial if we inlined the relatively small amount of code necessary to run the linear search into the caller, improving locality.

@Lukasa
Copy link
Contributor Author

Lukasa commented Nov 24, 2020

@swift-nio-bot test perf please

@swift-server-bot
Copy link

performance report

build id: 4

timestamp: Tue Nov 24 09:25:37 UTC 2020

results

nameminmaxmeanstd
1_conn_10k_reqs 2.1609249114990234 2.1683980226516724 2.165195906162262 0.002166591213947045
encode_100k_header_blocks_indexable 0.33807992935180664 0.3480340242385864 0.3396589994430542 0.0029604356327186093
encode_100k_header_blocks_nonindexable 0.28553497791290283 0.29137396812438965 0.2870213031768799 0.0019812307135574914
encode_100k_header_blocks_neverIndexed 0.29058992862701416 0.2935370206832886 0.2912513017654419 0.0008888522555732765
decode_100k_header_blocks_indexable 0.20630097389221191 0.206853985786438 0.20660439729690552 0.00019352288409948322
decode_100k_header_blocks_nonindexable 0.27768003940582275 0.278190016746521 0.27790811061859133 0.00021482515670429315
decode_100k_header_blocks_neverIndexed 0.2791029214859009 0.2796449661254883 0.2793076992034912 0.00022617209891339173
huffman_encode_basic 0.311972975730896 0.31306707859039307 0.3126974940299988 0.0003230447210063599
huffman_encode_complex 0.24950003623962402 0.2502020597457886 0.24975290298461914 0.00023678273908477079
huffman_decode_basic 0.06633400917053223 0.06687092781066895 0.06656659841537475 0.0002254017622956362
huffman_decode_complex 0.0512620210647583 0.051841020584106445 0.05145429372787476 0.00024408797016918396
server_only_10k_requests_1_concurrent 0.5370550155639648 0.5375570058822632 0.5372635960578919 0.00016729266370120124
server_only_10k_requests_100_concurrent 0.47634899616241455 0.47715699672698975 0.47675328254699706 0.00028971541440381125

comparison

name current previous winner diff
1_conn_10k_reqs 2.165195906162262 2.221388208866119 current -2%
encode_100k_header_blocks_indexable 0.3396589994430542 0.3241667032241821 previous 4%
encode_100k_header_blocks_nonindexable 0.2870213031768799 0.290792191028595 current -1%
encode_100k_header_blocks_neverIndexed 0.2912513017654419 0.28608070611953734 previous 1%
decode_100k_header_blocks_indexable 0.20660439729690552 0.20769050121307372 previous 0%
decode_100k_header_blocks_nonindexable 0.27790811061859133 0.28266890048980714 current -1%
decode_100k_header_blocks_neverIndexed 0.2793076992034912 0.28234760761260985 current -1%
huffman_encode_basic 0.3126974940299988 0.3034381031990051 previous 2%
huffman_encode_complex 0.24975290298461914 0.2507891058921814 current 0%
huffman_decode_basic 0.06656659841537475 0.0703089952468872 current -5%
huffman_decode_complex 0.05145429372787476 0.05351079702377319 current -3%
server_only_10k_requests_1_concurrent 0.5372635960578919 n/a n/a%
server_only_10k_requests_100_concurrent 0.47675328254699706 n/a n/a%

significant differences found

@Lukasa
Copy link
Contributor Author

Lukasa commented Nov 24, 2020

Hmm, it's weird that we don't have baseline perf numbers for the newer benchmarks.

Copy link
Contributor

@glbrntt glbrntt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nice win! I left a couple of comments inline.

/// are cache-friendly and branch-predictor-friendly, which means it can in some cases be cheaper to do a linear scan of
/// 100 items than to do a binary search or hash table lookup.
///
/// Our strategy here is therefore hybrid. Up to 200 streams we will do linear searches to find our stream. Beyond that,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you come up with 200? (It seems completely reasonable, I'm just curious)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ballpark benchmarking. It's hard to do it confidently but it's ballpark in the right place.

/// Creates an "empty" stream map. This should be used only to create static singletons
/// whose purpose is to be swapped to avoid CoWs. Otherwise use regular init().
static func empty() -> StreamMap<Element> {
let sortaEmptyCircularBuffer = CircularBuffer<Element>()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This init quietly uses 16 as the initial capacity, might be worth using initialCapacity: 0 here instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no actually empty circular buffer: it always has a size of at least one. So I figured it didn't really matter, as this is a singleton anyway.

Comment on lines 142 to 144
/// - modifier: A block that will modify the contained value in the
/// optional, if there is one present.
/// - returns: The return value of the block or `nil` if the optional was `nil`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation is a little off here

Comment on lines 232 to 234
/// - modifier: A block that will modify the contained value in the
/// optional, if there is one present.
/// - returns: The return value of the block or `nil` if the optional was `nil`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation also a little off here

@glbrntt
Copy link
Contributor

glbrntt commented Nov 24, 2020

FYI I just tested this change using the gRPC unary benchmark (which uses 100 streams concurrently), it yielded a 4% improvement in QPS.

Motivation:

When working with HTTP/2 we frequently have a need to store "per-stream"
state. This is state that is independently reproduced once per stream.
We often need to look up this state by stream ID, and so we are
motivated to store this state as efficiently as possible.

Naively, Dictionary seems like a useful data type for this. However,
Dictionaries are not free. The cost of hashing in order to index into a
Dictionary can be substantial. This is both because the hashing function
itself is branchy and complex, but also because the dictionary access is
very unfriendly to the branch predictor and the memory caching systems.

We can take advantage of the fact that if you split streams into client
and server namespaces, stream IDs are strictly ordered. That is, streams
are created in order of increasing stream ID. While streams are
_retired_ out of order, at no point will a CircularBuffer of stream IDs
ever be unordered.

This unlocks a very powerful data structure for us: the ordered "Array"
(in our case actually CircularBuffer). An ordered Array supports two
powerful lookup tools. Firstly, it supports the mighty binary search.
This means the absolute worst-case searching performance for an ordered
Array is log(n) in the size of the Array. Finding a stream by stream ID
in an Array of one million streams would take around 20 steps: extremely
cheap.

For smaller sizes, however, we unlock the one true searching function:
linear scan. On modern CPUs, performing simple computations (such as
checking if one number equals another) is orders of magnitude faster
than looking data up in memory. More importantly, modern CPUs can often
effectively compute multiple lookups at once thanks to their
out-of-order superscalar pipelines. As a result, on any modern CPU
(including all mobile CPUs) a linear search outperforms almost any other
searching algorithm for surprisingly large Array sizes!

Given that stream IDs are intrinsically sorted, and thus all
per-stream-data is intrinsically sorted by stream ID, we have the
ability to use either of the above strategies. More importantly for us,
we have the ability to be _adaptive_ to those strategies, and swap
between them as needed.

Thus we deploy the StreamMap. This data structure stores per-stream data
in a pair of circular buffers, keyed by stream ID. For smallish numbers
of streams we just search these buffers linearly. For larger numbers, we
flip over to binary search.

Why circular buffers? Because we avoid a compaction problem. While in
general removal of stream IDs is done in arbitrary order, in _general_
the mostly likely streams to end at any given time are the oldest and
the youngest. Circular buffers make removing streams at those ends very
cheap, as compaction can be done easily. This, along with avoiding new
allocations, makes circular buffers a perfect data structure for this
strategy.

Modifications:

- Implemented StreamMap
- Replaced Dictionary

Results:

Meaningful performance gains on all benchmarks. 1k requests with 1
concurrent stream sees about a .1% performance improvement. 1k requests
with 100 concurrent streams sees a more like 15% performance
improvement.
Co-authored-by: George Barnett <gbarnett@apple.com>
@Lukasa Lukasa merged commit a08dce8 into apple:main Nov 24, 2020
@Lukasa Lukasa deleted the cb-stream-map branch November 24, 2020 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver/patch No public API change.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants