-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Compiled" static header maps instead of big trie #33932
Conversation
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
/retest |
Signed-off-by: Raven Black <ravenblack@dropbox.com>
/wait (for CI green) |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks amazing though I don't fully understand it yet.
It turns out I had some comments from a while back that I hadn't flushed, so combining those with what I found just now and flushing.
* @param key the key to look up. | ||
*/ | ||
Value find(const absl::string_view& key) const { | ||
if (key.size() >= table_.size() || table_[key.size()] == nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid the duplicate lookup of table_[key.size()]
? It's probably optimized away but I am never certain.
I think for the compiler to optimize it away it would have to see inside table_.size() to ensure it could not mutate key. That seems obvious to humans but idk how smart compilers are with that. I think if you can't see inside table_.size() impl (ie not inlined) then you can't know that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried taking it to a local variable, but that has to be after key.size() being too large has already been checked, which makes it two if-branches, which from benchmarking appears to make both the hit paths ~5% slower.
It's not like these are actual table lookups, they're just offsets, so even if it wasn't optimized away it might still be faster than taking an additional copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does it have to be after the first check? Can't you have the first line of the function be the assignment to a temp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
table_[key.size()]
is an out-of-bounds crash if key.size() >= table_.size().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const size_t key_size = key.size();
if (key_size >= table_.size() || table_[key_size] == nullptr) {
...
using FindFn = std::function<Value(const absl::string_view&)>; | ||
|
||
public: | ||
using KV = std::pair<std::string, Value>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scanning through the code I'm wondering if there' s some string copying going on here that we might be able to avoid.
Can we have a string-store held in the class, and have the KV that we manipulate in various sub-vectors have either a string_view or a string&
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the input a vector of string_views which eliminates the string copying during compile() except the copy taken for a leaf node (which is kept in case at some point the table wants to be initialized from a transient source rather than global singleton-constant-likes).
* for the string node. | ||
*/ | ||
template <class Value> class CompiledStringMap { | ||
using FindFn = std::function<Value(const absl::string_view&)>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
standard best practice is to pass string_view as value rather than as a ref. I think it's really close but this was the decision made by smart people some time ago :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment about references being faster for this use-case.
std::vector<KV> next_contents; | ||
next_contents.reserve(it - start); | ||
std::copy(start, it, std::back_inserter(next_contents)); | ||
nodes[i - best.min] = createEqualLengthNode(next_contents); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this recursive lambda generation looks really cool but I'm wondering if it's the most efficient thing we can do. As a thought experiment I'm wondering what the code would look like if you had to write this in C. I assume it would be possilbe (if a little more verbose). Would it be faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I was doing it in C it would be almost the same - I don't know how much overhead there is in std::function but for the C structure of this I'd be using something like a struct containing a function pointer and a union of the captured data that each of the node-types takes, and that function pointer would take a pointer to that same structure... which seems like it's probably at least about the same as what a lambda is doing.
It might be possible to optimize further with more manual constructs, but I think a 60% speedup is probably good enough, and it could have another pass later if someone wants to try to do better.
}; | ||
/** | ||
* Construct the lookup table. This is a somewhat slow multi-pass | ||
* operation - using this structure is not recommended unless the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this note be in the class comment?
For a moment I thought you were saying that you could use this class, but not call 'compile' and instead fall back to an interpret-mode, but I don't think that's what you are saying :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved.
auto it = initial.begin(); | ||
for (size_t i = 0; i <= longest; i++) { | ||
auto start = it; | ||
while (it != initial.end() && it->first.size() == i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am guessing we don't consider compiling to be important to be fast. But this nested loop seems like it might be slower than needed.
How about this for an algo:
flat_hash_map<int, std::vector<std::reference<Value>> size_to_values_map;
for (const KV& pair : initial) {
size_to_values_map[pair.second.size()] = pair.second;
}
I realize this creates a data structure which doesn't map the ones you are using, but it seems like a good way to go get the elements organized by size. After that I'm not sure how this would fit into your algo.
I'm having a little trouble following the algorithm also. What causes it
to re-examine the entries that don't match size 'i'?
Maybe some more comment would help, or breaking down your compile flow into a few helper methods that might be more self-documenting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was already not really a nested loop, in that the two indexes were moving "in parallel" (i.e. it wasn't n*m, it was n+m), but I've changed it to make that a bit clearer. Though still not much clearer since the std::find_if still looks like it might be an m-length lookup. But it's not!
/wait |
Signed-off-by: Raven Black <ravenblack@dropbox.com>
…loop Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still reading the detail on the code, but I think @adisuissa should review also.
* @param key the key to look up. | ||
*/ | ||
Value find(const absl::string_view& key) const { | ||
if (key.size() >= table_.size() || table_[key.size()] == nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does it have to be after the first check? Can't you have the first line of the function be the assignment to a temp?
// Find the first key whose length differs from the current key length. | ||
// Everything in between is keys with the same length. | ||
auto range_end = | ||
std::find_if(range_start, initial.end(), [l = range_start->first.size()](const KV& e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of 'l' as a variable name :)
can we have 'last'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len. :)
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is getting a lot more readable though I admit I still don't deeply understand the algorithm.
Left a few more superficial notes.
* (typically nullptr) if the key was not present. | ||
* @param key the key to look up. | ||
*/ | ||
Value find(absl::string_view key) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you found passing by const ref was faster in another case, but not here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly left this one for the sake of leaving the external API unchanged. Feels more reasonable to buck the style guide for internal implementation than for API.
struct IndexSplitInfo { | ||
uint8_t index, min, max, count; | ||
} best{0, 0, 0, 0}; | ||
for (size_t i = 0; i < node_contents[0].first.size(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this be faster?
for (size_t i = 0, n = node_contents[0].first.size(); i < n; i++) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it would (it's just reading a value from an address), and we don't care about the microoptimization performance here anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make this code more understandable consider naming:
const size_t each_key_length = node_contents[0].first.size();
(or something similar)
consider renaming:i
-> key_char_idx
, j
-> key_idx
, v
-> ch
/val
if (!hits[v]) { | ||
hits[v] = true; | ||
info.count++; | ||
info.min = std::min(v, info.min); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it would be faster to write this as:
if (v < info.min) {
info.min = v;
} else if (v > info.max) {
info.max = v;
}
It feels to me like what you have is a little more readable, but may have an extra branch. And it will be always executing two writes into info
and only 0 or 1 are needed. Not sure if the compiler would do that optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't really care about microoptimizing the speed of any function that isn't find().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. I was having trouble figuring out what's find() and what's not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully that's a lot clearer now that there's no lambdas. Now only find()
is find()
. :)
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! Very interesting data-structure!
* This structure first jumps to matching length, eliminating in this | ||
* example case apple and pineapple. | ||
* Then the "best split" node is on | ||
* `x-prefix-banana` | ||
* ^ | ||
* so the first node has 3 non-miss branches, n, b and r for that position. | ||
* Down that n branch, the "best split" is on | ||
* `x-prefix-banana` | ||
* ^ | ||
* which has two branches, n or k. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC the "best split" can be anywhere in the string. I think modifying the example to the following, will clarify how this should work.
* `x-prefix-banana`
* `x-prefix-banara`
* `x-prefix-apple`
* `x-prefix-pineapple`
* `x-prefix-barana`
* `x-prefix-banaka`
then the best split would've been in the character before-last.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can't be anywhere in the string, only at an index where there's a difference. The example is specifically showing that 3 branches would be "best" over 2 branches. I don't understand what your example clarifies that the existing example doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment was looking at it from a "trie" perspective.
When I first read the comment, I thought that in each branch, the structure only keeps the "suffix" (after the branch point). I think that updating the example to a case where the branching happens due to a "best-split" that is not the first split, will reduce that confusion to future readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully the md file explanation covers this sufficiently (though it doesn't do the second split earlier in the string).
If you'd like me to rearrange the strings in the new example so that it does the "last" index split first and the earlier one second I can do that, but I think the diagram having numbers in it probably makes it clear enough. (And late first would be just as potentially confusing as early first, and having three sequential splits would be getting so long as to be harder to read...)
wdyt of adding a diagram showing the algorithm in action, as an md file or image or whatever? |
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Done. |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the doc, it explains the data-structure very well!
The code LGTM. The only thing I suggest considering is whether the use of this data-structure should be behind a startup-flag (bootstrap option). I'll leave it to other maintainers to decide.
in the leaf nodes, which could potentially make it larger due to, e.g. containing | ||
`x-envoy-` multiple times where a regular trie only has one 'branch' for that. | ||
However, it also contains fewer nodes due to not having a node for every character, | ||
and compared to the prior fixed-size-node 256-branch trie, each node is a lot smaller; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a minor comment (no AI):
I think the original trie can be converted to using the same "shrinked-range" (only the needed range of characters) technique. So in terms of memory I think this data-structure is more expensive.
That said, from the CPU perf point-of-view, I tend to agree that this data-structure should perform better, although it is not easy for me to say which of the changes in the data-structure (using per-length buckets, splitting not at the prefix vs condensing a trie) provide the most benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing just shrunk-range was actually the original PR from which this one forked off as a prerequisite; it made performance a bit worse (more so than this version) for the miss case and made little difference for the hits (worse for short strings, better for long strings). I think this version is also still better memory-wise, because every pointer is 64 bits, and this trie has far fewer pointers.
Consider the case of even just one entry, x-envoy-banana
- in this model it's the root length node and a leaf node, so it's a vector with 13 nullptrs and one pointer (because the length node doesn't do the minimizing thing), a length, and one leaf node containing a value and a string.
In a regular "shrunk" trie, this is 14 nodes each containing a default value (a pointer) and a pointer to the next node, and a vector of chars one character long and a length - so that's 48 pointers, which is significantly larger than the "short path" trie. Even if you cut out the common "x-envoy-" prefix entirely and assume that's shared among so many headers that we can amortize it entirely, it's still 18 pointers just for the 'banana' part of the sequence, which is about the same as the "compiled" trie for the whole thing (which also theoretically shares "common prefixes"!)
Splitting at a more optimal branch point is definitely the main performance enhancer for hits. (It's similar to what gperf does.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are correct.
I keep forgetting that this is not a compressed-prefix-trie, but one that has per-character branching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, a compressed-prefix trie (assuming this is a name for "a branch is per-character but also matches the longest string common to all its children") would probably be nearly as fast as the "branch at best position" trie, and a similar size, maybe slightly smaller, since it would also only be a few hops to completion. But it would involve more string-compares.
+alyssawilk for senior maintainer pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. My largest request was going to be jmarantz review but thankfully you both already have that covered :-)
actually I'd love to take one more quick pass; just been slammed. Give me till EOD today? I agree this is fantastic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
This is amazing work. I am looking forward to seeing how fast this is in production!
Commit Message: "Compiled" static header maps instead of big trie Additional Description: This is slightly slower for "misses" but significantly faster for "hits". Given that the majority of headers are expected to be matches for the static table, this should be a big win. (Each faster hit pays for 30 slower misses!) This structure also uses significantly less memory than the big trie. Performance benchmarks detailed in envoyproxy#33932 Risk Level: Could be high since headers are parsed millions of times per second. But also there's a lot of existing test cases. Testing: Added tests, existing tests involving headers will also provide coverage. Docs Changes: n/a Release Notes: Not yet Platform Specific Features: n/a Signed-off-by: Raven Black <ravenblack@dropbox.com> Signed-off-by: Can Cecen <ccecen@netflix.com>
Commit Message: "Compiled" static header maps instead of big trie
Additional Description: This is slightly slower for "misses" but significantly faster for "hits". Given that the majority of headers are expected to be matches for the static table, this should be a big win. (Each faster hit pays for 30 slower misses!) This structure also uses significantly less memory than the big trie.
Performance benchmarks -
bazel run -c opt test/common/http:header_map_impl_speed_test -- --benchmark_filter="bm*" --benchmark_repetitions=25 --benchmark_display_aggregates_only
Risk Level: Could be high since headers are parsed millions of times per second. But also there's a lot of existing test cases.
Testing: Added tests, existing tests involving headers will also provide coverage.
Docs Changes: n/a
Release Notes: Not yet
Platform Specific Features: n/a