"Compiled" static header maps instead of big trie #33932

ravenblackx · 2024-05-02T15:48:24Z

Commit Message: "Compiled" static header maps instead of big trie
Additional Description: This is slightly slower for "misses" but significantly faster for "hits". Given that the majority of headers are expected to be matches for the static table, this should be a big win. (Each faster hit pays for 30 slower misses!) This structure also uses significantly less memory than the big trie.

Performance benchmarks - bazel run -c opt test/common/http:header_map_impl_speed_test -- --benchmark_filter="bm*" --benchmark_repetitions=25 --benchmark_display_aggregates_only

Benchmark	before	after	beforeStdDev	afterStdDev	change
bmHeaderMapImplRequestStaticLookupHits	47.2ns	16.4ns	0.629	0.378	-65.3%
bmHeaderMapImplResponseStaticLookupHits	34.7ns	14.3ns	0.571	0.085	-58.8%
bmHeaderMapImplRequestStaticLookupMisses	6.89ns	6.83ns	0.044	0.034	-0.01%
bmHeaderMapImplResponseStaticLookupMisses	6.40ns	7.31ns	0.028	0.057	+14.2%

Risk Level: Could be high since headers are parsed millions of times per second. But also there's a lot of existing test cases.
Testing: Added tests, existing tests involving headers will also provide coverage.
Docs Changes: n/a
Release Notes: Not yet
Platform Specific Features: n/a

Signed-off-by: Raven Black <ravenblack@dropbox.com>

repokitteh-read-only · 2024-05-02T15:48:29Z

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #33932 was opened by ravenblackx.

see: more, trace.

Signed-off-by: Raven Black <ravenblack@dropbox.com>

ravenblackx · 2024-05-09T15:51:34Z

/retest

Signed-off-by: Raven Black <ravenblack@dropbox.com>

jmarantz · 2024-05-10T16:33:24Z

/wait (for CI green)

ravenblackx · 2024-05-10T16:37:09Z

/retest

jmarantz

this looks amazing though I don't fully understand it yet.

It turns out I had some comments from a while back that I hadn't flushed, so combining those with what I found just now and flushing.

jmarantz · 2024-05-02T17:32:47Z

source/common/common/compiled_string_map.h

+   * @param key the key to look up.
+   */
+  Value find(const absl::string_view& key) const {
+    if (key.size() >= table_.size() || table_[key.size()] == nullptr) {


can we avoid the duplicate lookup of table_[key.size()]? It's probably optimized away but I am never certain.

I think for the compiler to optimize it away it would have to see inside table_.size() to ensure it could not mutate key. That seems obvious to humans but idk how smart compilers are with that. I think if you can't see inside table_.size() impl (ie not inlined) then you can't know that.

I tried taking it to a local variable, but that has to be after key.size() being too large has already been checked, which makes it two if-branches, which from benchmarking appears to make both the hit paths ~5% slower.

It's not like these are actual table lookups, they're just offsets, so even if it wasn't optimized away it might still be faster than taking an additional copy.

why does it have to be after the first check? Can't you have the first line of the function be the assignment to a temp?

table_[key.size()] is an out-of-bounds crash if key.size() >= table_.size().

const size_t key_size = key.size(); if (key_size >= table_.size() || table_[key_size] == nullptr) { ...

jmarantz · 2024-05-02T17:40:27Z

source/common/common/compiled_string_map.h

+  using FindFn = std::function<Value(const absl::string_view&)>;
+
+public:
+  using KV = std::pair<std::string, Value>;


Scanning through the code I'm wondering if there' s some string copying going on here that we might be able to avoid.

Can we have a string-store held in the class, and have the KV that we manipulate in various sub-vectors have either a string_view or a string&?

I made the input a vector of string_views which eliminates the string copying during compile() except the copy taken for a leaf node (which is kept in case at some point the table wants to be initialized from a transient source rather than global singleton-constant-likes).

jmarantz · 2024-05-02T17:41:11Z

source/common/common/compiled_string_map.h

+ * for the string node.
+ */
+template <class Value> class CompiledStringMap {
+  using FindFn = std::function<Value(const absl::string_view&)>;


standard best practice is to pass string_view as value rather than as a ref. I think it's really close but this was the decision made by smart people some time ago :)

Added a comment about references being faster for this use-case.

jmarantz · 2024-05-02T17:44:08Z

source/common/common/compiled_string_map.h

+        std::vector<KV> next_contents;
+        next_contents.reserve(it - start);
+        std::copy(start, it, std::back_inserter(next_contents));
+        nodes[i - best.min] = createEqualLengthNode(next_contents);


this recursive lambda generation looks really cool but I'm wondering if it's the most efficient thing we can do. As a thought experiment I'm wondering what the code would look like if you had to write this in C. I assume it would be possilbe (if a little more verbose). Would it be faster?

If I was doing it in C it would be almost the same - I don't know how much overhead there is in std::function but for the C structure of this I'd be using something like a struct containing a function pointer and a union of the captured data that each of the node-types takes, and that function pointer would take a pointer to that same structure... which seems like it's probably at least about the same as what a lambda is doing.

It might be possible to optimize further with more manual constructs, but I think a 60% speedup is probably good enough, and it could have another pass later if someone wants to try to do better.

source/common/common/compiled_string_map.h

jmarantz · 2024-05-10T18:13:43Z

source/common/common/compiled_string_map.h

+  };
+  /**
+   * Construct the lookup table. This is a somewhat slow multi-pass
+   * operation - using this structure is not recommended unless the


should this note be in the class comment?

For a moment I thought you were saying that you could use this class, but not call 'compile' and instead fall back to an interpret-mode, but I don't think that's what you are saying :)

jmarantz · 2024-05-10T18:45:04Z

source/common/common/compiled_string_map.h

+    auto it = initial.begin();
+    for (size_t i = 0; i <= longest; i++) {
+      auto start = it;
+      while (it != initial.end() && it->first.size() == i) {


I am guessing we don't consider compiling to be important to be fast. But this nested loop seems like it might be slower than needed.

How about this for an algo:

flat_hash_map<int, std::vector<std::reference<Value>> size_to_values_map; for (const KV& pair : initial) { size_to_values_map[pair.second.size()] = pair.second; }

I realize this creates a data structure which doesn't map the ones you are using, but it seems like a good way to go get the elements organized by size. After that I'm not sure how this would fit into your algo.

I'm having a little trouble following the algorithm also. What causes it to re-examine the entries that don't match size 'i'?

Maybe some more comment would help, or breaking down your compile flow into a few helper methods that might be more self-documenting.

It was already not really a nested loop, in that the two indexes were moving "in parallel" (i.e. it wasn't n*m, it was n+m), but I've changed it to make that a bit clearer. Though still not much clearer since the std::find_if still looks like it might be an m-length lookup. But it's not!

jmarantz · 2024-05-10T18:46:29Z

/wait

Signed-off-by: Raven Black <ravenblack@dropbox.com>

…loop Signed-off-by: Raven Black <ravenblack@dropbox.com>

Signed-off-by: Raven Black <ravenblack@dropbox.com>

jmarantz

I'm still reading the detail on the code, but I think @adisuissa should review also.

jmarantz · 2024-05-13T18:29:20Z

source/common/common/compiled_string_map.h

+   * @param key the key to look up.
+   */
+  Value find(const absl::string_view& key) const {
+    if (key.size() >= table_.size() || table_[key.size()] == nullptr) {


why does it have to be after the first check? Can't you have the first line of the function be the assignment to a temp?

jmarantz · 2024-05-13T19:34:54Z

source/common/common/compiled_string_map.h

+      // Find the first key whose length differs from the current key length.
+      // Everything in between is keys with the same length.
+      auto range_end =
+          std::find_if(range_start, initial.end(), [l = range_start->first.size()](const KV& e) {


I'm not a fan of 'l' as a variable name :)

can we have 'last'?

Signed-off-by: Raven Black <ravenblack@dropbox.com>

jmarantz

this is getting a lot more readable though I admit I still don't deeply understand the algorithm.

Left a few more superficial notes.

source/common/common/compiled_string_map.h

jmarantz · 2024-05-14T14:12:48Z

source/common/common/compiled_string_map.h

+   * (typically nullptr) if the key was not present.
+   * @param key the key to look up.
+   */
+  Value find(absl::string_view key) const {


you found passing by const ref was faster in another case, but not here?

Mostly left this one for the sake of leaving the external API unchanged. Feels more reasonable to buck the style guide for internal implementation than for API.

source/common/common/compiled_string_map.h

jmarantz · 2024-05-14T14:15:26Z

source/common/common/compiled_string_map.h

+    struct IndexSplitInfo {
+      uint8_t index, min, max, count;
+    } best{0, 0, 0, 0};
+    for (size_t i = 0; i < node_contents[0].first.size(); i++) {


would this be faster?

for (size_t i = 0, n = node_contents[0].first.size(); i < n; i++) {

I don't think it would (it's just reading a value from an address), and we don't care about the microoptimization performance here anyway.

to make this code more understandable consider naming:
const size_t each_key_length = node_contents[0].first.size(); (or something similar)

consider renaming:i -> key_char_idx, j -> key_idx, v -> ch/val

jmarantz · 2024-05-14T14:18:59Z

source/common/common/compiled_string_map.h

+        if (!hits[v]) {
+          hits[v] = true;
+          info.count++;
+          info.min = std::min(v, info.min);


I'm wondering if it would be faster to write this as:

if (v < info.min) { info.min = v; } else if (v > info.max) { info.max = v; }

It feels to me like what you have is a little more readable, but may have an extra branch. And it will be always executing two writes into info and only 0 or 1 are needed. Not sure if the compiler would do that optimization.

We shouldn't really care about microoptimizing the speed of any function that isn't find().

sure. I was having trouble figuring out what's find() and what's not.

Hopefully that's a lot clearer now that there's no lambdas. Now only find() is find(). :)

source/common/common/compiled_string_map.h

Signed-off-by: Raven Black <ravenblack@dropbox.com>

adisuissa

Thanks for working on this! Very interesting data-structure!

source/common/common/compiled_string_map.h

adisuissa · 2024-05-15T13:25:22Z

source/common/common/compiled_string_map.h

+ * This structure first jumps to matching length, eliminating in this
+ * example case apple and pineapple.
+ * Then the "best split" node is on
+ *   `x-prefix-banana`
+ *               ^
+ * so the first node has 3 non-miss branches, n, b and r for that position.
+ * Down that n branch, the "best split" is on
+ *   `x-prefix-banana`
+ *                 ^
+ * which has two branches, n or k.


IIUC the "best split" can be anywhere in the string. I think modifying the example to the following, will clarify how this should work.

* `x-prefix-banana` * `x-prefix-banara` * `x-prefix-apple` * `x-prefix-pineapple` * `x-prefix-barana` * `x-prefix-banaka`

then the best split would've been in the character before-last.

It can't be anywhere in the string, only at an index where there's a difference. The example is specifically showing that 3 branches would be "best" over 2 branches. I don't understand what your example clarifies that the existing example doesn't.

My comment was looking at it from a "trie" perspective.
When I first read the comment, I thought that in each branch, the structure only keeps the "suffix" (after the branch point). I think that updating the example to a case where the branching happens due to a "best-split" that is not the first split, will reduce that confusion to future readers.

Hopefully the md file explanation covers this sufficiently (though it doesn't do the second split earlier in the string).

If you'd like me to rearrange the strings in the new example so that it does the "last" index split first and the earlier one second I can do that, but I think the diagram having numbers in it probably makes it clear enough. (And late first would be just as potentially confusing as early first, and having three sequential splits would be getting so long as to be harder to read...)

source/common/common/compiled_string_map.h

Signed-off-by: Raven Black <ravenblack@dropbox.com>

jmarantz · 2024-05-16T19:49:05Z

wdyt of adding a diagram showing the algorithm in action, as an md file or image or whatever?

Signed-off-by: Raven Black <ravenblack@dropbox.com>

ravenblackx · 2024-05-17T18:58:42Z

wdyt of adding a diagram showing the algorithm in action, as an md file or image or whatever?

Done.

ravenblackx · 2024-05-17T18:59:16Z

/retest

adisuissa

Thanks for adding the doc, it explains the data-structure very well!
The code LGTM. The only thing I suggest considering is whether the use of this data-structure should be behind a startup-flag (bootstrap option). I'll leave it to other maintainers to decide.

adisuissa · 2024-05-20T14:46:09Z

source/common/common/compiled_string_map.md

+in the leaf nodes, which could potentially make it larger due to, e.g. containing
+`x-envoy-` multiple times where a regular trie only has one 'branch' for that.
+However, it also contains fewer nodes due to not having a node for every character,
+and compared to the prior fixed-size-node 256-branch trie, each node is a lot smaller;


just a minor comment (no AI):
I think the original trie can be converted to using the same "shrinked-range" (only the needed range of characters) technique. So in terms of memory I think this data-structure is more expensive.
That said, from the CPU perf point-of-view, I tend to agree that this data-structure should perform better, although it is not easy for me to say which of the changes in the data-structure (using per-length buckets, splitting not at the prefix vs condensing a trie) provide the most benefit.

Doing just shrunk-range was actually the original PR from which this one forked off as a prerequisite; it made performance a bit worse (more so than this version) for the miss case and made little difference for the hits (worse for short strings, better for long strings). I think this version is also still better memory-wise, because every pointer is 64 bits, and this trie has far fewer pointers.

Consider the case of even just one entry, x-envoy-banana - in this model it's the root length node and a leaf node, so it's a vector with 13 nullptrs and one pointer (because the length node doesn't do the minimizing thing), a length, and one leaf node containing a value and a string.

In a regular "shrunk" trie, this is 14 nodes each containing a default value (a pointer) and a pointer to the next node, and a vector of chars one character long and a length - so that's 48 pointers, which is significantly larger than the "short path" trie. Even if you cut out the common "x-envoy-" prefix entirely and assume that's shared among so many headers that we can amortize it entirely, it's still 18 pointers just for the 'banana' part of the sequence, which is about the same as the "compiled" trie for the whole thing (which also theoretically shares "common prefixes"!)

Splitting at a more optimal branch point is definitely the main performance enhancer for hits. (It's similar to what gperf does.)

Yes, you are correct.
I keep forgetting that this is not a compressed-prefix-trie, but one that has per-character branching.

Ah yeah, a compressed-prefix trie (assuming this is a name for "a branch is per-character but also matches the longest string common to all its children") would probably be nearly as fast as the "branch at best position" trie, and a similar size, maybe slightly smaller, since it would also only be a few hops to completion. But it would involve more string-compares.

ravenblackx · 2024-05-21T18:10:23Z

+alyssawilk for senior maintainer pass.

alyssawilk

LGTM. My largest request was going to be jmarantz review but thankfully you both already have that covered :-)

source/common/common/compiled_string_map.h

jmarantz · 2024-05-22T15:00:40Z

actually I'd love to take one more quick pass; just been slammed. Give me till EOD today?

I agree this is fantastic.

jmarantz

lgtm!

This is amazing work. I am looking forward to seeing how fast this is in production!

Commit Message: "Compiled" static header maps instead of big trie Additional Description: This is slightly slower for "misses" but significantly faster for "hits". Given that the majority of headers are expected to be matches for the static table, this should be a big win. (Each faster hit pays for 30 slower misses!) This structure also uses significantly less memory than the big trie. Performance benchmarks detailed in envoyproxy#33932 Risk Level: Could be high since headers are parsed millions of times per second. But also there's a lot of existing test cases. Testing: Added tests, existing tests involving headers will also provide coverage. Docs Changes: n/a Release Notes: Not yet Platform Specific Features: n/a Signed-off-by: Raven Black <ravenblack@dropbox.com> Signed-off-by: Can Cecen <ccecen@netflix.com>

…#33932)" This reverts commit c7ce725. Signed-off-by: Adi Suissa-Peleg <adip@google.com>

ravenblackx added 5 commits May 1, 2024 21:08

Add static lookup benchmarks for header_map_impl

9b0d62c

Signed-off-by: Raven Black <ravenblack@dropbox.com>

compiled_string_map v1

2e83788

Signed-off-by: Raven Black <ravenblack@dropbox.com>

optimize

be5dd61

Signed-off-by: Raven Black <ravenblack@dropbox.com>

Format

611e1dd

Signed-off-by: Raven Black <ravenblack@dropbox.com>

Document

150040e

Signed-off-by: Raven Black <ravenblack@dropbox.com>

ravenblackx added 7 commits May 2, 2024 16:26

Format

90ba622

Signed-off-by: Raven Black <ravenblack@dropbox.com>

Header

e055146

Signed-off-by: Raven Black <ravenblack@dropbox.com>

Comment and fix issue.

dd3a9fa

Signed-off-by: Raven Black <ravenblack@dropbox.com>

More comment

1990767

Signed-off-by: Raven Black <ravenblack@dropbox.com>

Fix and simplify

5df1270

Signed-off-by: Raven Black <ravenblack@dropbox.com>

Merge branch 'headers' into headers_compiled

92625f3

Signed-off-by: Raven Black <ravenblack@dropbox.com>

Split finalize and compile to support legacy host header injection

23042de

Signed-off-by: Raven Black <ravenblack@dropbox.com>

ravenblackx mentioned this pull request May 6, 2024

Make trie structure less memory-hogging #33668

Merged

Merge branch 'main' into headers_compiled

0808002

Signed-off-by: Raven Black <ravenblack@dropbox.com>

ravenblackx assigned jmarantz May 7, 2024

ravenblackx marked this pull request as ready for review May 7, 2024 22:59

Empty commit to retest

b293819

Signed-off-by: Raven Black <ravenblack@dropbox.com>

repokitteh-read-only bot added the waiting label May 10, 2024

ravenblackx removed the waiting label May 10, 2024

jmarantz reviewed May 10, 2024

View reviewed changes

repokitteh-read-only bot added the waiting label May 10, 2024

ravenblackx added 4 commits May 10, 2024 19:41

Reduce string copies while populating lookup table

9eccaa9

Signed-off-by: Raven Black <ravenblack@dropbox.com>

Rearrange node creation to be clearer that it's effectively a single …

a3f8919

…loop Signed-off-by: Raven Black <ravenblack@dropbox.com>

More comments

d1a8ed8

Signed-off-by: Raven Black <ravenblack@dropbox.com>

Explicit move

7ece7eb

Signed-off-by: Raven Black <ravenblack@dropbox.com>

jmarantz reviewed May 13, 2024

View reviewed changes

adisuissa self-assigned this May 13, 2024

ravenblackx added 5 commits May 13, 2024 20:26

len

941d3fa

Signed-off-by: Raven Black <ravenblack@dropbox.com>

key_size to avoid calling size() repeatedly (no difference)

d75190d

Signed-off-by: Raven Black <ravenblack@dropbox.com>

Virtual class instead of lambdas

30bab4b

Signed-off-by: Raven Black <ravenblack@dropbox.com>

PURE

5469839

Signed-off-by: Raven Black <ravenblack@dropbox.com>

Cut out a size comparison

dc91907

Signed-off-by: Raven Black <ravenblack@dropbox.com>

jmarantz reviewed May 14, 2024

View reviewed changes

ravenblackx added 4 commits May 14, 2024 16:48

move

4348681

Signed-off-by: Raven Black <ravenblack@dropbox.com>

Trailing underscores, some readability rearranging

9bd6190

Signed-off-by: Raven Black <ravenblack@dropbox.com>

Comment on KV ownership.

9ee31c0

Signed-off-by: Raven Black <ravenblack@dropbox.com>

data() not &[0]

405ac4d

Signed-off-by: Raven Black <ravenblack@dropbox.com>

adisuissa reviewed May 15, 2024

View reviewed changes

ravenblackx added 3 commits May 16, 2024 18:01

Comments

a743ba2

Signed-off-by: Raven Black <ravenblack@dropbox.com>

Spelling

c7458f4

Signed-off-by: Raven Black <ravenblack@dropbox.com>

Fix comment

c4858ec

Signed-off-by: Raven Black <ravenblack@dropbox.com>

Add compiled_string_map.md

56c5e4e

Signed-off-by: Raven Black <ravenblack@dropbox.com>

adisuissa approved these changes May 20, 2024

View reviewed changes

ravenblackx assigned alyssawilk May 21, 2024

alyssawilk approved these changes May 22, 2024

View reviewed changes

source/common/common/compiled_string_map.h Show resolved Hide resolved

jmarantz reviewed May 23, 2024

View reviewed changes

alyssawilk merged commit c7ce725 into envoyproxy:main May 23, 2024
53 checks passed

ravenblackx deleted the headers_compiled branch May 23, 2024 15:10

adisuissa added a commit to adisuissa/envoy that referenced this pull request Jun 17, 2024

Revert ""Compiled" static header maps instead of big trie (envoyproxy…

f6eb6c8

…#33932)" This reverts commit c7ce725. Signed-off-by: Adi Suissa-Peleg <adip@google.com>

"Compiled" static header maps instead of big trie #33932

"Compiled" static header maps instead of big trie #33932

Conversation

ravenblackx commented May 2, 2024 • edited Loading

repokitteh-read-only bot commented May 2, 2024

ravenblackx commented May 9, 2024

jmarantz commented May 10, 2024

ravenblackx commented May 10, 2024

jmarantz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ravenblackx May 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmarantz commented May 10, 2024

jmarantz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmarantz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adisuissa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmarantz commented May 16, 2024

ravenblackx commented May 17, 2024

ravenblackx commented May 17, 2024

adisuissa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ravenblackx commented May 21, 2024

alyssawilk left a comment

Choose a reason for hiding this comment

jmarantz commented May 22, 2024

jmarantz left a comment

Choose a reason for hiding this comment

ravenblackx commented May 2, 2024 •

edited

Loading

ravenblackx May 13, 2024 •

edited

Loading