Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reflection-based deterministic message hashing #30761

Merged
merged 44 commits into from Dec 13, 2023
Merged

Conversation

ravenblackx
Copy link
Contributor

@ravenblackx ravenblackx commented Nov 7, 2023

Commit Message: Reflection-based deterministic message hashing
Additional Description: Comparing config used to be done with "deterministic" serialization and a hash of that, but it turned out deterministic serialization was not in fact deterministic enough (Any messages could contain reordered maps or even other fields). The recommendation in protobuf docs is "if you want a deterministic order do it yourself". (Note that the linked comment also suggests that SetSerializationDeterministic should work for our purposes here, but protocolbuffers/protobuf#5731 is the relevant unaddressed bug about Any fields.)

The prior method of achieving determinism by expanding Any via TextFormat serialization is slow enough that it shows up on performance graphs as a significant cost (about 1/4 of our total startup time).
image

This change is quite a lot of code, but should give us a deterministic hash with a much faster runtime than transforming to and then hashing a human-readable string.

Risk Level: Some; if it's broken it might cause config updates to not apply, or cache entries to collide.
Testing: Many unit tests, and a bit extra in the existing hash test. Benchmark results from bazel run -c opt test/common/protobuf:utility_speed_test:

-----------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations
-----------------------------------------------------------------------------------
bmHashByDeterministicHash/map                 35663 ns        35662 ns        20288
bmHashByTextFormat/map                       376608 ns       376503 ns         1892
bmHashByDeterministicHash/recursion           13682 ns        13681 ns        51381
bmHashByTextFormat/recursion                  30239 ns        30239 ns        23463
bmHashByDeterministicHash/repeatedFields      21024 ns        21023 ns        33263
bmHashByTextFormat/repeatedFields            150444 ns       150443 ns         4645

Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a
Runtime Guard: envoy.restart_features.use_fast_protobuf_hash

Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
…rently

Signed-off-by: Raven Black <ravenblack@dropbox.com>
hash_type(reflection->Get##get_type(message, field)); \
}

#define MAP_SORT_BY(get_type) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you pass in the variable name rather than capturing the free variable, ie MAP_SORT_BY(map, get_type) ?

Also the map variable at call-sites is really a vector. Call it that? I'm actually not sure we really need to macro-izer the call to std::sort. I think you could do this readably without macros by instead just making a template function that returns the comparator:

std::sort(map.begin(), map.end(), makeCompareFn<get_type>);

I think that will be more readaable at call-sits and only slight more verbose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you can template e.g. makeCompareFn<int> to reflection->GetInt(...) can you? Other than by explicitly writing a specialization for each of the possible values. If we want to go that way I might as well just manually fully expand the macro throughout.
(It is frustrating that the reflection API doesn't have a Get template with specializations that would make this easy!)

I've now done it with a middle-ground template which to me has the worst readability of all possible options. Do you prefer this over the full expansion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see how that got awkward fast :) Let me reflect on that (no pun intended).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restructured it without that template, it's less lines of code.

Then separated the sorting out into its own helper class, then made the comparison function also its own helper class so there's no std::function or even lambdas involved any more. Then got rid of the original helper class because the comparer is enough to make it more suitable to just be a function again. :)

Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
@ravenblackx
Copy link
Contributor Author

/retest
(mysql integration test passes locally and does not appear to be related)

@ravenblackx
Copy link
Contributor Author

/retest

@ravenblackx
Copy link
Contributor Author

/coverage

Copy link

Coverage for this Pull Request will be rendered here:

https://storage.googleapis.com/envoy-pr/30761/coverage/index.html

The coverage results are (re-)rendered each time the CI envoy-presubmit (check linux_x64 coverage) job completes.

🐱

Caused by: a #30761 (comment) was created by @ravenblackx.

see: more, trace.

Signed-off-by: Raven Black <ravenblack@dropbox.com>
The reduction is necessary because the new uncovered lines are
unreachable enum paths that cannot be tested, and the existing
covered function had its LoC reduced which also contributed to
a coverage percentage downswing.

Signed-off-by: Raven Black <ravenblack@dropbox.com>
Copy link

CC @envoyproxy/coverage-shephards: FYI only for changes made to (test/per_file_coverage.sh).
envoyproxy/coverage-shephards assignee is @RyanTheOptimist

🐱

Caused by: #30761 was synchronize by ravenblackx.

see: more, trace.

@@ -15,7 +15,7 @@ declare -a KNOWN_LOW_COVERAGE=(
"source/common/matcher:94.6"
"source/common/network:94.4" # Flaky, `activateFileEvents`, `startSecureTransport` and `ioctl`, listener_socket do not always report LCOV
"source/common/network/dns_resolver:91.4" # A few lines of MacOS code not tested in linux scripts. Tested in MacOS scripts
"source/common/protobuf:96.5"
"source/common/protobuf:96.3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we find a way to keep the existing coverage threshold for this directory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a relevant one. Literally every new line of code that is reachable is covered. We could add some nonsense padding lines in covered areas, or add a test to something completely unrelated to this change, would be the only feasible ways to re-increase the coverage percentage.

OR, an alternative I considered, give the helper class a separate constructor to allow to construct it with invalid input fields, expose that constructor only for testing (but it would still be built for production) and use that to add test coverage of the unreachable lines by making them reachable. Pretty sure that's even worse than the other options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Coverage LGTM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up actually improving the coverage, as the final version of deterministic_hash.cc now has 100% coverage (the switch with unreachable paths was excised thanks to jmarantz pointing out that we don't actually have to sort a list in order to hash it ignoring order), and the intermediate/current version of utility.cc still has the old lines covered for the lifespan of a runtime guard.

Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
@ravenblackx
Copy link
Contributor Author

Updated to use xxHash64Value, added an explicit scalar constraint on hashScalarField (even though it would be implicit from xxHash64Value, it makes sense to be explicit too), and refreshed the benchmarks in the PR description after the change.

@ravenblackx
Copy link
Contributor Author

/retest

The performance of the hash operation is improved by 2-10x depending on the structure of the message,
which is expected to reduce config update time or startup time by 10-25%. The new algorithm is also
used for http_cache_filter hashing, which will effectively cause a one-time cache flush on update
for users with a persistent cache. To revert this behavior set ``envoy.restart_features.use_fast_protobuf_hash`` to false.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would you feel about making this default to the current behavior.

I would like, when we import this into our mono-repo, to have products run the new mode in staging for a few weeks first before we turn it on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to. My understanding is that runtime guards for performance improvements are generally supposed to be emergency shutoffs, not optional activates - if you make it opt-in then it doesn't get tested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not forever. Just for some amount of time (a quarter?)

@alyssawilk may have more context on an appropriate amount of time a large change should bake before flipping it to the default. It will improve things; it will get tested and turned on.

I just don't want the eventual productionizing of this to catch us by surprise, and we wouldn't be able to disable it prior to importing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False by default it is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally when we do false by default we assign an owner to hassle when we flip true. Josh as you're looking at testing this are you OK driving the default flip? Alternately if this is no higher risk than most flags should we leave it as true by default upstream and import-false? I'm fine either way, just want to make sure we have a plan on how to move forward

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I don't need to take that on -- I like this change a lot, but it's solving a problem our service is not currently suffering from, afaict. Even if we have like a delay till end of January to make sure we have a chance to flip the flag off, I'd be OK.

I think for the service my team runs, we are being particularly paranoid right now, and we'd just want a chance to have this flag imported into our monorepo first, and then turn it off, then it could be turned on in Envoy main. After N months we could flip it on ourselves; we just can't take on that risk right now.

test/common/protobuf/utility_speed_test.cc Show resolved Hide resolved
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but as this is pretty deep stuff I think I'd like to see Adi and Yan both review.

@@ -296,7 +296,7 @@ TEST_F(LookupRequestTest, PragmaNoFallback) {
TEST(HttpCacheTest, StableHashKey) {
Key key;
key.set_host("example.com");
ASSERT_EQ(stableHashKey(key), 9582653837550152292u);
ASSERT_EQ(stableHashKey(key), 6153940628716543519u);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ASSERT_EQ rather than EXPECT_EQ? Usually I use ASSERT_EQ only if, when it fails, that might result in a crash later in the function, like

ASSERT_EQ(5, array.size());
EXPECT_EQ("foo", array[4]);

no big deal though

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see Ryan is already the senior maintainer on it but I'd still like Adi to review first.

Copy link
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, thanks!
Left a few minor comments.

changelogs/current.yaml Outdated Show resolved Hide resolved
source/common/protobuf/deterministic_hash.cc Outdated Show resolved Hide resolved
source/common/runtime/runtime_features.cc Outdated Show resolved Hide resolved
source/common/protobuf/deterministic_hash.cc Show resolved Hide resolved
source/common/protobuf/deterministic_hash.cc Show resolved Hide resolved
source/common/protobuf/deterministic_hash.cc Outdated Show resolved Hide resolved
source/common/protobuf/deterministic_hash.cc Outdated Show resolved Hide resolved
source/common/protobuf/deterministic_hash.cc Show resolved Hide resolved
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Copy link
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!
One small minor comment, otherwise LGTM

source/common/protobuf/deterministic_hash.cc Show resolved Hide resolved
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
@ravenblackx
Copy link
Contributor Author

/retest

@jmarantz
Copy link
Contributor

I think this just need Senior Maintainer approval. Ryan?

Copy link
Contributor

@RyanTheOptimist RyanTheOptimist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

value.set_index(2);
a2.mutable_any()->PackFrom(value);
EXPECT_NE(hash(a1), hash(a2));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent tests!

@ravenblackx
Copy link
Contributor Author

/retest

Signed-off-by: Raven Black <ravenblack@dropbox.com>
@ravenblackx
Copy link
Contributor Author

/retest

1 similar comment
@ravenblackx
Copy link
Contributor Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants