Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding streaming response to AdminClusters endpoint #33879

Closed
wants to merge 77 commits into from

Conversation

miroswan
Copy link
Contributor

@miroswan miroswan commented Apr 30, 2024

Commit Message: Adding streaming response to AdminClusters endpoint
Additional Description: Leveraging Envoy::Server::Request to implement a streaming response for AdminClusters
Risk Level: medium
Testing: unit / integration / manual
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features: N/A
[Optional Runtime guard:]
[Optional Fixes #Issue]: #33879
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

@adisuissa
Copy link
Contributor

Seems that this PR includes many file changes. As it doesn't have reviews yet, can you please rebase it?
/wait

@adisuissa adisuissa self-assigned this May 1, 2024
@adisuissa
Copy link
Contributor

Seems that this PR includes many file changes. As it doesn't have reviews yet, can you please rebase it?

Apologies, I misread the PR contents.
This is quite a big change. Can you please provide some background what is the motivation behind this, and consider breaking it to smaller PRs.

@miroswan
Copy link
Contributor Author

miroswan commented May 1, 2024

Seems that this PR includes many file changes. As it doesn't have reviews yet, can you please rebase it?

Apologies, I misread the PR contents. This is quite a big change. Can you please provide some background what is the motivation behind this, and consider breaking it to smaller PRs.

  1. I've updated the PR description to explain which issue this fixes. It was assigned to me by @jmarantz. The motivation for the change can be found in the issue, but the TL;DR is that a request that needs to construct a response with many clusters can add unnecessary memory pressure and by streaming the response, we can reduce it.
  2. I don't think I'll be rebasing given the guidance here: https://github.com/envoyproxy/envoy/blob/main/CONTRIBUTING.md#submitting-a-pr. If my reviewer (@tonya11en) requests a rebase I'd be happy to do that.
  3. I won't be breaking this up right away since I was asked to not do this until the code could be reviewed holistically.

@adisuissa adisuissa removed the waiting label May 1, 2024
@adisuissa
Copy link
Contributor

Thanks for the info.
The rebasing request was an error on my part because of the big PR and I thought it wasn't intentional.

Assigning @jmarantz who may have the most context.
/assign @jmarantz

@adisuissa adisuissa removed their assignment May 1, 2024
@miroswan
Copy link
Contributor Author

miroswan commented May 1, 2024

There are some aspects to this PR that I believe should be known to folks reviewing:

  1. There is an opportunity to reduce string formatting in the text processor by replacing calls to Buffer::Instance::add with Buffer::Instance::addFragments. We can decide to do this in this PR or reserve for a secondary PR.
  2. I noticed that the JSON encoder used prior to this PR has omit empty behavior. That is to say, values that are false, 0, or an empty object are omitted. I've taken some precaution to replicate this behavior within this PR.
  3. I am interested to know if there are any existing e2e tests or benchmarks against which I can verify the changes.
  4. The unit tests herein verify the changes with more than one cluster to ensure that more than one call to Envoy::Server::Request::nextChunk works as expected.

@jmarantz
Copy link
Contributor

jmarantz commented May 1, 2024

You can do addFragments in this PR or a follow-up; if it adds risk it's fine to do a follow-up

I think you should add a new benchmark, in the style of test/server/admin/stats_handler_speed_test.cc.

I think there are probably some unit & integration tests already. A good way to find them is to inject garbage into your output, run all the tests and see which ones fail :)

However, you can also look at test/integration/integratin_admin_test.cc and the tests in test/server/admin/...

@jmarantz
Copy link
Contributor

jmarantz commented May 1, 2024

Oh, and you can add the new benchmark in a separate PR so we can see the before/after.

@miroswan
Copy link
Contributor Author

miroswan commented May 1, 2024

@jmarantz I'll create a separate Issue/PR to create a benchmark for the AdminClusters handler. We can get get that merged first. I can rebase those changes into this branch and make sure that the performance is improved.

@miroswan
Copy link
Contributor Author

miroswan commented May 1, 2024

@jmarantz Please assign #33918 to me when you have a moment. Thanks!

@miroswan miroswan force-pushed the admin-clusters-streamed branch 3 times, most recently from f5e12db to c2c81e9 Compare May 3, 2024 01:07
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flushing comments for now.

source/server/admin/clusters_chunk_processor.cc Outdated Show resolved Hide resolved
source/server/admin/clusters_chunk_processor.cc Outdated Show resolved Hide resolved
source/server/admin/clusters_chunk_processor.cc Outdated Show resolved Hide resolved
source/server/admin/admin.cc Outdated Show resolved Hide resolved
source/server/admin/clusters_chunk_processor.cc Outdated Show resolved Hide resolved
@miroswan
Copy link
Contributor Author

miroswan commented May 8, 2024

flushing comments for now.

Addressed your comments. Thanks.

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking great!

Flushing some more comments. Will continue to review in parallel.

source/server/admin/clusters_chunk_processor.h Outdated Show resolved Hide resolved
source/server/admin/clusters_chunk_processor.h Outdated Show resolved Hide resolved
source/server/admin/clusters_chunk_processor.h Outdated Show resolved Hide resolved
source/server/admin/clusters_handler.cc Outdated Show resolved Hide resolved
source/server/admin/clusters_params.cc Show resolved Hide resolved
source/server/admin/clusters_request.cc Outdated Show resolved Hide resolved
test/server/admin/clusters_params_test.cc Show resolved Hide resolved
test/server/admin/clusters_params_test.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm so sorry this has taken so long. I still have to read the big implementation & test files.

This looks great; these are all minor style nits.

source/server/admin/clusters_handler.cc Outdated Show resolved Hide resolved
source/server/admin/clusters_params.cc Outdated Show resolved Hide resolved
source/server/admin/clusters_params.h Outdated Show resolved Hide resolved
source/server/admin/clusters_params.h Outdated Show resolved Hide resolved
source/server/admin/clusters_request.h Show resolved Hide resolved
source/server/admin/clusters_request.h Outdated Show resolved Hide resolved
test/server/admin/clusters_params_test.cc Outdated Show resolved Hide resolved
test/server/admin/clusters_params_test.cc Outdated Show resolved Hide resolved
test/server/admin/clusters_params_test.cc Outdated Show resolved Hide resolved
test/server/admin/clusters_params_test.cc Outdated Show resolved Hide resolved
@adisuissa
Copy link
Contributor

/wait

Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
…ig when C++20 is supported

Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
…ery param parsing fails

Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
…n the request classes

Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
…rs_request.cc

Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
… statement to switch in clusters_handler

Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great progress, and realy sorry for the slow feedback. last week was rough; hopefully next week is better!

Json,
};

Http::Code parse(absl::string_view url, Buffer::Instance& response);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doxygen here please, e.g. it might not be obvious that response is only written if there's an error.

: chunk_limit_(chunk_limit), server_(server), params_(params) {
for (const auto& [str, reference_wrapped_const_cluster] :
server_.clusterManager().clusters().active_clusters_) {
clusters_.push_back(reference_wrapped_const_cluster);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subtle potential bug (use-after-free here). I think the lifetime of an admin request can cross xDS updates, in this model. The http client might be slow, and the flow control might (someday) pause getting next chunks arbitrarily long.

So I think what we need to do is make clusters_ hold a shared_ptr rather than a const ref, so we don't crash if there's an xds update midway through.

When I did this for /stats, a pre-requisite for doing the admin work was to make Stats:Scope be managed by shared_ptr, which was itself an adventure.

Another alternative, which might be easier and better in some way, is to copy the names of the clusters during construction here. Then, when getting the next chunk, use the name to look up the cluster. We'd have to figure out how to make that fast if there's not already a map.

Copy link
Contributor Author

@miroswan miroswan Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is that this is how it's actually stored originally:

using ClusterInfoMap = absl::flat_hash_map<std::string, std::reference_wrapper<const Cluster>>;

I simply collect all of these elements in a vector so that stateful iteration is stable (not so with a flat hash map).

I'm not sure how easy it would be to collect the clusters, in their original form (referenced wrapped const Cluster), into a vector of shared_ptr without issues.

As for the alternative...I believe this could work because there is already the flat hash map that stores the clusters by name. If we simply stored the names of the clusters in a vector and used them to look up the corresponding Cluster in the map, I believe that would work just fine. But, if there is a cluster that was removed during request serving and it is no longer present in the map, I think we'll want to establish some policy on how we'd handle this across this streaming requests within Envoy.

Also, if the map can get updated mid-way through request serving, is it a critical region that must be protected?

Do we guarantee isolation such that the response is a snapshot of the state of the clusters at the time of the request or is it best effort? In other words, if the cluster is no longer present in the map, do we simply omit it from the response?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say I am streaming out /clusters, and there are initially clusters a,b,c,d. After streaming out a and b, there's an xDS update deleting c but leaving d.

When streaming out the remaining clusters, we will try to name-lookup "c" and not find it, so we just skip it, and stream out "d". We could stream out "c" as empty if that makes a difference.

So when streaming out /clusters you would not be guaranteed a coherent view at any one point in time. That is the way I thought about streaming out /stats -- your view does shift a bit over time. In fact that was the case for the values even before streaming was added -- you wouldn't see an instantaneous snapshot of the stats values, as the data plane runs in parallel with admin.

At the moment, I don't think the admin infrastructure allows for incoherency in source/exe:envoy-static; the admin request running on the main thread would not relinquish control so xDS would not run at all until /clusters completed. However there is an exported and tested async C++ API which does allow xDS to run during an admin request. So we need to define the behavior and test for it. And I have a dormant PR to try to relinquish control when the output buffer reaches its watermark to avoid buffering the whole response.

}

bool ClustersRequest::nextChunk(Buffer::Instance& response) {
UNREFERENCED_PARAMETER(response);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we omit these, and use the definitions from the sub-classes?

#include <cstdint>
#include <functional>
#include <memory>
#include <regex>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't read through this in detail yet, but can you make sure we test the case where we start a /clusters request, emulate an xds update completely changing the set of clusters, and then continue the /clusters request?

Another thing it would be good to do manually is just look at the json output in a brower; Firefox has a built in json rendering mode.

@jmarantz
Copy link
Contributor

jmarantz commented Jun 8, 2024

/wait

Copy link

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 18, 2024
Copy link

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot closed this Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale stalebot believes this issue/PR has not been touched recently waiting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants