-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding streaming response to AdminClusters endpoint #33879
Conversation
9214bf8
to
0e4cc40
Compare
Seems that this PR includes many file changes. As it doesn't have reviews yet, can you please rebase it? |
Apologies, I misread the PR contents. |
|
There are some aspects to this PR that I believe should be known to folks reviewing:
|
You can do addFragments in this PR or a follow-up; if it adds risk it's fine to do a follow-up I think you should add a new benchmark, in the style of test/server/admin/stats_handler_speed_test.cc. I think there are probably some unit & integration tests already. A good way to find them is to inject garbage into your output, run all the tests and see which ones fail :) However, you can also look at test/integration/integratin_admin_test.cc and the tests in test/server/admin/... |
Oh, and you can add the new benchmark in a separate PR so we can see the before/after. |
@jmarantz I'll create a separate Issue/PR to create a benchmark for the AdminClusters handler. We can get get that merged first. I can rebase those changes into this branch and make sure that the performance is improved. |
f5e12db
to
c2c81e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flushing comments for now.
Addressed your comments. Thanks. |
b801d24
to
c65b1cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking great!
Flushing some more comments. Will continue to review in parallel.
05682e1
to
cc840be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm so sorry this has taken so long. I still have to read the big implementation & test files.
This looks great; these are all minor style nits.
/wait |
Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
…ig when C++20 is supported Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
…ery param parsing fails Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
…n the request classes Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
…rs_request.cc Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
… statement to switch in clusters_handler Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
Signed-off-by: Demitri Swan <demitriswan@google.com>
a756412
to
6bb869e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great progress, and realy sorry for the slow feedback. last week was rough; hopefully next week is better!
Json, | ||
}; | ||
|
||
Http::Code parse(absl::string_view url, Buffer::Instance& response); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doxygen here please, e.g. it might not be obvious that response is only written if there's an error.
: chunk_limit_(chunk_limit), server_(server), params_(params) { | ||
for (const auto& [str, reference_wrapped_const_cluster] : | ||
server_.clusterManager().clusters().active_clusters_) { | ||
clusters_.push_back(reference_wrapped_const_cluster); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subtle potential bug (use-after-free here). I think the lifetime of an admin request can cross xDS updates, in this model. The http client might be slow, and the flow control might (someday) pause getting next chunks arbitrarily long.
So I think what we need to do is make clusters_ hold a shared_ptr rather than a const ref, so we don't crash if there's an xds update midway through.
When I did this for /stats, a pre-requisite for doing the admin work was to make Stats:Scope be managed by shared_ptr, which was itself an adventure.
Another alternative, which might be easier and better in some way, is to copy the names of the clusters during construction here. Then, when getting the next chunk, use the name to look up the cluster. We'd have to figure out how to make that fast if there's not already a map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue here is that this is how it's actually stored originally:
envoy/envoy/upstream/cluster_manager.h
Line 263 in 2402102
using ClusterInfoMap = absl::flat_hash_map<std::string, std::reference_wrapper<const Cluster>>; |
I simply collect all of these elements in a vector so that stateful iteration is stable (not so with a flat hash map).
I'm not sure how easy it would be to collect the clusters, in their original form (referenced wrapped const Cluster), into a vector of shared_ptr without issues.
As for the alternative...I believe this could work because there is already the flat hash map that stores the clusters by name. If we simply stored the names of the clusters in a vector and used them to look up the corresponding Cluster in the map, I believe that would work just fine. But, if there is a cluster that was removed during request serving and it is no longer present in the map, I think we'll want to establish some policy on how we'd handle this across this streaming requests within Envoy.
Also, if the map can get updated mid-way through request serving, is it a critical region that must be protected?
Do we guarantee isolation such that the response is a snapshot of the state of the clusters at the time of the request or is it best effort? In other words, if the cluster is no longer present in the map, do we simply omit it from the response?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's say I am streaming out /clusters, and there are initially clusters a,b,c,d. After streaming out a and b, there's an xDS update deleting c but leaving d.
When streaming out the remaining clusters, we will try to name-lookup "c" and not find it, so we just skip it, and stream out "d". We could stream out "c" as empty if that makes a difference.
So when streaming out /clusters you would not be guaranteed a coherent view at any one point in time. That is the way I thought about streaming out /stats -- your view does shift a bit over time. In fact that was the case for the values even before streaming was added -- you wouldn't see an instantaneous snapshot of the stats values, as the data plane runs in parallel with admin.
At the moment, I don't think the admin infrastructure allows for incoherency in source/exe:envoy-static; the admin request running on the main thread would not relinquish control so xDS would not run at all until /clusters completed. However there is an exported and tested async C++ API which does allow xDS to run during an admin request. So we need to define the behavior and test for it. And I have a dormant PR to try to relinquish control when the output buffer reaches its watermark to avoid buffering the whole response.
} | ||
|
||
bool ClustersRequest::nextChunk(Buffer::Instance& response) { | ||
UNREFERENCED_PARAMETER(response); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we omit these, and use the definitions from the sub-classes?
#include <cstdint> | ||
#include <functional> | ||
#include <memory> | ||
#include <regex> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't read through this in detail yet, but can you make sure we test the case where we start a /clusters request, emulate an xds update completely changing the set of clusters, and then continue the /clusters request?
Another thing it would be good to do manually is just look at the json output in a brower; Firefox has a built in json rendering mode.
/wait |
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
Commit Message: Adding streaming response to AdminClusters endpoint
Additional Description: Leveraging Envoy::Server::Request to implement a streaming response for AdminClusters
Risk Level: medium
Testing: unit / integration / manual
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features: N/A
[Optional Runtime guard:]
[Optional Fixes #Issue]: #33879
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]