Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LEDS: Adding LEDS protos and initial implementation [WIP] #16487

Closed

Conversation

adisuissa
Copy link
Contributor

Commit Message: LEDS: Adding LEDS protos and initial implementation
Additional Description:
Adding the API protos, and plugging into EDS configurations.
Added an integration test to test valid workflow.
Need to add unit tests and more integration tests.
Currently only works in delta-xDS mode, and using ADS.

Risk Level: Medium - but only if being configured.
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:

Signed-off-by: Adi Suissa-Peleg adip@google.com

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
envoyproxy/api-shepherds assignee is @htuch
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #16487 was opened by adisuissa.

see: more, trace.

@adisuissa
Copy link
Contributor Author

/cc @htuch
The PR can be broken into API and some plumbing PRs if needed.

core.v3.ConfigSource leds_config = 1;

// The xDS transport protocol list or glob collection resource name.
// SotW mode supports only list collections, and delta xDS supports both.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we said that initially, we were not going to support SotW.

// Locality-Endpoint discovery :ref:`architecture overview
// <arch_overview_service_discovery_types_leds>`

service LocalityEndpointDiscoveryService {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, is there any case where we actually need a separate LEDS service, or will all known use-cases of LEDS actually use ADS?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My experience has been that endpoint data is different enough from regular config to merit special data plane treatment. The reasons include that endpoint data often changes faster that the rest of the config and loading endpoint data presents different risks than those seen while loading standard config. The most common failure modes when loading endpoint config is not validation errors but potential for excessive memory/cpu usage due to update size and frequency.

Endpoint distribution may also be one of the first possible targets for federation or sharding of control plane components in order to handle scaling better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think as part of the general push towards supporting disaggregation, a separate service makes sense.

Copy link
Contributor

@antoniovicente antoniovicente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great to see progress on LEDS.

// Locality-Endpoint discovery :ref:`architecture overview
// <arch_overview_service_discovery_types_leds>`

service LocalityEndpointDiscoveryService {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My experience has been that endpoint data is different enough from regular config to merit special data plane treatment. The reasons include that endpoint data often changes faster that the rest of the config and loading endpoint data presents different risks than those seen while loading standard config. The most common failure modes when loading endpoint config is not validation errors but potential for excessive memory/cpu usage due to update size and frequency.

Endpoint distribution may also be one of the first possible targets for federation or sharding of control plane components in order to handle scaling better.

envoy_cc_library(
name = "leds_lib",
srcs = ["leds.cc"],
hdrs = ["leds.h"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit odd that the dependencies of this new leds library match the dependencies of what used to be the eds library. Is there potential for trimming down this dependency list?

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great, I think this is a nice architecture and this will support LEDS requirements.

Can you start splitting into smaller PR for review? I think this is pretty huge and some parts can be done individually, e.g. the LEDS subscription.
/wait

// Locality-Endpoint discovery :ref:`architecture overview
// <arch_overview_service_discovery_types_leds>`

service LocalityEndpointDiscoveryService {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think as part of the general push towards supporting disaggregation, a separate service makes sense.


rpc FetchLocalityEndpoints(discovery.v3.DiscoveryRequest)
returns (discovery.v3.DiscoveryResponse) {
option (google.api.http).post = "/v3/discovery:locality_endpoints";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be able to get away with not supporting REST and SotW here, whatever simplifies and is efficient.

// TODO(adisuissa): here we know the exact locality that is now active, so we can
// optimize this process by moving the locality from warming set ot active set, and call
// the update if the warming set is empty.
if (validateAllLedsActive()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the design here support transitive warming flows?

E.g. when a new cluster is added, it will warm on its EDS update. We need to delay the warming of the EDS update until all the dependent LEDS updates have occurred.

// Create a new LEDS subscription and add it to the subscriptions map.
LedsSubscriptionPtr leds_locality_subscription = std::make_unique<LedsSubscription>(
leds_config, cluster_name_, factory_context_, /*nullptr,*/ [&]() {
// Called upon an update to the locality
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a timeout flow supported, where after some configurable period, e.g. 15s default, we can switch all LEDS to active and unblock initialization?

@@ -1666,6 +1666,75 @@ TEST_P(XdsTpAdsIntegrationTest, Basic) {
makeSingleRequest();
}

// TODO(adisuissa): Add a test with static cluster and load assignment, but dynamic endpoints.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we'll also need a bunch of unit tests exploring the corner case behaviors around warming and sharing of LEDS subscriptions across EDS subs.


// Add all the LEDS localities that are new.
for (const auto& leds_config : cla_leds_configs) {
if (leds_localities_.find(leds_config) == leds_localities_.end()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider making LEDS localities a cluster manager property to share across EDS subscriptions? I think it's probably overkill given that generally clusters won't share.

@jmarantz
Copy link
Contributor

@jmarantz subscribing

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Jun 17, 2021
@github-actions
Copy link

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot closed this Jun 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api stale stalebot believes this issue/PR has not been touched recently waiting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants