Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds-failover: plumbing the API and adding integration test #34761

Merged
merged 10 commits into from
Jun 28, 2024

Conversation

adisuissa
Copy link
Contributor

Commit Message: xds-failover: plumbing the API and adding integration test
Additional Description:
Adding the ability to use xDS-Failover into the API (ADS config).
Introducing the xds-failover integration tests.
This is all guarded behind the envoy.restart_features.xds_failover_support guard.
In the future, once we test xDS-Failover out and ensure it works as expected, the runtime guard will be removed.

Risk Level: low if not configured, medium if it is.
Testing: Added unit tests and integration tests
Docs Changes: N/A (once xDS-failover behavior will be finalized).
Release Notes: N/A (once xDS-failover behavior will be finalized).
Platform Specific Features: N/A.

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Copy link

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #34761 was opened by adisuissa.

see: more, trace.

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
@adisuissa adisuissa marked this pull request as ready for review June 20, 2024 03:30
@adisuissa
Copy link
Contributor Author

Also assigning Kateryna to expedite the review cycle.
/assign @nezdolik

RETURN_IF_NOT_OK(checkApiConfigSourceNames(api_config_source));
RETURN_IF_NOT_OK(checkApiConfigSourceNames(
api_config_source,
Runtime::runtimeFeatureEnabled("envoy.restart_features.xds_failover_support") ? 2 : 1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be computed in the `checkApiConfigSourceNames function instead of being passed as function param?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will require the checkApiConfigSourceNames() function to become a method (rather than a function), or the runtime passed to it, and I'm not sure if it's worth it as this runtime-flag will disappear in the future.
If you feel strongly about this, then I'll pass the runtime as a second param.

bool skip_cluster_check, int grpc_service_idx) {
RETURN_IF_NOT_OK(checkApiConfigSourceNames(
api_config_source,
Runtime::runtimeFeatureEnabled("envoy.restart_features.xds_failover_support") ? 2 : 1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above.

grpc_mux_context.scope_,
// TODO(adisuissa): the backoff strategy for the failover should
// be the same as the primary source.
std::make_unique<FixedBackOffStrategy>(500),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit could factor out this magic number 500 into constant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, although IMHO it makes the code less readable (because of the template).

Comment on lines 459 to 474
auto factory_primary_or_error = Config::Utility::factoryForGrpcApiConfigSource(
*async_client_manager_, dyn_resources.ads_config(), *stats_.rootScope(), false, 0);
THROW_IF_STATUS_NOT_OK(factory_primary_or_error, throw);
Grpc::AsyncClientFactoryPtr factory_failover = nullptr;
if (Runtime::runtimeFeatureEnabled("envoy.restart_features.xds_failover_support")) {
auto factory_failover_or_error = Config::Utility::factoryForGrpcApiConfigSource(
*async_client_manager_, dyn_resources.ads_config(), *stats_.rootScope(), false, 1);
THROW_IF_STATUS_NOT_OK(factory_failover_or_error, throw);
factory_failover = std::move(factory_failover_or_error.value());
}
ads_mux_ = factory->create(
factory_primary_or_error.value()->createUncachedRawAsyncClient(),
factory_failover ? factory_failover->createUncachedRawAsyncClient() : nullptr,
dispatcher_, random_, *stats_.rootScope(), dyn_resources.ads_config(), local_info_,
std::move(custom_config_validators), std::move(backoff_strategy),
makeOptRefFromPtr(xds_config_tracker_.get()), {}, use_eds_cache);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this large code block is is identical to one below, could be factored out into a dedicated function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few differences between the initializations of SotW and delta, and in this PR I don't want to make major changes - I'm keeping the same pattern as was prior to this PR.
We are planning to work more on this code base, so we will definitely change it later.

}

// Validate that once failover answers, primary will not be used, even after disconnecting.
TEST_P(XdsFailoverAdsIntegrationTest, NoPrimaryUseAfterFailoverResponse) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would be happening if primary is not responding and failover as well, would it try to again connect to primary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Envoy will alternate between the primary and failover over and over until one of them respondes. We've got a unit test that covers this in

TEST_F(GrpcMuxFailoverTest, AlternatingBetweenFailoverAndPrimary) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, just wanted make sure this is covered by tests.

@@ -167,7 +167,7 @@ class BaseIntegrationTest : protected Logger::Loggable<Logger::Id::testing> {
envoy::config::core::v3::Node last_node_;

// Functions for testing reloadable config (xDS)
void createXdsUpstream();
virtual void createXdsUpstream();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this break other tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Until now other tests just called this method. The new test is the only one overriding it. In c++ when a method is not declared virtual it cannot be overridden.

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Copy link
Contributor Author

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments @nezdolik. This is ready for another round.

RETURN_IF_NOT_OK(checkApiConfigSourceNames(api_config_source));
RETURN_IF_NOT_OK(checkApiConfigSourceNames(
api_config_source,
Runtime::runtimeFeatureEnabled("envoy.restart_features.xds_failover_support") ? 2 : 1));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will require the checkApiConfigSourceNames() function to become a method (rather than a function), or the runtime passed to it, and I'm not sure if it's worth it as this runtime-flag will disappear in the future.
If you feel strongly about this, then I'll pass the runtime as a second param.

bool skip_cluster_check, int grpc_service_idx) {
RETURN_IF_NOT_OK(checkApiConfigSourceNames(
api_config_source,
Runtime::runtimeFeatureEnabled("envoy.restart_features.xds_failover_support") ? 2 : 1));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above.

Comment on lines 459 to 474
auto factory_primary_or_error = Config::Utility::factoryForGrpcApiConfigSource(
*async_client_manager_, dyn_resources.ads_config(), *stats_.rootScope(), false, 0);
THROW_IF_STATUS_NOT_OK(factory_primary_or_error, throw);
Grpc::AsyncClientFactoryPtr factory_failover = nullptr;
if (Runtime::runtimeFeatureEnabled("envoy.restart_features.xds_failover_support")) {
auto factory_failover_or_error = Config::Utility::factoryForGrpcApiConfigSource(
*async_client_manager_, dyn_resources.ads_config(), *stats_.rootScope(), false, 1);
THROW_IF_STATUS_NOT_OK(factory_failover_or_error, throw);
factory_failover = std::move(factory_failover_or_error.value());
}
ads_mux_ = factory->create(
factory_primary_or_error.value()->createUncachedRawAsyncClient(),
factory_failover ? factory_failover->createUncachedRawAsyncClient() : nullptr,
dispatcher_, random_, *stats_.rootScope(), dyn_resources.ads_config(), local_info_,
std::move(custom_config_validators), std::move(backoff_strategy),
makeOptRefFromPtr(xds_config_tracker_.get()), {}, use_eds_cache);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few differences between the initializations of SotW and delta, and in this PR I don't want to make major changes - I'm keeping the same pattern as was prior to this PR.
We are planning to work more on this code base, so we will definitely change it later.

}

// Validate that once failover answers, primary will not be used, even after disconnecting.
TEST_P(XdsFailoverAdsIntegrationTest, NoPrimaryUseAfterFailoverResponse) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Envoy will alternate between the primary and failover over and over until one of them respondes. We've got a unit test that covers this in

TEST_F(GrpcMuxFailoverTest, AlternatingBetweenFailoverAndPrimary) {

@@ -167,7 +167,7 @@ class BaseIntegrationTest : protected Logger::Loggable<Logger::Id::testing> {
envoy::config::core::v3::Node last_node_;

// Functions for testing reloadable config (xDS)
void createXdsUpstream();
virtual void createXdsUpstream();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Until now other tests just called this method. The new test is the only one overriding it. In c++ when a method is not declared virtual it cannot be overridden.

@adisuissa
Copy link
Contributor Author

@htuch this is ready for another round of reviews.
RE our discussion about simulated-time, I've attempted using it, but some connections that are expected are not being attempted by Envoy. I'll continue debugging this, but I trust the real-time over the simulated-time for now.

nezdolik
nezdolik previously approved these changes Jun 26, 2024
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
auto* failover_google_grpc = failover_grpc_service->mutable_google_grpc();
auto* failover_ssl_creds =
failover_google_grpc->mutable_channel_credentials()->mutable_ssl_credentials();
failover_ssl_creds->mutable_root_certs()->set_filename(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think there is a fair bit of setup boiler plate here that could be nicer to refactor a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, I agree.
I'll follow up on this in a subsequent PR.

@htuch
Copy link
Member

htuch commented Jun 27, 2024

Would be good to see a coverage report, for whatever reason it's not in the checks (maybe because of failures?)

LGTM otherwise!

@adisuissa
Copy link
Contributor Author

/retest

@adisuissa
Copy link
Contributor Author

Would be good to see a coverage report, for whatever reason it's not in the checks (maybe because of failures?)

LGTM otherwise!

It was a non-related flaky test.
Coverage should be available at: https://storage.googleapis.com/envoy-pr/295784d/coverage/index.html

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@htuch htuch enabled auto-merge (squash) June 27, 2024 21:48
@htuch htuch merged commit 3feff04 into envoyproxy:main Jun 28, 2024
49 of 51 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants