-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xds-failover: plumbing the API and adding integration test #34761
Conversation
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Also assigning Kateryna to expedite the review cycle. |
RETURN_IF_NOT_OK(checkApiConfigSourceNames(api_config_source)); | ||
RETURN_IF_NOT_OK(checkApiConfigSourceNames( | ||
api_config_source, | ||
Runtime::runtimeFeatureEnabled("envoy.restart_features.xds_failover_support") ? 2 : 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be computed in the `checkApiConfigSourceNames function instead of being passed as function param?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will require the checkApiConfigSourceNames()
function to become a method (rather than a function), or the runtime passed to it, and I'm not sure if it's worth it as this runtime-flag will disappear in the future.
If you feel strongly about this, then I'll pass the runtime as a second param.
bool skip_cluster_check, int grpc_service_idx) { | ||
RETURN_IF_NOT_OK(checkApiConfigSourceNames( | ||
api_config_source, | ||
Runtime::runtimeFeatureEnabled("envoy.restart_features.xds_failover_support") ? 2 : 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above.
grpc_mux_context.scope_, | ||
// TODO(adisuissa): the backoff strategy for the failover should | ||
// be the same as the primary source. | ||
std::make_unique<FixedBackOffStrategy>(500), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit could factor out this magic number 500
into constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, although IMHO it makes the code less readable (because of the template).
auto factory_primary_or_error = Config::Utility::factoryForGrpcApiConfigSource( | ||
*async_client_manager_, dyn_resources.ads_config(), *stats_.rootScope(), false, 0); | ||
THROW_IF_STATUS_NOT_OK(factory_primary_or_error, throw); | ||
Grpc::AsyncClientFactoryPtr factory_failover = nullptr; | ||
if (Runtime::runtimeFeatureEnabled("envoy.restart_features.xds_failover_support")) { | ||
auto factory_failover_or_error = Config::Utility::factoryForGrpcApiConfigSource( | ||
*async_client_manager_, dyn_resources.ads_config(), *stats_.rootScope(), false, 1); | ||
THROW_IF_STATUS_NOT_OK(factory_failover_or_error, throw); | ||
factory_failover = std::move(factory_failover_or_error.value()); | ||
} | ||
ads_mux_ = factory->create( | ||
factory_primary_or_error.value()->createUncachedRawAsyncClient(), | ||
factory_failover ? factory_failover->createUncachedRawAsyncClient() : nullptr, | ||
dispatcher_, random_, *stats_.rootScope(), dyn_resources.ads_config(), local_info_, | ||
std::move(custom_config_validators), std::move(backoff_strategy), | ||
makeOptRefFromPtr(xds_config_tracker_.get()), {}, use_eds_cache); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this large code block is is identical to one below, could be factored out into a dedicated function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few differences between the initializations of SotW and delta, and in this PR I don't want to make major changes - I'm keeping the same pattern as was prior to this PR.
We are planning to work more on this code base, so we will definitely change it later.
test/extensions/config_subscription/grpc/xds_failover_integration_test.cc
Outdated
Show resolved
Hide resolved
} | ||
|
||
// Validate that once failover answers, primary will not be used, even after disconnecting. | ||
TEST_P(XdsFailoverAdsIntegrationTest, NoPrimaryUseAfterFailoverResponse) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would be happening if primary is not responding and failover as well, would it try to again connect to primary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Envoy will alternate between the primary and failover over and over until one of them respondes. We've got a unit test that covers this in
TEST_F(GrpcMuxFailoverTest, AlternatingBetweenFailoverAndPrimary) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, just wanted make sure this is covered by tests.
@@ -167,7 +167,7 @@ class BaseIntegrationTest : protected Logger::Loggable<Logger::Id::testing> { | |||
envoy::config::core::v3::Node last_node_; | |||
|
|||
// Functions for testing reloadable config (xDS) | |||
void createXdsUpstream(); | |||
virtual void createXdsUpstream(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this break other tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Until now other tests just called this method. The new test is the only one overriding it. In c++ when a method is not declared virtual it cannot be overridden.
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments @nezdolik. This is ready for another round.
RETURN_IF_NOT_OK(checkApiConfigSourceNames(api_config_source)); | ||
RETURN_IF_NOT_OK(checkApiConfigSourceNames( | ||
api_config_source, | ||
Runtime::runtimeFeatureEnabled("envoy.restart_features.xds_failover_support") ? 2 : 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will require the checkApiConfigSourceNames()
function to become a method (rather than a function), or the runtime passed to it, and I'm not sure if it's worth it as this runtime-flag will disappear in the future.
If you feel strongly about this, then I'll pass the runtime as a second param.
bool skip_cluster_check, int grpc_service_idx) { | ||
RETURN_IF_NOT_OK(checkApiConfigSourceNames( | ||
api_config_source, | ||
Runtime::runtimeFeatureEnabled("envoy.restart_features.xds_failover_support") ? 2 : 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above.
auto factory_primary_or_error = Config::Utility::factoryForGrpcApiConfigSource( | ||
*async_client_manager_, dyn_resources.ads_config(), *stats_.rootScope(), false, 0); | ||
THROW_IF_STATUS_NOT_OK(factory_primary_or_error, throw); | ||
Grpc::AsyncClientFactoryPtr factory_failover = nullptr; | ||
if (Runtime::runtimeFeatureEnabled("envoy.restart_features.xds_failover_support")) { | ||
auto factory_failover_or_error = Config::Utility::factoryForGrpcApiConfigSource( | ||
*async_client_manager_, dyn_resources.ads_config(), *stats_.rootScope(), false, 1); | ||
THROW_IF_STATUS_NOT_OK(factory_failover_or_error, throw); | ||
factory_failover = std::move(factory_failover_or_error.value()); | ||
} | ||
ads_mux_ = factory->create( | ||
factory_primary_or_error.value()->createUncachedRawAsyncClient(), | ||
factory_failover ? factory_failover->createUncachedRawAsyncClient() : nullptr, | ||
dispatcher_, random_, *stats_.rootScope(), dyn_resources.ads_config(), local_info_, | ||
std::move(custom_config_validators), std::move(backoff_strategy), | ||
makeOptRefFromPtr(xds_config_tracker_.get()), {}, use_eds_cache); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few differences between the initializations of SotW and delta, and in this PR I don't want to make major changes - I'm keeping the same pattern as was prior to this PR.
We are planning to work more on this code base, so we will definitely change it later.
} | ||
|
||
// Validate that once failover answers, primary will not be used, even after disconnecting. | ||
TEST_P(XdsFailoverAdsIntegrationTest, NoPrimaryUseAfterFailoverResponse) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Envoy will alternate between the primary and failover over and over until one of them respondes. We've got a unit test that covers this in
TEST_F(GrpcMuxFailoverTest, AlternatingBetweenFailoverAndPrimary) { |
@@ -167,7 +167,7 @@ class BaseIntegrationTest : protected Logger::Loggable<Logger::Id::testing> { | |||
envoy::config::core::v3::Node last_node_; | |||
|
|||
// Functions for testing reloadable config (xDS) | |||
void createXdsUpstream(); | |||
virtual void createXdsUpstream(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Until now other tests just called this method. The new test is the only one overriding it. In c++ when a method is not declared virtual it cannot be overridden.
@htuch this is ready for another round of reviews. |
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
auto* failover_google_grpc = failover_grpc_service->mutable_google_grpc(); | ||
auto* failover_ssl_creds = | ||
failover_google_grpc->mutable_channel_credentials()->mutable_ssl_credentials(); | ||
failover_ssl_creds->mutable_root_certs()->set_filename( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think there is a fair bit of setup boiler plate here that could be nicer to refactor a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, I agree.
I'll follow up on this in a subsequent PR.
Would be good to see a coverage report, for whatever reason it's not in the checks (maybe because of failures?) LGTM otherwise! |
/retest |
It was a non-related flaky test. |
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Commit Message: xds-failover: plumbing the API and adding integration test
Additional Description:
Adding the ability to use xDS-Failover into the API (ADS config).
Introducing the xds-failover integration tests.
This is all guarded behind the
envoy.restart_features.xds_failover_support
guard.In the future, once we test xDS-Failover out and ensure it works as expected, the runtime guard will be removed.
Risk Level: low if not configured, medium if it is.
Testing: Added unit tests and integration tests
Docs Changes: N/A (once xDS-failover behavior will be finalized).
Release Notes: N/A (once xDS-failover behavior will be finalized).
Platform Specific Features: N/A.