New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
During reconnection of Delta XDS subscription, some secrets are never requested #33607
Comments
Thanks for reporting the issue, it does sound like a problem, and I wonder if we can get an AdsIntegrationTest that reproduces this bug.
How long is the warming timeout configured in your case?
Are you using ADS? If so the order should be fixed.
Sending a request for a known EDS is the current (somewhat wrong if you look at the xDS-protocol) behavior for SotW, but not for delta. If you think of it as a PubSub protocol, Envoy already subscribed to the resource, so no need to send the same subscription again. Specifically for EDS, we now have a cache (guarded by this) that "solves" #13009. |
We do use ADS. I was a bit surprised by this as well, but in theory its fine for us. See #13009 (comment) for prior discussion. Note the difference now and then is that Envoy is not sending followup EDS request after CDS (was EDS->CDS->EDS, now EDS->CDS without EDS). Before vs after is likely SotW vs Delta.
We did have this on for a while, but ended up reverting in istio/istio#49801. I can maybe try to see if it impacts this issue at all. |
This is WAI, as it is up to the server to send that EDS without the client requesting it. |
But I do wonder about the order of these requests after reconnection, because I thought there was code that intentionally required these to be used. That said, it shouldn't matter (in theory). If you have infinite timeout for EDS, then the cache won't work for you, and then you will probably have a problem here. |
So with SotW, it would normally be CDS->EDS. This works fine of course. When we do spontaneous pushes, we always do CDS+EDS together. The edge case was when it would oddly send EDS first. Originally, it would send EDS -> CDS -> EDS, and the 2nd EDS we would classify as an ACK (no change in request, we already sent everything). Our workaround was to always respond to EDS request after a CDS request. With the move to Delta, we no longer see that. We just get EDS -> CDS. So maybe now we should just spontaneously push EDS again after a CDS request if we already saw an EDS request. Or change envoy to always send CDS first, I suppose |
It's not a problem if you use the EDS cache and some timeout.
The idea was to wait for the timeout, and then use the cached version.
IMHO it would be better to keep the original order if possible. On reconnection, Envoy sees the order as EDS then CDS, so it will be reverted... |
I don't think we will never change the timeout. I don't think the behavior with a timeout is acceptable. Its either too high and causes updates to take too long, or too low and doesn't allow the server to be temporarily slow. For us, forever warming is the correct behavior if we are just waiting on a slower xds server. We don't use EDS static clusters fwiw |
Title: During reconnection of Delta XDS subscription, some secrets are never requested
Description:
We are seeing a pretty subtle condition that occurs during our tests.
We have had the same test running without issues for years. We recently switched to using Delta XDS, and now see it somewhat frequently.
The test sets up an Envoy cluster with an SDS secret reference and sends requests. When the test fails, we see the cluster and secret "warming" in the config dump. In the control plane, we never get an SDS request for the resource.
This seems to happen when there is a disconnection from the control plane at just the right time. We have our XDS connections terminate semi-regularly.
Upon reconnection, envoy is requesting resources in an unusual order: EDS,LDS,RDS,SDS,CDS. I would typically see CDS first.
After the CDS request, there is no followup EDS request, which makes me suspicious that #13009 is coming into play. However, as far as I know, Envoy is supposed to send an EDS request here (see #13009 (comment), for prior analysis of the same issue)
Repro steps:
Unfortunately I do not have an isolated reproducer. I am reproducing it myself against Istio's test suite:
Change
--keepaliveMaxServerConnectionAge=5s
. Turn on debug logs../tests/integration/security -run TestMutualTlsOrigination -count 50 --failfast --istio.test.skipWorkloads=tproxy,vm
Config:
Here is an example config dump: https://storage.googleapis.com/istio-prow/logs/integ-ds_istio_postsubmit/1780583422448635904/artifacts/security-df00d96e40154f4688185d/TestMutualTlsOrigination/generic/a.echo1/to_external.external/_test_context/istio-state2391627172/primary-0/istio-egressgateway-75f47d8bd9-7bjfh_proxy-config.json. See the
outbound|443||external.external-2-21010.svc.cluster.local
warming clusterLogs:
Log 1, just showing control plane view:
Full debug logs with both control plane and data plane.
Timeline:
envoy.txt
control.tar.gz
The text was updated successfully, but these errors were encountered: