-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Listener drops all connections when per filter config updated with RDS #17109
Comments
cc @htuch @adisuissa |
I don't think this should be happening; the per-route/vhost overrides are generally resolved by filter chains at runtime, we don't rebuild them. @lambdai @mattklein123 do you know any reason this might happen? @belyalov are you sure there isn't anything else in the listener changing in this scenario? |
I can't think of any reason this would be happening. If you are able to repro this easily can you provide some logs during an update? |
Yep, there is no changes in listener, only routes update via XDS.
Yep, we have stable repro on production env, so far we were able to isolate the problem just to XDS -> envoy routes config push. I'll update you with debug logs. Thanks |
So I was able to get some logs in debug mode. New routes config pushed:
Right after config pushed there are a lot of warn messages for deprecated field:
Then, after awhile there are a lot of entries about adding connections to cleanup list:
Finally new config gets applied and connections being re-established:
Note that
Takes around 12 seconds to be processed. Could it be because of a lot of warning being written to log file so connections expiring due to some timeout? Thanks! |
Hmm, I could see the main thread hanging for a bit, but I don't think any worker threads should be getting hung. Do you have any other logs or stats on why connections are being closed? Are they local closes or remote closes? I would look at both listener/downstream HTTP stats as well as cluster stats. cc @antoniovicente also. |
@mattklein123 I can get those stats (and probably more logs) by tomorrow. |
An update:
Is it possible that huge config size takes too long to be processed (this is why main thread "hangs"?) so all connections are forced to expire? |
I wouldn't expect main thread processing time so effect the workers, but I'm not exactly sure what is going on. More logs would be helpful. |
Getting more logs now |
Stats: Before DROP
After DROP
|
They are seems to be remote close (grepped few of them):
Upstream / downstream for this setup is also Envoy |
I'm generally wondering if we over commit our CPU cores when we have concurrency == NumCores, as that leaves no room for compute-intensive work on main thread (config updates, stats dumping) and there may be noisy neighbor issues. But this is not really based on data; just a conjecture. One short term remedy might be to set There's also lots of things we can do (and are working on in fact) to make these main-thread activities more efficient, but that will take time. |
It would be helpful to capture a profile to see where the time is spent. Can you share some info about the structure of the large config being loaded? How many clusters, vhosts, routes, etc. |
If the main thread drags down workers I'd be also concerned about memory situation. Do you see spikes in page faults? |
|
Interesting. I don't have any great theories right now as to why this config spinning on the main thread (which is known) is causing worker issues. Coming back to your original report:
What do you mean by this exactly? Are you positive that the use of this causes the issue but the removal does not? Do you mean per-filter config overrides? Can you be really clear on what delta causes the issue? |
That was an original assumption of root cause, but seems it is unrelated. |
Some insights of what our "huge" config contains of:
|
Title: Listener drops all connections when per filter config updated with RDS
Description:
Having simple working setup:
Everything works well, including updates sent by XDS to Envoy while there is no "per vhost filter config overrides" - envoy successfully and without any impact to traffic able to update route configuration.
However, if route configuration contains per virtual host overrides, like
Listener re-establishing all connections (
downstream_cx_active
gauge):To me it seems like listener re-creates
HttpConnectionManager
and it somehow causes old connections to close.Is it expected behavior?
Is there some parameter that we can adjust?
Any other ideas?
I'd be happy to provide more details, if needed.
Thanks
The text was updated successfully, but these errors were encountered: