-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issues around 15k routes/clusters #19946
Comments
Does Envoy set up on each node/app? Do you set bound CPUs for Envoy? |
Do you have 16k routes total or is it the size of delta? |
cc @jmarantz also |
I appreciate the help!
There is only a single instance of Envoy to distribute traffic across all apps. No, we don't apply any CPU restrictions to Envoy. The binary is simply run as a systemd service. I'm hesitant to run
The deltas are usually just a small handful of changes. Here's an example from our Envoy logs. # The following logs have been trimmed for brevity
[2022-02-14 22:58:19.838] cds: added/updated 1 cluster(s), skipped 0 unmodified cluster(s)
[2022-02-14 22:58:19.837] cds: add 1 cluster(s), remove 0 cluster(s)
[2022-02-14 22:58:19.817] lds: add/update listener 'https_listener'
[2022-02-14 22:57:37.527] cds: added/updated 0 cluster(s), skipped 0 unmodified cluster(s)
[2022-02-14 22:57:37.527] cds: add 0 cluster(s), remove 2 cluster(s)
[2022-02-14 22:57:37.487] lds: add/update listener 'https_listener'
[2022-02-14 22:56:50.107] lds: add/update listener 'https_listener'
[2022-02-14 22:56:36.228] cds: added/updated 1 cluster(s), skipped 0 unmodified cluster(s)
[2022-02-14 22:56:36.226] cds: add 1 cluster(s), remove 0 cluster(s)
[2022-02-14 22:55:44.435] lds: add/update listener 'https_listener' |
Gathering a If you are using delta xDS and only changing a few resources at a time, I wouldn't expect this to take much CPU, so there must be something broken here so having the perf trace would be super useful. |
+1 to start the deep-dive with a flame-graph. If your system has daily peaks/troughs like some of ours do, maybe you could aim for sometime during either the rising or falling edge, so that you have some headroom for collecting the data overloading your serve, but are not at the trough which might not be representative. Also -- I'm curious why you don't add more Envoy tasks to take on extra load -- at some point you will probably run out of gas in your single task. But let's see what's going on with a flame-graph in your one Envoy first. |
I actually am wondering if this is a dup of #19774 and related to TLS context churning? cc @lambdai @howardjohn @kyessenov |
Another random theory - we saw issues with sending a bunch of routes to envoy. This was non-delta, so probably worse than here. Update latency became huge. What we found was that we had no backpressure, so Envoy was taking (for example) 15s to process routes and we sent new ones every 2s. This compounded and make things worse repeatedly. Simply scaling back to only send new routes when Envoy had ACKed the previous ones made the performance much better. Something to look into if you do not already have such a mechanism in your control plane |
My mental model of the config is to add/delete The theory is that the time adding the above 4 as a suite is linear to the scale big things. I suspect there is some O(N^2) thing introduced here but I don't have the insight. You can try to watch the metric listener_modified: bump 1 on each https_listener https://www.envoyproxy.io/docs/envoy/latest/configuration/listeners/stats Let me know if it matches your pattern. |
So I was able to use Here's the result from 120 seconds of perf data: https://schierco.nyc3.digitaloceanspaces.com/other/perf.svg
@jmarantz By Envoy tasks, do you mean sharding the clusters/routes across multiple Envoy instances? If so, we'll be moving toward that shortly. I just wanted to see if there were any quick wins for now. |
Thanks for the extra info. Looks like you are spending most of the time stat flushing, which is not surprising to me if no tweaks have been done there. This is @jmarantz favorite topic so I will let him take it from here. :) Either way @jmarantz I feel like we should have some docs on best practices for dealing with stats with giant configs? |
After seeing the stats stuff in the flamegraph, I remembered that we had to disable DataDog's Envoy integration because queries were too slow. I just ran a
Any advice on how to tweak things to make stats collection viable and efficient would be greatly appreciated. And thanks for the quick response! |
Can you run an experiment by excluding most stats (https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto#envoy-v3-api-field-config-metrics-v3-statsmatcher-reject-all)? I suspect you might be creating new symbols for the names of the resources which somehow causes large overhead in the stats subsystem. |
Unfortunately we never configured hot restart when first setting things up so any config change will trigger ~5-10 minutes of downtime. Actually, the docs are very confusing in this area. They say that hot restart is enabled by default but it seems like you have to be running it with the example |
Looking at your flame graph, |
If it helps, here's the latency chart for updates post restart. Seems to climb pretty rapidly, which leads me to believe @howardjohn's compounding refresh situation might be on the money. Going to attempt:
We'll monitor each rollout and provide changes between as we go. If anybody has any other suggestions would love to hear em. Thank you for your help thus far! Separately, the baseline 30s per route update seem correct for our scale? |
Okay, just giving an update here. We were finally able to transition our Envoy instances over to a config with stats excluded for all our created resource (disabling all caused issues). Our time-to-live went from ~180s (blue) to ~20s (purple). Thanks for all the useful help and feedback! 🤗 |
Good news! |
We're looking for some performance advice on Envoy and weren't sure where to ask. I couldn't find a paid support channel but, if there was one, we'd be happy to use it.
We use Envoy+go-control-plane at Railway as a forward proxy to distribute incoming requests to our users' deployments. When we switched to Envoy (from Traefik) at around 10k upstreams, config upserts took just a few seconds. Now, at around 16k routes, upserts can take 30s or more.
Each of our users' deployment has:
🔍 snippet from
/config_dump
We upsert the above resources over
DELTA_GRPC
v3 with ADS on every user deployment (roughly every ~5 seconds) which seems to keep Envoy consistently pinned between 100-200% CPU usage (running on 32 vCPU / 120GB VM).🙋♂️ Questions: Does what we're doing (frequent updates on many clusters) seem within the realm of Envoy's capabilities? Does anything wrong stand out with how we're configuring these? What would be the recommended practice for reaching 100k+ upstreams?
Thanks, we appreciate any resources/suggestions you have!
The text was updated successfully, but these errors were encountered: