Performance issues around 15k routes/clusters #19946

gschier · 2022-02-14T23:46:46Z

We're looking for some performance advice on Envoy and weren't sure where to ask. I couldn't find a paid support channel but, if there was one, we'd be happy to use it.

We use Envoy+go-control-plane at Railway as a forward proxy to distribute incoming requests to our users' deployments. When we switched to Envoy (from Traefik) at around 10k upstreams, config upserts took just a few seconds. Now, at around 16k routes, upserts can take 30s or more.

Each of our users' deployment has:

STATIC cluster pointing to IP + PORT
Virtual host mapping domain(s) to cluster
Filter chain w/ inline SSL certificates

🔍 snippet from /config_dump

[
  {
    "version_info": "3bef5398bedcf6e33c5666d200a3e731b68f9461e2688eca2bc7ad974b9b97b8",
    "last_updated": "2022-02-12T20:50:46.362Z",
    "cluster": {
      "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
      "name": "clst_10_138_0_83_6637",
      "type": "STATIC",
      "load_assignment": {
        "cluster_name": "clst_10_138_0_83_6637",
        "endpoints": [ {
          "lb_endpoints": [ {
            "endpoint": {
              "address": { "socket_address": { "address": "10.138.0.83", "port_value": 6637 } }
            }
          }]
        }]
      }
    }
  },
  {
    "name": "flch_example_com",
    "filter_chain_match": { "server_names": [ "example.com" ] },
    "filters": [
      {
        "name": "envoy.filters.network.http_connection_manager",
        "typed_config": {
          "@type": "type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager",
          "stat_prefix": "https",
          "rds": {
            "config_source": { "ads": {}, "resource_api_version": "V3" },
            "route_config_name": "https_route_config"
          },
          "http_filters": [ { "name": "envoy.filters.http.router" } ],
          "http2_protocol_options": {},
          "use_remote_address": true,
          "upgrade_configs": [ { "upgrade_type": "websocket" } ]
        }
      }
    ],
    "transport_socket": {
      "name": "envoy.transport_sockets.tls",
      "typed_config": {
        "@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext",
        "common_tls_context": {
          "alpn_protocols": [ "h2", "http/1.1" ],
          "tls_certificate_sds_secret_configs": [{
            "name": "scrt_pan_example_com",
            "sds_config": { "ads": {}, "resource_api_version": "V3" }
          }]
        }
      }
    }
  },
  {
    "name": "vhst_example_com_10_138_0_83_6637",
    "domains": [ "example.com", "example.com:443" ],
    "routes": [{
      "match": { "prefix": "/" },
      "route": {
        "cluster": "clst_10_138_0_83_6637",
        "timeout": "0s",
        "upgrade_configs": [ { "upgrade_type": "websocket" } ]
      },
      "name": "rout_example_com_10_138_0_83_6637"
    }]
  },
  {
    "name": "scrt_example_com",
    "version_info": "4c2416216592e1bfdd4f9c5eb0e9dc4da324a18e0d12a340866afe87cd6d911c",
    "last_updated": "2022-02-14T22:23:42.130Z",
    "secret": {
      "@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret",
      "name": "scrt_example_com",
      "tls_certificate": {
        "certificate_chain": { "inline_bytes": "..." },
        "private_key": { "inline_bytes": "..." }
      }
    }
  }
]

We upsert the above resources over DELTA_GRPC v3 with ADS on every user deployment (roughly every ~5 seconds) which seems to keep Envoy consistently pinned between 100-200% CPU usage (running on 32 vCPU / 120GB VM).

🙋‍♂️ Questions: Does what we're doing (frequent updates on many clusters) seem within the realm of Envoy's capabilities? Does anything wrong stand out with how we're configuring these? What would be the recommended practice for reaching 100k+ upstreams?

Thanks, we appreciate any resources/suggestions you have!

The text was updated successfully, but these errors were encountered:

daixiang0 · 2022-02-15T08:06:16Z

Does Envoy set up on each node/app? Do you set bound CPUs for Envoy?
Could you collect pprof info refer to https://github.com/envoyproxy/envoy/blob/main/bazel/PPROF.md?

rojkov · 2022-02-15T08:28:39Z

Do you have 16k routes total or is it the size of delta?

mattklein123 · 2022-02-15T16:05:48Z

cc @jmarantz also

gschier · 2022-02-15T19:01:37Z

I appreciate the help!

Does Envoy set up on each node/app? Do you set bound CPUs for Envoy? Could you collect pprof info refer to https://github.com/envoyproxy/envoy/blob/main/bazel/PPROF.md?

There is only a single instance of Envoy to distribute traffic across all apps. No, we don't apply any CPU restrictions to Envoy. The binary is simply run as a systemd service. I'm hesitant to run pprof on our production instance, as CPU is already constrained but I may be able to reproduce in a staging environment.

Do you have 16k routes total or is it the size of delta?

The deltas are usually just a small handful of changes. Here's an example from our Envoy logs.

# The following logs have been trimmed for brevity
[2022-02-14 22:58:19.838] cds: added/updated 1 cluster(s), skipped 0 unmodified cluster(s)
[2022-02-14 22:58:19.837] cds: add 1 cluster(s), remove 0 cluster(s)
[2022-02-14 22:58:19.817] lds: add/update listener 'https_listener'
[2022-02-14 22:57:37.527] cds: added/updated 0 cluster(s), skipped 0 unmodified cluster(s)
[2022-02-14 22:57:37.527] cds: add 0 cluster(s), remove 2 cluster(s)
[2022-02-14 22:57:37.487] lds: add/update listener 'https_listener'
[2022-02-14 22:56:50.107] lds: add/update listener 'https_listener'
[2022-02-14 22:56:36.228] cds: added/updated 1 cluster(s), skipped 0 unmodified cluster(s)
[2022-02-14 22:56:36.226] cds: add 1 cluster(s), remove 0 cluster(s)
[2022-02-14 22:55:44.435] lds: add/update listener 'https_listener'

mattklein123 · 2022-02-15T22:51:18Z

I'm hesitant to run pprof on our production instance, as CPU is already constrained but I may be able to reproduce in a staging environment.

Gathering a perf trace should be relatively low overhead and would provide a lot of info. If you could do this and provide a flame graph that would be very useful.

If you are using delta xDS and only changing a few resources at a time, I wouldn't expect this to take much CPU, so there must be something broken here so having the perf trace would be super useful.

jmarantz · 2022-02-15T23:26:33Z

+1 to start the deep-dive with a flame-graph.

If your system has daily peaks/troughs like some of ours do, maybe you could aim for sometime during either the rising or falling edge, so that you have some headroom for collecting the data overloading your serve, but are not at the trough which might not be representative.

Also -- I'm curious why you don't add more Envoy tasks to take on extra load -- at some point you will probably run out of gas in your single task. But let's see what's going on with a flame-graph in your one Envoy first.

mattklein123 · 2022-02-15T23:44:30Z

I actually am wondering if this is a dup of #19774 and related to TLS context churning? cc @lambdai @howardjohn @kyessenov

howardjohn · 2022-02-15T23:47:55Z

Another random theory - we saw issues with sending a bunch of routes to envoy. This was non-delta, so probably worse than here. Update latency became huge. What we found was that we had no backpressure, so Envoy was taking (for example) 15s to process routes and we sent new ones every 2s. This compounded and make things worse repeatedly. Simply scaling back to only send new routes when Envoy had ACKed the previous ones made the performance much better. Something to look into if you do not already have such a mechanism in your control plane

lambdai · 2022-02-16T01:26:40Z

My mental model of the config is to add/delete
a filter chain in the big listener,
a domain in the big route
a cluster
a secret

The theory is that the time adding the above 4 as a suite is linear to the scale big things.
That says I am expecting such a operation in 16k suite is 1.6x slower than 10k suite even #19774 is not fixed. The O(N^2) there is due to the secret in each N cluster is updated, which is not the case here.

I suspect there is some O(N^2) thing introduced here but I don't have the insight.

You can try to watch the metric

listener_modified: bump 1 on each https_listener
listener_in_place_updated: bump 1 on https_listener update along with listener_modified
total_filter_chains_draining: 1 after the listener is updating, maybe several more if you frequently updates listener , and drop to 0 eventually

https://www.envoyproxy.io/docs/envoy/latest/configuration/listeners/stats

Let me know if it matches your pattern.

gschier · 2022-02-16T18:24:55Z

Gathering a perf trace should be relatively low overhead and would provide a lot of info. If you could do this and provide a flame graph that would be very useful.

If you are using delta xDS and only changing a few resources at a time, I wouldn't expect this to take much CPU, so there must be something broken here so having the perf trace would be super useful.

So I was able to use perf to generate perf data, but wasn't able to get perf_data_converter compiled. I used github.com/brendangregg/FlameGraph for now, which hopefully is sufficient.

Here's the result from 120 seconds of perf data: https://schierco.nyc3.digitaloceanspaces.com/other/perf.svg

Also -- I'm curious why you don't add more Envoy tasks to take on extra load -- at some point you will probably run out of gas in your single task. But let's see what's going on with a flame-graph in your one Envoy first.

@jmarantz By Envoy tasks, do you mean sharding the clusters/routes across multiple Envoy instances? If so, we'll be moving toward that shortly. I just wanted to see if there were any quick wins for now.

mattklein123 · 2022-02-16T18:38:11Z

Thanks for the extra info. Looks like you are spending most of the time stat flushing, which is not surprising to me if no tweaks have been done there. This is @jmarantz favorite topic so I will let him take it from here. :)

Either way @jmarantz I feel like we should have some docs on best practices for dealing with stats with giant configs?

gschier · 2022-02-16T18:44:18Z

After seeing the stats stuff in the flamegraph, I remembered that we had to disable DataDog's Envoy integration because queries were too slow. I just ran a /stats query again to see what it's like now:

$ time curl localhost:19000/stats > stats.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  121M    0  121M    0     0  1199k      0 --:--:--  0:01:44 --:--:-- 2098k

real	1m44.079s
user	0m0.156s
sys	0m0.460s

Any advice on how to tweak things to make stats collection viable and efficient would be greatly appreciated.

And thanks for the quick response!

kyessenov · 2022-02-16T19:21:28Z

Can you run an experiment by excluding most stats (https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto#envoy-v3-api-field-config-metrics-v3-statsmatcher-reject-all)? I suspect you might be creating new symbols for the names of the resources which somehow causes large overhead in the stats subsystem.

gschier · 2022-02-16T20:35:24Z

Can you run an experiment by excluding most stats (https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto#envoy-v3-api-field-config-metrics-v3-statsmatcher-reject-all)? I suspect you might be creating new symbols for the names of the resources which somehow causes large overhead in the stats subsystem.

Unfortunately we never configured hot restart when first setting things up so any config change will trigger ~5-10 minutes of downtime. Actually, the docs are very confusing in this area. They say that hot restart is enabled by default but it seems like you have to be running it with the example hot-restarter.py script or similar?

lambdai · 2022-02-16T21:16:09Z

Looking at your flame graph,
I believe the control plane is doing great and the listener are going through in place update. You can ignore #19946 (comment)

JakeCooper · 2022-02-17T07:20:13Z

If it helps, here's the latency chart for updates post restart. Seems to climb pretty rapidly, which leads me to believe @howardjohn's compounding refresh situation might be on the money.

Going to attempt:

Adding hot restart
Stats disabling
Batching of updates to a larger window (15-30s)

We'll monitor each rollout and provide changes between as we go. If anybody has any other suggestions would love to hear em. Thank you for your help thus far!

Separately, the baseline 30s per route update seem correct for our scale?

gschier · 2022-03-17T16:13:50Z

Okay, just giving an update here. We were finally able to transition our Envoy instances over to a config with stats excluded for all our created resource (disabling all caused issues). Our time-to-live went from ~180s (blue) to ~20s (purple).

Thanks for all the useful help and feedback! 🤗

daixiang0 · 2022-03-18T03:55:23Z

Good news!

gschier added the triage Issue requires triage label Feb 14, 2022

rojkov added area/perf area/xds investigate Potential bug that needs verification and removed triage Issue requires triage labels Feb 15, 2022

mattklein123 added the help wanted Needs help! label Feb 16, 2022

gschier closed this as completed Mar 22, 2022

gschier mentioned this issue Apr 14, 2022

Memory Leak After 19->21 Upgrade #20800

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues around 15k routes/clusters #19946

Performance issues around 15k routes/clusters #19946

gschier commented Feb 14, 2022

daixiang0 commented Feb 15, 2022

rojkov commented Feb 15, 2022

mattklein123 commented Feb 15, 2022

gschier commented Feb 15, 2022

mattklein123 commented Feb 15, 2022

jmarantz commented Feb 15, 2022

mattklein123 commented Feb 15, 2022

howardjohn commented Feb 15, 2022

lambdai commented Feb 16, 2022

gschier commented Feb 16, 2022

mattklein123 commented Feb 16, 2022

gschier commented Feb 16, 2022 •

edited

kyessenov commented Feb 16, 2022

gschier commented Feb 16, 2022

lambdai commented Feb 16, 2022

JakeCooper commented Feb 17, 2022 •

edited

gschier commented Mar 17, 2022 •

edited

daixiang0 commented Mar 18, 2022

Performance issues around 15k routes/clusters #19946

Performance issues around 15k routes/clusters #19946

Comments

gschier commented Feb 14, 2022

daixiang0 commented Feb 15, 2022

rojkov commented Feb 15, 2022

mattklein123 commented Feb 15, 2022

gschier commented Feb 15, 2022

mattklein123 commented Feb 15, 2022

jmarantz commented Feb 15, 2022

mattklein123 commented Feb 15, 2022

howardjohn commented Feb 15, 2022

lambdai commented Feb 16, 2022

gschier commented Feb 16, 2022

mattklein123 commented Feb 16, 2022

gschier commented Feb 16, 2022 • edited

kyessenov commented Feb 16, 2022

gschier commented Feb 16, 2022

lambdai commented Feb 16, 2022

JakeCooper commented Feb 17, 2022 • edited

gschier commented Mar 17, 2022 • edited

daixiang0 commented Mar 18, 2022

gschier commented Feb 16, 2022 •

edited

JakeCooper commented Feb 17, 2022 •

edited

gschier commented Mar 17, 2022 •

edited