Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues around 15k routes/clusters #19946

Closed
gschier opened this issue Feb 14, 2022 · 18 comments
Closed

Performance issues around 15k routes/clusters #19946

gschier opened this issue Feb 14, 2022 · 18 comments
Labels
area/perf area/xds help wanted Needs help! investigate Potential bug that needs verification

Comments

@gschier
Copy link

gschier commented Feb 14, 2022

We're looking for some performance advice on Envoy and weren't sure where to ask. I couldn't find a paid support channel but, if there was one, we'd be happy to use it.

We use Envoy+go-control-plane at Railway as a forward proxy to distribute incoming requests to our users' deployments. When we switched to Envoy (from Traefik) at around 10k upstreams, config upserts took just a few seconds. Now, at around 16k routes, upserts can take 30s or more.

Each of our users' deployment has:

  • STATIC cluster pointing to IP + PORT
  • Virtual host mapping domain(s) to cluster
  • Filter chain w/ inline SSL certificates
🔍 snippet from /config_dump
[
  {
    "version_info": "3bef5398bedcf6e33c5666d200a3e731b68f9461e2688eca2bc7ad974b9b97b8",
    "last_updated": "2022-02-12T20:50:46.362Z",
    "cluster": {
      "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
      "name": "clst_10_138_0_83_6637",
      "type": "STATIC",
      "load_assignment": {
        "cluster_name": "clst_10_138_0_83_6637",
        "endpoints": [ {
          "lb_endpoints": [ {
            "endpoint": {
              "address": { "socket_address": { "address": "10.138.0.83", "port_value": 6637 } }
            }
          }]
        }]
      }
    }
  },
  {
    "name": "flch_example_com",
    "filter_chain_match": { "server_names": [ "example.com" ] },
    "filters": [
      {
        "name": "envoy.filters.network.http_connection_manager",
        "typed_config": {
          "@type": "type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager",
          "stat_prefix": "https",
          "rds": {
            "config_source": { "ads": {}, "resource_api_version": "V3" },
            "route_config_name": "https_route_config"
          },
          "http_filters": [ { "name": "envoy.filters.http.router" } ],
          "http2_protocol_options": {},
          "use_remote_address": true,
          "upgrade_configs": [ { "upgrade_type": "websocket" } ]
        }
      }
    ],
    "transport_socket": {
      "name": "envoy.transport_sockets.tls",
      "typed_config": {
        "@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext",
        "common_tls_context": {
          "alpn_protocols": [ "h2", "http/1.1" ],
          "tls_certificate_sds_secret_configs": [{
            "name": "scrt_pan_example_com",
            "sds_config": { "ads": {}, "resource_api_version": "V3" }
          }]
        }
      }
    }
  },
  {
    "name": "vhst_example_com_10_138_0_83_6637",
    "domains": [ "example.com", "example.com:443" ],
    "routes": [{
      "match": { "prefix": "/" },
      "route": {
        "cluster": "clst_10_138_0_83_6637",
        "timeout": "0s",
        "upgrade_configs": [ { "upgrade_type": "websocket" } ]
      },
      "name": "rout_example_com_10_138_0_83_6637"
    }]
  },
  {
    "name": "scrt_example_com",
    "version_info": "4c2416216592e1bfdd4f9c5eb0e9dc4da324a18e0d12a340866afe87cd6d911c",
    "last_updated": "2022-02-14T22:23:42.130Z",
    "secret": {
      "@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret",
      "name": "scrt_example_com",
      "tls_certificate": {
        "certificate_chain": { "inline_bytes": "..." },
        "private_key": { "inline_bytes": "..." }
      }
    }
  }
]

We upsert the above resources over DELTA_GRPC v3 with ADS on every user deployment (roughly every ~5 seconds) which seems to keep Envoy consistently pinned between 100-200% CPU usage (running on 32 vCPU / 120GB VM).

🙋‍♂️ Questions: Does what we're doing (frequent updates on many clusters) seem within the realm of Envoy's capabilities? Does anything wrong stand out with how we're configuring these? What would be the recommended practice for reaching 100k+ upstreams?

Thanks, we appreciate any resources/suggestions you have!

@gschier gschier added the triage Issue requires triage label Feb 14, 2022
@daixiang0
Copy link
Member

Does Envoy set up on each node/app? Do you set bound CPUs for Envoy?
Could you collect pprof info refer to https://github.com/envoyproxy/envoy/blob/main/bazel/PPROF.md?

@rojkov
Copy link
Member

rojkov commented Feb 15, 2022

Do you have 16k routes total or is it the size of delta?

@rojkov rojkov added area/perf area/xds investigate Potential bug that needs verification and removed triage Issue requires triage labels Feb 15, 2022
@mattklein123
Copy link
Member

cc @jmarantz also

@gschier
Copy link
Author

gschier commented Feb 15, 2022

I appreciate the help!

Does Envoy set up on each node/app? Do you set bound CPUs for Envoy? Could you collect pprof info refer to https://github.com/envoyproxy/envoy/blob/main/bazel/PPROF.md?

There is only a single instance of Envoy to distribute traffic across all apps. No, we don't apply any CPU restrictions to Envoy. The binary is simply run as a systemd service. I'm hesitant to run pprof on our production instance, as CPU is already constrained but I may be able to reproduce in a staging environment.

Do you have 16k routes total or is it the size of delta?

The deltas are usually just a small handful of changes. Here's an example from our Envoy logs.

# The following logs have been trimmed for brevity
[2022-02-14 22:58:19.838] cds: added/updated 1 cluster(s), skipped 0 unmodified cluster(s)
[2022-02-14 22:58:19.837] cds: add 1 cluster(s), remove 0 cluster(s)
[2022-02-14 22:58:19.817] lds: add/update listener 'https_listener'
[2022-02-14 22:57:37.527] cds: added/updated 0 cluster(s), skipped 0 unmodified cluster(s)
[2022-02-14 22:57:37.527] cds: add 0 cluster(s), remove 2 cluster(s)
[2022-02-14 22:57:37.487] lds: add/update listener 'https_listener'
[2022-02-14 22:56:50.107] lds: add/update listener 'https_listener'
[2022-02-14 22:56:36.228] cds: added/updated 1 cluster(s), skipped 0 unmodified cluster(s)
[2022-02-14 22:56:36.226] cds: add 1 cluster(s), remove 0 cluster(s)
[2022-02-14 22:55:44.435] lds: add/update listener 'https_listener'

@mattklein123
Copy link
Member

I'm hesitant to run pprof on our production instance, as CPU is already constrained but I may be able to reproduce in a staging environment.

Gathering a perf trace should be relatively low overhead and would provide a lot of info. If you could do this and provide a flame graph that would be very useful.

If you are using delta xDS and only changing a few resources at a time, I wouldn't expect this to take much CPU, so there must be something broken here so having the perf trace would be super useful.

@jmarantz
Copy link
Contributor

+1 to start the deep-dive with a flame-graph.

If your system has daily peaks/troughs like some of ours do, maybe you could aim for sometime during either the rising or falling edge, so that you have some headroom for collecting the data overloading your serve, but are not at the trough which might not be representative.

Also -- I'm curious why you don't add more Envoy tasks to take on extra load -- at some point you will probably run out of gas in your single task. But let's see what's going on with a flame-graph in your one Envoy first.

@mattklein123
Copy link
Member

I actually am wondering if this is a dup of #19774 and related to TLS context churning? cc @lambdai @howardjohn @kyessenov

@howardjohn
Copy link
Contributor

Another random theory - we saw issues with sending a bunch of routes to envoy. This was non-delta, so probably worse than here. Update latency became huge. What we found was that we had no backpressure, so Envoy was taking (for example) 15s to process routes and we sent new ones every 2s. This compounded and make things worse repeatedly. Simply scaling back to only send new routes when Envoy had ACKed the previous ones made the performance much better. Something to look into if you do not already have such a mechanism in your control plane

@lambdai
Copy link
Contributor

lambdai commented Feb 16, 2022

My mental model of the config is to add/delete
a filter chain in the big listener,
a domain in the big route
a cluster
a secret

The theory is that the time adding the above 4 as a suite is linear to the scale big things.
That says I am expecting such a operation in 16k suite is 1.6x slower than 10k suite even #19774 is not fixed. The O(N^2) there is due to the secret in each N cluster is updated, which is not the case here.

I suspect there is some O(N^2) thing introduced here but I don't have the insight.

You can try to watch the metric

listener_modified: bump 1 on each https_listener
listener_in_place_updated: bump 1 on https_listener update along with listener_modified
total_filter_chains_draining: 1 after the listener is updating, maybe several more if you frequently updates listener , and drop to 0 eventually

https://www.envoyproxy.io/docs/envoy/latest/configuration/listeners/stats

Let me know if it matches your pattern.

@mattklein123 mattklein123 added the help wanted Needs help! label Feb 16, 2022
@gschier
Copy link
Author

gschier commented Feb 16, 2022

Gathering a perf trace should be relatively low overhead and would provide a lot of info. If you could do this and provide a flame graph that would be very useful.

If you are using delta xDS and only changing a few resources at a time, I wouldn't expect this to take much CPU, so there must be something broken here so having the perf trace would be super useful.

So I was able to use perf to generate perf data, but wasn't able to get perf_data_converter compiled. I used github.com/brendangregg/FlameGraph for now, which hopefully is sufficient.

Here's the result from 120 seconds of perf data: https://schierco.nyc3.digitaloceanspaces.com/other/perf.svg

Also -- I'm curious why you don't add more Envoy tasks to take on extra load -- at some point you will probably run out of gas in your single task. But let's see what's going on with a flame-graph in your one Envoy first.

@jmarantz By Envoy tasks, do you mean sharding the clusters/routes across multiple Envoy instances? If so, we'll be moving toward that shortly. I just wanted to see if there were any quick wins for now.

@mattklein123
Copy link
Member

Thanks for the extra info. Looks like you are spending most of the time stat flushing, which is not surprising to me if no tweaks have been done there. This is @jmarantz favorite topic so I will let him take it from here. :)

Either way @jmarantz I feel like we should have some docs on best practices for dealing with stats with giant configs?

@gschier
Copy link
Author

gschier commented Feb 16, 2022

After seeing the stats stuff in the flamegraph, I remembered that we had to disable DataDog's Envoy integration because queries were too slow. I just ran a /stats query again to see what it's like now:

$ time curl localhost:19000/stats > stats.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  121M    0  121M    0     0  1199k      0 --:--:--  0:01:44 --:--:-- 2098k

real	1m44.079s
user	0m0.156s
sys	0m0.460s

Any advice on how to tweak things to make stats collection viable and efficient would be greatly appreciated.

And thanks for the quick response!

@kyessenov
Copy link
Contributor

Can you run an experiment by excluding most stats (https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto#envoy-v3-api-field-config-metrics-v3-statsmatcher-reject-all)? I suspect you might be creating new symbols for the names of the resources which somehow causes large overhead in the stats subsystem.

@gschier
Copy link
Author

gschier commented Feb 16, 2022

Can you run an experiment by excluding most stats (https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto#envoy-v3-api-field-config-metrics-v3-statsmatcher-reject-all)? I suspect you might be creating new symbols for the names of the resources which somehow causes large overhead in the stats subsystem.

Unfortunately we never configured hot restart when first setting things up so any config change will trigger ~5-10 minutes of downtime. Actually, the docs are very confusing in this area. They say that hot restart is enabled by default but it seems like you have to be running it with the example hot-restarter.py script or similar?

@lambdai
Copy link
Contributor

lambdai commented Feb 16, 2022

Looking at your flame graph,
I believe the control plane is doing great and the listener are going through in place update. You can ignore #19946 (comment)

@JakeCooper
Copy link

JakeCooper commented Feb 17, 2022

If it helps, here's the latency chart for updates post restart. Seems to climb pretty rapidly, which leads me to believe @howardjohn's compounding refresh situation might be on the money.

Going to attempt:

  • Adding hot restart
  • Stats disabling
  • Batching of updates to a larger window (15-30s)

We'll monitor each rollout and provide changes between as we go. If anybody has any other suggestions would love to hear em. Thank you for your help thus far!

Separately, the baseline 30s per route update seem correct for our scale?

image

@gschier
Copy link
Author

gschier commented Mar 17, 2022

Okay, just giving an update here. We were finally able to transition our Envoy instances over to a config with stats excluded for all our created resource (disabling all caused issues). Our time-to-live went from ~180s (blue) to ~20s (purple).

image

Thanks for all the useful help and feedback! 🤗

@daixiang0
Copy link
Member

Good news!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/perf area/xds help wanted Needs help! investigate Potential bug that needs verification
Projects
None yet
Development

No branches or pull requests

9 participants