New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Leak After 19->21 Upgrade #20800
Comments
Can you list what extensions you are using? Are you using tracing? There have been other reports of memory leaks using tracing and it would be nice to understand if that might be involved here. Also, can you get a memory profile? We are still working on getting memory profile dumps working using the new tcmalloc so you may have to recompile using gperftools until we get that working. cc @jmarantz @rojkov @wbpcode |
As matt said, can you provide some more configs about your scene. |
Thanks for the quick response! In a previous issue around performance, you can find more info on our setup as well as a snippet of our We can start working on getting a memory profile next week. |
re: Heap size doesn't seem to be correlated to num active connections, num draining listeners, or anything else I could find in the stats How about number of stats ? There are some dynamic allocated stats that may never be destroyed |
We disable all stats for our custom resources because Envoy can't handle that many. Here's a full dump of /stats
We disable stats for our resources with: stats_config:
stats_matcher:
exclusion_list:
patterns:
- prefix: cluster.clst_
- prefix: sds.scrt_
- prefix: vhost.vhst_ |
The perf tools should also be ok to get some memory profiles. Then we can try to analyse it. |
Gonna grab a heap profile/flame graph for this EDIT: Apologies got sidetracked. Have it on my Calendar tomorrow to compile from scratch to get a heap profile/pprof |
1/3 of the envoy replicas is now running the heap profile. Going to let it go through a cycle of climbing heap allocations and will come back with the pprof Sidenote: Envoy normally take like 7 hours to build from source? I haven't built a massive C++ package from source in eons |
It depends. You can edit the do_ci.sh to ignore the unit test to shorten this time. And the second pass building would be quickly. |
I've also let it run for about 30 minutes+ but I'll let it go for longer. Usually takes about 10-15m to start up, I assume given the number of routes/clusters we've got |
You can check the stats of the listener 443. If this listener is working or the cert is correct? And the 404 like the route not found error. |
This one?
|
Should be same as #20919. You can set |
Perfect. Now returning empty responses. Once the routes are loaded, I'll force some traffic through that instance and circle back with the heap profiles Thank you so much for the lightning quick responses! |
Got the pprof heap files! Just went off this but please let me know if there's anything else I can get to make this easier to debug. |
Can you potentially do some pre-processing of the data following the instructions here? https://gperftools.github.io/gperftools/heapprofile.html. You should be able to generate graphs of the heap as well as differential diffs of the heap which should make it pretty clear where the extra memory is coming from. It will also confirm that symbols are correct. |
@JakeCooper Hi, thanks very much for you hard works. As matt said, it would be better to do some pre-processing of these data. Or can you provide your envoy binary that used to generate these data? Then we can try to process these data by ourself. (If you compiled the envoy based on the opensource code base, there should be no privacy problem.) |
I'm happy to run anything you'd like! Additionally here's the binary Let me know what command you'd want run, or any other files. Here to make this as simple as possible and really appreciate all your help! |
@JakeCooper unfortunately the version that you built doesn't have symbols (fastbuild). You will need to build with something like:
|
Oh man. Welp, guess I'll try again. Build should be faster now that we've done it once though. Will come back with another binary and set of stuff |
Unfortunately it won't be faster since you have to rebuild everything to get optimized code with symbols. I would try to use a bigger machine if possible. |
It's already operating on a 96 core, 256gb box :S Also just to clarify: I built it with |
Yes that is correct. Something is really wrong if it's taking 7 hours to build on that machine. It should build in probably 10-20 minutes I would guess. |
Got the below. I'm attempting to rerun it with CFLAGS=-Wno-error. Please correct me if this is incorrect, again it's been eons since I used cpp. New Command: Old Error:
|
If you are using a cloud instance, the problem is probably building against something like ebs. Make sure you use a local ssd and build on that. The version of gcc you are using is too old. You would probably be better of using the docker container to build as it will replicate CI. See the instructions in the ci directory. |
Alrighty. Circling back here as this has grown a LOT and are now prioritizing resolving it We continue to notice memory climbing, and so we've run a heap profile Here's a couple of em Our Edge Envoys (v1.25.1): Our Cluster Envoys (v1.25.1): for the edge proxies, we have a single listener with ~ 12k filter chains, each filter chain entry has a domain match, one downstream TLS context, and a HCM Filter which points to a routing table via RDS. We reference a single route config (via RDS) for all filter chain entries. This routing table has ~70k vhosts (each matching 1-2 domains), with a route to cluster action. Traffic routes to ~4 ish clusters depending on which middle proxy set we route to. One of the filter chains is a wildcard match, (eg: *.up.railway.app), and maps to around 60k vhosts, The other 10k filter chains map 1:1 to 10k vhosts for individual domains. Our current RouteConfig size is ~ 18MB and is updated around once every 2-5 seconds. CC @char8 |
are there lots of long-live http stream or requests? |
would long lived http streams show up in the We do definitely have websocket connections, not sure about average duration of these, but looking at 1xx ret codes, there are about 3/sec compared to about 600/sec 2xx responses on each edge proxy. [assuming all 1xx responses are 101 upgrades for websockets]. |
As I know, envoy will keep a copy of shared pointer of route table in every active http request (normal http1.1 request, http2 request, http2 stream, websocket request) to ensure the route config will still alive in the whole lifetime of the request. It means if you route table is refreshed by the RDS, the old route table will not be released immediately. If you route table is refreshed frequently and there are lots of long-live http request, then it's possible to cache lots of old versions tables in your memory and make a illusion of memory leak. is this a possible cause of this problem? cc @mattklein123 Could you reproduce this problem in your test env? If you can, maybe you can disable all the upgrade and check if this problem could be resolved or not? |
If you are allowing listeners to drain for 10 minutes and you are updating every 3-8 minutes I think by definition you are going to have infinitely climbing memory. Can you drop the drain time to something like 1 minute so it's conservatively less than the update frequency? Granted, the updater should be optimizing to only drain the impacted filter chains, but I could still see this being a problem. (Are you changing every filter chain or only some of them every 3-8 minutes?) From a quick look at the profile it does look like either the route tables are leaking somehow or there are just too many of them sitting around. Can you actually generate some graph view cyclic charts from the data so we can see more clearly the flow of data in terms of allocation? It will be easier to understand. See https://github.com/envoyproxy/envoy/blob/main/bazel/PPROF.md#analyzing-with-pprof and check out the pprof CLI options to see how to spit out a visual view of the profile. Regarding the listener updates, take a look at the stats here: https://www.envoyproxy.io/docs/envoy/latest/configuration/listeners/stats#listener-manager. In particular check the |
Thanks both for the ideas!
We're currently working on testing the route config theory by a) trying to make our route configs smaller by about 10x, b) doing a ramp of longlived http connections over an hour to see if we can see similar mem growth. We did test with 5k websockets, but it was a very short ramp to 10k, so the active connections likely only referenced a handful of route configs that changed during the ramp.
do these work? would each request hold onto a ref for just the matching vhost/route or the entire routeconfig? re: Listener updates, I forgot but the cluster hosts (the stacker-at-30g profile) don't do any listener updates after startup, but they still show this large mem growth. Looks like the lds inplace update counts roughly follow the lds update attempts count (off by 1) for the edge proxies. See plots below. We suspected listener updates might have been the issue, especially after watching this talk from EnvoyCon, but we haven't been able to see any changes despite changing LDS update frequency, etc... We only modify a very small subset of filter chains (mostly to change the SDS secret name after cert renewal), most of the changes are likely to be adding or removing low numbers (< 5) of filter chains per update. We try to keep the order of the filter chains list stable between updates as well. |
The entire route config. |
Thanks those graphs are very helpful. Yes it looks like the issue is many route tables that are in memory.
Yeah I think this is probably the issue and I think @wbpcode analysis is correct.
I had to do some code splunking, but I think the fundamental issue is what you outline, we are snapping the entire route table vs. just the route. envoy/source/common/router/config_impl.h Lines 338 to 339 in d9d257a
envoy/source/common/http/conn_manager_impl.h Line 399 in d9d257a
Untangling this is going to be difficult but it should probably involve not snapping the entire config and making the config more granular so only the parts that are actually needed would be saved for the request. @wbpcode is this something that you would be willing to tackle? It seems right up your alley. :) In the interim I don't see any solution other than to make your route configs smaller and/or make them more granular where you return different RDS responses for each filter chain. |
Will you create websocket requests gradually and keep the frequency of refreshing route config? I mean create 5 websocket every second (and ensure these websocket requests have a long time lifetime such as 10min or 15min) and update route config every 3 seconds, for example. By this way, there should be lots old versions route config be cached. |
I will have a try. Whether this is the root cause or not, we still could do some optimizations. I think we can wrapper the shared part of vh to a object and keep a pointer to this object in every route. By this way, the request will only ref a single route and the shared part of the vh. And the shared part of whole route config will be referred by the shared part of vh. |
Yep! plan on testing this today. I also saw #26045 🙏, so if I can repro the mem growth, will fire off a local build of it and see if I can compare results. |
Confirmed via a test, rough figures with an 18mb~ route config and ramping up to 8k websocket connections. Base mem use at idle was about ~ 1.5GB.
|
* router: only cache single route for the active request Signed-off-by: wbpcode <wangbaiping@corp.netease.com> In the previous impl of the route table, every active request will keep a reference of the whole route table. If there route table is updated frequently and there are some long-live http request, the old versions route tables cannot be released timely and increase the overhead of memory. This PR try to fix this problem by cache single route for active request. See #20800 for more detailed background. Risk Level: High. Core code patch. Change the lifetime of route table. Testing: Added/fixed. Docs Changes: n/a. Release Notes: n/a. Platform Specific Features: n/a. Optional Runtime guard: envoy.reloadable_features.prohibit_route_refresh_after_response_headers_sent --------- Signed-off-by: wbpcode <wangbaiping@corp.netease.com> Signed-off-by: code <wangbaiping@corp.netease.com> Co-authored-by: Matt Klein <mattklein123@gmail.com>
) * router: only cache single route for the active request Signed-off-by: wbpcode <wangbaiping@corp.netease.com> In the previous impl of the route table, every active request will keep a reference of the whole route table. If there route table is updated frequently and there are some long-live http request, the old versions route tables cannot be released timely and increase the overhead of memory. This PR try to fix this problem by cache single route for active request. See envoyproxy#20800 for more detailed background. Risk Level: High. Core code patch. Change the lifetime of route table. Testing: Added/fixed. Docs Changes: n/a. Release Notes: n/a. Platform Specific Features: n/a. Optional Runtime guard: envoy.reloadable_features.prohibit_route_refresh_after_response_headers_sent --------- Signed-off-by: wbpcode <wangbaiping@corp.netease.com> Signed-off-by: code <wangbaiping@corp.netease.com> Co-authored-by: Matt Klein <mattklein123@gmail.com> Signed-off-by: River Phillips <riverphillips1@gmail.com>
We're using Envoy as our edge proxy at Railway (20k routes, virtual hosts, cluster; 5k secrets) and started running into an issue where heap grows uncontrollably. There was already some leaking, but it increased dramatically after switching from Dockerized Envoy 1.19 to Linux Brew Envoy 1.21. We're currently restarting each of our nodes every few hours to get around this (as noticeable on the posted graphs below).
My question is, does anything jump out immediately as to why this might be happening, and what is the best way to go about debugging this issue if not?
af50070ee60866874b0a9383daf9364e884ded22/1.21.1/Modified/RELEASE/BoringSSL
on DebianLet me know if you need any additional info.
📊 Here's also a dump of /stats from one of the nodes:
Also tagging @JakeCooper and @soggycactus from the team on this one.
The text was updated successfully, but these errors were encountered: