New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[stats]: lazy init stats to save RAM (and CPU) #23575
Comments
cc @jmarantz |
This is the subgroups I plan to have for clusters. /** All cluster config update related stats. All cluster endpoints related stats. All cluster loadbalancing related stats. All cluster stats. https://github.com/see stats_macros.h |
cc @jmarantz |
…ig-update-stats and "the rest"(will be renamed to upstream-stats) (#23907) Commit Message: Subgroup cluster stats() into lb-stats, endpoint-stats, config-update-stats and "the rest"(will be renamed to upstream-stats) Additional Description: See more description and live example in #23575. With lots of clusters and route-tables in a cloud proxy, we are seeing tons of RAM been spent on stats while most of the stats are never inc-ed due to traffic pattern(or long tail). We are thinking that we can lazy init cluster stats() so that the RAM is only allocated when it's required. To achieve that we need to have finer grained stats group, e.g. configUpdateStats() are frequently updated by config management server, while upstream_xxx are only required when there is traffic for the cluster, for this sub-group we can save RAM by lazy init it. Risk Level: LOW, should be a no-behavior-change CL. Testing: N/A existing stats test should prove that there is no behavior change. Docs Changes: Release Notes: Platform Specific Features: Signed-off-by: Xin Zhuang <stevenzzz@google.com>
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
Commit Message: LazyInit ClusterInfo::trafficStats(). Additional Description: 3rd PR for #23575, this is stack upon 2nd PR #23907 With 100K clusters, we are seeing ~1.5 GBs less RAM usage with this PR. Kudos to @jmarantz for the data: | | Clean client | in-MB | Defferred | Diff | Diff % | Defferred-Fulfilled | Diff-in-MB (fulfilled - Clean) | |----------------------|------------|-------------|-------------------|------------|-----------------|-----------------------------|-----------------------------| | allocated | 4561550208 | 4350.233276 | 2886860656 | 1674689552 | 36.71316714 | 4565167632 | 3.44984436 | | heap_size | 5303697408 | 656 | 3443523584 | 1860173824 | 35.07315144 | 5146411008 | -150 | | pageheap_unmapped | 687865856 | | 501219328 | 186646528 | 27.13414634 | 524288000 | -156 | | pageheap_free | 22921216 | 21.859375 | 23109632 | -188416 | -00.8220157255 | 22257664 | -0.6328125 | | total_thread_cache | 1718288 | 1.638687134 | 4197032 | -2478744 | -144.2566089 | 4833576 | 2.970970154 | | total_physical_bytes | 4647192158 | 4431.907804 | 2965242430 | 1681949728 | 36.19281645 | 4653479518 | 5.99609375 | To reproduce, use this script to add 100K clusters to the bootstrap tmpl: ``` $ cat large_bootstrap_maker.sh #!/bin/bash set -u set -e limit="$1" for i in $(seq 1 $limit); do service="zzzzzzz-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx_$i" echo " - name: $service" echo " connect_timeout: 0.25s" echo " type: LOGICAL_DNS" echo " dns_lookup_family: \"v6_only\"" echo " lb_policy: ROUND_ROBIN" echo " load_assignment:" echo " cluster_name: $service" echo " endpoints:" echo " - lb_endpoints:" echo " - endpoint:" echo " address: {socket_address: {address: google.com, port_value: 443}}" done ``` base.tmpl: ```$ cat base.yaml admin: access_log: - name: envoy.access_loggers.file typed_config: "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog path: "/dev/null" address: socket_address: address: "::" port_value: 8080 layered_runtime: layers: - name: admin admin_layer: {} static_resources: listeners: - name: listener_0 address: socket_address: address: "::" port_value: 0 filter_chains: - filters: - name: http typed_config: "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager stat_prefix: ingress_http codec_type: AUTO route_config: name: local_route virtual_hosts: - name: local_service domains: ["*"] routes: - match: {prefix: "/"} route: {host_rewrite_literal: 127.0.0.1, cluster: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy-wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww} clusters: ``` Run the following commands to generate the bootstrap. ``` bash large_bootstrap_maker.sh 100000 >> base.yaml mv base.yaml test/config/integration/100k_clusters.yaml ``` and modify the test/config/main_common_test.cc to load from test/config/integration/100k_clusters.yaml instead. Then you could run the test and observe the RAM uasage. Risk Level: Medium (changes how cluster_info::trafficStats() are created). Testing: existing unit tests. Docs Changes: done Release Notes: incluided Platform Specific Features: n/a Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Commit Message: LazyInit ClusterInfo::trafficStats(). Additional Description: 3rd PR for envoyproxy#23575, this is stack upon 2nd PR envoyproxy#23907 With 100K clusters, we are seeing ~1.5 GBs less RAM usage with this PR. Kudos to @jmarantz for the data: | | Clean client | in-MB | Defferred | Diff | Diff % | Defferred-Fulfilled | Diff-in-MB (fulfilled - Clean) | |----------------------|------------|-------------|-------------------|------------|-----------------|-----------------------------|-----------------------------| | allocated | 4561550208 | 4350.233276 | 2886860656 | 1674689552 | 36.71316714 | 4565167632 | 3.44984436 | | heap_size | 5303697408 | 656 | 3443523584 | 1860173824 | 35.07315144 | 5146411008 | -150 | | pageheap_unmapped | 687865856 | | 501219328 | 186646528 | 27.13414634 | 524288000 | -156 | | pageheap_free | 22921216 | 21.859375 | 23109632 | -188416 | -00.8220157255 | 22257664 | -0.6328125 | | total_thread_cache | 1718288 | 1.638687134 | 4197032 | -2478744 | -144.2566089 | 4833576 | 2.970970154 | | total_physical_bytes | 4647192158 | 4431.907804 | 2965242430 | 1681949728 | 36.19281645 | 4653479518 | 5.99609375 | To reproduce, use this script to add 100K clusters to the bootstrap tmpl: ``` $ cat large_bootstrap_maker.sh #!/bin/bash set -u set -e limit="$1" for i in $(seq 1 $limit); do service="zzzzzzz-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx_$i" echo " - name: $service" echo " connect_timeout: 0.25s" echo " type: LOGICAL_DNS" echo " dns_lookup_family: \"v6_only\"" echo " lb_policy: ROUND_ROBIN" echo " load_assignment:" echo " cluster_name: $service" echo " endpoints:" echo " - lb_endpoints:" echo " - endpoint:" echo " address: {socket_address: {address: google.com, port_value: 443}}" done ``` base.tmpl: ```$ cat base.yaml admin: access_log: - name: envoy.access_loggers.file typed_config: "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog path: "/dev/null" address: socket_address: address: "::" port_value: 8080 layered_runtime: layers: - name: admin admin_layer: {} static_resources: listeners: - name: listener_0 address: socket_address: address: "::" port_value: 0 filter_chains: - filters: - name: http typed_config: "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager stat_prefix: ingress_http codec_type: AUTO route_config: name: local_route virtual_hosts: - name: local_service domains: ["*"] routes: - match: {prefix: "/"} route: {host_rewrite_literal: 127.0.0.1, cluster: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy-wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww} clusters: ``` Run the following commands to generate the bootstrap. ``` bash large_bootstrap_maker.sh 100000 >> base.yaml mv base.yaml test/config/integration/100k_clusters.yaml ``` and modify the test/config/main_common_test.cc to load from test/config/integration/100k_clusters.yaml instead. Then you could run the test and observe the RAM uasage. Risk Level: Medium (changes how cluster_info::trafficStats() are created). Testing: existing unit tests. Docs Changes: done Release Notes: incluided Platform Specific Features: n/a Signed-off-by: Xin Zhuang <stevenzzz@google.com> Signed-off-by: Ryan Eskin <ryan.eskin89@protonmail.com>
Commit Message: LazyInit ClusterInfo::trafficStats(). Additional Description: 3rd PR for envoyproxy/envoy#23575, this is stack upon 2nd PR envoyproxy/envoy#23907 With 100K clusters, we are seeing ~1.5 GBs less RAM usage with this PR. Kudos to @jmarantz for the data: | | Clean client | in-MB | Defferred | Diff | Diff % | Defferred-Fulfilled | Diff-in-MB (fulfilled - Clean) | |----------------------|------------|-------------|-------------------|------------|-----------------|-----------------------------|-----------------------------| | allocated | 4561550208 | 4350.233276 | 2886860656 | 1674689552 | 36.71316714 | 4565167632 | 3.44984436 | | heap_size | 5303697408 | 656 | 3443523584 | 1860173824 | 35.07315144 | 5146411008 | -150 | | pageheap_unmapped | 687865856 | | 501219328 | 186646528 | 27.13414634 | 524288000 | -156 | | pageheap_free | 22921216 | 21.859375 | 23109632 | -188416 | -00.8220157255 | 22257664 | -0.6328125 | | total_thread_cache | 1718288 | 1.638687134 | 4197032 | -2478744 | -144.2566089 | 4833576 | 2.970970154 | | total_physical_bytes | 4647192158 | 4431.907804 | 2965242430 | 1681949728 | 36.19281645 | 4653479518 | 5.99609375 | To reproduce, use this script to add 100K clusters to the bootstrap tmpl: ``` $ cat large_bootstrap_maker.sh #!/bin/bash set -u set -e limit="$1" for i in $(seq 1 $limit); do service="zzzzzzz-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx_$i" echo " - name: $service" echo " connect_timeout: 0.25s" echo " type: LOGICAL_DNS" echo " dns_lookup_family: \"v6_only\"" echo " lb_policy: ROUND_ROBIN" echo " load_assignment:" echo " cluster_name: $service" echo " endpoints:" echo " - lb_endpoints:" echo " - endpoint:" echo " address: {socket_address: {address: google.com, port_value: 443}}" done ``` base.tmpl: ```$ cat base.yaml admin: access_log: - name: envoy.access_loggers.file typed_config: "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog path: "/dev/null" address: socket_address: address: "::" port_value: 8080 layered_runtime: layers: - name: admin admin_layer: {} static_resources: listeners: - name: listener_0 address: socket_address: address: "::" port_value: 0 filter_chains: - filters: - name: http typed_config: "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager stat_prefix: ingress_http codec_type: AUTO route_config: name: local_route virtual_hosts: - name: local_service domains: ["*"] routes: - match: {prefix: "/"} route: {host_rewrite_literal: 127.0.0.1, cluster: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy-wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww} clusters: ``` Run the following commands to generate the bootstrap. ``` bash large_bootstrap_maker.sh 100000 >> base.yaml mv base.yaml test/config/integration/100k_clusters.yaml ``` and modify the test/config/main_common_test.cc to load from test/config/integration/100k_clusters.yaml instead. Then you could run the test and observe the RAM uasage. Risk Level: Medium (changes how cluster_info::trafficStats() are created). Testing: existing unit tests. Docs Changes: done Release Notes: incluided Platform Specific Features: n/a Signed-off-by: Xin Zhuang <stevenzzz@google.com> Mirrored from https://github.com/envoyproxy/envoy @ 23d6164e4405c9389fb1f6403f274942e9dd2d9e
Commit Message: LazyInit ClusterInfo::trafficStats(). Additional Description: 3rd PR for envoyproxy/envoy#23575, this is stack upon 2nd PR envoyproxy/envoy#23907 With 100K clusters, we are seeing ~1.5 GBs less RAM usage with this PR. Kudos to @jmarantz for the data: | | Clean client | in-MB | Defferred | Diff | Diff % | Defferred-Fulfilled | Diff-in-MB (fulfilled - Clean) | |----------------------|------------|-------------|-------------------|------------|-----------------|-----------------------------|-----------------------------| | allocated | 4561550208 | 4350.233276 | 2886860656 | 1674689552 | 36.71316714 | 4565167632 | 3.44984436 | | heap_size | 5303697408 | 656 | 3443523584 | 1860173824 | 35.07315144 | 5146411008 | -150 | | pageheap_unmapped | 687865856 | | 501219328 | 186646528 | 27.13414634 | 524288000 | -156 | | pageheap_free | 22921216 | 21.859375 | 23109632 | -188416 | -00.8220157255 | 22257664 | -0.6328125 | | total_thread_cache | 1718288 | 1.638687134 | 4197032 | -2478744 | -144.2566089 | 4833576 | 2.970970154 | | total_physical_bytes | 4647192158 | 4431.907804 | 2965242430 | 1681949728 | 36.19281645 | 4653479518 | 5.99609375 | To reproduce, use this script to add 100K clusters to the bootstrap tmpl: ``` $ cat large_bootstrap_maker.sh #!/bin/bash set -u set -e limit="$1" for i in $(seq 1 $limit); do service="zzzzzzz-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx_$i" echo " - name: $service" echo " connect_timeout: 0.25s" echo " type: LOGICAL_DNS" echo " dns_lookup_family: \"v6_only\"" echo " lb_policy: ROUND_ROBIN" echo " load_assignment:" echo " cluster_name: $service" echo " endpoints:" echo " - lb_endpoints:" echo " - endpoint:" echo " address: {socket_address: {address: google.com, port_value: 443}}" done ``` base.tmpl: ```$ cat base.yaml admin: access_log: - name: envoy.access_loggers.file typed_config: "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog path: "/dev/null" address: socket_address: address: "::" port_value: 8080 layered_runtime: layers: - name: admin admin_layer: {} static_resources: listeners: - name: listener_0 address: socket_address: address: "::" port_value: 0 filter_chains: - filters: - name: http typed_config: "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager stat_prefix: ingress_http codec_type: AUTO route_config: name: local_route virtual_hosts: - name: local_service domains: ["*"] routes: - match: {prefix: "/"} route: {host_rewrite_literal: 127.0.0.1, cluster: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy-wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww} clusters: ``` Run the following commands to generate the bootstrap. ``` bash large_bootstrap_maker.sh 100000 >> base.yaml mv base.yaml test/config/integration/100k_clusters.yaml ``` and modify the test/config/main_common_test.cc to load from test/config/integration/100k_clusters.yaml instead. Then you could run the test and observe the RAM uasage. Risk Level: Medium (changes how cluster_info::trafficStats() are created). Testing: existing unit tests. Docs Changes: done Release Notes: incluided Platform Specific Features: n/a Signed-off-by: Xin Zhuang <stevenzzz@google.com> Mirrored from https://github.com/envoyproxy/envoy @ 23d6164e4405c9389fb1f6403f274942e9dd2d9e
Title: stats: lazy init stats to save RAM (and CPU)
Description:
With lots of clusters and route-tables in a cloud proxy, we are seeing tons of RAM been spent on stats while most of the stats are never inc-ed due to traffic pattern(or long tail). We are thinking that we can lazy init cluster stats() so that the RAM is only allocated when it's required.
To achieve that we need to have finer grained stats group, e.g. configUpdateStats() are frequently updated by config management server, while upstream_xxx are only required when there is traffic for the cluster, for this sub-group we can save RAM by lazy init it.
Here is an example of cluster stats, when there is no traffic, Envoy is still paying the RAM cost to hold all the 0 stats, as well as the CPU that's burnt on collecting such stats.
[optional Relevant Links:]
The text was updated successfully, but these errors were encountered: