[stats]: lazy init stats to save RAM (and CPU) #23575

stevenzzzz · 2022-10-19T20:20:31Z

Title: stats: lazy init stats to save RAM (and CPU)

Description:

With lots of clusters and route-tables in a cloud proxy, we are seeing tons of RAM been spent on stats while most of the stats are never inc-ed due to traffic pattern(or long tail). We are thinking that we can lazy init cluster stats() so that the RAM is only allocated when it's required.

To achieve that we need to have finer grained stats group, e.g. configUpdateStats() are frequently updated by config management server, while upstream_xxx are only required when there is traffic for the cluster, for this sub-group we can save RAM by lazy init it.

Here is an example of cluster stats, when there is no traffic, Envoy is still paying the RAM cost to hold all the 0 stats, as well as the CPU that's burnt on collecting such stats.

[optional Relevant Links:]

Any extra documentation required to understand the issue.

stevenzzzz · 2022-10-19T20:21:46Z

cc @jmarantz

stevenzzzz · 2022-10-19T20:22:05Z

This is the subgroups I plan to have for clusters.

/**

All cluster config update related stats.
/
#define ALL_CLUSTER_CONFIG_UPDATE_STATS(COUNTER, GAUGE, HISTOGRAM,
TEXT_READOUT, STATNAME)
COUNTER(assignment_stale)
COUNTER(assignment_timeout_received)
COUNTER(update_attempt)
COUNTER(update_empty)
COUNTER(update_failure)
COUNTER(update_no_rebuild)
COUNTER(update_success)
GAUGE(version, NeverImport)
/*

All cluster endpoints related stats.
/
#define ALL_CLUSTER_ENDPOINT_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT,
STATNAME)
GAUGE(max_host_weight, NeverImport)
COUNTER(membership_change)
GAUGE(membership_degraded, NeverImport)
GAUGE(membership_excluded, NeverImport)
GAUGE(membership_healthy, NeverImport)
GAUGE(membership_total, NeverImport)
/*

All cluster loadbalancing related stats.
/
#define ALL_CLUSTER_LB_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT,
STATNAME)
COUNTER(lb_healthy_panic)
COUNTER(lb_local_cluster_not_ok)
COUNTER(lb_recalculate_zone_structures)
COUNTER(lb_subsets_created)
COUNTER(lb_subsets_fallback)
COUNTER(lb_subsets_fallback_panic)
COUNTER(lb_subsets_removed)
COUNTER(lb_subsets_selected)
COUNTER(lb_zone_cluster_too_small)
COUNTER(lb_zone_no_capacity_left)
COUNTER(lb_zone_number_differs)
COUNTER(lb_zone_routing_all_directly)
COUNTER(lb_zone_routing_cross_zone)
COUNTER(lb_zone_routing_sampled)
GAUGE(lb_subsets_active, Accumulate)
/*

All cluster stats. https://github.com/see stats_macros.h
*/
#define ALL_CLUSTER_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT, STATNAME)
COUNTER(bind_errors)
COUNTER(original_dst_host_invalid)
COUNTER(retry_or_shadow_abandoned)
COUNTER(upstream_cx_close_notify)
COUNTER(upstream_cx_connect_attempts_exceeded)
COUNTER(upstream_cx_connect_fail)
COUNTER(upstream_cx_connect_timeout)
COUNTER(upstream_cx_connect_with_0_rtt)
COUNTER(upstream_cx_destroy)
COUNTER(upstream_cx_destroy_local)
COUNTER(upstream_cx_destroy_local_with_active_rq)
COUNTER(upstream_cx_destroy_remote)
COUNTER(upstream_cx_destroy_remote_with_active_rq)
COUNTER(upstream_cx_destroy_with_active_rq)
COUNTER(upstream_cx_http1_total)
COUNTER(upstream_cx_http2_total)
COUNTER(upstream_cx_http3_total)
COUNTER(upstream_cx_idle_timeout)
COUNTER(upstream_cx_max_duration_reached)
COUNTER(upstream_cx_max_requests)
COUNTER(upstream_cx_none_healthy)
COUNTER(upstream_cx_overflow)
COUNTER(upstream_cx_pool_overflow)
COUNTER(upstream_cx_protocol_error)
COUNTER(upstream_cx_rx_bytes_total)
COUNTER(upstream_cx_total)
COUNTER(upstream_cx_tx_bytes_total)
COUNTER(upstream_flow_control_backed_up_total)
COUNTER(upstream_flow_control_drained_total)
COUNTER(upstream_flow_control_paused_reading_total)
COUNTER(upstream_flow_control_resumed_reading_total)
COUNTER(upstream_internal_redirect_failed_total)
COUNTER(upstream_internal_redirect_succeeded_total)
COUNTER(upstream_rq_cancelled)
COUNTER(upstream_rq_completed)
COUNTER(upstream_rq_maintenance_mode)
COUNTER(upstream_rq_max_duration_reached)
COUNTER(upstream_rq_pending_failure_eject)
COUNTER(upstream_rq_pending_overflow)
COUNTER(upstream_rq_pending_total)
COUNTER(upstream_rq_0rtt)
COUNTER(upstream_rq_per_try_timeout)
COUNTER(upstream_rq_per_try_idle_timeout)
COUNTER(upstream_rq_retry)
COUNTER(upstream_rq_retry_backoff_exponential)
COUNTER(upstream_rq_retry_backoff_ratelimited)
COUNTER(upstream_rq_retry_limit_exceeded)
COUNTER(upstream_rq_retry_overflow)
COUNTER(upstream_rq_retry_success)
COUNTER(upstream_rq_rx_reset)
COUNTER(upstream_rq_timeout)
COUNTER(upstream_rq_total)
COUNTER(upstream_rq_tx_reset)
COUNTER(upstream_http3_broken)
GAUGE(upstream_cx_active, Accumulate)
GAUGE(upstream_cx_rx_bytes_buffered, Accumulate)
GAUGE(upstream_cx_tx_bytes_buffered, Accumulate)
GAUGE(upstream_rq_active, Accumulate)
GAUGE(upstream_rq_pending_active, Accumulate)
HISTOGRAM(upstream_cx_connect_ms, Milliseconds)
HISTOGRAM(upstream_cx_length_ms, Milliseconds)

wbpcode · 2022-10-21T01:59:24Z

cc @jmarantz

…ig-update-stats and "the rest"(will be renamed to upstream-stats) (#23907) Commit Message: Subgroup cluster stats() into lb-stats, endpoint-stats, config-update-stats and "the rest"(will be renamed to upstream-stats) Additional Description: See more description and live example in #23575. With lots of clusters and route-tables in a cloud proxy, we are seeing tons of RAM been spent on stats while most of the stats are never inc-ed due to traffic pattern(or long tail). We are thinking that we can lazy init cluster stats() so that the RAM is only allocated when it's required. To achieve that we need to have finer grained stats group, e.g. configUpdateStats() are frequently updated by config management server, while upstream_xxx are only required when there is traffic for the cluster, for this sub-group we can save RAM by lazy init it. Risk Level: LOW, should be a no-behavior-change CL. Testing: N/A existing stats test should prove that there is no behavior change. Docs Changes: Release Notes: Platform Specific Features: Signed-off-by: Xin Zhuang <stevenzzz@google.com>

github-actions · 2022-11-20T04:04:34Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@jmarantz

Commit Message: LazyInit ClusterInfo::trafficStats(). Additional Description: 3rd PR for #23575, this is stack upon 2nd PR #23907 With 100K clusters, we are seeing ~1.5 GBs less RAM usage with this PR. Kudos to @jmarantz for the data: | | Clean client | in-MB | Defferred | Diff | Diff % | Defferred-Fulfilled | Diff-in-MB (fulfilled - Clean) | |----------------------|------------|-------------|-------------------|------------|-----------------|-----------------------------|-----------------------------| | allocated | 4561550208 | 4350.233276 | 2886860656 | 1674689552 | 36.71316714 | 4565167632 | 3.44984436 | | heap_size | 5303697408 | 656 | 3443523584 | 1860173824 | 35.07315144 | 5146411008 | -150 | | pageheap_unmapped | 687865856 | | 501219328 | 186646528 | 27.13414634 | 524288000 | -156 | | pageheap_free | 22921216 | 21.859375 | 23109632 | -188416 | -00.8220157255 | 22257664 | -0.6328125 | | total_thread_cache | 1718288 | 1.638687134 | 4197032 | -2478744 | -144.2566089 | 4833576 | 2.970970154 | | total_physical_bytes | 4647192158 | 4431.907804 | 2965242430 | 1681949728 | 36.19281645 | 4653479518 | 5.99609375 | To reproduce, use this script to add 100K clusters to the bootstrap tmpl: ``` $ cat large_bootstrap_maker.sh #!/bin/bash set -u set -e limit="$1" for i in $(seq 1 $limit); do service="zzzzzzz-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx_$i" echo " - name: $service" echo " connect_timeout: 0.25s" echo " type: LOGICAL_DNS" echo " dns_lookup_family: \"v6_only\"" echo " lb_policy: ROUND_ROBIN" echo " load_assignment:" echo " cluster_name: $service" echo " endpoints:" echo " - lb_endpoints:" echo " - endpoint:" echo " address: {socket_address: {address: google.com, port_value: 443}}" done ``` base.tmpl: ```$ cat base.yaml admin: access_log: - name: envoy.access_loggers.file typed_config: "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog path: "/dev/null" address: socket_address: address: "::" port_value: 8080 layered_runtime: layers: - name: admin admin_layer: {} static_resources: listeners: - name: listener_0 address: socket_address: address: "::" port_value: 0 filter_chains: - filters: - name: http typed_config: "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager stat_prefix: ingress_http codec_type: AUTO route_config: name: local_route virtual_hosts: - name: local_service domains: ["*"] routes: - match: {prefix: "/"} route: {host_rewrite_literal: 127.0.0.1, cluster: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy-wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww} clusters: ``` Run the following commands to generate the bootstrap. ``` bash large_bootstrap_maker.sh 100000 >> base.yaml mv base.yaml test/config/integration/100k_clusters.yaml ``` and modify the test/config/main_common_test.cc to load from test/config/integration/100k_clusters.yaml instead. Then you could run the test and observe the RAM uasage. Risk Level: Medium (changes how cluster_info::trafficStats() are created). Testing: existing unit tests. Docs Changes: done Release Notes: incluided Platform Specific Features: n/a Signed-off-by: Xin Zhuang <stevenzzz@google.com>

@jmarantz

Commit Message: LazyInit ClusterInfo::trafficStats(). Additional Description: 3rd PR for envoyproxy#23575, this is stack upon 2nd PR envoyproxy#23907 With 100K clusters, we are seeing ~1.5 GBs less RAM usage with this PR. Kudos to @jmarantz for the data: | | Clean client | in-MB | Defferred | Diff | Diff % | Defferred-Fulfilled | Diff-in-MB (fulfilled - Clean) | |----------------------|------------|-------------|-------------------|------------|-----------------|-----------------------------|-----------------------------| | allocated | 4561550208 | 4350.233276 | 2886860656 | 1674689552 | 36.71316714 | 4565167632 | 3.44984436 | | heap_size | 5303697408 | 656 | 3443523584 | 1860173824 | 35.07315144 | 5146411008 | -150 | | pageheap_unmapped | 687865856 | | 501219328 | 186646528 | 27.13414634 | 524288000 | -156 | | pageheap_free | 22921216 | 21.859375 | 23109632 | -188416 | -00.8220157255 | 22257664 | -0.6328125 | | total_thread_cache | 1718288 | 1.638687134 | 4197032 | -2478744 | -144.2566089 | 4833576 | 2.970970154 | | total_physical_bytes | 4647192158 | 4431.907804 | 2965242430 | 1681949728 | 36.19281645 | 4653479518 | 5.99609375 | To reproduce, use this script to add 100K clusters to the bootstrap tmpl: ``` $ cat large_bootstrap_maker.sh #!/bin/bash set -u set -e limit="$1" for i in $(seq 1 $limit); do service="zzzzzzz-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx_$i" echo " - name: $service" echo " connect_timeout: 0.25s" echo " type: LOGICAL_DNS" echo " dns_lookup_family: \"v6_only\"" echo " lb_policy: ROUND_ROBIN" echo " load_assignment:" echo " cluster_name: $service" echo " endpoints:" echo " - lb_endpoints:" echo " - endpoint:" echo " address: {socket_address: {address: google.com, port_value: 443}}" done ``` base.tmpl: ```$ cat base.yaml admin: access_log: - name: envoy.access_loggers.file typed_config: "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog path: "/dev/null" address: socket_address: address: "::" port_value: 8080 layered_runtime: layers: - name: admin admin_layer: {} static_resources: listeners: - name: listener_0 address: socket_address: address: "::" port_value: 0 filter_chains: - filters: - name: http typed_config: "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager stat_prefix: ingress_http codec_type: AUTO route_config: name: local_route virtual_hosts: - name: local_service domains: ["*"] routes: - match: {prefix: "/"} route: {host_rewrite_literal: 127.0.0.1, cluster: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy-wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww} clusters: ``` Run the following commands to generate the bootstrap. ``` bash large_bootstrap_maker.sh 100000 >> base.yaml mv base.yaml test/config/integration/100k_clusters.yaml ``` and modify the test/config/main_common_test.cc to load from test/config/integration/100k_clusters.yaml instead. Then you could run the test and observe the RAM uasage. Risk Level: Medium (changes how cluster_info::trafficStats() are created). Testing: existing unit tests. Docs Changes: done Release Notes: incluided Platform Specific Features: n/a Signed-off-by: Xin Zhuang <stevenzzz@google.com> Signed-off-by: Ryan Eskin <ryan.eskin89@protonmail.com>

@jmarantz

Commit Message: LazyInit ClusterInfo::trafficStats(). Additional Description: 3rd PR for envoyproxy/envoy#23575, this is stack upon 2nd PR envoyproxy/envoy#23907 With 100K clusters, we are seeing ~1.5 GBs less RAM usage with this PR. Kudos to @jmarantz for the data: | | Clean client | in-MB | Defferred | Diff | Diff % | Defferred-Fulfilled | Diff-in-MB (fulfilled - Clean) | |----------------------|------------|-------------|-------------------|------------|-----------------|-----------------------------|-----------------------------| | allocated | 4561550208 | 4350.233276 | 2886860656 | 1674689552 | 36.71316714 | 4565167632 | 3.44984436 | | heap_size | 5303697408 | 656 | 3443523584 | 1860173824 | 35.07315144 | 5146411008 | -150 | | pageheap_unmapped | 687865856 | | 501219328 | 186646528 | 27.13414634 | 524288000 | -156 | | pageheap_free | 22921216 | 21.859375 | 23109632 | -188416 | -00.8220157255 | 22257664 | -0.6328125 | | total_thread_cache | 1718288 | 1.638687134 | 4197032 | -2478744 | -144.2566089 | 4833576 | 2.970970154 | | total_physical_bytes | 4647192158 | 4431.907804 | 2965242430 | 1681949728 | 36.19281645 | 4653479518 | 5.99609375 | To reproduce, use this script to add 100K clusters to the bootstrap tmpl: ``` $ cat large_bootstrap_maker.sh #!/bin/bash set -u set -e limit="$1" for i in $(seq 1 $limit); do service="zzzzzzz-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx_$i" echo " - name: $service" echo " connect_timeout: 0.25s" echo " type: LOGICAL_DNS" echo " dns_lookup_family: \"v6_only\"" echo " lb_policy: ROUND_ROBIN" echo " load_assignment:" echo " cluster_name: $service" echo " endpoints:" echo " - lb_endpoints:" echo " - endpoint:" echo " address: {socket_address: {address: google.com, port_value: 443}}" done ``` base.tmpl: ```$ cat base.yaml admin: access_log: - name: envoy.access_loggers.file typed_config: "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog path: "/dev/null" address: socket_address: address: "::" port_value: 8080 layered_runtime: layers: - name: admin admin_layer: {} static_resources: listeners: - name: listener_0 address: socket_address: address: "::" port_value: 0 filter_chains: - filters: - name: http typed_config: "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager stat_prefix: ingress_http codec_type: AUTO route_config: name: local_route virtual_hosts: - name: local_service domains: ["*"] routes: - match: {prefix: "/"} route: {host_rewrite_literal: 127.0.0.1, cluster: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy-wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww} clusters: ``` Run the following commands to generate the bootstrap. ``` bash large_bootstrap_maker.sh 100000 >> base.yaml mv base.yaml test/config/integration/100k_clusters.yaml ``` and modify the test/config/main_common_test.cc to load from test/config/integration/100k_clusters.yaml instead. Then you could run the test and observe the RAM uasage. Risk Level: Medium (changes how cluster_info::trafficStats() are created). Testing: existing unit tests. Docs Changes: done Release Notes: incluided Platform Specific Features: n/a Signed-off-by: Xin Zhuang <stevenzzz@google.com> Mirrored from https://github.com/envoyproxy/envoy @ 23d6164e4405c9389fb1f6403f274942e9dd2d9e

@jmarantz

Commit Message: LazyInit ClusterInfo::trafficStats(). Additional Description: 3rd PR for envoyproxy/envoy#23575, this is stack upon 2nd PR envoyproxy/envoy#23907 With 100K clusters, we are seeing ~1.5 GBs less RAM usage with this PR. Kudos to @jmarantz for the data: | | Clean client | in-MB | Defferred | Diff | Diff % | Defferred-Fulfilled | Diff-in-MB (fulfilled - Clean) | |----------------------|------------|-------------|-------------------|------------|-----------------|-----------------------------|-----------------------------| | allocated | 4561550208 | 4350.233276 | 2886860656 | 1674689552 | 36.71316714 | 4565167632 | 3.44984436 | | heap_size | 5303697408 | 656 | 3443523584 | 1860173824 | 35.07315144 | 5146411008 | -150 | | pageheap_unmapped | 687865856 | | 501219328 | 186646528 | 27.13414634 | 524288000 | -156 | | pageheap_free | 22921216 | 21.859375 | 23109632 | -188416 | -00.8220157255 | 22257664 | -0.6328125 | | total_thread_cache | 1718288 | 1.638687134 | 4197032 | -2478744 | -144.2566089 | 4833576 | 2.970970154 | | total_physical_bytes | 4647192158 | 4431.907804 | 2965242430 | 1681949728 | 36.19281645 | 4653479518 | 5.99609375 | To reproduce, use this script to add 100K clusters to the bootstrap tmpl: ``` $ cat large_bootstrap_maker.sh #!/bin/bash set -u set -e limit="$1" for i in $(seq 1 $limit); do service="zzzzzzz-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx_$i" echo " - name: $service" echo " connect_timeout: 0.25s" echo " type: LOGICAL_DNS" echo " dns_lookup_family: \"v6_only\"" echo " lb_policy: ROUND_ROBIN" echo " load_assignment:" echo " cluster_name: $service" echo " endpoints:" echo " - lb_endpoints:" echo " - endpoint:" echo " address: {socket_address: {address: google.com, port_value: 443}}" done ``` base.tmpl: ```$ cat base.yaml admin: access_log: - name: envoy.access_loggers.file typed_config: "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog path: "/dev/null" address: socket_address: address: "::" port_value: 8080 layered_runtime: layers: - name: admin admin_layer: {} static_resources: listeners: - name: listener_0 address: socket_address: address: "::" port_value: 0 filter_chains: - filters: - name: http typed_config: "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager stat_prefix: ingress_http codec_type: AUTO route_config: name: local_route virtual_hosts: - name: local_service domains: ["*"] routes: - match: {prefix: "/"} route: {host_rewrite_literal: 127.0.0.1, cluster: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy-wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww} clusters: ``` Run the following commands to generate the bootstrap. ``` bash large_bootstrap_maker.sh 100000 >> base.yaml mv base.yaml test/config/integration/100k_clusters.yaml ``` and modify the test/config/main_common_test.cc to load from test/config/integration/100k_clusters.yaml instead. Then you could run the test and observe the RAM uasage. Risk Level: Medium (changes how cluster_info::trafficStats() are created). Testing: existing unit tests. Docs Changes: done Release Notes: incluided Platform Specific Features: n/a Signed-off-by: Xin Zhuang <stevenzzz@google.com> Mirrored from https://github.com/envoyproxy/envoy @ 23d6164e4405c9389fb1f6403f274942e9dd2d9e

stevenzzzz added enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Oct 19, 2022

wbpcode added area/stats and removed triage Issue requires triage labels Oct 21, 2022

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Nov 20, 2022

jmarantz added no stalebot Disables stalebot from closing an issue and removed stale stalebot believes this issue/PR has not been touched recently labels Nov 20, 2022

stevenzzzz mentioned this issue Dec 7, 2022

make ClusterInfo::traffic_stats_ a unique_ptr, so that later we can lazy-init it later. #24406

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stats]: lazy init stats to save RAM (and CPU) #23575

[stats]: lazy init stats to save RAM (and CPU) #23575

stevenzzzz commented Oct 19, 2022

stevenzzzz commented Oct 19, 2022

stevenzzzz commented Oct 19, 2022

wbpcode commented Oct 21, 2022

github-actions bot commented Nov 20, 2022

[stats]: lazy init stats to save RAM (and CPU) #23575

[stats]: lazy init stats to save RAM (and CPU) #23575

Comments

stevenzzzz commented Oct 19, 2022

stevenzzzz commented Oct 19, 2022

stevenzzzz commented Oct 19, 2022

wbpcode commented Oct 21, 2022

github-actions bot commented Nov 20, 2022