Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Listener drops all connections when per filter config updated with RDS #17109

Open
belyalov opened this issue Jun 23, 2021 · 20 comments
Open

Listener drops all connections when per filter config updated with RDS #17109

belyalov opened this issue Jun 23, 2021 · 20 comments
Labels
area/perf area/xds help wanted Needs help! investigate Potential bug that needs verification

Comments

@belyalov
Copy link
Contributor

Title: Listener drops all connections when per filter config updated with RDS

Description:
Having simple working setup:

static_resources {
  listeners {
    name: "listener"
    address {
      socket_address {
        address: "0.0.0.0"
        port_value: 8080
      }
    }
    filter_chains {
      filters {
        name: "envoy.http_connection_manager"
        typed_config {
          [type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager] {
            stat_prefix: "ingress_http"
            rds {
              config_source {
                ads {
                }
                initial_fetch_timeout {
                  seconds: 3
                }
                resource_api_version: V3
              }
              route_config_name: "xds-config-backend"
            }
            http_filters {
              name: "dropbox.envoy.waf"
              typed_config {
                [type.googleapis.com/dropbox.envoy_dbx.WafConfig] {
                }
              }
            }
            http_filters {
              name: "envoy.filters.http.router"
            }
// ... less important... //

Everything works well, including updates sent by XDS to Envoy while there is no "per vhost filter config overrides" - envoy successfully and without any impact to traffic able to update route configuration.

However, if route configuration contains per virtual host overrides, like

resources: <
  [type.googleapis.com/envoy.config.route.v3.RouteConfiguration]: <
    name: "xds-config-backend"
    virtual_hosts: <
      name: "test1"
      domains: "test1"
      routes: <
        match: <
          prefix: "/test1"
        >
        route: <
          cluster: "test_cluster"
        >
      >
      typed_per_filter_config: <
        key: "dropbox.envoy.waf"
        value: <
          [type.googleapis.com/dropbox.envoy_dbx.WafConfig]: <
          >
        >
      >
    >

Listener re-establishing all connections (downstream_cx_active gauge):
image

To me it seems like listener re-creates HttpConnectionManager and it somehow causes old connections to close.

Is it expected behavior?
Is there some parameter that we can adjust?
Any other ideas?

I'd be happy to provide more details, if needed.

Thanks

@belyalov belyalov added the triage Issue requires triage label Jun 23, 2021
@alyssawilk
Copy link
Contributor

cc @htuch @adisuissa

@alyssawilk alyssawilk added area/xds question Questions that are neither investigations, bugs, nor enhancements and removed triage Issue requires triage labels Jun 24, 2021
@htuch
Copy link
Member

htuch commented Jun 27, 2021

I don't think this should be happening; the per-route/vhost overrides are generally resolved by filter chains at runtime, we don't rebuild them. @lambdai @mattklein123 do you know any reason this might happen?

@belyalov are you sure there isn't anything else in the listener changing in this scenario?

@mattklein123
Copy link
Member

I don't think this should be happening; the per-route/vhost overrides are generally resolved by filter chains at runtime, we don't rebuild them. @lambdai @mattklein123 do you know any reason this might happen?

I can't think of any reason this would be happening. If you are able to repro this easily can you provide some logs during an update?

@belyalov
Copy link
Contributor Author

belyalov commented Jun 28, 2021

@belyalov are you sure there isn't anything else in the listener changing in this scenario?

Yep, there is no changes in listener, only routes update via XDS.

If you are able to repro this easily...

Yep, we have stable repro on production env, so far we were able to isolate the problem just to XDS -> envoy routes config push.

I'll update you with debug logs.

Thanks

@belyalov
Copy link
Contributor Author

So I was able to get some logs in debug mode.

New routes config pushed:

I0628 18:45:25.357789 00007 routes_config_watcher.go:64] Sending new routes configuration
I0628 18:45:25.568516 00007 aggregated_watcher.go:126] wait for: type.googleapis.com/envoy.config.route.v3.RouteConfiguration
I0628 18:45:37.752858 00007 routes_config_watcher.go:66] Routes configuration sent

Right after config pushed there are a lot of warn messages for deprecated field:

W0628 18:45:26.599912 7 misc message_validator_impl.cc:21] Deprecated field: type envoy.type.matcher.v3.RegexMatcher.GoogleRE2 Using deprecated option 'envoy.type.matcher.v3.RegexMatcher.GoogleRE2.max_program_size' from file regex.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/version_history/version_history for details. If continued use of this field is absolutely necessary, see https://www.envoyproxy.io/docs/envoy/latest/configuration/operations/runtime#using-runtime-overrides-for-deprecated-features for how to apply a temporary and highly discouraged override.
W0628 18:45:26.600018 7 misc message_validator_impl.cc:21] Deprecated field: type envoy.type.matcher.v3.RegexMatcher.GoogleRE2 Using deprecated option 'envoy.type.matcher.v3.RegexMatcher.GoogleRE2.max_program_size' from file regex.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/version_history/version_history for details. If continued use of this field is absolutely necessary, see https://www.envoyproxy.io/docs/envoy/latest/configuration/operations/runtime#using-runtime-overrides-for-deprecated-features for how to apply a temporary and highly discouraged override.
W0628 18:45:26.600151 7 misc message_validator_impl.cc:21] Deprecated field: type envoy.type.matcher.v3.RegexMatcher.GoogleRE2 Using deprecated option 'envoy.type.matcher.v3.RegexMatcher.GoogleRE2.max_program_size' from file regex.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/version_history/version_history for details. If continued use of this field is absolutely necessary, see https://www.envoyproxy.io/docs/envoy/latest/configuration/operations/runtime#using-runtime-overrides-for-deprecated-features for how to apply a temporary and highly discouraged override.
W0628 18:45:26.600180 7 misc message_validator_impl.cc:21] Deprecated field: type envoy.type.matcher.v3.RegexMatcher.GoogleRE2 Using deprecated option 'envoy.type.matcher.v3.RegexMatcher.GoogleRE2.max_program_size' from file regex.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/version_history/version_history for details. If continued use of this field is absolutely necessary, see https://www.envoyproxy.io/docs/envoy/latest/configuration/operations/runtime#using-runtime-overrides-for-deprecated-features for how to apply a temporary and highly discouraged override.
W0628 18:45:26.600218 7 misc message_validator_impl.cc:21] Deprecated field: type envoy.type.matcher.v3.RegexMatcher.GoogleRE2 Using deprecated option 'envoy.type.matcher.v3.RegexMatcher.GoogleRE2.max_program_size' from file regex.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/version_history/version_history for details. If continued use of this field is absolutely necessary, see https://www.envoyproxy.io/docs/envoy/latest/configuration/operations/runtime#using-runtime-overrides-for-deprecated-features for how to apply a temporary and highly discouraged override.
....
$ cat envoy_proxy_0.log | grep "18:45:" | grep "Deprecated field: type envoy.type.matcher.v3.RegexMatcher.GoogleRE2" | wc -l
136025

Then, after awhile there are a lot of entries about adding connections to cleanup list:

D0628 18:45:34.150229 27 conn_handler active_tcp_listener.cc:76] [C3636248] adding to cleanup list
D0628 18:45:34.149782 54 conn_handler active_tcp_listener.cc:76] [C3634492] adding to cleanup list
D0628 18:45:34.150142 39 conn_handler active_tcp_listener.cc:76] [C3637873] adding to cleanup list
D0628 18:45:34.150396 83 conn_handler active_tcp_listener.cc:76] [C3636918] adding to cleanup list
D0628 18:45:34.150224 50 conn_handler active_tcp_listener.cc:76] [C3637973] adding to cleanup list
D0628 18:45:34.150504 51 conn_handler active_tcp_listener.cc:76] [C3635807] adding to cleanup list
D0628 18:45:34.150609 52 conn_handler active_tcp_listener.cc:76] [C3635005] adding to cleanup list
.....
$ cat envoy_proxy_0.log | grep "18:45:33." | grep "adding to cleanup list" | wc -l
3190
$ cat envoy_proxy_0.log | grep "18:45:34." | grep "adding to cleanup list" | wc -l
792

Finally new config gets applied and connections being re-established:

D0628 18:45:37.512406 7 router rds_impl.cc:143] rds: loading new configuration: config_name=xds-config-backend hash=8658319582561063231
D0628 18:45:37.512775 7 config grpc_subscription_impl.cc:84] gRPC config for type.googleapis.com/envoy.config.route.v3.RouteConfiguration accepted with 1 resources with version
D0628 18:45:37.512786 7 config grpc_subscription_impl.cc:88] gRPC config update took 7018 ms! Resources names: xds-config-backend
D0628 18:45:37.734420 7 config grpc_mux_impl.cc:129] Resuming discovery requests for type.googleapis.com/envoy.config.route.v3.RouteConfiguration (previous count 1)
D0628 18:45:37.751557 7 conn_handler active_tcp_listener.cc:328] [C3638391] new connection
D0628 18:45:37.751586 7 conn_handler active_tcp_listener.cc:328] [C3638392] new connection
D0628 18:45:37.751607 7 conn_handler active_tcp_listener.cc:328] [C3638393] new connection
D0628 18:45:37.751673 7 conn_handler active_tcp_listener.cc:328] [C3638394] new connection
D0628 18:45:37.751721 7 conn_handler active_tcp_listener.cc:328] [C3638395] new connection
....
$ cat envoy_proxy_0.log | grep "18:45:" | grep " new connection" | wc -l
1738

Note that

I0628 18:45:25.568516 00007 aggregated_watcher.go:126] wait for: type.googleapis.com/envoy.config.route.v3.RouteConfiguration
I0628 18:45:37.752858 00007 routes_config_watcher.go:66] Routes configuration sent

Takes around 12 seconds to be processed.

Could it be because of a lot of warning being written to log file so connections expiring due to some timeout?

Thanks!

@mattklein123
Copy link
Member

Hmm, I could see the main thread hanging for a bit, but I don't think any worker threads should be getting hung. Do you have any other logs or stats on why connections are being closed? Are they local closes or remote closes? I would look at both listener/downstream HTTP stats as well as cluster stats.

cc @antoniovicente also.

@belyalov
Copy link
Contributor Author

@mattklein123 I can get those stats (and probably more logs) by tomorrow.

@belyalov
Copy link
Contributor Author

belyalov commented Jul 1, 2021

An update:
it seems like it is related to routes config size:
We have 2 almost the same deployments with different config sizes:
~30mb vs 16mb
And we do see drops only on deployment with huge config.

# 16mb routes config (binary proto)
I0701 17:39:30.101950 00008 routes_config_watcher.go:64] Sending new routes configuration
I0701 17:39:32.161910 00008 routes_config_watcher.go:66] Routes configuration sent
# --> 2 seconds to apply

# 30 mb config
I0701 17:40:57.003594 00007 routes_config_watcher.go:64] Sending new routes configuration
I0701 17:41:04.460476 00007 routes_config_watcher.go:66] Routes configuration sent
# --> 7.5 seconds to apply

Is it possible that huge config size takes too long to be processed (this is why main thread "hangs"?) so all connections are forced to expire?

@mattklein123
Copy link
Member

Is it possible that huge config size takes too long to be processed (this is why main thread "hangs"?) so all connections are forced to expire?

I wouldn't expect main thread processing time so effect the workers, but I'm not exactly sure what is going on. More logs would be helpful.

@belyalov
Copy link
Contributor Author

belyalov commented Jul 1, 2021

Getting more logs now

@belyalov
Copy link
Contributor Author

belyalov commented Jul 1, 2021

Stats:

Before DROP

listener.XXXXXXX.downstream_cx_active: 5856
listener.XXXXXXX.downstream_cx_destroy: 267415
listener.XXXXXXX.downstream_cx_overflow: 0
listener.XXXXXXX.downstream_cx_overload_reject: 0
listener.XXXXXXX.downstream_cx_total: 273271
listener.XXXXXXX.downstream_global_cx_overflow: 0
listener.XXXXXXX.downstream_pre_cx_active: 0
listener.XXXXXXX.downstream_pre_cx_timeout: 0
listener.XXXXXXX.http.dc.downstream_rq_1xx: 0
listener.XXXXXXX.http.dc.downstream_rq_2xx: 6931197
listener.XXXXXXX.http.dc.downstream_rq_3xx: 9089
listener.XXXXXXX.http.dc.downstream_rq_4xx: 257724
listener.XXXXXXX.http.dc.downstream_rq_5xx: 31490
listener.XXXXXXX.http.dc.downstream_rq_completed: 7229500
listener.XXXXXXX.no_filter_chain_match: 0
listener.XXXXXXX.server_ssl_socket_factory.downstream_context_secrets_not_ready: 0
listener.XXXXXXX.server_ssl_socket_factory.ssl_context_update_by_sds: 8
listener.XXXXXXX.server_ssl_socket_factory.upstream_context_secrets_not_ready: 0
listener.XXXXXXX.ssl.ciphers.ECDHE-ECDSA-AES128-GCM-SHA256: 273267
listener.XXXXXXX.ssl.ciphers.TLS_AES_128_GCM_SHA256: 3
listener.XXXXXXX.ssl.connection_error: 0
listener.XXXXXXX.ssl.curves.X25519: 273270
listener.XXXXXXX.ssl.fail_verify_cert_hash: 0
listener.XXXXXXX.ssl.fail_verify_error: 0
listener.XXXXXXX.ssl.fail_verify_no_cert: 0
listener.XXXXXXX.ssl.fail_verify_san: 0
listener.XXXXXXX.ssl.handshake: 273270
listener.XXXXXXX.ssl.no_certificate: 0
listener.XXXXXXX.ssl.ocsp_staple_failed: 0
listener.XXXXXXX.ssl.ocsp_staple_omitted: 3
listener.XXXXXXX.ssl.ocsp_staple_requests: 3
listener.XXXXXXX.ssl.ocsp_staple_responses: 0
listener.XXXXXXX.ssl.session_reused: 64476
listener.XXXXXXX.ssl.sigalgs.ecdsa_secp256r1_sha256: 273270
listener.XXXXXXX.ssl.versions.TLSv1.2: 273267
listener.XXXXXXX.ssl.versions.TLSv1.3: 3
listener.XXXXXXX.worker_0.downstream_cx_active: 189
listener.XXXXXXX.worker_0.downstream_cx_total: 8554
listener.XXXXXXX.worker_1.downstream_cx_active: 170
listener.XXXXXXX.worker_1.downstream_cx_total: 8528
listener.XXXXXXX.worker_10.downstream_cx_active: 164
listener.XXXXXXX.worker_10.downstream_cx_total: 8435
listener.XXXXXXX.worker_11.downstream_cx_active: 182
listener.XXXXXXX.worker_11.downstream_cx_total: 8508
listener.XXXXXXX.worker_12.downstream_cx_active: 193
listener.XXXXXXX.worker_12.downstream_cx_total: 8662
listener.XXXXXXX.worker_13.downstream_cx_active: 171
listener.XXXXXXX.worker_13.downstream_cx_total: 8452
listener.XXXXXXX.worker_14.downstream_cx_active: 188
listener.XXXXXXX.worker_14.downstream_cx_total: 8548
listener.XXXXXXX.worker_15.downstream_cx_active: 176
listener.XXXXXXX.worker_15.downstream_cx_total: 8388
listener.XXXXXXX.worker_16.downstream_cx_active: 191
listener.XXXXXXX.worker_16.downstream_cx_total: 8641
listener.XXXXXXX.worker_17.downstream_cx_active: 188
listener.XXXXXXX.worker_17.downstream_cx_total: 8588
listener.XXXXXXX.worker_18.downstream_cx_active: 200
listener.XXXXXXX.worker_18.downstream_cx_total: 8464
listener.XXXXXXX.worker_19.downstream_cx_active: 213
listener.XXXXXXX.worker_19.downstream_cx_total: 8551
listener.XXXXXXX.worker_2.downstream_cx_active: 184
listener.XXXXXXX.worker_2.downstream_cx_total: 8560
listener.XXXXXXX.worker_20.downstream_cx_active: 193
listener.XXXXXXX.worker_20.downstream_cx_total: 8618
listener.XXXXXXX.worker_21.downstream_cx_active: 172
listener.XXXXXXX.worker_21.downstream_cx_total: 8672
listener.XXXXXXX.worker_22.downstream_cx_active: 179
listener.XXXXXXX.worker_22.downstream_cx_total: 8361
listener.XXXXXXX.worker_23.downstream_cx_active: 157
listener.XXXXXXX.worker_23.downstream_cx_total: 8432
listener.XXXXXXX.worker_24.downstream_cx_active: 190
listener.XXXXXXX.worker_24.downstream_cx_total: 8537
listener.XXXXXXX.worker_25.downstream_cx_active: 172
listener.XXXXXXX.worker_25.downstream_cx_total: 8549
listener.XXXXXXX.worker_26.downstream_cx_active: 188
listener.XXXXXXX.worker_26.downstream_cx_total: 8341
listener.XXXXXXX.worker_27.downstream_cx_active: 163
listener.XXXXXXX.worker_27.downstream_cx_total: 8700
listener.XXXXXXX.worker_28.downstream_cx_active: 195
listener.XXXXXXX.worker_28.downstream_cx_total: 8618
listener.XXXXXXX.worker_29.downstream_cx_active: 174
listener.XXXXXXX.worker_29.downstream_cx_total: 8676
listener.XXXXXXX.worker_3.downstream_cx_active: 173
listener.XXXXXXX.worker_3.downstream_cx_total: 8529
listener.XXXXXXX.worker_30.downstream_cx_active: 186
listener.XXXXXXX.worker_30.downstream_cx_total: 8547
listener.XXXXXXX.worker_31.downstream_cx_active: 196
listener.XXXXXXX.worker_31.downstream_cx_total: 8579
listener.XXXXXXX.worker_4.downstream_cx_active: 177
listener.XXXXXXX.worker_4.downstream_cx_total: 8365
listener.XXXXXXX.worker_5.downstream_cx_active: 170
listener.XXXXXXX.worker_5.downstream_cx_total: 8653
listener.XXXXXXX.worker_6.downstream_cx_active: 204
listener.XXXXXXX.worker_6.downstream_cx_total: 8623
listener.XXXXXXX.worker_7.downstream_cx_active: 190
listener.XXXXXXX.worker_7.downstream_cx_total: 8636
listener.XXXXXXX.worker_8.downstream_cx_active: 197
listener.XXXXXXX.worker_8.downstream_cx_total: 8583
listener.XXXXXXX.worker_9.downstream_cx_active: 171
listener.XXXXXXX.worker_9.downstream_cx_total: 8373
listener.XXXXXXX.downstream_cx_length_ms: P0(nan,58.0) P25(nan,3147036.0024829297) P50(nan,3677753.4919134635) P75(nan,6003795.180722891) P90(nan,10046677.91141097) P95(nan,11231741.573033707) P99(nan,19312560.827250592) P99.5(nan,20237299.311926592) P99.9(nan,21603858.208955254) P100(nan,41000000.0)

After DROP

listener.XXXXXXX.downstream_cx_active: 196
listener.XXXXXXX.downstream_cx_destroy: 273388
listener.XXXXXXX.downstream_cx_overflow: 0
listener.XXXXXXX.downstream_cx_overload_reject: 0
listener.XXXXXXX.downstream_cx_total: 273584
listener.XXXXXXX.downstream_global_cx_overflow: 0
listener.XXXXXXX.downstream_pre_cx_active: 0
listener.XXXXXXX.downstream_pre_cx_timeout: 0
listener.XXXXXXX.http.dc.downstream_rq_1xx: 0
listener.XXXXXXX.http.dc.downstream_rq_2xx: 6933885
listener.XXXXXXX.http.dc.downstream_rq_3xx: 9099
listener.XXXXXXX.http.dc.downstream_rq_4xx: 257844
listener.XXXXXXX.http.dc.downstream_rq_5xx: 31505
listener.XXXXXXX.http.dc.downstream_rq_completed: 7232333
listener.XXXXXXX.no_filter_chain_match: 0
listener.XXXXXXX.server_ssl_socket_factory.downstream_context_secrets_not_ready: 0
listener.XXXXXXX.server_ssl_socket_factory.ssl_context_update_by_sds: 8
listener.XXXXXXX.server_ssl_socket_factory.upstream_context_secrets_not_ready: 0
listener.XXXXXXX.ssl.ciphers.ECDHE-ECDSA-AES128-GCM-SHA256: 273578
listener.XXXXXXX.ssl.ciphers.TLS_AES_128_GCM_SHA256: 3
listener.XXXXXXX.ssl.connection_error: 0
listener.XXXXXXX.ssl.curves.X25519: 273581
listener.XXXXXXX.ssl.fail_verify_cert_hash: 0
listener.XXXXXXX.ssl.fail_verify_error: 0
listener.XXXXXXX.ssl.fail_verify_no_cert: 0
listener.XXXXXXX.ssl.fail_verify_san: 0
listener.XXXXXXX.ssl.handshake: 273581
listener.XXXXXXX.ssl.no_certificate: 0
listener.XXXXXXX.ssl.ocsp_staple_failed: 0
listener.XXXXXXX.ssl.ocsp_staple_omitted: 3
listener.XXXXXXX.ssl.ocsp_staple_requests: 3
listener.XXXXXXX.ssl.ocsp_staple_responses: 0
listener.XXXXXXX.ssl.session_reused: 64704
listener.XXXXXXX.ssl.sigalgs.ecdsa_secp256r1_sha256: 273581
listener.XXXXXXX.ssl.versions.TLSv1.2: 273578
listener.XXXXXXX.ssl.versions.TLSv1.3: 3
listener.XXXXXXX.worker_0.downstream_cx_active: 11
listener.XXXXXXX.worker_0.downstream_cx_total: 8570
listener.XXXXXXX.worker_1.downstream_cx_active: 10
listener.XXXXXXX.worker_1.downstream_cx_total: 8541
listener.XXXXXXX.worker_10.downstream_cx_active: 6
listener.XXXXXXX.worker_10.downstream_cx_total: 8442
listener.XXXXXXX.worker_11.downstream_cx_active: 5
listener.XXXXXXX.worker_11.downstream_cx_total: 8517
listener.XXXXXXX.worker_12.downstream_cx_active: 6
listener.XXXXXXX.worker_12.downstream_cx_total: 8671
listener.XXXXXXX.worker_13.downstream_cx_active: 4
listener.XXXXXXX.worker_13.downstream_cx_total: 8459
listener.XXXXXXX.worker_14.downstream_cx_active: 5
listener.XXXXXXX.worker_14.downstream_cx_total: 8555
listener.XXXXXXX.worker_15.downstream_cx_active: 2
listener.XXXXXXX.worker_15.downstream_cx_total: 8391
listener.XXXXXXX.worker_16.downstream_cx_active: 5
listener.XXXXXXX.worker_16.downstream_cx_total: 8651
listener.XXXXXXX.worker_17.downstream_cx_active: 3
listener.XXXXXXX.worker_17.downstream_cx_total: 8596
listener.XXXXXXX.worker_18.downstream_cx_active: 2
listener.XXXXXXX.worker_18.downstream_cx_total: 8470
listener.XXXXXXX.worker_19.downstream_cx_active: 11
listener.XXXXXXX.worker_19.downstream_cx_total: 8566
listener.XXXXXXX.worker_2.downstream_cx_active: 8
listener.XXXXXXX.worker_2.downstream_cx_total: 8569
listener.XXXXXXX.worker_20.downstream_cx_active: 12
listener.XXXXXXX.worker_20.downstream_cx_total: 8636
listener.XXXXXXX.worker_21.downstream_cx_active: 4
listener.XXXXXXX.worker_21.downstream_cx_total: 8677
listener.XXXXXXX.worker_22.downstream_cx_active: 6
listener.XXXXXXX.worker_22.downstream_cx_total: 8372
listener.XXXXXXX.worker_23.downstream_cx_active: 0
listener.XXXXXXX.worker_23.downstream_cx_total: 8434
listener.XXXXXXX.worker_24.downstream_cx_active: 4
listener.XXXXXXX.worker_24.downstream_cx_total: 8548
listener.XXXXXXX.worker_25.downstream_cx_active: 6
listener.XXXXXXX.worker_25.downstream_cx_total: 8566
listener.XXXXXXX.worker_26.downstream_cx_active: 2
listener.XXXXXXX.worker_26.downstream_cx_total: 8345
listener.XXXXXXX.worker_27.downstream_cx_active: 5
listener.XXXXXXX.worker_27.downstream_cx_total: 8711
listener.XXXXXXX.worker_28.downstream_cx_active: 10
listener.XXXXXXX.worker_28.downstream_cx_total: 8629
listener.XXXXXXX.worker_29.downstream_cx_active: 4
listener.XXXXXXX.worker_29.downstream_cx_total: 8683
listener.XXXXXXX.worker_3.downstream_cx_active: 4
listener.XXXXXXX.worker_3.downstream_cx_total: 8537
listener.XXXXXXX.worker_30.downstream_cx_active: 7
listener.XXXXXXX.worker_30.downstream_cx_total: 8561
listener.XXXXXXX.worker_31.downstream_cx_active: 8
listener.XXXXXXX.worker_31.downstream_cx_total: 8589
listener.XXXXXXX.worker_4.downstream_cx_active: 8
listener.XXXXXXX.worker_4.downstream_cx_total: 8377
listener.XXXXXXX.worker_5.downstream_cx_active: 9
listener.XXXXXXX.worker_5.downstream_cx_total: 8664
listener.XXXXXXX.worker_6.downstream_cx_active: 5
listener.XXXXXXX.worker_6.downstream_cx_total: 8631
listener.XXXXXXX.worker_7.downstream_cx_active: 7
listener.XXXXXXX.worker_7.downstream_cx_total: 8644
listener.XXXXXXX.worker_8.downstream_cx_active: 9
listener.XXXXXXX.worker_8.downstream_cx_total: 8597
listener.XXXXXXX.worker_9.downstream_cx_active: 8
listener.XXXXXXX.worker_9.downstream_cx_total: 8385
listener.XXXXXXX.downstream_cx_length_ms: P0(160.0,58.0) P25(400582.19178082194,2856296.875) P50(642075.4716981133,3673833.6221382064) P75(848188.4057971014,5894208.1604426) P90(919000.0,10002882.076855384) P95(934169.2307692309,11045349.563046189) P99(946072.8813559322,19276234.79318735) P99.5(948036.4406779661,20203056.19266055) P99.9(949607.2881355932,21594944.776119422) P100(950000.0,41000000.0)

@belyalov
Copy link
Contributor Author

belyalov commented Jul 1, 2021

They are seems to be remote close (grepped few of them):

D0701 17:07:27.488015 7 http conn_manager_impl.cc:1479] [C602431][S6971722700133307378] encoding headers via codec (end_stream=false):
D0701 17:07:27.488380 7 connection connection_impl.cc:633] [C602431] remote close
D0701 17:07:27.488394 7 connection connection_impl.cc:243] [C602431] closing socket: 0
D0701 17:07:27.488438 7 conn_handler active_tcp_listener.cc:76] [C602431] adding to cleanup list

D0701 18:02:02.784808 28 connection connection_impl.cc:633] [C630033] remote close
D0701 18:02:02.784821 28 connection connection_impl.cc:243] [C630033] closing socket: 0
D0701 18:02:02.784840 28 connection ssl_socket.cc:306] [C630033] SSL shutdown: rc=1
D0701 18:02:02.784890 28 conn_handler active_tcp_listener.cc:76] [C630033] adding to cleanup list

D0701 18:02:03.088431 81 connection connection_impl.cc:633] [C628537] remote close
D0701 18:02:03.088444 81 connection connection_impl.cc:243] [C628537] closing socket: 0
D0701 18:02:03.088463 81 connection ssl_socket.cc:306] [C628537] SSL shutdown: rc=1
D0701 18:02:03.088486 81 conn_handler active_tcp_listener.cc:76] [C628537] adding to cleanup list

Upstream / downstream for this setup is also Envoy

@mattklein123 mattklein123 added area/perf help wanted Needs help! investigate Potential bug that needs verification and removed question Questions that are neither investigations, bugs, nor enhancements labels Jul 1, 2021
@mattklein123
Copy link
Member

My assumption here is that somehow the large config update is pegging the machine and/or workers and causing connection drops. I think this is probably a generic performance issue that we need to look into it. cc @jmarantz @rojkov

@jmarantz
Copy link
Contributor

jmarantz commented Jul 1, 2021

I'm generally wondering if we over commit our CPU cores when we have concurrency == NumCores, as that leaves no room for compute-intensive work on main thread (config updates, stats dumping) and there may be noisy neighbor issues. But this is not really based on data; just a conjecture.

One short term remedy might be to set --concurrency $((NumCores - 1)) or similar.

There's also lots of things we can do (and are working on in fact) to make these main-thread activities more efficient, but that will take time.

@antoniovicente
Copy link
Contributor

It would be helpful to capture a profile to see where the time is spent.

Can you share some info about the structure of the large config being loaded? How many clusters, vhosts, routes, etc.

@rojkov
Copy link
Member

rojkov commented Jul 2, 2021

If the main thread drags down workers I'd be also concerned about memory situation. Do you see spikes in page faults?

@belyalov
Copy link
Contributor Author

belyalov commented Jul 6, 2021

envoy_proxy thread takes around 100% CPU while processing config update for ~6 seconds, workers are fine

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 46285 xxxxxxx   20   0 6989008   3.2g  27816 R  99.9   5.1  94:04.05 envoy_proxy
 46415 xxxxxxx   20   0 6989008   3.2g  27816 S  11.0   5.1 163:38.47 wrk:worker_18
 46419 xxxxxxx   20   0 6989008   3.2g  27816 S  10.0   5.1 151:56.21 wrk:worker_20
 46399 xxxxxxx   20   0 6989008   3.2g  27816 S   9.0   5.1 154:30.24 wrk:worker_10
 46400 xxxxxxx   20   0 6989008   3.2g  27816 S   9.0   5.1 161:14.43 wrk:worker_11
 46401 xxxxxxx   20   0 6989008   3.2g  27816 S   9.0   5.1 153:35.57 wrk:worker_12
 46413 xxxxxxx   20   0 6989008   3.2g  27816 S   5.0   5.1 150:27.32 wrk:worker_17
 46395 xxxxxxx   20   0 6989008   3.2g  27816 S   4.0   5.1 153:24.26 wrk:worker_8
 46409 xxxxxxx   20   0 6989008   3.2g  27816 S   4.0   5.1 144:16.69 wrk:worker_15
 46422 xxxxxxx   20   0 6989008   3.2g  27816 S   4.0   5.1 160:13.89 wrk:worker_22
......

@mattklein123
Copy link
Member

Interesting. I don't have any great theories right now as to why this config spinning on the main thread (which is known) is causing worker issues. Coming back to your original report:

However, if route configuration contains per virtual host overrides

What do you mean by this exactly? Are you positive that the use of this causes the issue but the removal does not? Do you mean per-filter config overrides? Can you be really clear on what delta causes the issue?

@belyalov
Copy link
Contributor Author

belyalov commented Jul 7, 2021

However, if route configuration contains per virtual host overrides

What do you mean by this exactly? Are you positive that the use of this causes the issue but the removal does not? Do you mean per-filter config overrides? Can you be really clear on what delta causes the issue?

That was an original assumption of root cause, but seems it is unrelated.
Currently we're playing with config sizes (trying hard to reduce it!) and per filter config overrides, so I'll update you with results.

@belyalov
Copy link
Contributor Author

Some insights of what our "huge" config contains of:

  • 9 clusters with eds enabled
  • ~40 virtual hosts, with around 80k routes total
  • 13k routes with runtime fraction enabled
  • Most of routes using re2 regex matching.
  • Almost all routes have filter_metadata attached.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/perf area/xds help wanted Needs help! investigate Potential bug that needs verification
Projects
None yet
Development

No branches or pull requests

7 participants