Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing and adding a cluster will cause EDS not sending to Envoy #583

Closed
tony612 opened this issue Aug 23, 2022 · 4 comments
Closed

Removing and adding a cluster will cause EDS not sending to Envoy #583

tony612 opened this issue Aug 23, 2022 · 4 comments
Labels

Comments

@tony612
Copy link
Contributor

tony612 commented Aug 23, 2022

I found my go-control-plane doesn't send EDS sometimes when I remove a Cluster and then add it back via CDS. And this will cause 503 with no healthy upstream.

My operations are like:

  1. Add needed resources like LDS, CDS, EDS. LDS and CDS are using Snapshot cache and EDS uses Linear cache. Now the API call via Envoy returns 200 OK.
  2. Cluster A is removed via CDS. Now the API returns 404.
  3. Cluster A is added back again. Now it got 503 no healthy upstream.

I researched a little and guess it's because:

  1. When Cluster A is removed, Envoy will also remove the ClusterLoadAssignment.
  2. When Cluster A is added again, Envoy will send EDS request using the previous version.
  3. Then Linear cache will think the cache is not stale and will open watch but don't send a response.

The Envoy logs(some logs are ignored) are:

# access log. 202
[2022-08-22T11:23:31.488Z] "GET /ctx1/echo/abc HTTP/1.1" 200 - 0 60 5 3 "-" "curl/7.77.0" "6223f5b7-86d5-4e50-bd20-a25b4d45224c" "my-ip:32068" "10.0.0.128:18081"

# remove cluster
[2022-08-22 11:23:39.195][14][debug][config] [external/envoy/source/common/config/grpc_mux_impl.cc:165] Received gRPC message for type.googleapis.com/envoy.config.cluster.v3.Cluster at version 650059771229369675
[2022-08-22 11:23:39.196][14][debug][config] [external/envoy/source/common/config/grpc_mux_impl.cc:143] Pausing discovery requests for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment (previous count 0)
[2022-08-22 11:23:39.196][14][info][upstream] [external/envoy/source/common/upstream/cds_api_helper.cc:35] cds: add 0 cluster(s), remove 2 cluster(s)
[2022-08-22 11:23:39.196][14][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:769] removing cluster my-cluster
[2022-08-22 11:23:39.196][14][debug][upstream] [external/envoy/source/common/upstream/cds_api_helper.cc:67] cds: remove cluster 'my-cluster'
[2022-08-22 11:23:39.196][21][debug][connection] [external/envoy/source/common/network/connection_impl.cc:139] [C7] closing data_to_write=0 type=1
[2022-08-22 11:23:39.196][14][info][upstream] [external/envoy/source/common/upstream/cds_api_helper.cc:72] cds: added/updated 0 cluster(s), skipped 0 unmodified cluster(s)
[2022-08-22 11:23:39.196][21][debug][connection] [external/envoy/source/common/network/connection_impl.cc:250] [C7] closing socket: 1
[2022-08-22 11:23:39.196][14][debug][config] [external/envoy/source/common/config/grpc_mux_impl.cc:150] Resuming discovery requests for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment (previous count 1)

# now got 404
[2022-08-22T11:23:40.126Z] "GET /ctx1/echo/abc HTTP/1.1" 404 NR 0 0 0 - "-" "curl/7.77.0" "b4023be8-36fb-48a5-b452-4b7b02a3b097" "my-ip:32068" "-"

# I also checked Envoy's config_dump and doesn't find the EDS config at the moment

# add cluster
[2022-08-22 11:23:45.206][14][debug][config] [external/envoy/source/common/config/grpc_mux_impl.cc:165] Received gRPC message for type.googleapis.com/envoy.config.cluster.v3.Cluster at version 5473953689400181114
[2022-08-22 11:23:45.207][14][debug][config] [external/envoy/source/common/config/grpc_mux_impl.cc:143] Pausing discovery requests for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment (previous count 0)
[2022-08-22 11:23:45.207][14][info][upstream] [external/envoy/source/common/upstream/cds_api_helper.cc:35] cds: add 1 cluster(s), remove 1 cluster(s)
[2022-08-22 11:23:45.218][14][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:737] add/update cluster my-cluster starting warming
[2022-08-22 11:23:45.218][14][debug][config] [external/envoy/source/common/config/grpc_mux_impl.cc:113] gRPC mux addWatch for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment
[2022-08-22 11:23:45.218][14][debug][upstream] [external/envoy/source/common/upstream/cds_api_helper.cc:52] cds: add/update cluster 'my-cluster'
[2022-08-22 11:23:45.218][14][info][upstream] [external/envoy/source/common/upstream/cds_api_helper.cc:72] cds: added/updated 1 cluster(s), skipped 0 unmodified cluster(s)
[2022-08-22 11:23:45.218][14][debug][config] [external/envoy/source/common/config/grpc_mux_impl.cc:150] Resuming discovery requests for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment (previous count 1)

[2022-08-22 11:24:00.207][14][warning][config] [external/envoy/source/common/config/grpc_subscription_impl.cc:118] gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment

# got 503
[2022-08-22T11:24:02.477Z] "GET /ctx1/echo/abc HTTP/1.1" 503 UH 0 19 0 - "-" "curl/7.77.0" "57ba4a44-7daa-479a-874e-c56df2f69cdc" "my-ip:32068" "-"

And the current config_dump of EDS is:

"dynamic_endpoint_configs": [
    {
     "endpoint_config": {
      "@type": "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment",
      "cluster_name": "my-cluster",
      "policy": {
       "overprovisioning_factor": 140
      }
     }
    }
   ]

My go-control-plane implementation logs:

# First send EDS 
2022-08-22T11:23:01.235Z        INFO    xds/callback.go:73      on stream response      {"stream_id": 3, "type_url": "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment"}
2022-08-22T11:23:01.235Z        DEBUG xds/callback.go:74      stream response detail  {"payload": "version_info:\"1\" resources:{ ignored here } type_url:\"type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment\" nonce:\"5\""}

# got EDS request (version: 1, nonce 5)
2022-08-22T11:23:39.208Z        INFO    xds/callback.go:49      on stream request       {"stream_id": 3, "type_url": "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment", "resources": []}
2022-08-22T11:23:39.208Z        DEBUG   xds/callback.go:51      stream request detail(simple)   {"stream_id": 3, "payload": "{\"version_info\":\"1\",\"node\":{...}},\"type_url\":\"type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment\",\"response_nonce\":\"5\"}"}
# Linear cache open watch
2022-08-22T11:23:39.208Z        INFO    v3/linear.go:352        [linear cache] open watch for all type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment resources, system version "1"

# got EDS request (version: 1, nonce 5)
2022-08-22T11:23:45.227Z        INFO    xds/callback.go:49      on stream request       {"stream_id": 3, "type_url": "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment", "resources": ["my-cluster"]}
2022-08-22T11:23:45.228Z        DEBUG   xds/callback.go:51      stream request detail(simple)   {"stream_id": 3, "payload": "{\"version_info\":\"1\",\"node\":{...}}},\"resource_names\":[\"my-cluster\"],\"type_url\":\"type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment\",\"response_nonce\":\"5\"}"}
# Linear cache open watch
2022-08-22T11:23:45.228Z        INFO    v3/linear.go:370        [linear cache] open watch for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment Resources:[c#evelyn-gwcp_default#provider-demo#default], system version "1"

From the logs, I think there are some reasons causing the problem:

  1. Envoy cleared ClusterLoadAssignment, but uses the previous version next time after the cluster is added back.
  2. go-control-plane's Linear cache detects stale using version:
    if lastVersion < cache.versionVector[name] {
    stale = true
    staleResources = append(staleResources, name)
    }

Maybe I should open an issue in envoyproxy/envoy?

btw it seems I also got this issue envoyproxy/envoy#7529.

My versions:
Envoy: envoyproxy/envoy@f04e10f (1.24)
go-control-plane: v0.10.3-0.20220711195203-227f5af5bbe6

@tony612
Copy link
Contributor Author

tony612 commented Aug 24, 2022

This is my Envoy config:

admin:
  address:
    socketAddress:
      address: 0.0.0.0
      portValue: 9901
dynamicResources:
  adsConfig:
    apiType: GRPC
    grpcServices:
    - envoyGrpc:
        clusterName: ads-cluster
    transportApiVersion: V3
  cdsConfig:
    ads: {}
    resourceApiVersion: V3
  ldsConfig:
    ads: {}
    resourceApiVersion: V3
node:
# ignored
staticResources:
  clusters:
  - connectTimeout: 5s
    loadAssignment:
      clusterName: ads-cluster
      endpoints:
      - lbEndpoints:
        - endpoint:
            address:
              socketAddress:
                address: envoy-control
                portValue: 15500
    name: ads-cluster
    type: STRICT_DNS
    typedExtensionProtocolOptions:
      envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
        '@type': type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
        explicitHttpConfig:
          http2ProtocolOptions:
            connectionKeepalive:
              interval: 30s
              timeout: 5s

valerian-roche added a commit to valerian-roche/go-control-plane that referenced this issue Aug 25, 2022
…nvoy unsubscribes then subscribes again to a resource
valerian-roche added a commit to valerian-roche/go-control-plane that referenced this issue Aug 25, 2022
…voy unsubscribes then subscribes again to a resource

Fix potential deadlock in sotw-ads related to improper cleanup of watches in Linear cache when using delta in non-wildcard
Fix improper request set on sotw responses in Linear cache
Replaced lastResponse in sotw server by staged resources pending ACK

Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>
valerian-roche added a commit to valerian-roche/go-control-plane that referenced this issue Aug 25, 2022
…voy unsubscribes then subscribes again to a resource

Fix potential deadlock in sotw-ads related to improper cleanup of watches in Linear cache when using delta in non-wildcard
Fix improper request set on sotw responses in Linear cache
Replaced lastResponse in sotw server by staged resources pending ACK

Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>
valerian-roche added a commit to valerian-roche/go-control-plane that referenced this issue Aug 25, 2022
…nvoy unsubscribes then subscribes again to a resource

Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Sep 23, 2022
@github-actions
Copy link

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@haoruolei
Copy link

Hi all, this issue was closed but it seemed like the above fix was abandoned. Any updates?

valerian-roche added a commit to valerian-roche/go-control-plane that referenced this issue Jan 5, 2024
Rename KnownResources to ACKedResources to better reflect the change

Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants