Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETCD unusual long request when proxying from the grpc gateway #14065

Closed
Battlefield-Pony opened this issue May 23, 2022 · 7 comments
Closed

ETCD unusual long request when proxying from the grpc gateway #14065

Battlefield-Pony opened this issue May 23, 2022 · 7 comments
Labels

Comments

@Battlefield-Pony
Copy link

What happened?

during setting up 3 node etcd cluster, it is found that if there are more than 5 apisix nodes exists, the etcd may takes longer than 20 seconds to process the request.
As picture indicated below
image

What did you expect to happen?

Usually with in this amount of data and connections, it usually wont be wait more than 1 second to process.

How can we reproduce it (as minimally and precisely as possible)?

setting up 3 node etcd cluster, and 6 apisix nodes
try to modify a route, and it produces.

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.3
Git SHA: 0452feec7
Go Version: go1.16.15
Go OS/Arch: linux/amd64


$ etcdctl version
etcdctl version: 3.5.3
API version: 3.5

Etcd configuration (command line flags or environment variables)

paste your configuration here

flags:
--cert-file=/root/etcd-cert/server.crt
--key-file=/root/etcd-cert/server.key
--peer-auto-tls --metrics 'extensive'
--enable-pprof
--log-level 'debug'

ENV_PARAMS:
ETCD_INITIAL_CLUSTER="node1=https://node1:2380,node2=https://node2:2380,node3=https://node3:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_NAME="node1"
ETCD_DATA_DIR="/data/etcd/default.etcd"
ETCD_LISTEN_PEER_URLS="https://node1:2380"
ETCD_LISTEN_CLIENT_URLS="https://node1:2379,https://127.0.0.1:2379"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://node1:2380"
ETCD_ADVERTISE_CLIENT_URLS="https://node1:2379"

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
root@pekshcsitd45770:~# etcdctl --cacert certs/ca/ca.crt --endpoints=https://10.69.184.105:2379 --user apisix01-cp member list -w  table  --dial-timeout=5s --debug
ETCDCTL_CACERT=certs/ca/ca.crt
ETCDCTL_CERT=
ETCDCTL_COMMAND_TIMEOUT=5s
ETCDCTL_DEBUG=true
ETCDCTL_DIAL_TIMEOUT=5s
ETCDCTL_DISCOVERY_SRV=
ETCDCTL_DISCOVERY_SRV_NAME=
ETCDCTL_ENDPOINTS=[https://10.69.184.105:2379]
ETCDCTL_HEX=false
ETCDCTL_INSECURE_DISCOVERY=true
ETCDCTL_INSECURE_SKIP_TLS_VERIFY=false
ETCDCTL_INSECURE_TRANSPORT=true
ETCDCTL_KEEPALIVE_TIME=2s
ETCDCTL_KEEPALIVE_TIMEOUT=6s
ETCDCTL_KEY=
ETCDCTL_PASSWORD=
ETCDCTL_USER=apisix01-cp:apisix01-cp
ETCDCTL_WRITE_OUT=table
WARNING: 2022/05/23 16:20:26 [core] Adjusting keepalive ping interval to minimum period of 10s
WARNING: 2022/05/23 16:20:26 [core] Adjusting keepalive ping interval to minimum period of 10s
INFO: 2022/05/23 16:20:26 [core] parsed scheme: "etcd-endpoints"
INFO: 2022/05/23 16:20:26 [core] ccResolverWrapper: sending update to cc: {[{10.69.184.105:2379 10.69.184.105 <nil> 0 <nil>}] 0xc00014f100 <nil>}
INFO: 2022/05/23 16:20:26 [core] ClientConn switching balancer to "round_robin"
INFO: 2022/05/23 16:20:26 [core] Channel switches to new LB policy "round_robin"
INFO: 2022/05/23 16:20:26 [balancer] base.baseBalancer: got new ClientConn state:  {{[{10.69.184.105:2379 10.69.184.105 <nil> 0 <nil>}] 0xc00014f100 <nil>} <nil>}
INFO: 2022/05/23 16:20:26 [core] Subchannel Connectivity change to CONNECTING
INFO: 2022/05/23 16:20:26 [balancer] base.baseBalancer: handle SubConn state change: 0xc000783080, CONNECTING
INFO: 2022/05/23 16:20:26 [core] Channel Connectivity change to CONNECTING
INFO: 2022/05/23 16:20:26 [core] Subchannel picks a new address "10.69.184.105:2379" to connect
INFO: 2022/05/23 16:20:26 [core] Subchannel Connectivity change to READY
INFO: 2022/05/23 16:20:26 [balancer] base.baseBalancer: handle SubConn state change: 0xc000783080, READY
INFO: 2022/05/23 16:20:26 [roundrobin] roundrobinPicker: newPicker called with info: {map[0xc000783080:{{10.69.184.105:2379 10.69.184.105 <nil> 0 <nil>}}]}
INFO: 2022/05/23 16:20:26 [core] Channel Connectivity change to READY
+------------------+---------+-----------------+----------------------------+----------------------------+------------+
|        ID        | STATUS  |      NAME       |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------------+----------------------------+----------------------------+------------+
| 42fdca43dd793624 | started | pekshcsitd46736 | https://10.69.187.193:2380 | https://10.69.187.193:2379 |      false |
| b3fd80a70329c965 | started | pekshcsitd46740 | https://10.69.184.105:2380 | https://10.69.184.105:2379 |      false |
| f36bef6988f6bc37 | started | pekshcsitd46735 |  https://10.69.188.23:2380 |  https://10.69.188.23:2379 |      false |
+------------------+---------+-----------------+----------------------------+----------------------------+------------+

image

Relevant log output

No response

@Battlefield-Pony
Copy link
Author

Actually here are some additional info.

  1. the cpu profilling did not found any issues, and the pressure seems not from the CPU.
  2. the requests directly using benchmark range is responding fast, but the request won't be displayed in the /debug/request url.
  3. What usually will doduring the steps shown on image below
    image

@Battlefield-Pony
Copy link
Author

Battlefield-Pony commented May 23, 2022

Update: Using gRPC Gateway http methods performing Putting Operations also takes 20-ish seconds
image

@ptabor
Copy link
Contributor

ptabor commented May 23, 2022

Please collect metrics (e.g. http://127.0.0.1:2379/metrics) from the server, such that it's visible that the delay happens on the server side.

@Battlefield-Pony
Copy link
Author

Battlefield-Pony commented May 24, 2022

Gday @ptabor

As Requested, this is collected metrics, I still cant see what is the issue from the metrics (sigh....) I really appreciate any helps

Update: I have tried to downgrade etcd to 3.4.16 and it still happens, and the metrics file is based on 3.4
metrics.txt

@Battlefield-Pony Battlefield-Pony changed the title ETCD unusual long request ETCD unusual long request when proxying from the grpc gateway May 24, 2022
@Battlefield-Pony
Copy link
Author

New Update:

When we disabled server-client communication TLS, the situation disappeared, everything goes well, but how? (thinking face)

image

No more more than 10s requests

@ahrtr
Copy link
Member

ahrtr commented May 27, 2022

See issuecomment-1140057917

@ahrtr
Copy link
Member

ahrtr commented Jul 13, 2022

Resolved by #14169 and #14219

@ahrtr ahrtr closed this as completed Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants