Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No Inbound peers after upgrade #29312

Closed
J1a-wei opened this issue Mar 22, 2024 · 16 comments
Closed

No Inbound peers after upgrade #29312

J1a-wei opened this issue Mar 22, 2024 · 16 comments
Labels

Comments

@J1a-wei
Copy link

J1a-wei commented Mar 22, 2024

System information

Geth version: geth version
geth v1.13.14

prysm v5.0.1

geth command

        - --datadir=/data
        - --http
        - --http.addr=0.0.0.0
        - --http.api=eth,net,web3,engine,admin
        - --http.vhosts=*
        - --http.corsdomain=*
        - --ws
        - --ws.origins=*
        - --ws.addr=0.0.0.0
        - --ws.api=eth,net,web3
        - --graphql
        - --graphql.corsdomain=*
        - --graphql.vhosts=*
        - --authrpc.addr=0.0.0.0
        - --authrpc.vhosts=*
        - --authrpc.jwtsecret=/secret/jwtsecret
        - --authrpc.port=8551
        - --verbosity=5
        - --maxpeers=200
        - --mainnet
        - --syncmode=snap
        - --db.engine=pebble
        - --state.scheme=path
        - --metrics
        - --pprof
        - --pprof.addr=0.0.0.0
        - --pprof.port=6060
        - --port=31882
        - --discovery.port=31883
        - --nat=extip:$MY_PUBLIC_IP 

prysm command

        - --datadir=/data
        - --rpc-host=0.0.0.0
        - --rpc-port=4000
        - --accept-terms-of-use
        - --execution-endpoint=http://ethereum-geth-staking-i4i-4xlarge-b:8551
        - --jwt-secret=/secret/jwtsecret
        - --enable-debug-rpc-endpoints
        - --suggested-fee-recipient=0xCC416b7f92EDd2dC5A5a0aB6D1e060d8298971e3
        - --mainnet
        - --accept-terms-of-use
        - --p2p-max-peers=160
        - --enable-debug-rpc-endpoints
        - --checkpoint-sync-url=https://beaconstate.ethstaker.cc
        - --subscribe-all-subnets
        - --monitoring-port=9090
        - --monitoring-host=0.0.0.0
        - --http-mev-relay=http://flashbots-ethereum-mainnet-mev-boost.web3-mev:18550
        - --grpc-gateway-host=0.0.0.0
        - --grpc-gateway-port=8080
        - --p2p-host-ip=$MY_PUBLIC_IP
        - --p2p-tcp-port=31872
        - --p2p-udp-port=31873

Expected behaviour

lot of inbound peers

Actual behaviour

No inbound peers

Steps to reproduce the behaviour

After the upgrade in Cancun, my geth version also underwent an upgrade from 1.13.8 to 1.13.14.

However, I noticed that the number of peers from inbound connections became 0, which is very unfriendly for staking nodes. Our statistical analysis indicates that the miss att has doubled in the past week.

I suspect there has been a change in the p2p code, or perhaps this is a bug.

We start through containers and NAT our p2p port. Here is our startup command:

I have also conducted some tests and troubleshooting. I found that when I perform tests inside the container using localhost telnet or using the devp2p tool, it returns results correctly and retains them for a period of time.

But when accessed from outside, for example, using the local network card address, the server (geth) immediately returns EOF. I captured packets and found the same.

This makes me very curious.

Trace

When I access the p2p port via 127.0.0.1, it consistently returns data for me. However, when I use the local public network card, a significant number of p2p connections are dropped out of 10 attempts.

img_v3_0295_fc8139d3-f376-4bcd-8fa1-da3a202fc91h

I used tcpdump to capture traffic. Through analysis, it can be seen that geth actively sends FIN packets to the client.

09:53:32.692354 IP 172-1-18-49.rancher-monitoring-prometheus-node-exporter.cattle-monitoring-system.svc.cluster.local.38404 > ethereum-geth-staking-i4i-4xlarge-b-0.ethereum-geth-staking-i4i-4xlarge-b.web3-helm.svc.cluster.local.31882: Flags [S], seq 3657559020, win 62727, options [mss 8961,sackOK,TS val 1421458438 ecr 0,nop,wscale 7], length 0
09:53:32.692361 IP ethereum-geth-staking-i4i-4xlarge-b-0.ethereum-geth-staking-i4i-4xlarge-b.web3-helm.svc.cluster.local.31882 > 172-1-18-49.rancher-monitoring-prometheus-node-exporter.cattle-monitoring-system.svc.cluster.local.38404: Flags [S.], seq 1266159035, ack 3657559021, win 62643, options [mss 8961,sackOK,TS val 1782476317 ecr 1421458438,nop,wscale 7], length 0
09:53:32.692448 IP 172-1-18-49.rancher-monitoring-prometheus-node-exporter.cattle-monitoring-system.svc.cluster.local.38404 > ethereum-geth-staking-i4i-4xlarge-b-0.ethereum-geth-staking-i4i-4xlarge-b.web3-helm.svc.cluster.local.31882: Flags [.], ack 1, win 491, options [nop,nop,TS val 1421458438 ecr 1782476317], length 0
09:53:37.693343 IP ethereum-geth-staking-i4i-4xlarge-b-0.ethereum-geth-staking-i4i-4xlarge-b.web3-helm.svc.cluster.local.31882 > 172-1-18-49.rancher-monitoring-prometheus-node-exporter.cattle-monitoring-system.svc.cluster.local.38404: Flags [F.], seq 1, ack 1, win 490, options [nop,nop,TS val 1782481318 ecr 1421458438], length 0
09:53:37.693464 IP 172-1-18-49.rancher-monitoring-prometheus-node-exporter.cattle-monitoring-system.svc.cluster.local.38404 > ethereum-geth-staking-i4i-4xlarge-b-0.ethereum-geth-staking-i4i-4xlarge-b.web3-helm.svc.cluster.local.31882: Flags [F.], seq 1, ack 2, win 491, options [nop,nop,TS val 1421463439 ecr 1782481318], length 0
09:53:37.693474 IP ethereum-geth-staking-i4i-4xlarge-b-0.ethereum-geth-staking-i4i-4xlarge-b.web3-helm.svc.cluster.local.31882 > 172-1-18-49.rancher-monitoring-prometheus-node-exporter.cattle-monitoring-system.svc.cluster.local.38404: Flags [.], ack 2, win 490, options [nop,nop,TS val 1782481318 ecr 1421463439], length 0

Any help is greatly appreciated.

@jflo
Copy link

jflo commented Mar 25, 2024

Besu team is tracking a similar issue, Besu only likes to peer with non-geth peers at the moment. Watching.

@fjl
Copy link
Contributor

fjl commented Mar 25, 2024

which is very unfriendly for staking nodes. Our statistical analysis indicates that the miss att has doubled in the past week.

Geth peering issues should not make a difference in attestations, because they are handled by the beacon chain client.

@fjl
Copy link
Contributor

fjl commented Mar 25, 2024

@J1a-wei can you please double-check your firewall configuration? There haven't been any changes in geth's p2p code between the releases you mentioned.

@J1a-wei
Copy link
Author

J1a-wei commented Mar 26, 2024

Hi, @fjl , We contacted the AWS engineer the day before yesterday (our nodes are deployed on AWS) and confirmed that it is not a firewall or security group issue.
Preliminary analysis through packet capturing suggests an anomaly with GETH.

The beacon chain is experiencing the same issue; there are no inbound peers.

We conducted comparative experiments and found that with inbound peers, the number of missed stakings is significantly reduced, with a gap of approximately half.

@lightclient
Copy link
Member

The beacon chain is experiencing the same issue; there are no inbound peers.

Considering there is no relation to the peering on EL vs. CL nodes, this seems to imply some type of configuration issue.

@J1a-wei
Copy link
Author

J1a-wei commented Mar 26, 2024

hi @lightclient
I don't think so. I didn't change any parameter files, just swapped the image. Our staking nodes are deployed on Kubernetes. There are a total of 1600 validators, using combinations of geth + prysm, geth + lighthouse + SSV. The penalties have increased by more than 2.5 times after the mainnet upgrade. We've investigated, and the only anomaly seems to be the peers.

Additionally, earlier on, I didn't have NAT mapping enabled. Our monitoring program feedback indicated a relatively high probability of signature loss. When NAT mapping was enabled and we had inbound peers, the overall situation improved by about double. This is based on our practical experience.

Ref: https://www.symphonious.net/2021/08/14/exploring-eth2-why-open-ports-matter/

@J1a-wei
Copy link
Author

J1a-wei commented Mar 26, 2024

I need to correct, the inbound peer is not 0, all of them are our own intranet addresses. There are no external peers. Our test command is:

Geth

curl localhost:8545 -H 'Content-Type: application/json' -d '{"jsonrpc":"2.0","method":"admin_peers","params":[],"id":1}' | jq '.result[] | select(.network.inbound == true) | select(.network.remoteAddress | startswith("172.1.")| not)'

Prysm or Lighthouse

curl "localhost:8080/eth/v1/node/peers?state=connected&direction=inbound" -H 'Content-Type: application/json' | jq '.data[] | select(.last_seen_p2p_address  | startswith("/ip4/172.1.")| not)'

Additionally, it's worth noting that 172.1.xx.xx is our K8s cidr, and within this subnet, we have deployed some testnet nodes such as holesky, sepolia, and so on. However, only the mainnet has NAT mappin

@lightclient
Copy link
Member

We've investigated, and the only anomaly seems to be the peers.

My point is that it doesn't really matter how well connected your EL in regards to attestation performance.

What I find interesting is that two completely different pieces of software / p2p stacks are having similar peering issues. Do all of your CL clients have peering issues or just prysm?

the inbound peer is not 0, all of them are our own intranet addresses

What is the number of inbound peers which are your own intranet addresses? Given maxpeers=200 and the default dial ratio in geth of 3, your node will accept 133 inbound connections. If your intranet has more nodes than that, it may fully saturate your ability to accept other inbounds.

@J1a-wei
Copy link
Author

J1a-wei commented Mar 26, 2024

Hi @lightclient

Our own intranet inbound peers are currently only 5. So I don't think that's the issue.

In addition, I have re-run geth + lighthouse on a new machine with the same parameters, and it is currently running well with plenty of inbound peers.

So I am somewhat suspicious that maybe we are under attack?

@fjl
Copy link
Contributor

fjl commented Mar 26, 2024

In your screenshot, I can see you have tried to connect to the node using devp2p rlpx ping. Can you please also try devp2p discv4 ping on the same enode:// URL with internet IP?

I'm asking you to do this to confirm whether your firewall permits inbound UDP traffic correctly.

@skylenet
Copy link
Member

skylenet commented Mar 26, 2024

@J1a-wei on the kubernetes side, I'm assuming that $MY_PUBLIC_IP is the IP of the node where the pod is running, is that right? Based on the port numbers that you used on the example, I assume you're using a Service Type "NodePort" , am I correct? If that's the case, did you enable externalTrafficPolicy: Local (example here) to avoid an additional hop in routing?

@amplicity
Copy link

OP - can you run admin.nodeInfo from geth attach?

What do you see under discovery?

  ports: {
    discovery: 1038,
    listener: 30303
  },

My discovery port consistently changes from my setting (30303) to a random value. This happens shortly after start. I see it change in the logs, but I'm not sure what triggers it.

This caused low inbound for me, especially when I first booted up geth, because I have all ports blocked on my machine except for what's required (IE 30303 tcp/udp). The longer my discovery port was as expected (IE 30303), the more inbound I have.

Curious if this is your issue? This is also new to me, and I've searched everywhere / havent found anything. Just started happening. Running Ubuntu 23.04.

@J1a-wei
Copy link
Author

J1a-wei commented Mar 27, 2024

@fjl
Yes, I tried devp2p discv4 ping, it's in normal state through both public network address and internal network address.

INFO [03-27|12:04:56.499] New local node record                    seq=1,711,512,296,499 id=abc49c9338f23b4f ip=127.0.0.1 udp=55489 tcp=0
node responded to ping (RTT 365.13475ms).

@skylenet
Partially correct, we expose the p2p port through an NLB load balancer. I tried opening it with the parameter 'externalTrafficPolicy: Local' and adjusted the backend listener in AWS. It seems to be working fine now, but I can't understand the specific cause.

When we first deployed, we didn't enable 'externalTrafficPolicy: Local', and the p2p was still in a normal state.

From what I've researched, if set to 'cluster', it would add an extra hop to the network. What impact would this have?

Also, I'm quite curious about the Ethereum P2P communication principle. After each node starts up, it generates an ENR. However, I've observed that the ENR doesn't seem to regenerate when the node restarts. How can I make it regenerate, or where is it stored?

@amplicity
I checked our situation, and there haven't been any changes.

 ports: {
    discovery: 31883,
    listener: 31882
  }

@J1a-wei J1a-wei closed this as completed Apr 9, 2024
@holiman
Copy link
Contributor

holiman commented Apr 9, 2024

@J1a-wei did this resolve itself 'organically'? Or is peering still a problem? Or did you find out some other cause of this?

@web3dev2023
Copy link

This seems to be a bug related to both prysm and geth.
Please also refer to prysmaticlabs/prysm#13431 and prysmaticlabs/prysm#13936

@jflo
Copy link

jflo commented May 8, 2024

Besu team is tracking a similar issue, Besu only likes to peer with non-geth peers at the moment. Watching.

Besu team is no longer concerned this may be related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants