p2p: simultaneous connect between two nodes fails on both sides #17988

WinstonPrivacy · 2018-10-26T19:37:13Z

System information

Geth version: N/A
OS & Version: Linux
Commit hash : 16e4d0e

Expected behaviour

Bug reproduced in unit tests. One peer node is started with a bootstrap id of an already running node. Both nodes have the same RLPx protocol defined. Nodes are supposed to (and usually do) connect and agree on the RLPx protocol.

Actual behaviour

Unit tests randomly fail (~10% of time) when running within a single process (as is usually the case in unit tests). The nodes report "Rejected peer" error and drop the connection with each other. The actual error returned within server.go setupConn() is "already connected".

The issue has been traced down to server.go:724:

			err := srv.protoHandshakeChecks(peers, inboundCount, c)
			if err == nil {
				// The handshakes are done and it passed all checks.
				p := newPeer(c, srv.Protocols)
				// If message events are enabled, pass the peerFeed
				// to the peer
				if srv.EnableMsgEvents {
					p.events = &srv.peerFeed
				}
				name := truncateName(c.name)
				srv.log.Debug("Adding p2p peer", "name", name, "addr", c.fd.RemoteAddr(), "peers", len(peers)+1)
				go srv.runPeer(p)
peers[c.node.ID()] = p

This is a classic race condition. What is occurring is that a peer (from the first connection) is in the process of being added to the peers map just as the second connection is being established. If the peers map has been updated before setupConn() reaches the addpeer checkpoint, the connection is quietly dropped and everything works fine. Otherwise, the code above is reached, a new peer is started and this overwrites the existing entry in the peers map.

If an RLPx protocol (Run) has been started in the first peer, it may already be listening for messages on the first peer. This will always return an EOF.

Possible solution: Check the map again just prior to inserting a new peer. The map should be protected with a mutex to prevent concurrency problems.

(Edit: realized the peers is not global, but already assigned to a single server instance.)

The text was updated successfully, but these errors were encountered:

WinstonPrivacy · 2018-10-26T19:55:04Z

Now that I realized the peers map is not global, it seems likely that this bug could affect production ethereum nodes.

WinstonPrivacy · 2018-10-29T00:48:36Z

I've implemented some code to check the peers map just before updating the existing peer. Unfortunately, this has not fixed the problem. Any suggestions welcome.

WinstonPrivacy · 2019-01-31T22:40:44Z

Checking in to see if anyone else has encountered this. We are still seeing the same problem.

fjl · 2019-01-31T23:45:11Z

Can you share the unit test code to reproduce this issue? I've heard other reports of this issue but haven't been able to reproduce it myself.

WinstonPrivacy · 2019-02-01T01:35:33Z

That's a big ask as I'm going to have to not only isolate the code but also the unit tests. It only occurs when multiple servers are started in multiple tests... if those same unit tests are run in isolation, the error never occurs. Let me see what I can do.

WinstonPrivacy · 2019-02-01T02:14:54Z

Short of trying to isolate this, I thought I would share some logging data to see if it might help pinpoint things. In particular, I am seeing some weird behavior which does not happen on every run. It looks like the nodes are trying to connect to each other twice. The second connection fails because the first was already made and this drops the peers.

To illustrate this, I've annotated some logging code.

We start by spinning up two nodes (call them "remote" and "local"). They start just fine and local bootstraps to "remote" without any problems. Note in the log below there are some extra output which we've added:

// Start the first node: "Remote". All good here.

t=2019-01-31T23:24:02+0000 lvl=info msg="ShadowServer starting." Node=Remote port=56831
t=2019-01-31T23:24:02+0000 lvl=info msg=ShadowServer.Listen()    Node=Remote msg=" Launching shadow privacy service" addr=:56831
t=2019-01-31T23:24:02+0000 lvl=info msg=startListening                      Node=Remote listenAddr=:56830
t=2019-01-31T23:24:02+0000 lvl=dbug msg="UDP listener up"                   Node=Remote addr=[::]:56830
t=2019-01-31T23:24:02+0000 lvl=info msg="ShadowNetwork Hello World"         Node=Remote enode=6c1f40d09a listen port=:56830
t=2019-01-31T23:24:02+0000 lvl=info msg="ShadowNetwork started."            Node=Remote Local=6c1f40d09a port=56831

// Start the second node: "Local". All good here as well.

..t=2019-01-31T23:24:02+0000 lvl=info msg="ShadowServer starting."            Node=Local  port=56591

// While local is starting, remote starts its TCP and RLPx listeners...

t=2019-01-31T23:24:02+0000 lvl=dbug msg="TCP listener up"                   Node=Remote addr=[::]:56830
t=2019-01-31T23:24:02+0000 lvl=dbug msg="RLPx listener"                     Node=Remote maxPendingPeers=50
t=2019-01-31T23:24:02+0000 lvl=dbug msg="RLPx listener is waiting"          Node=Remote

// Now local starts it's listeners. So far, so good...

t=2019-01-31T23:24:02+0000 lvl=info msg=startListening                      Node=Local  listenAddr=:56590
t=2019-01-31T23:24:02+0000 lvl=dbug msg="UDP listener up"                   Node=Local  addr=[::]:56590
t=2019-01-31T23:24:02+0000 lvl=dbug msg="TCP listener up"                   Node=Local  addr=[::]:56590
t=2019-01-31T23:24:02+0000 lvl=dbug msg="RLPx listener"                     Node=Local  maxPendingPeers=50
t=2019-01-31T23:24:02+0000 lvl=dbug msg="RLPx listener is waiting"          Node=Local

// Remote's P2P server is finally up and running and dial tasks start ticking off.

t=2019-01-31T23:24:02+0000 lvl=info msg="Started P2P Server"                Node=Remote self=0894731ea016f4a6
t=2019-01-31T23:24:02+0000 lvl=trce msg="New dial task"                     Node=Remote task="discovery lookup (no results)"

// Local catches up and starts its P2P server.

t=2019-01-31T23:24:02+0000 lvl=info msg=ShadowServer.Listen()               Node=Local  msg=" Launching shadow privacy service" addr=:56591
t=2019-01-31T23:24:02+0000 lvl=info msg="ShadowNetwork Hello World"         Node=Local  enode=0201687c0a listen port=:56590
t=2019-01-31T23:24:02+0000 lvl=info msg="ShadowNetwork started."            Node=Local  Local=0201687c0a port=56591
t=2019-01-31T23:24:02+0000 lvl=info msg="Started P2P Server"                Node=Local  self=cf766f7cf871db7b

// Local bootstraps to Remote...

t=2019-01-31T23:24:02+0000 lvl=trce msg="New dial task"                     Node=Local  task="discovery lookup (no results)"
t=2019-01-31T23:24:03+0000 lvl=trce msg="Dial task done"                    Node=Local  task="discovery lookup (2 results)"
t=2019-01-31T23:24:03+0000 lvl=trce msg="New dial task"                     Node=Local  task="dyndial 0894731ea016f4a6 127.0.0.1:56830"
t=2019-01-31T23:24:03+0000 lvl=trce msg="New dial task"                     Node=Local  task="discovery lookup (no results)"

// Remote accepts the connection.

t=2019-01-31T23:24:03+0000 lvl=trce msg="Accepted connection"               Node=Remote addr=127.0.0.1:60480
t=2019-01-31T23:24:03+0000 lvl=dbug msg="RLPx listener is waiting"          Node=Remote

// I added this debug code to try and get some insight into the handshake. We see the posthandshake occurring in server.go:setupConn(). This happens for both the remote and local nodes and they seem to complete successfully.


[DEBUG] Waiting for posthandshake...
[DEBUG] Server.run:PostHandshake handler
[DEBUG] Server.run:encHandshakeChecks()

[DEBUG] Waiting for posthandshake...
[DEBUG] Server.run:PostHandshake handler
[DEBUG] Server.run:encHandshakeChecks()

[DEBUG] posthandshake completed.

// Now things get a little weird (or maybe I don't understand what should be happening). Remote dials out to local, even though they both already have a connection. This happens before the final posthandshake completes.

t=2019-01-31T23:24:03+0000 lvl=trce msg="New dial task"                     Node=Remote task="dyndial cf766f7cf871db7b 127.0.0.1:56590"

[DEBUG] posthandshake completed.

[DEBUG] Server.run:encHandshakeChecks()

// Remote now adds the peer to the peer[] map. This happens in server.go:run(), within the code "case c := <-srv.addpeer:"

t=2019-01-31T23:24:03+0000 lvl=dbug msg="Adding p2p peer"                   Node=Remote name= addr=127.0.0.1:60480 peers=1
[DEBUG] peers[] after adding new node map[cf766f7cf871db7bee6fabc8d9a0d811ba63dcfac3bbdbe78c7154806afbc3eb:Peer cf766f7cf871db7b 127.0.0.1

// Here we see that Remote has completed the setupConn() routine.

t=2019-01-31T23:24:03+0000 lvl=trce msg="connection set up"                 Node=Remote id=cf766f7cf871db7bee6fabc8d9a0d811ba63dcfac3bbdbe78c7154806afbc3eb add


// Now local starts this same process.

t=2019-01-31T23:24:03+0000 lvl=trce msg="Accepted connection"               Node=Local  addr=127.0.0.1:43424
t=2019-01-31T23:24:03+0000 lvl=dbug msg="RLPx listener is waiting"          Node=Local
[DEBUG] Server.run:encHandshakeChecks()
t=2019-01-31T23:24:03+0000 lvl=dbug msg="Adding p2p peer"                   Node=Local  name= addr=127.0.0.1:56830 peers=1
t=2019-01-31T23:24:03+0000 lvl=dbug msg=ShadowNetwork.Run()                 Node=Local  msg="Starting protocol handler"         Local=0201687c0a Remote=0894731
[DEBUG] peers after adding new node map[0894731ea016f4a67a5a22bf8d5988d7939d21393363da0cced2d490f08058c2:Peer 0894731ea016f4a6 127.0.0.1
t=2019-01-31T23:24:03+0000 lvl=trce msg="connection set up"                 Node=Local  id=0894731ea016f4a67a5a22bf8d5988d7939d21393363da0cced2d490f08058c2 add
t=2019-01-31T23:24:03+0000 lvl=trce msg="Dial task done"                    Node=Local  task="dyndial 0894731ea016f4a6 127.0.0.1:56830"

// Both remote and local now have set up their connections. Good right?

// But wait... here's where the problems seem to occur. Local starts the posthandshake process again. This fails because remote was already added to the peer[] map up above.

[DEBUG] Waiting for posthandshake...
[DEBUG] Server.run:PostHandshake handler
[DEBUG] Server.run:encHandshakeChecks()
[DEBUG] Server.run:encHandshakeChecks() error: Already connected. peers map[0894731ea016f4a67a5a22bf8d5988d7939d21393363da0cced2d490f08058c2:Peer 0894731ea016f
[DEBUG] posthandshake completed.
t=2019-01-31T23:24:03+0000 lvl=trce msg="Rejected peer before protocol handshake" Node=Local  id=0894731ea016f4a67a5a22bf8d5988d7939d21393363da0cced2d490f08058y connected"
t=2019-01-31T23:24:03+0000 lvl=trce msg="Setting up connection failed"            Node=Local  addr=127.0.0.1:43424 err="already connected"


// Remote does the same thing!
[DEBUG] Waiting for posthandshake...
[DEBUG] Server.run:PostHandshake handler
[DEBUG] Server.run:encHandshakeChecks()
[DEBUG] Server.run:encHandshakeChecks() error: Already connected. peers map[cf766f7cf871db7bee6fabc8d9a0d811ba63dcfac3bbdbe78c7154806afbc3eb:Peer cf766f7cf871d
[DEBUG] posthandshake completed.
t=2019-01-31T23:24:03+0000 lvl=trce msg="Rejected peer before protocol handshake" Node=Remote id=cf766f7cf871db7bee6fabc8d9a0d811ba63dcfac3bbdbe78c7154806afbc3y connected"
t=2019-01-31T23:24:03+0000 lvl=trce msg="Setting up connection failed"            Node=Remote addr=127.0.0.1:56590 err="already connected"

The end result of this is that we have two nodes which should be connected just fine (and most of the time they are), but one or both of them occasionally drop the connection. This causes our unit tests to break because we expect valid connections.

WinstonPrivacy · 2019-02-01T20:54:11Z

We just noticed something else that may be helpful. We're seeing that in the cases where the connections fail, both nodes have initiated an outbound connection with each other at the same time. The second rejected connection for both is inbound. When this occurs, any subsequent RLPx messages are flushed with an EOF error.

When one node accepts an inbound and the other an outbound connection first, the subsequent RLPx calls go through without being dropped (this is in spite of the fact that there are a pair of connection attempts which are dropped, similar to above).

WinstonPrivacy · 2019-02-02T19:34:15Z

We found a workaround. By setting "NoDial" to true on the bootstrap node's server config, this prevents the race condition and our unit tests work.

fjl · 2019-02-02T20:13:05Z

I think this deserves fixing even if you found a workaround.

WinstonPrivacy · 2019-02-02T20:32:38Z

I agree. I've tried several times but haven't been able to get to the bottom of it. It seems pretty unlikely to occur in production but as it does cause havoc with unit testing, it's worthwhile.

…

On Sat, Feb 2, 2019 at 2:13 PM Felix Lange ***@***.***> wrote: I think this deserves fixing even if you found a workaround. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#17988 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Ai89gd-8F70knJ43lYCmsDQyWR4wkhYUks5vJfF3gaJpZM4X82dn> .

holiman · 2019-12-12T11:04:36Z

@fjl the peers map is created in the func (srv *Server) run(dialstate dialer) { in p2p/server.go. It seems to have originally been meant for usage only by the goroutine dedicated to serving the loop, but it's also shared with the dialState in dial.go. Afaict, there are multiple goroutines doing accesses to the peers map, without any form of race prevention.

fjl · 2019-12-12T22:35:41Z

There are no concurrent accesses to the peers map because the dialState code is only invoked from Server.run.

adamschmideg · 2020-01-09T10:08:04Z

@fjl is working on a new Dialer which may resolve this issue

WinstonPrivacy closed this as completed Feb 2, 2019

fjl reopened this Feb 2, 2019

fjl changed the title ~~Race condition in p2p/server.go / Unprotected map~~ p2p: simultaneous connect between two nodes fails on both sides Feb 2, 2019

adamschmideg added status:triage area:p2p labels Dec 12, 2019

adamschmideg assigned fjl Jan 9, 2020

adamschmideg added priority:2 and removed status:triage labels Jan 9, 2020

karalabe removed the priority:2 label May 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

p2p: simultaneous connect between two nodes fails on both sides #17988

p2p: simultaneous connect between two nodes fails on both sides #17988

WinstonPrivacy commented Oct 26, 2018 •

edited

WinstonPrivacy commented Oct 26, 2018

WinstonPrivacy commented Oct 29, 2018

WinstonPrivacy commented Jan 31, 2019

fjl commented Jan 31, 2019

WinstonPrivacy commented Feb 1, 2019

WinstonPrivacy commented Feb 1, 2019

WinstonPrivacy commented Feb 1, 2019

WinstonPrivacy commented Feb 2, 2019

fjl commented Feb 2, 2019

WinstonPrivacy commented Feb 2, 2019 via email

holiman commented Dec 12, 2019

fjl commented Dec 12, 2019

adamschmideg commented Jan 9, 2020

p2p: simultaneous connect between two nodes fails on both sides #17988

p2p: simultaneous connect between two nodes fails on both sides #17988

Comments

WinstonPrivacy commented Oct 26, 2018 • edited

System information

Expected behaviour

Actual behaviour

WinstonPrivacy commented Oct 26, 2018

WinstonPrivacy commented Oct 29, 2018

WinstonPrivacy commented Jan 31, 2019

fjl commented Jan 31, 2019

WinstonPrivacy commented Feb 1, 2019

WinstonPrivacy commented Feb 1, 2019

WinstonPrivacy commented Feb 1, 2019

WinstonPrivacy commented Feb 2, 2019

fjl commented Feb 2, 2019

WinstonPrivacy commented Feb 2, 2019 via email

holiman commented Dec 12, 2019

fjl commented Dec 12, 2019

adamschmideg commented Jan 9, 2020

WinstonPrivacy commented Oct 26, 2018 •

edited