New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
p2p: simultaneous connect between two nodes fails on both sides #17988
Comments
Now that I realized the peers map is not global, it seems likely that this bug could affect production ethereum nodes. |
I've implemented some code to check the peers map just before updating the existing peer. Unfortunately, this has not fixed the problem. Any suggestions welcome. |
Checking in to see if anyone else has encountered this. We are still seeing the same problem. |
Can you share the unit test code to reproduce this issue? I've heard other reports of this issue but haven't been able to reproduce it myself. |
That's a big ask as I'm going to have to not only isolate the code but also the unit tests. It only occurs when multiple servers are started in multiple tests... if those same unit tests are run in isolation, the error never occurs. Let me see what I can do. |
Short of trying to isolate this, I thought I would share some logging data to see if it might help pinpoint things. In particular, I am seeing some weird behavior which does not happen on every run. It looks like the nodes are trying to connect to each other twice. The second connection fails because the first was already made and this drops the peers. To illustrate this, I've annotated some logging code. We start by spinning up two nodes (call them "remote" and "local"). They start just fine and local bootstraps to "remote" without any problems. Note in the log below there are some extra output which we've added:
The end result of this is that we have two nodes which should be connected just fine (and most of the time they are), but one or both of them occasionally drop the connection. This causes our unit tests to break because we expect valid connections. |
We just noticed something else that may be helpful. We're seeing that in the cases where the connections fail, both nodes have initiated an outbound connection with each other at the same time. The second rejected connection for both is inbound. When this occurs, any subsequent RLPx messages are flushed with an EOF error. When one node accepts an inbound and the other an outbound connection first, the subsequent RLPx calls go through without being dropped (this is in spite of the fact that there are a pair of connection attempts which are dropped, similar to above). |
We found a workaround. By setting "NoDial" to true on the bootstrap node's server config, this prevents the race condition and our unit tests work. |
I think this deserves fixing even if you found a workaround. |
I agree. I've tried several times but haven't been able to get to the
bottom of it. It seems pretty unlikely to occur in production but as it
does cause havoc with unit testing, it's worthwhile.
…On Sat, Feb 2, 2019 at 2:13 PM Felix Lange ***@***.***> wrote:
I think this deserves fixing even if you found a workaround.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#17988 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/Ai89gd-8F70knJ43lYCmsDQyWR4wkhYUks5vJfF3gaJpZM4X82dn>
.
|
@fjl the |
There are no concurrent accesses to the peers map because the |
@fjl is working on a new Dialer which may resolve this issue |
System information
Geth version: N/A
OS & Version: Linux
Commit hash : 16e4d0e
Expected behaviour
Bug reproduced in unit tests. One peer node is started with a bootstrap id of an already running node. Both nodes have the same RLPx protocol defined. Nodes are supposed to (and usually do) connect and agree on the RLPx protocol.
Actual behaviour
Unit tests randomly fail (~10% of time) when running within a single process (as is usually the case in unit tests). The nodes report "Rejected peer" error and drop the connection with each other. The actual error returned within server.go setupConn() is "already connected".
The issue has been traced down to server.go:724:
This is a classic race condition. What is occurring is that a peer (from the first connection) is in the process of being added to the peers map just as the second connection is being established. If the peers map has been updated before setupConn() reaches the addpeer checkpoint, the connection is quietly dropped and everything works fine. Otherwise, the code above is reached, a new peer is started and this overwrites the existing entry in the peers map.
If an RLPx protocol (Run) has been started in the first peer, it may already be listening for messages on the first peer. This will always return an EOF.
Possible solution: Check the map again just prior to inserting a new peer. The map should be protected with a mutex to prevent concurrency problems.
(Edit: realized the peers is not global, but already assigned to a single server instance.)
The text was updated successfully, but these errors were encountered: