Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry pick new dial scheduler from upstream #1192

Merged
merged 10 commits into from
Oct 27, 2020

Conversation

oneeman
Copy link
Contributor

@oneeman oneeman commented Oct 21, 2020

Description

Cherry pick the new dial scheduler from upstream. Also cherry-picks ethereum/go-ethereum#20688 because it is a closely-related follow-up to the dial scheduler.

Marked with "do not merge" until I add another commit, as explained below.

Merge notes:

  • Upstream made static dials respect the max dialed connections limit. This means that if there are a lot of dialed connections (static or dynamic), we are no longer guaranteed that it would try to connect to all the nodes with explicit static purpose set. This is not desirable for us, since we rely on such a guarantee for the connections between validators/proxies and other validators/proxies. In other words, for us static dial connections being exempt from the max dialed connection limit is more of a feature than a bug. I will add a commit to change that behavior back (while still fixing Validators/proxies initiate new dyndial connections even when they're at the max peer limit #1180), so I am adding the "do not merge" label to this PR until then.
  • Upstream removes the static map from server.go. Since we use it for purposes rather than just boolean, decided to keep it there rather than push the purposes map down into dial.go
  • Upstream made srv.RemovePeer() block until the peer is completely disconnected, using doPeerOp(). Since we need to access static in order to check and modify the node's purposes and decide whether to disconnect it or not, we can't do with doPeerOp and instead have to do it in the main thread. So I added a channel called removestaticDone to let us know when the peer has been disconnected fully (using the same method that they used, but in a different place now)
  • Upstream changed one of the conditions for the max peers check, so that instead of being bypassed if the peer is trusted, it's bypassed if the peer is trusted or it's a static dialed connection (i.e. it now allows non-trusted static dial connections, but not non-trusted incoming connections from nodes which happen to also be listed in static). We do the checks elsewhere, and we check using info.Network.Trusted and info.Network.Static, so we're already doing it the way they changed it to be.

Tested

  • Unit tests passing
  • e2e sync test passes
  • Running a node to sync with mainnet, it connects to peers and starts syncing, whether in full mode or in lightest mode.

Related issues

Backwards compatibility

This only modifies how the node starts new connections to other nodes, and doesn't affect the protocols used between nodes, so there aren't backwards compatibility issues.

…20592)

Conflicts:
  p2p/dial.go
  p2p/server.go
  p2p/server_test.go

* p2p: new dial scheduler

This change replaces the peer-to-peer dial scheduler with a new and
improved implementation. The new code is better than the previous
implementation in two key aspects:

- The time between discovery of a node and dialing that node is
  significantly lower in the new version. The old dialState kept
  a buffer of nodes and launched a task to refill it whenever the buffer
  became empty. This worked well with the discovery interface we used to
  have, but doesn't really work with the new iterator-based discovery
  API.

- Selection of static dial candidates (created by Server.AddPeer or
  through static-nodes.json) performs much better for large amounts of
  static peers. Connections to static nodes are now limited like dynanic
  dials and can no longer overstep MaxPeers or the dial ratio.

* p2p/simulations/adapters: adapt to new NodeDialer interface

* p2p: re-add check for self in checkDial

* p2p: remove peersetCh

* p2p: allow static dials when discovery is disabled

* p2p: add test for dialScheduler.removeStatic

* p2p: remove blank line

* p2p: fix documentation of maxDialPeers

* p2p: change "ok" to "added" in static node log

* p2p: improve dialTask docs

Also increase log level for "Can't resolve node"

* p2p: ensure dial resolver is truly nil without discovery

* p2p: add "looking for peers" log message

* p2p: clean up Server.run comments

* p2p: fix maxDialedConns for maxpeers < dialRatio

Always allocate at least one dial slot unless dialing is disabled using
NoDial or MaxPeers == 0. Most importantly, this fixes MaxPeers == 1 to
dedicate the sole slot to dialing instead of listening.

* p2p: fix RemovePeer to disconnect the peer again

Also make RemovePeer synchronous and add a test.

* p2p: remove "Connection set up" log message

* p2p: clean up connection logging

We previously logged outgoing connection failures up to three times.

- in SetupConn() as "Setting up connection failed addr=..."
- in setupConn() with an error-specific message and "id=... addr=..."
- in dial() as "Dial error task=..."

This commit ensures a single log message is emitted per failure and adds
"id=... addr=... conn=..." everywhere (id= omitted when the ID isn't
known yet).

Also avoid printing a log message when a static dial fails but can't be
resolved because discv4 is disabled. The light client hit this case all
the time, increasing the message count to four lines per failed
connection.

* p2p: document that RemovePeer blocks
Or Neeman added 2 commits October 21, 2020 12:00
Static peers are ones we explicitly want to connect to, and so should not be subject to this limit. This is especially important since connections between validators/proxies and other validators/proxies rely on this being so, and because that's how it was prior to the new dial scheduler from upstream.
@oneeman
Copy link
Contributor Author

oneeman commented Oct 22, 2020

Removing the "do not merge" tag, now that I've added the commit to exempt static dials from the limits. Unfortunately it involved considerable changes to the unit tests, but at the same time that's a good thing in that it indicates the unit tests were testing what they should be.

p2p/server.go Outdated
@@ -833,6 +851,10 @@ func (srv *Server) run(dialstate dialer) {
p.RemovePurpose(purpose)
}
}
if ch == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know all this changes come from master, but:

nit: i think ch == nil is a convoluted way to asking: HasNoPurpose && not in peers[id]. I would try to make it more explicit of what ch == nil actually means (change it to something else).

In fact, i would make ch and sub local variables to the goroutine, and thus reduce their scope. No need to have them as global to the whole function.

Also, i now realize that when removing a static we subscribe to all events from all peers; probably there's not easier way of doing this, but seems like an overkill.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'll move them and add a boolean called disconnecting that can then be used in the if

As far as the subscription, it does look like there is no way to subscribe to only certain events, so there's not much else we could do.

@oneeman oneeman merged commit 903d1b0 into master Oct 27, 2020
@oneeman oneeman deleted the oneeman/cherry-pick-new-dial-scheduler branch October 27, 2020 22:12
oneeman pushed a commit that referenced this pull request Nov 3, 2020
This increment was moved lower down in the function, but not removed here in the conflict resolution in #1192.
trianglesphere pushed a commit that referenced this pull request Nov 3, 2020
This increment was moved lower down in the function, but not removed here in the conflict resolution in #1192.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validators/proxies initiate new dyndial connections even when they're at the max peer limit
3 participants