Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

les: implement client connection logic #16899

Merged
merged 10 commits into from
Aug 14, 2018
Merged

Conversation

zsfelfoldi
Copy link
Contributor

This PR implements les.freeClientPool. It also adds common/mclock.SimulatedClock which enables time-sensitive tests to run quickly and still produce accurate results, and les/prque package which is a generalised variant of gopkg.in/karalabe/cookiejar.v2/collections/prque, allows different comparison functions and also enables removing elements other than the top one from the queue.

les.freeClientPool implements a client database that limits the connection time of each client and manages accepting/rejecting incoming connections and even kicking out some connected clients. The pool calculates recent usage time for each known client (a value that increases linearly when the client is connected and decreases exponentially when not connected). Clients with lower recent usage are preferred, unknown nodes have the highest priority. Already connected nodes receive a small bias in their favor in order to avoid accepting and instantly kicking out clients.

Note: the pool can use any string for client identification. Using signature keys for that purpose would not make sense when being known has a negative value for the client. Currently the LES protocol manager uses IP addresses (without port address) to identify clients.

@zsfelfoldi zsfelfoldi added this to the 1.8.11 milestone Jun 5, 2018
@zsfelfoldi zsfelfoldi modified the milestones: 1.8.11, 1.8.12 Jun 11, 2018
@fjl fjl self-requested a review June 25, 2018 12:23
lastScheduled := 0
for {
for i := 0; i < goSchedCount; i++ {
runtime.Gosched()
Copy link
Contributor

@fjl fjl Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop is weird. You should never make assumptions about the scheduler, particularly when it comes to sleeping goroutines. You need very good justification to use runtime.Gosched.

An acceptable use of Gosched would be when your goroutine is running a tight loop that cannot be preempted because it contains no function calls or channel synchronization. In such cases, the call to Gosched can ensure that other goroutines will run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit this is a hacky solution but it is only used for testing and it proved to be really useful. It is also used in #16490 which would be practically impossible to properly test without using a virtual timescale. This is the most reliable and least ugly solution I could come up with but I am open to suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that you need a virtual clock, my only issue is the call to Gosched. Does the test work without this call?  

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. We need something to let goroutines run until they stop waiting on something again. A Sleep works too but it procuces less accurate and less reliable/reproducible results and also I'm not sure if it is less ugly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've verified this now and the tests do pass without the Gosched call, they're just slower.

@@ -281,6 +286,19 @@ func (pm *ProtocolManager) handle(p *peer) error {
p.Log().Debug("Light Ethereum handshake failed", "err", err)
return err
}

if !pm.lightSync && !p.Peer.Info().Network.Trusted {
addr, ok := p.RemoteAddr().(*net.TCPAddr)
Copy link
Contributor

@fjl fjl Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: net.Addr supports String, it's one of the methods in the interface, so you don't need to check for TCPAddr here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That includes port too. I just want to use the IP address here because it is very easy for the client to change port.

Copy link
Contributor

@fjl fjl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still trying to wrap my head around this change, sorry for the long review delay.

An observation: You forked prque for two reasons: to have custom comparison function, and to be able to remove a specific element that might not be at the top. I agree with the latter new feature, but don't understand the need to have a custom comparison function. Here's why: if I'm not mistaken, the order of items in the queue never changes. Client 'usage' is recalculated only on entry and exit to the queue (anything else would be slightly dangerous because it might break priority order) and new clients are added only if they have less usage than any connected client. So the actual numbers don't matter much, all you need is a queue.

Please correct me if I'm wrong. Is there any condition under which a client entry might be inserted in the middle of the queue?

@zsfelfoldi
Copy link
Contributor Author

@fjl " Is there any condition under which a client entry might be inserted in the middle of the queue?" sure, absolutely. There are two states and accordingly, two queues here, one where the "usage" value is linearly increasing and one where it is exponentially decreasing. Clients may go back and forth any time. A client can go to "connected" state when it is slightly better than the worst currently connected one so it can indeed be inserted in the middle of the queue. It can also disconnect at any time and get back in the disconneced queue in any position.

@fjl
Copy link
Contributor

fjl commented Jul 4, 2018

Thank you for the more detailed explanation. That still leaves me wondering why we need the comparison function. If we kept the priority parameter for Push, you'd just use linUsage or logUsage as priority. How would that be different?

@zsfelfoldi
Copy link
Contributor Author

@fjl I removed the comparison function and added a comment about the wrap-around priority comparison. Also made logOffset a bit more readable. I tried removing gosched from simclocks and the tests are running 20x slower. Also, it will not work for my other use cases so I did not remove it just yet. I think I have an idea how to handle this problem better but I'll have to think about it a bit. I have to leave now for a few hours, I'll try to make simclock nicer later tonight.

@zsfelfoldi
Copy link
Contributor Author

@fjl I added a commit that removes GoSched and uses a "ping channel" instead. It achieves the same effect but other goroutines unrelated to the actual tested mechanism do not interfere with the process so it should be more deterministic.

@fjl fjl modified the milestones: 1.8.12, 1.8.13 Jul 5, 2018
@zsfelfoldi
Copy link
Contributor Author

@fjl I am writing a much simpler unit test for the flow control PR and your simclock is totally enough to do it. Actually it only needs Run and Now (like this PR). I am planning another unit test for the new request serving mechanism which needs Sleep/After too and I think I can do that too with your WaitForTimers solution.

@fjl fjl merged commit b2ddb1f into ethereum:master Aug 14, 2018
firmianavan pushed a commit to firmianavan/go-ethereum that referenced this pull request Aug 28, 2018
This PR implements les.freeClientPool. It also adds a simulated clock
in common/mclock, which enables time-sensitive tests to run quickly
and still produce accurate results, and package common/prque which is
a generalised variant of prque that enables removing elements other
than the top one from the queue.

les.freeClientPool implements a client database that limits the
connection time of each client and manages accepting/rejecting
incoming connections and even kicking out some connected clients. The
pool calculates recent usage time for each known client (a value that
increases linearly when the client is connected and decreases
exponentially when not connected). Clients with lower recent usage are
preferred, unknown nodes have the highest priority. Already connected
nodes receive a small bias in their favor in order to avoid accepting
and instantly kicking out clients.

Note: the pool can use any string for client identification. Using
signature keys for that purpose would not make sense when being known
has a negative value for the client. Currently the LES protocol
manager uses IP addresses (without port address) to identify clients.
@nambrot
Copy link

nambrot commented Sep 4, 2018

@zsfelfoldi I apologize if a merged PR is not the right avenue for this question, but I wasn't sure where else to ask:

I understand the rationale for wanting to use the IP address of the connection to identify clients and reward ones that utilize less resources. However, an advertised use case that we are relying on is the use of the light protocol on devices like smartphones. This change effectively blocks two users on the same network (and thus sharing the same IP address) of connecting to the same full node. I understand that they could always go to a different node, but effectively the number of clients in a network is bound by the number of full nodes. I was wondering if that is an explicit trade-off that has been decided on or whether we can expect a future change on this, that would avoid us having to run custom nodes with forked behavior on this.

nambrot added a commit to celo-org/celo-blockchain that referenced this pull request Sep 11, 2018
Upstream geth uses a client pool to determine which peers to allow
and stay connected to and which not. It is supposed to reward good
peers who only stay connected as little as necessary and punish
peers who hog the connection. However, since there is nothing that
can truly uniquely identify clients (enode addresses are trivially
changable), upstream geth uses IP addresses as a proxyy. However this
breaks light clients with the same IP addresss, as well as setups like
Kubernetes that appear to geth under the same IP address due to their
networking.

Thus this change is a short term hack to stabilizes connections for us
by doing the lookup by the enode ID. Longer-term we need to figure out
a solution to this. Check out celo-org/celo-monorepo#522
and ethereum/go-ethereum#16899 (comment)
nambrot added a commit to celo-org/celo-blockchain that referenced this pull request Sep 11, 2018
* Lookup peers in the client pool by ID instead of by IP address

Upstream geth uses a client pool to determine which peers to allow
and stay connected to and which not. It is supposed to reward good
peers who only stay connected as little as necessary and punish
peers who hog the connection. However, since there is nothing that
can truly uniquely identify clients (enode addresses are trivially
changable), upstream geth uses IP addresses as a proxyy. However this
breaks light clients with the same IP addresss, as well as setups like
Kubernetes that appear to geth under the same IP address due to their
networking.

Thus this change is a short term hack to stabilizes connections for us
by doing the lookup by the enode ID. Longer-term we need to figure out
a solution to this. Check out celo-org/celo-monorepo#522
and ethereum/go-ethereum#16899 (comment)

* Remove old code and use p.id directly
nambrot added a commit to celo-org/celo-blockchain that referenced this pull request Sep 11, 2018
* Lookup peers in the client pool by ID instead of by IP address

Upstream geth uses a client pool to determine which peers to allow
and stay connected to and which not. It is supposed to reward good
peers who only stay connected as little as necessary and punish
peers who hog the connection. However, since there is nothing that
can truly uniquely identify clients (enode addresses are trivially
changable), upstream geth uses IP addresses as a proxyy. However this
breaks light clients with the same IP addresss, as well as setups like
Kubernetes that appear to geth under the same IP address due to their
networking.

Thus this change is a short term hack to stabilizes connections for us
by doing the lookup by the enode ID. Longer-term we need to figure out
a solution to this. Check out celo-org/celo-monorepo#522
and ethereum/go-ethereum#16899 (comment)

* Remove old code and use p.id directly
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants