Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

les: implement request distributor, fix blocking issues #3660

Merged
merged 2 commits into from Mar 22, 2017

Conversation

Projects
None yet
8 participants
@zsfelfoldi
Copy link
Contributor

commented Feb 10, 2017

This PR implements a new, cleaner and more efficient LES request distribution mechanism that also avoids starvation issues by prioritizing requests that have been created earlier. Also, it fixes a minor issue in light.TxPool.

@mention-bot

This comment has been minimized.

Copy link

commented Feb 10, 2017

@zsfelfoldi, thanks for your PR! By analyzing the history of the files in this pull request, we identified @obscuren, @fjl and @bas-vk to be potential reviewers.

@mchusovlianov

This comment has been minimized.

Copy link
Contributor

commented Feb 10, 2017

I've tested a bit this PR ( I just run ./build/bin/geth --light --testnet console and i try to dump eth object by typing eth ENTER):
1 issue:
image
2 issue:
image

@zsfelfoldi

This comment has been minimized.

Copy link
Contributor Author

commented Feb 10, 2017

thanks for the bug report, I will fix it soon.

@zsfelfoldi zsfelfoldi force-pushed the zsfelfoldi:sync-blocking-fix branch Feb 10, 2017

@zsfelfoldi

This comment has been minimized.

Copy link
Contributor Author

commented Feb 10, 2017

fixed

@mchusovlianov

This comment has been minimized.

Copy link
Contributor

commented Feb 10, 2017

It looks now it has deadlock:
image

image

@zsfelfoldi

This comment has been minimized.

Copy link
Contributor Author

commented Feb 10, 2017

Not a deadlock, the server was disconnected. Did it happen because of your 'eth' command?

@farazdagi

This comment has been minimized.

Copy link
Contributor

commented Feb 10, 2017

  • I can confirm the same, if you type 'eth, eth, .. , eth N' at some (not very high) N, everything blocks, and nothing happens, smells lots like deadlock
  • the same rollback is reported in mine console too:
    image
@zsfelfoldi

This comment has been minimized.

Copy link
Contributor Author

commented Feb 10, 2017

Does it say "Rolled back 2048 headers..." like on the picture above? That means server disconnection during syncing. Which is bad too if an 'eth' command causes that, but it is a different issue.

@farazdagi

This comment has been minimized.

Copy link
Contributor

commented Feb 10, 2017

  • yep, it always (at least in my anecdotical evidence) happens in sequence: "eth" strings start to pile, and then "rolled back" message appears, and while I can input anything, no response from server ever comes
  • Additionally, when I had another console attached in a separate window, it also just locks, when "eth DOS" happens in the first window. I wanted to have that separate window, to see the peers (to investigate why server disconnect occurs).
@zsfelfoldi

This comment has been minimized.

Copy link
Contributor Author

commented Feb 10, 2017

Please try it again... there is still some weird issue when resuming a cancelled syncing (after header rollback), I will debug that later, but the really nasty bug that caused blocking has been found.

@farazdagi

This comment has been minimized.

Copy link
Contributor

commented Feb 10, 2017

Thank you! The issue seems to be resolved - at least I couldn't overwhelm console with "eth" calls. Rollbacks happen - yet sync resumes and full sync happens, eventually.

So, all in all - this nasty blocking bug has been resolved.

@farazdagi

This comment has been minimized.

Copy link
Contributor

commented Feb 11, 2017

I've done more tests, unfortunately got the following panic:
image

@zsfelfoldi

This comment has been minimized.

Copy link
Contributor Author

commented Feb 11, 2017

Thanks for catching it, fixed. Now I am carefully checking all my locks, callbacks and goroutines :)

@farazdagi farazdagi referenced this pull request Feb 11, 2017

Closed

Fix LES blocking on sync #107

@zsfelfoldi zsfelfoldi force-pushed the zsfelfoldi:sync-blocking-fix branch to d61ed3f Feb 13, 2017

@zsfelfoldi zsfelfoldi added pr:review and removed in progress labels Feb 13, 2017

@karalabe karalabe modified the milestones: 1.5.10, 1.5.9 Feb 13, 2017

@zsfelfoldi zsfelfoldi force-pushed the zsfelfoldi:sync-blocking-fix branch from 57e550e Feb 14, 2017

@zsfelfoldi zsfelfoldi removed the in progress label Feb 14, 2017

@zsfelfoldi

This comment has been minimized.

Copy link
Contributor Author

commented Feb 14, 2017

also fixed a bug that caused flow control errors and random peer disconnections.

@farazdagi

This comment has been minimized.

Copy link
Contributor

commented Feb 15, 2017

While overall blocking is gone indeed, there's one minor issue with a delay, rather than a block.

I've done the following experiment:

  • wiped out chaindata (so that the test starts from trusted CHT, and needs to sync all the way from it)
  • send couple of transactions
    • sometimes those transactions validate just fine, even if actual nonce is higher (in the last block), it might so happen that when tx is sent, chain state doesn't know about those higher nonces
    • curiously enough, the first transaction proceeds fine (of course it never gets included into blockchain, as peers deny it)
    • however, the second take much longer to complete (like minutes, instead of seconds)

I've tracked the problem to https://github.com/ethereum/go-ethereum/blob/master/light/txpool.go#L303:

func (pool *TxPool) eventLoop() {
	for ev := range pool.events.Chan() {
		switch ev.Data.(type) {
		case core.ChainHeadEvent:
			pool.mu.Lock()
			ctx, _ := context.WithTimeout(context.Background(), blockCheckTimeout)
			head := pool.chain.CurrentHeader()
			txc, _ := pool.setNewHead(ctx, head)
			m, r := txc.getLists()
			pool.relay.NewHead(pool.head, m, r)
			pool.homestead = pool.config.IsHomestead(head.Number)
			pool.signer = types.MakeSigner(pool.config, head.Number)
			pool.mu.Unlock()
                        // <<<<<< delay is solved if I add time.Sleep(0) or runtime.Gosched()
		}
	}
}

So, eventLoop acquires lock and preemption seems to get greatly delayed i.e. another routine was starving on txpool.Add() lock acquiring - hence the delay.

Once I added runtim.Gosched() to allow for manual preemption and context switching, no more delays occurred.

@zsfelfoldi zsfelfoldi removed the pr:review label Feb 20, 2017

@zsfelfoldi

This comment has been minimized.

Copy link
Contributor Author

commented Feb 20, 2017

@farazdagi thanks for reporting these issues, I've switched the status of the PR back to "in progress" until I address them.

@zsfelfoldi zsfelfoldi force-pushed the zsfelfoldi:sync-blocking-fix branch Mar 3, 2017

@zsfelfoldi zsfelfoldi force-pushed the zsfelfoldi:sync-blocking-fix branch to 3d221cc Mar 4, 2017

@zsfelfoldi zsfelfoldi added pr:review and removed in progress labels Mar 4, 2017

@zsfelfoldi

This comment has been minimized.

Copy link
Contributor Author

commented Mar 4, 2017

@farazdagi I have investigated the mentioned issues and found the following:

  • if you send a transaction before syncing up, the light transaction pool kills performance by downloading each block until it finds the sent transaction. It also wastes some CPU performance while syncing, even if no tx has been sent. This problem will be fixed by a subsequent PR.
  • serving data from the local db before syncing is started DOES work if everything is there, or at least I could not reproduce a scenario where it does not. The comments above the Retrieve funcion were outdated, db checking is done in the access functions found in light/odr_util.go.
  • the commit c57c54c you mentioned only affected the server mode and was rolled back since then, so it can not be related to these problems.

Hopefully the remaining performance and responsiveness issues will be fixed by the new (and actually simpler) light tx pool which will (again, hopefully) be included in the 1.6 release.

@farazdagi

This comment has been minimized.

Copy link
Contributor

commented Mar 5, 2017

@zsfelfoldi thank you for providing more details. I've mentioned this on Gitter, copying it here:

in my tests #3660 still doesn't fix the blocking. I did the following: 1. reset chaindata 2. start node 3. in another window attach to that node 4. try to eth.getBalance() and eth.estimateGas() 5. blocking/lags occurs.

So, unfortunately, for us it this remaining performance and responsiveness issues were show stoppers (as on mobile LES is already understandably slower than that on desktop). What we did is simply excluded any updates that happened from c57c54c I literally cherry-picked every other commits, but ones related to LES, and we have non blocking and almost up to date Geth (everything is up to date, except for LES).

I am not sure how c57c54c is unrelated to the issue - I see no blocking before that commit, I see no blocking if I update everything else bug LES. It is possible that commit has been reverted, and then something else re-introduced the issue, I am not sure.

At the moment we have our hackish way to keep updated Geth and outdated LES 😞 - once I have more time, I will look further into issue, as the original issue I've reported (#3639) is definitely not fixed by this PR, so hopefully it gets addressed down the road.

@fjl
Copy link
Contributor

left a comment

We did an interactive review.

les/distributor.go Outdated
waitBefore(uint64) (time.Duration, float64)
canSendOrdered() bool
orderedSend(f func())
}

This comment has been minimized.

Copy link
@fjl

fjl Mar 18, 2017

Contributor

As discussed, please remove the need for 'ordered sends' by updating flow control counters when the request is actually sent.

This comment has been minimized.

Copy link
@zsfelfoldi

zsfelfoldi Mar 18, 2017

Author Contributor

I realized that I'd really want to keep the ordered send because the order actually affects performance, the optimal thing to do is to keep the queue priority and always send the oldest requests first. I know this is a minor nuance but with the recent additions to the protocol there are cases where the order of reply messages counts too, so I believe in the long run this is the cleanest solution. So I made the ordered send a general thing and put it into the common package.

les/distributor.go Outdated
case retry := <-d.loopChn:
// the main loop distinguishes two kinds of signals:
// - "retry" is sent to re-check whether previously unsendable requests can now be sent to one of the peers
// - "next" is sent when there are more requests to be processed/sent

This comment has been minimized.

Copy link
@fjl

fjl Mar 18, 2017

Contributor

Please remove the retry mechanism.

les/distributor.go Outdated
r.queuePrev.queueNext = r
}
}
}

This comment has been minimized.

Copy link
@fjl

fjl Mar 18, 2017

Contributor

You might be able to remove this part if you use container/list.

les/distributor.go Outdated
getCost: getCost,
canSend: canSend,
request: request,
}

This comment has been minimized.

Copy link
@fjl

fjl Mar 18, 2017

Contributor

Please change callers to use &distReq{getCost: func(...) uint64 { ... }, ...} instead of calling newDistReq. Doing this will add "labels" to all the func literals.

les/distributor.go Outdated
}
if bestReq == nil {
} else {
}

This comment has been minimized.

Copy link
@fjl

fjl Mar 18, 2017

Contributor

You don't need this empty condition anymore ;)

les/distributor.go Outdated
if send != nil {
peer.orderedSend(send)
}
chn <- peer

This comment has been minimized.

Copy link
@fjl

fjl Mar 18, 2017

Contributor
req.sentChn <- peer
close(req.sentChn)

This comment has been minimized.

Copy link
@zsfelfoldi

zsfelfoldi Mar 18, 2017

Author Contributor

No, I realized that remove sets sentChn to nil so there was actually a reason for this.

les/distributor.go Outdated
}
fmt.Println(" order", r.reqOrder, "prev", p)
r = r.queueNext
}

This comment has been minimized.

Copy link
@fjl

fjl Mar 18, 2017

Contributor

Please remove this select clause.

@GitCop

This comment has been minimized.

Copy link

commented Mar 18, 2017

Thank you for your contribution! Your commits seem to not adhere to the repository coding standards

  • Commit: 4f3a9eea6aff8d15aa7ce328af343e32c6bc653e

  • Commits must be prefixed with the package(s) they modify

  • Commit: 73f68d563c1fa1d0e01cedcebbe91726b2ace6de

  • Commits must be prefixed with the package(s) they modify

  • Commit: 04067f3fffb9a0ef4bf36604ec197be79cb8c337

  • Commits must be prefixed with the package(s) they modify

  • Commit: 9a82a89c338754e1b5bd0d3542ca9499ba473d7b

  • Commits must be prefixed with the package(s) they modify

Please check the contribution guidelines for more details.


This message was auto-generated by https://gitcop.com

@zsfelfoldi

This comment has been minimized.

Copy link
Contributor Author

commented Mar 18, 2017

@fjl I did the requested changes (except two of them, see my comments).

common/execqueue.go Outdated
// Stop stops the exec queue.
func (q *ExecQueue) Stop() {
atomic.StoreInt32(&q.stop, 1)
}

This comment has been minimized.

Copy link
@fjl

fjl Mar 22, 2017

Contributor

Please add this as an unexported utility in package les instead. We don't want to add new features to package common anymore.

This comment has been minimized.

Copy link
@zsfelfoldi

zsfelfoldi Mar 22, 2017

Author Contributor

done

@zsfelfoldi zsfelfoldi force-pushed the zsfelfoldi:sync-blocking-fix branch Mar 22, 2017

@zsfelfoldi zsfelfoldi force-pushed the zsfelfoldi:sync-blocking-fix branch to 9d363f3 Mar 22, 2017

defer self.chainmu.Unlock()
defer func() {
self.chainmu.Unlock()
time.Sleep(time.Millisecond * 10) // ugly hack; do not hog chain lock in case syncing is CPU-limited by validation

This comment has been minimized.

Copy link
@fjl

fjl Mar 22, 2017

Contributor

This is indeed ugly and we should remove it. Can you explain why this helps with the lock?

}
go func() {
time.Sleep(wait)
d.loopChn <- struct{}{}

This comment has been minimized.

Copy link
@fjl

fjl Mar 22, 2017

Contributor

I have a better idea for this, will submit after the merge.

@fjl

fjl approved these changes Mar 22, 2017

@fjl fjl merged commit 525116d into ethereum:master Mar 22, 2017

1 of 3 checks passed

continuous-integration/appveyor/pr AppVeyor build failed
Details
continuous-integration/travis-ci/pr The Travis CI build failed
Details
commit-message-check/gitcop All commit messages are valid
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.