-
Notifications
You must be signed in to change notification settings - Fork 36.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce latency of ThreadMessageHandler. (3.2x speedup for IBD to 200k) #9415
Conversation
There's so much currently wrong with this loop that I'm hesitant to make changes before rewriting it entirely. I've been discussing my plans with @TheBlueMatt for doing just that after #9289 is merged. Some other fun issues with this loop (I'm sure I'm leaving several out):
I have no objections to slapping some band-aids on here, but I'm really really hoping to address all of those at once in my next PR which moves this handler into net_processing. As a bonus, it should simplify @TheBlueMatt's multi-threaded processing. |
who cares? :) It's not like this loop or processing that results from random wakes ever shows up in profiles. If it were a real concern it would be easy to go back to sleep if the boolean hasn't been flipped and not enough time has passed.
My guess is that we may be a bit close to the end of 0.14 to achieve much more in the way of big changes, especially ones with difficult to test concurrency implications. I just don't want efforts to make big sweeping changes to stand in the way of meaningful improvement-- we probably should have made the changes in this PR a year ago. I'm reviewing 9289-- hopefully we can get that in soon. |
I'm hoping the next PR isn't huge in the way of complicated locking implications, more like simplified locking implications compared to master. Let's wait and take a look next week (and if we decide it can't make it we can come back to here - I think the potential advantage in initial from-network sync time from "doing it right" might be worth the extra review/testing implications).
…On December 24, 2016 5:19:59 PM EST, Gregory Maxwell ***@***.***> wrote:
> messageHandlerCondition wakes spuriously
who cares? :) It's not like this loop or processing that results from
random wakes ever shows up in profiles.
If it were a real concern it would be easy to go back to sleep if the
boolean hasn't been flipped and not enough time has passed.
> but I'm really really hoping to address all of those at once in my
next PR which moves this handler into net_processing.
My guess is that we may be a bit close to the end of 0.14 to achieve
much more in the way of big changes, especially ones with difficult to
test concurrency implications. I just don't want efforts to make big
sweeping changes to stand in the way of meaningful improvement-- we
probably should have made the changes in this PR a year ago.
I'm reviewing 9289-- hopefully we can get that in soon.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#9415 (comment)
|
My concern was moreso that some platforms (win32) may vary significantly in this wakeup behavior. Completely agree with you though. I certainly wasn't suggesting that we not take the improvements here for the sake of something coming later. In fact, the wake from the miner is necessary either way. I was just pointing out that I'm hoping to be able to delete most of this code soon anyway :) |
Mystery: this slows the IBD of the first 200k blocks from a local gigabit ethernet connected peer by about 40%. I've run the test 4 times each way, same result. Maybe if we solve that, we make IBD faster. |
I bisected my patch down to the cs_vSend cs_vRecvMsg contention causing fsleep=false to be the cause of the IBD performance regression. |
Ugh, yea, those desperately need splitting. I know @theuni has been working on a patch to do that, see-also c214d12
…On December 28, 2016 12:39:28 AM GMT+01:00, Gregory Maxwell ***@***.***> wrote:
I bisected my patch down to the cs_vSend cs_vRecvMsg contention
causing fsleep=false to be the cause of the IBD performance regression.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#9415 (comment)
|
I tested reducing the message handler hold of cs_vrecv lock to be just a narrow space small enough to std::move the message out, and speed up IBD to 200k from a single peer by more than a factor of two over master. |
Instead of using a 100ms sleep in the condition variable, this changes it to sleep for 100ms after the loop started, so if the loop takes a long time (e.g. due to waiting on the vnodes lock at the end) it will restart sooner.
Avoids a potential 100ms delay before announcing a new block that arrived via submitblock.
…e work. This prevents delays in processing messages for early peers when they arrive part way through the message handler loop. This also reruns the message handler when some peers don't finish sendmessages due to contention for cs_main.
Sync to 200k with master: 243.7 blocks per second. (Chainstate reindex is about 1000/s so there is still perhaps some room for improvement.) |
This also eliminates a race on nSendSize by making it atomic. This speeds up IBD up to 200k from a single peer over GbE compared to master by about 200%, by eliminating cs_vRecvMsg contention (that the prior commits happened to also make worse).
Looks like we’re stepping on each-other :) We’re definitely working to solve the same issue, though it seems we disagree on the approach. This is less flexible for a clean separation of layers as it requires constant two-way communication between the net layer and the message processor. Done this way, the net must notify the processor that new messages are available, then the processor must grab the messages from the net. The more typical approach (what I’ve seen, anyway) is a one-way model where the net throws its messages to the processor, which is forced to consume them. That greatly reduces the risk of the processor throttling the network. It also makes things much easier as processing becomes multi-threaded, because the processor maintains the message queue, as opposed to having each new thread contend with the net lock. So the net side doesn’t care what happens once it’s handed off a message, it only has to have a means of being corked if the processor is at capacity. The last abstraction step (moving processing out of CConnman, as seen in a quick/dirty attempt here) relies on this one-way model. All that to say: I prefer the approach in #9441, which yields the same speedups :) |
I don't follow your two-way comments. There is communication only one way here: from the ThreadSocketHandler to ThreadMessageHandler for receive, and the reverse for sending. The processing cannot itself throttle the network (except in terms of the input rate exceeding the processing rate): Assuming we haven't hit the GetReceiveFloodSize the only time message processing can stall reception is holding the per-peer lock during a simple move. The network does let it know new things are available-- allowing it to sleep when there is nothing to do. If messages were to be handled concurrently, it would be no finer grained than per-peer+per-direction in any case to preserve ordering; which is the granularity of the locks in this configuration. Though message handling takes cs_main for effectively everything, so at the moment there wouldn't really be any gain in making it concurrent. Perhaps you're referring to the WakeMessageHandler calls in ProcessMessages? Those are communication from ThreadMessageHandler to itself-- just reusing the mechanism so that the handler doesn't sleep when there is more work to be done. It could more cleanly be done with a return argument, but the signaling mechanism already exists.
Sure takes a lot more code to do it. The purpose of its architecture is unclear to me, I don't see what we're gaining for the extra complexity. |
ThreadSocketHandler notifies ThreadMessageHandler that it has a new message.
Eh? It's essentially 2 changes. The first changes to use a return value rather than signaling (as you suggested above): 1d232f4 Maybe you're referring to the boost commits that it's built on? That's only because they touch so much of the same code that I didn't want to have to keep up with rebasing. |
I don't think it is-- e.g need to stop reading data off the socket if we have too much queued data. Otherwise the memory usage is unbounded: I don't think going around and lazily asking it to stop is sufficient, as AFAIK we have no guarantee about how much data we can receive between times the message handler will make its way back around to notice that it's too large. E.g. cold caches it can take 5 seconds to validate a block, an on a gigabit link you could receive >500MB of data in that time. It looks to me like your code handles this-- by having the socket side stay aware of the queue usage. So it's functionally fine, though you end up duplicating the code that does this size management in both message handling and socket handling. I don't think that "belongs" is even the right way to think about this. The queue "belongs" to the node-- as its the destruction of the node that ultimately destroys the queue. It is accessed by the socket thread and the message handling thread, for their respective operations. |
Nor do I. With each message, the message handler immediately notifies the socket handler whether it should cork. There's no lag there, it's done atomically with the message hand-off: https://github.com/bitcoin/bitcoin/pull/9441/files#diff-eff7adeaec73a769788bb78858815c91R2463 In the refactored code, the cork is the return value of the "take my message" function: theuni@1a6b10a#diff-9a82240fe7dfe86564178691cc57f2f1R1255
Disagreed. These are two separate layers. The queue could easily outlive the node. For ex, during IBD a node may send 10 blocks and immediately disconnect. There's no requirement that we destroy the queue before deleting the node. |
Instead of using a 100ms sleep in the condition variable, this changes
it to sleep for 100ms after the loop started, so if the loop takes a
long time (e.g. due to waiting on the vnodes lock at the end) it will
restart sooner.
Also avoids a potential 100ms delay before announcing a new block that
arrived via submitblock, prevents delays in processing messages for
early peers when they arrive part way through the message handler loop,
and reruns the message handler when some peers don't finish sendmessages
due to contention for cs_main.
The final commit changes the message handling to only hold cs_vRecvMsg long
enough to move out the message. This speeds up IBD up to 200k from a single
peer over GbE compared to master by about 200%, by eliminating cs_vRecvMsg
contention (that the prior commits happened to also make worse). With this change
on both sides the speedup is ~3.2x.