Reduce latency of ThreadMessageHandler. (3.2x speedup for IBD to 200k) #9415

gmaxwell · 2016-12-23T22:17:13Z

Instead of using a 100ms sleep in the condition variable, this changes
it to sleep for 100ms after the loop started, so if the loop takes a
long time (e.g. due to waiting on the vnodes lock at the end) it will
restart sooner.

Also avoids a potential 100ms delay before announcing a new block that
arrived via submitblock, prevents delays in processing messages for
early peers when they arrive part way through the message handler loop,
and reruns the message handler when some peers don't finish sendmessages
due to contention for cs_main.

The final commit changes the message handling to only hold cs_vRecvMsg long
enough to move out the message. This speeds up IBD up to 200k from a single
peer over GbE compared to master by about 200%, by eliminating cs_vRecvMsg
contention (that the prior commits happened to also make worse). With this change
on both sides the speedup is ~3.2x.

theuni · 2016-12-24T07:41:16Z

There's so much currently wrong with this loop that I'm hesitant to make changes before rewriting it entirely. I've been discussing my plans with @TheBlueMatt for doing just that after #9289 is merged.

Some other fun issues with this loop (I'm sure I'm leaving several out):

messageHandlerCondition wakes spuriously
The depth of vRecvGetData and messages to send aren't taken into consideration
cs_vRecvMsg is held for the duration of message processing, leading to insane lock contention

I have no objections to slapping some band-aids on here, but I'm really really hoping to address all of those at once in my next PR which moves this handler into net_processing. As a bonus, it should simplify @TheBlueMatt's multi-threaded processing.

gmaxwell · 2016-12-24T22:19:56Z

messageHandlerCondition wakes spuriously

who cares? :) It's not like this loop or processing that results from random wakes ever shows up in profiles.

If it were a real concern it would be easy to go back to sleep if the boolean hasn't been flipped and not enough time has passed.

but I'm really really hoping to address all of those at once in my next PR which moves this handler into net_processing.

My guess is that we may be a bit close to the end of 0.14 to achieve much more in the way of big changes, especially ones with difficult to test concurrency implications. I just don't want efforts to make big sweeping changes to stand in the way of meaningful improvement-- we probably should have made the changes in this PR a year ago.

I'm reviewing 9289-- hopefully we can get that in soon.

TheBlueMatt · 2016-12-24T22:33:42Z

I'm hoping the next PR isn't huge in the way of complicated locking implications, more like simplified locking implications compared to master. Let's wait and take a look next week (and if we decide it can't make it we can come back to here - I think the potential advantage in initial from-network sync time from "doing it right" might be worth the extra review/testing implications).

…

On December 24, 2016 5:19:59 PM EST, Gregory Maxwell ***@***.***> wrote: > messageHandlerCondition wakes spuriously who cares? :) It's not like this loop or processing that results from random wakes ever shows up in profiles. If it were a real concern it would be easy to go back to sleep if the boolean hasn't been flipped and not enough time has passed. > but I'm really really hoping to address all of those at once in my next PR which moves this handler into net_processing. My guess is that we may be a bit close to the end of 0.14 to achieve much more in the way of big changes, especially ones with difficult to test concurrency implications. I just don't want efforts to make big sweeping changes to stand in the way of meaningful improvement-- we probably should have made the changes in this PR a year ago. I'm reviewing 9289-- hopefully we can get that in soon. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #9415 (comment)

theuni · 2016-12-25T06:00:58Z

who cares? :) It's not like this loop or processing that results from random wakes ever shows up in profiles.

My concern was moreso that some platforms (win32) may vary significantly in this wakeup behavior.

Completely agree with you though. I certainly wasn't suggesting that we not take the improvements here for the sake of something coming later. In fact, the wake from the miner is necessary either way. I was just pointing out that I'm hoping to be able to delete most of this code soon anyway :)

gmaxwell · 2016-12-27T11:52:55Z

Mystery: this slows the IBD of the first 200k blocks from a local gigabit ethernet connected peer by about 40%. I've run the test 4 times each way, same result. Maybe if we solve that, we make IBD faster.

gmaxwell · 2016-12-27T23:39:24Z

I bisected my patch down to the cs_vSend cs_vRecvMsg contention causing fsleep=false to be the cause of the IBD performance regression.

TheBlueMatt · 2016-12-28T11:34:06Z

Ugh, yea, those desperately need splitting. I know @theuni has been working on a patch to do that, see-also c214d12

…

On December 28, 2016 12:39:28 AM GMT+01:00, Gregory Maxwell ***@***.***> wrote: I bisected my patch down to the cs_vSend cs_vRecvMsg contention causing fsleep=false to be the cause of the IBD performance regression. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #9415 (comment)

gmaxwell · 2016-12-29T08:12:03Z

I tested reducing the message handler hold of cs_vrecv lock to be just a narrow space small enough to std::move the message out, and speed up IBD to 200k from a single peer by more than a factor of two over master.

Instead of using a 100ms sleep in the condition variable, this changes it to sleep for 100ms after the loop started, so if the loop takes a long time (e.g. due to waiting on the vnodes lock at the end) it will restart sooner.

Avoids a potential 100ms delay before announcing a new block that arrived via submitblock.

…e work. This prevents delays in processing messages for early peers when they arrive part way through the message handler loop. This also reruns the message handler when some peers don't finish sendmessages due to contention for cs_main.

gmaxwell · 2016-12-29T14:51:43Z

Sync to 200k with master: 243.7 blocks per second.
With this PR on the receiver: 465.431 blocks per second.
With the PR on both sender and receiver: 782.287 (3.2x master's speed).

(Chainstate reindex is about 1000/s so there is still perhaps some room for improvement.)

This also eliminates a race on nSendSize by making it atomic. This speeds up IBD up to 200k from a single peer over GbE compared to master by about 200%, by eliminating cs_vRecvMsg contention (that the prior commits happened to also make worse).

theuni · 2016-12-29T15:46:55Z

Looks like we’re stepping on each-other :)

We’re definitely working to solve the same issue, though it seems we disagree on the approach. This is less flexible for a clean separation of layers as it requires constant two-way communication between the net layer and the message processor. Done this way, the net must notify the processor that new messages are available, then the processor must grab the messages from the net.

The more typical approach (what I’ve seen, anyway) is a one-way model where the net throws its messages to the processor, which is forced to consume them. That greatly reduces the risk of the processor throttling the network. It also makes things much easier as processing becomes multi-threaded, because the processor maintains the message queue, as opposed to having each new thread contend with the net lock. So the net side doesn’t care what happens once it’s handed off a message, it only has to have a means of being corked if the processor is at capacity.

The last abstraction step (moving processing out of CConnman, as seen in a quick/dirty attempt here) relies on this one-way model.

All that to say: I prefer the approach in #9441, which yields the same speedups :)

gmaxwell · 2016-12-29T16:18:27Z

I don't follow your two-way comments. There is communication only one way here: from the ThreadSocketHandler to ThreadMessageHandler for receive, and the reverse for sending.

The processing cannot itself throttle the network (except in terms of the input rate exceeding the processing rate): Assuming we haven't hit the GetReceiveFloodSize the only time message processing can stall reception is holding the per-peer lock during a simple move. The network does let it know new things are available-- allowing it to sleep when there is nothing to do.

If messages were to be handled concurrently, it would be no finer grained than per-peer+per-direction in any case to preserve ordering; which is the granularity of the locks in this configuration. Though message handling takes cs_main for effectively everything, so at the moment there wouldn't really be any gain in making it concurrent.

Perhaps you're referring to the WakeMessageHandler calls in ProcessMessages? Those are communication from ThreadMessageHandler to itself-- just reusing the mechanism so that the handler doesn't sleep when there is more work to be done. It could more cleanly be done with a return argument, but the signaling mechanism already exists.

which yields the same speedups

Sure takes a lot more code to do it. The purpose of its architecture is unclear to me, I don't see what we're gaining for the extra complexity.

theuni · 2016-12-29T16:45:09Z

I don't follow your two-way comments. There is communication only one way here: from the ThreadSocketHandler to ThreadMessageHandler for receive, and the reverse for sending.

ThreadSocketHandler notifies ThreadMessageHandler that it has a new message.
ThreadMessageHandler then reaches into the socket handler's buffer to get and remove it. Logically the queue belongs in the processor so that it can delegate intelligently. After all, the net is done with a message as soon as it's come in.

Sure takes a lot more code to do it. The purpose of its architecture is unclear to me, I don't see what we're gaining for the extra complexity.

Eh? It's essentially 2 changes. The first changes to use a return value rather than signaling (as you suggested above): 1d232f4
And the second looks much like yours: 65e916c

Maybe you're referring to the boost commits that it's built on? That's only because they touch so much of the same code that I didn't want to have to keep up with rebasing.

gmaxwell · 2016-12-29T17:12:56Z

After all, the net is done with a message as soon as it's come in.

I don't think it is-- e.g need to stop reading data off the socket if we have too much queued data. Otherwise the memory usage is unbounded: I don't think going around and lazily asking it to stop is sufficient, as AFAIK we have no guarantee about how much data we can receive between times the message handler will make its way back around to notice that it's too large. E.g. cold caches it can take 5 seconds to validate a block, an on a gigabit link you could receive >500MB of data in that time. It looks to me like your code handles this-- by having the socket side stay aware of the queue usage. So it's functionally fine, though you end up duplicating the code that does this size management in both message handling and socket handling.

I don't think that "belongs" is even the right way to think about this. The queue "belongs" to the node-- as its the destruction of the node that ultimately destroys the queue. It is accessed by the socket thread and the message handling thread, for their respective operations.

theuni · 2016-12-29T17:48:51Z

I don't think it is its need to stop reading data off the socket if we have too much queued data. Otherwise the memory usage is unbounded: I don't think going around and lazily asking it to stop is sufficient, as AFAIK we have no guarantee about how much data we can receive between times the message handler will make its way back around to notice that it's too large.

Nor do I.

With each message, the message handler immediately notifies the socket handler whether it should cork. There's no lag there, it's done atomically with the message hand-off: https://github.com/bitcoin/bitcoin/pull/9441/files#diff-eff7adeaec73a769788bb78858815c91R2463

In the refactored code, the cork is the return value of the "take my message" function: theuni@1a6b10a#diff-9a82240fe7dfe86564178691cc57f2f1R1255

I don't think that "belongs" is even the right way to think about this. The queue "belongs" to the node-- as its the destruction of the node that ultimately destroys the queue. It is accessed by the socket thread and the message handling thread, for their respective operations.

Disagreed. These are two separate layers. The queue could easily outlive the node. For ex, during IBD a node may send 10 blocks and immediately disconnect. There's no requirement that we destroy the queue before deleting the node.

fanquake added the P2P label Dec 23, 2016

theuni mentioned this pull request Dec 29, 2016

Net: Massive speedup. Net locks overhaul #9441

Merged

gmaxwell added 3 commits December 29, 2016 14:31

Reduce latency of ThreadMessageHandler.

a832c9e

Instead of using a 100ms sleep in the condition variable, this changes it to sleep for 100ms after the loop started, so if the loop takes a long time (e.g. due to waiting on the vnodes lock at the end) it will restart sooner.

Make submitblock wake the message handler thread.

685f504

Avoids a potential 100ms delay before announcing a new block that arrived via submitblock.

gmaxwell changed the title ~~Reduce latency of ThreadMessageHandler.~~ Reduce latency of ThreadMessageHandler. (3.2x speedup for IBD to 200k) Dec 29, 2016

gmaxwell mentioned this pull request Dec 29, 2016

Do not use signals for communication between net and net_processing. #9442

Closed

gmaxwell closed this Dec 29, 2016

OlegGirko mentioned this pull request Aug 22, 2017

Backport Bitcoin PR#9441: Net: Massive speedup. Net locks overhaul dashpay/dash#1586

Merged

bitcoin locked as resolved and limited conversation to collaborators Sep 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce latency of ThreadMessageHandler. (3.2x speedup for IBD to 200k) #9415

Reduce latency of ThreadMessageHandler. (3.2x speedup for IBD to 200k) #9415

gmaxwell commented Dec 23, 2016 •

edited

Loading

theuni commented Dec 24, 2016

gmaxwell commented Dec 24, 2016

TheBlueMatt commented Dec 24, 2016 via email

theuni commented Dec 25, 2016

gmaxwell commented Dec 27, 2016

gmaxwell commented Dec 27, 2016

TheBlueMatt commented Dec 28, 2016 via email

gmaxwell commented Dec 29, 2016 •

edited

Loading

gmaxwell commented Dec 29, 2016

theuni commented Dec 29, 2016

gmaxwell commented Dec 29, 2016

theuni commented Dec 29, 2016

gmaxwell commented Dec 29, 2016 •

edited

Loading

theuni commented Dec 29, 2016

Reduce latency of ThreadMessageHandler. (3.2x speedup for IBD to 200k) #9415

Reduce latency of ThreadMessageHandler. (3.2x speedup for IBD to 200k) #9415

Conversation

gmaxwell commented Dec 23, 2016 • edited Loading

theuni commented Dec 24, 2016

gmaxwell commented Dec 24, 2016

TheBlueMatt commented Dec 24, 2016 via email

theuni commented Dec 25, 2016

gmaxwell commented Dec 27, 2016

gmaxwell commented Dec 27, 2016

TheBlueMatt commented Dec 28, 2016 via email

gmaxwell commented Dec 29, 2016 • edited Loading

gmaxwell commented Dec 29, 2016

theuni commented Dec 29, 2016

gmaxwell commented Dec 29, 2016

theuni commented Dec 29, 2016

gmaxwell commented Dec 29, 2016 • edited Loading

theuni commented Dec 29, 2016

gmaxwell commented Dec 23, 2016 •

edited

Loading

gmaxwell commented Dec 29, 2016 •

edited

Loading

gmaxwell commented Dec 29, 2016 •

edited

Loading