Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce latency of ThreadMessageHandler. (3.2x speedup for IBD to 200k) #9415

Closed
wants to merge 4 commits into from
Closed

Reduce latency of ThreadMessageHandler. (3.2x speedup for IBD to 200k) #9415

wants to merge 4 commits into from

Conversation

gmaxwell
Copy link
Contributor

@gmaxwell gmaxwell commented Dec 23, 2016

Instead of using a 100ms sleep in the condition variable, this changes
it to sleep for 100ms after the loop started, so if the loop takes a
long time (e.g. due to waiting on the vnodes lock at the end) it will
restart sooner.

Also avoids a potential 100ms delay before announcing a new block that
arrived via submitblock, prevents delays in processing messages for
early peers when they arrive part way through the message handler loop,
and reruns the message handler when some peers don't finish sendmessages
due to contention for cs_main.

The final commit changes the message handling to only hold cs_vRecvMsg long
enough to move out the message. This speeds up IBD up to 200k from a single
peer over GbE compared to master by about 200%, by eliminating cs_vRecvMsg
contention (that the prior commits happened to also make worse). With this change
on both sides the speedup is ~3.2x.

@fanquake fanquake added the P2P label Dec 23, 2016
@theuni
Copy link
Member

theuni commented Dec 24, 2016

There's so much currently wrong with this loop that I'm hesitant to make changes before rewriting it entirely. I've been discussing my plans with @TheBlueMatt for doing just that after #9289 is merged.

Some other fun issues with this loop (I'm sure I'm leaving several out):

  • messageHandlerCondition wakes spuriously
  • The depth of vRecvGetData and messages to send aren't taken into consideration
  • cs_vRecvMsg is held for the duration of message processing, leading to insane lock contention

I have no objections to slapping some band-aids on here, but I'm really really hoping to address all of those at once in my next PR which moves this handler into net_processing. As a bonus, it should simplify @TheBlueMatt's multi-threaded processing.

@gmaxwell
Copy link
Contributor Author

messageHandlerCondition wakes spuriously

who cares? :) It's not like this loop or processing that results from random wakes ever shows up in profiles.

If it were a real concern it would be easy to go back to sleep if the boolean hasn't been flipped and not enough time has passed.

but I'm really really hoping to address all of those at once in my next PR which moves this handler into net_processing.

My guess is that we may be a bit close to the end of 0.14 to achieve much more in the way of big changes, especially ones with difficult to test concurrency implications. I just don't want efforts to make big sweeping changes to stand in the way of meaningful improvement-- we probably should have made the changes in this PR a year ago.

I'm reviewing 9289-- hopefully we can get that in soon.

@TheBlueMatt
Copy link
Contributor

TheBlueMatt commented Dec 24, 2016 via email

@theuni
Copy link
Member

theuni commented Dec 25, 2016

who cares? :) It's not like this loop or processing that results from random wakes ever shows up in profiles.

My concern was moreso that some platforms (win32) may vary significantly in this wakeup behavior.

Completely agree with you though. I certainly wasn't suggesting that we not take the improvements here for the sake of something coming later. In fact, the wake from the miner is necessary either way. I was just pointing out that I'm hoping to be able to delete most of this code soon anyway :)

@gmaxwell
Copy link
Contributor Author

Mystery: this slows the IBD of the first 200k blocks from a local gigabit ethernet connected peer by about 40%. I've run the test 4 times each way, same result. Maybe if we solve that, we make IBD faster.

@gmaxwell
Copy link
Contributor Author

I bisected my patch down to the cs_vSend cs_vRecvMsg contention causing fsleep=false to be the cause of the IBD performance regression.

@TheBlueMatt
Copy link
Contributor

TheBlueMatt commented Dec 28, 2016 via email

@gmaxwell
Copy link
Contributor Author

gmaxwell commented Dec 29, 2016

I tested reducing the message handler hold of cs_vrecv lock to be just a narrow space small enough to std::move the message out, and speed up IBD to 200k from a single peer by more than a factor of two over master.

Instead of using a 100ms sleep in the condition variable, this changes
 it to sleep for 100ms after the loop started, so if the loop takes a
 long time (e.g. due to waiting on the vnodes lock at the end) it will
 restart sooner.
Avoids a potential 100ms delay before announcing a new block that
 arrived via submitblock.
…e work.

This prevents delays in processing messages for early peers when they arrive
 part way through the message handler loop.

This also reruns the message handler when some peers don't finish sendmessages
 due to contention for cs_main.
@gmaxwell
Copy link
Contributor Author

Sync to 200k with master: 243.7 blocks per second.
With this PR on the receiver: 465.431 blocks per second.
With the PR on both sender and receiver: 782.287 (3.2x master's speed).

(Chainstate reindex is about 1000/s so there is still perhaps some room for improvement.)

@gmaxwell gmaxwell changed the title Reduce latency of ThreadMessageHandler. Reduce latency of ThreadMessageHandler. (3.2x speedup for IBD to 200k) Dec 29, 2016
This also eliminates a race on nSendSize by making it atomic.

This speeds up IBD up to 200k from a single peer over GbE
 compared to master by about 200%, by eliminating cs_vRecvMsg
 contention (that the prior commits happened to also make worse).
@theuni
Copy link
Member

theuni commented Dec 29, 2016

Looks like we’re stepping on each-other :)

We’re definitely working to solve the same issue, though it seems we disagree on the approach. This is less flexible for a clean separation of layers as it requires constant two-way communication between the net layer and the message processor. Done this way, the net must notify the processor that new messages are available, then the processor must grab the messages from the net.

The more typical approach (what I’ve seen, anyway) is a one-way model where the net throws its messages to the processor, which is forced to consume them. That greatly reduces the risk of the processor throttling the network. It also makes things much easier as processing becomes multi-threaded, because the processor maintains the message queue, as opposed to having each new thread contend with the net lock. So the net side doesn’t care what happens once it’s handed off a message, it only has to have a means of being corked if the processor is at capacity.

The last abstraction step (moving processing out of CConnman, as seen in a quick/dirty attempt here) relies on this one-way model.

All that to say: I prefer the approach in #9441, which yields the same speedups :)

@gmaxwell
Copy link
Contributor Author

I don't follow your two-way comments. There is communication only one way here: from the ThreadSocketHandler to ThreadMessageHandler for receive, and the reverse for sending.

The processing cannot itself throttle the network (except in terms of the input rate exceeding the processing rate): Assuming we haven't hit the GetReceiveFloodSize the only time message processing can stall reception is holding the per-peer lock during a simple move. The network does let it know new things are available-- allowing it to sleep when there is nothing to do.

If messages were to be handled concurrently, it would be no finer grained than per-peer+per-direction in any case to preserve ordering; which is the granularity of the locks in this configuration. Though message handling takes cs_main for effectively everything, so at the moment there wouldn't really be any gain in making it concurrent.

Perhaps you're referring to the WakeMessageHandler calls in ProcessMessages? Those are communication from ThreadMessageHandler to itself-- just reusing the mechanism so that the handler doesn't sleep when there is more work to be done. It could more cleanly be done with a return argument, but the signaling mechanism already exists.

which yields the same speedups

Sure takes a lot more code to do it. The purpose of its architecture is unclear to me, I don't see what we're gaining for the extra complexity.

@theuni
Copy link
Member

theuni commented Dec 29, 2016

I don't follow your two-way comments. There is communication only one way here: from the ThreadSocketHandler to ThreadMessageHandler for receive, and the reverse for sending.

ThreadSocketHandler notifies ThreadMessageHandler that it has a new message.
ThreadMessageHandler then reaches into the socket handler's buffer to get and remove it. Logically the queue belongs in the processor so that it can delegate intelligently. After all, the net is done with a message as soon as it's come in.

Sure takes a lot more code to do it. The purpose of its architecture is unclear to me, I don't see what we're gaining for the extra complexity.

Eh? It's essentially 2 changes. The first changes to use a return value rather than signaling (as you suggested above): 1d232f4
And the second looks much like yours: 65e916c

Maybe you're referring to the boost commits that it's built on? That's only because they touch so much of the same code that I didn't want to have to keep up with rebasing.

@gmaxwell
Copy link
Contributor Author

gmaxwell commented Dec 29, 2016

After all, the net is done with a message as soon as it's come in.

I don't think it is-- e.g need to stop reading data off the socket if we have too much queued data. Otherwise the memory usage is unbounded: I don't think going around and lazily asking it to stop is sufficient, as AFAIK we have no guarantee about how much data we can receive between times the message handler will make its way back around to notice that it's too large. E.g. cold caches it can take 5 seconds to validate a block, an on a gigabit link you could receive >500MB of data in that time. It looks to me like your code handles this-- by having the socket side stay aware of the queue usage. So it's functionally fine, though you end up duplicating the code that does this size management in both message handling and socket handling.

I don't think that "belongs" is even the right way to think about this. The queue "belongs" to the node-- as its the destruction of the node that ultimately destroys the queue. It is accessed by the socket thread and the message handling thread, for their respective operations.

@theuni
Copy link
Member

theuni commented Dec 29, 2016

I don't think it is its need to stop reading data off the socket if we have too much queued data. Otherwise the memory usage is unbounded: I don't think going around and lazily asking it to stop is sufficient, as AFAIK we have no guarantee about how much data we can receive between times the message handler will make its way back around to notice that it's too large.

Nor do I.

With each message, the message handler immediately notifies the socket handler whether it should cork. There's no lag there, it's done atomically with the message hand-off: https://github.com/bitcoin/bitcoin/pull/9441/files#diff-eff7adeaec73a769788bb78858815c91R2463

In the refactored code, the cork is the return value of the "take my message" function: theuni@1a6b10a#diff-9a82240fe7dfe86564178691cc57f2f1R1255

I don't think that "belongs" is even the right way to think about this. The queue "belongs" to the node-- as its the destruction of the node that ultimately destroys the queue. It is accessed by the socket thread and the message handling thread, for their respective operations.

Disagreed. These are two separate layers. The queue could easily outlive the node. For ex, during IBD a node may send 10 blocks and immediately disconnect. There's no requirement that we destroy the queue before deleting the node.

@gmaxwell gmaxwell closed this Dec 29, 2016
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Sep 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants