perf(consensus): Remove reactor rlocks on Consensus #3211

ValarDragon · 2024-06-08T20:27:12Z

Remove (most) of the reactor RLock's on consensus! (Remaining one is in the gossip routine, should be easy to fix, but as a separate PR)

The consensus reactor on ingesting messages takes RLock's on Consensus mutex, preventing us from adding things to the queue. (And therefore blocking behavior)

This can cause the peer message queue to be blocked. Now it won't be blocked because we update the round state directly from the cs state update routine (via event bus) when:

A vote is added
A block part is added
We enter a new consensus step. (Receiving a full block triggers this for us, we will enter prevote or precommit round)

This shouldn't change reactor message validity, because Reactor.Receive could always view a cs.RoundState that will be old when the packet reaches the cs.HandleMsg queue. Every update to consensus' roundstate pushes the update into the reactor's view, so we avoid locks.

I don't think any new tests are required!

PR checklist

Tests written/updated
Changelog entry added in .changelog (we use unclog to manage our changelog)
Updated relevant documentation (docs/ or spec/) and code comments
Title follows the Conventional Commits spec

internal/consensus/reactor.go

ValarDragon · 2024-06-08T23:16:37Z

@czarcas7ic profiled the number of mutex contentions before/after this change on Osmosis. This is the number of contentions on the recvRoutine over a 500s profile. (200 blocks)

We had 137000 contentions over 500s!

This became 0 contentions in the mconnection.RecvRoutine after this change :)

melekes

Thanks @ValarDragon ❤️

internal/consensus/state.go

cason · 2024-06-10T14:47:05Z

I need to go through in more detail, but why not send this shallow copy with the synchronous message sent from state to the reactor?

This PR works (without locks) because the calls are synchronous, so nothing happens in the protocol while it is called.

ValarDragon · 2024-06-10T20:11:01Z

I need to go through in more detail, but why not send this shallow copy with the synchronous message sent from state to the reactor?

I thought it was easier to handle copying/pointers to a struct in an independent getRoundState method (e.g. reasoning about shallow copy)

Or do you mean something different? Maybe I don't understand

cason · 2024-06-10T20:32:28Z

Or do you mean something different? Maybe I don't understand

I meant that adding the shallow copy of the consensus state to the data in the multiple func(data cmtevents.EventData) will have the same effect.

cason · 2024-06-10T20:45:18Z

Moreover, we still have the func (conR *Reactor) updateRoundStateRoutine() that does this shallow copy from the consensus state.

cason · 2024-06-10T20:52:14Z

So, for context here.

The Reactor needs the consensus state, the infamous rs that is the way that the consensus logic "sends" information to the reactor (shared memory communication).

At the beginning of the times, each time we needed rs in the reactor, either in the Receive() method or in the multiple sending routines, we copied it from the consensus state (conR.rs = conR.conS.GetRoundState()).

This of course was not efficient, so at some point we replaced this by creating a updateRoundStateRoutine infinite for loop that every 100 micros (I don't know why this granularity) does the same.

However, for some reason (also not 100% clear to me), in some points you have touched in this PR, in the Receive() method, we didn't trust in the rs copy (from 100 micros ago), but instead retrieved the current value.

From what I understood, you are proposing to still use a (potentially) outdated rs copy in the Receive() method but, instead of only updating it every 100 micros, also "forcedly" update it each time that the consensus logic produces a synchronous event to the reactor, the three that you mentioned and touched.

My general question here is: if we are pretty sure we don't need a fresh copy of rs on each Receive() call, because relevant changes on rs lead us to update our copy, why do we still need the updateRoundStateRoutine() loop?

cason · 2024-06-10T20:54:12Z

In summary, I suggest we open an issue to understand better this synchronization/shared memory design and link the issues and PRs (like this one) that improves performance by saving useless lock acquisition and data copy.

ValarDragon · 2024-06-10T20:54:50Z

My general question here is: if we are pretty sure we don't need a fresh copy of rs on each Receive() call, because relevant changes on rs lead us to update our copy, why do we still need the updateRoundStateRoutine() loop?

Talked about this with Bucky IRL as well. Left it in within this PR, since its a nice fallback that may not be costing us. It would let us get this PR in, without risking any events updating the round state that we missed. Agreed we should be able to remove it. However, I feel like we can remove that in a second PR, after getting this one tested on production more extensively.

It seems fine to leave the updateStatsRoutine as a backup on a perhaps slower timer than 100 microsecond.

This PR should only help, it took us to 0 lock conflicts!

cason

Left some general comments here.

I think this change is in the right direction. The description of the PR and, specially, the changelog entry are imprecise and misleading.

Once they are fixed or reworded, we are good to go.

internal/consensus/state.go

cason · 2024-06-10T20:35:18Z

.changelog/unreleased/improvements/3211-make-cs-reactor-no-longer-takes-cs-locks.md

@@ -0,0 +1,4 @@
+- `[consensus]` Make the consensus reactor no longer have packets on receive take the consensus lock.
+Consensus will now update the reactor's view after every relevant change through the existing event


This is not really an event bus.

I would say that we use the synchronous events produced by the state to the reactor to transmit a fresh/updated copy of the consensus state to the reactor.

Ok! I thought this event system was called the event bus, thats my bad. Thank you!

No problem, it is described like that:

cometbft/internal/consensus/state.go

Lines 131 to 133 in 2dd5209

// synchronous pubsub between consensus state and reactor.

// state only emits EventNewRoundStep and EventVote

evsw cmtevents.EventSwitch

And the comment is outdated, as we have a third event, eheheheheh

We have an asynchronous bus in state.go too, just to make it simple to any reader. : )

cason · 2024-06-10T20:36:51Z

internal/consensus/reactor.go

-	rs    *cstypes.RoundState
+	rsMtx         cmtsync.RWMutex
+	rs            *cstypes.RoundState
+	initialHeight int64 // under rsMtx


Why do we need this value for every received message? Not questioning the change, but the rationale.

We need it to be updated on every prune, since that can change our state.initial Height. I didn't see a synchronous point to update for prunes

I think this is the first height of the blockchain, it never changes. In any case, the first block (height) stored is something we can get from the block store, not from the state.

ValarDragon · 2024-06-10T21:02:59Z

I meant that adding the shallow copy of the consensus state to the data in the multiple func(data cmtevents.EventData) will have the same effect.

That makes sense to me!

ValarDragon · 2024-06-14T17:34:31Z

Anything needed for this PR to progress?

cason

Good.

The discussion, to have in another issue/PR, is whether we still, after this changes, have the updateRoundStateRoutine(), which does the same as this PR proposes.

internal/consensus/state.go

* Backport cometbft#3211 * Fix Race * bp cometbft#3157 * Speedup tests that were hitting timeouts * bp cometbft#3161 * Fix data race * Mempool async processing * Forgot to commit important part * Add changelog

internal/consensus/reactor.go

* Backport cometbft#3211 * Fix Race * bp cometbft#3157 * Speedup tests that were hitting timeouts * bp cometbft#3161 * Fix data race * Mempool async processing * Forgot to commit important part * Add changelog

This reverts commit 4b278e9.

…3335) This PR reverts #3211 since it is making the e2e nightly to fail in reproducible ways. This PR was identified as the reason for the recent e2e nightly failure on `main` and doing a bi-sect test of all recent commits it was determined that this PR introduced a behavior that makes the tests to fail and indicate a bug or unknown behavior has been introduced. Even though we are reverting this logic for now, we'd be happy to consider it in the future again once more tests are performed and we can ensure it passes all tests. --- #### PR checklist - [ ] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [ ] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

…#3341) Pretty simple bug fix for the e2e failure on #3211. There was a race condition at iniitialization for initial height, because we didn't initialize it early on enough. The error in the E2E logs was: ``` validator03 | E[2024-06-21|21:13:20.744] Stopping peer for error module=p2p peer="Peer{MConn{10.186.73.2:34810} 4fe295e4cfad69f1247ad85975c6fd87757195db in}" err="invalid field LastCommitRound can only be negative for initial height 0" validator03 | I[2024-06-21|21:13:20.744] service stop module=p2p peer=4fe295e4cfad69f1247ad85975c6fd87757195db@10.186.73.2:34810 msg="Stopping Peer service" impl="Peer{MConn{10.186.73.2:34810} 4fe295e4cfad69f1247ad85975c6fd87757195db in}" ``` hinting at initial height not being set rpoperly. --- #### PR checklist - [ ] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [ ] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Andy Nogueira <me@andynogueira.dev>

Remove reactor rlocks on Consensus

dea7522

ValarDragon requested review from a team as code owners June 8, 2024 20:27

ValarDragon added 5 commits June 8, 2024 16:27

Forgot to commit rlock removal

ae3dd43

rsMtx protect Rs reads

bc4c792

Remove mtx for Initial Height

3b42be8

fix race

f67040d

Add changelog

332fbdf

ebuchman approved these changes Jun 8, 2024

View reviewed changes

internal/consensus/reactor.go Outdated Show resolved Hide resolved

Remove stale comment

d9faf19

melekes reviewed Jun 10, 2024

View reviewed changes

internal/consensus/state.go Show resolved Hide resolved

evan-forbes mentioned this pull request Jun 10, 2024

Port and test consensus reactor contention fixes celestiaorg/celestia-core#1384

Open

evan-forbes mentioned this pull request Jun 10, 2024

Benchmark the consensus reactor celestiaorg/celestia-core#1369

Open

5 tasks

ValarDragon mentioned this pull request Jun 10, 2024

perf(consensus): Make late votes outside of last block commits not get to peerMsgQueue #3157

Open

4 tasks

cason reviewed Jun 10, 2024

View reviewed changes

This was referenced Jun 11, 2024

[Tracking Issue] Performance improvements in consensus implementation #3245

Open

consensus: change gossip routines sleep and signaling setup #3007

Open

Comment updates

da0de55

ValarDragon added a commit to osmosis-labs/cometbft that referenced this pull request Jun 15, 2024

Backport cometbft#3211

62e372c

ValarDragon mentioned this pull request Jun 15, 2024

Backport consensus improvements osmosis-labs/cometbft#108

Merged

3 tasks

cason approved these changes Jun 17, 2024

View reviewed changes

internal/consensus/state.go Outdated Show resolved Hide resolved

ValarDragon and others added 2 commits June 17, 2024 13:41

Address comment

af3b450

Merge branch 'main' into dev/remove_reactor_rlocks

967e023

ValarDragon added a commit to osmosis-labs/cometbft that referenced this pull request Jun 17, 2024

Backport cometbft#3211

dc21f95

Merge branch 'main' into dev/remove_reactor_rlocks

b596975

melekes requested changes Jun 19, 2024

View reviewed changes

internal/consensus/reactor.go Show resolved Hide resolved

melekes approved these changes Jun 20, 2024

View reviewed changes

melekes added this pull request to the merge queue Jun 20, 2024

Merged via the queue into main with commit 4b278e9 Jun 20, 2024
38 checks passed

melekes deleted the dev/remove_reactor_rlocks branch June 20, 2024 03:50

andynog restored the dev/remove_reactor_rlocks branch June 21, 2024 21:05

andynog added a commit that referenced this pull request Jun 21, 2024

Revert "perf(consensus): Remove reactor rlocks on Consensus (#3211)"

6966cf1

This reverts commit 4b278e9.

andynog mentioned this pull request Jun 24, 2024

test(e2e): reverting PR3211 since it is making e2e nightly to fail #3335

Merged

4 tasks

sergio-mena deleted the dev/remove_reactor_rlocks branch June 24, 2024 16:31

ValarDragon mentioned this pull request Jun 26, 2024

perf(consensus): Undo revert #3211. (Remove reactor Consensus RLocks) #3341

Merged

4 tasks

itsdevbear pushed a commit to berachain/cometbft that referenced this pull request Jul 4, 2024

Backport cometbft#3211

1923de1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(consensus): Remove reactor rlocks on Consensus #3211

perf(consensus): Remove reactor rlocks on Consensus #3211

ValarDragon commented Jun 8, 2024 •

edited

Loading

ValarDragon commented Jun 8, 2024 •

edited

Loading

melekes left a comment

cason commented Jun 10, 2024

ValarDragon commented Jun 10, 2024

cason commented Jun 10, 2024

cason commented Jun 10, 2024

cason commented Jun 10, 2024

cason commented Jun 10, 2024

ValarDragon commented Jun 10, 2024 •

edited

Loading

cason left a comment

cason Jun 10, 2024

ValarDragon Jun 10, 2024

cason Jun 10, 2024

cason Jun 10, 2024

cason Jun 10, 2024

cason Jun 10, 2024

ValarDragon Jun 10, 2024 •

edited

Loading

cason Jun 10, 2024

ValarDragon commented Jun 10, 2024

ValarDragon commented Jun 14, 2024

cason left a comment

		@@ -0,0 +1,4 @@
		- `[consensus]` Make the consensus reactor no longer have packets on receive take the consensus lock.
		Consensus will now update the reactor's view after every relevant change through the existing event

	// synchronous pubsub between consensus state and reactor.
	// state only emits EventNewRoundStep and EventVote
	evsw cmtevents.EventSwitch

perf(consensus): Remove reactor rlocks on Consensus #3211

perf(consensus): Remove reactor rlocks on Consensus #3211

Conversation

ValarDragon commented Jun 8, 2024 • edited Loading

PR checklist

ValarDragon commented Jun 8, 2024 • edited Loading

melekes left a comment

Choose a reason for hiding this comment

cason commented Jun 10, 2024

ValarDragon commented Jun 10, 2024

cason commented Jun 10, 2024

cason commented Jun 10, 2024

cason commented Jun 10, 2024

cason commented Jun 10, 2024

ValarDragon commented Jun 10, 2024 • edited Loading

cason left a comment

Choose a reason for hiding this comment

cason Jun 10, 2024

Choose a reason for hiding this comment

ValarDragon Jun 10, 2024

Choose a reason for hiding this comment

cason Jun 10, 2024

Choose a reason for hiding this comment

cason Jun 10, 2024

Choose a reason for hiding this comment

cason Jun 10, 2024

Choose a reason for hiding this comment

cason Jun 10, 2024

Choose a reason for hiding this comment

ValarDragon Jun 10, 2024 • edited Loading

Choose a reason for hiding this comment

cason Jun 10, 2024

Choose a reason for hiding this comment

ValarDragon commented Jun 10, 2024

ValarDragon commented Jun 14, 2024

cason left a comment

Choose a reason for hiding this comment

ValarDragon commented Jun 8, 2024 •

edited

Loading

ValarDragon commented Jun 8, 2024 •

edited

Loading

ValarDragon commented Jun 10, 2024 •

edited

Loading

ValarDragon Jun 10, 2024 •

edited

Loading