raft: don't emit unstable CommittedEntries #14413

tbg · 2022-09-01T15:18:48Z

When run in a single-voter configuration, prior to this PR
raft would emit HardStates that would emit a proposed Entry
simultaneously in CommittedEntries and Entries.

To be correct, this requires users of the raft library to enforce an
ordering between appending to the log and notifying the client about
CommittedEntries also present in Entries. This was easy to miss.

Walk back this behavior to arrive at a simpler contract: what's
emitted in CommittedEntries is truly committed, i.e. present
in stable storage on a quorum of voters.

This in turn pessimizes the single-voter case: rather than fully
handling an Entry in just one Ready, now two are required,
and in particular one has to do extra work to save on allocations.

We accept this as a good tradeoff, since raft primarily serves
multi-voter configurations.

Suggested review plan:

Look at raft: don't emit unstable CommittedEntries first to get an idea of the gist of the change
Look at raft: directly update leader in advance for an alternative solution that "inlines" the relevant bits of r.Step instead.
The test suites pass with either solution, so the two solutions are functionally equivalent to the degree that we're exercising them.
Look at raft: add BenchmarkRawNode and at the benchmark results in raft: benchmark results
If there are substantial concerns and/or comments, stop here and sort them out with me.
Then slog through the remaining commits, which are all going to be squashed before the merge, and fix up the tests one-by-one.
Note: raft: always mark leader as RecentActive was included here since otherwise the outputs of the two possible solutions differ in sometimes failing to track the leader as recently active.

serathius · 2022-09-07T11:28:36Z

Rerun the e2e flake. This PR is still failing DCO, please run git rebase origin/main --signoff and push head with force git push origin HEAD:raft-single-voter -f.

nvanbenschoten

This change makes sense to me. It feels like the "proper" solution because it simplifies the Ready contract and avoids the need for special-casing of single-peer deployments. After this change, entries in the CommittedEntries slice are always committed.

I actually made very similar changes in a prototype (nvanbenschoten@1d1fa32) intended to address #12257 / cockroachdb/cockroach#17500. These kinds of extensions will be easier to make if the Raft leader doesn't immediately consider an entry to be durable upon appending to its own unstable log (in raft.appendEntry). Things are cleaner if, instead, the durability status is maintained either implicitly in raft.advance or explicitly through a different (opt-in) API if clients want the flexibility to sync their log asynchronously.

From the perspective of CRDB's use of Raft, we weren't susceptible to the bug in #14370 because log application always came after log appending in the Ready processing state machine loop. However, we were aware of this idiosyncrasy and had to work around it when acking proposals ahead of application, see: https://github.com/cockroachdb/cockroach/blob/ce55e1b46604401f6085a6ece1355970da94600a/pkg/kv/kvserver/replica_raft.go#L834-L844

The comment there is interesting because it references a second case where Entries and CommittedEntries can overlap — that being when a follower in a multi-node replication group is catching up after falling behind. In those cases, the overlapping entries are already committed, so there's no durability risk from applying CommittedEntries before appending Entries (like etcdserver does). However, that still means that it's possible to have applied an entry that is not durably stored locally in the Raft log. I don't know if that could cause issues for etcd or whether entry application is idempotent.

raft/raft.go

raft/node_util_test.go

raft/rawnode_test.go

raft/diff_test.go

raft/node_util_test.go

raft/node_test.go

ahrtr · 2022-09-09T05:26:30Z

The production code change looks simple & good, but the test code change is too big. This PR almost follows the same logic as #14411, but this PR is much bigger, even excluding the new benchmark test case.

Another comment... Have you compared the performance as compared to the main branch using the benchmark tool? I suggest to compare both single-node cluster and multi-node cluster. FYI. #14394 (comment) and #14394 (comment)

tbg · 2022-09-09T10:01:06Z

Thanks for the review, @ahrtr. I'm not responding to the inline comments yet since you seem to have general concerns about the size of the change.

I think it comes down to the discussion in #14370 (comment).

You're fixing *node, I'm fixing raft. So I have to update more tests.
Also, you keep the old behavior for empty entries for reasons that aren't clear to me. I think this prevents you from having to adjust more tests.

Usually empty entries will only be generated in multi-node cluster. Since multi-node clusters do not have this issue, so I think it makes sense to keep the optimization (update the pr.Match immediately after appending it).

In my book, a special case needs a compelling reason. Special cases add complexity, and this complexity has to be made up for by some benefit. I cannot see that benefit here.

Some of the tests were pretty confused about whether they wanted to make statements about entry append or entry commit or entry apply. This is because the problem we're fixing made these things all the same in a single-node cluster in the same cycle.

I made sure to update the tests to be more robust (especially the TestNode.* tests are pretty annoying when they fail, prior to my improvements) and to not get too crufty. That is some extra work, but this PR should be judged by its changes to production code, not the corresponding updates in tests.

If you remove the optimization (mentioned above) so as to keep the behavior consistent for both empty and non-empty entry, then likely there are more subtle edge cases you need to handle, accordingly it might further complicate the raft package; and they are exactly what I have been trying to avoid.

Could you elaborate on this? Two cases = more complexity. In this PR, the empty entry is just an entry; why make it more complex? There should be fewer subtle edge cases now - namely none at all - and that is what we are after. What am I missing?

Actually [Fourth solution] Fix the potential data loss for clusters with only one member (raft layer change) #14411 doesn't almost change anything on the raft protocol, which I think is better. But I agree it has some impact on the direct users on RawNode.

We can agree to disagree here - *raft is the thing that produces Ready and the semantics of Ready are what need to be fixed. We could fix it in *RawNode but RawNode is intentionally a thin interface wrapper around *raft (similar to *node is just a goroutine around *RawNode). Spreading implementation details between them is not going to make things clearer. We shouldn't start to bolt on logic outside of *raft because we're scared of changing *raft.

ahrtr · 2022-09-09T23:00:38Z

Thanks for the feedback.

Could you elaborate on this? Two cases = more complexity. In this PR, the empty entry is just an entry; why make it more complex? There should be fewer subtle edge cases now - namely none at all - and that is what we are after. What am I missing?

I just kept the legacy behavior on empty entry so as to keep the PR(14411) as simple&safe as possible, and it's the reason why the size of 14411 is much smaller than this one. I agree that regarding the empty entry as a special case isn't good, although it was consistent with the legacy behavior.

Some of the tests were pretty confused about whether they wanted to make statements about entry append or entry commit or entry apply. This is because the problem we're fixing made these things all the same in a single-node cluster in the same cycle.

Thanks for the clarification.

That is some extra work, but this PR should be judged by its changes to production code, not the corresponding updates in tests.

Partially agreed. We also need to make sure the quality of test code. The reason why I raised the concern on the size on the test code is just to understand the reason of the bigger size. Good to confirmed that one of the major reasons is due to the removal of the special case on empty entry.

Overall looks good to me. Please,

Consider to resolve the inline minor comments;
Consider to squash the commits into a couple of commits if possible.

tbg · 2022-09-12T11:49:46Z

Thanks @ahrtr. I'm travelling this week, so there won't be activity on this PR until next week but I plan to return to it the week after. I am also planning to validate the perf impact of this change with the etcd suite as you suggested, and also against CRDB.

This picks up etcd-io/etcd#14413. Closes cockroachdb#87264. Release note: None

``` go get go.etcd.io/etcd/raft/v3@d19116e6ee66e52a5fd8cce2e10f9422fb80e42f go: downloading go.etcd.io/etcd/raft/v3 v3.6.0-alpha.0.0.20221009201006-d19116e6ee66 go: module github.com/golang/protobuf is deprecated: Use the "google.golang.org/protobuf" module instead. go: upgraded go.etcd.io/etcd/api/v3 v3.5.0 => v3.6.0-alpha.0 go: upgraded go.etcd.io/etcd/raft/v3 v3.0.0-20210320072418-e51c697ec6e8 => v3.6.0-alpha.0.0.20221009201006-d19116e6ee66 ``` This picks up - etcd-io/etcd#14413 - etcd-io/etcd#14538 Closes cockroachdb#87264. Release note: None

This commit introduces an intermediate state that delays the acknowledgement of a node's self-vote during an election until that vote has been durably persisted (i.e. on the next call to Advance). This change can be viewed as the election counterpart to etcd-io#14413. This is an intermediate state that limits code movement for the rest of the async storage writes change. Signed-off-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

``` go get go.etcd.io/etcd/raft/v3@d19116e6ee66e52a5fd8cce2e10f9422fb80e42f go: downloading go.etcd.io/etcd/raft/v3 v3.6.0-alpha.0.0.20221009201006-d19116e6ee66 go: module github.com/golang/protobuf is deprecated: Use the "google.golang.org/protobuf" module instead. go: upgraded go.etcd.io/etcd/api/v3 v3.5.0 => v3.6.0-alpha.0 go: upgraded go.etcd.io/etcd/raft/v3 v3.0.0-20210320072418-e51c697ec6e8 => v3.6.0-alpha.0.0.20221009201006-d19116e6ee66 ``` This picks up - etcd-io/etcd#14413 - etcd-io/etcd#14538 Closes cockroachdb#87264. Release note: None

89632: go.mod: bump raft r=nvanbenschoten a=tbg ``` go get go.etcd.io/etcd/raft/v3@d19116e6ee66e52a5fd8cce2e10f9422fb80e42f go: downloading go.etcd.io/etcd/raft/v3 v3.6.0-alpha.0.0.20221009201006-d19116e6ee66 go: module github.com/golang/protobuf is deprecated: Use the "google.golang.org/protobuf" module instead. go: upgraded go.etcd.io/etcd/api/v3 v3.5.0 => v3.6.0-alpha.0 go: upgraded go.etcd.io/etcd/raft/v3 v3.0.0-20210320072418-e51c697ec6e8 => v3.6.0-alpha.0.0.20221009201006-d19116e6ee66 ``` This picks up - etcd-io/etcd#14413 - etcd-io/etcd#14538 Compared single-node performance on gceworker via ```bash #!/bin/bash set -euxo pipefail pkill -9 cockroach || true rm -rf cockroach-data cr=./cockroach-$1 $cr start-single-node --background --insecure $cr workload init kv $cr workload run kv --splits 100 --max-rate 2000 --duration 10m --read-percent 0 --min-block-bytes 10 --max-block-bytes 10 | tee $1.txt ``` ``` _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 600.0s 0 1199604 1999.3 6.7 7.1 10.0 11.0 75.5 write #master 600.0s 0 1199614 1999.4 6.8 7.1 10.0 11.0 79.7 write #PR ``` Closes #87264. - [x] [make it build](#88985 (comment)) - [x] remove the maxIndex param and handling from Task.AckCommittedEntriesBeforeApplication - [x] check that single node write latencies don't regress Release note: None 91117: sql: reduce the overhead of EXPLAIN ANALYZE r=yuzefovich a=yuzefovich In order to propagate the execution stats across the distributed query plan we use the tracing infrastructure, where each stats object is added as "structured metadata" to the trace. Thus, whenever we're collecting the exec stats for a statement, we must enable tracing. Previously, in many cases we would enable it at the highest verbosity level which has non-trivial overhead. In some cases this was an overkill (e.g. in `EXPLAIN ANALYZE` we don't really care about the trace containing all of the gory details - we won't expose it anyway), so this is now fixed by using the less verbose "structured" verbosity level. As a concrete example of the difference: for a stmt that without `EXPLAIN ANALYZE` takes around 190ms, with `EXPLAIN ANALYZE` it would previously run for about 1.8s and now it takes around 210ms. This required some minor changes to the row-by-row outbox and router setups to collect thats even if the recording is not verbose. Addresses: #90739. Epic: None Release note (performance improvement): The overhead of running `EXPLAIN ANALYZE` and `EXPLAIN ANALYZE (DISTSQL)` has been significantly reduced. The overhead of `EXPLAIN ANALYZE (DEBUG)` didn't change. 91119: roachprod: improve error in ParallelE r=smg260 a=tbg Prior to this commit, the error's stack trace did not link back to the caller of `ParallelE`. Now it does. Epic: none Release note: None 91126: dev: allow whitespace separated regexps for testlogic files r=ajwerner a=ajwerner This was a feature of `make testlogic` and it was liked. Fixes #91125 Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Andrew Werner <awerner32@gmail.com>

This commit introduces an intermediate state that delays the acknowledgement of a node's self-vote during an election until that vote has been durably persisted (i.e. on the next call to Advance). This change can be viewed as the election counterpart to etcd-io#14413. This is an intermediate state that limits code movement for the rest of the async storage writes change. Signed-off-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

tbg mentioned this pull request Sep 1, 2022

Durability API guarantee broken in single node cluster #14370

Closed

tbg force-pushed the raft-single-voter branch from a00cc98 to a13a33f Compare September 1, 2022 15:23

tbg mentioned this pull request Sep 1, 2022

raft: re-work leader self-ack mechanism cockroachdb/cockroach#87264

Closed

ahrtr mentioned this pull request Sep 5, 2022

[Fourth solution] Fix the potential data loss for clusters with only one member (raft layer change) #14411

Closed

tbg force-pushed the raft-single-voter branch from a13a33f to e905e94 Compare September 6, 2022 12:07

This comment was marked as outdated.

Sign in to view

tbg force-pushed the raft-single-voter branch 2 times, most recently from 028ac1d to acb7e1d Compare September 7, 2022 09:09

tbg force-pushed the raft-single-voter branch 3 times, most recently from 1e6d038 to ba04f6c Compare September 8, 2022 10:28

tbg requested a review from ptabor September 8, 2022 10:43

tbg assigned ahrtr Sep 8, 2022

tbg requested a review from ahrtr September 8, 2022 10:43

tbg unassigned ahrtr Sep 8, 2022

tbg marked this pull request as ready for review September 8, 2022 10:46

nvanbenschoten reviewed Sep 8, 2022

View reviewed changes

raft/raft.go Outdated Show resolved Hide resolved