storage: make RaftTruncatedState unreplicated #34660

tbg · 2019-02-06T13:04:14Z

This isn't 100% polished yet, but generally ready for review.

Today, Raft (or preemptive) snapshots include the past Raft log, that is,
log entries which are already reflected in the state of the snapshot.
Fundamentally, this is because we have historically used a replicated
TruncatedState.

TruncatedState essentially tells us what the first index in the log is
(though it also includes a Term). If the TruncatedState cannot diverge
across replicas, we must send the whole log in snapshots, as the first
log index must match what the TruncatedState claims it is.

The Raft log is typically, but not necessarily small. Log truncations are
driven by a queue and use a complex decision process. That decision process
can be faulty and even if it isn't, the queue could be held up. Besides,
even when the Raft log contains only very few entries, these entries may be
quite large (see SSTable ingestion during RESTORE).

All this motivates that we don't want to (be forced to) send the Raft log
as part of snapshots, and in turn we need the TruncatedState to be
unreplicated.

This change migrates the TruncatedState into unreplicated keyspace. It does
not yet allow snapshots to avoid sending the past Raft log, but that is a
relatively straightforward follow-up change.

VersionUnreplicatedRaftTruncatedState, when active, moves the truncated
state into unreplicated keyspace on log truncations.

The migration works as follows:

at any log position, the replicas of a Range either use the new
(unreplicated) key or the old one, and exactly one of them exists.
When a log truncation evaluates under the new cluster version, it
initiates the migration by deleting the old key. Under the old cluster
version, it behaves like today, updating the replicated truncated state.
The deletion signals new code downstream of Raft and triggers a write to
the new, unreplicated, key (atomic with the deletion of the old key).
Future log truncations don't write any replicated data any more, but
(like before) send along the TruncatedState which is written downstream of
Raft atomically with the deletion of the log entries. This actually uses
the same code as 3. What's new is that the truncated state needs to be
verified before replacing a previous one. If replicas disagree about their
truncated state, it's possible for replica X at FirstIndex=100 to apply a
truncated state update that sets FirstIndex to, say, 50 (proposed by a
replica with a "longer" historical log). In that case, the truncated state
update must be ignored (this is straightforward downstream-of-Raft code).
When a split trigger evaluates, it seeds the RHS with the legacy key iff
the LHS uses the legacy key, and the unreplicated key otherwise. This makes
sure that the invariant that all replicas agree on the state of the
migration is upheld.
When a snapshot is applied, the receiver is told whether the snapshot
contains a legacy key. If not, it writes the truncated state (which is part
of the snapshot metadata) in its unreplicated version. Otherwise it doesn't
have to do anything (the range will migrate later).

The following diagram visualizes the above. Note that it abuses sequence
diagrams to get a nice layout; the vertical lines belonging to NewState and
OldState don't imply any particular ordering of operations.

┌────────┐                            ┌────────┐
│OldState│                            │NewState│
└───┬────┘                            └───┬────┘
   │                        Bootstrap under old version
   │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
─ ─
   │                                     │
   │                                     │     Bootstrap under new version
   │                                     │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
─ ─
   │                                     │
   │─ ─ ┐
   │    | Log truncation under old version
   │< ─ ┘
   │                                     │
   │─ ─ ┐                                │
   │    | Snapshot                       │
   │< ─ ┘                                │
   │                                     │
   │                                     │─ ─ ┐
   │                                     │    | Snapshot
   │                                     │< ─ ┘
   │                                     │
   │   Log truncation under new version  │
   │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─>│
   │                                     │
   │                                     │─ ─ ┐
   │                                     │    | Log truncation under new version
   │                                     │< ─ ┘
   │                                     │
   │                                     │─ ─ ┐
   │                                     │    | Log truncation under old version
   │                                     │< ─ ┘ (necessarily running new binary)

Release note: None

cockroach-teamcity · 2019-02-06T13:04:27Z

This change is

bdarnell

Generally looks good aside from the unfinished test.

Reviewed 11 of 11 files at r1, 1 of 1 files at r2, 2 of 2 files at r3, 1 of 1 files at r4, 29 of 30 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @tbg)

pkg/server/version_cluster_test.go, line 207 at r5 (raw file):

	oldVersionS := oldVersion.String()
	newVersionS := cluster.VersionByKey(cluster.VersionUnreplicatedRaftTruncatedState).String()
	_ = newVersionS

Obsolete?

pkg/server/version_cluster_test.go, line 215 at r5 (raw file):

		{oldVersionS, newVersionS},
		{oldVersionS, newVersionS},

Remove this blank line. This array doesn't seem to match the comment (unless I'm confused about how this test harness works).

pkg/server/version_cluster_test.go, line 241 at r5 (raw file):

	scatter := func() {
		return // HACK

Don't forget this. I'll stop reviewing this test here since it seems unfinished.

pkg/settings/cluster/cockroach_versions.go, line 74 at r5 (raw file):

	VersionLazyTxnRecord
	VersionSequencedReads
	// VersionUnreplicatedRaftTruncatedState, when active, moves the truncated

I think this list should just be the list of constant declarations; it's not the place for comments about migrations. I'd probably put this in the storage package on handleTruncatedStateBelowRaft.

pkg/storage/replica_raftstorage.go, line 498 at r5 (raw file):

	//
	// See the comment on VersionUnreplicatedRaftTruncatedState for details.
	UsesLegacyTruncatedState bool

It's confusing that you use "legacy" for this bool but "unreplicated" for the bool in the proto. That creates a link in my mind that unreplicated is the legacy mode when it's the reverse. Since we need default=false for the proto, I'd stick with "UsesUnreplicatedTruncatedState" here.

pkg/storage/batcheval/cmd_truncate_log.go, line 88 at r5 (raw file):

	//
	// TODO(tbg): think about synthesizing a valid term. Can we use the next
	// existing entry's term?

The term and index must match; raft won't be happy if you use index i but the term of index i+1. You might be able to get clever (I think the first entry after a term change is always empty, so if i+1 is non-empty it's safe to project its term backwards one entry), but that's probably not a great idea. I'm fine for now with just skipping truncations in this case unless we make further changes to truncation that make it more common.

Or if the leader has already truncated to index i, maybe we just truncate the whole cluster at that point (regardless of the index in the request)? I haven't paged in where this index is coming from; why would the leader be processing a truncation that is behind its own index?

pkg/storage/batcheval/cmd_truncate_log.go, line 108 at r5 (raw file):

	// still using the legacy truncated state key). On followers, this can in
	// principle be off either way, though in practice we expect followers to
	// have shorter logs than the leaseholder (see #34287).

s/shorter logs/logs that are no longer than/

The common case remains that everything is equal.

pkg/storage/stateloader/initial.go, line 58 at r5 (raw file):

	txnSpanGCThreshold hlc.Timestamp,
	activeVersion roachpb.Version,
	useLegacyTruncatedState bool, // see VersionUnreplicatedRaftTruncatedState

Same comment as before regarding legacy vs unreplicated.

Maybe make an enum instead of a bool so you don't have to comment at all the call sites?

pkg/storage/stateloader/stateloader.go, line 132 at r5 (raw file):

	eng engine.ReadWriter,
	state storagepb.ReplicaState,
	useLegacyTruncatedState bool, // see VersionUnreplicatedRaftTruncatedState

Same comment as in initial.go.

tbg

TFTR, PTAL!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @tbg)

pkg/server/version_cluster_test.go, line 207 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Obsolete?

Done.

pkg/server/version_cluster_test.go, line 215 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Remove this blank line. This array doesn't seem to match the comment (unless I'm confused about how this test harness works).

Done. The comments were out of sync with what the test is now doing.

pkg/server/version_cluster_test.go, line 241 at r5 (raw file):
Done. Amazingly this didn't turn up a new failure mode (that test helped me find a bunch of other errors before PR though).

1082 runs so far, 0 failures, over 10m40s

pkg/settings/cluster/cockroach_versions.go, line 74 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think this list should just be the list of constant declarations; it's not the place for comments about migrations. I'd probably put this in the storage package on handleTruncatedStateBelowRaft.

Are you sure that's better? I personally would love a high level explanation with all of these migrations, especially since this one is linked to from so many places.

pkg/storage/replica_raftstorage.go, line 498 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It's confusing that you use "legacy" for this bool but "unreplicated" for the bool in the proto. That creates a link in my mind that unreplicated is the legacy mode when it's the reverse. Since we need default=false for the proto, I'd stick with "UsesUnreplicatedTruncatedState" here.

Done.

pkg/storage/batcheval/cmd_truncate_log.go, line 108 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/shorter logs/logs that are no longer than/

The common case remains that everything is equal.

Done.

pkg/storage/stateloader/initial.go, line 58 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Same comment as before regarding legacy vs unreplicated.

Maybe make an enum instead of a bool so you don't have to comment at all the call sites?

Good call on the enum, done.

pkg/storage/stateloader/stateloader.go, line 132 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Same comment as in initial.go.

Done.

tbg

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell)

pkg/storage/batcheval/cmd_truncate_log.go, line 88 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The term and index must match; raft won't be happy if you use index i but the term of index i+1. You might be able to get clever (I think the first entry after a term change is always empty, so if i+1 is non-empty it's safe to project its term backwards one entry), but that's probably not a great idea. I'm fine for now with just skipping truncations in this case unless we make further changes to truncation that make it more common.

Or if the leader has already truncated to index i, maybe we just truncate the whole cluster at that point (regardless of the index in the request)? I haven't paged in where this index is coming from; why would the leader be processing a truncation that is behind its own index?

The index is coming from the raft log queue, i.e. almost always from the leaseholder's own Raft instance. I agree that this is mostly a non-issue, we're not planning any changes that make this a common occurrence (or an occurrence at all, really). Typically with what we're planning followers after having received a snapshot will have a shorter log than the leaseholder, but this check doesn't care.

bdarnell

Reviewed 30 of 30 files at r10.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @tbg)

pkg/settings/cluster/cockroach_versions.go, line 74 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Are you sure that's better? I personally would love a high level explanation with all of these migrations, especially since this one is linked to from so many places.

This list is where I look for "which release included VersionBitArrayColumns", so I like having it compact enough that I can see that the constant is between Version2_0 and Version2_1. I agree it might be nice to use something in this package for comments like this. Maybe put it on the entry in versionsSingleton? (which already has a one-line comment with a link for each migration).

pkg/storage/stateloader/stateloader.go, line 123 at r10 (raw file):

const (
	// UseReplicatedTruncatedState means use the legacy (replicated) key.
	UseReplicatedTruncatedState TruncatedStateType = iota

Nit: TruncatedStateReplicated and TruncatedStateUnreplicated? Maybe even TruncatedStateLegacyReplicated to make it clear which one is older.

No functional changes, just preparing to introduce the shiny new unreplicated raft truncated state. Release note: None

When replicas can have divergent truncated states, we want to carry out truncations even if they seem pointless to the leaseholder, since the leaseholder might have a shorter log than other replicas. Release note: None

The encoded range-local keys were mostly incorrect in that they were missing the replicated/unreplicated infix. Rather than trying to keep this comment up to date, readers should be directed to TestPrettyPrint which now conveniently logs all types of keys and their encoding. Release note: None

Release note: None

See cockroachdb#34287. Today, Raft (or preemptive) snapshots include the past Raft log, that is, log entries which are already reflected in the state of the snapshot. Fundamentally, this is because we have historically used a replicated TruncatedState. TruncatedState essentially tells us what the first index in the log is (though it also includes a Term). If the TruncatedState cannot diverge across replicas, we *must* send the whole log in snapshots, as the first log index must match what the TruncatedState claims it is. The Raft log is typically, but not necessarily small. Log truncations are driven by a queue and use a complex decision process. That decision process can be faulty and even if it isn't, the queue could be held up. Besides, even when the Raft log contains only very few entries, these entries may be quite large (see SSTable ingestion during RESTORE). All this motivates that we don't want to (be forced to) send the Raft log as part of snapshots, and in turn we need the TruncatedState to be unreplicated. This change migrates the TruncatedState into unreplicated keyspace. It does not yet allow snapshots to avoid sending the past Raft log, but that is a relatively straightforward follow-up change. VersionUnreplicatedRaftTruncatedState, when active, moves the truncated state into unreplicated keyspace on log truncations. The migration works as follows: 1. at any log position, the replicas of a Range either use the new (unreplicated) key or the old one, and exactly one of them exists. 2. When a log truncation evaluates under the new cluster version, it initiates the migration by deleting the old key. Under the old cluster version, it behaves like today, updating the replicated truncated state. 3. The deletion signals new code downstream of Raft and triggers a write to the new, unreplicated, key (atomic with the deletion of the old key). 4. Future log truncations don't write any replicated data any more, but (like before) send along the TruncatedState which is written downstream of Raft atomically with the deletion of the log entries. This actually uses the same code as 3. What's new is that the truncated state needs to be verified before replacing a previous one. If replicas disagree about their truncated state, it's possible for replica X at FirstIndex=100 to apply a truncated state update that sets FirstIndex to, say, 50 (proposed by a replica with a "longer" historical log). In that case, the truncated state update must be ignored (this is straightforward downstream-of-Raft code). 5. When a split trigger evaluates, it seeds the RHS with the legacy key iff the LHS uses the legacy key, and the unreplicated key otherwise. This makes sure that the invariant that all replicas agree on the state of the migration is upheld. 6. When a snapshot is applied, the receiver is told whether the snapshot contains a legacy key. If not, it writes the truncated state (which is part of the snapshot metadata) in its unreplicated version. Otherwise it doesn't have to do anything (the range will migrate later). The following diagram visualizes the above. Note that it abuses sequence diagrams to get a nice layout; the vertical lines belonging to NewState and OldState don't imply any particular ordering of operations. ``` ┌────────┐ ┌────────┐ │OldState│ │NewState│ └───┬────┘ └───┬────┘ │ Bootstrap under old version │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ │ │ │ Bootstrap under new version │ │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ │ │─ ─ ┐ │ | Log truncation under old version │< ─ ┘ │ │ │─ ─ ┐ │ │ | Snapshot │ │< ─ ┘ │ │ │ │ │─ ─ ┐ │ │ | Snapshot │ │< ─ ┘ │ │ │ Log truncation under new version │ │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─>│ │ │ │ │─ ─ ┐ │ │ | Log truncation under new version │ │< ─ ┘ │ │ │ │─ ─ ┐ │ │ | Log truncation under old version │ │< ─ ┘ (necessarily running new binary) ``` Source: http://www.plantuml.com/plantuml/uml/ and the following input: @startuml scale 600 width OldState <--] : Bootstrap under old version NewState <--] : Bootstrap under new version OldState --> OldState : Log truncation under old version OldState --> OldState : Snapshot NewState --> NewState : Snapshot OldState --> NewState : Log truncation under new version NewState --> NewState : Log truncation under new version NewState --> NewState : Log truncation under old version\n(necessarily running new binary) @enduml Release note: None

tbg

TFTR, comments addressed!

Thanks to megacheck/staticcheck I found a bug in the test -- it wasn't actually verifying that the old truncated state disappeared up but simply exited with a nil error. In fact the old truncated state didn't disappear reliably because that needed log truncations. Changed the test to force a log truncation instead. This has surived 5 minutes under stress without a failure, so I'm going to go ahead and merge.

bors r=bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell)

pkg/settings/cluster/cockroach_versions.go, line 74 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This list is where I look for "which release included VersionBitArrayColumns", so I like having it compact enough that I can see that the constant is between Version2_0 and Version2_1. I agree it might be nice to use something in this package for comments like this. Maybe put it on the entry in versionsSingleton? (which already has a one-line comment with a link for each migration).

Good idea, done.

tbg

bors r-

Branch shuffling wasn't done yet.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell)

craig · 2019-02-11T21:35:40Z

Canceled

tbg

bors r=bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell)

34296: storage: improve message on slow Raft proposal r=petermattis a=tbg Touches #33007. Release note: None 34589: importccl: fix flaky test TestImportCSVStmt r=rytaft a=rytaft `TestImportCSVStmt` tests that `IMPORT` jobs appear in a certain order in the `system.jobs` table. Automatic statistics were causing this test to be flaky since `CreateStats` jobs were present in the jobs table as well, in an unpredictable order. This commit fixes the problem by only selecting `IMPORT` jobs from the jobs table. Fixes #34568 Release note: None 34660: storage: make RaftTruncatedState unreplicated r=bdarnell a=tbg This isn't 100% polished yet, but generally ready for review. ---- See #34287. Today, Raft (or preemptive) snapshots include the past Raft log, that is, log entries which are already reflected in the state of the snapshot. Fundamentally, this is because we have historically used a replicated TruncatedState. TruncatedState essentially tells us what the first index in the log is (though it also includes a Term). If the TruncatedState cannot diverge across replicas, we *must* send the whole log in snapshots, as the first log index must match what the TruncatedState claims it is. The Raft log is typically, but not necessarily small. Log truncations are driven by a queue and use a complex decision process. That decision process can be faulty and even if it isn't, the queue could be held up. Besides, even when the Raft log contains only very few entries, these entries may be quite large (see SSTable ingestion during RESTORE). All this motivates that we don't want to (be forced to) send the Raft log as part of snapshots, and in turn we need the TruncatedState to be unreplicated. This change migrates the TruncatedState into unreplicated keyspace. It does not yet allow snapshots to avoid sending the past Raft log, but that is a relatively straightforward follow-up change. VersionUnreplicatedRaftTruncatedState, when active, moves the truncated state into unreplicated keyspace on log truncations. The migration works as follows: 1. at any log position, the replicas of a Range either use the new (unreplicated) key or the old one, and exactly one of them exists. 2. When a log truncation evaluates under the new cluster version, it initiates the migration by deleting the old key. Under the old cluster version, it behaves like today, updating the replicated truncated state. 3. The deletion signals new code downstream of Raft and triggers a write to the new, unreplicated, key (atomic with the deletion of the old key). 4. Future log truncations don't write any replicated data any more, but (like before) send along the TruncatedState which is written downstream of Raft atomically with the deletion of the log entries. This actually uses the same code as 3. What's new is that the truncated state needs to be verified before replacing a previous one. If replicas disagree about their truncated state, it's possible for replica X at FirstIndex=100 to apply a truncated state update that sets FirstIndex to, say, 50 (proposed by a replica with a "longer" historical log). In that case, the truncated state update must be ignored (this is straightforward downstream-of-Raft code). 5. When a split trigger evaluates, it seeds the RHS with the legacy key iff the LHS uses the legacy key, and the unreplicated key otherwise. This makes sure that the invariant that all replicas agree on the state of the migration is upheld. 6. When a snapshot is applied, the receiver is told whether the snapshot contains a legacy key. If not, it writes the truncated state (which is part of the snapshot metadata) in its unreplicated version. Otherwise it doesn't have to do anything (the range will migrate later). The following diagram visualizes the above. Note that it abuses sequence diagrams to get a nice layout; the vertical lines belonging to NewState and OldState don't imply any particular ordering of operations. ``` ┌────────┐ ┌────────┐ │OldState│ │NewState│ └───┬────┘ └───┬────┘ │ Bootstrap under old version │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ │ │ │ Bootstrap under new version │ │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ │ │─ ─ ┐ │ | Log truncation under old version │< ─ ┘ │ │ │─ ─ ┐ │ │ | Snapshot │ │< ─ ┘ │ │ │ │ │─ ─ ┐ │ │ | Snapshot │ │< ─ ┘ │ │ │ Log truncation under new version │ │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─>│ │ │ │ │─ ─ ┐ │ │ | Log truncation under new version │ │< ─ ┘ │ │ │ │─ ─ ┐ │ │ | Log truncation under old version │ │< ─ ┘ (necessarily running new binary) ``` Release note: None 34762: distsqlplan: fix error in union planning r=jordanlewis a=jordanlewis Previously, if 2 inputs to a UNION ALL had identical post processing except for renders, further post processing on top of that union all could invalidate the plan and cause errors or crashes. Fixes #34437. Release note (bug fix): fix a planning crash during UNION ALL operations that had projections, filters or renders directly on top of the UNION ALL in some cases. 34767: sql: reuse already allocated memory for the cache in a row container r=yuzefovich a=yuzefovich Previously, we would always allocate new memory for every row that we put in the cache of DiskBackedIndexedRowContainer and simply discard the memory underlying the row that we remove from the cache. Now, we're reusing that memory. Release note: None 34779: opt: add stats to tpch xform test r=justinj a=justinj Since we have stats by default now, this should be the default testing mechanism. I left in tpch-no-stats since we also have that for tpcc, just for safety. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com> Co-authored-by: Rebecca Taft <becca@cockroachlabs.com> Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Justin Jaffray <justin@cockroachlabs.com>

craig · 2019-02-11T22:38:09Z

Build succeeded

GitHub CI (Cockroach)

Assume a leaseholder wants to send a (Raft or preemptive) snapshot to a follower. Say the leaseholder's log ranges from 100 to 200, and we assume that the size (in bytes) of this log is 200mb. All of the log is successfully committed and applied, and is thus reflected in the snapshot data. Prior to this change, we would still send the 200mb of log entries along with the snapshot, even though the snapshot itself already reflected them. After this change, we won't send any log entries along with the snapshot, as sanity would suggest we would. We were unable to make this change because up until recently, the Raft truncated state (which dictates the first log index) was replicated and consistency checked; this was changed in cockroachdb#34660. The migration introduced there makes it straightforward to omit a prefix of the log in snapshots, as done in this commit. Somewhere down the road (19.2?) we should localize all the log truncation decisions and simplify all this further. I suspect that in doing so we can avoid tracking the size of the Raft log in the first place; all we really need for this is some mechanism that makes sure that an "idle" replica truncates its logs. With unreplicated truncation, this becomes cheap enough to "just do". Release note (bug fix): Remove historical log entries from Raft snapshots. These log entries could lead to failed snapshots with a message such as: snapshot failed: aborting snapshot because raft log is too large (25911051 bytes after processing 7 of 37 entries)

35701: storage: don't send historical Raft log with snapshots r=andreimatei,bdarnell a=tbg Assume a leaseholder wants to send a (Raft or preemptive) snapshot to a follower. Say the leaseholder's log ranges from 100 to 200, and we assume that the size (in bytes) of this log is 200mb. All of the log is successfully committed and applied, and is thus reflected in the snapshot data. Prior to this change, we would still send the 200mb of log entries along with the snapshot, even though the snapshot itself already reflected them. After this change, we won't send any log entries along with the snapshot, as sanity would suggest we would. We were unable to make this change because up until recently, the Raft truncated state (which dictates the first log index) was replicated and consistency checked; this was changed in #34660. Somewhere down the road (19.2?) we should localize all the log truncation decisions and simplify all this further. I suspect that in doing so we can avoid tracking the size of the Raft log in the first place; all we really need for this is some mechanism that makes sure that an "idle" replica truncates its logs. With unreplicated truncation, this becomes cheap enough to "just do". Release note (bug fix): Remove historical log entries from Raft snapshots. These log entries could lead to failed snapshots with a message such as: snapshot failed: aborting snapshot because raft log is too large (25911051 bytes after processing 7 of 37 entries) Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

tbg requested a review from a team as a code owner February 6, 2019 13:04

tbg requested review from a team February 6, 2019 13:04

tbg requested a review from bdarnell February 6, 2019 13:04

bdarnell reviewed Feb 6, 2019

View reviewed changes

tbg commented Feb 11, 2019

View reviewed changes

tbg force-pushed the snap-no-log branch from c529afc to 3738b4e Compare February 11, 2019 13:11

tbg commented Feb 11, 2019

View reviewed changes

tbg force-pushed the snap-no-log branch from 3738b4e to 4d67d68 Compare February 11, 2019 13:35

bdarnell approved these changes Feb 11, 2019

View reviewed changes

tbg added 5 commits February 11, 2019 21:36

storage: rename RaftTruncatedState -> LegacyRaftTruncatedState

d3aae3e

No functional changes, just preparing to introduce the shiny new unreplicated raft truncated state. Release note: None

storage: add Store.VisitReplicas

3878b12

Release note: None

tbg force-pushed the snap-no-log branch from 4d67d68 to d0aa09e Compare February 11, 2019 21:26

tbg commented Feb 11, 2019

View reviewed changes

craig bot merged commit d0aa09e into cockroachdb:master Feb 11, 2019

tbg mentioned this pull request Mar 13, 2019

storage: don't send historical Raft log with snapshots #35701

Merged

tbg deleted the snap-no-log branch March 26, 2019 15:46

nvanbenschoten mentioned this pull request Jun 20, 2019

storage: skip Pebble write-ahead log during Raft entry application #38322

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: make RaftTruncatedState unreplicated #34660

storage: make RaftTruncatedState unreplicated #34660

tbg commented Feb 6, 2019

cockroach-teamcity commented Feb 6, 2019

bdarnell left a comment

tbg left a comment

tbg left a comment

bdarnell left a comment

tbg left a comment

tbg left a comment

craig bot commented Feb 11, 2019

tbg left a comment

craig bot commented Feb 11, 2019

storage: make RaftTruncatedState unreplicated #34660

storage: make RaftTruncatedState unreplicated #34660

Conversation

tbg commented Feb 6, 2019

cockroach-teamcity commented Feb 6, 2019

bdarnell left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

craig bot commented Feb 11, 2019

Canceled

tbg left a comment

Choose a reason for hiding this comment

craig bot commented Feb 11, 2019

Build succeeded