Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: snapshots should not include Raft log #34287

Closed
tbg opened this issue Jan 28, 2019 · 4 comments
Closed

storage: snapshots should not include Raft log #34287

tbg opened this issue Jan 28, 2019 · 4 comments
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Milestone

Comments

@tbg
Copy link
Member

tbg commented Jan 28, 2019

In #34269, we had an issue with snapshots being rejected because they contained an overly large Raft log. The Raft log can get large if there is a problem with the log truncation queue. In effect, by having to reject snapshots based on log size, we're creating an unfortunate dependency between the two queues where problems with one have caused problems in the other.

A Raft snapshot contains the replicated data as of some log index N. In addition to some more metadata, it can contain past and "future" log entries. Future log entries don't have to be sent, though it makes sense to send them along at last up to some size so that the follower catches up faster post-snapshot. The past log entries are less easy to justify: the snapshot already reflects them. Their only utility is that should the node receiving the snapshot step up to be the leader very soon, it could catch up other followers who are still in need of recent log entries. This isn't a good justification as that is a a very unusual scenario.

I think that we include the past Raft log entries for technical reasons only. The RaftTruncatedState (which essentially remembers the first index in the log) is a replicated key for historical reasons

cockroach/pkg/keys/keys.go

Lines 908 to 911 in a05ee7b

// RaftTruncatedStateKey returns a system-local key for a RaftTruncatedState.
func (b RangeIDPrefixBuf) RaftTruncatedStateKey() roachpb.Key {
return append(b.replicatedPrefix(), LocalRaftTruncatedStateSuffix...)
}

and this forces us to provide it with the snapshot and also send all log entries that the truncated state promises are still around.

But this limitation can be refactored out. We make the truncated state key unreplicated, at which point replicas are free to truncate their logs to whatever past index they deem possible. Ordinarily truncations would still be triggered by the Raft log queue, though now additionally we're free to send out snapshots that contain no past log entries, and to synthesize a corresponding truncated state.

To migrate the key out of the replicated keyspace, we start using the new key once a corresponding cluster version is reached (atomically deleting the old key during queue-triggered log truncations, which go through the Raft log). Reading the truncated state queries both locations and uses the maximum. This doesn't guarantee that the key couldn't continue to exist in perpetuity on some ranges that never see a log truncation, but that doesn't matter; the code that uses it can go away one release later.

Touches #31947

@tbg tbg added this to Incoming in KV via automation Jan 28, 2019
@tbg
Copy link
Member Author

tbg commented Jan 28, 2019

cc @bdarnell in case I'm missing something that forces the truncated state to be replicated.

@bdarnell
Copy link
Contributor

The truncated state was replicated on the theory that

  1. It would be good to limit the number of ways in which replicas of a range can differ from each other. Questionable in this case, since the tail of the log is already inconsistent across replicas, but it permits the consistency checker to validate that truncation is happening consistently.
  2. Truncation should generally be driven by the leader, because it has knowledge of followers' log positions, so it might was well be replicated.

I think this was one of those decisions that we had made "temporarily" and intended to revisit, but never did until the current behavior was pretty well entrenched. I think making the change now is probably a good idea if it's easier than making the mechanisms that limit the size of the raft log more robust.

@andreimatei
Copy link
Contributor

cc @nvanbenschoten with whom I just happened to be discussing the replicated nature of the log truncation.

tbg added a commit to tbg/cockroach that referenced this issue Feb 6, 2019
See cockroachdb#34287.

Today, Raft (or preemptive) snapshots include the past Raft log, that
is, log entries which are already reflected in the state of the
snapshot. Fundamentally, this is because we have historically used
a replicated TruncatedState.

TruncatedState essentially tells us what the first index in the log is
(though it also includes a Term).
If the TruncatedState cannot diverge across replicas, we *must* send the
whole log in snapshots, as the first log index must match what the
TruncatedState claims it is.

The Raft log is typically, but not necessarily small. Log truncations
are driven by a queue and use a complex decision process. That decision
process can be faulty and even if it isn't, the queue could be held up.
Besides, even when the Raft log contains only very few entries, these
entries may be quite large (see SSTable ingestion during RESTORE).

All this motivates that we don't want to (be forced to) send the Raft
log as part of snapshots, and in turn we need the TruncatedState to
be unreplicated.

This change migrates the TruncatedState into unreplicated keyspace.
It does not yet allow snapshots to avoid sending the past Raft log,
but that is a relatively straightforward follow-up change.

VersionUnreplicatedRaftTruncatedState, when active, moves the truncated
state into unreplicated keyspace on log truncations.

The migration works as follows:

1. at any log position, the replicas of a Range either use the new
(unreplicated) key or the old one, and exactly one of them exists.

2. When a log truncation evaluates under the new cluster version,
it initiates the migration by deleting the old key. Under the old cluster
version, it behaves like today, updating the replicated truncated state.

3. The deletion signals new code downstream of Raft and triggers a write
to the new, unreplicated, key (atomic with the deletion of the old key).

4. Future log truncations don't write any replicated data any more, but
(like before) send along the TruncatedState which is written downstream
of Raft atomically with the deletion of the log entries. This actually
uses the same code as 3.
What's new is that the truncated state needs to be verified before
replacing a previous one. If replicas disagree about their truncated
state, it's possible for replica X at FirstIndex=100 to apply a
truncated state update that sets FirstIndex to, say, 50 (proposed by a
replica with a "longer" historical log). In that case, the truncated
state update must be ignored (this is straightforward downstream-of-Raft
code).

5. When a split trigger evaluates, it seeds the RHS with the legacy
key iff the LHS uses the legacy key, and the unreplicated key otherwise.
This makes sure that the invariant that all replicas agree on the
state of the migration is upheld.

6. When a snapshot is applied, the receiver is told whether the snapshot
contains a legacy key. If not, it writes the truncated state (which is
part of the snapshot metadata) in its unreplicated version. Otherwise
it doesn't have to do anything (the range will migrate later).

The following diagram visualizes the above. Note that it abuses sequence
diagrams to get a nice layout; the vertical lines belonging to NewState
and OldState don't imply any particular ordering of operations.

┌────────┐                            ┌────────┐
│OldState│                            │NewState│
└───┬────┘                            └───┬────┘
    │                        Bootstrap under old version
    │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
    │                                     │
    │                                     │     Bootstrap under new version
    │                                     │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
    │                                     │
    │─ ─ ┐
    │    | Log truncation under old version
    │< ─ ┘
    │                                     │
    │─ ─ ┐                                │
    │    | Snapshot                       │
    │< ─ ┘                                │
    │                                     │
    │                                     │─ ─ ┐
    │                                     │    | Snapshot
    │                                     │< ─ ┘
    │                                     │
    │   Log truncation under new version  │
    │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─>│
    │                                     │
    │                                     │─ ─ ┐
    │                                     │    | Log truncation under new version
    │                                     │< ─ ┘
    │                                     │
    │                                     │─ ─ ┐
    │                                     │    | Log truncation under old version
    │                                     │< ─ ┘ (necessarily running new binary)

Source: http://www.plantuml.com/plantuml/uml/ and the following input:

@startuml
scale 600 width

OldState <--] : Bootstrap under old version
NewState <--] : Bootstrap under new version
OldState --> OldState : Log truncation under old version
OldState --> OldState : Snapshot
NewState --> NewState : Snapshot
OldState --> NewState : Log truncation under new version
NewState --> NewState : Log truncation under new version
NewState --> NewState : Log truncation under old version\n(necessarily running new binary)
@enduml

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Feb 11, 2019
See cockroachdb#34287.

Today, Raft (or preemptive) snapshots include the past Raft log, that
is, log entries which are already reflected in the state of the
snapshot. Fundamentally, this is because we have historically used
a replicated TruncatedState.

TruncatedState essentially tells us what the first index in the log is
(though it also includes a Term).
If the TruncatedState cannot diverge across replicas, we *must* send the
whole log in snapshots, as the first log index must match what the
TruncatedState claims it is.

The Raft log is typically, but not necessarily small. Log truncations
are driven by a queue and use a complex decision process. That decision
process can be faulty and even if it isn't, the queue could be held up.
Besides, even when the Raft log contains only very few entries, these
entries may be quite large (see SSTable ingestion during RESTORE).

All this motivates that we don't want to (be forced to) send the Raft
log as part of snapshots, and in turn we need the TruncatedState to
be unreplicated.

This change migrates the TruncatedState into unreplicated keyspace.
It does not yet allow snapshots to avoid sending the past Raft log,
but that is a relatively straightforward follow-up change.

VersionUnreplicatedRaftTruncatedState, when active, moves the truncated
state into unreplicated keyspace on log truncations.

The migration works as follows:

1. at any log position, the replicas of a Range either use the new
(unreplicated) key or the old one, and exactly one of them exists.

2. When a log truncation evaluates under the new cluster version,
it initiates the migration by deleting the old key. Under the old cluster
version, it behaves like today, updating the replicated truncated state.

3. The deletion signals new code downstream of Raft and triggers a write
to the new, unreplicated, key (atomic with the deletion of the old key).

4. Future log truncations don't write any replicated data any more, but
(like before) send along the TruncatedState which is written downstream
of Raft atomically with the deletion of the log entries. This actually
uses the same code as 3.
What's new is that the truncated state needs to be verified before
replacing a previous one. If replicas disagree about their truncated
state, it's possible for replica X at FirstIndex=100 to apply a
truncated state update that sets FirstIndex to, say, 50 (proposed by a
replica with a "longer" historical log). In that case, the truncated
state update must be ignored (this is straightforward downstream-of-Raft
code).

5. When a split trigger evaluates, it seeds the RHS with the legacy
key iff the LHS uses the legacy key, and the unreplicated key otherwise.
This makes sure that the invariant that all replicas agree on the
state of the migration is upheld.

6. When a snapshot is applied, the receiver is told whether the snapshot
contains a legacy key. If not, it writes the truncated state (which is
part of the snapshot metadata) in its unreplicated version. Otherwise
it doesn't have to do anything (the range will migrate later).

The following diagram visualizes the above. Note that it abuses sequence
diagrams to get a nice layout; the vertical lines belonging to NewState
and OldState don't imply any particular ordering of operations.

```
┌────────┐                            ┌────────┐
│OldState│                            │NewState│
└───┬────┘                            └───┬────┘
    │                        Bootstrap under old version
    │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
    │                                     │
    │                                     │     Bootstrap under new version
    │                                     │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
    │                                     │
    │─ ─ ┐
    │    | Log truncation under old version
    │< ─ ┘
    │                                     │
    │─ ─ ┐                                │
    │    | Snapshot                       │
    │< ─ ┘                                │
    │                                     │
    │                                     │─ ─ ┐
    │                                     │    | Snapshot
    │                                     │< ─ ┘
    │                                     │
    │   Log truncation under new version  │
    │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─>│
    │                                     │
    │                                     │─ ─ ┐
    │                                     │    | Log truncation under new version
    │                                     │< ─ ┘
    │                                     │
    │                                     │─ ─ ┐
    │                                     │    | Log truncation under old version
    │                                     │< ─ ┘ (necessarily running new binary)
```

Source: http://www.plantuml.com/plantuml/uml/ and the following input:

@startuml
scale 600 width

OldState <--] : Bootstrap under old version
NewState <--] : Bootstrap under new version
OldState --> OldState : Log truncation under old version
OldState --> OldState : Snapshot
NewState --> NewState : Snapshot
OldState --> NewState : Log truncation under new version
NewState --> NewState : Log truncation under new version
NewState --> NewState : Log truncation under old version\n(necessarily running new binary)
@enduml

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Feb 11, 2019
See cockroachdb#34287.

Today, Raft (or preemptive) snapshots include the past Raft log, that
is, log entries which are already reflected in the state of the
snapshot. Fundamentally, this is because we have historically used
a replicated TruncatedState.

TruncatedState essentially tells us what the first index in the log is
(though it also includes a Term).
If the TruncatedState cannot diverge across replicas, we *must* send the
whole log in snapshots, as the first log index must match what the
TruncatedState claims it is.

The Raft log is typically, but not necessarily small. Log truncations
are driven by a queue and use a complex decision process. That decision
process can be faulty and even if it isn't, the queue could be held up.
Besides, even when the Raft log contains only very few entries, these
entries may be quite large (see SSTable ingestion during RESTORE).

All this motivates that we don't want to (be forced to) send the Raft
log as part of snapshots, and in turn we need the TruncatedState to
be unreplicated.

This change migrates the TruncatedState into unreplicated keyspace.
It does not yet allow snapshots to avoid sending the past Raft log,
but that is a relatively straightforward follow-up change.

VersionUnreplicatedRaftTruncatedState, when active, moves the truncated
state into unreplicated keyspace on log truncations.

The migration works as follows:

1. at any log position, the replicas of a Range either use the new
(unreplicated) key or the old one, and exactly one of them exists.

2. When a log truncation evaluates under the new cluster version,
it initiates the migration by deleting the old key. Under the old cluster
version, it behaves like today, updating the replicated truncated state.

3. The deletion signals new code downstream of Raft and triggers a write
to the new, unreplicated, key (atomic with the deletion of the old key).

4. Future log truncations don't write any replicated data any more, but
(like before) send along the TruncatedState which is written downstream
of Raft atomically with the deletion of the log entries. This actually
uses the same code as 3.
What's new is that the truncated state needs to be verified before
replacing a previous one. If replicas disagree about their truncated
state, it's possible for replica X at FirstIndex=100 to apply a
truncated state update that sets FirstIndex to, say, 50 (proposed by a
replica with a "longer" historical log). In that case, the truncated
state update must be ignored (this is straightforward downstream-of-Raft
code).

5. When a split trigger evaluates, it seeds the RHS with the legacy
key iff the LHS uses the legacy key, and the unreplicated key otherwise.
This makes sure that the invariant that all replicas agree on the
state of the migration is upheld.

6. When a snapshot is applied, the receiver is told whether the snapshot
contains a legacy key. If not, it writes the truncated state (which is
part of the snapshot metadata) in its unreplicated version. Otherwise
it doesn't have to do anything (the range will migrate later).

The following diagram visualizes the above. Note that it abuses sequence
diagrams to get a nice layout; the vertical lines belonging to NewState
and OldState don't imply any particular ordering of operations.

```
┌────────┐                            ┌────────┐
│OldState│                            │NewState│
└───┬────┘                            └───┬────┘
    │                        Bootstrap under old version
    │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
    │                                     │
    │                                     │     Bootstrap under new version
    │                                     │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
    │                                     │
    │─ ─ ┐
    │    | Log truncation under old version
    │< ─ ┘
    │                                     │
    │─ ─ ┐                                │
    │    | Snapshot                       │
    │< ─ ┘                                │
    │                                     │
    │                                     │─ ─ ┐
    │                                     │    | Snapshot
    │                                     │< ─ ┘
    │                                     │
    │   Log truncation under new version  │
    │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─>│
    │                                     │
    │                                     │─ ─ ┐
    │                                     │    | Log truncation under new version
    │                                     │< ─ ┘
    │                                     │
    │                                     │─ ─ ┐
    │                                     │    | Log truncation under old version
    │                                     │< ─ ┘ (necessarily running new binary)
```

Source: http://www.plantuml.com/plantuml/uml/ and the following input:

@startuml
scale 600 width

OldState <--] : Bootstrap under old version
NewState <--] : Bootstrap under new version
OldState --> OldState : Log truncation under old version
OldState --> OldState : Snapshot
NewState --> NewState : Snapshot
OldState --> NewState : Log truncation under new version
NewState --> NewState : Log truncation under new version
NewState --> NewState : Log truncation under old version\n(necessarily running new binary)
@enduml

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Feb 11, 2019
See cockroachdb#34287.

Today, Raft (or preemptive) snapshots include the past Raft log, that
is, log entries which are already reflected in the state of the
snapshot. Fundamentally, this is because we have historically used
a replicated TruncatedState.

TruncatedState essentially tells us what the first index in the log is
(though it also includes a Term).
If the TruncatedState cannot diverge across replicas, we *must* send the
whole log in snapshots, as the first log index must match what the
TruncatedState claims it is.

The Raft log is typically, but not necessarily small. Log truncations
are driven by a queue and use a complex decision process. That decision
process can be faulty and even if it isn't, the queue could be held up.
Besides, even when the Raft log contains only very few entries, these
entries may be quite large (see SSTable ingestion during RESTORE).

All this motivates that we don't want to (be forced to) send the Raft
log as part of snapshots, and in turn we need the TruncatedState to
be unreplicated.

This change migrates the TruncatedState into unreplicated keyspace.
It does not yet allow snapshots to avoid sending the past Raft log,
but that is a relatively straightforward follow-up change.

VersionUnreplicatedRaftTruncatedState, when active, moves the truncated
state into unreplicated keyspace on log truncations.

The migration works as follows:

1. at any log position, the replicas of a Range either use the new
(unreplicated) key or the old one, and exactly one of them exists.

2. When a log truncation evaluates under the new cluster version,
it initiates the migration by deleting the old key. Under the old cluster
version, it behaves like today, updating the replicated truncated state.

3. The deletion signals new code downstream of Raft and triggers a write
to the new, unreplicated, key (atomic with the deletion of the old key).

4. Future log truncations don't write any replicated data any more, but
(like before) send along the TruncatedState which is written downstream
of Raft atomically with the deletion of the log entries. This actually
uses the same code as 3.
What's new is that the truncated state needs to be verified before
replacing a previous one. If replicas disagree about their truncated
state, it's possible for replica X at FirstIndex=100 to apply a
truncated state update that sets FirstIndex to, say, 50 (proposed by a
replica with a "longer" historical log). In that case, the truncated
state update must be ignored (this is straightforward downstream-of-Raft
code).

5. When a split trigger evaluates, it seeds the RHS with the legacy
key iff the LHS uses the legacy key, and the unreplicated key otherwise.
This makes sure that the invariant that all replicas agree on the
state of the migration is upheld.

6. When a snapshot is applied, the receiver is told whether the snapshot
contains a legacy key. If not, it writes the truncated state (which is
part of the snapshot metadata) in its unreplicated version. Otherwise
it doesn't have to do anything (the range will migrate later).

The following diagram visualizes the above. Note that it abuses sequence
diagrams to get a nice layout; the vertical lines belonging to NewState
and OldState don't imply any particular ordering of operations.

```
┌────────┐                            ┌────────┐
│OldState│                            │NewState│
└───┬────┘                            └───┬────┘
    │                        Bootstrap under old version
    │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
    │                                     │
    │                                     │     Bootstrap under new version
    │                                     │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
    │                                     │
    │─ ─ ┐
    │    | Log truncation under old version
    │< ─ ┘
    │                                     │
    │─ ─ ┐                                │
    │    | Snapshot                       │
    │< ─ ┘                                │
    │                                     │
    │                                     │─ ─ ┐
    │                                     │    | Snapshot
    │                                     │< ─ ┘
    │                                     │
    │   Log truncation under new version  │
    │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─>│
    │                                     │
    │                                     │─ ─ ┐
    │                                     │    | Log truncation under new version
    │                                     │< ─ ┘
    │                                     │
    │                                     │─ ─ ┐
    │                                     │    | Log truncation under old version
    │                                     │< ─ ┘ (necessarily running new binary)
```

Source: http://www.plantuml.com/plantuml/uml/ and the following input:

@startuml
scale 600 width

OldState <--] : Bootstrap under old version
NewState <--] : Bootstrap under new version
OldState --> OldState : Log truncation under old version
OldState --> OldState : Snapshot
NewState --> NewState : Snapshot
OldState --> NewState : Log truncation under new version
NewState --> NewState : Log truncation under new version
NewState --> NewState : Log truncation under old version\n(necessarily running new binary)
@enduml

Release note: None
craig bot pushed a commit that referenced this issue Feb 11, 2019
34296: storage: improve message on slow Raft proposal r=petermattis a=tbg

Touches #33007.

Release note: None

34589: importccl: fix flaky test TestImportCSVStmt r=rytaft a=rytaft

`TestImportCSVStmt` tests that `IMPORT` jobs appear in a certain order
in the `system.jobs` table. Automatic statistics were causing this
test to be flaky since `CreateStats` jobs were present in the jobs
table as well, in an unpredictable order. This commit fixes the problem
by only selecting `IMPORT` jobs from the jobs table.

Fixes #34568

Release note: None

34660: storage: make RaftTruncatedState unreplicated r=bdarnell a=tbg

This isn't 100% polished yet, but generally ready for review.

----

See #34287.

Today, Raft (or preemptive) snapshots include the past Raft log, that is,
log entries which are already reflected in the state of the snapshot.
Fundamentally, this is because we have historically used a replicated
TruncatedState.

TruncatedState essentially tells us what the first index in the log is
(though it also includes a Term). If the TruncatedState cannot diverge
across replicas, we *must* send the whole log in snapshots, as the first
log index must match what the TruncatedState claims it is.

The Raft log is typically, but not necessarily small. Log truncations are
driven by a queue and use a complex decision process. That decision process
can be faulty and even if it isn't, the queue could be held up. Besides,
even when the Raft log contains only very few entries, these entries may be
quite large (see SSTable ingestion during RESTORE).

All this motivates that we don't want to (be forced to) send the Raft log
as part of snapshots, and in turn we need the TruncatedState to be
unreplicated.

This change migrates the TruncatedState into unreplicated keyspace. It does
not yet allow snapshots to avoid sending the past Raft log, but that is a
relatively straightforward follow-up change.

VersionUnreplicatedRaftTruncatedState, when active, moves the truncated
state into unreplicated keyspace on log truncations.

The migration works as follows:

1. at any log position, the replicas of a Range either use the new
(unreplicated) key or the old one, and exactly one of them exists.

2. When a log truncation evaluates under the new cluster version, it
initiates the migration by deleting the old key. Under the old cluster
version, it behaves like today, updating the replicated truncated state.

3. The deletion signals new code downstream of Raft and triggers a write to
the new, unreplicated, key (atomic with the deletion of the old key).

4. Future log truncations don't write any replicated data any more, but
(like before) send along the TruncatedState which is written downstream of
Raft atomically with the deletion of the log entries. This actually uses
the same code as 3. What's new is that the truncated state needs to be
verified before replacing a previous one. If replicas disagree about their
truncated state, it's possible for replica X at FirstIndex=100 to apply a
truncated state update that sets FirstIndex to, say, 50 (proposed by a
replica with a "longer" historical log). In that case, the truncated state
update must be ignored (this is straightforward downstream-of-Raft code).

5. When a split trigger evaluates, it seeds the RHS with the legacy key iff
the LHS uses the legacy key, and the unreplicated key otherwise. This makes
sure that the invariant that all replicas agree on the state of the
migration is upheld.

6. When a snapshot is applied, the receiver is told whether the snapshot
contains a legacy key. If not, it writes the truncated state (which is part
of the snapshot metadata) in its unreplicated version. Otherwise it doesn't
have to do anything (the range will migrate later).

The following diagram visualizes the above. Note that it abuses sequence
diagrams to get a nice layout; the vertical lines belonging to NewState and
OldState don't imply any particular ordering of operations.

```
┌────────┐                            ┌────────┐
│OldState│                            │NewState│
└───┬────┘                            └───┬────┘
   │                        Bootstrap under old version
   │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
─ ─
   │                                     │
   │                                     │     Bootstrap under new version
   │                                     │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
─ ─
   │                                     │
   │─ ─ ┐
   │    | Log truncation under old version
   │< ─ ┘
   │                                     │
   │─ ─ ┐                                │
   │    | Snapshot                       │
   │< ─ ┘                                │
   │                                     │
   │                                     │─ ─ ┐
   │                                     │    | Snapshot
   │                                     │< ─ ┘
   │                                     │
   │   Log truncation under new version  │
   │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─>│
   │                                     │
   │                                     │─ ─ ┐
   │                                     │    | Log truncation under new version
   │                                     │< ─ ┘
   │                                     │
   │                                     │─ ─ ┐
   │                                     │    | Log truncation under old version
   │                                     │< ─ ┘ (necessarily running new binary)
```

Release note: None

34762: distsqlplan: fix error in union planning r=jordanlewis a=jordanlewis

Previously, if 2 inputs to a UNION ALL had identical post processing
except for renders, further post processing on top of that union all
could invalidate the plan and cause errors or crashes.

Fixes #34437.

Release note (bug fix): fix a planning crash during UNION ALL operations
that had projections, filters or renders directly on top of the UNION
ALL in some cases.

34767: sql: reuse already allocated memory for the cache in a row container r=yuzefovich a=yuzefovich

Previously, we would always allocate new memory for every row that
we put in the cache of DiskBackedIndexedRowContainer and simply
discard the memory underlying the row that we remove from the cache.
Now, we're reusing that memory.

Release note: None

34779: opt: add stats to tpch xform test r=justinj a=justinj

Since we have stats by default now, this should be the default testing
mechanism. I left in tpch-no-stats since we also have that for tpcc,
just for safety.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>
Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Justin Jaffray <justin@cockroachlabs.com>
@tbg tbg self-assigned this Feb 13, 2019
@tbg tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Feb 13, 2019
@tbg tbg added this to the 2.2 milestone Feb 13, 2019
@tbg
Copy link
Member Author

tbg commented Mar 19, 2019

Fixed in #35701

@tbg tbg closed this as completed Mar 19, 2019
KV automation moved this from Incoming to Closed Mar 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

3 participants