Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: make RaftTruncatedState unreplicated #34660

Merged
merged 5 commits into from
Feb 11, 2019
Merged

Commits on Feb 11, 2019

  1. storage: rename RaftTruncatedState -> LegacyRaftTruncatedState

    No functional changes, just preparing to introduce the shiny new
    unreplicated raft truncated state.
    
    Release note: None
    tbg committed Feb 11, 2019
    Configuration menu
    Copy the full SHA
    d3aae3e View commit details
    Browse the repository at this point in the history
  2. batcheval: add reminder when implementing divergent truncated states

    When replicas can have divergent truncated states, we want to carry
    out truncations even if they seem pointless to the leaseholder, since
    the leaseholder might have a shorter log than other replicas.
    
    Release note: None
    tbg committed Feb 11, 2019
    Configuration menu
    Copy the full SHA
    4ed9e5b View commit details
    Browse the repository at this point in the history
  3. keys: remove misleading overview over encoded keys

    The encoded range-local keys were mostly incorrect in that they were
    missing the replicated/unreplicated infix. Rather than trying to keep
    this comment up to date, readers should be directed to TestPrettyPrint
    which now conveniently logs all types of keys and their encoding.
    
    Release note: None
    tbg committed Feb 11, 2019
    Configuration menu
    Copy the full SHA
    2990b96 View commit details
    Browse the repository at this point in the history
  4. storage: add Store.VisitReplicas

    Release note: None
    tbg committed Feb 11, 2019
    Configuration menu
    Copy the full SHA
    3878b12 View commit details
    Browse the repository at this point in the history
  5. storage: make RaftTruncatedState unreplicated

    See cockroachdb#34287.
    
    Today, Raft (or preemptive) snapshots include the past Raft log, that
    is, log entries which are already reflected in the state of the
    snapshot. Fundamentally, this is because we have historically used
    a replicated TruncatedState.
    
    TruncatedState essentially tells us what the first index in the log is
    (though it also includes a Term).
    If the TruncatedState cannot diverge across replicas, we *must* send the
    whole log in snapshots, as the first log index must match what the
    TruncatedState claims it is.
    
    The Raft log is typically, but not necessarily small. Log truncations
    are driven by a queue and use a complex decision process. That decision
    process can be faulty and even if it isn't, the queue could be held up.
    Besides, even when the Raft log contains only very few entries, these
    entries may be quite large (see SSTable ingestion during RESTORE).
    
    All this motivates that we don't want to (be forced to) send the Raft
    log as part of snapshots, and in turn we need the TruncatedState to
    be unreplicated.
    
    This change migrates the TruncatedState into unreplicated keyspace.
    It does not yet allow snapshots to avoid sending the past Raft log,
    but that is a relatively straightforward follow-up change.
    
    VersionUnreplicatedRaftTruncatedState, when active, moves the truncated
    state into unreplicated keyspace on log truncations.
    
    The migration works as follows:
    
    1. at any log position, the replicas of a Range either use the new
    (unreplicated) key or the old one, and exactly one of them exists.
    
    2. When a log truncation evaluates under the new cluster version,
    it initiates the migration by deleting the old key. Under the old cluster
    version, it behaves like today, updating the replicated truncated state.
    
    3. The deletion signals new code downstream of Raft and triggers a write
    to the new, unreplicated, key (atomic with the deletion of the old key).
    
    4. Future log truncations don't write any replicated data any more, but
    (like before) send along the TruncatedState which is written downstream
    of Raft atomically with the deletion of the log entries. This actually
    uses the same code as 3.
    What's new is that the truncated state needs to be verified before
    replacing a previous one. If replicas disagree about their truncated
    state, it's possible for replica X at FirstIndex=100 to apply a
    truncated state update that sets FirstIndex to, say, 50 (proposed by a
    replica with a "longer" historical log). In that case, the truncated
    state update must be ignored (this is straightforward downstream-of-Raft
    code).
    
    5. When a split trigger evaluates, it seeds the RHS with the legacy
    key iff the LHS uses the legacy key, and the unreplicated key otherwise.
    This makes sure that the invariant that all replicas agree on the
    state of the migration is upheld.
    
    6. When a snapshot is applied, the receiver is told whether the snapshot
    contains a legacy key. If not, it writes the truncated state (which is
    part of the snapshot metadata) in its unreplicated version. Otherwise
    it doesn't have to do anything (the range will migrate later).
    
    The following diagram visualizes the above. Note that it abuses sequence
    diagrams to get a nice layout; the vertical lines belonging to NewState
    and OldState don't imply any particular ordering of operations.
    
    ```
    ┌────────┐                            ┌────────┐
    │OldState│                            │NewState│
    └───┬────┘                            └───┬────┘
        │                        Bootstrap under old version
        │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
        │                                     │
        │                                     │     Bootstrap under new version
        │                                     │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
        │                                     │
        │─ ─ ┐
        │    | Log truncation under old version
        │< ─ ┘
        │                                     │
        │─ ─ ┐                                │
        │    | Snapshot                       │
        │< ─ ┘                                │
        │                                     │
        │                                     │─ ─ ┐
        │                                     │    | Snapshot
        │                                     │< ─ ┘
        │                                     │
        │   Log truncation under new version  │
        │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─>│
        │                                     │
        │                                     │─ ─ ┐
        │                                     │    | Log truncation under new version
        │                                     │< ─ ┘
        │                                     │
        │                                     │─ ─ ┐
        │                                     │    | Log truncation under old version
        │                                     │< ─ ┘ (necessarily running new binary)
    ```
    
    Source: http://www.plantuml.com/plantuml/uml/ and the following input:
    
    @startuml
    scale 600 width
    
    OldState <--] : Bootstrap under old version
    NewState <--] : Bootstrap under new version
    OldState --> OldState : Log truncation under old version
    OldState --> OldState : Snapshot
    NewState --> NewState : Snapshot
    OldState --> NewState : Log truncation under new version
    NewState --> NewState : Log truncation under new version
    NewState --> NewState : Log truncation under old version\n(necessarily running new binary)
    @enduml
    
    Release note: None
    tbg committed Feb 11, 2019
    Configuration menu
    Copy the full SHA
    d0aa09e View commit details
    Browse the repository at this point in the history