kvserver: sep-raft-log: make log application stand-alone #75729

tbg · 2022-01-31T15:06:00Z

Is your feature request related to a problem? Please describe.

Today, there is no "library-style" way of crafting a CockroachDB log of commands.

We can in principle craft individual commands by evaluating requests (the
batcheval package), but below that you need a *Replica which strings
together the raft log, and the apply loop.

We have felt for a long time that abstractly we could benefit from better
modularization here, but now we are starting to see this come up in projects we
are working on, for example

cockroach/pkg/storage/replicas_storage.go

Lines 133 to 142 in 4a73be0

    
           // ReplicasStorage does interpret the raft state (all the unreplicated 
        
           // range-ID local key names prefixed by Raft), and the RangeTombstoneKey. This 
        
           // is necessary for it to be able to maintain invariants spanning the raft log 
        
           // and the state machine (related to raft log truncation, replica lifetime 
        
           // etc.), including reapplying raft log entries on restart, to the state 
        
           // machine. All accesses (read or write) to the raft log and RangeTombstoneKey 
        
           // must happen via ReplicasStorage. ReplicasStorage does not by itself apply 
        
           // committed raft log entries to the state machine in a running system -- this 
        
           // is because state machine application in a running system has complex 
        
           // data-structure side-effects that are outside the scope of ReplicasStorage.

and similarly Loss Of Quorum Recovery tools could benefit from tools to interpret the raft log at the right level of abstraction.

Describe the solution you'd like

Continue the work begun by #39254
and further extract the logic pertaining to entry application so that it can be
used to apply log entries outside of the context of a fully instantiated
Replica. I think most of the work will be to wean the apply code off the
many direct accesses to replica.mu and the associated fields.

As an acceptance criterion, we should be able to write a unit test that sets up
a test cluster, appends some commands to the log, stops the cluster and
subsequently re-applies the log entries to an empty state, arriving at an
identical state machine.

Ideally, we also get a testing win by allowing tests to be set up with bespoke
raft logs that exercise hand-crafted sequences of commands; this is very difficult
today and essentially doesn't happen for that very reason.

Note: inclusion of splits/merges will add complexity. We may not need this in
the first pass but should think about how this could work in the future.

Epic: CRDB-220
Jira issue: CRDB-12807

The text was updated successfully, but these errors were encountered:

tbg · 2022-02-04T13:02:49Z

tbg · 2022-02-04T13:10:53Z

I'm not sure this photo of my whiteboard is helpful for anyone but me, but I spent a little bit of time looking into what's needed here. I think there are a few "themes":

the "log encoding" portion on the right-hand side is done manually in multiple places today. A little abstraction over this would be useful and this should be straightforward to write as well.
the apply-side (left hand side) is already nicely modular thanks to prior work by @nvanbenschoten (all the way back in '19!), but lots of the pieces of code that would need to be reusable aren't quite yet, so we need to pull out things like checkForcedError to make them re-usable. I expect most of these to be surgical.
I don't think we're too far off from the evaluate-side either. Assuming everything the commands we want to manufacture programmatically is populated in the batcheval.EvalContext, we will get a RaftCommand back and making the lease/closed timestamp code reusable shouldn't be hard either, especially since this is all much easier when operating offline without concurrency or in-memory state. We don't care much about the eval side outside of testing, so supporting just a subset of commands when we need it is good enough.
to support splits and merges, we would similarly need to make all the disk-touching components of these operations' side effects reusable, and probably restructure the code a bit to make sure there aren't two parallel control flows that can diverge easily. It's unclear if we need to support those anytime soon, just putting it out there.

I'm starting to type code for cockroachdb#75729 and it's clear immediately that we shouldn't have these protos in the `kvserver` package as this will make dependency cycles hard to avoid. We've long introduced the `kvserverpb` package, but simply didn't pull all the protos into it. This commit moves `raft.proto` to `kvserverpb`. There's still a bit of protobuf left in `api.proto`, but that can be handled separately. Release note: None

76113: kvserver: move raft.proto to kvserverpb r=erikgrinaker a=tbg kvserver: move raft.proto to kvserverpb I'm starting to type code for #75729 and it's clear immediately that we shouldn't have these protos in the `kvserver` package as this will make dependency cycles hard to avoid. We've long introduced the `kvserverpb` package, but simply didn't pull all the protos into it. This commit moves `raft.proto` to `kvserverpb`. There's still a bit of protobuf left in `api.proto`, but that can be handled separately (#76114) Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>

This is similar to cockroachdb#76113 in that it prevents import cycles in cockroachdb#75729. We don't want the code moved here to necessarily stay in `kvserverbase`, but that is a good place for now. Release note: None

76120: kvserver: move raft version encoding to `kvserverbase` r=erikgrinaker a=tbg This is similar to #76113 in that it prevents import cycles in #75729. We don't want the code moved here to necessarily stay in `kvserverbase`, but that is a good place for now. Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>

This commit introduces a new package `raftlog`. Aspirationally, this will at some point become the de facto way of encapsulating the raft log encoding and may play a role in programmatically constructing and inspecting raft logs (e.g. for testing). For now, we introduce two concepts: - `raftlog.Entry`, which wraps a `raftpb.Entry` and all of the information derived from it, such as the command ID, the `kvserverpb.RaftCommand`, the configuration change (if any), etc. - `raftlog.Iterator`, a way to iterate over the raft log in terms of `raftlog.Entry` (as opposed to `raftpb.Entry` which requires lots of manual processing). Both are then applied across the codebase, concretely: - `loqrecovery` is simplified via `raftlog.Iterator` to pull commit triggers out of the raft log. - debug pretty-printing is simpler thanks to use of `raftlog.Entry`. - `decodedRaftEntry` is now structurally a `raftpb.Entry`, and again lots manual custom unmarshaling code evaporates. It's currently difficult to create "interesting" raft log entries if trying to stay away from manual population of large datastructures (which is prone to rotting), so there's zero unit testing of `raftlog.Iterator`. However, the code is not new, instead it was deduplicated from a few places, and is now tested through all of them; so I don't feel to bad about it. I still think it is a priority to be able to "comfortably" create at least "simple" raft logs, meaning we need to be able to string together `batcheval` and entry creation at least in a rudimentary fashion. I intend to look into this next and add comprehensive unit tests for `raftlog.{Entry,Iterator}`. Touches cockroachdb#75729. Release note: None

tbg · 2022-02-06T20:04:26Z

The right side of the picture ("Log encoding") is librarified, at least on the read path, in #76126. We still "manually" create these proposals when we run config changes:

cockroach/pkg/kv/kvserver/replica_proposal_buf.go

Lines 529 to 559 in 7051448

    
           // Coordinate proposing the command to etcd/raft. 
        
           if crt := p.command.ReplicatedEvalResult.ChangeReplicas; crt != nil { 
        
           	// Flush any previously batched (non-conf change) proposals to 
        
           	// preserve the correct ordering or proposals. Later proposals 
        
           	// will start a new batch. 
        
           	if err := proposeBatch(raftGroup, b.p.replicaID(), ents); err != nil { 
        
           		firstErr = err 
        
           		continue 
        
           	} 
        
           	ents = ents[len(ents):] 
        
           	confChangeCtx := kvserverpb.ConfChangeContext{ 
        
           		CommandID: string(p.idKey), 
        
           		Payload:   p.encodedCommand, 
        
           	} 
        
           	encodedCtx, err := protoutil.Marshal(&confChangeCtx) 
        
           	if err != nil { 
        
           		firstErr = err 
        
           		continue 
        
           	} 
        
           	cc, err := crt.ConfChange(encodedCtx) 
        
           	if err != nil { 
        
           		firstErr = err 
        
           		continue 
        
           	} 
        
           	if err := raftGroup.ProposeConfChange( 
        
           		cc, 
        
           	); err != nil && !errors.Is(err, raft.ErrProposalDropped) { 
        
           		// Silently ignore dropped proposals (they were always silently

This commit introduces a new package `raftlog`. Aspirationally, this will at some point become the de facto way of encapsulating the raft log encoding and may play a role in programmatically constructing and inspecting raft logs (e.g. for testing). For now, we introduce two concepts: - `raftlog.Entry`, which wraps a `raftpb.Entry` and all of the information derived from it, such as the command ID, the `kvserverpb.RaftCommand`, the configuration change (if any), etc. - `raftlog.Iterator`, a way to iterate over the raft log in terms of `raftlog.Entry` (as opposed to `raftpb.Entry` which requires lots of manual processing). Both are then applied across the codebase, concretely: - `loqrecovery` is simplified via `raftlog.Iterator` to pull commit triggers out of the raft log. - debug pretty-printing is simpler thanks to use of `raftlog.Entry`. - `decodedRaftEntry` is now structurally a `raftpb.Entry`, and again lots manual custom unmarshaling code evaporates. It's currently difficult to create "interesting" raft log entries if trying to stay away from manual population of large datastructures (which is prone to rotting), so there's zero unit testing of `raftlog.Iterator`. However, the code is not new, instead it was deduplicated from a few places, and is now tested through all of them; so I don't feel to bad about it. I still think it is a priority to be able to "comfortably" create at least "simple" raft logs, meaning we need to be able to string together `batcheval` and entry creation at least in a rudimentary fashion. I intend to look into this next and add comprehensive unit tests for `raftlog.{Entry,Iterator}`. Touches cockroachdb#75729. Release note: None

This commit introduces `TestCreateManualProposal` which attempts to carry out an "offline" (i.e. `Replica`-less) version of evaluation & entry creation (i.e. the middle and right-hand side [here]). We start with a `*roachpb.PutRequest` and end up with a payload fit for a `(raftpb.Entry).Data`. This is helpful since it provides an easily digestible "life of a proposal" read-along, but more importantly represents a step towards establishing clearer interfaces for the various stages of a request - evaluation, replication, and application - and in particular separation of these stages from the surrounding `Replica`. The test helps pinpoint pieces of code that will need to be worked on in order to reach one of the goals of cockroachdb#75729, being able to create a meaningful raft log programmatically in a unit test without also instantiating a `Replica` and, later, also applying raft logs in a similar way. As a first step, this commit extracts a method `raftCmdToPayload` from `(*Replica).propose` which translates a `kvserverpb.RaftCommand` (which is roughly the result of evaluating a `BatchRequest`) into a payload that can be handed to raft for replication. This extracted method is far from clean - it takes a few crufty parameters, lives in `kvserver` (where it is hard to import) and also uses the logging singleton which is undesirable for a library method. However, it's enough to keep the ball rolling, and can be improved on later. [here]: cockroachdb#75729 (comment) Touches cockroachdb#75729. Release note: None

sumeerbhola · 2022-02-09T14:09:46Z

I don't think we're too far off from the evaluate-side either. Assuming everything the commands we want to manufacture programmatically is populated in the batcheval.EvalContext, we will get a RaftCommand back and making the lease/closed timestamp code reusable shouldn't be hard either, especially since this is all much easier when operating offline without concurrency or in-memory state. We don't care much about the eval side outside of testing, so supporting just a subset of commands when we need it is good enough.

We don't need this for ReplicasStorage, and I'm guessing we don't need it for loss of quorum either. Is that correct? Why do we need to write tests that need to start with a BatchRequest and don't have a Replica (just trying to understand)? Isn't populating batcheval.EvalContext a tall order given the very broad interface?

tbg · 2022-02-09T15:27:24Z

This isn't tracking exactly the work needed for ReplicasStorage (which will only need to apply some kinds of entries excluding splits and merges, iirc). It's generally what we want to exist for replication in a few places, and while there isn't a super pressing need, being able to do this shows us that we're doing something right.

tbg · 2022-02-09T15:29:09Z

The concrete example where you might want it both for LOQ and for ReplicasStorage is that for testing you'll want to populate custom raft logs. Having to spin up servers and manufacture them in a live system is now how this should work. Yet, that's what we do today:

cockroach/pkg/kv/kvserver/loqrecovery/collect_raft_log_test.go

Lines 34 to 90 in c4d4df9

    
           // TestFindUpdateDescriptor verifies that we can detect changes to range 
        
           // descriptor in the raft log. 
        
           // To do this we split and merge the range which updates descriptor prior to 
        
           // spawning or subsuming RHS. 
        
           func TestFindUpdateDescriptor(t *testing.T) { 
        
           	defer leaktest.AfterTest(t)() 
        
           	defer log.Scope(t).Close(t) 
        
           	ctx := context.Background() 
        
           	const testNode = 0 
        
           	var testRangeID roachpb.RangeID 
        
           	var rHS roachpb.RangeDescriptor 
        
           	var lHSBefore roachpb.RangeDescriptor 
        
           	var lHSAfter roachpb.RangeDescriptor 
        
           	checkRaftLog(t, ctx, testNode, 
        
           		func(ctx context.Context, tc *testcluster.TestCluster) roachpb.RKey { 
        
           			scratchKey, err := tc.Server(0).ScratchRange() 
        
           			require.NoError(t, err, "failed to get scratch range") 
        
           			srk, err := keys.Addr(scratchKey) 
        
           			require.NoError(t, err, "failed to resolve scratch key") 
        
           			rd, err := tc.LookupRange(scratchKey) 
        
           			testRangeID = rd.RangeID 
        
           			require.NoError(t, err, "failed to get descriptor for scratch range") 
        
           			splitKey := testutils.MakeKey(scratchKey, []byte("z")) 
        
           			lHSBefore, rHS, err = tc.SplitRange(splitKey) 
        
           			require.NoError(t, err, "failed to split scratch range") 
        
           			lHSAfter, err = tc.Servers[0].MergeRanges(scratchKey) 
        
           			require.NoError(t, err, "failed to merge scratch range") 
        
           			require.NoError(t, 
        
           				tc.Server(testNode).DB().Put(ctx, testutils.MakeKey(scratchKey, []byte("|first")), 
        
           					"some data"), 
        
           				"failed to put test value in LHS") 
        
           			return srk 
        
           		}, 
        
           		func(t *testing.T, ctx context.Context, reader storage.Reader) { 
        
           			seq, err := loqrecovery.GetDescriptorChangesFromRaftLog(testRangeID, 0, math.MaxInt64, reader) 
        
           			require.NoError(t, err, "failed to read raft log data") 
        
           			requireContainsDescriptor(t, loqrecoverypb.DescriptorChangeInfo{ 
        
           				ChangeType: loqrecoverypb.DescriptorChangeType_Split, 
        
           				Desc:       &lHSBefore, 
        
           				OtherDesc:  &rHS, 
        
           			}, seq) 
        
           			requireContainsDescriptor(t, loqrecoverypb.DescriptorChangeInfo{ 
        
           				ChangeType: loqrecoverypb.DescriptorChangeType_Merge, 
        
           				Desc:       &lHSAfter, 
        
           				OtherDesc:  &rHS, 
        
           			}, seq) 
        
           		}) 
        
           }

This commit introduces a new package `raftlog`. Aspirationally, this will at some point become the de facto way of encapsulating the raft log encoding and may play a role in programmatically constructing and inspecting raft logs (e.g. for testing). For now, we introduce two concepts: - `raftlog.Entry`, which wraps a `raftpb.Entry` and all of the information derived from it, such as the command ID, the `kvserverpb.RaftCommand`, the configuration change (if any), etc. - `raftlog.Iterator`, a way to iterate over the raft log in terms of `raftlog.Entry` (as opposed to `raftpb.Entry` which requires lots of manual processing). Both are then applied across the codebase, concretely: - `loqrecovery` is simplified via `raftlog.Iterator` to pull commit triggers out of the raft log. - debug pretty-printing is simpler thanks to use of `raftlog.Entry`. - `decodedRaftEntry` is now structurally a `raftpb.Entry`, and again lots manual custom unmarshaling code evaporates. It's currently difficult to create "interesting" raft log entries if trying to stay away from manual population of large datastructures (which is prone to rotting), so there's zero unit testing of `raftlog.Iterator`. However, the code is not new, instead it was deduplicated from a few places, and is now tested through all of them; so I don't feel to bad about it. I still think it is a priority to be able to "comfortably" create at least "simple" raft logs, meaning we need to be able to string together `batcheval` and entry creation at least in a rudimentary fashion. I intend to look into this next and add comprehensive unit tests for `raftlog.{Entry,Iterator}`. Touches cockroachdb#75729. Release note: None

`(*replicaAppBatch).Stage` roughly 1. checks whether to apply the `Command` as a no-op 2. runs a bunch of triggers (touching `*Replica` a lot) 3. stages the command into a `storage.Batch` 4. runs a bunch of triggers (touching `*Replica` a lot) 5. returns the cmd as a `CheckedCommand` Standalone log application needs to do 1, 3, 5 but it only needs to do parts of 2 and 4. We're trying to morph `replicaAppBatch` into something that can apply log entries both in regular and standalone mode, where all of the interactions with the surrounding *Replica (if any) are pushed into an interface handed to `replicaAppBatch`. The implementation of the interface is either a standalone-specific handler, or a *Replica-specific one (that relies heavily on the standalone handler to avoid logic duplication). This will take many commits to achieve but the goal is something like this: ```go // appBatch implements apply.Batch and Handler (for standalone // application). type appBatch struct { // All the fields that we need during standalone application. state ReplicaState batch storage.Batch } // Handler is used by replicaAppBatch to execute side effects of entry // application. The implementation depends on regular vs standalone mode. type Handler interface { // methods are called from Stage PreHandler // addSSTPreApply, maybeAcquireSplitLock, etc curre PostHandler // runPreApplyTriggersAfterStagingWriteBatch, itemized } // replicaAppBatch implements apply.Batch. It also implements Handler, // by first calling appBatch's Handler implementation and augmenting it // with additional actions that don't apply in a standalone context. type replicaAppBatch struct { appBatch r *Replica // More fields only required in *Replica context, like // `followerStoreWriteBytes` which is used for admission control } ``` That exact plan is unlikely to entirely survive contact with reality but something along those lines ought to work out. Other than paving a particularly bumpy part of the road to stand-alone log application, I expect this to also make the code more accessible, as it is currently somewhat ad-hoc. This first commit moves the lease applied index checks and the closed timestamp assertions[^1] to this new scheme. Follow-up commits will do the same for the other references to `*Replica` in `Stage` until `*Replica` is only referenced from `*replicaAppBatchHandler`. At that point, we can move `*Replica` from `replicaAppBatch` to `replicaAppBatchHandler` and can declare victory. This PR tries to minimize code movement. I left TODOs on everything that I think should move in a future commit. Touches cockroachdb#75729. [^1]: these assertions are currently only run in "regular" mode but sould also be ported to standalone mode in the future. For now it's convenient to have them out of the way; better to have a standalone mode with a few assertions missing earlier. Release note: None

For cockroachdb#75729 we will want to work on replicaAppBatch and decompose it into a new type appBatch around which replicaAppBatch can be layered. As a first baby step, move appbatch-related code to its own file to physically separate it from the `apply.StateMachine`. This is 100% mechanical (using Goland's refactoring tools). Release note: None

93237: kvserver: move replicaAppBatch to replica_app_batch.go r=pavelkalinnikov a=tbg For #75729 we will want to work on replicaAppBatch and decompose it into a new type appBatch around which replicaAppBatch can be layered. As a first baby step, move appbatch-related code to its own file to physically separate it from the `apply.StateMachine`. This is 100% mechanical (using Goland's refactoring tools). Epic: CRDB-220 Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>

93266: kvserver: refactor replicaAppBatch for standalone log application r=pavelkalinnikov a=tbg This long (but individually small) sequence of commits moves `(*replicaAppBatch).Stage` close to the structure that was prototyped in #93265, where it has the following steps: - command checks (standalone) - testing interceptors (replica) - pre-add triggers (standalone) - pre-add triggers (replica) - add (to pebble batch, standalone) - post-add triggers (standalone) - post-add triggers (replica) In standalone application (e.g. for #93244) we'd use an `apply.Batch` that is an `appBatch`, instead of a `replicaAppBatch`, i.e. skip all of the (replica) steps above. This PR doesn't get us all the way there - we still need to tease apart the `post-add triggers (replica)` step, which currently contains code that should be in `post-add triggers (standalone)`; this is best tackled in a separate PR since it's going to be quite a bit of work. Touches #75729. Epic: CRDB-220 Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>

tbg · 2023-02-20T07:47:35Z

In #97347 we're seeing a tricky interaction between reproposals that we can't unit test without a significant rework. Having more stand-alone components for log application would help here.

104401: kvserver: add TestCreateManyUnappliedProbes r=pavelkalinnikov a=tbg This is the test used for #102953. This PR also makes modest progress on #75729, not by making log application stand-alone, but by making it somewhat less convoluted to manufacture log entries programmatically. It would be desirable to be able to test the replication layer in CRDB in the same way that Raft allows via the InteractionEnv[^1]. [^1]: https://github.com/etcd-io/raft/blob/6bf4f7fe3122b064e0a0d76314298dca6f379fc7/interaction_test.go Epic: none Release note: none 105083: kvcoord: Release catchup reservation before re-acquire attempt r=miretskiy a=miretskiy Release catchup scan reservation prior to attemt to re-acquire it. Failure to do so could result in a stuck mux rangefeed when enough ranges encounter an error, such as range split, prior to receiving the first checkpoint event, that would cause additional attempts to acquire catchup scan quota. Fixes #105058 Release note (bug fix): Fix a bug in mux rangefeed implementation that may cause mux rangefeed to become stuck if enough ranges encounter certain error concurrently. Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com>

tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jan 31, 2022

tbg mentioned this issue Feb 5, 2022

kvserver: move raft.proto to kvserverpb #76113

Merged

tbg mentioned this issue Feb 6, 2022

kvserver: move raft version encoding to kvserverbase #76120

Merged

tbg mentioned this issue Feb 6, 2022

raftlog: introduce & use in loqrecovery, kvserver #76126

Merged

tbg mentioned this issue Feb 6, 2022

kvserver: start making proposal generation reusable #76130

Closed

tbg mentioned this issue Nov 16, 2022

kvserver: break up replicaAppBatch via appBatch and appHandler #91980

Closed

tbg mentioned this issue Dec 8, 2022

kvserver: move replicaAppBatch to replica_app_batch.go #93237

Merged

tbg changed the title ~~kvserver: make log application stand-alone~~ kvserver: sep-raft-log make log application stand-alone Dec 8, 2022

tbg changed the title ~~kvserver: sep-raft-log make log application stand-alone~~ kvserver: sep-raft-log: make log application stand-alone Dec 8, 2022

tbg mentioned this issue Dec 8, 2022

kvserver: refactor replicaAppBatch for standalone log application #93266

Merged

exalate-issue-sync bot unassigned tbg Apr 24, 2023

tbg mentioned this issue Jun 6, 2023

kvserver: add TestCreateManyUnappliedProbes #104401

Merged

tbg mentioned this issue Jun 20, 2023

kvserver: test *Replica like *raft.RawNode #105177

Open

exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: sep-raft-log: make log application stand-alone #75729

kvserver: sep-raft-log: make log application stand-alone #75729

tbg commented Jan 31, 2022 •

edited

Loading

tbg commented Feb 4, 2022

tbg commented Feb 4, 2022

tbg commented Feb 6, 2022

sumeerbhola commented Feb 9, 2022

tbg commented Feb 9, 2022

tbg commented Feb 9, 2022

tbg commented Feb 20, 2023 •

edited

Loading

kvserver: sep-raft-log: make log application stand-alone #75729

kvserver: sep-raft-log: make log application stand-alone #75729

Comments

tbg commented Jan 31, 2022 • edited Loading

tbg commented Feb 4, 2022

tbg commented Feb 4, 2022

tbg commented Feb 6, 2022

sumeerbhola commented Feb 9, 2022

tbg commented Feb 9, 2022

tbg commented Feb 9, 2022

tbg commented Feb 20, 2023 • edited Loading

tbg commented Jan 31, 2022 •

edited

Loading

tbg commented Feb 20, 2023 •

edited

Loading