Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: sep-raft-log: add crash-testing datadriven test harness #93247

Open
7 tasks
tbg opened this issue Dec 8, 2022 · 2 comments
Open
7 tasks

kvserver: sep-raft-log: add crash-testing datadriven test harness #93247

tbg opened this issue Dec 8, 2022 · 2 comments
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv-replication KV Replication Team

Comments

@tbg
Copy link
Member

tbg commented Dec 8, 2022

We need to exhaustively test the invariants for the engine states in separate raft logs.
We'll add assertions but should also explore these using datadriven tests.

Some of the code in #88606 might be helpful.

TODO:

  • crash in the middle of snapshot ingestion
  • crash in the middle of replica subsumption (snapshot and merge trigger case)
  • crash in the middle of replica creation
  • crash in the middle of split
  • crash in the middle of merge
  • crash in the middle of replica destruction
  • general engine-level testing of all of the above.

I'll add that our testing of the snapshot SSTs is currently weak. We rely on spinning up a real cluster and then intercept the SSTs, which is not very maintainable. We should complement this with kvstorage-level tests that create a snapshot and examine in detail their contents and the engine each one targets.

Epic: CRDB-220

Jira issue: CRDB-22240

@tbg tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv-replication KV Replication Team labels Dec 8, 2022
@blathers-crl
Copy link

blathers-crl bot commented Dec 8, 2022

cc @cockroachdb/replication

tbg added a commit to tbg/cockroach that referenced this issue Dec 19, 2022
Mechanical using GoLand.

Next steps are starting to add a datadriven testing framework
so that functionality can be exercised when it moves or is added here,
i.e. cockroachdb#93247.

Epic: none
Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Dec 22, 2022
Mechanical using GoLand.

Next steps are starting to add a datadriven testing framework
so that functionality can be exercised when it moves or is added here,
i.e. cockroachdb#93247.

Epic: none
Release note: None
craig bot pushed a commit that referenced this issue Dec 22, 2022
94131: kvserver: move LoadAndReconcileReplicas to kvstorage r=pavelkalinnikov a=tbg

Mechanical using GoLand.

Next steps are starting to add a datadriven testing framework
so that functionality can be exercised when it moves or is added here,
i.e. #93247.

Epic: none
Release note: None

Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
tbg added a commit to tbg/cockroach that referenced this issue Jan 23, 2023
See cockroachdb#93310. This is
also the beginning of
cockroachdb#93247.

Epic: CRDB-220
Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Feb 2, 2023
See cockroachdb#93310. This is
also the beginning of
cockroachdb#93247.

Epic: CRDB-220
Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Feb 3, 2023
See cockroachdb#93310. This is
also the beginning of
cockroachdb#93247.

Epic: CRDB-220
Release note: None
craig bot pushed a commit that referenced this issue Feb 6, 2023
95513: kvstorage: complete RaftReplicaID migration r=pavelkalinnikov a=tbg

As of v22.1[^1], we always write the RaftReplicaID when creating a
Replica or updating a snapshot. However, since this is
persisted state that could've originated in older versions and not
updated yet, we couldn't rely on a persisted ReplicaID yet.

This commit adds code to the `(*Store).Start` boot sequence that

- persists a RaftReplicaID for all initialized replicas (using the
  ReplicaID from the descriptor)
- deletes all uninitialized replicas lacking RaftReplicaID (since we don't know their
  ReplicaID at this point).

The second item in theory violates Raft invariants, as uninitialized
Replicas are allowed to vote (though they then cannot accept log
entries). So in theory:

- an uninitialized replica casts a decisive vote for a leader
- it restarts
- code in this commit removes the uninited replica (and its vote)
- delayed MsgVote from another leader arrives
- it casts another vote for the same term for a dueling leader
- now there are two leaders in the same term.

The above in addition presupposes that the two leaders cannot
communicate with each other. Also, even if that is the case, since the
two leaders cannot append to the uninitialized replica (it doesn't
accept entries), we also need additional voters to return at the exact
right time.

Since an uninitialized replica without RaftReplicaID in is necessarily
at least one release old, this is exceedingly unlikely and we will
live with this theoretical risk.

This PR also adds a first stab at a datadriven test harness for
`kvstorage` which is likely to be of use for #93247.

[^1]: #75761

Epic: [CRDB-220](https://cockroachlabs.atlassian.net/browse/CRDB-220)
Release note: None

Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
@tbg
Copy link
Member Author

tbg commented Feb 24, 2023

We have a basic datadriven framework at https://github.com/cockroachdb/cockroach/blob/672e8b1e628ef0a4c920cf42355c44cbc7a625b6/pkg/kv/kvserver/kvstorage/datadriven_test.go that should, with extra work, be able to accommodate our testing needs.

The testing it is used for is limited. I'm adding some bullets to the comment above about missing areas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv-replication KV Replication Team
Projects
None yet
Development

No branches or pull requests

1 participant