storage: build SSTs from KV_BATCH snapshot #38932

jeffrey-xiao · 2019-07-17T15:55:12Z

Implements the SST snapshot strategy discussed in #16954 and partially implemented in #25134 and #38873, but only have the logic on the receiver side for ease of testing and compatibility. This PR also handles the complications of subsumed replicas that are not fully contained by the current replica.

The maximum number of SSTs created using this strategy is 4 + SR + 2 where SR is the number of subsumed replicas.

Three SSTs get streamed from the sender (range local keys, replicated range-id local keys, and data keys)
One SST is constructed for the unreplicated range-id local keys.
One SST is constructed for every subsumed replica to clear the range-id local keys. These SSTs consists of one range deletion tombstone and one RaftTombstone key.
A maximum of two SSTs for all subsumed replicas to account for the case of not fully contained subsumed replicas. Note that currently, subsumed replicas can have keys right of the current replica, but not left of, so there will be a maximum of one SST created for the range-local keys and one for the data keys. These SSTs consist of one range deletion tombstone.

This number can be further reduced to 3 + SR if we pass the file handles and sst writers from the receiving step to the application step. We can combine the SSTs of the unreplicated range id and replicated id, and the range local of the subsumed replicas and data SSTs of the subsumed replicas. We probably don't want to do this optimization since we'll have to undo this optimization if we start constructing the SSTs from the sender or start chunking large SSTs into smaller SSTs.

Blocked by facebook/rocksdb#5649.

Test Plan

Testing knob to inspect SSTs before ingestion. Ensure that expected SSTs for subsumed replicas are ingested.
Unit tests for SSTSnapshotStorage.

Metrics and Evaluation

One way to evaluate this change is the following steps:

Setup 3 node cluster
Set default Raft log truncation threshold to some low constant:

defaultRaftLogTruncationThreshold = envutil.EnvOrDefaultInt64(
    "COCKROACH_RAFT_LOG_TRUNCATION_THRESHOLD", 128<<10 /* 128 KB */)

Set range_min_bytes to 0 and range_max_bytes to some large number.
Increase kv.snapshot_recovery.max_rate and kv.snapshot_rebalance.max_rate to some large number.
Disable load-based splitting.
Stop node 2.
Run an insert heavy workload (kv0) on the cluster.
Start node 2.
Time how long it takes for node 2 to have all the ranges.

Roachtest: https://gist.github.com/jeffrey-xiao/e69fcad04968822d603f6807ca77ef3b

We can have two independent variables

Fixed total data size (4000000 ops; ~3.81 GiB), variable number of splits

1024 splits (~3.9 MiB ranges)
512 splits (~7.9 MiB ranges)
256 splits (~15.7 MiB ranges)
128 splits (~31.2 MiB ranges)
64 splits (~61.0 MiB ranges)
32 splits (~121 MiB ranges)

name         old secs   new secs   delta
AvgBytes3    14.0 ±24%  12.7 ±21%     ~     (p=0.279 n=8+8)
AvgBytes7    11.3 ± 1%  13.4 ±25%     ~     (p=0.283 n=4+8)
AvgBytes15   11.8 ±17%  12.6 ±27%     ~     (p=0.755 n=6+8)
AvgBytes30   23.5 ±11%  14.9 ±45%  -36.74%  (p=0.001 n=8+8)
AvgBytes60   32.3 ±13%  23.4 ± 9%  -27.49%  (p=0.000 n=8+8)
AvgBytes121  53.1 ± 6%  38.8 ±19%  -26.86%  (p=0.002 n=5+8)

Fixed number of splits (32), variable total data size

125000 (~ 3.7 MiB ranges)
250000 (~7.5 MiB ranges)
500000 (~15 MiB ranges)
1000000 (~30 MiB ranges)
2000000 (60 MiB ranges)
4000000 (121 MiB ranges)

name         old secs   new sec   delta
AvgBytes3     740 ±22%   883 ± 8%     ~     (p=0.143 n=5+3)
AvgBytes7     681 ±14%   728 ± 9%     ~     (p=0.310 n=5+5)
AvgBytes15    418 ±10%   441 ±11%     ~     (p=0.310 n=5+5)
AvgBytes31   54.3 ± 6%  43.6 ± 5%  -19.72%  (p=0.008 n=5+5)
AvgBytes61   51.8 ± 3%  42.4 ± 6%  -18.16%  (p=0.008 n=5+5)
AvgBytes121  53.1 ± 6%  38.8 ±19%  -26.86%  (p=0.002 n=5+8)

Fsync Chunk Size

The size of the SST chunk that we write before fsync-ing impacts how fast node 2 has all the ranges. I've experimented 32 splits and an median range size of 121 MB with no fsync-ing (~27s recovery), fsync-ing in 8 MB chunks (~30s recovery), fsync-ing in 2 MB chunks (~40s recovery), fsync-ing in 256 KB chunks (~42s recovery). The default bulk sst sync rate is 2MB and #20352 sets bytes_per_sync to 512 KB, so something between those options is probably good. The reason we would want to fsync is to prevent the OS from accumulating such a large buffer that it blocks unrelated small/fast writes for a long time when it flushes.

Impact on Foreground Traffic

For testing the impact on foreground traffic, I ran kv0 on a four node cluster with the merge queue and split queue disabled and starting with a constant number of splits. After 5 minutes, I decommissioned node 1 so its replicas would drain to other nodes using snapshots.

Roachtest: https://gist.github.com/jeffrey-xiao/5d9443a37b0929884aca927f9c320b6c

Average Range Size of 3 MiB

Before
After

Average Range Size of 32 MiB

Before
After

Average Range Size 128 MiB

Before
After

We see p99 latency wins for larger range sizes and comparable performance for smaller range sizes.

Release note (performance improvement): Snapshots sent between replicas are now applied more performantly and use less memory.

cockroach-teamcity · 2019-07-17T15:55:20Z

This change is

nvanbenschoten

This is still a WIP, but I gave it a pass to get a for the general approach. I think this is turning out cleaner than the other attempt, even with the need to worry about partially applied snapshots and recovery from this state.

One of the bigger open questions right now is the testing plan for this change. Could you enumerate the tests that you're thinking of adding to exercise the new surface area of this approach?

Reviewed 14 of 14 files at r1, 4 of 4 files at r2, 7 of 11 files at r3, 2 of 3 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jeffrey-xiao, @nvanbenschoten, and @tbg)

pkg/keys/constants.go, line 156 at r5 (raw file):

	// if any. If this key is set, the replica must finish processing the
	// snapshot by ingesting the SSTs of the snapshot before performing any other
	// action.

"on startup"

pkg/roachpb/internal_raft.proto, line 55 at r5 (raw file):

}

// SSTSnapshotInProgressData is the data needed to process an in-progress snapshot.

This comment is a little misleading. This isn't the data needed to process the snapshot, it's the persisted record that a snapshot is in progress, durably written to coordinate recovery from an untimely crash.

pkg/roachpb/internal_raft.proto, line 58 at r5 (raw file):

message SSTSnapshotInProgressData {
  // The uuid of the in-progress snapshot.
  optional bytes uuid = 1 [(gogoproto.customname) = "UUID"];

We can improve the generated code using:

(gogoproto.customtype) = "github.com/cockroachdb/cockroach/pkg/util/uuid.UUID", (gogoproto.nullable) = false

At which point, ID might be a better name.

pkg/storage/client_raft_test.go, line 1074 at r5 (raw file):

		State:      storagepb.ReplicaState{Desc: rep.Desc()},
	}
	header.RaftMessageRequest.Message.Snapshot.Data = uuid.UUID{}.GetBytes()

Is this really needed?

pkg/storage/replica_init.go, line 163 at r5 (raw file):

	sstSnapshotInProgressKey := keys.RangeSSTSnapshotInProgress(desc.RangeID)
	var sstSnapshotInProgressData roachpb.SSTSnapshotInProgressData
	if ok, err := engine.MVCCGetProto(

I think we'll want to build the keys.RangeSSTSnapshotInProgress manipulation code in as methods on the r.mu.stateLoader. It provides some degree of abstraction here.

Also, we might as well pull this into a method that lives near the rest of snapshot ingestion logic.

pkg/storage/replica_init.go, line 168 at r5 (raw file):

		return err
	} else if ok {
		sss, err := newSstSnapshotStorage(r.store.cfg.Settings, desc.RangeID, uuid.Must(uuid.FromBytes(sstSnapshotInProgressData.UUID)),

s/newSstSnapshotStorage/newSSTSnapshotStorage/

pkg/storage/replica_init.go, line 173 at r5 (raw file):

			return err
		}
		if err := r.store.Engine().IngestExternalFiles(ctx, sss.ssts, true /* skipWritingSeqNo */, true /* modify */); err != nil {

For now, do you mind leaving // TODO(jeffreyxiao): Test an untimely crash here. comments at each place that you'd like to test a crash?

pkg/storage/replica_raftstorage.go, line 484 at r5 (raw file):

	// The RocksDB BatchReprs that make up this snapshot.
	Batches [][]byte
	// SSTables that make put this snapshot.

s/put/up/

pkg/storage/replica_raftstorage.go, line 485 at r5 (raw file):

	Batches [][]byte
	// SSTables that make put this snapshot.
	SSTs []string

What are these? Filenames? Do we need to carry these around here if we can already look up all of the SSTs using an SstSnapshotStorage?

pkg/storage/replica_raftstorage.go, line 485 at r5 (raw file):

	Batches [][]byte
	// SSTables that make put this snapshot.
	SSTs []string

nit: put in same order as the variable declaration in kvBatchSnapshotStrategy.Receive

pkg/storage/replica_raftstorage.go, line 852 at r5 (raw file):

	// We need to delete any old Raft log entries here because any log entries
	// that predate the snapshot will be orphaned and never truncated or GC'd.
	if err := clearRangeData(ctx, s.Desc, r.store.Engine(), batch, true /* destroyData */); err != nil {

I'm realizing that this might have a downside of causing the SST we ingest to always get stuck high in the LSM instead of making its way lower down. We're going to want to play with both approaches to deleting this data (in the batch, in the SSTs) to determine which way performs better.

pkg/storage/replica_raftstorage.go, line 885 at r5 (raw file):

	}

	sstSnapshotInProgressKey := stateloader.Make(s.Desc.RangeID).RangeSSTSnapshotInProgress()

Here's another place where we'd use the stateloader instead of constructing this directly.

pkg/storage/replica_raftstorage.go, line 948 at r5 (raw file):

	// The on-disk state is now committed, but the corresponding in-memory state
	// has not yet been updated and the data SST has not been ingested. Any

nit: break the SST ingestion and the in-memory state into separate logical steps instead of putting them together in this block.

Once you do that, please add a good comment about the entire strategy here, what happens if things fail in different places, and how this is all correct.

pkg/storage/replica_sst_snapshot_storage.go, line 28 at r5 (raw file):

// SstSnapshotStorage keeps track of the SST files created when receiving a
// snapshot with the SST strategy.
type SstSnapshotStorage struct {

s/SstSnapshotStorage/SSTSnapshotStorage/

I think it will be worth pulling out an interface here like we do with SideloadStorage so that we can create a real and an in-memory implementation.

pkg/storage/store_snapshot.go, line 134 at r5 (raw file):

		return noSnap, sendSnapshotError(stream, err)
	}
	sstFile, err := kvSS.sss.CreateFile()

Let's do this lazily in case the SST has no non-local keys.

pkg/storage/store_snapshot.go, line 134 at r5 (raw file):

		return noSnap, sendSnapshotError(stream, err)
	}
	sstFile, err := kvSS.sss.CreateFile()

Please add a TODO somewhere in here to limit the size of a single SST (sstMaxFileSize). For now, we probably won't ever hit this limit because the size of ranges isn't large enough to need it, but we'll want to have the code structure in place to support it. Doing so will probably inspire us to pull the SST creation code in this method into something a little more structured.

pkg/storage/store_snapshot.go, line 157 at r5 (raw file):

				return noSnap, sendSnapshotError(stream, err)
			}
			// All operations in the batch are guaranteed to be puts.

Should we assert that using BatchType()?

pkg/storage/store_snapshot.go, line 166 at r5 (raw file):

				// Add the key to the sst table if it is not a part of the local key
				// range.
				if key.Key.Compare(keys.LocalMax) > 0 {

It looks like keys.LocalMax is considered a non-local key because this is exclusive. See isSysLocal.

pkg/storage/store_snapshot.go, line 188 at r5 (raw file):

					return noSnap, sendSnapshotError(stream, err)
				}
			}

Should we flush a batch we have accumulated here?

jeffrey-xiao

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)

pkg/storage/client_raft_test.go, line 1074 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is this really needed?

The test fails without this line. The new code expects the UUID of the snapshot to be known from the Data field on the receiver side, so instead of failing with the expected error, it fails with uuid: UUID must be exactly 16 bytes long, got 0 bytes.

pkg/storage/replica_init.go, line 163 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think we'll want to build the keys.RangeSSTSnapshotInProgress manipulation code in as methods on the r.mu.stateLoader. It provides some degree of abstraction here.

Also, we might as well pull this into a method that lives near the rest of snapshot ingestion logic.

I've added code to manipulate SSTSnapshotInProgress to stateloader. I think the new logic after this change is not that reusable and simple enough to warrant not having a separate method for it.

pkg/storage/replica_raftstorage.go, line 485 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What are these? Filenames? Do we need to carry these around here if we can already look up all of the SSTs using an SstSnapshotStorage?

Constructing a SSTSnapshotStorage enables us to look up which SSTs we created, but it uses a glob to determine the SSTs which is probably not ideal.

pkg/storage/replica_raftstorage.go, line 852 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm realizing that this might have a downside of causing the SST we ingest to always get stuck high in the LSM instead of making its way lower down. We're going to want to play with both approaches to deleting this data (in the batch, in the SSTs) to determine which way performs better.

Left a TODO for this.

pkg/storage/replica_sst_snapshot_storage.go, line 28 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/SstSnapshotStorage/SSTSnapshotStorage/

I think it will be worth pulling out an interface here like we do with SideloadStorage so that we can create a real and an in-memory implementation.

Made the substitution change. I refactored SSTSnapshotStorage so that it's more easily made into an interface. I was just thinking what an appropriate in-memory implementation would look like? SSTSnapshotStorage would have to output SSTs to disk for ingestion, so I'm finding it hard to imagine how an in-memory implementation would be useful.

pkg/storage/store_snapshot.go, line 188 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we flush a batch we have accumulated here?

Discussed offline. Previously the batches were chunked to be streamed. Since we are constructing a batch on the receiver side, there isn't a need to chunk it. It's probably good to change kv strategy to only have a single batch instead of an array of batches.

tbg

This came out well! Thanks for splitting out the first two commits, that made reviewing a lot easier (third comment needs fat comment, but it's a WIP so no complaints). I left plenty of comments, but none of them are substantial, if I had to summarize them I'd go with "Make it a pleasure to understand precisely how it all works from the code". I'm pretty excited about getting this in, cranking the range size to 10gb and seeing all sorts of fires start to burn. I know of a few places where they will already (looking at you, consistency checker), but pretty sure not all.

Reviewed 10 of 14 files at r1, 15 of 15 files at r6, 4 of 4 files at r7, 13 of 13 files at r8.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jeffrey-xiao and @nvanbenschoten)

c-deps/libroach/include/libroach.h, line 471 at r7 (raw file):

// Truncates the writer and stores the constructed file's contents in *data.
// May be called multiple times. The function may not truncate and return all

This is saying that the returned data won't necessarily reflect the latest writes (only those in completed blocks, though that's an impl detail), right? Maybe you can use that wording, I was thrown off by "may not truncate and return all keys" trying to figure out what exactly was meant.

pkg/keys/keys.go, line 979 at r8 (raw file):

// RangeSSTSnapshotInProgress returns a range-local key for the snapshot in
// progress, in any.

"in any"

pkg/storage/client_raft_test.go, line 1074 at r5 (raw file):

Previously, jeffrey-xiao (Jeffrey Xiao) wrote…

The test fails without this line. The new code expects the UUID of the snapshot to be known from the Data field on the receiver side, so instead of failing with the expected error, it fails with uuid: UUID must be exactly 16 bytes long, got 0 bytes.

This isn't a migration concern because outside of tests, the Data field is always properly populated, right?

pkg/storage/replica_init.go, line 168 at r8 (raw file):

	// be idempotent.
	if found {
		sss, err := newSSTSnapshotStorage(r.store.cfg.Settings, desc.RangeID, sstSnapshotInProgressData.ID,

Surprised that you need to make a whole new thing here where the "other place" that does the ingestion gets the SSTs spoonfed. It'd be nice to have more symmetry.

pkg/storage/replica_init.go, line 173 at r8 (raw file):

			return err
		}
		if err := r.store.Engine().IngestExternalFiles(ctx, sss.ssts, true /* skipWritingSeqNo */, true /* modify */); err != nil {

This also deserves targeted crash testing, even though it looks very similar to the other code.

pkg/storage/replica_raftstorage.go, line 485 at r5 (raw file):

Previously, jeffrey-xiao (Jeffrey Xiao) wrote…

Constructing a SSTSnapshotStorage enables us to look up which SSTs we created, but it uses a glob to determine the SSTs which is probably not ideal.

Improve the comment, including whether the filenames are relative or absolute, whether this field is always populated. Also clarify what's in the Batch field.

pkg/storage/replica_raftstorage.go, line 904 at r8 (raw file):
Logic in this sentence is off. Suggestion

If there are no SSTs to ingest, don't write the snapshot progress key.

pkg/storage/replica_raftstorage.go, line 963 at r8 (raw file):
There are many more reasons we need to sync the WAL (some known, some probably unknown). I don't think it's worth listing them now, but just write

Sync to make the snapshot durably applied.

to make this less misleading.

pkg/storage/replica_raftstorage.go, line 970 at r8 (raw file):

SSTs have

pkg/storage/replica_raftstorage.go, line 972 at r8 (raw file):

	// has not yet been updated and the data SST has not been ingested. Any
	// errors past this point must therefore be treated as fatal. If the node
	// crashes before the data SST is ingested, the unreplicated range-ID local

ditto

pkg/storage/replica_raftstorage.go, line 977 at r8 (raw file):

// If there are no SSTs, there's nothing to ingest and we didn't write the snapshot in progress key earlier.

pkg/storage/replica_raftstorage.go, line 983 at r8 (raw file):

		// crash here is safe because the SSTs should be ingested on replica
		// startup.
		if err := r.store.engine.IngestExternalFiles(ctx, inSnap.SSTs, true /* skipWritingSeqNo */, true /* modify */); err != nil {

Is modify the option that hard-links into Rocks' dir and removes the original file? That seems fine then. I see you're skipWritingSeqNo, not sure how this works in the case in which it needs to be assigned a global seqno but you've probably looked into this and it just works.

pkg/storage/replica_sst_snapshot_storage.go, line 130 at r8 (raw file):

The current file must have been written to before being closed.

pkg/storage/store_snapshot.go, line 103 at r8 (raw file):

	// Fields used when sending snapshots.
	batchSize    int64
	sstChunkSize int64

comment

pkg/storage/store_snapshot.go, line 119 at r8 (raw file):

	var err error

	emptySST := true

comment (there isn't an obvious notion of empty since Truncate() can return empty even if something has been written)

pkg/storage/store_snapshot.go, line 120 at r8 (raw file):

	emptySST := true
	b := kvSS.newBatch()

Is it still necessary to provide this as a method? Just curious, it doesn't seem like it, but I'm also a fan of only creating a batch when it's being used immediately

pkg/storage/store_snapshot.go, line 123 at r8 (raw file):

// At the moment, we'll write at most one SST.
//
// TODO(jeffreyxiao): re-evaluate as the default range size grows.

pkg/storage/store_snapshot.go, line 127 at r8 (raw file):

	}

	lastSizeCheck := int64(0)

comment

pkg/storage/store_snapshot.go, line 164 at r8 (raw file):

					}
					emptySST = false
					if sst.DataSize-lastSizeCheck > kvSS.sstChunkSize {

Does sst.DataSize > 0 imply that Truncate returns a nonempty chunk? Call that out, and make sure it's actually true in the code below (i.e. assert).

Add a comment on kvSS.sstChunkSize that indicates that this is roughly how much data will be buffered in memory at any given time.

pkg/storage/store_snapshot.go, line 711 at r8 (raw file):

	switch header.Strategy {
	case SnapshotRequest_KV_BATCH:
		snapUUID, err := uuid.FromBytes(header.RaftMessageRequest.Message.Snapshot.Data)

Just confirmed to myself that this UUID has been sent forever, so this just moved earlier but will always be set in relevant old versions of CRDB.

pkg/storage/store_test.go, line 3374 at r8 (raw file):

	defer leaktest.AfterTest(t)()

	testKey := testutils.MakeKey(keys.MakeTablePrefix(50), roachpb.RKey("a"))

This test setup seems unnecessarily pedestrian. Can't you start a TestServer with on-disk storage, split the range, stop the server, then create an engine from the dir, set up the in-progress-key, close the engine, start the TestServer again and then run your store.MVCCGet via the TestServer (plus check that the snapshot dir you wrote in is gone, along with the marker key)?

pkg/storage/store_test.go, line 3430 at r8 (raw file):

		t.Fatal(err)
	}
	if err := sss.NewFile(); err != nil {

Your call, but require.NoError would've really helped you in this test.

pkg/storage/engine/rocksdb.go, line 2954 at r6 (raw file):

// ClearIterRange implements the Writer interface
func (fw *RocksDBSstFileWriter) ClearIterRange(iter Iterator, start, end MVCCKey) error {
	panic("unimplemented")

I don't feel great about making something an interface when it doesn't actually support lots of the operations (granted, the interfaces here have lots of room for improvement, too). You probably have a reason, but I don't (yet) know what it is. Add that to the commit message. Or, if it isn't needed, just don't implement the interface and remove the panics. If you'd like to check statically that the receiver implements the "relevant" part of the interface, you can do

type rocksDBSstFileWriterAdapter struct{*RocksDBSstFileWriter}
var _ Writer = (*rocksDBSstFileWriterAdapter)(nil)
func(fw *rocksDBSstFileWriterAdapter) Merge(...) error { panic("unimplemented") }
...

pkg/storage/engine/rocksdb.go, line 2980 at r6 (raw file):

// LogLogicalOp implements the Writer interface.
func (fw *RocksDBSstFileWriter) LogLogicalOp(op MVCCLogicalOpType, details MVCCLogicalOpDetails) {}

panic?

pkg/storage/engine/rocksdb_test.go, line 740 at r7 (raw file):

	const keyLen = 10
	const valLen = 1000
	ts := hlc.Timestamp{WallTime: timeutil.Now().UnixNano()}

Use a fixed timestamp here, unless you're going for randomness, in which case t.Log the timestamp so it's available when the test fails.

pkg/storage/engine/rocksdb_test.go, line 801 at r7 (raw file):

			"was not (len=%d)", len(sst1FinishBuf), len(resBuf2))
	}
}

Nice test.

pkg/storage/stateloader/stateloader.go, line 634 at r8 (raw file):
remove , if any, add:

If no record is found, returns true.

jeffrey-xiao

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jeffrey-xiao, @nvanbenschoten, and @tbg)

pkg/storage/client_raft_test.go, line 1074 at r5 (raw file):