Fix #3 with a xor diffing scheme #4

pkhuong · 2022-02-07T02:19:20Z

When manifests grow large, we can now represent their constituent fingerprints as xor-diffs on top of a base chunk. As long as the bytes in the base chunk are similar to the list of fingerprints, the manifest will have a lot of zero bytes when serialised, and zstd will easily compress them away.

The main change is the optional xoring of a base chunk. However, a lot of the new code goes into making this change effective, in particular:

Adding compression for manifests, to actually save space when the manifest is full of zero
Keeping a reference to the last base chunk we read for a database path, to guarantee it will be available in cache

These changes alleviate the issue with large manifests for large (e.g., 1-2 GB) databases that change slowly (maybe 2-3 pages per transaction), and were primarily tested by running the sqlite tests with invariant checks on the snapshots (staged, ready, and consuming) around every write.

TESTED=ran the tests with a panic.

When we also upload "base" arrays of fingerprints as chunks, we may send larger chunks than 64 KB. Verneuil already manages 3-4 GB databases, which translates into base chunks of up to 512 KB, and that's uncomfortably close to the 1 MB limit on chunk size. Bump that up to a bit more than 256 MB, which would be enough even for a 1 TB database (although we'd likely want a different approach to metadata at that size). In preparation for #3.

…hing It shouldn't matter if we can or cannot parse the manifest: the copier will upload bytes where they should go and clean up the ready directory. In preparation for more complex manifest decoding in #3. TESTED=sqlite tests.

It's harmless and will be part of the public interface to help consumers keep chunks in cache. In preparation for phase 1 of #3.

The array of chunk fingerprints in manifest protos can be long and is rarely compressible. However, we also know they don't change a lot, so we should be able to expose redundancy for compressors. Simply apply a xor filter to that array, with a base sequence of bytes stored in a regular content-addressed chunk, when such a base chunk is defined in the manifest proto. This is currently a no-op because we never populate the field and we also never pass a list of `ReplicationTarget`s to fetch the base chunk from. Implements phase 1 of #3. TESTED=sqlite tests.

…_chunks` Which the only caller, copier.rs, can readily provide. A no-op while we still don't use the base chunk array functionality. Makes phase 1 of #3 correctly configurable. TESTED=follow-up commits.

Pipe it back to the caller from `parse_manifest_chunks`, and add it to the list of candidate chunks for patrol touching. There is currently never any base chunk, so there's no behaviour change yet. Part of #3. TESTED=read the code very closely, a bug here would be hard to find.

We'll need that to pass a cache builder when checking that staged/ready/consuming manifests make sense. TESTED=it builds.

That functionality is only used by the change `tracker.rs`, and it always has such a list on hand. Moreover, when reading the current staged manifest, make sure to try and load from the staged chunk directory: we should find any chunk needed to decode the manifest in there. We also have to relax the invariant checking logic when the validator might be racing with a snapshotter: when that happens, we could read a manifest file, and not find the corresponding chunks. We can only rely on the staged manifest being readable when we hold a lock (read or write) on the sqlite db. Makes phase 1 of #3 correctly configurable. TESTED=follow-up commits.

…ilure If the manifest file at the path we just looked at disappeared or has changed (point to a different inode), bubble that failure up as a more benign missing file condition: we probably failed to decode the manifest because its base chunk was missing, and we know that can happen when the manifest changes from underneath us. TESTED=deflakes very fast test runs.

The Tracker doesn't use the information yet, but it will be useful when we want to reuse base chunks. TESTED=it builds.

The function *always* builds a fresh base chunk (a copy of the current list of chunk fingerprints), which, while not super useful in terms of saving bandwidth or storage, does maximally exercise our diff application logic. Completes phase 2 of #3. TESTED=sqlite tests.

If only because of backward compatibility, we have to try and detect when data is definitely not compressed, and avoid reporting that as a decoding error. Callers can then decide if they want to decode the raw or decompressed data. In practice, our data is unlikely to accidentally include a zstd header, so we can assume we want to propagate decoding errors. In preparation for zstd-compressed manifest blobs (step 3 of #3). TESTED=sqlite tests.

Instead of pre-allocating for the worst case. We now use a dedicated `Write` implementation that dumps into a vector, but only after checking that the result length does not exceed the decompressed size limit. Improves the memory efficiency of #3 in the common case.

If the bytes don't begin with the zstd header, directly parse them as proto. If the bytes begin with the zstd header and can be decompressed, only attempt to parse the decompressed bytes: it seems highly improbable that a valid proto message would also be a valid zstd stream, including checksums. Otherwise, the bytes begin with a zstd header but fail decompression. Give decoding the raw bytes (as proto) a shot; if that fails, report the decompression failure. Making sure field 5 in `Manifest` isn't used as a varint guarantees our proto messages don't start with the zstd header, which avoids the last ambiguous case, where we try to zstd decompress, fail, and try to parse the raw bytes as a proto. Document that restriction on future schemas. Implements step 3 of #3. TESTED=sqlite tests.

We can now expect some compressibility when we use a base chunk for the list of chunk fingerprints. Try to take advantage of that with quick compression. We now unconditionally ask zstd to compress manifest blobs as quickly as possible: in the worst case, we take a 12 byte hit for zstd framing. That seems negligible, and protobuf bytes tend to be a little compressible, so the actual overhead is less than that. Also rename the compression level constants to be clearer about intent. Implements step 4 of #3. TESTED=sqlite tests, looked for compressed manifests in minio during tests.

When a content-address chunk is larger than the snapshot granularity (chunk size), it's probably not snapshot data.. which means it's probably a base list of fingerprints. Either way, it's large, and we probably don't want to waste too much time compressing it. When we're asked to compress at the default level and find such a large chunk, instead compress at the minimum quality (maximum speed). Improves the upload speed of large base fingerprint chunks (for #3). TESTED=sqlite tests?

Retaining this `Arc<Chunk>` for each spool state guarantees the next patrol call will be able to find that chunk in the global cache when decoding the most recent manifest. Part of step 8 of #3. TESTED=it builds?

We don't want it to drop from cache, in case someone creates a new updated snapshot and finds a reference to the same base chunk. Part of step 8 of #3. TESTED=it builds.

When we wish to use the local Chunk cache as a *write through* cache, we can't use `Loader::fetch_chunk`. This new function supports the write-through use case. Will be used for step 8 of #3. TESTED=follow-up commit.

and stick it in `Tracker::recent_base_chunk`, to help future lookups. Completes step 8 of #3. TESTED=sqlite tests.

We'll want to be smart about whether we keep the current base chunk or create a new one. In preparation for step 6 of #3. TESTED=it builds?

and always decide to create a fresh chunk. In preparation for a smarter policy (step 6 of #3). TESTED=sqlite tests.

We essentially want to keep using a base fingerprint chunk while it's still a good match for our actual list of chunk fingerprints. Simply reuse with probability proportional to the match rate. On top of that, when we don't want to reuse the current chunk, look at the size of the list of fingerprints: it's not useful to indirect through a base chunk when the list is short anyway. For now, set an arbitrary cutoff at 1024: if a manifest serialises to ~16 KB or less, don't try to make it compress better. We only take the list size into account when we're about to create a new base chunk, instead of attempting to add some hysteresis: we're not worried about rapid alternation between having and not having a base chunk when the database goes between 1023 and 1024 chunks because we only "lose" a base chunk stochastically, when it's not useful. Closes #3, by implementing the policy (steps 6 and 7). The test-only random policy shows: 1. manifests with no base chunks 2. manifests with a fresh base chunk 3. manifests with a fully stale base chunk 4. manifests with a partially stale base chunk TESTED=sqlite tests for lack of crash, manual inspection for impact during tests.

We expect a lot more repetitions now that we have (and reuse) base chunks. Make the cache comfortably larger than the working set size, to make it much more likely that we will catch redundant uploads of the same base chunk. TESTED=it builds.

Large enough manifests have their list of fingerprints xor-ed with a "base [data] chunk." When the list's bytes and the base chunk are similar, `xor`ing them together creates a lot of zeros, which zstd then compresses efficiently. That's more robust than a logical diffing scheme, and probably more efficient (space and time) than what a reasonable effort would yield. Documents the solution to #3. TESTED=it's all docs.

pkhuong force-pushed the pkhuong/base-chunk branch 11 times, most recently from e9404b5 to 9e5cc67 Compare February 8, 2022 14:46

pkhuong added 19 commits February 8, 2022 10:23

t/test.sh: dump a full backtrace when our rust code panics

d691626

TESTED=ran the tests with a panic.

NFC loader: export Chunk

0a05b49

It's harmless and will be part of the public interface to help consumers keep chunks in cache. In preparation for phase 1 of #3.

manifest_schema: ask for a replication target list in `parse_manifest…

cd8728d

…_chunks` Which the only caller, copier.rs, can readily provide. A no-op while we still don't use the base chunk array functionality. Makes phase 1 of #3 correctly configurable. TESTED=follow-up commits.

NFC tracker: factor out cache builder configuration code

624abac

We'll need that to pass a cache builder when checking that staged/ready/consuming manifests make sense. TESTED=it builds.

NFC replication_buffer: pipe the base chunk when reading manifests

8b7bd38

The Tracker doesn't use the information yet, but it will be useful when we want to reuse base chunks. TESTED=it builds.

copier: keep the most recent base chunk in CopierSpoolState

5c6bd01

Retaining this `Arc<Chunk>` for each spool state guarantees the next patrol call will be able to find that chunk in the global cache when decoding the most recent manifest. Part of step 8 of #3. TESTED=it builds?

snapshot: stash a reference to our Manifest's base chunk, if any

425021b

We don't want it to drop from cache, in case someone creates a new updated snapshot and finds a reference to the same base chunk. Part of step 8 of #3. TESTED=it builds.

pkhuong added 7 commits February 8, 2022 11:08

loader: Chunk::arc_from_bytes fetches or creates a cached chunk

1ac3c2a

When we wish to use the local Chunk cache as a *write through* cache, we can't use `Loader::fetch_chunk`. This new function supports the write-through use case. Will be used for step 8 of #3. TESTED=follow-up commit.

tracker: return an Arc<Chunk> for the manifest's base chunk

532f9b7

and stick it in `Tracker::recent_base_chunk`, to help future lookups. Completes step 8 of #3. TESTED=sqlite tests.

NFC tracker: pass previous base chunk to reencode_flattened_chunks

d873997

We'll want to be smart about whether we keep the current base chunk or create a new one. In preparation for step 6 of #3. TESTED=it builds?

NFC tracker: apply the previous chunk in reencode_flattened_chunks

c12c1c6

and always decide to create a fresh chunk. In preparation for a smarter policy (step 6 of #3). TESTED=sqlite tests.

pkhuong force-pushed the pkhuong/base-chunk branch 2 times, most recently from 7f8da70 to 896cde9 Compare February 8, 2022 17:26

pkhuong merged commit edacfeb into main Feb 8, 2022

pkhuong deleted the pkhuong/base-chunk branch February 8, 2022 18:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #3 with a xor diffing scheme #4

Fix #3 with a xor diffing scheme #4

pkhuong commented Feb 7, 2022 •

edited

Loading

Fix #3 with a xor diffing scheme #4

Fix #3 with a xor diffing scheme #4

Conversation

pkhuong commented Feb 7, 2022 • edited Loading

pkhuong commented Feb 7, 2022 •

edited

Loading