-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #3 with a xor diffing scheme #4
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
pkhuong
force-pushed
the
pkhuong/base-chunk
branch
11 times, most recently
from
February 8, 2022 14:46
e9404b5
to
9e5cc67
Compare
TESTED=ran the tests with a panic.
When we also upload "base" arrays of fingerprints as chunks, we may send larger chunks than 64 KB. Verneuil already manages 3-4 GB databases, which translates into base chunks of up to 512 KB, and that's uncomfortably close to the 1 MB limit on chunk size. Bump that up to a bit more than 256 MB, which would be enough even for a 1 TB database (although we'd likely want a different approach to metadata at that size). In preparation for #3.
…hing It shouldn't matter if we can or cannot parse the manifest: the copier will upload bytes where they should go and clean up the ready directory. In preparation for more complex manifest decoding in #3. TESTED=sqlite tests.
It's harmless and will be part of the public interface to help consumers keep chunks in cache. In preparation for phase 1 of #3.
The array of chunk fingerprints in manifest protos can be long and is rarely compressible. However, we also know they don't change a lot, so we should be able to expose redundancy for compressors. Simply apply a xor filter to that array, with a base sequence of bytes stored in a regular content-addressed chunk, when such a base chunk is defined in the manifest proto. This is currently a no-op because we never populate the field and we also never pass a list of `ReplicationTarget`s to fetch the base chunk from. Implements phase 1 of #3. TESTED=sqlite tests.
…_chunks` Which the only caller, copier.rs, can readily provide. A no-op while we still don't use the base chunk array functionality. Makes phase 1 of #3 correctly configurable. TESTED=follow-up commits.
Pipe it back to the caller from `parse_manifest_chunks`, and add it to the list of candidate chunks for patrol touching. There is currently never any base chunk, so there's no behaviour change yet. Part of #3. TESTED=read the code very closely, a bug here would be hard to find.
We'll need that to pass a cache builder when checking that staged/ready/consuming manifests make sense. TESTED=it builds.
That functionality is only used by the change `tracker.rs`, and it always has such a list on hand. Moreover, when reading the current staged manifest, make sure to try and load from the staged chunk directory: we should find any chunk needed to decode the manifest in there. We also have to relax the invariant checking logic when the validator might be racing with a snapshotter: when that happens, we could read a manifest file, and not find the corresponding chunks. We can only rely on the staged manifest being readable when we hold a lock (read or write) on the sqlite db. Makes phase 1 of #3 correctly configurable. TESTED=follow-up commits.
…ilure If the manifest file at the path we just looked at disappeared or has changed (point to a different inode), bubble that failure up as a more benign missing file condition: we probably failed to decode the manifest because its base chunk was missing, and we know that can happen when the manifest changes from underneath us. TESTED=deflakes very fast test runs.
The Tracker doesn't use the information yet, but it will be useful when we want to reuse base chunks. TESTED=it builds.
The function *always* builds a fresh base chunk (a copy of the current list of chunk fingerprints), which, while not super useful in terms of saving bandwidth or storage, does maximally exercise our diff application logic. Completes phase 2 of #3. TESTED=sqlite tests.
If only because of backward compatibility, we have to try and detect when data is definitely not compressed, and avoid reporting that as a decoding error. Callers can then decide if they want to decode the raw or decompressed data. In practice, our data is unlikely to accidentally include a zstd header, so we can assume we want to propagate decoding errors. In preparation for zstd-compressed manifest blobs (step 3 of #3). TESTED=sqlite tests.
Instead of pre-allocating for the worst case. We now use a dedicated `Write` implementation that dumps into a vector, but only after checking that the result length does not exceed the decompressed size limit. Improves the memory efficiency of #3 in the common case.
If the bytes don't begin with the zstd header, directly parse them as proto. If the bytes begin with the zstd header and can be decompressed, only attempt to parse the decompressed bytes: it seems highly improbable that a valid proto message would also be a valid zstd stream, including checksums. Otherwise, the bytes begin with a zstd header but fail decompression. Give decoding the raw bytes (as proto) a shot; if that fails, report the decompression failure. Making sure field 5 in `Manifest` isn't used as a varint guarantees our proto messages don't start with the zstd header, which avoids the last ambiguous case, where we try to zstd decompress, fail, and try to parse the raw bytes as a proto. Document that restriction on future schemas. Implements step 3 of #3. TESTED=sqlite tests.
We can now expect some compressibility when we use a base chunk for the list of chunk fingerprints. Try to take advantage of that with quick compression. We now unconditionally ask zstd to compress manifest blobs as quickly as possible: in the worst case, we take a 12 byte hit for zstd framing. That seems negligible, and protobuf bytes tend to be a little compressible, so the actual overhead is less than that. Also rename the compression level constants to be clearer about intent. Implements step 4 of #3. TESTED=sqlite tests, looked for compressed manifests in minio during tests.
When a content-address chunk is larger than the snapshot granularity (chunk size), it's probably not snapshot data.. which means it's probably a base list of fingerprints. Either way, it's large, and we probably don't want to waste too much time compressing it. When we're asked to compress at the default level and find such a large chunk, instead compress at the minimum quality (maximum speed). Improves the upload speed of large base fingerprint chunks (for #3). TESTED=sqlite tests?
Retaining this `Arc<Chunk>` for each spool state guarantees the next patrol call will be able to find that chunk in the global cache when decoding the most recent manifest. Part of step 8 of #3. TESTED=it builds?
We don't want it to drop from cache, in case someone creates a new updated snapshot and finds a reference to the same base chunk. Part of step 8 of #3. TESTED=it builds.
When we wish to use the local Chunk cache as a *write through* cache, we can't use `Loader::fetch_chunk`. This new function supports the write-through use case. Will be used for step 8 of #3. TESTED=follow-up commit.
and stick it in `Tracker::recent_base_chunk`, to help future lookups. Completes step 8 of #3. TESTED=sqlite tests.
We'll want to be smart about whether we keep the current base chunk or create a new one. In preparation for step 6 of #3. TESTED=it builds?
and always decide to create a fresh chunk. In preparation for a smarter policy (step 6 of #3). TESTED=sqlite tests.
We essentially want to keep using a base fingerprint chunk while it's still a good match for our actual list of chunk fingerprints. Simply reuse with probability proportional to the match rate. On top of that, when we don't want to reuse the current chunk, look at the size of the list of fingerprints: it's not useful to indirect through a base chunk when the list is short anyway. For now, set an arbitrary cutoff at 1024: if a manifest serialises to ~16 KB or less, don't try to make it compress better. We only take the list size into account when we're about to create a new base chunk, instead of attempting to add some hysteresis: we're not worried about rapid alternation between having and not having a base chunk when the database goes between 1023 and 1024 chunks because we only "lose" a base chunk stochastically, when it's not useful. Closes #3, by implementing the policy (steps 6 and 7). The test-only random policy shows: 1. manifests with no base chunks 2. manifests with a fresh base chunk 3. manifests with a fully stale base chunk 4. manifests with a partially stale base chunk TESTED=sqlite tests for lack of crash, manual inspection for impact during tests.
We expect a lot more repetitions now that we have (and reuse) base chunks. Make the cache comfortably larger than the working set size, to make it much more likely that we will catch redundant uploads of the same base chunk. TESTED=it builds.
Large enough manifests have their list of fingerprints xor-ed with a "base [data] chunk." When the list's bytes and the base chunk are similar, `xor`ing them together creates a lot of zeros, which zstd then compresses efficiently. That's more robust than a logical diffing scheme, and probably more efficient (space and time) than what a reasonable effort would yield. Documents the solution to #3. TESTED=it's all docs.
pkhuong
force-pushed
the
pkhuong/base-chunk
branch
2 times, most recently
from
February 8, 2022 17:26
7f8da70
to
896cde9
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When manifests grow large, we can now represent their constituent fingerprints as xor-diffs on top of a base chunk. As long as the bytes in the base chunk are similar to the list of fingerprints, the manifest will have a lot of zero bytes when serialised, and zstd will easily compress them away.
The main change is the optional xoring of a base chunk. However, a lot of the new code goes into making this change effective, in particular:
These changes alleviate the issue with large manifests for large (e.g., 1-2 GB) databases that change slowly (maybe 2-3 pages per transaction), and were primarily tested by running the sqlite tests with invariant checks on the snapshots (staged, ready, and consuming) around every write.