Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #3 with a xor diffing scheme #4

Merged
merged 26 commits into from
Feb 8, 2022
Merged

Fix #3 with a xor diffing scheme #4

merged 26 commits into from
Feb 8, 2022

Conversation

pkhuong
Copy link
Collaborator

@pkhuong pkhuong commented Feb 7, 2022

When manifests grow large, we can now represent their constituent fingerprints as xor-diffs on top of a base chunk. As long as the bytes in the base chunk are similar to the list of fingerprints, the manifest will have a lot of zero bytes when serialised, and zstd will easily compress them away.

The main change is the optional xoring of a base chunk. However, a lot of the new code goes into making this change effective, in particular:

  • Adding compression for manifests, to actually save space when the manifest is full of zero
  • Keeping a reference to the last base chunk we read for a database path, to guarantee it will be available in cache

These changes alleviate the issue with large manifests for large (e.g., 1-2 GB) databases that change slowly (maybe 2-3 pages per transaction), and were primarily tested by running the sqlite tests with invariant checks on the snapshots (staged, ready, and consuming) around every write.

@pkhuong pkhuong force-pushed the pkhuong/base-chunk branch 11 times, most recently from e9404b5 to 9e5cc67 Compare February 8, 2022 14:46
When we also upload "base" arrays of fingerprints as chunks, we may
send larger chunks than 64 KB.  Verneuil already manages 3-4 GB
databases, which translates into base chunks of up to 512 KB, and
that's uncomfortably close to the 1 MB limit on chunk size.

Bump that up to a bit more than 256 MB, which would be enough even
for a 1 TB database (although we'd likely want a different approach
to metadata at that size).

In preparation for #3.
…hing

It shouldn't matter if we can or cannot parse the manifest: the copier
will upload bytes where they should go and clean up the ready directory.

In preparation for more complex manifest decoding in #3.

TESTED=sqlite tests.
It's harmless and will be part of the public interface to help
consumers keep chunks in cache.

In preparation for phase 1 of #3.
The array of chunk fingerprints in manifest protos can be long and is
rarely compressible.  However, we also know they don't change a lot,
so we should be able to expose redundancy for compressors.

Simply apply a xor filter to that array, with a base sequence of bytes
stored in a regular content-addressed chunk, when such a base chunk
is defined in the manifest proto.

This is currently a no-op because we never populate the field and
we also never pass a list of `ReplicationTarget`s to fetch the
base chunk from.

Implements phase 1 of #3.

TESTED=sqlite tests.
…_chunks`

Which the only caller, copier.rs, can readily provide.

A no-op while we still don't use the base chunk array functionality.

Makes phase 1 of #3 correctly configurable.

TESTED=follow-up commits.
Pipe it back to the caller from `parse_manifest_chunks`, and add
it to the list of candidate chunks for patrol touching.

There is currently never any base chunk, so there's no behaviour
change yet.

Part of #3.

TESTED=read the code very closely, a bug here would be hard to find.
We'll need that to pass a cache builder when checking that
staged/ready/consuming manifests make sense.

TESTED=it builds.
That functionality is only used by the change `tracker.rs`, and it
always has such a list on hand.

Moreover, when reading the current staged manifest, make sure to try
and load from the staged chunk directory: we should find any chunk
needed to decode the manifest in there.

We also have to relax the invariant checking logic when the validator
might be racing with a snapshotter: when that happens, we could read a
manifest file, and not find the corresponding chunks.  We can only
rely on the staged manifest being readable when we hold a lock (read
or write) on the sqlite db.

Makes phase 1 of #3 correctly configurable.

TESTED=follow-up commits.
…ilure

If the manifest file at the path we just looked at disappeared or has
changed (point to a different inode), bubble that failure up as a more
benign missing file condition: we probably failed to decode the
manifest because its base chunk was missing, and we know that can
happen when the manifest changes from underneath us.

TESTED=deflakes very fast test runs.
The Tracker doesn't use the information yet, but it will be useful
when we want to reuse base chunks.

TESTED=it builds.
The function *always* builds a fresh base chunk (a copy of the current
list of chunk fingerprints), which, while not super useful in terms of
saving bandwidth or storage, does maximally exercise our diff
application logic.

Completes phase 2 of #3.

TESTED=sqlite tests.
If only because of backward compatibility, we have to try and detect
when data is definitely not compressed, and avoid reporting that as a
decoding error.  Callers can then decide if they want to decode the
raw or decompressed data.  In practice, our data is unlikely to
accidentally include a zstd header, so we can assume we want to
propagate decoding errors.

In preparation for zstd-compressed manifest blobs (step 3 of #3).

TESTED=sqlite tests.
Instead of pre-allocating for the worst case.  We now use a dedicated
`Write` implementation that dumps into a vector, but only after
checking that the result length does not exceed the decompressed size
limit.

Improves the memory efficiency of #3 in the common case.
If the bytes don't begin with the zstd header, directly parse them as
proto.

If the bytes begin with the zstd header and can be decompressed, only
attempt to parse the decompressed bytes: it seems highly improbable
that a valid proto message would also be a valid zstd stream,
including checksums.

Otherwise, the bytes begin with a zstd header but fail decompression.
Give decoding the raw bytes (as proto) a shot; if that fails, report
the decompression failure.

Making sure field 5 in `Manifest` isn't used as a varint guarantees
our proto messages don't start with the zstd header, which avoids the
last ambiguous case, where we try to zstd decompress, fail, and try to
parse the raw bytes as a proto.  Document that restriction on future
schemas.

Implements step 3 of #3.

TESTED=sqlite tests.
We can now expect some compressibility when we use a base chunk for
the list of chunk fingerprints.  Try to take advantage of that with
quick compression.

We now unconditionally ask zstd to compress manifest blobs as quickly
as possible: in the worst case, we take a 12 byte hit for zstd
framing.  That seems negligible, and protobuf bytes tend to be a
little compressible, so the actual overhead is less than that.

Also rename the compression level constants to be clearer about
intent.

Implements step 4 of #3.

TESTED=sqlite tests, looked for compressed manifests in minio during tests.
When a content-address chunk is larger than the snapshot
granularity (chunk size), it's probably not snapshot data..
which means it's probably a base list of fingerprints.

Either way, it's large, and we probably don't want to waste too much
time compressing it.  When we're asked to compress at the default
level and find such a large chunk, instead compress at the minimum
quality (maximum speed).

Improves the upload speed of large base fingerprint chunks (for #3).

TESTED=sqlite tests?
Retaining this `Arc<Chunk>` for each spool state guarantees the next
patrol call will be able to find that chunk in the global cache when
decoding the most recent manifest.

Part of step 8 of #3.

TESTED=it builds?
We don't want it to drop from cache, in case someone creates a new
updated snapshot and finds a reference to the same base chunk.

Part of step 8 of #3.

TESTED=it builds.
When we wish to use the local Chunk cache as a *write through* cache,
we can't use `Loader::fetch_chunk`.  This new function supports the
write-through use case.

Will be used for step 8 of #3.

TESTED=follow-up commit.
and stick it in `Tracker::recent_base_chunk`, to help future
lookups.

Completes step 8 of #3.

TESTED=sqlite tests.
We'll want to be smart about whether we keep the current base chunk or
create a new one.

In preparation for step 6 of #3.

TESTED=it builds?
and always decide to create a fresh chunk.

In preparation for a smarter policy (step 6 of #3).

TESTED=sqlite tests.
We essentially want to keep using a base fingerprint chunk while it's
still a good match for our actual list of chunk fingerprints.  Simply
reuse with probability proportional to the match rate.

On top of that, when we don't want to reuse the current chunk, look at
the size of the list of fingerprints: it's not useful to indirect
through a base chunk when the list is short anyway.  For now, set an
arbitrary cutoff at 1024: if a manifest serialises to ~16 KB or less,
don't try to make it compress better.

We only take the list size into account when we're about to create a
new base chunk, instead of attempting to add some hysteresis: we're
not worried about rapid alternation between having and not having a
base chunk when the database goes between 1023 and 1024 chunks because
we only "lose" a base chunk stochastically, when it's not useful.

Closes #3, by implementing the policy (steps 6 and 7).

The test-only random policy shows:
1. manifests with no base chunks
2. manifests with a fresh base chunk
3. manifests with a fully stale base chunk
4. manifests with a partially stale base chunk

TESTED=sqlite tests for lack of crash, manual inspection for impact during tests.
We expect a lot more repetitions now that we have (and reuse) base
chunks.  Make the cache comfortably larger than the working set size,
to make it much more likely that we will catch redundant uploads
of the same base chunk.

TESTED=it builds.
Large enough manifests have their list of fingerprints xor-ed with a
"base [data] chunk."  When the list's bytes and the base chunk are
similar, `xor`ing them together creates a lot of zeros, which zstd
then compresses efficiently.

That's more robust than a logical diffing scheme, and probably more
efficient (space and time) than what a reasonable effort would yield.

Documents the solution to #3.

TESTED=it's all docs.
@pkhuong pkhuong force-pushed the pkhuong/base-chunk branch 2 times, most recently from 7f8da70 to 896cde9 Compare February 8, 2022 17:26
@pkhuong pkhuong merged commit edacfeb into main Feb 8, 2022
@pkhuong pkhuong deleted the pkhuong/base-chunk branch February 8, 2022 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant