Bundle high-churn chunks with the manifest #10

pkhuong · 2022-02-11T08:30:49Z

Implements the improvement sketched in #7.

We already made manifests a lot smaller with #3, at the potential expense of one extra I/O. We can recover some of that by sending chunks we don't expect to benefit from deduplication (because they're rarely reused across transactions) inline with the manifest. The net effect will be similar data transfer, but fewer blobs. Combined with #3, we'll obtain lower transfer bandwidth, storage footprint, and API calls compared to v0.1.

The first chunk, the one that includes sqlite's header page, is a prime candidate: the header contains a 32-bit integer that's incremented whenever the file's contents change. The rest of the page serves as the root B-tree page for the schema table. Root pages are often extra sparse (only they get to hit < 50% occupancy), and the schema table is mostly SQL DDL, i.e., nicely compressible. Once compressed by zstd, it shouldn't add too much to the manifest's size.

We want to bundle the chunk at offset 0 (that contains sqlite's header page), because that chunk includes a counter that's incremented after every write transaction; we don't expect a lot of deduplication from content addressing. This commit extends the schema and documents the new field. Further commits will prepare readers for bundled chunks and actually bundle the first chunk with the manifest. This schema change lets us implement #7. TESTED=it builds.

Bundled chunks probably won't even be found in cached or remote storage. We want to let the loader find them in its in-memory stash, before we even try to hit (and cross-check) storage. Implements the read half of #7. TESTED=follow-up commits.

`CopierWorker`s are also responsible for keeping relevant chunks alive: given a manifest, they figure out the list of chunks the manifest references, and periodically "touch" them (by copying each blob over itself in S3). When something goes wrong in that "touch" process, we assume the manifest is now useless, and the source database's replication state is cleared to force a full snapshot from scratch. We don't want that to happen whenever we try to touch a chunk that was bundled with the manifest instead of uploaded to S3. Tweak the list of chunks in the manifest to treat bundle chunks like the well-known zero fingerprint, and not patrol touch them. Avoids degrading performance once #7 introduces expected "missing" chunks. TESTED=follow-up commits?

…unks When a base manifest bundles chunks, we can't assume they're available for our new manifest to refer to: the chunks were bundled instead of staged for upload to the chunk store. Mark the corresponding chunks as dirty, and avoid thinking our predecessor uploaded anything useful for these chunks. Hopefully future-proofs `snapshot_file_contents` against more flexible versions of #7. TESTED=not really? We will use a static list of bundled chunk offsets.

The change tracker has logic to stage chunks for upload as soon as they're written, when writes come in exactly one chunk at a time. Disable that for chunks at file offsets we want to bundle with the manifest rather than upload as content-addressed / deduplicated blobs. Avoids useless work in the write half of #7. TESTED=sqlite tests, and manual spot check for spuriously uploaded chunks.

The first chunk contains a header that nearly always changes. It also contains the `sqlite_schema` btree, which consists mostly of compressible SQL text. It makes sense to bundle it with its manifest. Make sure that first chunk is always considered dirty (which it usually would be), and stick in the manifest proto instead of uploading it as a standalone chunk. Closes #7, now that the write side is fully implemented. TESTED=sqlite tests, w/ 4 subsets of {0, 65536} for BUNDLED_CHUNK_OFFSETS.

pkhuong force-pushed the pkhuong/bundle-chunks-again branch 5 times, most recently from 070787f to a69901b Compare February 11, 2022 14:37

Base automatically changed from pkhuong/well-known-chunks to main February 11, 2022 17:22

pkhuong added 4 commits February 11, 2022 12:23

pkhuong force-pushed the pkhuong/bundle-chunks-again branch from a69901b to 382f70b Compare February 11, 2022 17:23

pkhuong added 2 commits February 11, 2022 12:27

pkhuong force-pushed the pkhuong/bundle-chunks-again branch from 382f70b to 4bdf044 Compare February 11, 2022 17:28

pkhuong merged commit 8e7fa62 into main Feb 11, 2022

pkhuong deleted the pkhuong/bundle-chunks-again branch February 11, 2022 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bundle high-churn chunks with the manifest #10

Bundle high-churn chunks with the manifest #10

pkhuong commented Feb 11, 2022 •

edited

Bundle high-churn chunks with the manifest #10

Bundle high-churn chunks with the manifest #10

Conversation

pkhuong commented Feb 11, 2022 • edited

pkhuong commented Feb 11, 2022 •

edited