-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework Metadata Storage #8513
Merged
Merged
Rework Metadata Storage #8513
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…riter/Reader (not working)
…ointers when required
…-use blocks correctly after checkpoints
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR reworks the way metadata is written to the storage format.
Current Situation
Previously, we used the
MetaBlockWriter
class to write meta data to the storage format. This class would serialize data into a linked list of blocks. TheMetaBlockWriter
would start out by allocating a block (256KB) and writing data into it. When the block is filled, it would allocate a new block, and write the location of the new block in a "next" pointer in the first 8 bytes of the previous block.This would essentially create the following linked list on disk:
The
MetaBlockReader
would reverse the process and read the blocks in-order, following the pointers.Three separate
MetaBlockWriter
s are used currently:Issues
This approach has a number of problems:
MetaBlockWriters
write data to a continuous stream, re-use of blocks written within a single writer is not possible. Deleting a row group shifts the row group pointers of all subsequent row groups, for example.MetaBlockWriter
allocates at least one block, and we need a block for table data, adding moreMetaBlockWriters
is problematic. Already with 3 writers the minimum database size of a DuckDB database is around 1MB (4 blocks) even if those blocks are mostly empty.As a result of these issues we are forced to always re-write all metadata during a checkpoint and forced to read all metadata blocks before performing a checkpoint. This is problematic because metadata can grow to be large. In particular, row group and column data pointers can store 0.1-10KB per row group depending on how many columns a table has. For tables with many billions of rows (>1TB databases) this can lead to metadata that reaches sizes of 1GB~. For every checkpoint, we would need to rewrite that 1GB, even if we were only making small changes to the actual file.
New Approach
This PR reworks the way metadata is written by instead partitioning
256KB
blocks in64x4KB
metadata blocks. The metadata blocks are tracked in a centralizedMetadataManager
which keeps track of (1) which storage blocks are used to store the metadata blocks, (2) for every block which of the 64 metadata blocks is occupied/free (if any).The full list of metadata blocks (and which blocks are occupied) are stored alongside the list of free blocks as part of the top-level metadata. This is serialized as a
block_id_t
(the block id) and anidx_t
(a 64-bit bitmask indicating which blocks are occupied):Pointers to the metadata blocks are stored as a 64-bit integer, a combination of the block id and the index (0-63):
The
MetadataWriter
andMetadataReader
replace theMetaBlockWriter
andMetaBlockReader
. They work in a similar way - constructing linked lists of blocks - but operate on the smaller 4KB blocks managed by theMetadataManager
instead.Metadata Overhead
This new approach can greatly reduce the space taken by the metadata, particularly for small databases, as instead of having almost empty 256KB blocks the metadata can be fit on the same blocks. For example, running the following script on v0.8.1 and this PR:
Future Work
This PR moves to the new metadata manager approach for managing metadata, but it does not change the actual layout of the metadata and does not do any work on metadata re-use. Instead, this lays the groundwork for future work in that area. Namely, by storing the metadata in smaller blocks we can break up the metadata in smaller pieces. For example, we can use a new
MetadataWriter
for every few row groups, instead of for all row group pointers of all tables in the database. This will allow us to re-use previously written blocks if they have not changed.Another issue relates to the truncation of a database file (see #7824). We might still run into a problem where metadata is written after a big table, preventing the truncation until another checkpoint is run. Now that we have central information on where the metadata blocks reside, we could decide to do a second checkpoint to automatically clear up the space and truncate the file.
Index Serialization
CC @taniabogatsch
This PR also modifies index serialization because that used the
MetaBlockWriter
previously. This PR moves that to theMetadataWriter
but otherwise keeps the serialization the same. This is done to keep the serialization working while allowing us to remove theMetaBlockWriter
entirely. When the index serialization revamp is done all the changes here should be removed.Note also that I've temporarily disabled
test/sql/index/art/vacuum/test_art_vacuum_strings.test_slow
in this PR as this PR changes the way that lazy loading of indexes works which makes that no longer work. That test should be re-enabled along with the new index serialization.