Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework Metadata Storage #8513

Merged
merged 24 commits into from
Aug 9, 2023
Merged

Rework Metadata Storage #8513

merged 24 commits into from
Aug 9, 2023

Conversation

Mytherin
Copy link
Collaborator

@Mytherin Mytherin commented Aug 8, 2023

This PR reworks the way metadata is written to the storage format.

Current Situation

Previously, we used the MetaBlockWriter class to write meta data to the storage format. This class would serialize data into a linked list of blocks. The MetaBlockWriter would start out by allocating a block (256KB) and writing data into it. When the block is filled, it would allocate a new block, and write the location of the new block in a "next" pointer in the first 8 bytes of the previous block.

This would essentially create the following linked list on disk:

 Block 1      Block 2       Block 3
[#2.....] -> [#3.....]  -> [-1.....]

The MetaBlockReader would reverse the process and read the blocks in-order, following the pointers.

Three separate MetaBlockWriters are used currently:

  • One for the schema information (table definitions, column definitions, views, schemas, etc)
  • One for the row group and column data pointers (where the table data is located)
  • One for the list of free blocks

Issues

This approach has a number of problems:

  • Because each of the MetaBlockWriters write data to a continuous stream, re-use of blocks written within a single writer is not possible. Deleting a row group shifts the row group pointers of all subsequent row groups, for example.
  • Because each MetaBlockWriter allocates at least one block, and we need a block for table data, adding more MetaBlockWriters is problematic. Already with 3 writers the minimum database size of a DuckDB database is around 1MB (4 blocks) even if those blocks are mostly empty.
  • Which blocks are used for metadata are not stored in a central location, but rather spread over the linked lists. Because of this all metadata blocks must be read before checkpointing in order to figure out which blocks are now free (as they were previously used for metadata).

As a result of these issues we are forced to always re-write all metadata during a checkpoint and forced to read all metadata blocks before performing a checkpoint. This is problematic because metadata can grow to be large. In particular, row group and column data pointers can store 0.1-10KB per row group depending on how many columns a table has. For tables with many billions of rows (>1TB databases) this can lead to metadata that reaches sizes of 1GB~. For every checkpoint, we would need to rewrite that 1GB, even if we were only making small changes to the actual file.

New Approach

This PR reworks the way metadata is written by instead partitioning 256KB blocks in 64x4KB metadata blocks. The metadata blocks are tracked in a centralized MetadataManager which keeps track of (1) which storage blocks are used to store the metadata blocks, (2) for every block which of the 64 metadata blocks is occupied/free (if any).

The full list of metadata blocks (and which blocks are occupied) are stored alongside the list of free blocks as part of the top-level metadata. This is serialized as a block_id_t (the block id) and an idx_t (a 64-bit bitmask indicating which blocks are occupied):

struct MetadataBlock {
	block_id_t block_id;
	idx_t free_blocks;
};

Pointers to the metadata blocks are stored as a 64-bit integer, a combination of the block id and the index (0-63):

struct MetadataPointer {
	idx_t block_index : 56;
	uint8_t index : 8;
};

The MetadataWriter and MetadataReader replace the MetaBlockWriter and MetaBlockReader. They work in a similar way - constructing linked lists of blocks - but operate on the smaller 4KB blocks managed by the MetadataManager instead.

Metadata Overhead

This new approach can greatly reduce the space taken by the metadata, particularly for small databases, as instead of having almost empty 256KB blocks the metadata can be fit on the same blocks. For example, running the following script on v0.8.1 and this PR:

create table integers as select 42 i;
checkpoint;
create table integers2 as select 42 j;
checkpoint;
-rw-r--r--  1 myth  staff   1.3M Aug  8 19:21 almostempty-v081.db
-rw-r--r--  1 myth  staff   268K Aug  8 19:22 almostempty-new.db

Future Work

This PR moves to the new metadata manager approach for managing metadata, but it does not change the actual layout of the metadata and does not do any work on metadata re-use. Instead, this lays the groundwork for future work in that area. Namely, by storing the metadata in smaller blocks we can break up the metadata in smaller pieces. For example, we can use a new MetadataWriter for every few row groups, instead of for all row group pointers of all tables in the database. This will allow us to re-use previously written blocks if they have not changed.

Another issue relates to the truncation of a database file (see #7824). We might still run into a problem where metadata is written after a big table, preventing the truncation until another checkpoint is run. Now that we have central information on where the metadata blocks reside, we could decide to do a second checkpoint to automatically clear up the space and truncate the file.

Index Serialization

CC @taniabogatsch

This PR also modifies index serialization because that used the MetaBlockWriter previously. This PR moves that to the MetadataWriter but otherwise keeps the serialization the same. This is done to keep the serialization working while allowing us to remove the MetaBlockWriter entirely. When the index serialization revamp is done all the changes here should be removed.

Note also that I've temporarily disabled test/sql/index/art/vacuum/test_art_vacuum_strings.test_slow in this PR as this PR changes the way that lazy loading of indexes works which makes that no longer work. That test should be re-enabled along with the new index serialization.

@Mytherin Mytherin merged commit acbbfe0 into duckdb:master Aug 9, 2023
51 checks passed
@Mytherin Mytherin deleted the metadatarework branch December 4, 2023 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant