Rework Metadata Storage #8513

Mytherin · 2023-08-08T22:19:32Z

This PR reworks the way metadata is written to the storage format.

Current Situation

Previously, we used the MetaBlockWriter class to write meta data to the storage format. This class would serialize data into a linked list of blocks. The MetaBlockWriter would start out by allocating a block (256KB) and writing data into it. When the block is filled, it would allocate a new block, and write the location of the new block in a "next" pointer in the first 8 bytes of the previous block.

This would essentially create the following linked list on disk:

 Block 1      Block 2       Block 3
[#2.....] -> [#3.....]  -> [-1.....]

The MetaBlockReader would reverse the process and read the blocks in-order, following the pointers.

Three separate MetaBlockWriters are used currently:

One for the schema information (table definitions, column definitions, views, schemas, etc)
One for the row group and column data pointers (where the table data is located)
One for the list of free blocks

Issues

This approach has a number of problems:

Because each of the MetaBlockWriters write data to a continuous stream, re-use of blocks written within a single writer is not possible. Deleting a row group shifts the row group pointers of all subsequent row groups, for example.
Because each MetaBlockWriter allocates at least one block, and we need a block for table data, adding more MetaBlockWriters is problematic. Already with 3 writers the minimum database size of a DuckDB database is around 1MB (4 blocks) even if those blocks are mostly empty.
Which blocks are used for metadata are not stored in a central location, but rather spread over the linked lists. Because of this all metadata blocks must be read before checkpointing in order to figure out which blocks are now free (as they were previously used for metadata).

As a result of these issues we are forced to always re-write all metadata during a checkpoint and forced to read all metadata blocks before performing a checkpoint. This is problematic because metadata can grow to be large. In particular, row group and column data pointers can store 0.1-10KB per row group depending on how many columns a table has. For tables with many billions of rows (>1TB databases) this can lead to metadata that reaches sizes of 1GB~. For every checkpoint, we would need to rewrite that 1GB, even if we were only making small changes to the actual file.

New Approach

This PR reworks the way metadata is written by instead partitioning 256KB blocks in 64x4KB metadata blocks. The metadata blocks are tracked in a centralized MetadataManager which keeps track of (1) which storage blocks are used to store the metadata blocks, (2) for every block which of the 64 metadata blocks is occupied/free (if any).

The full list of metadata blocks (and which blocks are occupied) are stored alongside the list of free blocks as part of the top-level metadata. This is serialized as a block_id_t (the block id) and an idx_t (a 64-bit bitmask indicating which blocks are occupied):

struct MetadataBlock {
	block_id_t block_id;
	idx_t free_blocks;
};

Pointers to the metadata blocks are stored as a 64-bit integer, a combination of the block id and the index (0-63):

struct MetadataPointer {
	idx_t block_index : 56;
	uint8_t index : 8;
};

The MetadataWriter and MetadataReader replace the MetaBlockWriter and MetaBlockReader. They work in a similar way - constructing linked lists of blocks - but operate on the smaller 4KB blocks managed by the MetadataManager instead.

Metadata Overhead

This new approach can greatly reduce the space taken by the metadata, particularly for small databases, as instead of having almost empty 256KB blocks the metadata can be fit on the same blocks. For example, running the following script on v0.8.1 and this PR:

create table integers as select 42 i;
checkpoint;
create table integers2 as select 42 j;
checkpoint;

-rw-r--r--  1 myth  staff   1.3M Aug  8 19:21 almostempty-v081.db
-rw-r--r--  1 myth  staff   268K Aug  8 19:22 almostempty-new.db

Future Work

This PR moves to the new metadata manager approach for managing metadata, but it does not change the actual layout of the metadata and does not do any work on metadata re-use. Instead, this lays the groundwork for future work in that area. Namely, by storing the metadata in smaller blocks we can break up the metadata in smaller pieces. For example, we can use a new MetadataWriter for every few row groups, instead of for all row group pointers of all tables in the database. This will allow us to re-use previously written blocks if they have not changed.

Another issue relates to the truncation of a database file (see #7824). We might still run into a problem where metadata is written after a big table, preventing the truncation until another checkpoint is run. Now that we have central information on where the metadata blocks reside, we could decide to do a second checkpoint to automatically clear up the space and truncate the file.

Index Serialization

CC @taniabogatsch

This PR also modifies index serialization because that used the MetaBlockWriter previously. This PR moves that to the MetadataWriter but otherwise keeps the serialization the same. This is done to keep the serialization working while allowing us to remove the MetaBlockWriter entirely. When the index serialization revamp is done all the changes here should be removed.

Note also that I've temporarily disabled test/sql/index/art/vacuum/test_art_vacuum_strings.test_slow in this PR as this PR changes the way that lazy loading of indexes works which makes that no longer work. That test should be re-enabled along with the new index serialization.

…riter/Reader (not working)

…ointers when required

…-use blocks correctly after checkpoints

…ve been freed

…xist

…etadataWriter

Mytherin added 24 commits August 1, 2023 09:46

Metadata manager WIP

0f099c8

Add metadata reader

c5fe475

WIP rework - remove MetaBlockWriter/Reader and replace with MetadataW…

e599a9c

…riter/Reader (not working)

Checkpointing a little bit

e7ef60c

More minor fixes

4c80ba0

Add metadata block serialization (not working)

eee7b91

Free list loading working

685704b

More metadata writing fixes

41ad493

Use correct offset in several places

0470974

Basic storage working!?

babae50

Working index storage - store BlockPointers and convert to metadata p…

f832470

…ointers when required

Correctly de/serialize checkpoint marker in WAL

199b9e6

Add logic to metadata manager to mark blocks as modified so we can re…

c224d2b

…-use blocks correctly after checkpoints

Mark blocks as modified in their entirety when all metadata blocks ha…

e99a85f

…ve been freed

AddAndRegisterBlock should only create the block if it does not yet e…

b4dda98

…xist

Format fix

b651d8c

Correctly zero-initialize free blocks in the metadata manager

1c0f81a

Use ConvertToPersistent in MetadataManager, and require flushing of M…

853b7b8

…etadataWriter

Fixes for single-file compilation

2c55641

Merge branch 'master' into metadatarework

47fe997

Don't mark block as freed for now

b70af69

Full include path

029f63f

Test fixes

6b4bfa4

Coverage fixes

08573c9

Mytherin merged commit acbbfe0 into duckdb:master Aug 9, 2023
51 checks passed

taniabogatsch mentioned this pull request Aug 28, 2023

Towards stable storage of indexes and the ART #8703

Merged

Mytherin mentioned this pull request Sep 10, 2023

Rework Storage of Deletions - Allow for lazy loading and incremental re-writing of deletions #8869

Merged

Mytherin deleted the metadatarework branch December 4, 2023 11:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework Metadata Storage #8513

Rework Metadata Storage #8513

Mytherin commented Aug 8, 2023 •

edited

Rework Metadata Storage #8513

Rework Metadata Storage #8513

Conversation

Mytherin commented Aug 8, 2023 • edited

Current Situation

Issues

New Approach

Metadata Overhead

Future Work

Index Serialization

Mytherin commented Aug 8, 2023 •

edited