Blog file format: unified WAL and blob file format (1/n)#14675
Blog file format: unified WAL and blob file format (1/n)#14675pdillinger wants to merge 4 commits into
Conversation
Summary: Introduces the "blog" file format (portmanteau of "blob" + "log"), a new unified file format for WAL and blob files in RocksDB. This change makes the new format an opt-in option for blob files ONLY. An immediate follow-up will add WAL support. This format is intended to be the future default for both WAL and blob files, and likely also manifest files. The impetus for this new file format was an apparent convergence in requirements for interesting and useful future directions for RocksDB, along with some tech debt: * Supporting blob "direct write" (key-value separation in the memtable) with WAL enabled and at least the option to have all the WAL+blob data go into one file to reduce overheads in some cases like WAL sync write with blob direct write. (In other cases, separating WAL WriteBatches and blobs into distinct files would likely be the better choice.) The "preamble start" marker record is intended to support this case so that WriteBatches can carry external values in a "preamble" in memory and the WriteBatch doesn't need to be rewritten on storage to a single blog file serving both WAL and blob functions. (Details in later work.) * Preserve the continuity of each blob value for efficient reads (NOTE: WAL/Manifest format often breaks up payloads), and extend this continuity to WriteBatches so that keys/values with known checksums could be carried and extended to the WriteBatch and its contiguous encoding in the blog-as-WAL file. (The goal is to leverage checksums across layers as much as possible rather than computing new ones at each layer; only CRC checksums are "extendable.") * Support some "linear log" workloads with monotonically increasing keys and FIFO pruning of old data. A CF could be configured to use its own blog-as-WAL files writing this data, and those files could get indexing information written to them as each file is sealed. This would enable moderately efficient read queries that process WriteBatch records for results, and no WAL->SST write amplification. * Modernize blog and WAL formats with features like explicit versioning and extensibility, configurable and context-aware checksums, debugging and statistical information, customizable compression (CompressionManager aware), and more. New DB options: use_blog_format_for_blobs, blog_checksum. Other public API changes: * ChecksumType moved to include/rocksdb/checksum_type.h. * kStreamingCompressionSentinel (0x7F) added to CompressionType enum. Some included refactoring: * BlobLogWriter::log_number_ removed (was unused). * BlobLogWriter::AppendFooter renamed to LegacyAppendFooterAndClose. Test Plan: TODO
|
@pdillinger has imported this pull request. If you are a Meta employee, you can view this in D102718613. |
|
| Check | Count |
|---|---|
cppcoreguidelines-pro-type-member-init |
1 |
| Total | 1 |
Details
db/blog/blog_format.cc (1 warning(s))
db/blog/blog_format.cc:594:1: warning: constructor does not initialize these fields: bytes [cppcoreguidelines-pro-type-member-init]
✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit 5ba6eb5 Code Review: Blog File Format — Unified WAL and Blob File Format (1/n)PR: Blog file format: unified WAL and blob file format (1/n) Critical FindingsF1. MultiGetBlob Not Updated for Blog Format (HIGH)
Impact: Any MultiGet hitting a blog-format blob file fails with corruption errors. SingleGet is properly updated; MultiGet is not. Recommendation: Implement blog format support in Medium FindingsF2. BlobIndex Size Semantics Change Affects GC Statistics (MEDIUM)Legacy stores always-compressed size; blog stores actual on-disk size (may be uncompressed when compression is ineffective). This is correct for read/write but changes accounting for Recommendation: Document the semantic change; verify GC ratio calculations aren't affected. F3. Footer Locator Offset Division Without Alignment Assertion (MEDIUM)In Recommendation: Add F4. Generic Decompressor for Blog Format (MEDIUM)Blog format uses Recommendation: Benchmark blob read throughput to quantify any difference. Suggestions
Positive Observations
The full review is written to ℹ️ About this responseGenerated by Claude Code. Limitations:
Commands:
|
|
@pdillinger has imported this pull request. If you are a Meta employee, you can view this in D102718613. |
Summary:
Introduces the "blog" file format (portmanteau of "blob" + "log"), a new unified file format for WAL and blob files in RocksDB. This change makes the new format an opt-in option for blob files ONLY. An immediate follow-up will add WAL support. This format is intended to be the future default for both WAL and blob files, and likely also manifest files.
The impetus for this new file format was an apparent convergence in requirements for interesting and useful future directions for RocksDB, along with some tech debt:
New DB options: use_blog_format_for_blobs, blog_checksum. Other public API changes:
Test Plan:
New unit tests validate the blog file format core (33 tests in blog_format_test) covering header encode/decode round-trips, property encoding, escape sequence generation and verification, padding scheme, irregular varints, context checksums, footer locator/properties, schema version rejection, and typed property accessors. Writer/reader round-trip tests (11 tests in blog_writer_test) cover single and multiple blob records, compact vs full format selection, mixed record types, preamble-start stub, footer records, checksum corruption detection, alignment invariants, and header properties.
Existing blob file builder, reader, cache, and source tests (41 tests across 4 test binaries) pass unmodified, verifying legacy blob format is not broken. The options_settable_test validates that use_blog_format_for_blobs and blog_checksum are properly wired through the options system. The log_test (211 tests) confirms legacy WAL format is completely unaffected.
Blog-as-blob integration is exercised by db_crashtest.py with use_blog_format_for_blobs and blog_checksum randomized across iterations, stress-testing write/crash/recovery cycles with various checksum types (CRC32c, xxHash, xxHash64, XXH3), compression configurations, and fault injection.