Skip to content

Blog file format: unified WAL and blob file format (1/n)#14675

Open
pdillinger wants to merge 4 commits into
facebook:mainfrom
pdillinger:blog_format_no_wal
Open

Blog file format: unified WAL and blob file format (1/n)#14675
pdillinger wants to merge 4 commits into
facebook:mainfrom
pdillinger:blog_format_no_wal

Conversation

@pdillinger
Copy link
Copy Markdown
Contributor

Summary:
Introduces the "blog" file format (portmanteau of "blob" + "log"), a new unified file format for WAL and blob files in RocksDB. This change makes the new format an opt-in option for blob files ONLY. An immediate follow-up will add WAL support. This format is intended to be the future default for both WAL and blob files, and likely also manifest files.

The impetus for this new file format was an apparent convergence in requirements for interesting and useful future directions for RocksDB, along with some tech debt:

  • Supporting blob "direct write" (key-value separation in the memtable) with WAL enabled and at least the option to have all the WAL+blob data go into one file to reduce overheads in some cases like WAL sync write with blob direct write. (In other cases, separating WAL WriteBatches and blobs into distinct files would likely be the better choice.) The "preamble start" marker record is intended to support this case so that WriteBatches can carry external values in a "preamble" in memory and the WriteBatch doesn't need to be rewritten on storage to a single blog file serving both WAL and blob functions. (Details in later work.)
  • Preserve the continuity of each blob value for efficient reads (NOTE: WAL/Manifest format often breaks up payloads), and extend this continuity to WriteBatches so that keys/values with known checksums could be carried and extended to the WriteBatch and its contiguous encoding in the blog-as-WAL file. (The goal is to leverage checksums across layers as much as possible rather than computing new ones at each layer; only CRC checksums are "extendable.")
  • Support some "linear log" workloads with monotonically increasing keys and FIFO pruning of old data. A CF could be configured to use its own blog-as-WAL files writing this data, and those files could get indexing information written to them as each file is sealed. This would enable moderately efficient read queries that process WriteBatch records for results, and no WAL->SST write amplification.
  • Modernize blog and WAL formats with features like explicit versioning and extensibility, configurable and context-aware checksums, debugging and statistical information, customizable compression (CompressionManager aware), and more.

New DB options: use_blog_format_for_blobs, blog_checksum. Other public API changes:

  • ChecksumType moved to include/rocksdb/checksum_type.h.
  • kStreamingCompressionSentinel (0x7F) added to CompressionType enum. Some included refactoring:
  • BlobLogWriter::log_number_ removed (was unused).
  • BlobLogWriter::AppendFooter renamed to LegacyAppendFooterAndClose.

Test Plan:
New unit tests validate the blog file format core (33 tests in blog_format_test) covering header encode/decode round-trips, property encoding, escape sequence generation and verification, padding scheme, irregular varints, context checksums, footer locator/properties, schema version rejection, and typed property accessors. Writer/reader round-trip tests (11 tests in blog_writer_test) cover single and multiple blob records, compact vs full format selection, mixed record types, preamble-start stub, footer records, checksum corruption detection, alignment invariants, and header properties.

Existing blob file builder, reader, cache, and source tests (41 tests across 4 test binaries) pass unmodified, verifying legacy blob format is not broken. The options_settable_test validates that use_blog_format_for_blobs and blog_checksum are properly wired through the options system. The log_test (211 tests) confirms legacy WAL format is completely unaffected.

Blog-as-blob integration is exercised by db_crashtest.py with use_blog_format_for_blobs and blog_checksum randomized across iterations, stress-testing write/crash/recovery cycles with various checksum types (CRC32c, xxHash, xxHash64, XXH3), compression configurations, and fault injection.

Summary:
Introduces the "blog" file format (portmanteau of "blob" + "log"), a new
unified file format for WAL and blob files in RocksDB. This change makes
the new format an opt-in option for blob files ONLY. An immediate
follow-up will add WAL support. This format is intended to be the future
default for both WAL and blob files, and likely also manifest files.

The impetus for this new file format was an apparent convergence in
requirements for interesting and useful future directions for RocksDB,
along with some tech debt:
* Supporting blob "direct write" (key-value separation in the memtable)
  with WAL enabled and at least the option to have all the WAL+blob data
  go into one file to reduce overheads in some cases like WAL sync write
  with blob direct write. (In other cases, separating WAL WriteBatches
  and blobs into distinct files would likely be the better choice.)
  The "preamble start" marker record is intended to support this case so
  that WriteBatches can carry external values in a "preamble" in memory
  and the WriteBatch doesn't need to be rewritten on storage to a single
  blog file serving both WAL and blob functions. (Details in later
  work.)
* Preserve the continuity of each blob value for efficient reads (NOTE:
  WAL/Manifest format often breaks up payloads), and extend this
  continuity to WriteBatches so that keys/values with known checksums
  could be carried and extended to the WriteBatch and its contiguous
  encoding in the blog-as-WAL file. (The goal is to leverage checksums
  across layers as much as possible rather than computing new ones at
  each layer; only CRC checksums are "extendable.")
* Support some "linear log" workloads with monotonically increasing keys
  and FIFO pruning of old data. A CF could be configured to use its own
  blog-as-WAL files writing this data, and those files could get
  indexing information written to them as each file is sealed. This
  would enable moderately efficient read queries that process WriteBatch
  records for results, and no WAL->SST write amplification.
* Modernize blog and WAL formats with features like explicit versioning
  and extensibility, configurable and context-aware checksums, debugging
  and statistical information, customizable compression
  (CompressionManager aware), and more.

New DB options: use_blog_format_for_blobs, blog_checksum.
Other public API changes:
* ChecksumType moved to include/rocksdb/checksum_type.h.
* kStreamingCompressionSentinel (0x7F) added to CompressionType enum.
Some included refactoring:
* BlobLogWriter::log_number_ removed (was unused).
* BlobLogWriter::AppendFooter renamed to LegacyAppendFooterAndClose.

Test Plan:
TODO
@pdillinger pdillinger requested a review from xingbowang April 28, 2026 02:14
@meta-cla meta-cla Bot added the CLA Signed label Apr 28, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Apr 28, 2026

@pdillinger has imported this pull request. If you are a Meta employee, you can view this in D102718613.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 28, 2026

⚠️ clang-tidy: 1 warning(s) on changed lines

Completed in 1715.2s.

Summary by check

Check Count
cppcoreguidelines-pro-type-member-init 1
Total 1

Details

db/blog/blog_format.cc (1 warning(s))
db/blog/blog_format.cc:594:1: warning: constructor does not initialize these fields: bytes [cppcoreguidelines-pro-type-member-init]

@github-actions
Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 5ba6eb5


Code Review: Blog File Format — Unified WAL and Blob File Format (1/n)

PR: Blog file format: unified WAL and blob file format (1/n)
Author: pdillinger
Scope: 45 files changed, 3700 insertions, 191 deletions
Review method: Multi-agent parallel review (9 agents) with cross-agent debate and synthesis


Critical Findings

F1. MultiGetBlob Not Updated for Blog Format (HIGH)

BlobFileReader::MultiGetBlob() (db/blob/blob_file_reader.cc:416-570) was not modified and contains four legacy-format assumptions that break with blog format:

  1. Line 445: IsValidBlobOffset() uses legacy header/footer sizes — will incorrectly reject valid blog offsets.
  2. Line 450: req->compression != compression_type_ — for blog format, compression_type_ is kNoCompression but BlobIndex stores per-record actual types. Every compressed blog blob fails with "Compression type mismatch."
  3. Line 458: CalculateAdjustmentForRecordHeader(key_size) computes wrong offset — blog format has a 5-byte trailer after the payload, not a legacy header before it.
  4. Line 551: VerifyBlob() uses legacy CRC, not blog's VerifyBlogRecordTrailer().

Impact: Any MultiGet hitting a blog-format blob file fails with corruption errors. SingleGet is properly updated; MultiGet is not.

Recommendation: Implement blog format support in MultiGetBlob() or add an early guard routing blog-format files through single-blob reads.


Medium Findings

F2. BlobIndex Size Semantics Change Affects GC Statistics (MEDIUM)

Legacy stores always-compressed size; blog stores actual on-disk size (may be uncompressed when compression is ineffective). This is correct for read/write but changes accounting for compaction_iterator.cc:1480 (total_blob_bytes_relocated += blob_index.size()) and blob_garbage_meter.cc.

Recommendation: Document the semantic change; verify GC ratio calculations aren't affected.

F3. Footer Locator Offset Division Without Alignment Assertion (MEDIUM)

In CloseBlobFile: static_cast<uint32_t>((locator_offset - props_offset) / 4) — integer division truncates if not 4-byte aligned. No assertion guards this.

Recommendation: Add assert((locator_offset - props_offset) % 4 == 0).

F4. Generic Decompressor for Blog Format (MEDIUM)

Blog format uses GetDecompressor() (generic) vs legacy's GetDecompressorOptimizeFor(). Functionally correct per the CompressionManager API contract, but may have performance implications.

Recommendation: Benchmark blob read throughput to quantify any difference.


Suggestions

  • S1. Move blog_format_spec.html from repo root to docs/components/blob/.
  • S2. Consider format-aware minimum file size validation in OpenFile.
  • S3. Verify CompressBlock handles max_compressed_size = blob.size() constraint correctly for all compression types.

Positive Observations

  1. Robust format detection via 12-byte magic in BlogFileHeader::IsBlogFormat() — transparent to consumers.
  2. Clean backward compatibility — legacy files work identically; dual-path is well-structured.
  3. Per-record compression eliminates the legacy WART of storing expanded compressed data.
  4. Defense-in-depth context checksums at every level using incarnation_id.
  5. Complete API migration — all callers of renamed/removed methods properly updated.
  6. Extensible footer reusing the record infrastructure for integrity protection.
  7. Comprehensive unit tests — 33 format tests + 11 writer tests.

The full review is written to review-findings.md. The primary blocker is F1 (MultiGetBlob), which needs either implementation or explicit documentation as a known limitation.


ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 20, 2026

@pdillinger has imported this pull request. If you are a Meta employee, you can view this in D102718613.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant