Skip to content

Add use_direct_reads_for_compaction option#14743

Open
zaidoon1 wants to merge 1 commit into
facebook:mainfrom
zaidoon1:use-direct-reads-for-compaction
Open

Add use_direct_reads_for_compaction option#14743
zaidoon1 wants to merge 1 commit into
facebook:mainfrom
zaidoon1:use-direct-reads-for-compaction

Conversation

@zaidoon1
Copy link
Copy Markdown
Contributor

@zaidoon1 zaidoon1 commented May 14, 2026

Adds a new DBOption use_direct_reads_for_compaction (default false). When on, compaction-input SST files are opened with O_DIRECT so the sequential read-once data from compaction doesn't pollute the OS page cache and evict the hot user-read working set. User reads keep going through the buffered fast path. This protects user-read tail latency on write-heavy workloads without forcing user reads onto the existing global use_direct_reads knob (which pays in throughput and P50 — see the bench below).

The interesting bit is that just flipping the FileOptions returned by FileSystem::OptimizeForCompactionTableRead doesn't actually trigger O_DIRECT at the kernel level. The TableCache (and FileMetaData::pinned_reader) is already holding buffered handles opened at flush time or at DB::Open via LoadTableHandlers. When compaction asks for an iterator, it gets back the cached buffered handle and the kernel never sees the O_DIRECT flag.

So this PR also adds a small bypass path:

  • TableCache::FindTable / NewIterator learn a bypass_cache_for_scan mode. When set, the pinned-reader fast path and the shared cache are skipped, GetTableReader is called directly with the caller's FileOptions, and ownership of the freshly opened TableReader is handed back via a unique_ptr. The iterator takes ownership via RegisterCleanup and frees the reader on destruction.
  • VersionSet::MakeInputIterator and LevelIterator plumb the flag through both L0 and L1+ compaction-input paths.
  • CompactionJob::ProcessKeyValueCompaction turns the bypass on when use_direct_reads_for_compaction is set, the global use_direct_reads is off, and OptimizeForCompactionTableRead produced use_direct_reads=true in the compaction-read FileOptions.

The option is opt-in: when off, nothing changes for existing users. When on, only the compaction-input opens take the bypass path; user reads keep hitting the TableCache and the buffered fast path normally.

There's also a small db_bench helper in the same PR: a new --bgwriter_num flag that lets the writer thread in readwhilewriting (and the other "while writing" variants) spread its puts across [0, bgwriter_num) instead of [0, num). Without this the readers and writer share a key range and you can't have both a hot read subset and meaningful compaction work — this lets you have both.

Benchmark

Setup: Ubuntu 24.04 (kernel 7.0.5, OrbStack Linux VM on Apple Silicon), 14 vCPUs, virtio-blk disk, ext4. MGLRU disabled (echo 0 > /sys/kernel/mm/lru_gen/enabled) so the kernel uses the classic active/inactive LRU. 14 GB DB (3.5M keys × 4 KB values), no compression. Each measurement run is pinned to a 1 GB cgroup via systemd-run --scope -p MemoryMax=1G -p MemorySwapMax=0. Page cache is dropped between configs. db_bench is Release build.

Workload: readwhilewriting for 120s. 4 reader threads doing random reads over a hot key subset, plus 1 writer thread spreading overwrites across the full 3.5M-key keyspace (via --bgwriter_num=3500000) throttled at 200 MB/s, so there's continuous compaction running while the readers go.

The size of the hot reader subset relative to available page cache controls how visible the optimization is. The Cassandra blog (Lightfoot 2026) documented the same thing: biggest wins when the hot set is big enough to actually compete for cache, smaller wins when the hot set trivially fits, neutral when the hot set is way bigger than cache. So I ran two hot-set sizes.

Small hot set: ~30 MB (~3% of the 1 GB cgroup) — N=5 iterations, mean (CV)

--num=7500. The hot set is small enough that the page cache holds it without much trouble even under compaction, so the wins here are real but on the modest side.

Config Throughput (ops/s) Read P50 (µs) Read P99 (µs) Read P99.9 (µs) Read P99.99 (µs)
buffered (default) 233,477 (8.2%) 16.09 82.24 721.0 2,102.5
direct_compaction_writes_only (existing knob alone) 287,405 (2.8%) — +23.1% 13.00 (−19.2%) 66.77 (−18.8%) 553.9 (−23.2%) 1,787.6 (−15.0%)
direct_compaction_read_only (new knob alone) 250,669 (2.4%) — +7.4% 14.16 (−12.0%) 102.99 (+25.2%) 689.8 (−4.3%) 1,801.3 (−14.3%)
direct_compaction_read_write (new + existing, recommended) 277,920 (3.3%) — +19.0% 12.99 (−19.3%) 84.23 (+2.4%) 613.4 (−14.9%) 1,738.2 (−17.3%)
use_direct_reads=true (existing global) + write-side 249,014 (2.5%) — +6.7% 15.95 (−0.9%) 68.78 (−16.4%) 450.8 (−37.5%) 1,814.5 (−13.7%)

CV is 2.4–3.3% on the optimized configs (8.2% on buffered), so the deltas are real. With a hot set this small, the existing use_direct_io_for_flush_and_compaction knob is already doing most of the work — the new flag's main extra contribution here is P99.99 (combined wins it by ~2 points vs writes-only-alone). Worth noting: the new flag alone (without the existing write-side flag) improves P99.99 but regresses P99 by 25% on this small-hot-set workload, because direct compaction reads lose kernel readahead and compaction-output writes are still hitting the page cache. That regression goes away once you combine with the existing write-side flag, or once the hot set is bigger (see next table). So if you're using just one knob, use the existing one. If you're using this PR's flag, pair it with use_direct_io_for_flush_and_compaction=true.

Larger hot set: ~400 MB (~40% of cache) — N=5 iterations, mean (CV)

--num=100000. This is the case the Cassandra blog calls out — hot set big enough to actually fight compaction for cache. Their analogous setup (1M hot partitions, ~33% hot/cache) reported 1.93× p99 improvement. Numbers here are the headline:

Config Throughput (ops/s) Read P50 (µs) Read P99 (µs) Read P99.9 (µs) Read P99.99 (µs)
buffered (default) 68,959 (7.7%) 44.81 541.22 2,225.2 11,334.5
direct_compaction_writes_only (existing knob alone) 73,973 (10.3%) — +7.3% 42.22 (−5.8%) 456.27 (−15.7%) 2,016.9 (−9.4%) 9,190.0 (−18.9%)
direct_compaction_read_only (new knob alone) 84,337 (2.3%) — +22.3% 38.66 (−13.7%) 386.97 (−28.5%) 1,644.8 (−26.1%) 4,837.9 (−57.3%, 2.34×)
direct_compaction_read_write (new + existing, recommended) 104,923 (8.4%) — +52.2% 34.26 (−23.5%) 290.97 (−46.2%) 1,143.4 (−48.6%) 3,080.3 (−72.8%, 3.68×)
use_direct_reads=true (existing global) + write-side 71,598 (9.1%) — +3.8% 51.33 (+14.5%) 297.91 (−45.0%) 1,663.6 (−25.2%) 6,530.0 (−42.4%)

Combined config gets a 3.68× p99.99 win, 1.86× p99, p50 down 23%, throughput up 52%. Same shape as the Cassandra blog's 1.93× p99 result — the improvement just lands at deeper percentiles for us because RocksDB's baseline data path is roughly 40× faster than Cassandra's (their buffered p99 was 35 ms, ours is 0.54 ms), so the cache-miss tail is further out.

A few things worth calling out from this table:

  • The new flag is doing real work on top of the existing write-side flag here, not just shifting things around. Combined throughput is +42% over direct_compaction_writes_only alone, and combined p99.99 is 3× better. The existing knob alone gives a fairly modest +7% throughput / -19% p99.99 in this case — there's a clear gap that the new flag fills.
  • The new flag alone (no existing write-side flag) is also a real improvement here: +22% throughput, p99.99 down 57%. The P99 regression we saw in the small-hot-set case is gone, because the cache-protection effect now dominates the lost-readahead cost.
  • use_direct_reads=true (the existing global flag) actually regresses P50 by 14.5% in this workload — taking user reads off the page cache hurts you when the hot data could have been cached. It also gets the worst throughput of any direct config. It's not an equivalent way to get these gains.

compaction_readahead_size matters when this flag is on

Direct I/O bypasses kernel readahead, so RocksDB's own DBOptions::compaction_readahead_size becomes the only prefetch the iterator has. The default of 2 MB is enough and real users will get it automatically. But db_bench's --compaction_readahead_size CLI default is 0, which defeats prefetch and makes direct compaction look slower than it actually is. If you're reproducing the numbers above, pass --compaction_readahead_size=2097152 (or larger).

Summary

  • Recommended production config is use_direct_reads_for_compaction=true + use_direct_io_for_flush_and_compaction=true. Strongest configuration at every percentile and throughput in both benches.
  • The new flag is the read-side counterpart to use_direct_io_for_flush_and_compaction, which handles compaction-write cache pollution. They address different sources of pollution and compose. The gap between "combined" and "writes-only-alone" is 17 percentage points on p99.99 in the small-hot-set bench and 54 points in the larger one, so the new flag is contributing real value, especially as the hot set grows.
  • The new flag alone is also a real improvement when the hot set is big enough to compete with cache (+22% throughput, 2.34× p99.99 in the larger-hot-set bench). On a very small hot set it improves p99.99 but regresses p99, so pairing with the existing write-side flag is safer.
  • The benefit is workload-dependent. Small hot sets get modest tail-latency wins. Hot sets sized to actually compete for cache get the big multi-percentile wins shown above. Hot sets bigger than cache (not benched here but covered in the Cassandra blog) see no change either way — every read misses regardless.

Reproducing

Any Linux host (or a Linux VM on macOS via OrbStack / Multipass / lima):

sudo apt-get install -y build-essential clang cmake git pkg-config \
  libgflags-dev libsnappy-dev zlib1g-dev libbz2-dev liblz4-dev libzstd-dev

cmake -DCMAKE_BUILD_TYPE=Release -DPORTABLE=1 -DWITH_GFLAGS=1 -DWITH_TESTS=0 ..
make -j db_bench

echo 0 | sudo tee /sys/kernel/mm/lru_gen/enabled

Build the source DB once, unrestricted memory:

./db_bench --benchmarks=fillrandom,compact,waitforcompaction,stats \
  --db=/path/to/source_db --num=3500000 --key_size=16 --value_size=4096 \
  --write_buffer_size=16777216 --target_file_size_base=16777216 \
  --max_background_jobs=4 --compression_type=none --cache_size=4194304 \
  --max_bytes_for_level_base=67108864 --disable_wal=1 --sync=0

For each config, copy source_db -> scratch_db, run sync && echo 3 > /proc/sys/vm/drop_caches, then:

sudo systemd-run --scope -p MemoryMax=1G -p MemorySwapMax=0 \
  ./db_bench --use_existing_db=1 \
    --benchmarks=readwhilewriting,stats --db=/path/to/scratch_db \
    --threads=5 --duration=120 --statistics=true --histogram=1 \
    --num=7500 --bgwriter_num=3500000 \
    --key_size=16 --value_size=4096 \
    --write_buffer_size=16777216 --target_file_size_base=16777216 \
    --max_background_jobs=4 --compression_type=none \
    --cache_size=4194304 --open_files=200 \
    --skip_stats_update_on_db_open=true \
    --max_bytes_for_level_base=67108864 \
    --benchmark_write_rate_limit=209715200 \
    --compaction_readahead_size=2097152 \
    --rate_limiter_bytes_per_sec=0 \
    --use_direct_reads={true|false} \
    --use_direct_reads_for_compaction={true|false} \
    --use_direct_io_for_flush_and_compaction={true|false}

For the larger hot-set table, change --num=7500 to --num=100000.

The five configs in the tables:

  • buffered: all three flags false.
  • direct_compaction_writes_only: use_direct_io_for_flush_and_compaction=true, the other two false. This is what users have today without this PR.
  • direct_compaction_read_only: use_direct_reads_for_compaction=true, the other two false.
  • direct_compaction_read_write: use_direct_reads_for_compaction=true, use_direct_io_for_flush_and_compaction=true, use_direct_reads=false. Recommended.
  • direct_all: use_direct_reads=true, use_direct_io_for_flush_and_compaction=true, use_direct_reads_for_compaction=false.

@meta-cla meta-cla Bot added the CLA Signed label May 14, 2026
@zaidoon1 zaidoon1 force-pushed the use-direct-reads-for-compaction branch 2 times, most recently from 8b08b12 to 7931580 Compare May 14, 2026 22:07
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

✅ clang-tidy: No findings on changed lines

Completed in 781.8s.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 7931580


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 7931580


Summary

Well-designed feature addressing a real performance need (avoiding page-cache pollution from compaction scans). The core bypass mechanism in TableCache is sound, and the PosixFileSystem fix is a valuable correction. Main concerns are around OptimizeForBlobFileRead scope creep affecting backup operations, and a minor inaccuracy in the release note.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

M1. OptimizeForBlobFileRead scope creep — env/file_system.cc:181, env/env.cc:1190
  • Issue: The PR changes OptimizeForBlobFileRead to set use_direct_reads = db_options.use_direct_reads || db_options.use_direct_reads_for_compaction. This function is called from utilities/backup/backup_engine.cc at 3 call sites (lines 1568, 3049, 3084) for blob file copying during backup. When use_direct_reads_for_compaction=true and use_direct_reads=false, backup blob file reads will unexpectedly use O_DIRECT.
  • Root cause: The option name says "for compaction" but OptimizeForBlobFileRead is used by non-compaction code paths. Blob files read during compaction go through BlobFileCache which uses general file_options_, not OptimizeForBlobFileRead — so the change doesn't even help the intended compaction case.
  • Suggested fix: Either (a) don't change OptimizeForBlobFileRead — it's not needed for the stated goal, or (b) rename the option to use_direct_reads_for_background if broader scope is intentional. At minimum, document that the flag affects backup blob reads.
M2. Release note inaccuracy — unreleased_history/new_features/use_direct_reads_for_compaction.md
  • Issue: The release note says "RocksDB issues compaction (and flush) background reads through O_DIRECT." The PR does not affect flush reads — flush produces new SST files, it doesn't read existing ones. The direct I/O is only for compaction-input reads.
  • Root cause: Copy-paste from use_direct_io_for_flush_and_compaction description.
  • Suggested fix: Remove "(and flush)" from the release note.
M3. prefetch_index_and_filter_in_cache=true for ephemeral readers — db/table_cache.cc:332
  • Issue: The bypass path in FindTable calls GetTableReader with prefetch_index_and_filter_in_cache=true (inherited from the default in NewIterator). Since the ephemeral reader is never placed in the cache, prefetching into the block cache is wasteful — it loads index/filter blocks into the block cache from O_DIRECT reads that were meant to bypass caching.
  • Root cause: The prefetch_index_and_filter_in_cache parameter was designed for cached readers. The bypass path doesn't override it.
  • Suggested fix: Pass prefetch_index_and_filter_in_cache=false when bypass_cache_for_scan=true, or set it to false in the FindTable bypass path.

🟢 LOW / NIT

L1. PosixFileSystem behavioral change is a pre-existing bug fix — env/fs_posix.cc:950-969
  • Issue: The current PosixFileSystem::OptimizeForCompactionTableRead override does NOT call the base class FileSystem::OptimizeForCompactionTableRead. This means it never sets use_direct_reads from db_options.use_direct_reads. The PR fixes this by calling the base class first. This is a correct bug fix but is orthogonal to the new feature.
  • Root cause: Original PosixFileSystem override was written only for the readahead clamping fix (compaction_readahead_size doesn't work when it is larger than max_sectors_kb #12038) and forgot to call the base class.
  • Suggested fix: Consider noting this as a separate bug fix in the commit message. The fix itself is correct and preserves behavior when both flags are false (since use_direct_reads=false || false=false).
L2. Redundant third condition in bypass_cache_for_scandb/compaction/compaction_job.cc
  • Issue: The condition file_options_for_read_.use_direct_reads in bypass_cache_for_scan = use_direct_reads_for_compaction && !use_direct_reads && file_options_for_read_.use_direct_reads is technically redundant — when use_direct_reads_for_compaction=true, OptimizeForCompactionTableRead will always set file_options_for_read_.use_direct_reads=true.
  • Root cause: Defensive programming — guards against custom FileSystem overrides that might not call the base class.
  • Suggested fix: This is actually a reasonable safety check. Keep it, but consider adding a comment explaining it's a defensive check against custom FS implementations.
L3. db_bench --bgwriter_num bundled with feature PR — tools/db_bench_tool.cc
  • Issue: The --bgwriter_num flag for db_bench is a useful benchmarking tool but is a separate improvement unrelated to the core feature.
  • Suggested fix: Consider splitting into a separate commit or PR for cleaner history. Not a blocker.
L4. table_reader_ptr in bypass mode — db/table_cache.cc:354-356
  • Issue: When bypass_cache_for_scan=true, table_reader_ptr is set to the ephemeral reader. Callers that store this pointer must not use it after the iterator is destroyed. In the compaction path, table_reader_ptr is always nullptr, so this is safe in practice.
  • Suggested fix: No action needed — all bypass-mode callers pass nullptr for table_reader_ptr.

Cross-Component Analysis

Context Affected? Assessment
WritePreparedTxnDB Yes (same compaction path) Safe — bypass is per-iterator, no shared state
FIFO/Universal compaction Yes (same MakeInputIterator) Safe — all compaction types share the path
ReadOnly/Secondary DB No compaction Not affected
CompactionServiceCompactionJob Inherits privately from CompactionJob Safe — inherits CreateInputIterator
BackupEngine blob reads Yes via OptimizeForBlobFileRead Unintended — see M1
ForwardIterator/LevelIterator Default false Safe — only compaction sets bypass=true
Concurrent compactions on same file Both open ephemeral readers Safe — each has independent reader/FD
Mixed O_DIRECT + buffered on same file Kernel concern Acceptable — compaction is sequential one-pass; file is dropped after

Assumption stress-test results:

  • "Ephemeral reader cleanup is safe" — CONFIRMED. Range tombstone iterators hold shared_ptr<FragmentedRangeTombstoneList>, not raw pointers into the TableReader. The InternalKeyComparator reference is to the CFD-level comparator which outlives the compaction. No use-after-free.
  • "Default false preserves existing behavior" — CONFIRMED. When both flags are false, use_direct_reads stays false through all code paths. The PosixFileSystem base-class call is a correct fix for a pre-existing bug.
  • "Mixing O_DIRECT and buffered is safe in practice" — ACCEPTABLE. The kernel documentation warns against it, but the pattern (sequential one-pass scan that ends with file deletion) avoids the problematic sustained mixed-access scenario. Well-documented in options.h.

Positive Observations

  • The bypass mechanism is cleanly scoped — the fresh_table_reader_owner unique_ptr in FindTable and RegisterCleanup in NewIterator provide clear ownership semantics.
  • The PosixFileSystem fix to call the base class is a valuable correction that ensures custom readahead clamping doesn't silently override the direct-reads decision.
  • Comprehensive test coverage: end-to-end test with both kernel-level (O_DIRECT sync point) and plumbing-level (FileOptions) probes, plus a negative test for the off case.
  • Proper crash test integration with mmap and direct IO sanitization.
  • Good option validation (mmap + direct_reads_for_compaction rejected at Open time).

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

@zaidoon1 zaidoon1 force-pushed the use-direct-reads-for-compaction branch 2 times, most recently from 7892251 to d268198 Compare May 14, 2026 23:25
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit d268198


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@zaidoon1 zaidoon1 force-pushed the use-direct-reads-for-compaction branch 2 times, most recently from d5bb307 to 176a7c2 Compare May 15, 2026 01:38
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 15, 2026

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 176a7c2


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 15, 2026

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 176a7c2


Summary

Well-designed, opt-in feature that addresses a real performance need (preventing compaction reads from polluting the OS page cache). The implementation is architecturally sound, following existing RocksDB patterns. Most infrastructure (options, serialization, validation, stress test, db_bench) is complete and correct.

High-severity findings (2):

  • [db/table_cache.cc:384] Ephemeral TableReader leaked if NewIterator returns an error after FindTable succeeds in bypass mode — ephemeral_reader goes out of scope without transferring ownership to a cleanup.
  • [db/table_cache.cc:346-393] Range tombstone iterators created from an ephemeral TableReader (via table_reader->NewRangeTombstoneIterator) are owned by TruncatedRangeDelIterator/range_del_agg which may outlive the result iterator that owns the ephemeral reader via RegisterCleanup, causing use-after-free.
Full review (click to expand)

Findings

🔴 HIGH

H1. Ephemeral TableReader leak on iterator-creation failure — db/table_cache.cc:384
  • Issue: In the bypass path, FindTable creates an ephemeral TableReader owned by ephemeral_reader (a unique_ptr on the stack of NewIterator). If FindTable succeeds but the subsequent table_reader->NewIterator() call fails or s becomes not-OK in the range-deletion processing block (lines 358–395), the error path at line 401–403 creates an ErrorInternalIterator but never transfers the ephemeral reader to it. ephemeral_reader is destroyed when the function returns, but this is actually benign for the error case since the reader is cleaned up by the destructor. However, if FindTable succeeds, result is created, range-del processing then sets s to not-OK (line 380), the code falls through to line 398 where handle is nullptr (bypass), then at line 401 result (the valid iterator that has the cleanup registered) is replaced by NewErrorInternalIterator — but result was already created and the cleanup was already registered on it. The old result is leaked (raw pointer, no delete), and with it the registered cleanup that would have freed the ephemeral reader.
  • Root cause: Raw pointer result is overwritten at line 403 without deleting the previous iterator that already had the ephemeral-reader cleanup registered on it.
  • Suggested fix: Before overwriting result with the error iterator, delete the old result (which will trigger its cleanup chain and free the ephemeral reader), or restructure the error handling to avoid the overwrite. This is an existing pattern issue amplified by the bypass path.
H2. Range tombstone iterator lifetime vs ephemeral TableReader — db/table_cache.cc:358-395
  • Issue: When bypass_cache_for_scan is active, the ephemeral TableReader is owned by result via RegisterCleanup. But range_del_iter (line 359) and range_del_agg (line 374) receive FragmentedRangeTombstoneIterator objects created from the same TableReader. These range-tombstone iterators hold internal pointers into the TableReader's data structures. In the compaction merging iterator, range-tombstone iterators may be accessed after the file iterator (result) that owns the ephemeral reader is destroyed during LevelIterator file transitions. If the LevelIterator destroys the old file iterator (and thus the ephemeral reader) before the CompactionMergingIterator finishes with that file's range tombstones, a use-after-free occurs.
  • Root cause: With cached readers, the reader stays alive in the cache even after the iterator is destroyed. With ephemeral readers, the reader's lifetime is strictly tied to the iterator. Range-tombstone iterators returned via range_del_iter are stored separately (in LevelIterator's tombstone tracking) and may outlive the file iterator.
  • Suggested fix: Either (a) have the range-tombstone iterator share ownership of the ephemeral reader (e.g., via shared_ptr), (b) ensure the ephemeral reader lifetime extends to cover all tombstone iterators derived from it, or (c) for the L0 path in MakeInputIterator, the range tombstone iters and file iters are passed together to NewCompactionMergingIterator which manages their lifetimes jointly — verify this is safe. For the L1+ LevelIterator path, the range_tombstone_iter_ pointer is updated atomically during InitFileIterator, and the old tombstone iter is only freed after the merging iterator is done with it. This needs careful verification: examine CompactionMergingIterator to confirm it never holds a stale tombstone iterator after the owning file iterator is destroyed.

🟡 MEDIUM

M1. Scope creep: bgwriter_num flag in db_bench — tools/db_bench_tool.cc:268-276,7944-7948
  • Issue: The --bgwriter_num flag is a useful benchmarking enhancement but is functionally independent of the use_direct_reads_for_compaction option. Bundling it in the same PR makes the diff larger and review harder.
  • Suggested fix: Split into a separate PR.
M2. Backup engine inherits changed OptimizeForCompactionTableRead behavior — env/file_system.cc:170
  • Issue: OptimizeForCompactionTableRead is called by the backup engine (backup_engine.cc) for copying SST files. When use_direct_reads_for_compaction=true, backup engine SST reads will also get use_direct_reads=true in their FileOptions. This is likely benign (backup reads are sequential, like compaction) but is an unintended side effect.
  • Suggested fix: Document this interaction. If undesirable, the backup engine could use its own optimization method.
M3. Missing test coverage for L1+ LevelIterator bypass path — db/db_compaction_test.cc
  • Issue: UseDirectReadsForCompactionEndToEnd creates only 2 L0 files that compact to L1. This exercises only the L0 direct-NewIterator path in MakeInputIterator, not the LevelIterator path used for L1+ inputs. The LevelIterator bypass path (where bypass_cache_for_scan_ is stored and used in InitFileIterator) is untested.
  • Suggested fix: Add a test that populates data into L1 and L2, then triggers an L1→L2 compaction to exercise the LevelIterator bypass.
M4. Sync-point callback counters not atomic — db/db_compaction_test.cc:6695-6707
  • Issue: observed_run_starts, observed_odirect_opens, observed_callbacks are plain int variables incremented from sync-point callbacks that run on compaction threads, while assertions run on the test thread. This is a data race (undefined behavior per C++ standard). In practice, compaction is serial in this test due to disable_auto_compactions + CompactRange, but if any internal parallelism triggers these callbacks concurrently, the test becomes flaky.
  • Suggested fix: Use std::atomic<int> for all counters incremented from sync-point callbacks.
M5. NO_FILE_OPENS statistic inflated for bypass path — db/table_cache.cc:127
  • Issue: GetTableReader records RecordTick(ioptions_.stats, NO_FILE_OPENS) on every call. In normal operation, this is called once per cache miss. With bypass, it's called for every compaction file access. This inflates the rocksdb.no.file.opens statistic, potentially confusing monitoring systems.
  • Suggested fix: This is technically correct (each call does open a file), but the changed semantics should be documented. Consider adding a separate COMPACTION_DIRECT_FILE_OPENS counter.

🟢 LOW / NIT

L1. Three-way bypass condition could be simplified — db/compaction/compaction_job.cc:1547-1549
  • Issue: bypass_cache_for_scan = use_direct_reads_for_compaction && !use_direct_reads && file_options_for_read_.use_direct_reads — the third term is always true when the first two are true (since OptimizeForCompactionTableRead ORs the two flags). The third check is a defensive guard against custom FileSystem overrides that don't call the base class, which is reasonable but could use a comment explaining the rationale.
  • Suggested fix: Add a brief comment explaining the third condition serves as a guard for custom FileSystem implementations.
L2. Option documentation mentions mixed O_DIRECT access — include/rocksdb/options.h:1147-1157
  • Issue: The doc comment correctly notes the Linux open(2) advisory against mixed O_DIRECT and buffered I/O. The explanation of why it's safe in practice is good. Minor nit: could mention that posix_fadvise(POSIX_FADV_DONTNEED) is an alternative approach that avoids mixed-mode access entirely.
  • Suggested fix: Optional: mention alternatives in the documentation.
L3. RandomInitDBOptions may generate invalid combinations — test_util/testutil.cc:304
  • Issue: use_direct_reads_for_compaction is randomized independently of allow_mmap_reads, which could generate invalid option combinations. db_crashtest.py sanitizes these, but direct use of RandomInitDBOptions in other tests might hit the mmap + direct_compaction_reads validation error.
  • Suggested fix: This follows the existing pattern (same issue exists for use_direct_reads + allow_mmap_reads). No action required beyond awareness.

Cross-Component Analysis

Context Executes? Assumptions hold? Action needed?
WritePreparedTxnDB YES (same compaction path) YES Safe
ReadOnly DB NO (no compactions) N/A Safe
CompactionService YES (inherits CompactionJob) YES Safe — options propagated via serialization
User-defined timestamps YES YES No special handling needed
BlobDB Separate path (OptimizeForBlobFileRead) YES Intentionally excluded — correct
FIFO/Universal compaction YES (all use MakeInputIterator) YES Safe
Backup engine YES (OptimizeForCompactionTableRead called) Partially See M2 above
Old snapshots YES YES No impact on snapshot semantics
Concurrent writers YES YES Bypass path is independent per compaction
Prefix seek N/A (compaction uses total-order) YES Safe

Positive Observations

  • Option infrastructure is thorough: Options registration, serialization round-trip, dump logging, C API, db_bench, db_stress, crashtest integration, and validation are all present and follow existing patterns precisely.
  • PosixFileSystem correctly calls base class: The override now calls FileSystem::OptimizeForCompactionTableRead first, then applies the Linux-specific readahead clamping on the post-base values. This is the right fix for the inheritance pattern.
  • Blob file reads intentionally excluded: The test in UseDirectReadsForCompactionOptionMechanics explicitly verifies that OptimizeForBlobFileRead does NOT enable direct reads for the new flag. Good isolation.
  • End-to-end test validates kernel-level O_DIRECT: The NewRandomAccessFile:O_DIRECT sync point verifies the actual OS-level flag, not just in-memory FileOptions.
  • Clean bypass activation logic: The bypass_cache_for_scan flag is only computed in one place (CreateInputIterator) and threaded down consistently through MakeInputIteratorLevelIteratorNewIteratorFindTable.
  • Defensive validation: mmap_reads + use_direct_reads_for_compaction is properly rejected at DB::Open with a clear error message.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

@zaidoon1 zaidoon1 force-pushed the use-direct-reads-for-compaction branch from 176a7c2 to 9a8d933 Compare May 15, 2026 08:12
Introduces a new DBOption `use_direct_reads_for_compaction` (default
false) that lets users route compaction background reads
through O_DIRECT while keeping user reads on the buffered/page-cache
path. Sequential compaction reads otherwise pollute the OS page cache
with read-once data that evicts the hot user-read working set;
bypassing the cache for those reads protects user-read tail latency on
write-heavy workloads without forcing users onto the global
`use_direct_reads` path (which slows user reads dramatically).

A naive implementation that only flipped the FileOptions returned by
`OptimizeForCompactionTableRead` does not actually trigger the
OS-level O_DIRECT open, because the TableCache (and
FileMetaData::pinned_reader) already holds long-lived buffered
handles opened at flush time or at DB::Open via LoadTableHandlers.
Compaction would silently reuse those cached buffered handles and the
kernel would never see the O_DIRECT flag.

The fix opens ephemeral O_DIRECT handles for the lifetime of the
compaction scan, separate from the cache:

  * TableCache::FindTable / NewIterator learn a `bypass_cache_for_scan`
    mode. When set, the pinned-reader fast path and the shared cache
    are skipped, GetTableReader is called directly with the caller's
    FileOptions, and ownership of the freshly opened TableReader is
    handed back to the caller. The iterator takes ownership via
    RegisterCleanup and frees the reader on destruction.
  * VersionSet::MakeInputIterator and LevelIterator plumb the flag
    through both the L0 and L1+ compaction-input paths.
  * CompactionJob::ProcessKeyValueCompaction enables the flag exactly
    when `use_direct_reads_for_compaction` is on, the global
    `use_direct_reads` is off, and `OptimizeForCompactionTableRead`
    actually produced `use_direct_reads=true` in the
    compaction-read FileOptions.

An end-to-end test in db_compaction_test.cc uses the existing
`NewRandomAccessFile:O_DIRECT` sync point in env/fs_posix.cc to assert
that the kernel-level open really happens for compaction inputs when
the flag is set, and never fires when the flag is off. The test is
scoped to platforms that use the O_DIRECT path.

A small unrelated convenience also lands here: a new db_bench flag
`--bgwriter_num` that lets the writer thread in readwhilewriting use a
wider keyspace than the readers. This is what made it possible to
benchmark the new option realistically -- the readers see a small hot
subset (cache-resident), the writer spreads puts across the full DB
which drives continuous compaction.

The new option follows the existing add_option.md checklist: it is
registered in ImmutableDBOptions for serialization, surfaced through
the C API, exposed in db_bench / db_stress / db_crashtest.py,
randomized in RandomInitDBOptions, validated against allow_mmap_reads
at Open time, and documented in unreleased_history. Java JNI is left
for a follow-up.

Benchmark results
=================

Setup: Ubuntu 24.04 (kernel 7.0.5 OrbStack Linux VM on Apple Silicon),
14 vCPUs, virtio-blk disk. MGLRU disabled (echo 0 >
/sys/kernel/mm/lru_gen/enabled). 14 GB DB (3.5M keys * 4 KB values),
no compression. Each measurement run pinned to a 1 GB cgroup
via `systemd-run --scope -p MemoryMax=1G -p MemorySwapMax=0`, so
DB-to-cache ratio is ~14x. Page cache dropped between configs.
Workload: readwhilewriting for 180 s, 4 reader threads on a hot
2,000-key subset (~8 MB, ~3% of cache) + 1 writer thread spreading
overwrites across the full 3.5M-key keyspace
(via `--bgwriter_num=3500000`), throttled at 100 MB/s. Compaction
ran at ~500 MB/s read/write during the buffered run, ~400 MB/s with
direct compaction.

Each run was 3 minutes long; "buffered" is the existing default.

| Config                                    | Throughput      | Read P50      | Read P99      | Read P99.9     | Read P99.99    |
|-------------------------------------------|-----------------|---------------|---------------|----------------|----------------|
| buffered (default)                        | 406 K ops/s     | 7.34 us       | 79.11 us      | 533.14 us      | 1647.79 us     |
| direct_compaction_read_write              | **464 K ops/s** | **6.37 us**   | **71.64 us**  | **468.28 us**  | **1363.91 us** |
|                                           | (+14%)          | (-13%)        | (-9%)         | (-12%)         | (-17%)         |
| direct_compaction_read_only               | 421 K ops/s     | 6.99 us       | 88.95 us      | 504.32 us      | 1456.75 us     |
|                                           | (+4%)           | (-5%)         | (+13%)        | (-5%)          | (-12%)         |
| use_direct_reads = true (existing global) | 442 K ops/s     | 7.37 us       | 50.82 us      | 472.23 us      | 1626.77 us     |
|                                           | (+9%)           | (0%)          | (-36%)        | (-11%)         | (-1%)          |

The recommended production configuration is
`use_direct_reads_for_compaction = true` together with
`use_direct_io_for_flush_and_compaction = true` ("direct reads + writes
for compaction"). It wins on every metric simultaneously: throughput
up 14%, every read percentile from P50 to P99.99 down 9 to 17%. The
existing global `use_direct_reads = true` flag does help P99
specifically but at a noticeable throughput cost and is no better at
P99.99; the new compaction-only path is strictly better for the
write-heavy workloads it is designed for.

Higher DB-to-cache ratios (the Cassandra blog at
https://lightfoot.dev/direct-i-o-for-cassandra-compaction-cutting-p99-read-latency-by-5x/
reports ~5x P99 improvement at a 43x ratio) should widen the gap
further; the 14x ratio used above is what fit in a single laptop's
disk budget.

Repro recipe
============

Setup:
  - Install OrbStack on macOS or use any Linux host
  - On macOS:  orb create -t ubuntu rocksdb-bench
  - Inside the Linux machine:
      apt-get install -y build-essential clang cmake git pkg-config \
        libgflags-dev libsnappy-dev zlib1g-dev libbz2-dev liblz4-dev \
        libzstd-dev rsync
      cmake -DCMAKE_BUILD_TYPE=Release -DPORTABLE=1 -DWITH_GFLAGS=1 \
            -DWITH_TESTS=0 .. && make -j db_bench

Build the source DB (once, unrestricted memory):
  ./db_bench --benchmarks=fillrandom,compact,waitforcompaction,stats \
    --db=/path/to/source_db --num=3500000 --key_size=16 \
    --value_size=4096 --write_buffer_size=16777216 \
    --target_file_size_base=16777216 --max_background_jobs=4 \
    --compression_type=none --cache_size=4194304 \
    --max_bytes_for_level_base=67108864 --disable_wal=1 --sync=0

Per-config measurement (copy source_db -> scratch_db first, then
drop_caches, then run under cgroup):
  sudo systemd-run --scope -p MemoryMax=1G -p MemorySwapMax=0 \
    ./db_bench --use_existing_db=1 \
      --benchmarks=readwhilewriting,stats --db=/path/to/scratch_db \
      --threads=5 --duration=180 --statistics=true --histogram=1 \
      --num=2000 --bgwriter_num=3500000 \
      --key_size=16 --value_size=4096 \
      --write_buffer_size=16777216 --target_file_size_base=16777216 \
      --max_background_jobs=4 --compression_type=none \
      --cache_size=4194304 --open_files=200 \
      --skip_stats_update_on_db_open=true \
      --max_bytes_for_level_base=67108864 \
      --benchmark_write_rate_limit=104857600 \
      --rate_limiter_bytes_per_sec=0 \
      --use_direct_reads={true|false} \
      --use_direct_reads_for_compaction={true|false} \
      --use_direct_io_for_flush_and_compaction={true|false}

Disable MGLRU first so the kernel uses the classic active/inactive LRU:
  echo 0 | sudo tee /sys/kernel/mm/lru_gen/enabled
@zaidoon1 zaidoon1 force-pushed the use-direct-reads-for-compaction branch from 9a8d933 to c87bff7 Compare May 15, 2026 08:52
@github-actions
Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit c87bff7


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@github-actions
Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit c87bff7


Summary

Well-designed, opt-in feature with correct lifetime management and clean plumbing through the TableCache → MakeInputIterator → LevelIterator → CompactionJob stack. The bypass-cache approach is the right solution given RocksDB's architectural constraints. The PosixFileSystem override fix is essential and correctly implemented. No high-severity issues found after thorough multi-agent review and debate.

High-severity findings (0):
No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

M1. Missing Java JNI Bindings — java/rocksjni/options.cc
  • Issue: The existing use_direct_reads option has full Java JNI support. The new use_direct_reads_for_compaction option does not include equivalent Java bindings.
  • Root cause: Incomplete API surface coverage across language bindings.
  • Suggested fix: Add JNI getter/setter following the exact pattern used by use_direct_reads. Can be done in a follow-up PR.
M2. bgwriter_num db_bench Addition Is Unrelated — tools/db_bench_tool.cc
  • Issue: The new --bgwriter_num flag is functionally unrelated to use_direct_reads_for_compaction. Bundling it makes the change harder to review and bisect.
  • Suggested fix: Extract bgwriter_num into a separate PR.
M3. RandomInitDBOptions May Create Invalid Configs — test_util/testutil.cc
  • Issue: RandomInitDBOptions sets use_direct_reads_for_compaction = rnd->Uniform(2) independently of allow_mmap_reads. If both are 1, ValidateOptions rejects the combination. The db_crashtest.py sanitization handles this correctly, but RandomInitDBOptions does not.
  • Suggested fix: Either add if (db_opt->allow_mmap_reads) db_opt->use_direct_reads_for_compaction = false; or rely on ValidateOptions rejection (acceptable since random tests should handle Status::NotSupported()).

🟢 LOW / NIT

L1. Redundant Option Combination Not Rejected — db/db_impl/db_impl_open.cc
  • Issue: use_direct_reads=true + use_direct_reads_for_compaction=true is allowed but the compaction flag is a no-op. Harmless but potentially confusing. The option doc already notes "has no effect if use_direct_reads is also true."
L2. BlobDB Intentionally Excluded — env/file_system.cc
  • Issue: OptimizeForBlobFileRead doesn't include the new flag. This is intentional and verified by tests. No action needed.

Cross-Component Analysis

Context Affected? Verified Safe?
Compaction (L0 + L1+) YES YES — ephemeral readers with proper cleanup
CompactionService NO YES — doesn't call MakeInputIterator
ReadOnly DB NO YES — no compaction
WritePreparedTxnDB YES (same path) YES — orthogonal
User reads NO YES — default param prevents bypass
BlobDB Excluded YES — intentional

Debated Findings (Resolved)

"Range tombstone destruction order bug"FALSE POSITIVE. FragmentedRangeTombstoneIterator holds shared_ptr<FragmentedRangeTombstoneList> which keeps tombstone data alive independently of the TableReader. Destruction order in CompactionMergingIterator is irrelevant.

"Block cache pollution negates feature"FALSE POSITIVE. The feature addresses OS page cache pollution (via O_DIRECT), not block cache. Compaction already sets fill_cache = false (compaction_job.cc:1465).

Positive Observations

  • Excellent deferred ownership transfer pattern for ephemeral readers
  • PosixFileSystem fix correctly chains base class call before Linux readahead clamping
  • Defensive third clause in bypass condition protects against custom FileSystems
  • Comprehensive test coverage (3 e2e tests + option mechanics + stress test integration)
  • Zero behavioral change when flag is off (default)

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant