Skip to content

Cassandra 21134 - Direct I/O Background Writes#4814

Closed
samueldlightfoot wants to merge 18 commits into
apache:trunkfrom
samueldlightfoot:CASSANDRA-21134-direct-compaction-writes
Closed

Cassandra 21134 - Direct I/O Background Writes#4814
samueldlightfoot wants to merge 18 commits into
apache:trunkfrom
samueldlightfoot:CASSANDRA-21134-direct-compaction-writes

Conversation

@samueldlightfoot
Copy link
Copy Markdown
Contributor

@samueldlightfoot samueldlightfoot commented May 17, 2026

CASSANDRA-21134: Direct I/O for background SSTable writes

Summary

Adds an opt-in O_DIRECT write path for background SSTable producers, bypassing the OS
page cache for data that is unlikely to be re-read soon after being written. Memtable
flush data stays buffered (it is hot and benefits from the page cache).

Enabled via a new YAML knob:

background_write_disk_access_mode: direct    # default: standard
direct_write_buffer_size: 256KiB              # aligned up to FS block size; auto-grows to chunk_length

The path is gated by:

  1. config (background_write_disk_access_mode == direct),
  2. table compression being enabled (required — uncompressed writers still use the buffered
    path), and
  3. an OperationType-keyed allowlist (DataComponent#DIRECT_WRITE_SUPPORT).

Selection happens centrally in DataComponent.buildWriter; producers are unchanged.

Operations covered (DIO eligible)

OperationType Rationale
COMPACTION append-only writer
MAJOR_COMPACTION "
TOMBSTONE_COMPACTION "
ANTICOMPACTION "
GARBAGE_COLLECT "
CLEANUP "
UPGRADE_SSTABLES "
WRITE "
STREAM chunked-receiver path (see ZCS exclusion)

The allowlist is exhaustive: any new OperationType with writesData == true that is not
classified will fail static initialization (AssertionError).

Operations NOT covered

Path Classification Reason
FLUSH (memtable flush) UNSUPPORTED_POLICY Just-flushed data is hot — keep it in the page cache. Memtable flushes always use buffered I/O.
SCRUB UNSUPPORTED_CORRECTNESS tryAppend needs mark() / resetAndTruncate(), which the DIO writer cannot satisfy.
Zero-Copy Streaming (ZCS) n/a (path bypass) Entire-SSTable streaming does not go through DataComponent.buildWriter; the DIO gate never runs.
Uncompressed writers n/a (path bypass) Only CompressedSequentialWriter has a DIO subclass in this change.

Removing a UNSUPPORTED_CORRECTNESS entry requires code changes; removing
UNSUPPORTED_POLICY is a configuration / policy decision.

Key code

  • io/DirectIoSupport.java — eligibility enum (SUPPORTED / UNSUPPORTED_CORRECTNESS /
    UNSUPPORTED_POLICY / NOT_APPLICABLE).
  • io/sstable/format/DataComponent.java — central selection + allowlist + exhaustiveness
    check; first activation per op is logged.
  • io/compress/DirectCompressedSequentialWriter.java — new writer; aligned buffers, no
    mark()/resetAndTruncate().
  • io/compress/CompressedSequentialWriter.java — refactored to allow the DIO subclass to
    override the write chunk path; writeChunk contract documented and asserted.
  • config/Config.java, config/DatabaseDescriptor.java — new knobs, validation, and
    startup wiring; buffer size aligned to FS block size and auto-grown to chunk length.
  • service/StartupChecks.java — fails fast if direct is requested on a platform/FS that
    does not support O_DIRECT.

Tests introduced

  • DirectCompressedSequentialWriterTest (unit, 818 lines) — covers the DIO writer in
    isolation: chunk-boundary alignment, buffer auto-expansion to chunk length, abort/close
    paths, checksum + compression-info component correctness, error handling.
  • DataComponentDirectWriteSelectionTest (unit) — verifies the selection matrix:
    per-OperationType eligibility, exhaustiveness assertion, compression-enabled gate,
    config-mode gate.
  • StreamingDirectWriteTest (in-JVM distributed) — proves chunked streaming
    (CassandraStreamReader / CassandraCompressedStreamReader
    BigTableWriter.openDataWriterOperationType.STREAM) selects the DIO writer when
    enabled; ZCS is disabled in the test since it bypasses the selection point.
  • DirectIoTestUtils — shared helpers (FS block size, alignment) for the suites above.
  • AntiCompactionTest, CompactionsTest — extended to exercise the DIO path end-to-end
    for the compaction operations in the allowlist.
  • DatabaseDescriptorTest — validation of the new knobs (mode parsing, buffer-size
    alignment, defaults).

Not in scope

  • Direct I/O on the read path.
  • Uncompressed SSTable writers.
  • ZCS streaming.
  • Memtable flush.

patch by Sam Lightfoot; reviewed by for CASSANDRA21134

The Cassandra Jira

…act, harden abort tests

- Revert compressed and maxCompressedLength to private in CompressedSequentialWriter
- Document that writeChunk post-call buffer position is unspecified
- Fix test Javadoc class name typo
- Use explicit finishOnClose(false) in abort tests

Add abort/cleanup path tests for DirectCompressedSequentialWriter to guard against native memory leaks
Reusable chunk CRC32
…21134-direct-compaction-writes

# Conflicts:
#	src/java/org/apache/cassandra/io/compress/CompressedSequentialWriter.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant