Skip to content

[SPARK-51988][SS] Do file checksum verification on read for RocksDB zip file#54493

Open
gnanda wants to merge 3 commits intoapache:masterfrom
gnanda:stack/SPARK-51988
Open

[SPARK-51988][SS] Do file checksum verification on read for RocksDB zip file#54493
gnanda wants to merge 3 commits intoapache:masterfrom
gnanda:stack/SPARK-51988

Conversation

@gnanda
Copy link

@gnanda gnanda commented Feb 25, 2026

What changes were proposed in this pull request?

When the RocksDB state store downloads a checkpoint zip file from DFS, it previously opened the file via fs.open() (a raw Hadoop FileSystem handle), bypassing the CheckpointFileManager abstraction. This meant that even when file checksum
verification was enabled (spark.sql.streaming.stateStore.rocksdb.fileChecksumEnabled), the zip file itself was never verified against its .crc sidecar on read.

This PR fixes that by:

  1. Extracting a new Utils.unzipFilesFromInputStream(inputStream, localDir) helper that accepts any InputStream, and making the existing unzipFilesFromFile delegate to it.
  2. Changing RocksDBFileManager to open checkpoint zip files via fm.open(path) instead of fs.open(path). When checksum verification is enabled, fm is a ChecksumCheckpointFileManager, whose open() returns a stream that verifies the file's size
    and CRC32C checksum on close(). If the checksum does not match, a CHECKPOINT_FILE_CHECKSUM_VERIFICATION_FAILED error is raised.

Backward compatibility is preserved: if no .crc sidecar exists (e.g. the checkpoint was written before this feature was enabled), the file is opened and read without verification.

Why are the changes needed?

Silent data corruption in checkpoint zip files would previously go undetected. A corrupted or partially-written zip could cause a state store to load incorrect data, leading to wrong query results or obscure failures. Verifying the checksum on
read closes this gap and makes corruption visible immediately at load time, with a clear error condition.

Does this PR introduce any user-facing change?

Yes. When spark.sql.streaming.stateStore.rocksdb.fileChecksumEnabled is true (the default), loading a RocksDB checkpoint zip file whose .crc sidecar does not match will now throw a SparkException with error condition
CHECKPOINT_FILE_CHECKSUM_VERIFICATION_FAILED instead of silently succeeding with corrupt data.

How was this patch tested?

  • RocksDBSuite: Three new unit tests covering (1) successful load with checksum verification, (2) graceful fallback when no .crc sidecar exists, and (3) detection of a corrupted .crc sidecar raising
    CHECKPOINT_FILE_CHECKSUM_VERIFICATION_FAILED.
  • UtilsSuite: New unit test for unzipFilesFromInputStream verifying correct extraction from an in-memory zip.
  • RocksDBCheckpointFailureInjectionSuite: Updated zombie-write test to split the checkpointIds enabled/disabled branches; the disabled branch now explicitly opts out of checksum verification to isolate the intended failure-injection behavior
    from the new checksum enforcement.
  • StateStoreSuite: Fixed the checksum file verification test to write two independent runs (checksums-on then checksums-off) rather than overwriting the same versions, keeping existing .crc sidecars valid across the full test.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code 2.1.58

@gnanda gnanda changed the title [SPARK-51988][Structured Streaming] Do file checksum verification on read for RocksDB zip file [SPARK-51988][SS] Do file checksum verification on read for RocksDB zip file Feb 25, 2026
Copy link
Contributor

@ericm-db ericm-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@gnanda gnanda force-pushed the stack/SPARK-51988 branch 4 times, most recently from e3435b8 to 747a426 Compare March 2, 2026 17:45
@gnanda gnanda force-pushed the stack/SPARK-51988 branch 3 times, most recently from b9e71ad to 9492b26 Compare March 20, 2026 00:10
@gnanda gnanda requested a review from micheal-o March 20, 2026 00:12
@gnanda gnanda force-pushed the stack/SPARK-51988 branch 2 times, most recently from 4b202f5 to f92fb2b Compare March 20, 2026 00:20
@gnanda gnanda force-pushed the stack/SPARK-51988 branch from f92fb2b to cb0cd95 Compare March 21, 2026 05:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants