Skip to content

Conversation

@nateab
Copy link
Contributor

@nateab nateab commented Feb 11, 2026

What is the purpose of the change

During full snapshot restore, RocksDBFullRestoreOperation.restoreKVStateData() did not check for null after looking up a ColumnFamilyHandle by kvStateId. If the checkpoint stream contained an unknown kvStateId, data would be written to a null handle, causing silent data corruption where ListState data could land in timer column families, leading to EOFException during deserialization.

This adds a null check that throws IllegalStateException with diagnostic info (kvStateId, registered state count, state names/types), mirroring the pattern already used in RocksDBHeapTimersFullRestoreOperation. Also improves the error message in the sibling class to include the same diagnostic detail.

Additionally enhances RocksDBCachingPriorityQueueSet.deserializeElement() with payload length validation and detailed error messages (byte lengths, hex dump, corruption hint) to make deserialization failures easier to diagnose.

Brief change log

  • Added null check on ColumnFamilyHandle in RocksDBFullRestoreOperation.restoreKVStateData() that throws IllegalStateException with diagnostic context
  • Improved the existing error message in RocksDBHeapTimersFullRestoreOperation.restoreKVStateData() to include the same diagnostic detail (state id, count, registered state names/types)
  • Added payload length validation in RocksDBCachingPriorityQueueSet.deserializeElement() before deserialization
  • Enhanced deserialization error messages with byte lengths, hex dump (capped at 128 bytes), and a hint about possible cross-state corruption during restore

Verifying this change

This change added tests and can be verified as follows:

  • Added RocksDBFullRestoreOperationTest.testFullSnapshotRestorePreservesStateIsolation -- creates a backend with both ValueState and timer priority queue, takes a full snapshot, restores, and verifies each state's data is correctly isolated in its own column family
  • Added RocksDBFullRestoreOperationTest.testSnapshotRestoreSnapshotRoundTrip -- double snapshot-restore round-trip to catch latent corruption that may not surface until the second checkpoint cycle
  • Verified all 8 existing RocksDBRecoveryTest tests pass with no regressions

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes (checkpoint restore path)
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

…storeOperation

During full snapshot restore, RocksDBFullRestoreOperation.restoreKVStateData()
did not check for null after looking up a ColumnFamilyHandle by kvStateId.
If the checkpoint stream contained an unknown kvStateId, data would be written
to a null handle, causing silent data corruption where ListState data could
land in timer column families, leading to EOFException during deserialization.

This adds a null check that throws IllegalStateException with diagnostic info
(kvStateId, registered state count, state names/types), mirroring the pattern
already used in RocksDBHeapTimersFullRestoreOperation.

Also enhances RocksDBCachingPriorityQueueSet.deserializeElement() with:
- Payload length validation before deserialization
- Detailed error messages including byte lengths, hex dump, and a hint
  about possible cross-state corruption during restore
@nateab nateab changed the title [FLINK-23886][runtime] Fix null column family handle in RocksDBFullRe… [FLINK-23886][runtime] Fix null column family handle in RocksDBFullRestoreOperation Feb 11, 2026
@flinkbot
Copy link
Collaborator

flinkbot commented Feb 11, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants