[SPARK-54443][SS] Integrate PartitionKeyExtractor in Re-partition reader #53459

zifeif2 · 2025-12-12T19:16:47Z

What changes were proposed in this pull request?

Integrate the PartitionKeyExtractor introduced in this PR to StatePartitionAllColumnFamiliesReader. Before this change, StatePartitionAllColumnFamiliesReader returns the entire key value in partition_key field, and SchemaUtil will return keySchema as the partitionKey schema. After this change, StatePartitionAllColumnFamiliesReader will return the actual partition key, and SchemaUtil returns the actual partitionKey schema

Why are the changes needed?

When creating a StatePartitionAllColumnFamiliesReader, we need to pass along the stateFormatVersion and operator name.
In SchemaUtil, we will create a getExtractor helper function. It's used when getSourceSchema is called (for default column family), as well as when StatePartitionAllColumnFamiliesReader is initialized, as the reader will use the extractor to get partitionKey for different column families in iterator
We also added checks of partitionKey in reader suite

Does this PR introduce any user-facing change?

No

How was this patch tested?

See updated StatePartitionAllColumnFamiliesSuite. We have hard-coded function that extract partition key for different column families from normalDF, then we'll compare the extracted partition key against the partition key read from bytesDF

Was this patch authored or co-authored using generative AI tooling?

Yes. claude-4.5-opus

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala

zifeif2 · 2025-12-17T21:23:34Z

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala

-            val stateVarInfo = stateVarInfoList.head
-            transformWithStateVariableInfoOpt = Some(stateVarInfo)
          }
+          var stateVarInfoList = operatorProperties.stateVariables


This is the same as previous version exception for indentation. We can now assign a transformWithStateVariableInfoOpt because stateVarName will always be a "valid" value after line 323 change

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala

micheal-o · 2025-12-18T04:07:14Z

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala

+      storeMetadata: Array[StateMetadataTableEntry]): Option[Int] = {
+    if (storeMetadata.nonEmpty &&
+      storeMetadata.head.operatorName == StatefulOperatorsUtils.SYMMETRIC_HASH_JOIN_EXEC_OP_NAME) {
+      Some(session.conf.get(SQLConf.STREAMING_JOIN_STATE_FORMAT_VERSION))


We should read this from the current batch offset seq conf instead. buildStateStoreConf does similar.

The session here doesn't include the confs written in checkpoint, so can return wrong value

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala

micheal-o · 2025-12-18T04:07:28Z

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala

          }
+          var stateVarInfoList = operatorProperties.stateVariables
+            .filter(stateVar => stateVar.stateName == stateVarName)
+          if (stateVarInfoList.isEmpty &&


we don't need this anymore right. Since it won't be empty

It will be empty when it's a non-timer internal column. Updated the logic in the new version to make it more explicit

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala

...he/spark/sql/execution/datasources/v2/state/StatePartitionAllColumnFamiliesReaderSuite.scala

reduce duplicate code all tests pass

micheal-o

Stamped but please address the comments. Thanks

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala

micheal-o · 2026-01-07T21:05:44Z

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala

-          // infos instead of validating a specific stateVarName. This skips the normal validation
-          // logic because we're not reading a specific state variable - we're reading all of them.
-          if (sourceOptions.internalOnlyReadAllColumnFamilies) {
+          } else if (sourceOptions.internalOnlyReadAllColumnFamilies) {


why not a separate if condition? Right now you are doing:

if {} else if {}

Hmm either will work, since we don't allow setting both readRegisteredTimers and intenralOnlyReadAllColumnFamilies at the smae time. I can change it in the next version

micheal-o · 2026-01-07T21:05:48Z

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala

+
+          var stateVarInfoList = operatorProperties.stateVariables
+            .filter(stateVar => stateVar.stateName == stateVarName)
+          if (!TimerStateUtils.isTimerCFName(stateVarName) &&


Are you sure this doesn't apply to timer CFs?

yeah, when readRegisteredTimers is set, stateVarName is set to the timer column. From line 333-334 above, we would have gotten the correct stateVarInfoList, thus won't need to assign it a dummy one like below.

var stateVarInfoList = operatorProperties.stateVariables .filter(stateVar => stateVar.stateName == stateVarName)

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala

micheal-o · 2026-01-07T21:05:57Z

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala

+    val stateFormatVersion = getStateFormatVersion(storeMetadata, sourceOptions.resolvedCpLocation)
+    val allColFamilyReaderInfoOpt: Option[AllColumnFamiliesReaderInfo] =
+      if (sourceOptions.internalOnlyReadAllColumnFamilies) {
+        Option(AllColumnFamiliesReaderInfo(


nit: Some

curious: When do we have one preference over the other? I only know from the style guide that we use Option to guard against null, but I don't think it applies here

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/utils/SchemaUtil.scala

...tion/streaming/operators/stateful/transformwithstate/StateStoreColumnFamilySchemaUtils.scala

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StatePartitionReader.scala

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala

anishshri-db

lgtm pending couple nits

...tion/streaming/operators/stateful/transformwithstate/StateStoreColumnFamilySchemaUtils.scala

### What changes were proposed in this pull request? Integrate the PartitionKeyExtractor introduced in [this PR](https://github.com/apache/spark/pull/53355/files) to StatePartitionAllColumnFamiliesReader. Before this change, StatePartitionAllColumnFamiliesReader returns the entire key value in partition_key field, and SchemaUtil will return `keySchema` as the partitionKey schema. After this change, StatePartitionAllColumnFamiliesReader will return the actual partition key, and SchemaUtil returns the actual partitionKey schema ### Why are the changes needed? When creating a StatePartitionAllColumnFamiliesReader, we need to pass along the stateFormatVersion and operator name. In SchemaUtil, we will create a `getExtractor` helper function. It's used when getSourceSchema is called (for default column family), as well as when StatePartitionAllColumnFamiliesReader is initialized, as the reader will use the extractor to get partitionKey for different column families in `iterator` We also added checks of partitionKey in reader suite ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? See updated StatePartitionAllColumnFamiliesSuite. We have hard-coded function that extract partition key for different column families from normalDF, then we'll compare the extracted partition key against the partition key read from bytesDF ### Was this patch authored or co-authored using generative AI tooling? Yes. claude-4.5-opus Closes apache#53459 from zifeif2/integrate-key-extraction. Authored-by: zifeif2 <zifeifeng11@gmail.com> Signed-off-by: Anish Shrigondekar <anish.shrigondekar@databricks.com>

github-actions bot added SQL STRUCTURED STREAMING labels Dec 12, 2025

zifeif2 force-pushed the integrate-key-extraction branch from 4330018 to 996e27b Compare December 15, 2025 22:55

zifeif2 commented Dec 15, 2025

View reviewed changes

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala Outdated Show resolved Hide resolved

zifeif2 force-pushed the integrate-key-extraction branch from 996e27b to d9a88b3 Compare December 16, 2025 00:20

zifeif2 marked this pull request as ready for review December 16, 2025 00:28

zifeif2 force-pushed the integrate-key-extraction branch from d9a88b3 to d234a97 Compare December 17, 2025 04:56

zifeif2 commented Dec 17, 2025

View reviewed changes

micheal-o reviewed Dec 18, 2025

View reviewed changes

zifeif2 force-pushed the integrate-key-extraction branch 2 times, most recently from 1738128 to e500018 Compare January 6, 2026 20:40

zifeif2 added 2 commits January 8, 2026 00:30

initial commit

011e7b7

reduce duplicate code all tests pass

address comment

6438de5

micheal-o approved these changes Jan 8, 2026

View reviewed changes

address comment

4a0e7cd

zifeif2 force-pushed the integrate-key-extraction branch from e500018 to 4a0e7cd Compare January 8, 2026 18:36

anishshri-db reviewed Jan 8, 2026

View reviewed changes

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala Outdated Show resolved Hide resolved

anishshri-db reviewed Jan 8, 2026

View reviewed changes

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala Outdated Show resolved Hide resolved

anishshri-db approved these changes Jan 8, 2026

View reviewed changes

anishshri-db reviewed Jan 8, 2026

View reviewed changes

...tion/streaming/operators/stateful/transformwithstate/StateStoreColumnFamilySchemaUtils.scala Outdated Show resolved Hide resolved

anishshri-db reviewed Jan 8, 2026

View reviewed changes

...tion/streaming/operators/stateful/transformwithstate/StateStoreColumnFamilySchemaUtils.scala Outdated Show resolved Hide resolved

zifeif2 added 2 commits January 8, 2026 19:55

address comment + add new error type

45cf3d4

fix failing test

5a4446e

anishshri-db closed this in d2b00ae Jan 9, 2026

[SPARK-54443][SS] Integrate PartitionKeyExtractor in Re-partition reader #53459

[SPARK-54443][SS] Integrate PartitionKeyExtractor in Re-partition reader #53459

Uh oh!

Conversation

zifeif2 commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

micheal-o left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anishshri-db left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zifeif2 commented Dec 12, 2025 •

edited

Loading