-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54443][SS] Integrate PartitionKeyExtractor in Re-partition reader #53459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4330018 to
996e27b
Compare
...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala
Outdated
Show resolved
Hide resolved
996e27b to
d9a88b3
Compare
d9a88b3 to
d234a97
Compare
| val stateVarInfo = stateVarInfoList.head | ||
| transformWithStateVariableInfoOpt = Some(stateVarInfo) | ||
| } | ||
| var stateVarInfoList = operatorProperties.stateVariables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same as previous version exception for indentation. We can now assign a transformWithStateVariableInfoOpt because stateVarName will always be a "valid" value after line 323 change
...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala
Outdated
Show resolved
Hide resolved
| storeMetadata: Array[StateMetadataTableEntry]): Option[Int] = { | ||
| if (storeMetadata.nonEmpty && | ||
| storeMetadata.head.operatorName == StatefulOperatorsUtils.SYMMETRIC_HASH_JOIN_EXEC_OP_NAME) { | ||
| Some(session.conf.get(SQLConf.STREAMING_JOIN_STATE_FORMAT_VERSION)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should read this from the current batch offset seq conf instead. buildStateStoreConf does similar.
The session here doesn't include the confs written in checkpoint, so can return wrong value
...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala
Show resolved
Hide resolved
| } | ||
| var stateVarInfoList = operatorProperties.stateVariables | ||
| .filter(stateVar => stateVar.stateName == stateVarName) | ||
| if (stateVarInfoList.isEmpty && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need this anymore right. Since it won't be empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be empty when it's a non-timer internal column. Updated the logic in the new version to make it more explicit
...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala
Outdated
Show resolved
Hide resolved
...he/spark/sql/execution/datasources/v2/state/StatePartitionAllColumnFamiliesReaderSuite.scala
Outdated
Show resolved
Hide resolved
...he/spark/sql/execution/datasources/v2/state/StatePartitionAllColumnFamiliesReaderSuite.scala
Outdated
Show resolved
Hide resolved
...he/spark/sql/execution/datasources/v2/state/StatePartitionAllColumnFamiliesReaderSuite.scala
Outdated
Show resolved
Hide resolved
...he/spark/sql/execution/datasources/v2/state/StatePartitionAllColumnFamiliesReaderSuite.scala
Outdated
Show resolved
Hide resolved
...he/spark/sql/execution/datasources/v2/state/StatePartitionAllColumnFamiliesReaderSuite.scala
Outdated
Show resolved
Hide resolved
1738128 to
e500018
Compare
reduce duplicate code all tests pass
micheal-o
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stamped but please address the comments. Thanks
...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala
Outdated
Show resolved
Hide resolved
| // infos instead of validating a specific stateVarName. This skips the normal validation | ||
| // logic because we're not reading a specific state variable - we're reading all of them. | ||
| if (sourceOptions.internalOnlyReadAllColumnFamilies) { | ||
| } else if (sourceOptions.internalOnlyReadAllColumnFamilies) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not a separate if condition? Right now you are doing:
if {}
else if {}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm either will work, since we don't allow setting both readRegisteredTimers and intenralOnlyReadAllColumnFamilies at the smae time. I can change it in the next version
|
|
||
| var stateVarInfoList = operatorProperties.stateVariables | ||
| .filter(stateVar => stateVar.stateName == stateVarName) | ||
| if (!TimerStateUtils.isTimerCFName(stateVarName) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure this doesn't apply to timer CFs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, when readRegisteredTimers is set, stateVarName is set to the timer column. From line 333-334 above, we would have gotten the correct stateVarInfoList, thus won't need to assign it a dummy one like below.
var stateVarInfoList = operatorProperties.stateVariables
.filter(stateVar => stateVar.stateName == stateVarName)
...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala
Outdated
Show resolved
Hide resolved
...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala
Outdated
Show resolved
Hide resolved
| val stateFormatVersion = getStateFormatVersion(storeMetadata, sourceOptions.resolvedCpLocation) | ||
| val allColFamilyReaderInfoOpt: Option[AllColumnFamiliesReaderInfo] = | ||
| if (sourceOptions.internalOnlyReadAllColumnFamilies) { | ||
| Option(AllColumnFamiliesReaderInfo( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Some
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious: When do we have one preference over the other? I only know from the style guide that we use Option to guard against null, but I don't think it applies here
...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/utils/SchemaUtil.scala
Outdated
Show resolved
Hide resolved
...tion/streaming/operators/stateful/transformwithstate/StateStoreColumnFamilySchemaUtils.scala
Show resolved
Hide resolved
...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StatePartitionReader.scala
Outdated
Show resolved
Hide resolved
e500018 to
4a0e7cd
Compare
...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala
Outdated
Show resolved
Hide resolved
...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala
Outdated
Show resolved
Hide resolved
anishshri-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm pending couple nits
...tion/streaming/operators/stateful/transformwithstate/StateStoreColumnFamilySchemaUtils.scala
Outdated
Show resolved
Hide resolved
...tion/streaming/operators/stateful/transformwithstate/StateStoreColumnFamilySchemaUtils.scala
Outdated
Show resolved
Hide resolved
### What changes were proposed in this pull request? Integrate the PartitionKeyExtractor introduced in [this PR](https://github.com/apache/spark/pull/53355/files) to StatePartitionAllColumnFamiliesReader. Before this change, StatePartitionAllColumnFamiliesReader returns the entire key value in partition_key field, and SchemaUtil will return `keySchema` as the partitionKey schema. After this change, StatePartitionAllColumnFamiliesReader will return the actual partition key, and SchemaUtil returns the actual partitionKey schema ### Why are the changes needed? When creating a StatePartitionAllColumnFamiliesReader, we need to pass along the stateFormatVersion and operator name. In SchemaUtil, we will create a `getExtractor` helper function. It's used when getSourceSchema is called (for default column family), as well as when StatePartitionAllColumnFamiliesReader is initialized, as the reader will use the extractor to get partitionKey for different column families in `iterator` We also added checks of partitionKey in reader suite ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? See updated StatePartitionAllColumnFamiliesSuite. We have hard-coded function that extract partition key for different column families from normalDF, then we'll compare the extracted partition key against the partition key read from bytesDF ### Was this patch authored or co-authored using generative AI tooling? Yes. claude-4.5-opus Closes apache#53459 from zifeif2/integrate-key-extraction. Authored-by: zifeif2 <zifeifeng11@gmail.com> Signed-off-by: Anish Shrigondekar <anish.shrigondekar@databricks.com>
What changes were proposed in this pull request?
Integrate the PartitionKeyExtractor introduced in this PR to StatePartitionAllColumnFamiliesReader. Before this change, StatePartitionAllColumnFamiliesReader returns the entire key value in partition_key field, and SchemaUtil will return
keySchemaas the partitionKey schema. After this change, StatePartitionAllColumnFamiliesReader will return the actual partition key, and SchemaUtil returns the actual partitionKey schemaWhy are the changes needed?
When creating a StatePartitionAllColumnFamiliesReader, we need to pass along the stateFormatVersion and operator name.
In SchemaUtil, we will create a
getExtractorhelper function. It's used when getSourceSchema is called (for default column family), as well as when StatePartitionAllColumnFamiliesReader is initialized, as the reader will use the extractor to get partitionKey for different column families initeratorWe also added checks of partitionKey in reader suite
Does this PR introduce any user-facing change?
No
How was this patch tested?
See updated StatePartitionAllColumnFamiliesSuite. We have hard-coded function that extract partition key for different column families from normalDF, then we'll compare the extracted partition key against the partition key read from bytesDF
Was this patch authored or co-authored using generative AI tooling?
Yes. claude-4.5-opus