[SPARK-40429][SQL] Only set KeyGroupedPartitioning when the referenced column is in the output #37886

huaxingao · 2022-09-14T20:25:49Z

What changes were proposed in this pull request?

Only set KeyGroupedPartitioning when the referenced column is in the output

Why are the changes needed?

bug fixing

Does this PR introduce any user-facing change?

no

How was this patch tested?

new test

…d column is in the output

huaxingao · 2022-09-14T23:15:52Z

cc @cloud-fan @sunchao

cloud-fan · 2022-09-15T01:04:03Z

...main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanPartitioningAndOrdering.scala

+            None
+          } else {
+            val ref = AttributeSet.fromAttributeSets(partitioning.get.map(_.references))
+            if (ref.subsetOf(AttributeSet(d.output))) {


Suggested change

if (ref.subsetOf(AttributeSet(d.output))) {

if (ref.subsetOf(d.outputSet)) {

cloud-fan · 2022-09-15T01:05:03Z

...main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanPartitioningAndOrdering.scala

+          if (partitioning.isEmpty) {
+            None
+          } else {
+            val ref = AttributeSet.fromAttributeSets(partitioning.get.map(_.references))


how about partitioning.get.forall(p => p.references.subsetOf(...))

Sounds good! Changed.

cloud-fan

good catch!

dongjoon-hyun

+1, LGTM. Thank you, @huaxingao and @cloud-fan .
Merged to master.

dongjoon-hyun · 2022-09-15T06:07:52Z

Could you make a backporting PR, @huaxingao ?

huaxingao · 2022-09-15T16:05:01Z

Thanks @cloud-fan @dongjoon-hyun

…renced column is in the output ### What changes were proposed in this pull request? back porting [PR](#37886) to 3.3. Only set `KeyGroupedPartitioning` when the referenced column is in the output ### Why are the changes needed? bug fixing ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New test Closes #37901 from huaxingao/3.3. Authored-by: huaxingao <huaxin_gao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…keys ### What changes were proposed in this pull request? - Add new conf spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled - Change key compatibility checks in EnsureRequirements. Remove checks where all partition keys must be in join keys to allow isKeyCompatible = true in this case. - Change BatchScanExec/DataSourceV2Relation to group splits by join keys (previously grouped only by partition values) - Implement partiallyClustered skew-handling. - Group only the replicate side (now by join key as well) - add an additional sort in the end of partitions based on join key, as when we group the non-replicate side, partition ordering becomes out of order. ### Why are the changes needed? - Support Storage Partition Join in cases where the join condition does not contain all the partition keys, but just some of them ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? -Added tests in KeyGroupedPartitioningSuite -Found two problems, will address in separate PR: - apache#37886 made another change so that we have to select all join keys, otherwise DSV2 scan does not report KeyGroupedPartitioning and SPJ does not get triggered. Need to see how to relax this. - https://issues.apache.org/jira/browse/SPARK-44641 was found when testing this change. This pr refactors some of those code to add group-by-join-key, but doesnt change the underlying logic, so issue continues to exist. Hopefully this will also get fixed in another way.

…keys ### What changes were proposed in this pull request? - Add new conf spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled - Change key compatibility checks in EnsureRequirements. Remove checks where all partition keys must be in join keys to allow isKeyCompatible = true in this case (if this flag is enabled) - "Project" partitions by join keys in KeyGroupedPartitioning/KeyGroupedShuffleSpec - Add join key grouping to the partition grouping in BatchScanExec ### Why are the changes needed? - Support Storage Partition Join in cases where the join condition does not contain all the partition keys, but just some of them ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? -Added tests in KeyGroupedPartitioningSuite -Because of apache#37886 we have to select all join keys to trigger SPJ in this case, otherwise DSV2 scan does not report KeyGroupedPartitioning and SPJ does not get triggered. Need to see how to relax this in separate PR.

…keys ### What changes were proposed in this pull request? - Add new conf spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled - Change key compatibility checks in EnsureRequirements. Remove checks where all partition keys must be in join keys to allow isKeyCompatible = true in this case (if this flag is enabled) - Change BatchScanExec/DataSourceV2Relation to group splits by join keys if they differ from partition keys (previously grouped only by partition values). Do same for all auxiliary data structure, like commonPartValues. - Implement partiallyClustered skew-handling. - Group only the replicate side (now by join key as well), replicate by the total size of other-side partitions that share the join key. - add an additional sort for partitions based on join key, as when we group the replicate side, partition ordering becomes out of order from the non-replicate side. ### Why are the changes needed? - Support Storage Partition Join in cases where the join condition does not contain all the partition keys, but just some of them ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? -Added tests in KeyGroupedPartitioningSuite -Found two existing problems, will address in separate PR: - Because of #37886 we have to select all join keys to trigger SPJ in this case, otherwise DSV2 scan does not report KeyGroupedPartitioning and SPJ does not get triggered. Need to see how to relax this. - https://issues.apache.org/jira/browse/SPARK-44641 was found when testing this change. This pr refactors some of those code to add group-by-join-key, but doesnt change the underlying logic, so issue continues to exist. Hopefully this will also get fixed in another way. Closes #42306 from szehon-ho/spj_attempt_master. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…keys - Add new conf spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled - Change key compatibility checks in EnsureRequirements. Remove checks where all partition keys must be in join keys to allow isKeyCompatible = true in this case (if this flag is enabled) - Change BatchScanExec/DataSourceV2Relation to group splits by join keys if they differ from partition keys (previously grouped only by partition values). Do same for all auxiliary data structure, like commonPartValues. - Implement partiallyClustered skew-handling. - Group only the replicate side (now by join key as well), replicate by the total size of other-side partitions that share the join key. - add an additional sort for partitions based on join key, as when we group the replicate side, partition ordering becomes out of order from the non-replicate side. - Support Storage Partition Join in cases where the join condition does not contain all the partition keys, but just some of them No -Added tests in KeyGroupedPartitioningSuite -Found two existing problems, will address in separate PR: - Because of apache#37886 we have to select all join keys to trigger SPJ in this case, otherwise DSV2 scan does not report KeyGroupedPartitioning and SPJ does not get triggered. Need to see how to relax this. - https://issues.apache.org/jira/browse/SPARK-44641 was found when testing this change. This pr refactors some of those code to add group-by-join-key, but doesnt change the underlying logic, so issue continues to exist. Hopefully this will also get fixed in another way. Closes apache#42306 from szehon-ho/spj_attempt_master. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

huaxingao added 2 commits September 14, 2022 13:20

[SPARK-40429][SQL] Only set KeyGroupedPartitioning when the reference…

51c6160

…d column is in the output

fix style

ba9f4ca

github-actions bot added the SQL label Sep 14, 2022

cloud-fan reviewed Sep 15, 2022

View reviewed changes

cloud-fan approved these changes Sep 15, 2022

View reviewed changes

address comments

ab5ea9b

dongjoon-hyun approved these changes Sep 15, 2022

View reviewed changes

dongjoon-hyun closed this in 034e48f Sep 15, 2022

huaxingao deleted the keyGroupedPartitioning branch September 15, 2022 16:05

huaxingao mentioned this pull request Sep 15, 2022

[SPARK-40429][SQL][3.3] Only set KeyGroupedPartitioning when the referenced column is in the output #37901

Closed

szehon-ho mentioned this pull request Aug 2, 2023

[SPARK-44647][SQL] Support SPJ where join keys are less than cluster keys #42306

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40429][SQL] Only set KeyGroupedPartitioning when the referenced column is in the output #37886

[SPARK-40429][SQL] Only set KeyGroupedPartitioning when the referenced column is in the output #37886

huaxingao commented Sep 14, 2022

huaxingao commented Sep 14, 2022

cloud-fan Sep 15, 2022

cloud-fan Sep 15, 2022

huaxingao Sep 15, 2022

cloud-fan left a comment

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun commented Sep 15, 2022

huaxingao commented Sep 15, 2022

	if (ref.subsetOf(AttributeSet(d.output))) {
	if (ref.subsetOf(d.outputSet)) {

[SPARK-40429][SQL] Only set KeyGroupedPartitioning when the referenced column is in the output #37886

[SPARK-40429][SQL] Only set KeyGroupedPartitioning when the referenced column is in the output #37886

Conversation

huaxingao commented Sep 14, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

huaxingao commented Sep 14, 2022

cloud-fan Sep 15, 2022

Choose a reason for hiding this comment

cloud-fan Sep 15, 2022

Choose a reason for hiding this comment

huaxingao Sep 15, 2022

Choose a reason for hiding this comment

cloud-fan left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun commented Sep 15, 2022

huaxingao commented Sep 15, 2022

dongjoon-hyun left a comment •

edited

Loading