Spark: Add DistributionAndOrderingUtils#2141
Conversation
| } | ||
|
|
||
| // add the configured sort to the partition spec prefix sort | ||
| SortOrderVisitor.visit(sortOrder, new CopySortOrderFields(builder)); |
There was a problem hiding this comment.
I removed another CopySortOrderFields which I think was a duplicate.
| val roundRobin = Repartition(numShufflePartitions, shuffle = true, childPlan) | ||
| Sort(buildSortOrder(order), global = true, roundRobin) | ||
| case iceberg: SparkTable => | ||
| val distribution = Spark3Util.buildRequiredDistribution(iceberg.table) |
There was a problem hiding this comment.
The intention to reuse Spark3Util during inserts in Spark 3.2.
There was a problem hiding this comment.
Can we expose the same interface as in Spark 3.2 from Table and use that here instead? Then Spark3Util calls remain in SparkTable.
There was a problem hiding this comment.
Could you elaborate a bit more? The interface in Spark 3.2 is implemented by Write, not Table. Are you thinking of passing SparkTable as an arg?
There was a problem hiding this comment.
You're right. I was thinking it was on Table instead. Let's go with this then.
|
|
||
| protected def buildSimpleScanPlan( | ||
| relation: DataSourceV2Relation, | ||
| cond: catalyst.expressions.Expression): LogicalPlan = { |
spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteMergeInto.scala
Outdated
Show resolved
Hide resolved
| case _: OrderedDistribution => | ||
| // insert a round robin partitioning to avoid executing the join twice | ||
| val numShufflePartitions = conf.numShufflePartitions | ||
| Repartition(numShufflePartitions, shuffle = true, childPlan) |
There was a problem hiding this comment.
This is here and not in prepareQuery because we don't want to assume that a global ordering always requires an extra round-robin repartition? If so, it would be good to move the comment above newChildPlan and make it a bit more clear why this extra step is happening.
There was a problem hiding this comment.
Yes, we don't want to add a round-robin repartition during inserts, for example. I'll add more info.
...nsions/src/main/scala/org/apache/spark/sql/catalyst/utils/DistributionAndOrderingUtils.scala
Show resolved
Hide resolved
| case NONE: | ||
| return Distributions.unspecified(); | ||
| case HASH: | ||
| return Distributions.clustered(toTransforms(table.spec())); |
There was a problem hiding this comment.
Is it correct to return a clustered distribution with no expressions if the spec is unpartitioned? I think I would rather return Distributions.unspecified just to be safe when passing this back to Spark.
There was a problem hiding this comment.
Good point, let me handle this.
There was a problem hiding this comment.
Added a check at the beginning of this method. Could you check, @rdblue?
There was a problem hiding this comment.
A sorted table may not be partitioned, but it would pass the check you added. Then if the distribution mode is hash, it would return an empty clustered distribution. I think it would be more correct and easier to reason about if the check was done here.
There was a problem hiding this comment.
You are right. Updated.
|
|
||
| object IcebergImplicits { | ||
| implicit class TableHelper(table: Table) { | ||
| def asIcebergTable: org.apache.iceberg.Table = { |
There was a problem hiding this comment.
I think it looks cleaner with implicits.
There was a problem hiding this comment.
Should this be named toIcebergTable or icebergTable? This is doing more than just a cast, it is accessing the underlying table.
There was a problem hiding this comment.
I think this logic will be needed in a few places so I moved it to Spark3Util and got rid of implicits.
7f52312 to
5baec72
Compare
| private val TRUE_LITERAL = Literal(true, BooleanType) | ||
| private val FALSE_LITERAL = Literal(false, BooleanType) | ||
|
|
||
| import org.apache.spark.sql.execution.datasources.v2.ExtendedDataSourceV2Implicits._ |
There was a problem hiding this comment.
Nit: this doesn't need to move the import above the constants.
There was a problem hiding this comment.
I think it is more natural to have imports for implicits before variables and methods in a class. I'd be in favor of changing but I can do that in a separate PR. I'll revert it from here and submit a follow-up.
There was a problem hiding this comment.
I agree about order. We should probably also move constants into a companion class instead of inline. Does Scala do that automatically or are these initialized for every instance?
There was a problem hiding this comment.
I'd need to check the bytecode but I agree on moving constants to the companion object.
Will submit a follow-up.
|
Looks correct to me. I'd prefer to fix the unnecessary changes, but I'll leave it up to you whether to merge or fix and then merge. I usually allow nits through on the last pass to avoid blocking. |
|
Thanks for reviewing, @rdblue! |
| // the conversion to catalyst expressions above produces SortOrder expressions | ||
| // for OrderedDistribution and generic expressions for ClusteredDistribution | ||
| // this allows RepartitionByExpression to pick either range or hash partitioning | ||
| RepartitionByExpression(distribution, query, numShufflePartitions) |
There was a problem hiding this comment.
When WRITE_DISTRIBUTION_MODE = range, before this PR, Logical Plan is
Sort [dt#8 ASC NULLS FIRST, v#7 ASC NULLS FIRST], true
+- Repartition 2000, true
+- MergeInto
After this PR, Logical Plan is
Sort [dt#8 ASC NULLS FIRST, v#7 ASC NULLS FIRST], false
+- RepartitionByExpression [dt#8], 2000
+- Repartition 2000, true
+- MergeInto
In my opinion, the conversion of global sort to local sort + range partitioning is correct, but here we need to consider the CollapseRepartition rule in Spark Optimizer. In this case, this rule will eliminate the Repartition 2000, true node.
Please take a look here, thanks @aokolnychyi @rdblue
This PR adds
DistributionAndOrderingUtilsthat is being proposed in master and migratesRewriteMergeIntoto use it.