Spark: Add DistributionAndOrderingUtils by aokolnychyi · Pull Request #2141 · apache/iceberg

aokolnychyi · 2021-01-23T02:39:48Z

This PR adds DistributionAndOrderingUtils that is being proposed in master and migrates RewriteMergeInto to use it.

aokolnychyi · 2021-01-23T02:40:05Z

core/src/main/java/org/apache/iceberg/util/SortOrderUtil.java

aokolnychyi · 2021-01-23T17:29:37Z

core/src/main/java/org/apache/iceberg/util/SortOrderUtil.java

+    }
+
+    // add the configured sort to the partition spec prefix sort
+    SortOrderVisitor.visit(sortOrder, new CopySortOrderFields(builder));


I removed another CopySortOrderFields which I think was a duplicate.

aokolnychyi · 2021-01-23T17:30:03Z

spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteMergeInto.scala

-            val roundRobin = Repartition(numShufflePartitions, shuffle = true, childPlan)
-            Sort(buildSortOrder(order), global = true, roundRobin)
+      case iceberg: SparkTable =>
+        val distribution = Spark3Util.buildRequiredDistribution(iceberg.table)


The intention to reuse Spark3Util during inserts in Spark 3.2.

Can we expose the same interface as in Spark 3.2 from Table and use that here instead? Then Spark3Util calls remain in SparkTable.

Could you elaborate a bit more? The interface in Spark 3.2 is implemented by Write, not Table. Are you thinking of passing SparkTable as an arg?

You're right. I was thinking it was on Table instead. Let's go with this then.

aokolnychyi · 2021-01-23T17:30:26Z

...ions/src/main/scala/org/apache/spark/sql/catalyst/utils/RewriteRowLevelOperationHelper.scala


  protected def buildSimpleScanPlan(
      relation: DataSourceV2Relation,
-      cond: catalyst.expressions.Expression): LogicalPlan = {


Reverted back.

spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteMergeInto.scala

rdblue · 2021-01-25T18:33:42Z

spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteMergeInto.scala

+          case _: OrderedDistribution =>
+            // insert a round robin partitioning to avoid executing the join twice
+            val numShufflePartitions = conf.numShufflePartitions
+            Repartition(numShufflePartitions, shuffle = true, childPlan)


This is here and not in prepareQuery because we don't want to assume that a global ordering always requires an extra round-robin repartition? If so, it would be good to move the comment above newChildPlan and make it a bit more clear why this extra step is happening.

Yes, we don't want to add a round-robin repartition during inserts, for example. I'll add more info.

Done. Could you check it once, @rdblue?

...nsions/src/main/scala/org/apache/spark/sql/catalyst/utils/DistributionAndOrderingUtils.scala

rdblue · 2021-01-25T18:40:34Z

spark3/src/main/java/org/apache/iceberg/spark/Spark3Util.java

+      case NONE:
+        return Distributions.unspecified();
+      case HASH:
+        return Distributions.clustered(toTransforms(table.spec()));


Is it correct to return a clustered distribution with no expressions if the spec is unpartitioned? I think I would rather return Distributions.unspecified just to be safe when passing this back to Spark.

Good point, let me handle this.

Added a check at the beginning of this method. Could you check, @rdblue?

A sorted table may not be partitioned, but it would pass the check you added. Then if the distribution mode is hash, it would return an empty clustered distribution. I think it would be more correct and easier to reason about if the check was done here.

You are right. Updated.

core/src/main/java/org/apache/iceberg/util/SortOrderUtil.java

aokolnychyi · 2021-01-26T00:20:40Z

spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/utils/IcebergImplicits.scala

+
+object IcebergImplicits {
+  implicit class TableHelper(table: Table) {
+    def asIcebergTable: org.apache.iceberg.Table = {


I think it looks cleaner with implicits.

Should this be named toIcebergTable or icebergTable? This is doing more than just a cast, it is accessing the underlying table.

I think this logic will be needed in a few places so I moved it to Spark3Util and got rid of implicits.

rdblue · 2021-01-26T17:38:53Z

spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteMergeInto.scala

  private val TRUE_LITERAL = Literal(true, BooleanType)
  private val FALSE_LITERAL = Literal(false, BooleanType)

-  import org.apache.spark.sql.execution.datasources.v2.ExtendedDataSourceV2Implicits._


Nit: this doesn't need to move the import above the constants.

I think it is more natural to have imports for implicits before variables and methods in a class. I'd be in favor of changing but I can do that in a separate PR. I'll revert it from here and submit a follow-up.

I agree about order. We should probably also move constants into a companion class instead of inline. Does Scala do that automatically or are these initialized for every instance?

I'd need to check the bytecode but I agree on moving constants to the companion object.
Will submit a follow-up.

rdblue · 2021-01-26T17:40:26Z

Looks correct to me. I'd prefer to fix the unnecessary changes, but I'll leave it up to you whether to merge or fix and then merge. I usually allow nits through on the last pass to avoid blocking.

aokolnychyi · 2021-01-26T20:46:12Z

Thanks for reviewing, @rdblue!

izchen · 2021-07-13T10:16:48Z

...nsions/src/main/scala/org/apache/spark/sql/catalyst/utils/DistributionAndOrderingUtils.scala

+      // the conversion to catalyst expressions above produces SortOrder expressions
+      // for OrderedDistribution and generic expressions for ClusteredDistribution
+      // this allows RepartitionByExpression to pick either range or hash partitioning
+      RepartitionByExpression(distribution, query, numShufflePartitions)


When WRITE_DISTRIBUTION_MODE = range, before this PR, Logical Plan is

Sort [dt#8 ASC NULLS FIRST, v#7 ASC NULLS FIRST], true +- Repartition 2000, true +- MergeInto

After this PR, Logical Plan is

Sort [dt#8 ASC NULLS FIRST, v#7 ASC NULLS FIRST], false +- RepartitionByExpression [dt#8], 2000 +- Repartition 2000, true +- MergeInto

In my opinion, the conversion of global sort to local sort + range partitioning is correct, but here we need to consider the CollapseRepartition rule in Spark Optimizer. In this case, this rule will eliminate the Repartition 2000, true node.

Please take a look here, thanks @aokolnychyi @rdblue

github-actions bot added core spark labels Jan 23, 2021

aokolnychyi commented Jan 23, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/util/SortOrderUtil.java Show resolved Hide resolved

aokolnychyi commented Jan 23, 2021

View reviewed changes

rdblue reviewed Jan 25, 2021

View reviewed changes

spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteMergeInto.scala Outdated Show resolved Hide resolved

rdblue reviewed Jan 25, 2021

View reviewed changes

...nsions/src/main/scala/org/apache/spark/sql/catalyst/utils/DistributionAndOrderingUtils.scala Show resolved Hide resolved

rdblue reviewed Jan 25, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/util/SortOrderUtil.java Show resolved Hide resolved

aokolnychyi commented Jan 26, 2021

View reviewed changes

Spark: Add DistributionAndOrderingUtils

5baec72

aokolnychyi force-pushed the refactor-required-distribution branch from 7f52312 to 5baec72 Compare January 26, 2021 00:36

Minor updates

ecce921

rdblue reviewed Jan 26, 2021

View reviewed changes

rdblue approved these changes Jan 26, 2021

View reviewed changes

aokolnychyi added 2 commits January 26, 2021 11:13

Revert some changes

b03612d

Remove empty line

d7b0429

aokolnychyi merged commit e5e1c8a into apache:master Jan 26, 2021

izchen reviewed Jul 13, 2021

View reviewed changes

Conversation

aokolnychyi commented Jan 23, 2021

Uh oh!

aokolnychyi commented Jan 23, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rdblue Jan 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Jan 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jan 26, 2021

Uh oh!

aokolnychyi commented Jan 26, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rdblue Jan 25, 2021 •

edited

Loading

aokolnychyi Jan 26, 2021 •

edited

Loading