Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark 3.3: Change default distribution modes #6828

Merged

Conversation

aokolnychyi
Copy link
Contributor

This PR changes the default distribution modes in Spark 3.3.

  • Default distribution mode for partitioned but unsorted tables in INSERT is HASH (instead of NONE).
  • Default distribution mode for partitioned but unsorted tables in CoW MERGE is HASH (instead of NONE).
  • Default distribution mode for partitioned and sorted tables in CoW MERGE is HASH (instead of RANGE).
  • Default distribution mode for MoR MERGE is always HASH (instead of relying on write distribution).

Fixes #6679.

@github-actions github-actions bot added the spark label Feb 13, 2023
@aokolnychyi
Copy link
Contributor Author

cc @RussellSpitzer @rdblue @jackye1995 @dramaticlly

} else if (table.spec().isPartitioned()) {
return HASH;
} else {
return NONE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we keep this as hash, we will avoid small files even with unpartitioned tables by forcing the rebalance right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to request a valid Distribution in order for AQE to do its job. If the table is unpartitioned, I don't think we have what to cluster by.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is purely for inserts, though. I should probably call it defaultInsertDistributionMode() or something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think we could just cluster by all columns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how that would perform, to be honest. What about exploring this separately? Maybe, we can also handle this natively in Spark.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about exploring this separately?

+1, the benefit seems to be not so straightforward to cluster by all columns. We can first have this generic default and then make more changes for unpartitioned case.

@jackye1995 jackye1995 added this to In progress in [Release] Iceberg 1.2 via automation Feb 13, 2023
@@ -491,6 +491,7 @@ public void testReadPartitionColumn() throws Exception {
.option(SparkReadOptions.VECTORIZATION_ENABLED, String.valueOf(vectorized))
.load(baseLocation)
.select("struct.innerName")
.orderBy("struct.innerName")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Required by the check below as the default distribution changes the order of elements.

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me!

[Release] Iceberg 1.2 automation moved this from In progress to Reviewer approved Feb 14, 2023
@aokolnychyi aokolnychyi force-pushed the change-default-distribution-modes branch from e3f9394 to fa7965c Compare February 16, 2023 03:06
@aokolnychyi aokolnychyi merged commit a6ad1d1 into apache:master Feb 16, 2023
[Release] Iceberg 1.2 automation moved this from Reviewer approved to Done Feb 16, 2023
@aokolnychyi
Copy link
Contributor Author

Thanks for reviewing, @RussellSpitzer @jackye1995 @dramaticlly!
Let me cherry-pick this to 3.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Change Default Write Distribution Mode
4 participants