[SPARK-31164][SQL] Inconsistent rdd and output partitioning for bucket table when output doesn't contain all bucket columns by wzhfy · Pull Request #27924 · apache/spark

wzhfy · 2020-03-16T09:42:39Z

What changes were proposed in this pull request?

For a bucketed table, when deciding output partitioning, if the output doesn't contain all bucket columns, the result is UnknownPartitioning. But when generating rdd, current Spark uses createBucketedReadRDD because it doesn't check if the output contains all bucket columns. So the rdd and its output partitioning are inconsistent.

Why are the changes needed?

To fix a bug.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Modified existing tests.

cloud-fan

good catch!

SparkQA · 2020-03-16T10:41:58Z

Test build #119861 has finished for PR 27924 at commit 6c7543b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

SparkQA · 2020-03-16T16:57:33Z

Test build #119864 has finished for PR 27924 at commit 69b484e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-03-16T18:47:50Z

cc @dbtsai since this is related to table bucketing.

sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala

dongjoon-hyun

+1, LGTM (except one typo issue).

SparkQA · 2020-03-17T04:46:54Z

Test build #119898 has finished for PR 27924 at commit 85d5324.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…t table when output doesn't contain all bucket columns ### What changes were proposed in this pull request? For a bucketed table, when deciding output partitioning, if the output doesn't contain all bucket columns, the result is `UnknownPartitioning`. But when generating rdd, current Spark uses `createBucketedReadRDD` because it doesn't check if the output contains all bucket columns. So the rdd and its output partitioning are inconsistent. ### Why are the changes needed? To fix a bug. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Modified existing tests. Closes #27924 from wzhfy/inconsistent_rdd_partitioning. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Zhenhua Wang <wzh_zju@163.com> (cherry picked from commit 1369a97) Signed-off-by: Zhenhua Wang <wzh_zju@163.com>

wzhfy · 2020-03-17T06:25:18Z

thanks for reviewing, merged to master/3.0

…t table when output doesn't contain all bucket columns For a bucketed table, when deciding output partitioning, if the output doesn't contain all bucket columns, the result is `UnknownPartitioning`. But when generating rdd, current Spark uses `createBucketedReadRDD` because it doesn't check if the output contains all bucket columns. So the rdd and its output partitioning are inconsistent. To fix a bug. No. Modified existing tests. Closes apache#27924 from wzhfy/inconsistent_rdd_partitioning. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Zhenhua Wang <wzh_zju@163.com>

dongjoon-hyun · 2020-03-17T07:08:37Z

Thank you, @wzhfy and @cloud-fan

…bucket table when output doesn't contain all bucket columns ### What changes were proposed in this pull request? This is a backport for [pr#27924](#27924). For a bucketed table, when deciding output partitioning, if the output doesn't contain all bucket columns, the result is `UnknownPartitioning`. But when generating rdd, current Spark uses `createBucketedReadRDD` because it doesn't check if the output contains all bucket columns. So the rdd and its output partitioning are inconsistent. ### Why are the changes needed? To fix a bug. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Modified existing tests. Closes #27934 from wzhfy/inconsistent_rdd_partitioning_2.4. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…t table when output doesn't contain all bucket columns ### What changes were proposed in this pull request? For a bucketed table, when deciding output partitioning, if the output doesn't contain all bucket columns, the result is `UnknownPartitioning`. But when generating rdd, current Spark uses `createBucketedReadRDD` because it doesn't check if the output contains all bucket columns. So the rdd and its output partitioning are inconsistent. ### Why are the changes needed? To fix a bug. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Modified existing tests. Closes apache#27924 from wzhfy/inconsistent_rdd_partitioning. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Zhenhua Wang <wzh_zju@163.com>

manuzhang · 2022-05-27T06:29:15Z

@cloud-fan @wzhfy
I'm wondering whether bucketed scan with UnknownPartitioning is a bug.

As I understand it, bucketed scan has two effects

decides how we partition input files
benefits downstream operators (e.g. bucket join)

When 2 is not effective, we may still make use of 1. For example, we can have a bounded number of FilePartitions. Without bucketed scan and when the input volume is huge, a large number of FilePartitions could blow up driver memory.

Is there a correctness issue I've missed?

cloud-fan · 2022-05-27T06:59:25Z

I don't think limiting the number of file partitions was the design goal of the bucketed table. We can set spark.sql.files.maxPartitionBytes to a large number like 1GB to reduce the partitions.

manuzhang · 2022-05-27T07:36:23Z

Yes, but the effect has leaked into user space and this breaks it. As to my original question, is HashPartitioning a hard requirement for bucketed scan?

cloud-fan · 2022-05-27T07:59:07Z

I don't think so, and Spark will disable bucketed scan if it has no benefit for downstream operators, see the rule DisableUnnecessaryBucketedScan

manuzhang · 2022-05-27T08:05:26Z

That rule has same issue but can be disabled. However, this change can't be.

cloud-fan · 2022-05-27T09:10:31Z

That rule means this is by design. We believe bucketed scan is more expensive than a normal scan and only want to use it if it can avoid shuffles. Maybe this does not apply in your case but it applies in many other cases. If you found this issue before we release it, we can revert to avoid perf regression. But revert is not an option today as it may cause perf regression for more people.

I'd suggest revisiting your requirement and considering spark.sql.files.maxPartitionBytes. We should not abuse bucketed table here.

manuzhang · 2022-05-27T13:25:18Z

I'm not asking for revert here but to explore an option to disable this behavior, and hence the question about partitioning and bucketed scan.
Bucket table is built to avoid shuffle but it's a table and we cannot prevent downstream users from using it in other ways. As a platform, we'd better keep backward compatibility with some general options rather than asking users to fine tune their jobs.

cloud-fan · 2022-05-30T12:09:01Z

I'm fine to keep backward compatible with some "by-accident" features if the cost is small. Feel free to open a PR to bring back the old behavior if

we do not re-introduce the correctness bug
the code is simple (less maintenance cost)

manuzhang · 2022-06-01T02:12:03Z

@cloud-fan please help check #36733

inconsistent bucket read

6c7543b

wzhfy requested a review from cloud-fan March 16, 2020 09:42

cloud-fan approved these changes Mar 16, 2020

View reviewed changes

cloud-fan reviewed Mar 16, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala Outdated Show resolved Hide resolved

remove meta

69b484e

dongjoon-hyun added the SQL label Mar 16, 2020

dongjoon-hyun reviewed Mar 16, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun approved these changes Mar 16, 2020

View reviewed changes

typo

85d5324

wzhfy closed this in 1369a97 Mar 17, 2020

wzhfy mentioned this pull request Mar 17, 2020

[SPARK-31164][SQL][2.4] Inconsistent rdd and output partitioning for bucket table when output doesn't contain all bucket columns #27934

Closed

c21 mentioned this pull request Feb 16, 2022

[SPARK-38207][SQL][DOCS] Add migration guide for bucketed scan behavior change #35514

Closed

manuzhang mentioned this pull request May 31, 2022

[SPARK-39344][SQL] Only disable bucketing when autoBucketedScan is enabled if bucket columns are not in scan output #36733

Closed

Conversation

wzhfy commented Mar 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 16, 2020

Uh oh!

Uh oh!

SparkQA commented Mar 16, 2020

Uh oh!

dongjoon-hyun commented Mar 16, 2020

Uh oh!

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 17, 2020

Uh oh!

wzhfy commented Mar 17, 2020

Uh oh!

dongjoon-hyun commented Mar 17, 2020

Uh oh!

manuzhang commented May 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented May 27, 2022

Uh oh!

manuzhang commented May 27, 2022

Uh oh!

cloud-fan commented May 27, 2022

Uh oh!

manuzhang commented May 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented May 27, 2022

Uh oh!

manuzhang commented May 27, 2022

Uh oh!

cloud-fan commented May 30, 2022

Uh oh!

manuzhang commented Jun 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wzhfy commented Mar 16, 2020 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

manuzhang commented May 27, 2022 •

edited

Loading

manuzhang commented May 27, 2022 •

edited

Loading