Skip to content

[SPARK-31164][SQL] Inconsistent rdd and output partitioning for bucket table when output doesn't contain all bucket columns#27924

Closed
wzhfy wants to merge 3 commits intoapache:masterfrom
wzhfy:inconsistent_rdd_partitioning
Closed

[SPARK-31164][SQL] Inconsistent rdd and output partitioning for bucket table when output doesn't contain all bucket columns#27924
wzhfy wants to merge 3 commits intoapache:masterfrom
wzhfy:inconsistent_rdd_partitioning

Conversation

@wzhfy
Copy link
Contributor

@wzhfy wzhfy commented Mar 16, 2020

What changes were proposed in this pull request?

For a bucketed table, when deciding output partitioning, if the output doesn't contain all bucket columns, the result is UnknownPartitioning. But when generating rdd, current Spark uses createBucketedReadRDD because it doesn't check if the output contains all bucket columns. So the rdd and its output partitioning are inconsistent.

Why are the changes needed?

To fix a bug.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Modified existing tests.

@wzhfy wzhfy requested a review from cloud-fan March 16, 2020 09:42
Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

@SparkQA
Copy link

SparkQA commented Mar 16, 2020

Test build #119861 has finished for PR 27924 at commit 6c7543b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 16, 2020

Test build #119864 has finished for PR 27924 at commit 69b484e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

cc @dbtsai since this is related to table bucketing.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (except one typo issue).

@SparkQA
Copy link

SparkQA commented Mar 17, 2020

Test build #119898 has finished for PR 27924 at commit 85d5324.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy wzhfy closed this in 1369a97 Mar 17, 2020
wzhfy added a commit that referenced this pull request Mar 17, 2020
…t table when output doesn't contain all bucket columns

### What changes were proposed in this pull request?

For a bucketed table, when deciding output partitioning, if the output doesn't contain all bucket columns, the result is `UnknownPartitioning`. But when generating rdd, current Spark uses `createBucketedReadRDD` because it doesn't check if the output contains all bucket columns. So the rdd and its output partitioning are inconsistent.

### Why are the changes needed?

To fix a bug.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Modified existing tests.

Closes #27924 from wzhfy/inconsistent_rdd_partitioning.

Authored-by: Zhenhua Wang <wzh_zju@163.com>
Signed-off-by: Zhenhua Wang <wzh_zju@163.com>
(cherry picked from commit 1369a97)
Signed-off-by: Zhenhua Wang <wzh_zju@163.com>
@wzhfy
Copy link
Contributor Author

wzhfy commented Mar 17, 2020

thanks for reviewing, merged to master/3.0

wzhfy added a commit to wzhfy/spark that referenced this pull request Mar 17, 2020
…t table when output doesn't contain all bucket columns

For a bucketed table, when deciding output partitioning, if the output doesn't contain all bucket columns, the result is `UnknownPartitioning`. But when generating rdd, current Spark uses `createBucketedReadRDD` because it doesn't check if the output contains all bucket columns. So the rdd and its output partitioning are inconsistent.

To fix a bug.

No.

Modified existing tests.

Closes apache#27924 from wzhfy/inconsistent_rdd_partitioning.

Authored-by: Zhenhua Wang <wzh_zju@163.com>
Signed-off-by: Zhenhua Wang <wzh_zju@163.com>
@dongjoon-hyun
Copy link
Member

Thank you, @wzhfy and @cloud-fan

cloud-fan pushed a commit that referenced this pull request Mar 17, 2020
…bucket table when output doesn't contain all bucket columns

### What changes were proposed in this pull request?

This is a backport for [pr#27924](#27924). For a bucketed table, when deciding output partitioning, if the output doesn't contain all bucket columns, the result is `UnknownPartitioning`. But when generating rdd, current Spark uses `createBucketedReadRDD` because it doesn't check if the output contains all bucket columns. So the rdd and its output partitioning are inconsistent.

### Why are the changes needed?

To fix a bug.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Modified existing tests.

Closes #27934 from wzhfy/inconsistent_rdd_partitioning_2.4.

Authored-by: Zhenhua Wang <wzh_zju@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…t table when output doesn't contain all bucket columns

### What changes were proposed in this pull request?

For a bucketed table, when deciding output partitioning, if the output doesn't contain all bucket columns, the result is `UnknownPartitioning`. But when generating rdd, current Spark uses `createBucketedReadRDD` because it doesn't check if the output contains all bucket columns. So the rdd and its output partitioning are inconsistent.

### Why are the changes needed?

To fix a bug.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Modified existing tests.

Closes apache#27924 from wzhfy/inconsistent_rdd_partitioning.

Authored-by: Zhenhua Wang <wzh_zju@163.com>
Signed-off-by: Zhenhua Wang <wzh_zju@163.com>
@manuzhang
Copy link
Member

manuzhang commented May 27, 2022

@cloud-fan @wzhfy
I'm wondering whether bucketed scan with UnknownPartitioning is a bug.

As I understand it, bucketed scan has two effects

  1. decides how we partition input files
  2. benefits downstream operators (e.g. bucket join)

When 2 is not effective, we may still make use of 1. For example, we can have a bounded number of FilePartitions. Without bucketed scan and when the input volume is huge, a large number of FilePartitions could blow up driver memory.

Is there a correctness issue I've missed?

@cloud-fan
Copy link
Contributor

I don't think limiting the number of file partitions was the design goal of the bucketed table. We can set spark.sql.files.maxPartitionBytes to a large number like 1GB to reduce the partitions.

@manuzhang
Copy link
Member

Yes, but the effect has leaked into user space and this breaks it. As to my original question, is HashPartitioning a hard requirement for bucketed scan?

@cloud-fan
Copy link
Contributor

I don't think so, and Spark will disable bucketed scan if it has no benefit for downstream operators, see the rule DisableUnnecessaryBucketedScan

@manuzhang
Copy link
Member

manuzhang commented May 27, 2022

That rule has same issue but can be disabled. However, this change can't be.

@cloud-fan
Copy link
Contributor

That rule means this is by design. We believe bucketed scan is more expensive than a normal scan and only want to use it if it can avoid shuffles. Maybe this does not apply in your case but it applies in many other cases. If you found this issue before we release it, we can revert to avoid perf regression. But revert is not an option today as it may cause perf regression for more people.

I'd suggest revisiting your requirement and considering spark.sql.files.maxPartitionBytes. We should not abuse bucketed table here.

@manuzhang
Copy link
Member

I'm not asking for revert here but to explore an option to disable this behavior, and hence the question about partitioning and bucketed scan.
Bucket table is built to avoid shuffle but it's a table and we cannot prevent downstream users from using it in other ways. As a platform, we'd better keep backward compatibility with some general options rather than asking users to fine tune their jobs.

@cloud-fan
Copy link
Contributor

I'm fine to keep backward compatible with some "by-accident" features if the cost is small. Feel free to open a PR to bring back the old behavior if

  1. we do not re-introduce the correctness bug
  2. the code is simple (less maintenance cost)

@manuzhang
Copy link
Member

@cloud-fan please help check #36733

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants