Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-39862][SQL] Fix two bugs in existence DEFAULT value lookups #37280

Closed
wants to merge 11 commits into from
Closed

[SPARK-39862][SQL] Fix two bugs in existence DEFAULT value lookups #37280

wants to merge 11 commits into from

Conversation

dtenedor
Copy link
Contributor

@dtenedor dtenedor commented Jul 25, 2022

What changes were proposed in this pull request?

Fix two bugs in existence DEFAULT value lookups during data source scans:

  1. The ResolveDefaultColumns.getExistenceDefaultValues method should inspect the string in the column metadata at the EXISTS_DEFAULT_COLUMN_METADATA_KEY. It was erroneously using the CURRENT_DEFAULT_COLUMN_METADATA_KEY instead; this PR fixes that bug.

  2. The Orc file scan operator should return NULL for any row/column where the Orc data itself indicates NULL. It was erroneously using the existence default value instead; this PR fixes that bug.

Why are the changes needed?

This fixes a bug in column DEFAULT analysis for V2 data source tables.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This PR changes existing unit tests and adds a new case covering this bug.

@github-actions github-actions bot added the SQL label Jul 25, 2022
@dtenedor
Copy link
Contributor Author

Hi @gengliangwang, this PR fixes a bug in V2 data source analysis support for column DEFAULT values in Apache Spark.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Contributor Author

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @beliefer for your review! Responded to changes, please look again when you have a chance.

@beliefer
Copy link
Contributor

Thanks @beliefer for your review! Responded to changes, please look again when you have a chance.

Thank you for fix. Although my suggestions may contain my personal habits.

@gengliangwang
Copy link
Member

LGTM. Thanks for the fix!

@gengliangwang
Copy link
Member

@dtenedor could you also update the PR description about the ORC fix?

@dtenedor dtenedor changed the title [SPARK-39862][SQL] Fix bug in existence DEFAULT value lookups for V2 data sources [SPARK-39862][SQL] Fix bugs in existence DEFAULT value lookups for V2 data sources Jul 28, 2022
@dtenedor
Copy link
Contributor Author

@gengliangwang Sure, this is done.

@dtenedor dtenedor changed the title [SPARK-39862][SQL] Fix bugs in existence DEFAULT value lookups for V2 data sources [SPARK-39862][SQL] Fix two bugs in existence DEFAULT value lookups Jul 28, 2022
@gengliangwang
Copy link
Member

Thanks, merging to master

@YetiCuzMountain
Copy link

I find a bug. When u show create table, there does not exist DEFAULT on some fields event if they were defined.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @dtenedor , @beliefer , @gengliangwang .
I found that this commit causes a huge regression on ORC native readers during Apache Spark 3.4.0 preparation via benchmark.

I verified that this commit causes the regression.

cc @HyukjinKwon

@beliefer
Copy link
Contributor

beliefer commented Jan 3, 2023

@dongjoon-hyun Thank you for you ping. @dtenedor Could you investigate the issue ?

@dongjoon-hyun
Copy link
Member

Thanks. I created a PR to recover the ORC reader performance first by a partial and logical revert.

For the new DEFAULT-value feature's ORC bugs, we should handle separately more careful review in terms of the performance.

dongjoon-hyun added a commit that referenced this pull request Jan 3, 2023
…e feature

### What changes were proposed in this pull request?

This PR is a partial and logical revert of SPARK-39862, #37280, to fix the huge ORC reader perf regression (3x slower).

SPARK-39862 should propose a fix without perf regression.

### Why are the changes needed?

During Apache Spark 3.4.0 preparation, SPARK-41782 identified a perf regression.
- #39301 (comment)

### Does this PR introduce _any_ user-facing change?

After this PR, the regression is removed. However, the bug of DEFAULT value feature will remain. This should be handled separately.

### How was this patch tested?

Pass the CI.

Closes #39362 from dongjoon-hyun/SPARK-41858.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Jan 4, 2023
…rc reader

### What changes were proposed in this pull request?

This PR fixes a correctness bug related to column DEFAULT values in Orc reader.

* #37280 introduced a performance regression in the Orc reader.
* #39362 fixed the performance regression, but stopped the column DEFAULT feature from working, causing a temporary correctness regression that we agreed for me to fix later.
* This PR restores column DEFAULT functionality for Orc scans and fixes the correctness regression while not reintroducing the performance regression.

### Why are the changes needed?

This PR fixes a correctness bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This PR updates a unit test to exercise that the Orc scan functionality is correct.

Closes #39370 from dtenedor/fix-perf-bug-orc-reader.

Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants