New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-39862][SQL] Fix two bugs in existence DEFAULT value lookups #37280
Conversation
Hi @gengliangwang, this PR fixes a bug in V2 data source analysis support for column DEFAULT values in Apache Spark. |
Can one of the admins verify this patch? |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @beliefer for your review! Responded to changes, please look again when you have a chance.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogSuite.scala
Outdated
Show resolved
Hide resolved
Thank you for fix. Although my suggestions may contain my personal habits. |
sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala
Show resolved
Hide resolved
LGTM. Thanks for the fix! |
@dtenedor could you also update the PR description about the ORC fix? |
@gengliangwang Sure, this is done. |
Thanks, merging to master |
I find a bug. When u show create table, there does not exist DEFAULT on some fields event if they were defined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @dtenedor , @beliefer , @gengliangwang .
I found that this commit causes a huge regression on ORC native readers during Apache Spark 3.4.0 preparation via benchmark.
I verified that this commit causes the regression.
cc @HyukjinKwon
@dongjoon-hyun Thank you for you ping. @dtenedor Could you investigate the issue ? |
Thanks. I created a PR to recover the ORC reader performance first by a partial and logical revert. For the new DEFAULT-value feature's ORC bugs, we should handle separately more careful review in terms of the performance. |
…e feature ### What changes were proposed in this pull request? This PR is a partial and logical revert of SPARK-39862, #37280, to fix the huge ORC reader perf regression (3x slower). SPARK-39862 should propose a fix without perf regression. ### Why are the changes needed? During Apache Spark 3.4.0 preparation, SPARK-41782 identified a perf regression. - #39301 (comment) ### Does this PR introduce _any_ user-facing change? After this PR, the regression is removed. However, the bug of DEFAULT value feature will remain. This should be handled separately. ### How was this patch tested? Pass the CI. Closes #39362 from dongjoon-hyun/SPARK-41858. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…rc reader ### What changes were proposed in this pull request? This PR fixes a correctness bug related to column DEFAULT values in Orc reader. * #37280 introduced a performance regression in the Orc reader. * #39362 fixed the performance regression, but stopped the column DEFAULT feature from working, causing a temporary correctness regression that we agreed for me to fix later. * This PR restores column DEFAULT functionality for Orc scans and fixes the correctness regression while not reintroducing the performance regression. ### Why are the changes needed? This PR fixes a correctness bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR updates a unit test to exercise that the Orc scan functionality is correct. Closes #39370 from dtenedor/fix-perf-bug-orc-reader. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Fix two bugs in existence DEFAULT value lookups during data source scans:
The
ResolveDefaultColumns.getExistenceDefaultValues
method should inspect the string in the column metadata at theEXISTS_DEFAULT_COLUMN_METADATA_KEY
. It was erroneously using theCURRENT_DEFAULT_COLUMN_METADATA_KEY
instead; this PR fixes that bug.The Orc file scan operator should return NULL for any row/column where the Orc data itself indicates NULL. It was erroneously using the existence default value instead; this PR fixes that bug.
Why are the changes needed?
This fixes a bug in column DEFAULT analysis for V2 data source tables.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
This PR changes existing unit tests and adds a new case covering this bug.