-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-43380][SQL] Fix Avro data type conversion issues to avoid producing incorrect results #41052
[SPARK-43380][SQL] Fix Avro data type conversion issues to avoid producing incorrect results #41052
Conversation
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
Outdated
Show resolved
Hide resolved
connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
Outdated
Show resolved
Hide resolved
connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
Outdated
Show resolved
Hide resolved
connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
Outdated
Show resolved
Hide resolved
connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
Outdated
Show resolved
Hide resolved
@zeruibao could you update the PR description about what's the behavior will be? |
@zeruibao could you update the PR description about what the behavior will be? |
connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
Outdated
Show resolved
Hide resolved
fbd85f4
to
8123398
Compare
connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
Outdated
Show resolved
Hide resolved
LGTM if all tests pass |
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this. I didn't get a chance to review in a lot of detail. Can we also please investigate similar issues in Parquet and ORC as well? I believe Parquet may have some similar issues around interval types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this. I didn't get a chance to review in a lot of detail. Can we also please investigate similar issues in Parquet and ORC as well? I believe Parquet may have some similar issues around interval types
Thanks, merging to master |
…ucing incorrect results ### What changes were proposed in this pull request? We introduce the SQLConf `spark.sql.legacy.avro.allowReadingWithIncompatibleSchema` to prevent reading interval types as date or timestamp types to avoid getting corrupt dates as well as reading decimal types with incorrect precision. ### Why are the changes needed? We found the following issues with open source Avro: - Interval types can be read as date or timestamp types that would lead to wildly different results For example, `Duration.ofDays(1).plusSeconds(1)` will be read as `1972-09-27`, which is weird. - Decimal types can be read with lower precision, that leads to data being read as `null` instead of suggesting that a wider decimal format should be provided ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New unit test Closes apache#41052 from zeruibao/SPARK-43380-fix-avro-data-type-conversion. Lead-authored-by: zeruibao <zerui.bao@databricks.com> Co-authored-by: zeruibao <125398515+zeruibao@users.noreply.github.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request? Fix conversion of Avro logical timestamp type to Long ### Why are the changes needed? The fix in #41052 cause regression when trying to convert logical timestamp type to Long type and it's not covered by existing test suite. Fix it and add a test for it. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? New unit test Closes #41521 from zeruibao/SPARK-43380-fix-regression. Authored-by: zeruibao <zerui.bao@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request? Fix conversion of Avro logical timestamp type to Long ### Why are the changes needed? The fix in apache#41052 cause regression when trying to convert logical timestamp type to Long type and it's not covered by existing test suite. Fix it and add a test for it. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? New unit test Closes apache#41521 from zeruibao/SPARK-43380-fix-regression. Authored-by: zeruibao <zerui.bao@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request? Revert my last PR #41052 that causes AVRO read performance regression since I change the code structure. ### Why are the changes needed? Remove performance regression ### How was this patch tested? Unit test Closes #42458 from zeruibao/revert-avro-change. Authored-by: zeruibao <zerui.bao@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request? Revert my last PR #41052 that causes AVRO read performance regression since I change the code structure. ### Why are the changes needed? Remove performance regression ### How was this patch tested? Unit test Closes #42458 from zeruibao/revert-avro-change. Authored-by: zeruibao <zerui.bao@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 46580ab) Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request? Revert my last PR apache#41052 that causes AVRO read performance regression since I change the code structure. ### Why are the changes needed? Remove performance regression ### How was this patch tested? Unit test Closes apache#42458 from zeruibao/revert-avro-change. Authored-by: zeruibao <zerui.bao@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>
…ng performance regression ### What changes were proposed in this pull request? My last PR #41052 causes AVRO read performance regression since I change the code structure. I turn one match case into a nested match case. So I fix the Avro data type conversion issues in anther way to avoid this regression. Original Change: We introduce the SQLConf `spark.sql.legacy.avro.allowReadingWithIncompatibleSchema` to prevent reading interval types as date or timestamp types to avoid getting corrupt dates as well as reading decimal types with incorrect precision. ### Why are the changes needed? We found the following issues with open source Avro: - Interval types can be read as date or timestamp types that would lead to wildly different results For example, `Duration.ofDays(1).plusSeconds(1)` will be read as `1972-09-27`, which is weird. - Decimal types can be read with lower precision, that leads to data being read as `null` instead of suggesting that a wider decimal format should be provided ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Old unit test Closes #42503 from zeruibao/SPARK-4380-real-fix-regression. Lead-authored-by: zeruibao <zerui.bao@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ng performance regression ### What changes were proposed in this pull request? My last PR #41052 causes AVRO read performance regression since I change the code structure. I turn one match case into a nested match case. So I fix the Avro data type conversion issues in anther way to avoid this regression. Original Change: We introduce the SQLConf `spark.sql.legacy.avro.allowReadingWithIncompatibleSchema` to prevent reading interval types as date or timestamp types to avoid getting corrupt dates as well as reading decimal types with incorrect precision. ### Why are the changes needed? We found the following issues with open source Avro: - Interval types can be read as date or timestamp types that would lead to wildly different results For example, `Duration.ofDays(1).plusSeconds(1)` will be read as `1972-09-27`, which is weird. - Decimal types can be read with lower precision, that leads to data being read as `null` instead of suggesting that a wider decimal format should be provided ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Old unit test Closes #42503 from zeruibao/SPARK-4380-real-fix-regression. Lead-authored-by: zeruibao <zerui.bao@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f8c87f0) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? Revert my last PR apache#41052 that causes AVRO read performance regression since I change the code structure. ### Why are the changes needed? Remove performance regression ### How was this patch tested? Unit test Closes apache#42458 from zeruibao/revert-avro-change. Authored-by: zeruibao <zerui.bao@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>
What changes were proposed in this pull request?
We introduce the SQLConf
spark.sql.legacy.avro.allowReadingWithIncompatibleSchema
to prevent reading interval types as date or timestamp types to avoid getting corrupt dates as well as reading decimal types with incorrect precision.Why are the changes needed?
We found the following issues with open source Avro:
For example,
Duration.ofDays(1).plusSeconds(1)
will be read as1972-09-27
, which is weird.null
instead of suggesting that a wider decimal format should be providedDoes this PR introduce any user-facing change?
No
How was this patch tested?
New unit test