Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43380][SQL] Fix Avro data type conversion issues to avoid producing incorrect results #41052

Conversation

zeruibao
Copy link
Contributor

@zeruibao zeruibao commented May 5, 2023

What changes were proposed in this pull request?

We introduce the SQLConf spark.sql.legacy.avro.allowReadingWithIncompatibleSchema to prevent reading interval types as date or timestamp types to avoid getting corrupt dates as well as reading decimal types with incorrect precision.

Why are the changes needed?

We found the following issues with open source Avro:

  • Interval types can be read as date or timestamp types that would lead to wildly different results
    For example, Duration.ofDays(1).plusSeconds(1) will be read as 1972-09-27, which is weird.
  • Decimal types can be read with lower precision, that leads to data being read as null instead of suggesting that a wider decimal format should be provided

Does this PR introduce any user-facing change?

No

How was this patch tested?

New unit test

@zeruibao zeruibao requested a review from cloud-fan May 5, 2023 06:05
@HyukjinKwon HyukjinKwon changed the title [SPARK-43380] Fix Avro data type conversion issues to avoid producing incorrect results [SPARK-43380][SQL[ Fix Avro data type conversion issues to avoid producing incorrect results May 8, 2023
@HyukjinKwon HyukjinKwon changed the title [SPARK-43380][SQL[ Fix Avro data type conversion issues to avoid producing incorrect results [SPARK-43380][SQL] Fix Avro data type conversion issues to avoid producing incorrect results May 8, 2023
@gengliangwang
Copy link
Member

Interval types can be read as date or timestamp types that would lead to wildly different results

@zeruibao could you update the PR description about what's the behavior will be?

@gengliangwang
Copy link
Member

Interval types can be read as date or timestamp types that would lead to wildly different results

@zeruibao could you update the PR description about what the behavior will be?

@zeruibao zeruibao requested a review from cloud-fan May 22, 2023 23:05
@zeruibao zeruibao force-pushed the SPARK-43380-fix-avro-data-type-conversion branch from fbd85f4 to 8123398 Compare May 23, 2023 16:27
@cloud-fan
Copy link
Contributor

LGTM if all tests pass

@zeruibao zeruibao requested a review from MaxGekk June 1, 2023 23:12
Copy link
Contributor

@brkyvz brkyvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this. I didn't get a chance to review in a lot of detail. Can we also please investigate similar issues in Parquet and ORC as well? I believe Parquet may have some similar issues around interval types

Copy link
Contributor

@brkyvz brkyvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this. I didn't get a chance to review in a lot of detail. Can we also please investigate similar issues in Parquet and ORC as well? I believe Parquet may have some similar issues around interval types

@gengliangwang
Copy link
Member

Thanks, merging to master

czxm pushed a commit to czxm/spark that referenced this pull request Jun 12, 2023
…ucing incorrect results

### What changes were proposed in this pull request?
We introduce the SQLConf `spark.sql.legacy.avro.allowReadingWithIncompatibleSchema` to prevent reading interval types as date or timestamp types to avoid getting corrupt dates as well as reading decimal types with incorrect precision.

### Why are the changes needed?
We found the following issues with open source Avro:

- Interval types can be read as date or timestamp types that would lead to wildly different results
   For example, `Duration.ofDays(1).plusSeconds(1)` will be read as `1972-09-27`, which is weird.
- Decimal types can be read with lower precision, that leads to data being read as `null` instead of suggesting that a wider decimal format should be provided

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
New unit test

Closes apache#41052 from zeruibao/SPARK-43380-fix-avro-data-type-conversion.

Lead-authored-by: zeruibao <zerui.bao@databricks.com>
Co-authored-by: zeruibao <125398515+zeruibao@users.noreply.github.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request Jun 13, 2023
### What changes were proposed in this pull request?
Fix conversion of Avro logical timestamp type to Long

### Why are the changes needed?
The fix in #41052 cause regression when trying to convert logical timestamp type to Long type and it's not covered by existing test suite. Fix it and add a test for it.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
New unit test

Closes #41521 from zeruibao/SPARK-43380-fix-regression.

Authored-by: zeruibao <zerui.bao@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
czxm pushed a commit to czxm/spark that referenced this pull request Jun 19, 2023
### What changes were proposed in this pull request?
Fix conversion of Avro logical timestamp type to Long

### Why are the changes needed?
The fix in apache#41052 cause regression when trying to convert logical timestamp type to Long type and it's not covered by existing test suite. Fix it and add a test for it.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
New unit test

Closes apache#41521 from zeruibao/SPARK-43380-fix-regression.

Authored-by: zeruibao <zerui.bao@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
gengliangwang pushed a commit that referenced this pull request Aug 15, 2023
### What changes were proposed in this pull request?
Revert my last PR #41052 that causes AVRO read performance regression since I change the code structure.

### Why are the changes needed?
Remove performance regression

### How was this patch tested?
Unit test

Closes #42458 from zeruibao/revert-avro-change.

Authored-by: zeruibao <zerui.bao@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
gengliangwang pushed a commit that referenced this pull request Aug 15, 2023
### What changes were proposed in this pull request?
Revert my last PR #41052 that causes AVRO read performance regression since I change the code structure.

### Why are the changes needed?
Remove performance regression

### How was this patch tested?
Unit test

Closes #42458 from zeruibao/revert-avro-change.

Authored-by: zeruibao <zerui.bao@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 46580ab)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
valentinp17 pushed a commit to valentinp17/spark that referenced this pull request Aug 24, 2023
### What changes were proposed in this pull request?
Revert my last PR apache#41052 that causes AVRO read performance regression since I change the code structure.

### Why are the changes needed?
Remove performance regression

### How was this patch tested?
Unit test

Closes apache#42458 from zeruibao/revert-avro-change.

Authored-by: zeruibao <zerui.bao@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
cloud-fan added a commit that referenced this pull request Sep 27, 2023
…ng performance regression

### What changes were proposed in this pull request?
My last PR #41052 causes AVRO read performance regression since I change the code structure. I turn one match case into a nested match case. So I fix the Avro data type conversion issues in anther way to avoid this regression.

Original Change:
We introduce the SQLConf `spark.sql.legacy.avro.allowReadingWithIncompatibleSchema` to prevent reading interval types as date or timestamp types to avoid getting corrupt dates as well as reading decimal types with incorrect precision.

### Why are the changes needed?
We found the following issues with open source Avro:

- Interval types can be read as date or timestamp types that would lead to wildly different results
   For example, `Duration.ofDays(1).plusSeconds(1)` will be read as `1972-09-27`, which is weird.
- Decimal types can be read with lower precision, that leads to data being read as `null` instead of suggesting that a wider decimal format should be provided

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Old unit test

Closes #42503 from zeruibao/SPARK-4380-real-fix-regression.

Lead-authored-by: zeruibao <zerui.bao@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan added a commit that referenced this pull request Sep 27, 2023
…ng performance regression

### What changes were proposed in this pull request?
My last PR #41052 causes AVRO read performance regression since I change the code structure. I turn one match case into a nested match case. So I fix the Avro data type conversion issues in anther way to avoid this regression.

Original Change:
We introduce the SQLConf `spark.sql.legacy.avro.allowReadingWithIncompatibleSchema` to prevent reading interval types as date or timestamp types to avoid getting corrupt dates as well as reading decimal types with incorrect precision.

### Why are the changes needed?
We found the following issues with open source Avro:

- Interval types can be read as date or timestamp types that would lead to wildly different results
   For example, `Duration.ofDays(1).plusSeconds(1)` will be read as `1972-09-27`, which is weird.
- Decimal types can be read with lower precision, that leads to data being read as `null` instead of suggesting that a wider decimal format should be provided

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Old unit test

Closes #42503 from zeruibao/SPARK-4380-real-fix-regression.

Lead-authored-by: zeruibao <zerui.bao@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit f8c87f0)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request Mar 2, 2024
### What changes were proposed in this pull request?
Revert my last PR apache#41052 that causes AVRO read performance regression since I change the code structure.

### Why are the changes needed?
Remove performance regression

### How was this patch tested?
Unit test

Closes apache#42458 from zeruibao/revert-avro-change.

Authored-by: zeruibao <zerui.bao@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants