Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-38018][SQL] Fix ColumnVectorUtils.populate to handle CalendarIntervalType correctly #35314

Closed
wants to merge 1 commit into from

Conversation

c21
Copy link
Contributor

@c21 c21 commented Jan 25, 2022

What changes were proposed in this pull request?

ColumnVectorUtils.populate() does not handle CalendarInterval type correctly. The CalendarInterval type is in the format of (months: int, days: int, microseconds: long). However, the function above misses days field, and sets microseconds field in wrong position.

ColumnVectorUtils.populate() is used by Parquet and ORC vectorized reader to read partition column. So technically Spark can potentially produce wrong result if reading table with CalendarInterval partition column. However I also notice Spark explicitly disallows writing data with CalendarInterval type, so it might not be a big deal for users. But it's worth to fix anyway.

Caveat: I found the bug when reading through the related code path, but I never encountered the issue in production for partition column with CalendarInterval type. I think it should be an obvious fix unless anyone more experienced could find some more historical context. The code was introduced a long time ago where I couldn't find any more info why it was implemented as it is (#11435)

Why are the changes needed?

To fix potential correctness issue.

Does this PR introduce any user-facing change?

No but fix the exiting correctness issue when reading partition column with CalendarInterval type.

How was this patch tested?

Added unit test in ColumnVectorSuite.scala.
Verified the unit test failed with exception below without this PR:

java.lang.NullPointerException was thrown.
java.lang.NullPointerException
	at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLongs(OnHeapColumnVector.java:345)
	at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:94)
	at org.apache.spark.sql.execution.vectorized.ColumnVectorSuite.$anonfun$new$99(ColumnVectorSuite.scala:613)

@github-actions github-actions bot added the SQL label Jan 25, 2022
@c21
Copy link
Contributor Author

c21 commented Jan 25, 2022

@cloud-fan could you help take a look when you have time? Thanks.

@yikf
Copy link
Contributor

yikf commented Jan 25, 2022

@c21 We changed the number of CalendarInterval fields from 2(months and microseconds) to 3(months, days and microseconds) (extra day had been added) in #26134, so this change is reasonable for me

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thanks, @c21 .

@cloud-fan
Copy link
Contributor

cloud-fan commented Jan 25, 2022

thanks, merging to master!

@cloud-fan cloud-fan closed this in 1bda48b Jan 25, 2022
@c21
Copy link
Contributor Author

c21 commented Jan 25, 2022

Thank you @cloud-fan, @dongjoon-hyun and @yikf for review!

@c21 c21 deleted the vector-fix branch January 25, 2022 18:49
@c21
Copy link
Contributor Author

c21 commented Jan 25, 2022

btw @cloud-fan & @dongjoon-hyun - do we want to add this to older branches (3.2-) ?

@dongjoon-hyun
Copy link
Member

+1 for backporting, @c21 . Could you make a backporting PR to the applicable branches?

cloud-fan pushed a commit that referenced this pull request Jul 7, 2022
…ndarIntervalType correctly

### What changes were proposed in this pull request?

This is a backport of #35314 to branch 3.2. See that original PR for context.

### Why are the changes needed?

To fix potential correctness issue.

### Does this PR introduce _any_ user-facing change?

No but fix the exiting correctness issue when reading partition column with CalendarInterval type.

### How was this patch tested?

Added unit test in `ColumnVectorSuite.scala`.

Closes #37114 from c21/branch-3.2.

Authored-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
…ndarIntervalType correctly

### What changes were proposed in this pull request?

This is a backport of apache#35314 to branch 3.2. See that original PR for context.

### Why are the changes needed?

To fix potential correctness issue.

### Does this PR introduce _any_ user-facing change?

No but fix the exiting correctness issue when reading partition column with CalendarInterval type.

### How was this patch tested?

Added unit test in `ColumnVectorSuite.scala`.

Closes apache#37114 from c21/branch-3.2.

Authored-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit c5983c1)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants