Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-39830][SQL][TESTS] Add a test case to read ORC table that requires type promotion #37800

Closed
wants to merge 1 commit into from

Conversation

cxzl25
Copy link
Contributor

@cxzl25 cxzl25 commented Sep 5, 2022

What changes were proposed in this pull request?

Increase ORC test coverage.
ORC-1205 Size of batches in some ConvertTreeReaders should be ensured before using

Why are the changes needed?

When spark reads an orc with type promotion, an ArrayIndexOutOfBoundsException may be thrown, which has been fixed in version 1.7.6 and 1.8.0.

java.lang.ArrayIndexOutOfBoundsException: 1
        at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387)
        at org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740)
        at org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069)
        at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)

Does this PR introduce any user-facing change?

No

How was this patch tested?

add UT

@github-actions github-actions bot added the SQL label Sep 5, 2022
@cxzl25
Copy link
Contributor Author

cxzl25 commented Sep 5, 2022

Tested with ORC 1.7.5

sbt:spark-sql> testOnly *.OrcV2QuerySuite -- -t "SPARK-39830: Reading ORC table that requires type promotion may throw AIOOBE"
14:12:00.408 ERROR org.apache.spark.scheduler.TaskSetManager: Task 0 in stage 1.0 failed 1 times; aborting job
[info] - SPARK-39830: Reading ORC table that requires type promotion may throw AIOOBE *** FAILED *** (7 seconds, 530 milliseconds)
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (10.18.40.187 executor driver): java.lang.ArrayIndexOutOfBoundsException: 1
[info] 	at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387)
[info] 	at org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740)
[info] 	at org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069)
[info] 	at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
[info] 	at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
[info] 	at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
[info] 	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371)
[info] 	at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:207)
[info] 	at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:100)
[info] 	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
[info] 	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:118)
[info] 	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:592)

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. cc @dongjoon-hyun

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.
Thank you, @cxzl25 and @HyukjinKwon !
Merged to master.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-39830][SQL][TESTS] Reading ORC table that requires type promotion may throw AIOOBE [SPARK-39830][SQL][TESTS] Add a test case to read ORC table that requires type promotion Sep 6, 2022
@dongjoon-hyun
Copy link
Member

Do you think you can make a backporting patch to branch-3.3, @cxzl25 ?

@cxzl25
Copy link
Contributor Author

cxzl25 commented Sep 6, 2022

Do you think you can make a backporting patch to branch-3.3 ?

In 3.3, we have no way to control the batch size (spark.sql.orc.columnarWriterBatchSize) written by orc, which will be more difficult to test.

I will try it locally first, if possible, I will make a PR based on the 3.3 branch.

Thanks all !

@dongjoon-hyun
Copy link
Member

Yes, it does. We don't have SPARK-39381 there. Thank you for taking a look at that, @cxzl25 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants