[SPARK-39830][SQL][TESTS] Add a test case to read ORC table that requires type promotion #37800

cxzl25 · 2022-09-05T06:27:53Z

What changes were proposed in this pull request?

Increase ORC test coverage.
ORC-1205 Size of batches in some ConvertTreeReaders should be ensured before using

Why are the changes needed?

When spark reads an orc with type promotion, an ArrayIndexOutOfBoundsException may be thrown, which has been fixed in version 1.7.6 and 1.8.0.

java.lang.ArrayIndexOutOfBoundsException: 1
        at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387)
        at org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740)
        at org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069)
        at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)

Does this PR introduce any user-facing change?

No

How was this patch tested?

add UT

cxzl25 · 2022-09-05T06:29:26Z

Tested with ORC 1.7.5

sbt:spark-sql> testOnly *.OrcV2QuerySuite -- -t "SPARK-39830: Reading ORC table that requires type promotion may throw AIOOBE"

14:12:00.408 ERROR org.apache.spark.scheduler.TaskSetManager: Task 0 in stage 1.0 failed 1 times; aborting job
[info] - SPARK-39830: Reading ORC table that requires type promotion may throw AIOOBE *** FAILED *** (7 seconds, 530 milliseconds)
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (10.18.40.187 executor driver): java.lang.ArrayIndexOutOfBoundsException: 1
[info] 	at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387)
[info] 	at org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740)
[info] 	at org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069)
[info] 	at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
[info] 	at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
[info] 	at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
[info] 	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371)
[info] 	at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:207)
[info] 	at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:100)
[info] 	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
[info] 	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:118)
[info] 	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:592)

AmplabJenkins · 2022-09-06T01:02:38Z

Can one of the admins verify this patch?

HyukjinKwon

LGTM. cc @dongjoon-hyun

dongjoon-hyun

+1, LGTM.
Thank you, @cxzl25 and @HyukjinKwon !
Merged to master.

dongjoon-hyun · 2022-09-06T05:37:29Z

Do you think you can make a backporting patch to branch-3.3, @cxzl25 ?

cxzl25 · 2022-09-06T05:55:26Z

Do you think you can make a backporting patch to branch-3.3 ?

In 3.3, we have no way to control the batch size (spark.sql.orc.columnarWriterBatchSize) written by orc, which will be more difficult to test.

I will try it locally first, if possible, I will make a PR based on the 3.3 branch.

Thanks all !

dongjoon-hyun · 2022-09-06T05:58:06Z

Yes, it does. We don't have SPARK-39381 there. Thank you for taking a look at that, @cxzl25 .

add test

39dd832

github-actions bot added the SQL label Sep 5, 2022

HyukjinKwon approved these changes Sep 6, 2022

View reviewed changes

dongjoon-hyun approved these changes Sep 6, 2022

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-39830][SQL][TESTS] Reading ORC table that requires type promotion may throw AIOOBE~~ [SPARK-39830][SQL][TESTS] Add a test case to read ORC table that requires type promotion Sep 6, 2022

dongjoon-hyun added the TESTS label Sep 6, 2022

dongjoon-hyun closed this in 19b1780 Sep 6, 2022

cxzl25 mentioned this pull request Sep 6, 2022

[SPARK-39830][SQL][TESTS][3.3] Add a test case to read ORC table that requires type promotion #37808

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-39830][SQL][TESTS] Add a test case to read ORC table that requires type promotion #37800

[SPARK-39830][SQL][TESTS] Add a test case to read ORC table that requires type promotion #37800

cxzl25 commented Sep 5, 2022

cxzl25 commented Sep 5, 2022

AmplabJenkins commented Sep 6, 2022

HyukjinKwon left a comment

dongjoon-hyun left a comment •

edited

dongjoon-hyun commented Sep 6, 2022

cxzl25 commented Sep 6, 2022

dongjoon-hyun commented Sep 6, 2022

[SPARK-39830][SQL][TESTS] Add a test case to read ORC table that requires type promotion #37800

[SPARK-39830][SQL][TESTS] Add a test case to read ORC table that requires type promotion #37800

Conversation

cxzl25 commented Sep 5, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cxzl25 commented Sep 5, 2022

AmplabJenkins commented Sep 6, 2022

HyukjinKwon left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment • edited

Choose a reason for hiding this comment

dongjoon-hyun commented Sep 6, 2022

cxzl25 commented Sep 6, 2022

dongjoon-hyun commented Sep 6, 2022

dongjoon-hyun left a comment •

edited