-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-39830][SQL][TESTS] Add a test case to read ORC table that requires type promotion #37800
Conversation
Tested with ORC 1.7.5
14:12:00.408 ERROR org.apache.spark.scheduler.TaskSetManager: Task 0 in stage 1.0 failed 1 times; aborting job
[info] - SPARK-39830: Reading ORC table that requires type promotion may throw AIOOBE *** FAILED *** (7 seconds, 530 milliseconds)
[info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (10.18.40.187 executor driver): java.lang.ArrayIndexOutOfBoundsException: 1
[info] at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387)
[info] at org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740)
[info] at org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069)
[info] at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
[info] at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
[info] at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
[info] at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371)
[info] at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:207)
[info] at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:100)
[info] at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
[info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:118)
[info] at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:592) |
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. cc @dongjoon-hyun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
Thank you, @cxzl25 and @HyukjinKwon !
Merged to master.
Do you think you can make a backporting patch to branch-3.3, @cxzl25 ? |
In 3.3, we have no way to control the batch size ( I will try it locally first, if possible, I will make a PR based on the 3.3 branch. Thanks all ! |
Yes, it does. We don't have SPARK-39381 there. Thank you for taking a look at that, @cxzl25 . |
What changes were proposed in this pull request?
Increase ORC test coverage.
ORC-1205 Size of batches in some ConvertTreeReaders should be ensured before using
Why are the changes needed?
When spark reads an orc with type promotion, an
ArrayIndexOutOfBoundsException
may be thrown, which has been fixed in version 1.7.6 and 1.8.0.Does this PR introduce any user-facing change?
No
How was this patch tested?
add UT