-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-37728][SQL][3.2] Reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException #35038
Conversation
…can cause ArrayIndexOutOfBoundsException When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch: `orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);` When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths. However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException. This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue. Bugfix No Pass the CIs with the newly added test case. Closes apache#35002 from yym1995/fix-nested. Lead-authored-by: Yimin <yimin.y@outlook.com> Co-authored-by: Yimin <26797163+yym1995@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yym1995 . The backporting patch should have its own PR description because it will be a commit log. In general, each PR had better be complete by itself instead pointing another PR.
This is the patch on branch-3.2 for #35002. See the description in the other PR.
Let me revise your PR description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you propose a different PR from the original, @yym1995 ?
This original PR tests in both v1 and v2 while this PR seems not. Please recover the test coverage like the original PR.
@@ -644,6 +645,28 @@ class OrcSourceSuite extends OrcSuite with SharedSparkSession { | |||
} | |||
} | |||
|
|||
test("SPARK-37728: Reading nested columns with ORC vectorized reader should not " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant was that this should be in OrcQuerySuite
instead of OrcSourceSuite
to provide a test coverage for v1 and v2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant was that this should be in
OrcQuerySuite
instead ofOrcSourceSuite
to provide a test coverage for v1 and v2.
This PR is a fix for [SPARK-34862]. The unit test "SPARK-34862: Support ORC vectorized reader for nested column" is in OrcSourceSuite.scala on branch-3.2. That's why I put my unit test in OrcSourceSuite.scala. Do you think I should put it in OrcQuerySuite.scala?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- First of all, this PR is a fix for SPARK-37728. Please don't make it mix with the other JIRA although that is the root cause of your JIRA.
- Second, yes. It's for my previous comment, "Please recover the test coverage like the original PR". Do you have another reason to remove a test coverage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- First of all, this PR is a fix for SPARK-37728. Please don't make it mix with the other JIRA although that is the root cause of your JIRA.
- Second, yes. It's for my previous comment, "Please recover the test coverage like the original PR". Do you have another reason to remove a test coverage?
OK. Now I moved this unit test case to OrcQuerySuite.scala.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I confirm we need to create unit test in OrcQuerySuite.scala
instead. Just for context, I moved the unit test from OrcSourceSuite
to OrcQuerySuite
when adding support for DSv2 - #33626 . Thanks for @dongjoon-hyun pointing it out.
Could you re-trigger GitHub Action job to make it sure it pass? |
cc @williamhyun , too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failure looks relevant. Could you check if it passes in your environment, @yym1995 ?
*** 1 TEST FAILED ***�
Failed: Total 9473, Failed 1, Errors 0, Passed 9472, Ignored 29�
Failed tests:�
org.apache.spark.sql.execution.datasources.orc.OrcV2QuerySuite�
It fails in my local environment, too.
|
df.write.format("orc").save(path) | ||
|
||
withSQLConf(SQLConf.ORC_VECTORIZED_READER_NESTED_COLUMN_ENABLED.key -> "true", | ||
SQLConf.WHOLESTAGE_MAX_NUM_FIELDS.key -> "10000") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation? We need two more spaces.
c7ff56e
to
294b02b
Compare
val vectorizationEnabled = readDf.queryExecution.executedPlan.find { | ||
case scan @ (_: FileSourceScanExec | _: BatchScanExec) => scan.supportsColumnar | ||
case _ => false | ||
}.isDefined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove, assert(vectorizationEnabled)
, here, which is differently from the master branch? It looks like a removal of important test coverage again.
Since this test case is about Reading nested columns with ORC vectorized reader ...
, you should not remove it. Did I miss something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove,
assert(vectorizationEnabled)
, here, which is differently from the master branch? It looks like a removal of important test coverage again.Since this test case is about
Reading nested columns with ORC vectorized reader ...
, you should not remove it. Did I miss something here?
Because when testing with OrcV2QuerySuite, method supportColumnarReads
in
https://github.com/apache/spark/blob/branch-3.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcPartitionReaderFactory.scala will be called. resultSchema.forall(_.dataType.isInstanceOf[AtomicType])
will return false in that case. Therefore, I removed assert(vectorizationEnabled)
.
To fix this issue, I think #33626 should also be backported to branch-3.2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thank you for the background, @yym1995 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you so much, @yym1995 and @c21 .
Merged to branch-3.2 for Apache Spark 3.2.1.
cc @huaxingao since she is the release manager of Apache Spark 3.2.1.
…ader can cause ArrayIndexOutOfBoundsException ### What changes were proposed in this pull request? This is a backport of #35002 . When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch: `orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);` When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths. However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException. This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue. ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the CIs with the newly added test case. Closes #35038 from yym1995/branch-3.2. Lead-authored-by: Yimin <yimin.y@outlook.com> Co-authored-by: Yimin Yang <26797163+yym1995@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ader can cause ArrayIndexOutOfBoundsException ### What changes were proposed in this pull request? This is a backport of apache#35002 . When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch: `orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);` When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths. However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException. This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue. ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the CIs with the newly added test case. Closes apache#35038 from yym1995/branch-3.2. Lead-authored-by: Yimin <yimin.y@outlook.com> Co-authored-by: Yimin Yang <26797163+yym1995@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ader can cause ArrayIndexOutOfBoundsException ### What changes were proposed in this pull request? This is a backport of apache#35002 . When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch: `orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);` When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths. However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException. This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue. ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the CIs with the newly added test case. Closes apache#35038 from yym1995/branch-3.2. Lead-authored-by: Yimin <yimin.y@outlook.com> Co-authored-by: Yimin Yang <26797163+yym1995@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ader can cause ArrayIndexOutOfBoundsException ### What changes were proposed in this pull request? This is a backport of apache#35002 . When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch: `orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);` When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths. However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException. This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue. ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the CIs with the newly added test case. Closes apache#35038 from yym1995/branch-3.2. Lead-authored-by: Yimin <yimin.y@outlook.com> Co-authored-by: Yimin Yang <26797163+yym1995@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ader can cause ArrayIndexOutOfBoundsException ### What changes were proposed in this pull request? This is a backport of apache#35002 . When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch: `orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);` When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths. However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException. This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue. ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the CIs with the newly added test case. Closes apache#35038 from yym1995/branch-3.2. Lead-authored-by: Yimin <yimin.y@outlook.com> Co-authored-by: Yimin Yang <26797163+yym1995@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 5f9b92c) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
This is a backport of #35002 .
When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch:
orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);
When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths.
However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException.
This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue.
Why are the changes needed?
Bugfix
Does this PR introduce any user-facing change?
No
How was this patch tested?
Pass the CIs with the newly added test case.