[SPARK-37728][SQL][3.2] Reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException #35038

yimin-yang · 2021-12-28T08:02:50Z

What changes were proposed in this pull request?

This is a backport of #35002 .

When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch:

orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);

When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths.

However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException.

This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue.

Why are the changes needed?

Bugfix

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass the CIs with the newly added test case.

…can cause ArrayIndexOutOfBoundsException When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch: `orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);` When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths. However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException. This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue. Bugfix No Pass the CIs with the newly added test case. Closes apache#35002 from yym1995/fix-nested. Lead-authored-by: Yimin <yimin.y@outlook.com> Co-authored-by: Yimin <26797163+yym1995@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

c21

LGTM

yimin-yang · 2021-12-28T11:24:56Z

cc @dongjoon-hyun

AmplabJenkins · 2021-12-28T15:31:14Z

Can one of the admins verify this patch?

dongjoon-hyun

@yym1995 . The backporting patch should have its own PR description because it will be a commit log. In general, each PR had better be complete by itself instead pointing another PR.

This is the patch on branch-3.2 for #35002. See the description in the other PR.

dongjoon-hyun · 2021-12-28T21:02:35Z

Let me revise your PR description.

dongjoon-hyun

Why do you propose a different PR from the original, @yym1995 ?

This original PR tests in both v1 and v2 while this PR seems not. Please recover the test coverage like the original PR.

dongjoon-hyun · 2021-12-29T05:40:45Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala

@@ -644,6 +645,28 @@ class OrcSourceSuite extends OrcSuite with SharedSparkSession {
    }
  }

+  test("SPARK-37728: Reading nested columns with ORC vectorized reader should not " +


What I meant was that this should be in OrcQuerySuite instead of OrcSourceSuite to provide a test coverage for v1 and v2.

What I meant was that this should be in OrcQuerySuite instead of OrcSourceSuite to provide a test coverage for v1 and v2.

This PR is a fix for [SPARK-34862]. The unit test "SPARK-34862: Support ORC vectorized reader for nested column" is in OrcSourceSuite.scala on branch-3.2. That's why I put my unit test in OrcSourceSuite.scala. Do you think I should put it in OrcQuerySuite.scala?

First of all, this PR is a fix for SPARK-37728. Please don't make it mix with the other JIRA although that is the root cause of your JIRA.

Second, yes. It's for my previous comment, "Please recover the test coverage like the original PR". Do you have another reason to remove a test coverage?

First of all, this PR is a fix for SPARK-37728. Please don't make it mix with the other JIRA although that is the root cause of your JIRA.

Second, yes. It's for my previous comment, "Please recover the test coverage like the original PR". Do you have another reason to remove a test coverage?

OK. Now I moved this unit test case to OrcQuerySuite.scala.

Yeah I confirm we need to create unit test in OrcQuerySuite.scala instead. Just for context, I moved the unit test from OrcSourceSuite to OrcQuerySuite when adding support for DSv2 - #33626 . Thanks for @dongjoon-hyun pointing it out.

dongjoon-hyun · 2021-12-29T19:05:23Z

Could you re-trigger GitHub Action job to make it sure it pass?

dongjoon-hyun · 2021-12-30T00:25:39Z

cc @williamhyun , too

dongjoon-hyun

The failure looks relevant. Could you check if it passes in your environment, @yym1995 ?

*** 1 TEST FAILED ***�
Failed: Total 9473, Failed 1, Errors 0, Passed 9472, Ignored 29�
Failed tests:�
org.apache.spark.sql.execution.datasources.orc.OrcV2QuerySuite�

dongjoon-hyun · 2021-12-30T21:47:50Z

It fails in my local environment, too.

[info] - SPARK-37728: Reading nested columns with ORC vectorized reader should not cause ArrayIndexOutOfBoundsException *** FAILED *** (1 second, 522 milliseconds)
[info]   vectorizationEnabled was false (OrcQuerySuite.scala:734)
[info]   org.scalatest.exceptions.TestFailedException:

dongjoon-hyun · 2021-12-31T08:25:55Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala

+      df.write.format("orc").save(path)
+
+      withSQLConf(SQLConf.ORC_VECTORIZED_READER_NESTED_COLUMN_ENABLED.key -> "true",
+        SQLConf.WHOLESTAGE_MAX_NUM_FIELDS.key -> "10000") {


Indentation? We need two more spaces.

dongjoon-hyun · 2021-12-31T18:54:41Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala

+        val vectorizationEnabled = readDf.queryExecution.executedPlan.find {
+          case scan @ (_: FileSourceScanExec | _: BatchScanExec) => scan.supportsColumnar
+          case _ => false
+        }.isDefined


Why did you remove, assert(vectorizationEnabled), here, which is differently from the master branch? It looks like a removal of important test coverage again.

Since this test case is about Reading nested columns with ORC vectorized reader ..., you should not remove it. Did I miss something here?

Why did you remove, assert(vectorizationEnabled), here, which is differently from the master branch? It looks like a removal of important test coverage again.

Since this test case is about Reading nested columns with ORC vectorized reader ..., you should not remove it. Did I miss something here?

Because when testing with OrcV2QuerySuite, method supportColumnarReads in
https://github.com/apache/spark/blob/branch-3.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcPartitionReaderFactory.scala will be called. resultSchema.forall(_.dataType.isInstanceOf[AtomicType]) will return false in that case. Therefore, I removed assert(vectorizationEnabled).

To fix this issue, I think #33626 should also be backported to branch-3.2.

Got it. Thank you for the background, @yym1995 .

dongjoon-hyun

+1, LGTM. Thank you so much, @yym1995 and @c21 .
Merged to branch-3.2 for Apache Spark 3.2.1.

cc @huaxingao since she is the release manager of Apache Spark 3.2.1.

…ader can cause ArrayIndexOutOfBoundsException ### What changes were proposed in this pull request? This is a backport of #35002 . When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch: `orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);` When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths. However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException. This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue. ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the CIs with the newly added test case. Closes #35038 from yym1995/branch-3.2. Lead-authored-by: Yimin <yimin.y@outlook.com> Co-authored-by: Yimin Yang <26797163+yym1995@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…ader can cause ArrayIndexOutOfBoundsException ### What changes were proposed in this pull request? This is a backport of apache#35002 . When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch: `orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);` When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths. However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException. This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue. ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the CIs with the newly added test case. Closes apache#35038 from yym1995/branch-3.2. Lead-authored-by: Yimin <yimin.y@outlook.com> Co-authored-by: Yimin Yang <26797163+yym1995@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…ader can cause ArrayIndexOutOfBoundsException ### What changes were proposed in this pull request? This is a backport of apache#35002 . When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch: `orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);` When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths. However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException. This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue. ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the CIs with the newly added test case. Closes apache#35038 from yym1995/branch-3.2. Lead-authored-by: Yimin <yimin.y@outlook.com> Co-authored-by: Yimin Yang <26797163+yym1995@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 5f9b92c) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

github-actions bot added the SQL label Dec 28, 2021

c21 approved these changes Dec 28, 2021

View reviewed changes

dongjoon-hyun reviewed Dec 28, 2021

View reviewed changes

dongjoon-hyun requested changes Dec 28, 2021

View reviewed changes

yimin-yang and others added 2 commits December 29, 2021 11:08

update

1708b68

Merge branch 'apache:branch-3.2' into branch-3.2

6dd4f46

yimin-yang requested a review from dongjoon-hyun December 29, 2021 03:12

dongjoon-hyun reviewed Dec 29, 2021

View reviewed changes

yimin-yang added 2 commits December 29, 2021 19:40

update

de4fd40

Merge branch 'branch-3.2' of github.com:yym1995/spark into branch-3.2

62b98ce

update

5f2e85d

dongjoon-hyun reviewed Dec 30, 2021

View reviewed changes

update

be7a155

dongjoon-hyun reviewed Dec 31, 2021

View reviewed changes

update

294b02b

yimin-yang force-pushed the branch-3.2 branch from c7ff56e to 294b02b Compare December 31, 2021 13:32

dongjoon-hyun requested changes Dec 31, 2021

View reviewed changes

dongjoon-hyun approved these changes Jan 4, 2022

View reviewed changes

dongjoon-hyun closed this Jan 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-37728][SQL][3.2] Reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException #35038

[SPARK-37728][SQL][3.2] Reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException #35038

yimin-yang commented Dec 28, 2021 •

edited by dongjoon-hyun

Loading

c21 left a comment

yimin-yang commented Dec 28, 2021

AmplabJenkins commented Dec 28, 2021

dongjoon-hyun left a comment

dongjoon-hyun commented Dec 28, 2021

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun Dec 29, 2021

yimin-yang Dec 29, 2021

dongjoon-hyun Dec 29, 2021 •

edited

Loading

yimin-yang Dec 29, 2021

c21 Dec 30, 2021 •

edited

Loading

dongjoon-hyun commented Dec 29, 2021

dongjoon-hyun commented Dec 30, 2021

dongjoon-hyun left a comment

dongjoon-hyun commented Dec 30, 2021

dongjoon-hyun Dec 31, 2021

dongjoon-hyun Dec 31, 2021 •

edited

Loading

yimin-yang Jan 4, 2022 •

edited

Loading

dongjoon-hyun Jan 4, 2022

dongjoon-hyun left a comment

[SPARK-37728][SQL][3.2] Reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException #35038

[SPARK-37728][SQL][3.2] Reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException #35038

Conversation

yimin-yang commented Dec 28, 2021 • edited by dongjoon-hyun Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

c21 left a comment

Choose a reason for hiding this comment

yimin-yang commented Dec 28, 2021

AmplabJenkins commented Dec 28, 2021

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 28, 2021

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun Dec 29, 2021

Choose a reason for hiding this comment

yimin-yang Dec 29, 2021

Choose a reason for hiding this comment

dongjoon-hyun Dec 29, 2021 • edited Loading

Choose a reason for hiding this comment

yimin-yang Dec 29, 2021

Choose a reason for hiding this comment

c21 Dec 30, 2021 • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 29, 2021

dongjoon-hyun commented Dec 30, 2021

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 30, 2021

dongjoon-hyun Dec 31, 2021

Choose a reason for hiding this comment

dongjoon-hyun Dec 31, 2021 • edited Loading

Choose a reason for hiding this comment

yimin-yang Jan 4, 2022 • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun Jan 4, 2022

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

yimin-yang commented Dec 28, 2021 •

edited by dongjoon-hyun

Loading

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun Dec 29, 2021 •

edited

Loading

c21 Dec 30, 2021 •

edited

Loading

dongjoon-hyun Dec 31, 2021 •

edited

Loading

yimin-yang Jan 4, 2022 •

edited

Loading