Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-37728][SQL][3.2] Reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException #35038

Closed
wants to merge 8 commits into from

Conversation

yimin-yang
Copy link
Contributor

@yimin-yang yimin-yang commented Dec 28, 2021

What changes were proposed in this pull request?

This is a backport of #35002 .

When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch:

orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);

When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths.

However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException.

This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue.

Why are the changes needed?

Bugfix

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass the CIs with the newly added test case.

…can cause ArrayIndexOutOfBoundsException

When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch:

`orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);`

When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths.

However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException.

This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue.

Bugfix

No

Pass the CIs with the newly added test case.

Closes apache#35002 from yym1995/fix-nested.

Lead-authored-by: Yimin <yimin.y@outlook.com>
Co-authored-by: Yimin <26797163+yym1995@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@github-actions github-actions bot added the SQL label Dec 28, 2021
Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yimin-yang
Copy link
Contributor Author

cc @dongjoon-hyun

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yym1995 . The backporting patch should have its own PR description because it will be a commit log. In general, each PR had better be complete by itself instead pointing another PR.

This is the patch on branch-3.2 for #35002. See the description in the other PR.

@dongjoon-hyun
Copy link
Member

Let me revise your PR description.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you propose a different PR from the original, @yym1995 ?

This original PR tests in both v1 and v2 while this PR seems not. Please recover the test coverage like the original PR.

@@ -644,6 +645,28 @@ class OrcSourceSuite extends OrcSuite with SharedSparkSession {
}
}

test("SPARK-37728: Reading nested columns with ORC vectorized reader should not " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was that this should be in OrcQuerySuite instead of OrcSourceSuite to provide a test coverage for v1 and v2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was that this should be in OrcQuerySuite instead of OrcSourceSuite to provide a test coverage for v1 and v2.

This PR is a fix for [SPARK-34862]. The unit test "SPARK-34862: Support ORC vectorized reader for nested column" is in OrcSourceSuite.scala on branch-3.2. That's why I put my unit test in OrcSourceSuite.scala. Do you think I should put it in OrcQuerySuite.scala?

Copy link
Member

@dongjoon-hyun dongjoon-hyun Dec 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • First of all, this PR is a fix for SPARK-37728. Please don't make it mix with the other JIRA although that is the root cause of your JIRA.
  • Second, yes. It's for my previous comment, "Please recover the test coverage like the original PR". Do you have another reason to remove a test coverage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • First of all, this PR is a fix for SPARK-37728. Please don't make it mix with the other JIRA although that is the root cause of your JIRA.
  • Second, yes. It's for my previous comment, "Please recover the test coverage like the original PR". Do you have another reason to remove a test coverage?

OK. Now I moved this unit test case to OrcQuerySuite.scala.

Copy link
Contributor

@c21 c21 Dec 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I confirm we need to create unit test in OrcQuerySuite.scala instead. Just for context, I moved the unit test from OrcSourceSuite to OrcQuerySuite when adding support for DSv2 - #33626 . Thanks for @dongjoon-hyun pointing it out.

@dongjoon-hyun
Copy link
Member

Could you re-trigger GitHub Action job to make it sure it pass?

@dongjoon-hyun
Copy link
Member

cc @williamhyun , too

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure looks relevant. Could you check if it passes in your environment, @yym1995 ?

*** 1 TEST FAILED ***�
Failed: Total 9473, Failed 1, Errors 0, Passed 9472, Ignored 29�
Failed tests:�
org.apache.spark.sql.execution.datasources.orc.OrcV2QuerySuite�

@dongjoon-hyun
Copy link
Member

It fails in my local environment, too.

[info] - SPARK-37728: Reading nested columns with ORC vectorized reader should not cause ArrayIndexOutOfBoundsException *** FAILED *** (1 second, 522 milliseconds)
[info]   vectorizationEnabled was false (OrcQuerySuite.scala:734)
[info]   org.scalatest.exceptions.TestFailedException:

df.write.format("orc").save(path)

withSQLConf(SQLConf.ORC_VECTORIZED_READER_NESTED_COLUMN_ENABLED.key -> "true",
SQLConf.WHOLESTAGE_MAX_NUM_FIELDS.key -> "10000") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation? We need two more spaces.

val vectorizationEnabled = readDf.queryExecution.executedPlan.find {
case scan @ (_: FileSourceScanExec | _: BatchScanExec) => scan.supportsColumnar
case _ => false
}.isDefined
Copy link
Member

@dongjoon-hyun dongjoon-hyun Dec 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove, assert(vectorizationEnabled), here, which is differently from the master branch? It looks like a removal of important test coverage again.

Since this test case is about Reading nested columns with ORC vectorized reader ..., you should not remove it. Did I miss something here?

Copy link
Contributor Author

@yimin-yang yimin-yang Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove, assert(vectorizationEnabled), here, which is differently from the master branch? It looks like a removal of important test coverage again.

Since this test case is about Reading nested columns with ORC vectorized reader ..., you should not remove it. Did I miss something here?

Because when testing with OrcV2QuerySuite, method supportColumnarReads in
https://github.com/apache/spark/blob/branch-3.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcPartitionReaderFactory.scala will be called. resultSchema.forall(_.dataType.isInstanceOf[AtomicType]) will return false in that case. Therefore, I removed assert(vectorizationEnabled).

To fix this issue, I think #33626 should also be backported to branch-3.2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thank you for the background, @yym1995 .

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you so much, @yym1995 and @c21 .
Merged to branch-3.2 for Apache Spark 3.2.1.

cc @huaxingao since she is the release manager of Apache Spark 3.2.1.

dongjoon-hyun pushed a commit that referenced this pull request Jan 4, 2022
…ader can cause ArrayIndexOutOfBoundsException

### What changes were proposed in this pull request?

This is a backport of #35002 .

When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch:

`orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);`

When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths.

However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException.

This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue.

### Why are the changes needed?

Bugfix

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass the CIs with the newly added test case.

Closes #35038 from yym1995/branch-3.2.

Lead-authored-by: Yimin <yimin.y@outlook.com>
Co-authored-by: Yimin Yang <26797163+yym1995@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
catalinii pushed a commit to lyft/spark that referenced this pull request Feb 22, 2022
…ader can cause ArrayIndexOutOfBoundsException

### What changes were proposed in this pull request?

This is a backport of apache#35002 .

When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch:

`orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);`

When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths.

However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException.

This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue.

### Why are the changes needed?

Bugfix

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass the CIs with the newly added test case.

Closes apache#35038 from yym1995/branch-3.2.

Lead-authored-by: Yimin <yimin.y@outlook.com>
Co-authored-by: Yimin Yang <26797163+yym1995@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
catalinii pushed a commit to lyft/spark that referenced this pull request Mar 4, 2022
…ader can cause ArrayIndexOutOfBoundsException

### What changes were proposed in this pull request?

This is a backport of apache#35002 .

When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch:

`orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);`

When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths.

However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException.

This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue.

### Why are the changes needed?

Bugfix

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass the CIs with the newly added test case.

Closes apache#35038 from yym1995/branch-3.2.

Lead-authored-by: Yimin <yimin.y@outlook.com>
Co-authored-by: Yimin Yang <26797163+yym1995@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
domybest11 pushed a commit to domybest11/spark that referenced this pull request Jun 15, 2022
…ader can cause ArrayIndexOutOfBoundsException

### What changes were proposed in this pull request?

This is a backport of apache#35002 .

When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch:

`orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);`

When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths.

However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException.

This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue.

### Why are the changes needed?

Bugfix

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass the CIs with the newly added test case.

Closes apache#35038 from yym1995/branch-3.2.

Lead-authored-by: Yimin <yimin.y@outlook.com>
Co-authored-by: Yimin Yang <26797163+yym1995@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
kazuyukitanimura pushed a commit to kazuyukitanimura/spark that referenced this pull request Aug 10, 2022
…ader can cause ArrayIndexOutOfBoundsException

### What changes were proposed in this pull request?

This is a backport of apache#35002 .

When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch:

`orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);`

When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths.

However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException.

This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue.

### Why are the changes needed?

Bugfix

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass the CIs with the newly added test case.

Closes apache#35038 from yym1995/branch-3.2.

Lead-authored-by: Yimin <yimin.y@outlook.com>
Co-authored-by: Yimin Yang <26797163+yym1995@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 5f9b92c)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants