Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27100][SQL][2.4] Use Array instead of Seq in FilePartition to prevent StackOverflowError #24957

Closed
wants to merge 2 commits into from

Conversation

parthchandra
Copy link
Contributor

… prevent StackOverflowError

ShuffleMapTask's partition field is a FilePartition and FilePartition's 'files' field is a Stream$cons which is essentially a linked list. It is therefore serialized recursively.
If the number of files in each partition is, say, 10000 files, recursing into a linked list of length 10000 overflows the stack

The problem is only in Bucketed partitions. The corresponding implementation for non Bucketed partitions uses a StreamBuffer. The proposed change applies the same for Bucketed partitions.

Existing unit tests. Added new unit test. The unit test fails without the patch. Manual testing on dataset used to reproduce the problem.

… prevent `StackOverflowError `

ShuffleMapTask's partition field is a FilePartition and FilePartition's 'files' field is a Stream$cons which is essentially a linked list. It is therefore serialized recursively.
If the number of files in each partition is, say, 10000 files, recursing into a linked list of length 10000 overflows the stack

The problem is only in Bucketed partitions. The corresponding implementation for non Bucketed partitions uses a StreamBuffer. The proposed change applies the same for Bucketed partitions.

Existing unit tests. Added new unit test. The unit test fails without the patch. Manual testing on dataset used to reproduce the problem.

Closes apache#24865 from parthchandra/SPARK-27100.

Lead-authored-by: Parth Chandra <parthc@apple.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
@HyukjinKwon
Copy link
Member

@parthchandra can you open a PR against the current master?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jun 25, 2019

This is a backport of 5a7aa6f . :)

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-27100][SQL] Use Array instead of Seq in FilePartition to… [SPARK-27100][SQL][2.4] Use Array instead of Seq in FilePartition to… Jun 25, 2019
* A collection of file blocks that should be read as a single task
* (possibly from multiple partitioned directories).
*/
case class FilePartition(index: Int, files: Array[PartitionedFile])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this PR unexpectedly has a part of [SPARK-23817][SQL] Create file source V2 framework and migrate ORC read path. Could you try to remove this change about FilePartition?

cc @dbtsai , @gatorsmile

@SparkQA
Copy link

SparkQA commented Jun 25, 2019

Test build #106859 has finished for PR 24957 at commit a89676b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class FilePartition(index: Int, files: Array[PartitionedFile])

* A collection of file blocks that should be read as a single task
* (possibly from multiple partitioned directories).
*/
case class FilePartition(index: Int, files: Seq[PartitionedFile]) extends RDDPartition
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have those changes as @dongjoon-hyun suggested? Why not just change Seq to Array here?

@parthchandra
Copy link
Contributor Author

@dongjoon-hyun, @dbtsai, you guys are right. I included the FilePartition change as part of resolving the merge conflicts. But I can just make the change that DBTsai has suggested. I'll make the change, re-run the tests and update this PR.

@parthchandra
Copy link
Contributor Author

Updated

Copy link
Member

@dbtsai dbtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Waiting the build.

@gatorsmile gatorsmile changed the title [SPARK-27100][SQL][2.4] Use Array instead of Seq in FilePartition to… [SPARK-27100][SQL][2.4] Use Array instead of Seq in FilePartition to prevent StackOverflowError Jun 25, 2019
// large. tests for the condition where the serialization of such a task may result in a stack
// overflow if the files list is stored in a recursive data structure
// This test is ignored because it takes long to run (~3 min)
ignore("SPARK-27100 stack overflow: read data with large partitions") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trigger the test and see whether it passes in 2.4?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@parthchandra can you confirm this? Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The test passes in 2.4 for both bucketing with/without hive.

@SparkQA
Copy link

SparkQA commented Jun 26, 2019

Test build #106903 has finished for PR 24957 at commit 17e0e2a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

dbtsai pushed a commit that referenced this pull request Jun 26, 2019
…n` to prevent StackOverflowError

… prevent `StackOverflowError `

ShuffleMapTask's partition field is a FilePartition and FilePartition's 'files' field is a Stream$cons which is essentially a linked list. It is therefore serialized recursively.
If the number of files in each partition is, say, 10000 files, recursing into a linked list of length 10000 overflows the stack

The problem is only in Bucketed partitions. The corresponding implementation for non Bucketed partitions uses a StreamBuffer. The proposed change applies the same for Bucketed partitions.

Existing unit tests. Added new unit test. The unit test fails without the patch. Manual testing on dataset used to reproduce the problem.

Closes #24957 from parthchandra/branch-2.4.

Authored-by: Parth Chandra <parthc@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
@dbtsai
Copy link
Member

dbtsai commented Jun 26, 2019

Thanks. Merged into branch 2.4.

rluta pushed a commit to rluta/spark that referenced this pull request Sep 17, 2019
…n` to prevent StackOverflowError

… prevent `StackOverflowError `

ShuffleMapTask's partition field is a FilePartition and FilePartition's 'files' field is a Stream$cons which is essentially a linked list. It is therefore serialized recursively.
If the number of files in each partition is, say, 10000 files, recursing into a linked list of length 10000 overflows the stack

The problem is only in Bucketed partitions. The corresponding implementation for non Bucketed partitions uses a StreamBuffer. The proposed change applies the same for Bucketed partitions.

Existing unit tests. Added new unit test. The unit test fails without the patch. Manual testing on dataset used to reproduce the problem.

Closes apache#24957 from parthchandra/branch-2.4.

Authored-by: Parth Chandra <parthc@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Sep 26, 2019
…n` to prevent StackOverflowError

… prevent `StackOverflowError `

ShuffleMapTask's partition field is a FilePartition and FilePartition's 'files' field is a Stream$cons which is essentially a linked list. It is therefore serialized recursively.
If the number of files in each partition is, say, 10000 files, recursing into a linked list of length 10000 overflows the stack

The problem is only in Bucketed partitions. The corresponding implementation for non Bucketed partitions uses a StreamBuffer. The proposed change applies the same for Bucketed partitions.

Existing unit tests. Added new unit test. The unit test fails without the patch. Manual testing on dataset used to reproduce the problem.

Closes apache#24957 from parthchandra/branch-2.4.

Authored-by: Parth Chandra <parthc@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants