[SPARK-27100][SQL][2.4] Use `Array` instead of `Seq` in `FilePartition` to prevent StackOverflowError #24957

parthchandra · 2019-06-25T04:03:58Z

… prevent StackOverflowError

ShuffleMapTask's partition field is a FilePartition and FilePartition's 'files' field is a Stream$cons which is essentially a linked list. It is therefore serialized recursively.
If the number of files in each partition is, say, 10000 files, recursing into a linked list of length 10000 overflows the stack

The problem is only in Bucketed partitions. The corresponding implementation for non Bucketed partitions uses a StreamBuffer. The proposed change applies the same for Bucketed partitions.

Existing unit tests. Added new unit test. The unit test fails without the patch. Manual testing on dataset used to reproduce the problem.

… prevent `StackOverflowError ` ShuffleMapTask's partition field is a FilePartition and FilePartition's 'files' field is a Stream$cons which is essentially a linked list. It is therefore serialized recursively. If the number of files in each partition is, say, 10000 files, recursing into a linked list of length 10000 overflows the stack The problem is only in Bucketed partitions. The corresponding implementation for non Bucketed partitions uses a StreamBuffer. The proposed change applies the same for Bucketed partitions. Existing unit tests. Added new unit test. The unit test fails without the patch. Manual testing on dataset used to reproduce the problem. Closes apache#24865 from parthchandra/SPARK-27100. Lead-authored-by: Parth Chandra <parthc@apple.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>

HyukjinKwon · 2019-06-25T04:49:07Z

@parthchandra can you open a PR against the current master?

dongjoon-hyun · 2019-06-25T05:41:20Z

This is a backport of 5a7aa6f . :)

dongjoon-hyun · 2019-06-25T05:48:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala

+ * A collection of file blocks that should be read as a single task
+ * (possibly from multiple partitioned directories).
+ */
+case class FilePartition(index: Int, files: Array[PartitionedFile])


It seems that this PR unexpectedly has a part of [SPARK-23817][SQL] Create file source V2 framework and migrate ORC read path. Could you try to remove this change about FilePartition?

cc @dbtsai , @gatorsmile

SparkQA · 2019-06-25T07:05:02Z

Test build #106859 has finished for PR 24957 at commit a89676b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class FilePartition(index: Int, files: Array[PartitionedFile])

dbtsai · 2019-06-25T07:28:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

- * A collection of file blocks that should be read as a single task
- * (possibly from multiple partitioned directories).
- */
-case class FilePartition(index: Int, files: Seq[PartitionedFile]) extends RDDPartition


Do we need to have those changes as @dongjoon-hyun suggested? Why not just change Seq to Array here?

parthchandra · 2019-06-25T17:08:42Z

@dongjoon-hyun, @dbtsai, you guys are right. I included the FilePartition change as part of resolving the merge conflicts. But I can just make the change that DBTsai has suggested. I'll make the change, re-run the tests and update this PR.

parthchandra · 2019-06-25T20:07:42Z

Updated

dbtsai

LGTM. Waiting the build.

gatorsmile · 2019-06-25T21:13:59Z

sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala

+  //  large. tests for the condition where the serialization of such a task may result in a stack
+  //  overflow if the files list is stored in a recursive data structure
+  //  This test is ignored because it takes long to run (~3 min)
+  ignore("SPARK-27100 stack overflow: read data with large partitions") {


trigger the test and see whether it passes in 2.4?

@parthchandra can you confirm this? Thanks.

Yes. The test passes in 2.4 for both bucketing with/without hive.

SparkQA · 2019-06-26T00:05:04Z

Test build #106903 has finished for PR 24957 at commit 17e0e2a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…n` to prevent StackOverflowError … prevent `StackOverflowError ` ShuffleMapTask's partition field is a FilePartition and FilePartition's 'files' field is a Stream$cons which is essentially a linked list. It is therefore serialized recursively. If the number of files in each partition is, say, 10000 files, recursing into a linked list of length 10000 overflows the stack The problem is only in Bucketed partitions. The corresponding implementation for non Bucketed partitions uses a StreamBuffer. The proposed change applies the same for Bucketed partitions. Existing unit tests. Added new unit test. The unit test fails without the patch. Manual testing on dataset used to reproduce the problem. Closes #24957 from parthchandra/branch-2.4. Authored-by: Parth Chandra <parthc@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>

dbtsai · 2019-06-26T07:48:45Z

Thanks. Merged into branch 2.4.

…n` to prevent StackOverflowError … prevent `StackOverflowError ` ShuffleMapTask's partition field is a FilePartition and FilePartition's 'files' field is a Stream$cons which is essentially a linked list. It is therefore serialized recursively. If the number of files in each partition is, say, 10000 files, recursing into a linked list of length 10000 overflows the stack The problem is only in Bucketed partitions. The corresponding implementation for non Bucketed partitions uses a StreamBuffer. The proposed change applies the same for Bucketed partitions. Existing unit tests. Added new unit test. The unit test fails without the patch. Manual testing on dataset used to reproduce the problem. Closes apache#24957 from parthchandra/branch-2.4. Authored-by: Parth Chandra <parthc@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>

dongjoon-hyun changed the title ~~[SPARK-27100][SQL] Use Array instead of Seq in FilePartition to…~~ [SPARK-27100][SQL][2.4] Use Array instead of Seq in FilePartition to… Jun 25, 2019

dongjoon-hyun reviewed Jun 25, 2019

View reviewed changes

dbtsai reviewed Jun 25, 2019

View reviewed changes

Undo FilePartition refactoring

17e0e2a

dbtsai approved these changes Jun 25, 2019

View reviewed changes

gatorsmile changed the title ~~[SPARK-27100][SQL][2.4] Use Array instead of Seq in FilePartition to…~~ [SPARK-27100][SQL][2.4] Use Array instead of Seq in FilePartition to prevent StackOverflowError Jun 25, 2019

gatorsmile reviewed Jun 25, 2019

View reviewed changes

dongjoon-hyun closed this Jun 26, 2019

dongjoon-hyun added the SQL label Feb 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27100][SQL][2.4] Use `Array` instead of `Seq` in `FilePartition` to prevent StackOverflowError #24957

[SPARK-27100][SQL][2.4] Use `Array` instead of `Seq` in `FilePartition` to prevent StackOverflowError #24957

parthchandra commented Jun 25, 2019

HyukjinKwon commented Jun 25, 2019

dongjoon-hyun commented Jun 25, 2019 •

edited

Loading

dongjoon-hyun Jun 25, 2019

SparkQA commented Jun 25, 2019

dbtsai Jun 25, 2019

parthchandra commented Jun 25, 2019

parthchandra commented Jun 25, 2019

dbtsai left a comment

gatorsmile Jun 25, 2019

dbtsai Jun 25, 2019

parthchandra Jun 25, 2019

SparkQA commented Jun 26, 2019

dbtsai commented Jun 26, 2019

[SPARK-27100][SQL][2.4] Use Array instead of Seq in FilePartition to prevent StackOverflowError #24957

[SPARK-27100][SQL][2.4] Use Array instead of Seq in FilePartition to prevent StackOverflowError #24957

Conversation

parthchandra commented Jun 25, 2019

HyukjinKwon commented Jun 25, 2019

dongjoon-hyun commented Jun 25, 2019 • edited Loading

dongjoon-hyun Jun 25, 2019

Choose a reason for hiding this comment

SparkQA commented Jun 25, 2019

dbtsai Jun 25, 2019

Choose a reason for hiding this comment

parthchandra commented Jun 25, 2019

parthchandra commented Jun 25, 2019

dbtsai left a comment

Choose a reason for hiding this comment

gatorsmile Jun 25, 2019

Choose a reason for hiding this comment

dbtsai Jun 25, 2019

Choose a reason for hiding this comment

parthchandra Jun 25, 2019

Choose a reason for hiding this comment

SparkQA commented Jun 26, 2019

dbtsai commented Jun 26, 2019

[SPARK-27100][SQL][2.4] Use `Array` instead of `Seq` in `FilePartition` to prevent StackOverflowError #24957

[SPARK-27100][SQL][2.4] Use `Array` instead of `Seq` in `FilePartition` to prevent StackOverflowError #24957

dongjoon-hyun commented Jun 25, 2019 •

edited

Loading