[SPARK-25753][CORE]fix reading small files via BinaryFileRDD #22725

10110346 · 2018-10-15T11:44:13Z

What changes were proposed in this pull request?

This is a follow up of #21601, StreamFileInputFormat and WholeTextFileInputFormat have the same problem.

Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304 java.io.IOException: Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304 at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java: 201) at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)

How was this patch tested?

Added a unit test

SparkQA · 2018-10-15T16:10:06Z

Test build #97378 has finished for PR 22725 at commit 54ffcdb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

10110346 · 2018-10-16T00:47:51Z

cc @dhruve @tgravescs

tgravescs · 2018-10-16T18:45:14Z

SPARK-24610 is the original issue, please file a new jira for StreamFileInputFormat

10110346 · 2018-10-17T00:51:17Z

@tgravescs ok, I will do it ,thanks

tgravescs · 2018-10-17T16:01:59Z

+1 Looks good, thanks @10110346

dongjoon-hyun

Hi, @10110346 . Could you change the title?

- [SPARK-25753][[CORE][FOLLOW-UP]fix reading small files via BinaryFileRDD
+ [SPARK-25753][CORE] Fix reading small files via BinaryFileRDD

10110346 · 2018-10-20T02:15:58Z

ok,thanks @dongjoon-hyun

dongjoon-hyun · 2018-10-20T05:44:36Z

It still has [[ before CORE. :)

tgravescs · 2018-10-22T13:53:43Z

merged to master

SparkQA · 2018-10-22T14:36:25Z

Test build #97854 has started for PR 22725 at commit 54ffcdb.

SparkQA · 2018-10-22T15:08:35Z

Test build #97862 has started for PR 22725 at commit 54ffcdb.

SparkQA · 2018-10-22T16:18:44Z

Test build #97874 has started for PR 22725 at commit 54ffcdb.

AmplabJenkins · 2018-10-22T16:36:50Z

Build finished. Test FAILed.

## What changes were proposed in this pull request? This is a follow up of apache#21601, `StreamFileInputFormat` and `WholeTextFileInputFormat` have the same problem. `Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304 java.io.IOException: Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304 at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java: 201) at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)` ## How was this patch tested? Added a unit test Closes apache#22725 from 10110346/maxSplitSize_node_rack. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Thomas Graves <tgraves@apache.org>

### What changes were proposed in this pull request? This is a clean cherry pick of #22725 from master to 2.4 This is a follow up of #21601, `StreamFileInputFormat` and `WholeTextFileInputFormat` have the same problem. `Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304 java.io.IOException: Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304 at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java: 201) at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)` ### Why are the changes needed? This is an existing bug which was fixed in master, but not back ported to 2.4. ### Does this PR introduce any user-facing change? No ### How was this patch tested? The original patch added a unit test. Ran the unit test that was added in the original patch and manually verified the changes by creating a multiline csv and loading it in spark shell. Closes #26026 from dhruve/fix/SPARK-25753/2.4. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

fix

54ffcdb

10110346 changed the title ~~[SPARK-24610][[CORE][FOLLOW-UP]fix reading small files via BinaryFileRDD~~ [SPARK-25753][[CORE][FOLLOW-UP]fix reading small files via BinaryFileRDD Oct 17, 2018

dongjoon-hyun reviewed Oct 19, 2018

View reviewed changes

10110346 changed the title ~~[SPARK-25753][[CORE][FOLLOW-UP]fix reading small files via BinaryFileRDD~~ [SPARK-25753][[CORE]fix reading small files via BinaryFileRDD Oct 20, 2018

10110346 changed the title ~~[SPARK-25753][[CORE]fix reading small files via BinaryFileRDD~~ [SPARK-25753][CORE]fix reading small files via BinaryFileRDD Oct 20, 2018

asfgit closed this in 81a305d Oct 22, 2018

dhruve mentioned this pull request Oct 4, 2019

[SPARK-25753][CORE][2.4] Fix reading small files via BinaryFileRDD #26026

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25753][CORE]fix reading small files via BinaryFileRDD #22725

[SPARK-25753][CORE]fix reading small files via BinaryFileRDD #22725

10110346 commented Oct 15, 2018

SparkQA commented Oct 15, 2018

10110346 commented Oct 16, 2018

tgravescs commented Oct 16, 2018

10110346 commented Oct 17, 2018

tgravescs commented Oct 17, 2018

dongjoon-hyun left a comment

10110346 commented Oct 20, 2018

dongjoon-hyun commented Oct 20, 2018

tgravescs commented Oct 22, 2018

SparkQA commented Oct 22, 2018

SparkQA commented Oct 22, 2018

SparkQA commented Oct 22, 2018

AmplabJenkins commented Oct 22, 2018

[SPARK-25753][CORE]fix reading small files via BinaryFileRDD #22725

[SPARK-25753][CORE]fix reading small files via BinaryFileRDD #22725

Conversation

10110346 commented Oct 15, 2018

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Oct 15, 2018

10110346 commented Oct 16, 2018

tgravescs commented Oct 16, 2018

10110346 commented Oct 17, 2018

tgravescs commented Oct 17, 2018

dongjoon-hyun left a comment

Choose a reason for hiding this comment

10110346 commented Oct 20, 2018

dongjoon-hyun commented Oct 20, 2018

tgravescs commented Oct 22, 2018

SparkQA commented Oct 22, 2018

SparkQA commented Oct 22, 2018

SparkQA commented Oct 22, 2018

AmplabJenkins commented Oct 22, 2018