[SPARK-25753][CORE][2.4] Fix reading small files via BinaryFileRDD #26026

dhruve · 2019-10-04T15:26:29Z

What changes were proposed in this pull request?

This is a clean cherry pick of #22725 from master to 2.4

This is a follow up of #21601, StreamFileInputFormat and WholeTextFileInputFormat have the same problem.

Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304 java.io.IOException: Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304 at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java: 201) at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)

Why are the changes needed?

This is an existing bug which was fixed in master, but not back ported to 2.4.

Does this PR introduce any user-facing change?

No

How was this patch tested?

The original patch added a unit test.

Ran the unit test that was added in the original patch and manually verified the changes by creating a multiline csv and loading it in spark shell.

## What changes were proposed in this pull request? This is a follow up of apache#21601, `StreamFileInputFormat` and `WholeTextFileInputFormat` have the same problem. `Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304 java.io.IOException: Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304 at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java: 201) at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)` ## How was this patch tested? Added a unit test Closes apache#22725 from 10110346/maxSplitSize_node_rack. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Thomas Graves <tgraves@apache.org>

dhruve · 2019-10-04T15:26:49Z

@dongjoon-hyun @tgravescs

SparkQA · 2019-10-04T17:50:13Z

Test build #111782 has finished for PR 26026 at commit 27acb89.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2019-10-04T18:08:02Z

@dhruve can you update the description to say cherry-pick #22725 and move the How the patch was tested down to fit the format of the new PR template and put any additional info about testing that you did.

srowen

Looks OK as a cherry pick but yes update the description

dhruve · 2019-10-04T19:19:04Z

Updated the description.

dongjoon-hyun

+1, LGTM. Thank you, @dhruve , @tgravescs , @srowen .
Merged to branch-2.4.

### What changes were proposed in this pull request? This is a clean cherry pick of #22725 from master to 2.4 This is a follow up of #21601, `StreamFileInputFormat` and `WholeTextFileInputFormat` have the same problem. `Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304 java.io.IOException: Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304 at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java: 201) at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)` ### Why are the changes needed? This is an existing bug which was fixed in master, but not back ported to 2.4. ### Does this PR introduce any user-facing change? No ### How was this patch tested? The original patch added a unit test. Ran the unit test that was added in the original patch and manually verified the changes by creating a multiline csv and loading it in spark shell. Closes #26026 from dhruve/fix/SPARK-25753/2.4. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

tgravescs approved these changes Oct 4, 2019

View reviewed changes

dhruve changed the title ~~[SPARK-25753][CORE] fix reading small files via BinaryFileRDD~~ [SPARK-25753][CORE] Cherry pick #22725 - fix reading small files via BinaryFileRDD Oct 4, 2019

srowen reviewed Oct 4, 2019

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-25753][CORE] Cherry pick #22725 - fix reading small files via BinaryFileRDD~~ [SPARK-25753][CORE][2.4] Fix reading small files via BinaryFileRDD Oct 4, 2019

dongjoon-hyun added the SPARK CORE label Oct 4, 2019

dongjoon-hyun approved these changes Oct 4, 2019

View reviewed changes

dongjoon-hyun closed this Oct 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25753][CORE][2.4] Fix reading small files via BinaryFileRDD #26026

[SPARK-25753][CORE][2.4] Fix reading small files via BinaryFileRDD #26026

dhruve commented Oct 4, 2019 •

edited

Loading

dhruve commented Oct 4, 2019

SparkQA commented Oct 4, 2019

tgravescs commented Oct 4, 2019

srowen left a comment

dhruve commented Oct 4, 2019

dongjoon-hyun left a comment •

edited

Loading

[SPARK-25753][CORE][2.4] Fix reading small files via BinaryFileRDD #26026

[SPARK-25753][CORE][2.4] Fix reading small files via BinaryFileRDD #26026

Conversation

dhruve commented Oct 4, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dhruve commented Oct 4, 2019

SparkQA commented Oct 4, 2019

tgravescs commented Oct 4, 2019

srowen left a comment

Choose a reason for hiding this comment

dhruve commented Oct 4, 2019

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

dhruve commented Oct 4, 2019 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading