Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25753][CORE]fix reading small files via BinaryFileRDD #22725

Closed
wants to merge 1 commit into from

Conversation

10110346
Copy link
Contributor

What changes were proposed in this pull request?

This is a follow up of #21601, StreamFileInputFormat and WholeTextFileInputFormat have the same problem.

Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304 java.io.IOException: Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304 at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java: 201) at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)

How was this patch tested?

Added a unit test

@SparkQA
Copy link

SparkQA commented Oct 15, 2018

Test build #97378 has finished for PR 22725 at commit 54ffcdb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@10110346
Copy link
Contributor Author

cc @dhruve @tgravescs

@tgravescs
Copy link
Contributor

SPARK-24610 is the original issue, please file a new jira for StreamFileInputFormat

@10110346
Copy link
Contributor Author

@tgravescs ok, I will do it ,thanks

@10110346 10110346 changed the title [SPARK-24610][[CORE][FOLLOW-UP]fix reading small files via BinaryFileRDD [SPARK-25753][[CORE][FOLLOW-UP]fix reading small files via BinaryFileRDD Oct 17, 2018
@tgravescs
Copy link
Contributor

+1 Looks good, thanks @10110346

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @10110346 . Could you change the title?

- [SPARK-25753][[CORE][FOLLOW-UP]fix reading small files via BinaryFileRDD
+ [SPARK-25753][CORE] Fix reading small files via BinaryFileRDD

@10110346
Copy link
Contributor Author

ok,thanks @dongjoon-hyun

@10110346 10110346 changed the title [SPARK-25753][[CORE][FOLLOW-UP]fix reading small files via BinaryFileRDD [SPARK-25753][[CORE]fix reading small files via BinaryFileRDD Oct 20, 2018
@dongjoon-hyun
Copy link
Member

It still has [[ before CORE. :)

@10110346 10110346 changed the title [SPARK-25753][[CORE]fix reading small files via BinaryFileRDD [SPARK-25753][CORE]fix reading small files via BinaryFileRDD Oct 20, 2018
@tgravescs
Copy link
Contributor

merged to master

@asfgit asfgit closed this in 81a305d Oct 22, 2018
@SparkQA
Copy link

SparkQA commented Oct 22, 2018

Test build #97854 has started for PR 22725 at commit 54ffcdb.

@SparkQA
Copy link

SparkQA commented Oct 22, 2018

Test build #97862 has started for PR 22725 at commit 54ffcdb.

@SparkQA
Copy link

SparkQA commented Oct 22, 2018

Test build #97874 has started for PR 22725 at commit 54ffcdb.

@AmplabJenkins
Copy link

Build finished. Test FAILed.

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

This is a follow up of apache#21601, `StreamFileInputFormat` and `WholeTextFileInputFormat` have the same problem.

`Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304
java.io.IOException: Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304
        at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java: 201)
	at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)`

## How was this patch tested?
Added a unit test

Closes apache#22725 from 10110346/maxSplitSize_node_rack.

Authored-by: liuxian <liu.xian3@zte.com.cn>
Signed-off-by: Thomas Graves <tgraves@apache.org>
dhruve pushed a commit to dhruve/spark that referenced this pull request Oct 4, 2019
## What changes were proposed in this pull request?

This is a follow up of apache#21601, `StreamFileInputFormat` and `WholeTextFileInputFormat` have the same problem.

`Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304
java.io.IOException: Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304
        at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java: 201)
	at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)`

## How was this patch tested?
Added a unit test

Closes apache#22725 from 10110346/maxSplitSize_node_rack.

Authored-by: liuxian <liu.xian3@zte.com.cn>
Signed-off-by: Thomas Graves <tgraves@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request Oct 4, 2019
### What changes were proposed in this pull request?
This is a clean cherry pick of  #22725 from master to 2.4

This is a follow up of #21601, `StreamFileInputFormat` and `WholeTextFileInputFormat` have the same problem.

`Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304
java.io.IOException: Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304
        at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java: 201)
	at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)`

### Why are the changes needed?
This is an existing bug which was fixed in master, but not back ported to 2.4.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
The original patch added a unit test.

Ran the unit test that was added in the original patch and manually verified the changes by creating a multiline csv and loading it in spark shell.

Closes #26026 from dhruve/fix/SPARK-25753/2.4.

Authored-by: liuxian <liu.xian3@zte.com.cn>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants