[SPARK-9189][CORE] Takes locality and the sum of partition length into account when partition is instance of HadoopPartition in operator coalesce #7536

watermen · 2015-07-20T11:00:57Z

Before:
Takes locality and the number of partitions into account in operator coalesce.

After:
Takes locality and the sum of partition length(part1.len + part2.len + ... + partN.len) into account when partition is instance of HadoopPartition in operator coalesce.

To make the data size of partition more balanced.

/cc @liancheng @scwf

scwf · 2015-07-20T13:36:05Z

@watermen i think this should be [SPARK-9189][CORE]

andrewor14 · 2015-07-21T04:15:45Z

add to whitelist

SparkQA · 2015-07-21T06:18:32Z

Test build #37909 has finished for PR 7536 at commit cb72d0f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

watermen · 2015-07-21T07:01:53Z

@andrewor14 retest it?
[error] SERVER ERROR: Service Temporarily Unavailable url=http://maven.twttr.com/org/apache/hadoop/hadoop-yarn-server/2.2.0/hadoop-yarn-server-2.2.0.jar

liancheng · 2015-07-21T12:05:16Z

@watermen You can retest it yourself now as Andrew has put you into the whitelist.

liancheng · 2015-07-21T12:05:52Z

Jenkins is being shut down for maintenance purposes. You may retest this PR after it comes back.

watermen · 2015-07-21T13:15:08Z

retest this please

SparkQA · 2015-07-21T15:14:47Z

Test build #42 has finished for PR 7536 at commit cb72d0f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-21T15:29:50Z

Test build #37951 has finished for PR 7536 at commit cb72d0f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

watermen · 2015-07-22T01:08:11Z

retest this please

SparkQA · 2015-07-22T01:12:34Z

Test build #38004 has finished for PR 7536 at commit cb72d0f.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-22T03:04:52Z

Test build #47 has finished for PR 7536 at commit cb72d0f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

scwf · 2015-07-22T09:55:20Z

retest this please

SparkQA · 2015-07-22T09:57:23Z

Test build #57 has finished for PR 7536 at commit cb72d0f.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-22T10:00:56Z

Test build #38066 has finished for PR 7536 at commit cb72d0f.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

watermen · 2015-07-23T01:16:17Z

retest this please

SparkQA · 2015-07-23T03:11:31Z

Test build #38141 has finished for PR 7536 at commit cb72d0f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-23T03:31:12Z

Test build #67 has finished for PR 7536 at commit cb72d0f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

watermen · 2015-07-26T14:18:05Z

retest this please

SparkQA · 2015-07-26T16:23:41Z

Test build #108 has finished for PR 7536 at commit cb72d0f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-26T16:36:56Z

Test build #38473 has finished for PR 7536 at commit cb72d0f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hellertime · 2015-08-05T18:24:26Z

core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala

+    val minPowerOfTwo = if (p.isInstanceOf[HadoopPartition]) {
+      val groupLen1 = groupArr(r1).arr.map(part =>
+        part.asInstanceOf[HadoopPartition].inputSplit.value.getLength).sum
+      val groupLen2 = groupArr(r1).arr.map(part =>


Shouldn't this be val groupLen2 = groupArr(r2)? Otherwise groupLen1 will aways equal groupLen2.

@hellertime Yes, I'll fix this bug, thanks.

watermen · 2015-08-06T06:32:56Z

retest this please

SparkQA · 2015-08-06T08:51:38Z

Test build #244 has finished for PR 7536 at commit 5623370.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-06T09:25:52Z

Test build #39997 has finished for PR 7536 at commit 5623370.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

watermen · 2015-08-11T06:30:02Z

@andrewor14 @srowen Any more comment on this?

srowen · 2015-08-11T06:32:41Z

That seems reasonable to me, although I am not sure if inputLength is necessarily set in all cases? this is just because I don't know the code. Also are there other places that need a similar treatment?

andrewor14 · 2015-09-01T21:56:03Z

@watermen can you add a unit test for this? The high level motivation sounds reasonable to me, but like @srowen I'm not familiar enough with the Hadoop code to merge this. Perhaps @tgravescs would have a better idea?

tgravescs · 2015-09-03T13:23:58Z

On the hadoop size getLength() is part of the required interface so the function would be there. With all the inputs formats I've seen this is always set to something reasonable but anyone can write a custom input format. I could see cases where someone has an input format that they don't know the size or its more expensive to compute the size then just to fetch it. You could simply add a check for 0 and fall back to the # of partitions.

@watermen have you run this on real cluster with skewed data to see if it makes a difference? what input formats have you used?

if there are thousands (or tens of thousands) of partitions and you are coalescing into small # of buckets we are now potentially calculating the length in every group over and over again. did you test to see how long that takes vs just checking the size of the array? I'm guessing that isn't to bad but it doesn't hurt to verify.

srowen · 2015-09-03T13:46:50Z

(PS Yeah I meant that I wasn't sure whether getLength always returned a meaningful value, as I sort of distantly remember a problem with this, but, it might be imagining it.)

Takes locality and the sum of partition length into account.

cb72d0f

watermen mentioned this pull request Jul 29, 2015

[SPARK-8813][SQL] Support combine text/parquet format file in SQL #7210

Closed

hellertime reviewed Aug 5, 2015
View reviewed changes

Fix bug

5623370

watermen closed this Oct 14, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-9189][CORE] Takes locality and the sum of partition length into account when partition is instance of HadoopPartition in operator coalesce #7536

[SPARK-9189][CORE] Takes locality and the sum of partition length into account when partition is instance of HadoopPartition in operator coalesce #7536

watermen commented Jul 20, 2015

scwf commented Jul 20, 2015

andrewor14 commented Jul 21, 2015

SparkQA commented Jul 21, 2015

watermen commented Jul 21, 2015

liancheng commented Jul 21, 2015

liancheng commented Jul 21, 2015

watermen commented Jul 21, 2015

SparkQA commented Jul 21, 2015

SparkQA commented Jul 21, 2015

watermen commented Jul 22, 2015

SparkQA commented Jul 22, 2015

SparkQA commented Jul 22, 2015

scwf commented Jul 22, 2015

SparkQA commented Jul 22, 2015

SparkQA commented Jul 22, 2015

watermen commented Jul 23, 2015

SparkQA commented Jul 23, 2015

SparkQA commented Jul 23, 2015

watermen commented Jul 26, 2015

SparkQA commented Jul 26, 2015

SparkQA commented Jul 26, 2015

hellertime Aug 5, 2015

watermen Aug 6, 2015

watermen commented Aug 6, 2015

SparkQA commented Aug 6, 2015

SparkQA commented Aug 6, 2015

watermen commented Aug 11, 2015

srowen commented Aug 11, 2015

andrewor14 commented Sep 1, 2015

tgravescs commented Sep 3, 2015

srowen commented Sep 3, 2015

[SPARK-9189][CORE] Takes locality and the sum of partition length into account when partition is instance of HadoopPartition in operator coalesce #7536

[SPARK-9189][CORE] Takes locality and the sum of partition length into account when partition is instance of HadoopPartition in operator coalesce #7536

Conversation

watermen commented Jul 20, 2015

scwf commented Jul 20, 2015

andrewor14 commented Jul 21, 2015

SparkQA commented Jul 21, 2015

watermen commented Jul 21, 2015

liancheng commented Jul 21, 2015

liancheng commented Jul 21, 2015

watermen commented Jul 21, 2015

SparkQA commented Jul 21, 2015

SparkQA commented Jul 21, 2015

watermen commented Jul 22, 2015

SparkQA commented Jul 22, 2015

SparkQA commented Jul 22, 2015

scwf commented Jul 22, 2015

SparkQA commented Jul 22, 2015

SparkQA commented Jul 22, 2015

watermen commented Jul 23, 2015

SparkQA commented Jul 23, 2015

SparkQA commented Jul 23, 2015

watermen commented Jul 26, 2015

SparkQA commented Jul 26, 2015

SparkQA commented Jul 26, 2015

hellertime Aug 5, 2015

Choose a reason for hiding this comment

watermen Aug 6, 2015

Choose a reason for hiding this comment

watermen commented Aug 6, 2015

SparkQA commented Aug 6, 2015

SparkQA commented Aug 6, 2015

watermen commented Aug 11, 2015

srowen commented Aug 11, 2015

andrewor14 commented Sep 1, 2015

tgravescs commented Sep 3, 2015

srowen commented Sep 3, 2015