[SPARK-6307][Core] Coalesce RDDs before RDD.cartesian #6454

viirya · 2015-05-28T10:35:37Z

JIRA: https://issues.apache.org/jira/browse/SPARK-6307 and https://issues.apache.org/jira/browse/SPARK-6922

Alternative approach to this issue. Ref. #5572.

This PR tries to coalesce RDDs before RDD.cartesian. It also implements the idea 1 discussed in #5572. So in cartesian we will fetch each partition in RDD2 once for each partition in RDD1.

For performance test, using the same example codes in #5572, it costs about 4s.

cc @tbertelsen, @squito, @cloud-fan.

SparkQA · 2015-05-28T12:30:07Z

Test build #33656 has finished for PR 6454 at commit 72e199e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-05-28T13:10:05Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala

Have you tried not do coalesce before cartesian? It seems to me that we may slow down due to these 2 extra shuffles.

Without coalesce, it slightly faster. However, coalesced to less partition may benefit as we discussed in #5572. Another approach is to remove coalesce and only add description to cartesian's comment to recommend users to do coalesce.

What type of cluster have to test this on? I expect this will perform poorly if your cluster have few machines with a lot of cores, for example two machines with 32 cores each. We will ideally like the product of the number of partitions in the 2 RDDs to be equal to default parallelism (that is the total number of cores). If we do a shuffle anyway we should be able to calculate the number of partitions pretty exact.

What will happen if we coalesce with shuffle set to false? For example: if an RDD has 16 partitions on 4 nodes and we coalesced to six partitions?

Sorry I think I was vague when I originally brought up coalesce. I had intended for it to only be a suggestion to the user, included in the docs, as you note. I think its really hard to guess how to do the coalescing. In fact, I just realized there is a downside to coalescing that I hadn't thought of earlier -- the more you coalesce, then it becomes more likely that you cannot store all of rdd2.iterator in memory.

We want to coalesce to reduce the number of times a given partition is fetched. Each partition is fetched once for every partition in the other RDD. The Cartesian RDD has a number of partitions equal to the product of the number of partitions in the two source RDDs. For each partition in the Cartesian RDD we will do (asymptotically) one fetch.

RDDs typically have a number of partitions equal to default parallelism (the number of cores). The Cartesian RDDs you have the same. If we don't coalesce we will expect them to have a number of partition equal to the square of default parallelism.

If we for example have 10 machines with 16 cores then default parallelism will be 160. If we don't coalesce the Cartesian RDD will have 25600 partitions. 2560 partitions can be calculated without a remote fetch but the rest will involve a remote fetch. This will result in a lot of redundant fetches if we don't cache the partitions. To be exact each partition will be fetched 144 times. If we coalesce we can bring that down to 10 or 16.

I agree it can be difficult to predict the number of partitions for all possible cases, but we can do a pretty good guess that would be great for most applications. I think the best solution is to make that guess and let the user override it if he or she has domain knowledge to change the default.

I will suggest something like:

val numPartitions1 = numExecutors val numPartitions2 = max(numExecutors, defaultParallelism / numExecutors)

shaneknapp · 2015-05-28T14:03:57Z

jenkins, test this please

shaneknapp · 2015-05-28T14:22:43Z

jenkins, test this please

SparkQA · 2015-05-28T16:18:43Z

Test build #33657 has finished for PR 6454 at commit 72e199e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2015-05-28T19:17:46Z

core/src/main/scala/org/apache/spark/rdd/CartesianRDD.scala

I'm confused how this actually makes things any faster. It seems like even if you do unroll rdd2.iterator into an array, you just get an iterator and you don't hold on to the array. Doesn't this still lead to rdd2.iterator getting called for every element in rdd1.iterator? I'd think you would need to do something like:

val arrayValues: Option[Array[U]] = { SparkEnv.get.blockManager.memoryStore.unrollSafely(key, rdd2.iterator(currSplit.s2, context), updatedBlocks) match { case Left(arr) => Some(arr) case Right(it) => None } } def getRdd2Values(): Iterator[U] = arrayValues.getOrElse { rdd2.iterator(currSplit.s2, context) }

I think we also need to call MemoryStore.releasePendingUnrollMemoryForThisThread after we're done w/ this task ... I guess it we'd add that with TaskContext.get.addTaskCompletionListener.

@squito Yes. I noticed that but have not committed updates yet. Interesting is, the performance gap before and after applying the remotely caching approach #5572, seems to be pulled closer in latest codebase. Because this issue is there for a while, I wonder if other improvement already makes this update not important on performance. I will test it again to see if this is correct. If so, this two PRs can be closed.

@tbertelsen Can you try latest Spark codebase for RDD.cartesian performance too?

tbertelsen · 2015-06-01T10:30:59Z

@viirya Sadly I don't have access to our compute cluster for the next couple of weeks so I can try it out right now.

wingchen · 2015-07-09T18:48:47Z

Guys, any update on this one?

viirya · 2015-07-15T04:33:03Z

Close this PR first as I don't know if it is needed now.

Coalesce RDDs before RDD.cartesian.

72e199e

cloud-fan reviewed May 28, 2015
View reviewed changes

squito reviewed May 28, 2015
View reviewed changes

viirya closed this Jul 15, 2015

viirya deleted the speed_up_cartesian branch December 27, 2023 18:31

[SPARK-6307][Core] Coalesce RDDs before RDD.cartesian #6454

[SPARK-6307][Core] Coalesce RDDs before RDD.cartesian #6454

Uh oh!

Conversation

viirya commented May 28, 2015

Uh oh!

SparkQA commented May 28, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shaneknapp commented May 28, 2015

Uh oh!

shaneknapp commented May 28, 2015

Uh oh!

SparkQA commented May 28, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tbertelsen commented Jun 1, 2015

Uh oh!

wingchen commented Jul 9, 2015

Uh oh!

viirya commented Jul 15, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants