[SPARK-19189] Optimize CartesianRDD to avoid parent RDD's partitions re-computation and re-serialization by WeichenXu123 · Pull Request #16574 · apache/spark

WeichenXu123 · 2017-01-13T12:20:47Z

What changes were proposed in this pull request?

Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating RDDC,
each RDDA partition will be reading by multiple RDDC partition, and RDDB has similar problem.
This will cause, when RDDC partition computing, each partition's data in RDDA or RDDB will be repeatedly serialized (then transfer through network), if RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.

In this PR, I change the dependency in CartesianRDD from NarrowDependency into ShuffleDependency, but still keep the way how the parent RDD partitioned. And computing CartesianRDD keep current implementation.

How was this patch tested?

Add a Cartesian test.

SparkQA · 2017-01-13T12:24:56Z

Test build #71320 has finished for PR 16574 at commit 14ba3b2.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-13T12:30:43Z

Test build #71321 has finished for PR 16574 at commit e114eed.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-13T14:47:24Z

Test build #71322 has finished for PR 16574 at commit 815063b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2017-01-13T14:53:48Z

Jenkins, test this please.

SparkQA · 2017-01-13T16:58:56Z

Test build #71328 has finished for PR 16574 at commit 815063b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2017-01-13T20:02:09Z

This is a behavior change and will break expectations from existing code depending on cartesian to not go through shuffle (particularly when data is already persisted).

WeichenXu123 · 2017-01-14T09:23:37Z

@mridulm En...so that still keep NarrowDependency seems better, but I think the recomputation is a serious problem when parents RDD not persisted, I think in this case we should try to print some warning message to remind developer to check their spark application code... how do you think about it ?

mridulm · 2017-01-14T11:29:12Z

Couple of points :

a) Can recomputation be expensive ? Unfortunately, yes if not used properly. For better or for worse, this has been the implementation in spark since early days - pre-0.5; and the costs are known. Particularly given Apache spark's ability to cache/checkpoint data, the assumption is that shuffle is more expensive. This might not hold anymore actually, given improvements since 1.0 - but only redoing benchmarks will give a better picture.

b) If we were to do a shuffle for cartesian, I would implement it differently - take a look at how Apache Pig has implemented it for a more efficient way to do it. (Btw, I dont think the impl in the PR actually works, but I have not looked at it in detail).

WeichenXu123 · 2017-01-14T15:01:57Z

@mridulm
Year, I know you are worried about the shuffling cost here. Currently do the shuffle still use partition the parent RDDs in the same way, i.e the shuffle do nothing, each partition after shuffling will keep the same. And I do not modify the preferredLocatinos So in this PR implementation it will generate good data-locality so that its network transfer cost is similar to current NarrowDependency implementation, IMO.

BUT, you mention that Cartesian has more efficient way to implement using shuffling... I would like to research about it and consider better solution. Thanks!

WeichenXu123 · 2017-01-15T09:48:04Z

I need to make a survey for better Cartesian implementation, especially in shuffle way. Close this PR for now and when the new solution is done I will reopen it.

WeichenXu123 added 2 commits January 12, 2017 04:47

init pr

e114eed

fix scala style check

815063b

WeichenXu123 force-pushed the optimize_cartesian branch from 14ba3b2 to e114eed Compare January 13, 2017 12:24

WeichenXu123 changed the title ~~[SPARK-19189] Optimize CartesianRDD to avoid partition re-computation and re-serialization~~ [SPARK-19189] Optimize CartesianRDD to avoid parent RDD's partitions re-computation and re-serialization Jan 13, 2017

WeichenXu123 closed this Jan 15, 2017

WeichenXu123 deleted the optimize_cartesian branch April 24, 2019 21:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19189] Optimize CartesianRDD to avoid parent RDD's partitions re-computation and re-serialization#16574

[SPARK-19189] Optimize CartesianRDD to avoid parent RDD's partitions re-computation and re-serialization#16574
WeichenXu123 wants to merge 2 commits intoapache:masterfrom
WeichenXu123:optimize_cartesian

WeichenXu123 commented Jan 13, 2017

Uh oh!

SparkQA commented Jan 13, 2017

Uh oh!

SparkQA commented Jan 13, 2017

Uh oh!

SparkQA commented Jan 13, 2017

Uh oh!

WeichenXu123 commented Jan 13, 2017

Uh oh!

SparkQA commented Jan 13, 2017

Uh oh!

mridulm commented Jan 13, 2017

Uh oh!

WeichenXu123 commented Jan 14, 2017

Uh oh!

mridulm commented Jan 14, 2017

Uh oh!

WeichenXu123 commented Jan 14, 2017 •

edited

Loading

Uh oh!

WeichenXu123 commented Jan 15, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

WeichenXu123 commented Jan 13, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jan 13, 2017

Uh oh!

SparkQA commented Jan 13, 2017

Uh oh!

SparkQA commented Jan 13, 2017

Uh oh!

WeichenXu123 commented Jan 13, 2017

Uh oh!

SparkQA commented Jan 13, 2017

Uh oh!

mridulm commented Jan 13, 2017

Uh oh!

WeichenXu123 commented Jan 14, 2017

Uh oh!

mridulm commented Jan 14, 2017

Uh oh!

WeichenXu123 commented Jan 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WeichenXu123 commented Jan 15, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WeichenXu123 commented Jan 14, 2017 •

edited

Loading