Skip to content

[SPARK-19189] Optimize CartesianRDD to avoid parent RDD's partitions re-computation and re-serialization#16574

Closed
WeichenXu123 wants to merge 2 commits intoapache:masterfrom
WeichenXu123:optimize_cartesian
Closed

[SPARK-19189] Optimize CartesianRDD to avoid parent RDD's partitions re-computation and re-serialization#16574
WeichenXu123 wants to merge 2 commits intoapache:masterfrom
WeichenXu123:optimize_cartesian

Conversation

@WeichenXu123
Copy link
Contributor

What changes were proposed in this pull request?

Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating RDDC,
each RDDA partition will be reading by multiple RDDC partition, and RDDB has similar problem.
This will cause, when RDDC partition computing, each partition's data in RDDA or RDDB will be repeatedly serialized (then transfer through network), if RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.

In this PR, I change the dependency in CartesianRDD from NarrowDependency into ShuffleDependency, but still keep the way how the parent RDD partitioned. And computing CartesianRDD keep current implementation.

How was this patch tested?

Add a Cartesian test.

@WeichenXu123 WeichenXu123 changed the title [SPARK-19189] Optimize CartesianRDD to avoid partition re-computation and re-serialization [SPARK-19189] Optimize CartesianRDD to avoid parent RDD's partitions re-computation and re-serialization Jan 13, 2017
@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71320 has finished for PR 16574 at commit 14ba3b2.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71321 has finished for PR 16574 at commit e114eed.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71322 has finished for PR 16574 at commit 815063b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71328 has finished for PR 16574 at commit 815063b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mridulm
Copy link
Contributor

mridulm commented Jan 13, 2017

This is a behavior change and will break expectations from existing code depending on cartesian to not go through shuffle (particularly when data is already persisted).

@WeichenXu123
Copy link
Contributor Author

@mridulm En...so that still keep NarrowDependency seems better, but I think the recomputation is a serious problem when parents RDD not persisted, I think in this case we should try to print some warning message to remind developer to check their spark application code... how do you think about it ?

@mridulm
Copy link
Contributor

mridulm commented Jan 14, 2017

Couple of points :

a) Can recomputation be expensive ? Unfortunately, yes if not used properly. For better or for worse, this has been the implementation in spark since early days - pre-0.5; and the costs are known. Particularly given Apache spark's ability to cache/checkpoint data, the assumption is that shuffle is more expensive. This might not hold anymore actually, given improvements since 1.0 - but only redoing benchmarks will give a better picture.

b) If we were to do a shuffle for cartesian, I would implement it differently - take a look at how Apache Pig has implemented it for a more efficient way to do it. (Btw, I dont think the impl in the PR actually works, but I have not looked at it in detail).

@WeichenXu123
Copy link
Contributor Author

WeichenXu123 commented Jan 14, 2017

@mridulm
Year, I know you are worried about the shuffling cost here. Currently do the shuffle still use partition the parent RDDs in the same way, i.e the shuffle do nothing, each partition after shuffling will keep the same. And I do not modify the preferredLocatinos So in this PR implementation it will generate good data-locality so that its network transfer cost is similar to current NarrowDependency implementation, IMO.

BUT, you mention that Cartesian has more efficient way to implement using shuffling... I would like to research about it and consider better solution. Thanks!

@WeichenXu123
Copy link
Contributor Author

I need to make a survey for better Cartesian implementation, especially in shuffle way. Close this PR for now and when the new solution is done I will reopen it.

@WeichenXu123 WeichenXu123 deleted the optimize_cartesian branch April 24, 2019 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants