[SPARK-19189] Optimize CartesianRDD to avoid parent RDD's partitions re-computation and re-serialization#16574
[SPARK-19189] Optimize CartesianRDD to avoid parent RDD's partitions re-computation and re-serialization#16574WeichenXu123 wants to merge 2 commits intoapache:masterfrom
Conversation
14ba3b2 to
e114eed
Compare
|
Test build #71320 has finished for PR 16574 at commit
|
|
Test build #71321 has finished for PR 16574 at commit
|
|
Test build #71322 has finished for PR 16574 at commit
|
|
Jenkins, test this please. |
|
Test build #71328 has finished for PR 16574 at commit
|
|
This is a behavior change and will break expectations from existing code depending on cartesian to not go through shuffle (particularly when data is already persisted). |
|
@mridulm En...so that still keep |
|
Couple of points : a) Can recomputation be expensive ? Unfortunately, yes if not used properly. For better or for worse, this has been the implementation in spark since early days - pre-0.5; and the costs are known. Particularly given Apache spark's ability to cache/checkpoint data, the assumption is that shuffle is more expensive. This might not hold anymore actually, given improvements since 1.0 - but only redoing benchmarks will give a better picture. b) If we were to do a shuffle for cartesian, I would implement it differently - take a look at how Apache Pig has implemented it for a more efficient way to do it. (Btw, I dont think the impl in the PR actually works, but I have not looked at it in detail). |
|
@mridulm BUT, you mention that Cartesian has more efficient way to implement using shuffling... I would like to research about it and consider better solution. Thanks! |
|
I need to make a survey for better Cartesian implementation, especially in shuffle way. Close this PR for now and when the new solution is done I will reopen it. |
What changes were proposed in this pull request?
Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating RDDC,
each RDDA partition will be reading by multiple RDDC partition, and RDDB has similar problem.
This will cause, when RDDC partition computing, each partition's data in RDDA or RDDB will be repeatedly serialized (then transfer through network), if RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.
In this PR, I change the dependency in
CartesianRDDfromNarrowDependencyintoShuffleDependency, but still keep the way how the parent RDD partitioned. And computing CartesianRDD keep current implementation.How was this patch tested?
Add a Cartesian test.