-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-20638][Core]Optimize the CartesianRDD to reduce repeatedly data fetching #17936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @viirya, can you help to review this? I thinks you are familiar with this, because you had tried to solve it before. And also ping @srowen , @mridulm, @jerryshao. |
A cluster version of the comparison results, I will be given later. |
Looks like there's a similar PR #17898 trying to address this issue, can you please elaborate your difference compared to that one? |
hi @jerryshao,thanks for your review. In #17898,there is a potential buffer to cache the data,so we should control the groupsize very careful. Because for small size,it need fetch more times. For lager,there is potential OOM. So,in this pr we using blockmanager to cache it. And the block cached can be used multiple task in same executor. But that patch changed little. |
From my first glance, I have several questions:
|
Cool, you see the
For your question answers: And also, this patch has some insufficient. The highest case is delete the cached block after the |
Test build #3708 has finished for PR 17936 at commit
|
The cluster test result. The Environments: Spark on Yarn with 9 executors(10 cores & 30 GB Mem) on three nodes. Test Case:
Test Results:
|
I think you@ConeyLiu should directly test the Cartesian phase with the following patch. val user = model.userFeatures The recommendForAll in mllib ALS has been merged a new PR (#17742). Your PR may not fit this case. |
Yeah, I can test it. You see, the |
Hi, @jtengyp the test Results as follow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a complex change to a critical component, so I'm not sure I can review this. Is there no eaiser way to achieve this? I'm wondering whether it's too good to be true and whether there are downsides or cases this doesn't work?
val iterator = rdd.iterator(partition, context) | ||
// Keep read lock, because next we need read it. And don't tell master. | ||
blockManager.putIterator[U](blockId2, iterator, level, false, true) match { | ||
case true => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I think match is confusing overkill for booleans. Just use if-else
|
||
val tmpRdd = sc.textFile(tmpFilePath, numPartitions) | ||
tmpRdd.cache() | ||
tmpRdd.count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we cache the rdd in the CartesianRDD compute method, so there we should count the bytes read from memory.
sc.parallelize(cartVector).saveAsTextFile(cartFilePath) | ||
val aRdd = sc.textFile(cartFilePath, numPartitions) | ||
aRdd.cache() | ||
aRdd.count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a very strange mistake. If we cache both aRdd
& tmpRdd
, this pr and master branch all pasted the test. But if we just cache the tmpRdd
, both the branch are failed. So here are temporarily set to cache. I will look at the details of the problem, it may be a bug, if I understand the wrong me, please pointer me.
@srowen Sorry for the late reply. I updated the code. Because we should reduce times of the remotely fetch, the second partition should be cached in locally. There are two ways, first cached by the Cartesian only used in
So I think there should be no other impact. |
Hi, @squito, @cloud-fan. Can you help review this code? Thanks a lot. |
|
||
val iterator = rdd.iterator(partition, context) | ||
if (rdd.getStorageLevel != StorageLevel.NONE || rdd.isCheckpointedAndMaterialized) { | ||
// If the block is cached in local, wo shouldn't cache it again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even getStorageLevel
is not StorageLevel.NONE
, we still can't guarantee the block can be successfully cached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya Thanks for review. If the storeage level of rdd2 is not StorageLevel.NONE
, it will cached by in the method RDD.getOrCompute
. So I think we should cache it again, because the blockManger.getOrElseUpdate
call the same method as blockManager.putIterator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getOrElseUpdate
doesn't guarantee the block can be successfully cached. It can be failed to cache it. In this case, it simply returns the iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but if it isn't cached in local, the next loop will try call the iterator again, then we will call the getOrElseUpdate
. You means we should check if it is not cached, try cached it again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I mean here if getStorageLevel != StorageLevel.NONE
, you assume the block is cached and return the iterator. However, the caching can be failed and you just return the computed iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll change it, thanks very much.
@ConeyLiu , I would suggest to add a flag cartesianRDD to specify whether local cache should be enabled. User could choose to enable it or not. Besides, if cache into BlockManager is failed, can we offload to original cartesian computation, so that the task will not be failed. |
Hi @jerryshao . Good advice. Because here choose |
I see. I think at least we should make this cache mechanism controllable by flag. I'm guessing in some HPC clusters or single node cluster this problem is not so severe. |
OK, I'll add it. From the test result, performance is still very obvious. Mainly from the network and disk overhead. |
How much difference this performs, compared with caching the two RDDs before doing cartesian with current codebase? |
var holdReadLock = false | ||
|
||
// Try to get data from the local, otherwise it will be cached to the local. | ||
def getOrElseCache( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, can we move those functions out of compute
? Too many nested functions here and making compute
too big.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will change it too.
I agreed with @srowen. This adds quite complexity. If there is no much difference comparing with caching RDDs before doing cartesian (or other ways), it may not worth such complexity. |
I did not directly test this situation. But I have test the this pr compared with latest |
@viirya , this is slightly different from caching RDD. It is more like broadcasting, the final state is that each executor will hold the whole data of RDD2, the difference is that this is executor-executor sync, not driver-executor sync. I also have the similar concern. The performance can be varied by workloads, we'd better have some different workloads to see general improvements. |
Sorry for the mistake, this test result should be the cached situation:
Test case:
The RDDs (user and item) should be cached. |
@jerryshao Yeah, the reason I mentioned caching is to know how much re-computing RDD costs in the performance. It seems to me that if re-computing is much more costing than transferring the data, only caching can be helpful. |
@jerryshao As you mentioned broadcasting, another question might be, can we just use broadcasting to achieve similar performance without such changes? |
|
Seems it should be still better than original cartesian, since it saves re-computing RDD, re-transferring data? |
Yeah, I think I can do the performance comparison. |
Hi @rxin, would you mind take a look? |
In Spark SQL we have As we are encouraging users to use Spark SQL as the main programing interface instead of RDD, it seems to me that this patch is not very useful for Spark. BTW I think it's hard to optimize |
I can understand any code change in Spark core will be hard to review due to the regression concern, I think we can leave the PR for discussion.
|
This is a good point, we should improve it, but I don't think relying on block manager is a good idea:
Maybe we can use a hash map to reuse already fetched partitions in |
May create a MemoryAndDiskArray like ExternalAppendOnlyMap? MemoryAndDiskArray, not only use here but also groupByKey? and its memory can controlle by MemoryManager |
You are saying to use |
So careless to notice UnsafeCartesianRDD's ExternalAppendOnlyUnsafeRowArray, that nice, I am not read all discussion here...the solution unify with unsafeCartesionRDD already have a big improvement for CartesionRDD, and it seams more simple and easy to understand... (In our inner change, we adopt a memory and disk array to store graphx Array[EdgeAttr])... I not sure it will have a strong optimize requirement to avoid per task fetch |
rdd1.cartesian(rdd2). For each task we need pool all the data of rdd1 (or rdd2) from the cluster. If we have n task running parallel in the same executor, that means we need duplicate poll n same data to same executor. This can bring seriously gc problem and network I/O (maybe disk I/O if the memory and disk |
I'm going to close this PR because it goes stale, please feel free to reopen it or open another PR if anyone have more thoughts on this issue. |
OK, thanks a lot. |
What changes were proposed in this pull request?
This path aims to solve the poor performance of
RDD.cartesian
. In the original method ofcartesian
, it need repeatedly fetch remotely data or recompute, so the performance is poor. In this path cache the second partition into the local withBlockManager
. There are two advantage:Because we cache it with
BlockManager
and set the storage level asMEMORY_AND_DISK
, so we don't need care the OOM caused by the buffer.Many task may depend on the same block (the second partition) for calculation, so don't remove the block when other task need it. This can reduce the times of fetching or calculate.
How was this patch tested?
Test enviroments: 4 Executors(10 core, 30GB Mem) in one node with(4 ssd + 7hdd)
Test case:
Before:
353491.027379
After:
94516.680067