-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18608][ML] Fix double-caching in ML algorithms #17014
Conversation
Test build #73209 has finished for PR 17014 at commit
|
Test build #73213 has finished for PR 17014 at commit
|
Test build #73214 has finished for PR 17014 at commit
|
Test build #73215 has finished for PR 17014 at commit
|
Test build #73217 has finished for PR 17014 at commit
|
* @return Fitted model | ||
*/ | ||
protected def train(dataset: Dataset[_]): M | ||
protected def train(dataset: Dataset[_], handlePersistence: Boolean): M |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this going to break external implementers?
Test build #73219 has finished for PR 17014 at commit
|
It's better if we can fix this without breaking API. Let's allow some time to see if there's a better solution. |
0298f0c
to
b81eeb7
Compare
Test build #73252 has finished for PR 17014 at commit
|
Test build #73253 has finished for PR 17014 at commit
|
ping @hhbyyh ? |
Hi @zhengruifeng , how can I help? I thought you plan to send an update according to your suggestion in jira? |
@hhbyyh I think I misunderstood your comments in jira. I will update this pr with the new plan: directly add |
a3f3bb6
to
fca364d
Compare
Test build #74921 has finished for PR 17014 at commit
|
Test build #74925 has finished for PR 17014 at commit
|
Test build #74932 has finished for PR 17014 at commit
|
Test build #74935 has finished for PR 17014 at commit
|
ping @hhbyyh |
I'm trying to refresh my memory and clear the targets on the topic, basically we want to achieve:
Let me know if I miss anything. I'll take a look at the code now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update. It's good to see no API breaks. Two comments from me:
-
perhaps we can find some way to reduce the code duplicates and increase flexibility. And it may also make sense to define a trait for this issue, if the scope is not with Predictor only.
-
About the scope, I'm not sure if we should extend it to so many algorithms. I'm leaning towards limiting it to the alorightms in
ml
which already has the handlePersistence logic.
@@ -110,12 +111,17 @@ class DecisionTreeClassifier @Since("1.4.0") ( | |||
val oldDataset: RDD[LabeledPoint] = extractLabeledPoints(dataset, numClasses) | |||
val strategy = getOldStrategy(categoricalFeatures, numClasses) | |||
|
|||
val instr = Instrumentation.create(this, oldDataset) | |||
val handlePersistence = storageLevel == StorageLevel.NONE | |||
if (handlePersistence) oldDataset.persist(StorageLevel.MEMORY_AND_DISK) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
storageLevel == StorageLevel.NONE
could be a function defined in Predictor
to avoid code duplicates.
I'm not sure if we always want to use StorageLevel.MEMORY_AND_DISK
, but it will be good to have some flexibility. How about adding a field to represent StorageLevel.MEMORY_AND_DISK
?
b76438c
to
52a7b65
Compare
@hhbyyh Thanks for you comments! And sorry for this late reply. |
Test build #75414 has finished for PR 17014 at commit
|
ping @MLnick Can you help reviewing this? |
Jenkins, retest this please |
Test build #76663 has finished for PR 17014 at commit
|
@smurching Yes this should be added as a |
@WeichenXu123 Sounds good. And since adding And if we add |
Thanks all for discussing this! I'm just catching up now. I'm OK with adding handlePersistence as a new Param, but please do so in a separate PR and JIRA issue. I'd like to backport the KMeans fix since it's a bug causing a performance regression, but we should not backport the new Param API. |
@zhengruifeng @jkbradley I create a PR #19107 for quick fix |
if (handlePersistence) { | ||
instances.persist(StorageLevel.MEMORY_AND_DISK) | ||
} | ||
if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many similar changes from a block to one-line. We can avoid such changes and they're not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will revert those small changes
@WeichenXu123 @jkbradley I am curious about why |
@zhengruifeng |
513a2ef
to
df4d263
Compare
Test build #81372 has finished for PR 17014 at commit
|
Test build #81373 has finished for PR 17014 at commit
|
Test build #81376 has finished for PR 17014 at commit
|
Jenkins, retest this please |
Test build #81393 has finished for PR 17014 at commit
|
Hi @zhengruifeng, thanks for your work on this! Now that we're introducing a new handlePersistence parameter (a new public API), it'd be good to track work in a separate JIRA/PR as @jkbradley suggested so others are aware of the proposed change. I've created a new JIRA ticket for adding the handlePersistence param here: SPARK-21972. Would you mind resubmitting your work as a new PR that addresses the new JIRA ticket (SPARK-21972)? Thanks & sorry for the inconvenience! |
@smurching OK, I will close this PR and resubmit it to the new ticket. |
## What changes were proposed in this pull request? `df.rdd.getStorageLevel` => `df.storageLevel` using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this issue are fixed. Previous discussion in other PRs: #19107, #17014 ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #19197 from zhengruifeng/double_caching. (cherry picked from commit c5f9b89) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
## What changes were proposed in this pull request? `df.rdd.getStorageLevel` => `df.storageLevel` using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this issue are fixed. Previous discussion in other PRs: apache#19107, apache#17014 ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes apache#19197 from zhengruifeng/double_caching.
## What changes were proposed in this pull request? `df.rdd.getStorageLevel` => `df.storageLevel` using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this issue are fixed. Previous discussion in other PRs: apache#19107, apache#17014 ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes apache#19197 from zhengruifeng/double_caching. (cherry picked from commit c5f9b89) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
What changes were proposed in this pull request?
1, For Predictors, add
protected[spark] var storageLevel
to store the storageLevel of orignal dataset2, Use
dataset.storageLevel
instead ofdataset.rdd.getStorageLevel
to avoid double cachingHow was this patch tested?
Existing tests