[SPARK-18608][ML] Fix double-caching in ML algorithms #17014

zhengruifeng · 2017-02-21T09:30:00Z

What changes were proposed in this pull request?

1, For Predictors, add protected[spark] var storageLevel to store the storageLevel of orignal dataset
2, Use dataset.storageLevel instead of dataset.rdd.getStorageLevel to avoid double caching

How was this patch tested?

Existing tests

SparkQA · 2017-02-21T09:39:41Z

Test build #73209 has finished for PR 17014 at commit 39210f5.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-21T10:49:38Z

Test build #73213 has finished for PR 17014 at commit 1457751.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-21T10:59:40Z

Test build #73214 has finished for PR 17014 at commit 845512a.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-21T11:04:50Z

Test build #73215 has finished for PR 17014 at commit bc314b6.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-21T11:28:12Z

Test build #73217 has finished for PR 17014 at commit 0b26a1c.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2017-02-21T13:23:12Z

mllib/src/main/scala/org/apache/spark/ml/Predictor.scala

   * @return  Fitted model
   */
-  protected def train(dataset: Dataset[_]): M
+  protected def train(dataset: Dataset[_], handlePersistence: Boolean): M


Isn't this going to break external implementers?

SparkQA · 2017-02-21T13:30:54Z

Test build #73219 has finished for PR 17014 at commit 0298f0c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2017-02-21T17:12:51Z

It's better if we can fix this without breaking API. Let's allow some time to see if there's a better solution.
Meanwhile, if we have to add the new parameter, can we set some default value?

zhengruifeng · 2017-02-22T02:08:38Z

@srowen @hhbyyh You are right. I will update this without breaking train. Thanks for pointing it out!

SparkQA · 2017-02-22T02:58:45Z

Test build #73252 has finished for PR 17014 at commit b81eeb7.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-22T05:43:48Z

Test build #73253 has finished for PR 17014 at commit a3f3bb6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2017-03-17T02:16:13Z

ping @hhbyyh ?

hhbyyh · 2017-03-17T23:31:05Z

Hi @zhengruifeng , how can I help? I thought you plan to send an update according to your suggestion in jira?

zhengruifeng · 2017-03-20T02:01:34Z

@hhbyyh I think I misunderstood your comments in jira. I will update this pr with the new plan: directly add protected var storageLevel in Predictor, without adding setter and getter of it now..

SparkQA · 2017-03-21T02:52:15Z

Test build #74921 has finished for PR 17014 at commit fca364d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-21T03:07:32Z

Test build #74925 has finished for PR 17014 at commit 9c58cb3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-21T03:40:52Z

Test build #74932 has finished for PR 17014 at commit 8aefaf1.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-21T04:38:39Z

Test build #74935 has finished for PR 17014 at commit b76438c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2017-03-22T02:57:20Z

ping @hhbyyh
I updated the PR, can you please help reviewing this? Thank in advance.

hhbyyh · 2017-03-22T19:53:03Z

I'm trying to refresh my memory and clear the targets on the topic, basically we want to achieve:

Avoid double caching. If Input Dataset is already cached, then we should not cache the internal RDD.
If input Dataset is not cached, some algorithms may need internal RDD caching to avoid warning from MLlib and also to avoid unnecessary re-computation. But I'm not sure about the scope. (Should we add this for all the algorithms? This is a behavior change for many algorithms). I don't think we have an ideal way to detect if a Dataset should be cached (it's parent may be cached already), and thus not sure if we should take action based on the speculative condition.
Avoid public API change.

Let me know if I miss anything. I'll take a look at the code now.

hhbyyh

Thanks for the update. It's good to see no API breaks. Two comments from me:

perhaps we can find some way to reduce the code duplicates and increase flexibility. And it may also make sense to define a trait for this issue, if the scope is not with Predictor only.
About the scope, I'm not sure if we should extend it to so many algorithms. I'm leaning towards limiting it to the alorightms in ml which already has the handlePersistence logic.

hhbyyh · 2017-03-22T19:54:57Z

mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala

@@ -110,12 +111,17 @@ class DecisionTreeClassifier @Since("1.4.0") (
    val oldDataset: RDD[LabeledPoint] = extractLabeledPoints(dataset, numClasses)
    val strategy = getOldStrategy(categoricalFeatures, numClasses)

-    val instr = Instrumentation.create(this, oldDataset)
+    val handlePersistence = storageLevel == StorageLevel.NONE
+    if (handlePersistence) oldDataset.persist(StorageLevel.MEMORY_AND_DISK)


storageLevel == StorageLevel.NONE could be a function defined in Predictor to avoid code duplicates.

I'm not sure if we always want to use StorageLevel.MEMORY_AND_DISK, but it will be good to have some flexibility. How about adding a field to represent StorageLevel.MEMORY_AND_DISK?

zhengruifeng · 2017-03-31T03:15:12Z

@hhbyyh Thanks for you comments! And sorry for this late reply.
I update this PR: 1,limit the scope, only modifiy algorithms in which double-caching already exist
2, add a function handlePersistence to avoid code duplication

SparkQA · 2017-03-31T04:07:32Z

Test build #75414 has finished for PR 17014 at commit 52a7b65.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2017-03-31T06:16:09Z

ping @MLnick Can you help reviewing this?

zhengruifeng · 2017-05-09T08:33:13Z

Jenkins, retest this please

SparkQA · 2017-05-09T09:55:58Z

Test build #76663 has finished for PR 17014 at commit 52a7b65.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2017-09-01T01:16:07Z

@smurching Yes this should be added as a ml.Param, we should not add as an argument.
@zhengruifeng Would you mind update the PR according to our discussion result above ?
Make handlePersistence as a ml.Param (added to these algos, default value be true).
And we don't need to modify the Predictor and any other public interface for now.

zhengruifeng · 2017-09-01T01:48:52Z

@WeichenXu123 Sounds good. And since adding handlePersistence as a ml.Param may influences many algs (more than that in this PR), I think we may need more discussion @MLnick @yanboliang @hhbyyh

And if we add handlePersistence, should we also add a param intermediateStorageLevel to let users choose the storagelevel (like ALS)?

jkbradley · 2017-09-01T22:43:10Z

Thanks all for discussing this! I'm just catching up now.

I'm OK with adding handlePersistence as a new Param, but please do so in a separate PR and JIRA issue. I'd like to backport the KMeans fix since it's a bug causing a performance regression, but we should not backport the new Param API.

WeichenXu123 · 2017-09-02T04:34:33Z

@zhengruifeng @jkbradley I create a PR #19107 for quick fix KMeans perf regression bug.
This PR can continue to work on adding Param of handlePersistence which is not so emergent.
Thanks!

viirya · 2017-09-02T07:28:25Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

-    if (handlePersistence) {
-      instances.persist(StorageLevel.MEMORY_AND_DISK)
-    }
+    if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK)


There are many similar changes from a block to one-line. We can avoid such changes and they're not necessary.

OK, I will revert those small changes

zhengruifeng · 2017-09-04T03:02:31Z

@WeichenXu123 @jkbradley I am curious about why ml.Kmeans is special that it needs a separate PR. It seems that performance regression also exist in other algs.

WeichenXu123 · 2017-09-04T05:04:26Z

@zhengruifeng KMeans regarded as a bugfix(SPARK-21799) because the double-cache issue is introduced in 2.2 and cause perf regression.
Other algos also have the same issue, but the issue exists in those algos for a long time and they related to Predictor so it is not so easy to fix, we can leave them in here and do more discussion.

SparkQA · 2017-09-04T05:39:39Z

Test build #81372 has finished for PR 17014 at commit df4d263.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-04T06:12:44Z

Test build #81373 has finished for PR 17014 at commit 971e52c.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-04T07:04:47Z

Test build #81376 has finished for PR 17014 at commit f8fa957.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2017-09-05T01:26:14Z

Jenkins, retest this please

SparkQA · 2017-09-05T04:29:52Z

Test build #81393 has finished for PR 17014 at commit f8fa957.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

smurching · 2017-09-11T03:56:16Z

Hi @zhengruifeng, thanks for your work on this!

Now that we're introducing a new handlePersistence parameter (a new public API), it'd be good to track work in a separate JIRA/PR as @jkbradley suggested so others are aware of the proposed change.

I've created a new JIRA ticket for adding the handlePersistence param here: SPARK-21972. Would you mind resubmitting your work as a new PR that addresses the new JIRA ticket (SPARK-21972)?

Thanks & sorry for the inconvenience!

zhengruifeng · 2017-09-11T05:34:10Z

@smurching OK, I will close this PR and resubmit it to the new ticket.

## What changes were proposed in this pull request? `df.rdd.getStorageLevel` => `df.storageLevel` using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this issue are fixed. Previous discussion in other PRs: #19107, #17014 ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #19197 from zhengruifeng/double_caching. (cherry picked from commit c5f9b89) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

## What changes were proposed in this pull request? `df.rdd.getStorageLevel` => `df.storageLevel` using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this issue are fixed. Previous discussion in other PRs: apache#19107, apache#17014 ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes apache#19197 from zhengruifeng/double_caching.

## What changes were proposed in this pull request? `df.rdd.getStorageLevel` => `df.storageLevel` using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this issue are fixed. Previous discussion in other PRs: apache#19107, apache#17014 ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes apache#19197 from zhengruifeng/double_caching. (cherry picked from commit c5f9b89) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

srowen reviewed Feb 21, 2017

View reviewed changes

zhengruifeng force-pushed the fix_double_cache branch from 0298f0c to b81eeb7 Compare February 22, 2017 02:42

zhengruifeng changed the title ~~[SPARK-18608][ML][WIP] Fix double-caching in ML algorithms~~ [SPARK-18608][ML] Fix double-caching in ML algorithms Feb 23, 2017

zhengruifeng force-pushed the fix_double_cache branch from a3f3bb6 to fca364d Compare March 21, 2017 02:00

hhbyyh reviewed Mar 22, 2017

View reviewed changes

zhengruifeng force-pushed the fix_double_cache branch from b76438c to 52a7b65 Compare March 31, 2017 03:11

WeichenXu123 mentioned this pull request Sep 2, 2017

[SPARK-21799][ML] Fix KMeans performance regression caused by double-caching #19107

Closed

viirya reviewed Sep 2, 2017

View reviewed changes

zhengruifeng added 3 commits September 4, 2017 11:27

recreate pr

45adf7e

del unused equation

208e100

revert Predictor

936d466

create param HasHandlePersistence

df4d263

zhengruifeng force-pushed the fix_double_cache branch from 513a2ef to df4d263 Compare September 4, 2017 05:28

fix

971e52c

fix mima

f8fa957

zhengruifeng closed this Sep 11, 2017

smurching mentioned this pull request Sep 11, 2017

[SPARK-21972][ML] Add param handlePersistence #19186

Closed

zhengruifeng mentioned this pull request Sep 12, 2017

[SPARK-18608][ML] Fix double caching #19197

Closed

[SPARK-18608][ML] Fix double-caching in ML algorithms #17014

[SPARK-18608][ML] Fix double-caching in ML algorithms #17014

Conversation

zhengruifeng commented Feb 21, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Feb 21, 2017

SparkQA commented Feb 21, 2017

SparkQA commented Feb 21, 2017

SparkQA commented Feb 21, 2017

SparkQA commented Feb 21, 2017

srowen Feb 21, 2017

Choose a reason for hiding this comment

SparkQA commented Feb 21, 2017

hhbyyh commented Feb 21, 2017

zhengruifeng commented Feb 22, 2017

SparkQA commented Feb 22, 2017

SparkQA commented Feb 22, 2017

zhengruifeng commented Mar 17, 2017

hhbyyh commented Mar 17, 2017 • edited Loading

zhengruifeng commented Mar 20, 2017

SparkQA commented Mar 21, 2017

SparkQA commented Mar 21, 2017

SparkQA commented Mar 21, 2017

SparkQA commented Mar 21, 2017

zhengruifeng commented Mar 22, 2017

hhbyyh commented Mar 22, 2017 • edited Loading

hhbyyh left a comment • edited Loading

Choose a reason for hiding this comment

hhbyyh Mar 22, 2017

Choose a reason for hiding this comment

zhengruifeng commented Mar 31, 2017

SparkQA commented Mar 31, 2017

zhengruifeng commented Mar 31, 2017

zhengruifeng commented May 9, 2017

SparkQA commented May 9, 2017

WeichenXu123 commented Sep 1, 2017

zhengruifeng commented Sep 1, 2017 • edited Loading

jkbradley commented Sep 1, 2017

WeichenXu123 commented Sep 2, 2017

viirya Sep 2, 2017

Choose a reason for hiding this comment

zhengruifeng Sep 4, 2017

Choose a reason for hiding this comment

zhengruifeng commented Sep 4, 2017 • edited Loading

WeichenXu123 commented Sep 4, 2017

SparkQA commented Sep 4, 2017

SparkQA commented Sep 4, 2017

SparkQA commented Sep 4, 2017

zhengruifeng commented Sep 5, 2017

SparkQA commented Sep 5, 2017

smurching commented Sep 11, 2017

zhengruifeng commented Sep 11, 2017

zhengruifeng commented Feb 21, 2017 •

edited

Loading

hhbyyh commented Mar 17, 2017 •

edited

Loading

hhbyyh commented Mar 22, 2017 •

edited

Loading

hhbyyh left a comment •

edited

Loading

zhengruifeng commented Sep 1, 2017 •

edited

Loading

zhengruifeng commented Sep 4, 2017 •

edited

Loading