Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18608][ML] Fix double-caching in ML algorithms #17014

Closed
wants to merge 6 commits into from

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Feb 21, 2017

What changes were proposed in this pull request?

1, For Predictors, add protected[spark] var storageLevel to store the storageLevel of orignal dataset
2, Use dataset.storageLevel instead of dataset.rdd.getStorageLevel to avoid double caching

How was this patch tested?

Existing tests

@SparkQA
Copy link

SparkQA commented Feb 21, 2017

Test build #73209 has finished for PR 17014 at commit 39210f5.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 21, 2017

Test build #73213 has finished for PR 17014 at commit 1457751.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 21, 2017

Test build #73214 has finished for PR 17014 at commit 845512a.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 21, 2017

Test build #73215 has finished for PR 17014 at commit bc314b6.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 21, 2017

Test build #73217 has finished for PR 17014 at commit 0b26a1c.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* @return Fitted model
*/
protected def train(dataset: Dataset[_]): M
protected def train(dataset: Dataset[_], handlePersistence: Boolean): M
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this going to break external implementers?

@SparkQA
Copy link

SparkQA commented Feb 21, 2017

Test build #73219 has finished for PR 17014 at commit 0298f0c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hhbyyh
Copy link
Contributor

hhbyyh commented Feb 21, 2017

It's better if we can fix this without breaking API. Let's allow some time to see if there's a better solution.
Meanwhile, if we have to add the new parameter, can we set some default value?

@zhengruifeng
Copy link
Contributor Author

@srowen @hhbyyh You are right. I will update this without breaking train. Thanks for pointing it out!

@SparkQA
Copy link

SparkQA commented Feb 22, 2017

Test build #73252 has finished for PR 17014 at commit b81eeb7.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 22, 2017

Test build #73253 has finished for PR 17014 at commit a3f3bb6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng zhengruifeng changed the title [SPARK-18608][ML][WIP] Fix double-caching in ML algorithms [SPARK-18608][ML] Fix double-caching in ML algorithms Feb 23, 2017
@zhengruifeng
Copy link
Contributor Author

ping @hhbyyh ?

@hhbyyh
Copy link
Contributor

hhbyyh commented Mar 17, 2017

Hi @zhengruifeng , how can I help? I thought you plan to send an update according to your suggestion in jira?

@zhengruifeng
Copy link
Contributor Author

@hhbyyh I think I misunderstood your comments in jira. I will update this pr with the new plan: directly add protected var storageLevel in Predictor, without adding setter and getter of it now..

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74921 has finished for PR 17014 at commit fca364d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74925 has finished for PR 17014 at commit 9c58cb3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74932 has finished for PR 17014 at commit 8aefaf1.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74935 has finished for PR 17014 at commit b76438c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

ping @hhbyyh
I updated the PR, can you please help reviewing this? Thank in advance.

@hhbyyh
Copy link
Contributor

hhbyyh commented Mar 22, 2017

I'm trying to refresh my memory and clear the targets on the topic, basically we want to achieve:

  1. Avoid double caching. If Input Dataset is already cached, then we should not cache the internal RDD.

  2. If input Dataset is not cached, some algorithms may need internal RDD caching to avoid warning from MLlib and also to avoid unnecessary re-computation. But I'm not sure about the scope. (Should we add this for all the algorithms? This is a behavior change for many algorithms). I don't think we have an ideal way to detect if a Dataset should be cached (it's parent may be cached already), and thus not sure if we should take action based on the speculative condition.

  3. Avoid public API change.

Let me know if I miss anything. I'll take a look at the code now.

Copy link
Contributor

@hhbyyh hhbyyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. It's good to see no API breaks. Two comments from me:

  1. perhaps we can find some way to reduce the code duplicates and increase flexibility. And it may also make sense to define a trait for this issue, if the scope is not with Predictor only.

  2. About the scope, I'm not sure if we should extend it to so many algorithms. I'm leaning towards limiting it to the alorightms in ml which already has the handlePersistence logic.

@@ -110,12 +111,17 @@ class DecisionTreeClassifier @Since("1.4.0") (
val oldDataset: RDD[LabeledPoint] = extractLabeledPoints(dataset, numClasses)
val strategy = getOldStrategy(categoricalFeatures, numClasses)

val instr = Instrumentation.create(this, oldDataset)
val handlePersistence = storageLevel == StorageLevel.NONE
if (handlePersistence) oldDataset.persist(StorageLevel.MEMORY_AND_DISK)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storageLevel == StorageLevel.NONE could be a function defined in Predictor to avoid code duplicates.

I'm not sure if we always want to use StorageLevel.MEMORY_AND_DISK, but it will be good to have some flexibility. How about adding a field to represent StorageLevel.MEMORY_AND_DISK?

@zhengruifeng
Copy link
Contributor Author

@hhbyyh Thanks for you comments! And sorry for this late reply.
I update this PR: 1,limit the scope, only modifiy algorithms in which double-caching already exist
2, add a function handlePersistence to avoid code duplication

@SparkQA
Copy link

SparkQA commented Mar 31, 2017

Test build #75414 has finished for PR 17014 at commit 52a7b65.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

ping @MLnick Can you help reviewing this?

@zhengruifeng
Copy link
Contributor Author

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented May 9, 2017

Test build #76663 has finished for PR 17014 at commit 52a7b65.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor

@smurching Yes this should be added as a ml.Param, we should not add as an argument.
@zhengruifeng Would you mind update the PR according to our discussion result above ?
Make handlePersistence as a ml.Param (added to these algos, default value be true).
And we don't need to modify the Predictor and any other public interface for now.

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Sep 1, 2017

@WeichenXu123 Sounds good. And since adding handlePersistence as a ml.Param may influences many algs (more than that in this PR), I think we may need more discussion @MLnick @yanboliang @hhbyyh

And if we add handlePersistence, should we also add a param intermediateStorageLevel to let users choose the storagelevel (like ALS)?

@jkbradley
Copy link
Member

Thanks all for discussing this! I'm just catching up now.

I'm OK with adding handlePersistence as a new Param, but please do so in a separate PR and JIRA issue. I'd like to backport the KMeans fix since it's a bug causing a performance regression, but we should not backport the new Param API.

@WeichenXu123
Copy link
Contributor

@zhengruifeng @jkbradley I create a PR #19107 for quick fix KMeans perf regression bug.
This PR can continue to work on adding Param of handlePersistence which is not so emergent.
Thanks!

if (handlePersistence) {
instances.persist(StorageLevel.MEMORY_AND_DISK)
}
if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many similar changes from a block to one-line. We can avoid such changes and they're not necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will revert those small changes

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Sep 4, 2017

@WeichenXu123 @jkbradley I am curious about why ml.Kmeans is special that it needs a separate PR. It seems that performance regression also exist in other algs.

@WeichenXu123
Copy link
Contributor

@zhengruifeng KMeans regarded as a bugfix(SPARK-21799) because the double-cache issue is introduced in 2.2 and cause perf regression.
Other algos also have the same issue, but the issue exists in those algos for a long time and they related to Predictor so it is not so easy to fix, we can leave them in here and do more discussion.

@SparkQA
Copy link

SparkQA commented Sep 4, 2017

Test build #81372 has finished for PR 17014 at commit df4d263.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 4, 2017

Test build #81373 has finished for PR 17014 at commit 971e52c.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 4, 2017

Test build #81376 has finished for PR 17014 at commit f8fa957.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Sep 5, 2017

Test build #81393 has finished for PR 17014 at commit f8fa957.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@smurching
Copy link
Contributor

Hi @zhengruifeng, thanks for your work on this!

Now that we're introducing a new handlePersistence parameter (a new public API), it'd be good to track work in a separate JIRA/PR as @jkbradley suggested so others are aware of the proposed change.

I've created a new JIRA ticket for adding the handlePersistence param here: SPARK-21972. Would you mind resubmitting your work as a new PR that addresses the new JIRA ticket (SPARK-21972)?

Thanks & sorry for the inconvenience!

@zhengruifeng
Copy link
Contributor Author

@smurching OK, I will close this PR and resubmit it to the new ticket.

asfgit pushed a commit that referenced this pull request Sep 12, 2017
## What changes were proposed in this pull request?
`df.rdd.getStorageLevel` => `df.storageLevel`

using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this issue are fixed.

Previous discussion in other PRs: #19107, #17014

## How was this patch tested?
existing tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #19197 from zhengruifeng/double_caching.

(cherry picked from commit c5f9b89)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
ghost pushed a commit to dbtsai/spark that referenced this pull request Sep 12, 2017
## What changes were proposed in this pull request?
`df.rdd.getStorageLevel` => `df.storageLevel`

using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this issue are fixed.

Previous discussion in other PRs: apache#19107, apache#17014

## How was this patch tested?
existing tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes apache#19197 from zhengruifeng/double_caching.
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
## What changes were proposed in this pull request?
`df.rdd.getStorageLevel` => `df.storageLevel`

using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this issue are fixed.

Previous discussion in other PRs: apache#19107, apache#17014

## How was this patch tested?
existing tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes apache#19197 from zhengruifeng/double_caching.

(cherry picked from commit c5f9b89)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants