[SPARK-29818][MLLIB] Missing persist on RDD #26454

amanomer · 2019-11-09T19:52:13Z

What changes were proposed in this pull request?

Use persist on RDD which is used for more than one action.

Why are the changes needed?

Persist prevents re-computation of rdd when more than one actions are applied.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manually

amanomer · 2019-11-09T19:53:19Z

@srowen @MaxGekk Kindly, review this PR.

AmplabJenkins · 2019-11-09T19:57:06Z

Can one of the admins verify this patch?

MaxGekk · 2019-11-09T20:38:41Z

Persisting intermediate results is not always good because serialization has some costs + some vendors can solve the performance issue by another ways like IO caching /cc @gatorsmile @cloud-fan

@amanomer Just in case, do you observe performance improvements on some benchmarks or use cases?

amanomer · 2019-11-10T09:14:00Z

I have not tested this patch on any performance benchmark but I think these functions are quite generic, most of the applications/vendors must be using them. So it would be better if we optimize them like we are doing in other places?

spark/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

Lines 155 to 158 in 57b954e

    
           val baggedInput = BaggedPoint 
        
             .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, withReplacement, 
        
               (tp: TreePoint) => tp.weight, seed = seed) 
        
             .persist(StorageLevel.MEMORY_AND_DISK)

spark/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala

Line 227 in 57b954e

cumulativeCounts.persist()

@MaxGekk Kindly correct me if I am wrong. Thanks

mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala

srowen

Most of these don't look like obvious candidates to force a persist. Which ones would you argue are actually a performance gain?

mllib/src/main/scala/org/apache/spark/ml/evaluation/MultilabelClassificationEvaluator.scala

mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala

Icysandwich · 2019-11-11T02:54:07Z

Now that persist() is added on some RDDs, I think the related RDD.unpersist() should also be added.

amanomer · 2019-11-11T13:43:21Z

@srowen There are some JIRA tickets created for unnecessary use of persist and improper persist strategy (like 29844, 29832). Do we need to handle them in this PR, if required?

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala

mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala

mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala

mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

amanomer · 2019-11-13T05:36:20Z

mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala

      createCombiner = (labelAndWeight: (Double, Double)) =>
        new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, labelAndWeight._2),
      mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) =>
        c += (labelAndWeight._1, labelAndWeight._2),
      mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 += c2
-    ).sortByKey(ascending = false)
+    )
+    binnedWeights.persist()


binnedWeights RDD needs to be persisted. At Line 176: sortByKey() and Line 216: collect() is applied. Also, it is not already persisted.

This has to be unpersisted though. We also customarily only persist if the user input was persisted, as a sort of signal that the user is OK consuming memory/disk. For small inputs this may not be worth it so maybe the user doesn't want the overhead of persisting.

srowen · 2019-11-13T16:27:27Z

mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala

-    ).sortByKey(ascending = false)
+    )
+    if (scoreLabelsWeight.getStorageLevel != StorageLevel.NONE) {
+      binnedWeights.persist()


This still isn't unpersisted

Do you mean it should be unpersisted after use?

Yes otherwise the caller has no way to unpersist it until it's GCed

srowen · 2019-11-13T17:09:18Z

mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala

+    if (scoreLabelsWeight.getStorageLevel != StorageLevel.NONE) {
+      binnedWeights.persist()
+    }
+    val counts = binnedWeights.sortByKey(ascending = false)


Wait, hm, I don't understand this. You persist binnedWeights, but it is now only used once. Why? If anything it's binnedCounts that needs persisting. I'm still not clear if it makes enough difference to matter.

binnedCounts is a child RDD of binnedWeights. And here one action sortByKey is performed on binnedWeights.

Yes, but, why bother persisting binnedWeights? you recompute everything in between it and binnedCounts twice, when I think that would be the point, to avoid that.

I think binnedWeights is required to be persisted because more than one action is getting applied here.

binnedWeights
      |
      | sortByKey (action)
     V
counts
      |
      | count (action)
     V
binnedCounts (on which action collect is applied to compute agg)

I might be wrong here. Kindly correct me @srowen

caching helps where more than one action is performed on the same RDD. That's not the case here. Each of the first two has one thing executed on it. sortByKey is not an action, anyway.

Oh, okay. One question here, will it be worth persisting counts since actions count and collect is applied directly on it ?

Doesn't seem so. But that is the question I'd put to you in these cases - are you sure it makes a difference meaningful enough to overcome the overhead? I could imagine so here, just wondering if these are based on more investigation or benchmarking, vs just trying to persist lots of things.

are you sure it makes a difference meaningful enough to overcome the overhead?

I think, no. Persisting count doesn't makes sense here. It will just be an overhead. Now I am getting clear picture of where to use persist.
Key learnings from this PR about persist.

persist introduce memory and CPU overheads.

So only important inputs (such as intermediate results, user data which is already cached, etc) should be persisted or RDD on which more than one action is performed.

Avoid using persist in loop.

Persist should be meaningful enough to overcome overheads.

TYSM @srowen . Looking forward for more learning opportunities.

amanomer · 2019-11-13T18:47:58Z

@srowen kindly help me to review this PR #26317

srowen · 2019-11-16T16:19:28Z

I don't see how that PR is related?

amanomer · 2019-11-16T17:29:33Z

I know it's not related. But want your reviews on that.

dongjoon-hyun added ML MLLIB labels Nov 9, 2019

amanomer commented Nov 10, 2019

View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala Outdated Show resolved Hide resolved

srowen reviewed Nov 10, 2019

View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/evaluation/MultilabelClassificationEvaluator.scala Outdated Show resolved Hide resolved

mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala Outdated Show resolved Hide resolved

amanomer added 2 commits November 11, 2019 20:57

Initial commit

4acb797

Add persist for more files SPARK-29828, SPARK-29826

fa920f7

amanomer commented Nov 12, 2019

View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala Outdated Show resolved Hide resolved

amanomer commented Nov 12, 2019

View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala Outdated Show resolved Hide resolved