-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29818][MLLIB] Missing persist on RDD #26454
Conversation
Can one of the admins verify this patch? |
Persisting intermediate results is not always good because serialization has some costs + some vendors can solve the performance issue by another ways like IO caching /cc @gatorsmile @cloud-fan @amanomer Just in case, do you observe performance improvements on some benchmarks or use cases? |
I have not tested this patch on any performance benchmark but I think these functions are quite generic, most of the applications/vendors must be using them. So it would be better if we optimize them like we are doing in other places? spark/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala Lines 155 to 158 in 57b954e
spark/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala Line 227 in 57b954e
@MaxGekk Kindly correct me if I am wrong. Thanks |
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of these don't look like obvious candidates to force a persist. Which ones would you argue are actually a performance gain?
mllib/src/main/scala/org/apache/spark/ml/evaluation/MultilabelClassificationEvaluator.scala
Outdated
Show resolved
Hide resolved
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala
Outdated
Show resolved
Hide resolved
Now that persist() is added on some RDDs, I think the related RDD.unpersist() should also be added. |
mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
Outdated
Show resolved
Hide resolved
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala
Outdated
Show resolved
Hide resolved
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala
Outdated
Show resolved
Hide resolved
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala
Outdated
Show resolved
Hide resolved
mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
Outdated
Show resolved
Hide resolved
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
Outdated
Show resolved
Hide resolved
createCombiner = (labelAndWeight: (Double, Double)) => | ||
new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, labelAndWeight._2), | ||
mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) => | ||
c += (labelAndWeight._1, labelAndWeight._2), | ||
mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 += c2 | ||
).sortByKey(ascending = false) | ||
) | ||
binnedWeights.persist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
binnedWeights
RDD needs to be persisted. At Line 176: sortByKey()
and Line 216: collect()
is applied. Also, it is not already persisted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has to be unpersisted though. We also customarily only persist if the user input was persisted, as a sort of signal that the user is OK consuming memory/disk. For small inputs this may not be worth it so maybe the user doesn't want the overhead of persisting.
c5b1a15
to
daad006
Compare
2119a3f
to
daad006
Compare
).sortByKey(ascending = false) | ||
) | ||
if (scoreLabelsWeight.getStorageLevel != StorageLevel.NONE) { | ||
binnedWeights.persist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still isn't unpersisted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean it should be unpersisted after use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes otherwise the caller has no way to unpersist it until it's GCed
if (scoreLabelsWeight.getStorageLevel != StorageLevel.NONE) { | ||
binnedWeights.persist() | ||
} | ||
val counts = binnedWeights.sortByKey(ascending = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, hm, I don't understand this. You persist binnedWeights, but it is now only used once. Why? If anything it's binnedCounts that needs persisting. I'm still not clear if it makes enough difference to matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
binnedCounts
is a child RDD of binnedWeights
. And here one action sortByKey
is performed on binnedWeights
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but, why bother persisting binnedWeights? you recompute everything in between it and binnedCounts twice, when I think that would be the point, to avoid that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think binnedWeights
is required to be persisted because more than one action is getting applied here.
binnedWeights
|
| sortByKey
(action)
V
counts
|
| count
(action)
V
binnedCounts
(on which action collect
is applied to compute agg
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong here. Kindly correct me @srowen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
caching helps where more than one action is performed on the same RDD. That's not the case here. Each of the first two has one thing executed on it. sortByKey is not an action, anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, okay. One question here, will it be worth persisting counts
since actions count
and collect
is applied directly on it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seem so. But that is the question I'd put to you in these cases - are you sure it makes a difference meaningful enough to overcome the overhead? I could imagine so here, just wondering if these are based on more investigation or benchmarking, vs just trying to persist lots of things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure it makes a difference meaningful enough to overcome the overhead?
I think, no. Persisting count
doesn't makes sense here. It will just be an overhead. Now I am getting clear picture of where to use persist.
Key learnings from this PR about persist.
-
persist introduce memory and CPU overheads.
-
So only important inputs (such as intermediate results, user data which is already cached, etc) should be persisted or RDD on which more than one action is performed.
-
Avoid using persist in loop.
-
Persist should be meaningful enough to overcome overheads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TYSM @srowen . Looking forward for more learning opportunities.
I don't see how that PR is related? |
I know it's not related. But want your reviews on that. |
What changes were proposed in this pull request?
Use persist on RDD which is used for more than one action.
Why are the changes needed?
Persist prevents re-computation of rdd when more than one actions are applied.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Manually