Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29818][MLLIB] Missing persist on RDD #26454

Closed
wants to merge 6 commits into from

Conversation

amanomer
Copy link
Contributor

@amanomer amanomer commented Nov 9, 2019

What changes were proposed in this pull request?

Use persist on RDD which is used for more than one action.

Why are the changes needed?

Persist prevents re-computation of rdd when more than one actions are applied.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manually

@amanomer
Copy link
Contributor Author

amanomer commented Nov 9, 2019

@srowen @MaxGekk Kindly, review this PR.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@MaxGekk
Copy link
Member

MaxGekk commented Nov 9, 2019

Persisting intermediate results is not always good because serialization has some costs + some vendors can solve the performance issue by another ways like IO caching /cc @gatorsmile @cloud-fan

@amanomer Just in case, do you observe performance improvements on some benchmarks or use cases?

@amanomer
Copy link
Contributor Author

I have not tested this patch on any performance benchmark but I think these functions are quite generic, most of the applications/vendors must be using them. So it would be better if we optimize them like we are doing in other places?

val baggedInput = BaggedPoint
.convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, withReplacement,
(tp: TreePoint) => tp.weight, seed = seed)
.persist(StorageLevel.MEMORY_AND_DISK)


@MaxGekk Kindly correct me if I am wrong. Thanks

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these don't look like obvious candidates to force a persist. Which ones would you argue are actually a performance gain?

@Icysandwich
Copy link
Contributor

Now that persist() is added on some RDDs, I think the related RDD.unpersist() should also be added.

@amanomer
Copy link
Contributor Author

amanomer commented Nov 11, 2019

@srowen There are some JIRA tickets created for unnecessary use of persist and improper persist strategy (like 29844, 29832). Do we need to handle them in this PR, if required?

createCombiner = (labelAndWeight: (Double, Double)) =>
new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, labelAndWeight._2),
mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) =>
c += (labelAndWeight._1, labelAndWeight._2),
mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 += c2
).sortByKey(ascending = false)
)
binnedWeights.persist()
Copy link
Contributor Author

@amanomer amanomer Nov 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

binnedWeights RDD needs to be persisted. At Line 176: sortByKey() and Line 216: collect() is applied. Also, it is not already persisted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to be unpersisted though. We also customarily only persist if the user input was persisted, as a sort of signal that the user is OK consuming memory/disk. For small inputs this may not be worth it so maybe the user doesn't want the overhead of persisting.

).sortByKey(ascending = false)
)
if (scoreLabelsWeight.getStorageLevel != StorageLevel.NONE) {
binnedWeights.persist()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still isn't unpersisted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean it should be unpersisted after use?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes otherwise the caller has no way to unpersist it until it's GCed

if (scoreLabelsWeight.getStorageLevel != StorageLevel.NONE) {
binnedWeights.persist()
}
val counts = binnedWeights.sortByKey(ascending = false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, hm, I don't understand this. You persist binnedWeights, but it is now only used once. Why? If anything it's binnedCounts that needs persisting. I'm still not clear if it makes enough difference to matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

binnedCounts is a child RDD of binnedWeights. And here one action sortByKey is performed on binnedWeights.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but, why bother persisting binnedWeights? you recompute everything in between it and binnedCounts twice, when I think that would be the point, to avoid that.

Copy link
Contributor Author

@amanomer amanomer Nov 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think binnedWeights is required to be persisted because more than one action is getting applied here.

binnedWeights
      |
      | sortByKey (action)
     V
counts
      |
      | count (action)
     V
binnedCounts (on which action collect is applied to compute agg)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong here. Kindly correct me @srowen

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

caching helps where more than one action is performed on the same RDD. That's not the case here. Each of the first two has one thing executed on it. sortByKey is not an action, anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, okay. One question here, will it be worth persisting counts since actions count and collect is applied directly on it ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem so. But that is the question I'd put to you in these cases - are you sure it makes a difference meaningful enough to overcome the overhead? I could imagine so here, just wondering if these are based on more investigation or benchmarking, vs just trying to persist lots of things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure it makes a difference meaningful enough to overcome the overhead?

I think, no. Persisting count doesn't makes sense here. It will just be an overhead. Now I am getting clear picture of where to use persist.
Key learnings from this PR about persist.

  • persist introduce memory and CPU overheads.

  • So only important inputs (such as intermediate results, user data which is already cached, etc) should be persisted or RDD on which more than one action is performed.

  • Avoid using persist in loop.

  • Persist should be meaningful enough to overcome overheads.

Copy link
Contributor Author

@amanomer amanomer Nov 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TYSM @srowen . Looking forward for more learning opportunities.

@amanomer amanomer requested a review from srowen November 13, 2019 17:18
@amanomer amanomer closed this Nov 13, 2019
@amanomer
Copy link
Contributor Author

@srowen kindly help me to review this PR #26317

@srowen
Copy link
Member

srowen commented Nov 16, 2019

I don't see how that PR is related?

@amanomer
Copy link
Contributor Author

amanomer commented Nov 16, 2019

I know it's not related. But want your reviews on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants