Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-10182] [MLlib] GeneralizedLinearModel doesn't unpersist cached data #8395

Closed
wants to merge 3 commits into from

Conversation

SlavikBaranov
Copy link
Contributor

GeneralizedLinearModel creates a cached RDD when building a model. It's inconvenient, since these RDDs flood the memory when building several models in a row, so useful data might get evicted from the cache.

The proposed solution is to always cache the dataset & remove the warning. There's a caveat though: input dataset gets evaluated twice, in line 270 when fitting StandardScaler for the first time, and when running optimizer for the second time. So, it might worth to return removed warning.

Another possible solution is to disable caching entirely & return removed warning. I don't really know what approach is better.

@srowen
Copy link
Member

srowen commented Aug 24, 2015

I get it, it's maybe because the one path is not cached. This looks good to me. The only thing I wonder is this: clearly the code is expecting that the input should be cached, and it might not be cached in memory. This creates an additional cache, always in memory. However, since several code paths already behave this way it seems more consistent to be consistent, and then the warnings don't make as much sense.

@SparkQA
Copy link

SparkQA commented Aug 24, 2015

Test build #1685 has finished for PR 8395 at commit 8a5a8e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SlavikBaranov
Copy link
Contributor Author

@srowen I think, intermediate data is cached because it significantly improves performance if the number of features is high (since appendBias method performs System.arrayCopy). On the other hand, LogisticRegressionWithLBFGS class always performs feature scaling, so redundant evaluation of the input RDD might take significant time.

Thinking of it again, I'd prefer to return removed warning and limit the fix to something like this:

if (data.getStorageLevel != StorageLevel.NONE) {
  data.unpersist(false)
}

What do you think about it?

@@ -287,7 +282,7 @@ abstract class GeneralizedLinearAlgorithm[M <: GeneralizedLinearModel]
if (useFeatureScaling) {
input.map(lp => (lp.label, scaler.transform(lp.features))).cache()
} else {
input.map(lp => (lp.label, lp.features))
input.map(lp => (lp.label, lp.features)).cache()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this omission was intentional: If input is cached, then there's no real need to cache this tiny transform. If you keep as is, then you can just check at the end to see if "data" is cached before trying to unpersist it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense too. I suppose changing behavior less is good. Yes, then the safer thing is to just unpersist only if the data RDD is persisted.

@srowen
Copy link
Member

srowen commented Aug 27, 2015

LGTM, I'm going to merge for 1.6 shortly

@srowen
Copy link
Member

srowen commented Aug 27, 2015

(PS not sure why it doesn't seem to show up, but the tests passed again after the last commit: https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1685/console )

@asfgit asfgit closed this in fdd466b Aug 27, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants