-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-10182] [MLlib] GeneralizedLinearModel doesn't unpersist cached data #8395
Conversation
I get it, it's maybe because the one path is not cached. This looks good to me. The only thing I wonder is this: clearly the code is expecting that the input should be cached, and it might not be cached in memory. This creates an additional cache, always in memory. However, since several code paths already behave this way it seems more consistent to be consistent, and then the warnings don't make as much sense. |
Test build #1685 has finished for PR 8395 at commit
|
@srowen I think, intermediate data is cached because it significantly improves performance if the number of features is high (since Thinking of it again, I'd prefer to return removed warning and limit the fix to something like this:
What do you think about it? |
@@ -287,7 +282,7 @@ abstract class GeneralizedLinearAlgorithm[M <: GeneralizedLinearModel] | |||
if (useFeatureScaling) { | |||
input.map(lp => (lp.label, scaler.transform(lp.features))).cache() | |||
} else { | |||
input.map(lp => (lp.label, lp.features)) | |||
input.map(lp => (lp.label, lp.features)).cache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this omission was intentional: If input is cached, then there's no real need to cache this tiny transform. If you keep as is, then you can just check at the end to see if "data" is cached before trying to unpersist it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense too. I suppose changing behavior less is good. Yes, then the safer thing is to just unpersist only if the data
RDD is persisted.
LGTM, I'm going to merge for 1.6 shortly |
(PS not sure why it doesn't seem to show up, but the tests passed again after the last commit: https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1685/console ) |
GeneralizedLinearModel
creates a cached RDD when building a model. It's inconvenient, since these RDDs flood the memory when building several models in a row, so useful data might get evicted from the cache.The proposed solution is to always cache the dataset & remove the warning. There's a caveat though: input dataset gets evaluated twice, in line 270 when fitting
StandardScaler
for the first time, and when running optimizer for the second time. So, it might worth to return removed warning.Another possible solution is to disable caching entirely & return removed warning. I don't really know what approach is better.