[SPARK-10182] [MLlib] GeneralizedLinearModel doesn't unpersist cached data #8395

SlavikBaranov · 2015-08-24T14:27:39Z

GeneralizedLinearModel creates a cached RDD when building a model. It's inconvenient, since these RDDs flood the memory when building several models in a row, so useful data might get evicted from the cache.

The proposed solution is to always cache the dataset & remove the warning. There's a caveat though: input dataset gets evaluated twice, in line 270 when fitting StandardScaler for the first time, and when running optimizer for the second time. So, it might worth to return removed warning.

Another possible solution is to disable caching entirely & return removed warning. I don't really know what approach is better.

… data

srowen · 2015-08-24T15:10:53Z

I get it, it's maybe because the one path is not cached. This looks good to me. The only thing I wonder is this: clearly the code is expecting that the input should be cached, and it might not be cached in memory. This creates an additional cache, always in memory. However, since several code paths already behave this way it seems more consistent to be consistent, and then the warnings don't make as much sense.

SparkQA · 2015-08-24T15:56:39Z

Test build #1685 has finished for PR 8395 at commit 8a5a8e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SlavikBaranov · 2015-08-24T18:26:07Z

@srowen I think, intermediate data is cached because it significantly improves performance if the number of features is high (since appendBias method performs System.arrayCopy). On the other hand, LogisticRegressionWithLBFGS class always performs feature scaling, so redundant evaluation of the input RDD might take significant time.

Thinking of it again, I'd prefer to return removed warning and limit the fix to something like this:

if (data.getStorageLevel != StorageLevel.NONE) {
  data.unpersist(false)
}

What do you think about it?

jkbradley · 2015-08-24T18:49:16Z

mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala

@@ -287,7 +282,7 @@ abstract class GeneralizedLinearAlgorithm[M <: GeneralizedLinearModel]
        if (useFeatureScaling) {
          input.map(lp => (lp.label, scaler.transform(lp.features))).cache()
        } else {
-          input.map(lp => (lp.label, lp.features))
+          input.map(lp => (lp.label, lp.features)).cache()


I think this omission was intentional: If input is cached, then there's no real need to cache this tiny transform. If you keep as is, then you can just check at the end to see if "data" is cached before trying to unpersist it.

That makes sense too. I suppose changing behavior less is good. Yes, then the safer thing is to just unpersist only if the data RDD is persisted.

srowen · 2015-08-27T09:34:03Z

LGTM, I'm going to merge for 1.6 shortly

srowen · 2015-08-27T17:57:50Z

(PS not sure why it doesn't seem to show up, but the tests passed again after the last commit: https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1685/console )

[SPARK-10182] [MLlib] GeneralizedLinearModel doesn't unpersist cached…

8a5a8e4

… data

jkbradley reviewed Aug 24, 2015
View reviewed changes

SlavikBaranov added 2 commits August 25, 2015 12:01

SPARK-10182: Update after discussion

a3c589e

Fixed typo

352e8f3

asfgit closed this in fdd466b Aug 27, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-10182] [MLlib] GeneralizedLinearModel doesn't unpersist cached data #8395

[SPARK-10182] [MLlib] GeneralizedLinearModel doesn't unpersist cached data #8395

SlavikBaranov commented Aug 24, 2015

srowen commented Aug 24, 2015

SparkQA commented Aug 24, 2015

SlavikBaranov commented Aug 24, 2015

jkbradley Aug 24, 2015

srowen Aug 24, 2015

srowen commented Aug 27, 2015

srowen commented Aug 27, 2015

[SPARK-10182] [MLlib] GeneralizedLinearModel doesn't unpersist cached data #8395

[SPARK-10182] [MLlib] GeneralizedLinearModel doesn't unpersist cached data #8395

Conversation

SlavikBaranov commented Aug 24, 2015

srowen commented Aug 24, 2015

SparkQA commented Aug 24, 2015

SlavikBaranov commented Aug 24, 2015

jkbradley Aug 24, 2015

Choose a reason for hiding this comment

srowen Aug 24, 2015

Choose a reason for hiding this comment

srowen commented Aug 27, 2015

srowen commented Aug 27, 2015