[SPARK-16404][ML] LeastSquaresAggregators serializes unnecessary data #14109

sethah · 2016-07-08T23:05:56Z

What changes were proposed in this pull request?

Similar to LogisticAggregator, LeastSquaresAggregator used for linear regression ends up serializing the coefficients and the features standard deviations, which is not necessary and can cause performance issues for high dimensional data. This patch removes this serialization.

In #13729 the approach was to pass these values directly to the add method. The approach used here, initially, is to mark these fields as transient instead which gives the benefit of keeping the signature of the add method simple and interpretable. The downside is that it requires the use of @transient lazy vals which are difficult to reason about if one is not quite familiar with serialization in Scala/Spark.

How was this patch tested?

MLlib

ML without patch

ML with patch

sethah · 2016-07-08T23:10:35Z

mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

  }
-
-  private val effectiveCoefficientsVector = Vectors.dense(effectiveCoefficientsArray)
+  @transient private lazy val effectiveCoefficientsVector = coefAndOffset._1


Before, these values were assigned simultaneously implicitly using a pattern match. It turns out that marking it as @transient lazy val doesn't work because the unapply method generates a Tuple2 which does not contain the transient tag. The individual vals are still transient, but the tuple is not and thus gets serialized. This obscure/hidden consequence of pattern matching is one good argument not to use the @transient approach. e.g. the following doesn't work

@transient private lazy val (x, y) = { ... (x, y) }

sethah · 2016-07-08T23:14:06Z

ping @dbtsai

I implemented this patch using @transient as you suggested. I ran into an obscure complication using @transient with unapply method which is not obvious. While I understand that it is nice to keep the add method signature unchanged, I think you can argue that it is more appropriate to pass the coefficients and featuresStd arrays directly to the method since it is the only place in the class they are used. Add that to the fact that the @transient approach has a more confusing implementation and could potentially be unknowningly undone by future developers, it may not be the best approach. I am open to feedback/suggestions. Thanks!

SparkQA · 2016-07-09T00:59:22Z

Test build #62003 has finished for PR 14109 at commit 53c9192.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-07-20T16:22:01Z

also cc @mengxr - this is the same problem as in #13729

yanboliang · 2016-08-02T14:14:49Z

Jenkins, test this please.

SparkQA · 2016-08-02T15:10:32Z

Test build #63123 has finished for PR 14109 at commit 53c9192.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-08-02T15:29:45Z

mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

-    val coefficientsArray = coefficients.toArray.clone()
+  private val dim = bcCoefficients.value.size
+  @transient private lazy val featuresStd = bcFeaturesStd.value
+  @transient private lazy val coefAndOffset = {


coefAndOffset should be effectiveCoefficientsAndOffset be better? It's effective coefficients rather than coefficients.

yanboliang · 2016-08-03T08:47:58Z

@sethah I left two inline comments. Otherwise, LGTM. Thanks!

sethah · 2016-08-03T15:00:02Z

@yanboliang Do you have thoughts on my comments regarding the trade-offs with using @transient lazy val? I am not necessarily convinced this is the best way. If it is, we should update Logistic Regression to use this method as well.

I'll address your other comments shortly. Thanks!

SparkQA · 2016-08-03T16:23:07Z

Test build #63175 has finished for PR 14109 at commit 10ba14e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2016-08-05T05:24:57Z

mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

-    val coefficientsArray = coefficients.toArray.clone()
+  private val dim = bcCoefficients.value.size
+  @transient private lazy val featuresStd = bcFeaturesStd.value
+  @transient private lazy val effectiveCoefAndOffset = {


How about @transient private lazy val (effectiveCoefficientsVector: Vector, offset: Double) = ...

@sethah has explained this issue in comment which has been folded.
@transient private lazy val (effectiveCoefficientsVector: Vector, offset: Double) will generates a Tuple2 which does not contain the transient tag. The individual vals are still transient, but the tuple is not and thus gets serialized.

Oh, this is indeed obscure. I like the fact that using @transient will tell those fields are not being serialized. However, this can be difficulty to debug. How about have the documentation written in the code? Or we can do def initializeEffectiveCoefficientsVectorAndOffset, and call it in the add method for the first time? I don't have strong opinion about this.

dbtsai · 2016-08-05T05:33:37Z

@sethah In my opinion, I think using @transient lazy val is okay since there are only two places dereferencing the lazy val, and we don't use it in the tight loop. LGTM except one small comment. Thanks.

sethah · 2016-08-05T20:19:54Z

@dbtsai @yanboliang I went ahead and added a couple comments so someone will not mistakenly change this behavior in the future. Let me know if you see anything else, thanks!

dbtsai · 2016-08-05T20:30:59Z

mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

    }

    val totalGradientArray = leastSquaresAggregator.gradient.toArray
+    bcCoeffs.destroy(blocking = false)



BTW, why do we not explicitly destroy bcFeaturesStd and bcFeaturesMean here? Thanks.

We cannot destroy them here because they are used on every iteration. I just added a commit to fix this so that, after the algorithm is run, we destroy the broadcast variables.

SparkQA · 2016-08-05T21:11:54Z

Test build #63282 has finished for PR 14109 at commit 0d99795.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-05T22:50:24Z

Test build #63288 has finished for PR 14109 at commit 9c2bf47.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-08-06T12:57:31Z

The current fix for broadcast variable destroy is ok. LGTM. Thanks!

dbtsai · 2016-08-08T07:01:58Z

LGTM. Merged into master. Great work! Thanks.

dbtsai · 2016-08-08T07:07:05Z

Would be great to have LOR sharing similar style and destroy mean and variance after usage. Thanks.

…es unnecessary data. ## What changes were proposed in this pull request? Similar to ```LeastSquaresAggregator``` in #14109, ```AFTAggregator``` used for ```AFTSurvivalRegression``` ends up serializing the ```parameters``` and ```featuresStd```, which is not necessary and can cause performance issues for high dimensional data. This patch removes this serialization. This PR is highly inspired by #14109. ## How was this patch tested? I tested this locally and verified the serialization reduction. Before patch ![image](https://cloud.githubusercontent.com/assets/1962026/17512035/abb93f04-5dda-11e6-97d3-8ae6b61a0dfd.png) After patch ![image](https://cloud.githubusercontent.com/assets/1962026/17512024/9e0dc44c-5dda-11e6-93d0-6e130ba0d6aa.png) Author: Yanbo Liang <ybliang8@gmail.com> Closes #14519 from yanboliang/spark-16933.

…code to make it consistent with LinearRegression ## What changes were proposed in this pull request? Update LogisticCostAggregator serialization code to make it consistent with #14109 ## How was this patch tested? MLlib 2.0: ![image](https://cloud.githubusercontent.com/assets/19235986/17649601/5e2a79ac-61ee-11e6-833c-3bd8b5250470.png) After this PR: ![image](https://cloud.githubusercontent.com/assets/19235986/17649599/52b002ae-61ee-11e6-9402-9feb3439880f.png) Author: WeichenXu <WeichenXu123@outlook.com> Closes #14520 from WeichenXu123/improve_logistic_regression_costfun.

…berAggregator ## What changes were proposed in this pull request? Modifies the HuberAggregator class so that a copy of the coefficients vector isn't created every time that an instance is added. Follows the approach of LeastSquaresAggregator and uses transient lazy class variable to store the reused quantities. (See apache#14109 for explanation of the use of transient lazy variables) On the test case in the linked JIRA, this change gives an order of magnitude performance improvement reducing the time taken to fit the model from 540 to 47 seconds. ## How was this patch tested? Existing unit tests. See https://issues.apache.org/jira/browse/SPARK-28062 for results from running a benchmark script. Closes apache#24880 from Andrew-Crosby/spark-28062. Authored-by: Andrew-Crosby <andrew.crosby@autotrader.co.uk> Signed-off-by: Sean Owen <sean.owen@databricks.com>

sethah reviewed Jul 8, 2016
View reviewed changes

sethah mentioned this pull request Jul 23, 2016

[SPARK-3181] [ML] Implement RobustRegression with huber loss. #14326

Closed

yanboliang reviewed Aug 2, 2016
View reviewed changes

sethah added 4 commits August 3, 2016 08:31

remove unnecessary serialization in linear regression

533d2ea

using transient

152304c

style

eb4baac

destroy bc var

10ba14e

sethah force-pushed the LIR_serialize branch from 53c9192 to 10ba14e Compare August 3, 2016 15:33

dbtsai reviewed Aug 5, 2016
View reviewed changes

add comment about serialization

0d99795

dbtsai reviewed Aug 5, 2016
View reviewed changes

destroy featuresstd and featuresmean

9c2bf47

yanboliang mentioned this pull request Aug 6, 2016

[SPARK-16933] [ML] Fix AFTAggregator in AFTSurvivalRegression serializes unnecessary data. #14519

Closed

WeichenXu123 mentioned this pull request Aug 6, 2016

[SPARK-16934][ML][MLLib]Update LogisticCostAggregator serialization code to make it consistent with LinearRegression #14520

Closed

asfgit closed this in 1db1c65 Aug 8, 2016

WeichenXu123 mentioned this pull request Sep 3, 2016

[WIP][SPARK-19208][ML][MLLib] Optimize MultivariantOnlineSummerizer by making the summarized target configurable #14950

Closed

Andrew-Crosby mentioned this pull request Jun 17, 2019

[SPARK-28062][ML] Avoid unnecessary copy of coefficients vector in HuberAggregator #24880

Closed

sethah mentioned this pull request Feb 11, 2020

[SPARK-30772][ML][SQL] avoid tuple assignment because it will circumvent the transient tag #27523

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-16404][ML] LeastSquaresAggregators serializes unnecessary data #14109

[SPARK-16404][ML] LeastSquaresAggregators serializes unnecessary data #14109

sethah commented Jul 8, 2016

sethah Jul 8, 2016

sethah commented Jul 8, 2016

SparkQA commented Jul 9, 2016

sethah commented Jul 20, 2016

yanboliang commented Aug 2, 2016

SparkQA commented Aug 2, 2016

yanboliang Aug 2, 2016

yanboliang commented Aug 3, 2016

sethah commented Aug 3, 2016

SparkQA commented Aug 3, 2016

dbtsai Aug 5, 2016 •

edited

yanboliang Aug 5, 2016

dbtsai Aug 5, 2016

dbtsai commented Aug 5, 2016

sethah commented Aug 5, 2016

dbtsai Aug 5, 2016

sethah Aug 5, 2016

SparkQA commented Aug 5, 2016

SparkQA commented Aug 5, 2016

yanboliang commented Aug 6, 2016

dbtsai commented Aug 8, 2016

dbtsai commented Aug 8, 2016

[SPARK-16404][ML] LeastSquaresAggregators serializes unnecessary data #14109

[SPARK-16404][ML] LeastSquaresAggregators serializes unnecessary data #14109

Conversation

sethah commented Jul 8, 2016

What changes were proposed in this pull request?

How was this patch tested?

sethah Jul 8, 2016

Choose a reason for hiding this comment

sethah commented Jul 8, 2016

SparkQA commented Jul 9, 2016

sethah commented Jul 20, 2016

yanboliang commented Aug 2, 2016

SparkQA commented Aug 2, 2016

yanboliang Aug 2, 2016

Choose a reason for hiding this comment

yanboliang commented Aug 3, 2016

sethah commented Aug 3, 2016

SparkQA commented Aug 3, 2016

dbtsai Aug 5, 2016 • edited

Choose a reason for hiding this comment

yanboliang Aug 5, 2016

Choose a reason for hiding this comment

dbtsai Aug 5, 2016

Choose a reason for hiding this comment

dbtsai commented Aug 5, 2016

sethah commented Aug 5, 2016

dbtsai Aug 5, 2016

Choose a reason for hiding this comment

sethah Aug 5, 2016

Choose a reason for hiding this comment

SparkQA commented Aug 5, 2016

SparkQA commented Aug 5, 2016

yanboliang commented Aug 6, 2016

dbtsai commented Aug 8, 2016

dbtsai commented Aug 8, 2016

dbtsai Aug 5, 2016 •

edited