Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16404][ML] LeastSquaresAggregators serializes unnecessary data #14109

Closed
wants to merge 6 commits into from

Conversation

sethah
Copy link
Contributor

@sethah sethah commented Jul 8, 2016

What changes were proposed in this pull request?

Similar to LogisticAggregator, LeastSquaresAggregator used for linear regression ends up serializing the coefficients and the features standard deviations, which is not necessary and can cause performance issues for high dimensional data. This patch removes this serialization.

In #13729 the approach was to pass these values directly to the add method. The approach used here, initially, is to mark these fields as transient instead which gives the benefit of keeping the signature of the add method simple and interpretable. The downside is that it requires the use of @transient lazy vals which are difficult to reason about if one is not quite familiar with serialization in Scala/Spark.

How was this patch tested?

MLlib
image

ML without patch
image

ML with patch
image

}

private val effectiveCoefficientsVector = Vectors.dense(effectiveCoefficientsArray)
@transient private lazy val effectiveCoefficientsVector = coefAndOffset._1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before, these values were assigned simultaneously implicitly using a pattern match. It turns out that marking it as @transient lazy val doesn't work because the unapply method generates a Tuple2 which does not contain the transient tag. The individual vals are still transient, but the tuple is not and thus gets serialized. This obscure/hidden consequence of pattern matching is one good argument not to use the @transient approach. e.g. the following doesn't work

@transient private lazy val (x, y) = {
  ...
  (x, y)
}

@sethah
Copy link
Contributor Author

sethah commented Jul 8, 2016

ping @dbtsai

I implemented this patch using @transient as you suggested. I ran into an obscure complication using @transient with unapply method which is not obvious. While I understand that it is nice to keep the add method signature unchanged, I think you can argue that it is more appropriate to pass the coefficients and featuresStd arrays directly to the method since it is the only place in the class they are used. Add that to the fact that the @transient approach has a more confusing implementation and could potentially be unknowningly undone by future developers, it may not be the best approach. I am open to feedback/suggestions. Thanks!

@SparkQA
Copy link

SparkQA commented Jul 9, 2016

Test build #62003 has finished for PR 14109 at commit 53c9192.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor Author

sethah commented Jul 20, 2016

also cc @mengxr - this is the same problem as in #13729

@yanboliang
Copy link
Contributor

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Aug 2, 2016

Test build #63123 has finished for PR 14109 at commit 53c9192.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val coefficientsArray = coefficients.toArray.clone()
private val dim = bcCoefficients.value.size
@transient private lazy val featuresStd = bcFeaturesStd.value
@transient private lazy val coefAndOffset = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coefAndOffset should be effectiveCoefficientsAndOffset be better? It's effective coefficients rather than coefficients.

@yanboliang
Copy link
Contributor

@sethah I left two inline comments. Otherwise, LGTM. Thanks!

@sethah
Copy link
Contributor Author

sethah commented Aug 3, 2016

@yanboliang Do you have thoughts on my comments regarding the trade-offs with using @transient lazy val? I am not necessarily convinced this is the best way. If it is, we should update Logistic Regression to use this method as well.

I'll address your other comments shortly. Thanks!

@SparkQA
Copy link

SparkQA commented Aug 3, 2016

Test build #63175 has finished for PR 14109 at commit 10ba14e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val coefficientsArray = coefficients.toArray.clone()
private val dim = bcCoefficients.value.size
@transient private lazy val featuresStd = bcFeaturesStd.value
@transient private lazy val effectiveCoefAndOffset = {
Copy link
Member

@dbtsai dbtsai Aug 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about @transient private lazy val (effectiveCoefficientsVector: Vector, offset: Double) = ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sethah has explained this issue in comment which has been folded.
@transient private lazy val (effectiveCoefficientsVector: Vector, offset: Double) will generates a Tuple2 which does not contain the transient tag. The individual vals are still transient, but the tuple is not and thus gets serialized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is indeed obscure. I like the fact that using @transient will tell those fields are not being serialized. However, this can be difficulty to debug. How about have the documentation written in the code? Or we can do def initializeEffectiveCoefficientsVectorAndOffset, and call it in the add method for the first time? I don't have strong opinion about this.

@dbtsai
Copy link
Member

dbtsai commented Aug 5, 2016

@sethah In my opinion, I think using @transient lazy val is okay since there are only two places dereferencing the lazy val, and we don't use it in the tight loop. LGTM except one small comment. Thanks.

@sethah
Copy link
Contributor Author

sethah commented Aug 5, 2016

@dbtsai @yanboliang I went ahead and added a couple comments so someone will not mistakenly change this behavior in the future. Let me know if you see anything else, thanks!

}

val totalGradientArray = leastSquaresAggregator.gradient.toArray
bcCoeffs.destroy(blocking = false)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, why do we not explicitly destroy bcFeaturesStd and bcFeaturesMean here? Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot destroy them here because they are used on every iteration. I just added a commit to fix this so that, after the algorithm is run, we destroy the broadcast variables.

@SparkQA
Copy link

SparkQA commented Aug 5, 2016

Test build #63282 has finished for PR 14109 at commit 0d99795.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 5, 2016

Test build #63288 has finished for PR 14109 at commit 9c2bf47.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

The current fix for broadcast variable destroy is ok. LGTM. Thanks!

@dbtsai
Copy link
Member

dbtsai commented Aug 8, 2016

LGTM. Merged into master. Great work! Thanks.

@asfgit asfgit closed this in 1db1c65 Aug 8, 2016
@dbtsai
Copy link
Member

dbtsai commented Aug 8, 2016

Would be great to have LOR sharing similar style and destroy mean and variance after usage. Thanks.

asfgit pushed a commit that referenced this pull request Aug 9, 2016
…es unnecessary data.

## What changes were proposed in this pull request?
Similar to ```LeastSquaresAggregator``` in #14109, ```AFTAggregator``` used for ```AFTSurvivalRegression``` ends up serializing the ```parameters``` and ```featuresStd```, which is not necessary and can cause performance issues for high dimensional data. This patch removes this serialization. This PR is highly inspired by #14109.

## How was this patch tested?
I tested this locally and verified the serialization reduction.

Before patch
![image](https://cloud.githubusercontent.com/assets/1962026/17512035/abb93f04-5dda-11e6-97d3-8ae6b61a0dfd.png)

After patch
![image](https://cloud.githubusercontent.com/assets/1962026/17512024/9e0dc44c-5dda-11e6-93d0-6e130ba0d6aa.png)

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14519 from yanboliang/spark-16933.
asfgit pushed a commit that referenced this pull request Aug 15, 2016
…code to make it consistent with LinearRegression

## What changes were proposed in this pull request?

Update LogisticCostAggregator serialization code to make it consistent with #14109

## How was this patch tested?
MLlib 2.0:
![image](https://cloud.githubusercontent.com/assets/19235986/17649601/5e2a79ac-61ee-11e6-833c-3bd8b5250470.png)

After this PR:
![image](https://cloud.githubusercontent.com/assets/19235986/17649599/52b002ae-61ee-11e6-9402-9feb3439880f.png)

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14520 from WeichenXu123/improve_logistic_regression_costfun.
kiku-jw pushed a commit to kiku-jw/spark that referenced this pull request Jun 26, 2019
…berAggregator

## What changes were proposed in this pull request?

Modifies the HuberAggregator class so that a copy of the coefficients vector isn't created every time that an instance is added. Follows the approach of LeastSquaresAggregator and uses transient lazy class variable to store the reused quantities. (See apache#14109 for explanation of the use of transient lazy variables)

On the test case in the linked JIRA, this change gives an order of magnitude performance improvement reducing the time taken to fit the model from 540 to 47 seconds.

## How was this patch tested?

Existing unit tests.
See https://issues.apache.org/jira/browse/SPARK-28062 for results from running a benchmark script.

Closes apache#24880 from Andrew-Crosby/spark-28062.

Authored-by: Andrew-Crosby <andrew.crosby@autotrader.co.uk>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants