New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16404][ML] LeastSquaresAggregators serializes unnecessary data #14109
Conversation
} | ||
|
||
private val effectiveCoefficientsVector = Vectors.dense(effectiveCoefficientsArray) | ||
@transient private lazy val effectiveCoefficientsVector = coefAndOffset._1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before, these values were assigned simultaneously implicitly using a pattern match. It turns out that marking it as @transient lazy val
doesn't work because the unapply method generates a Tuple2
which does not contain the transient tag. The individual vals are still transient, but the tuple is not and thus gets serialized. This obscure/hidden consequence of pattern matching is one good argument not to use the @transient
approach. e.g. the following doesn't work
@transient private lazy val (x, y) = {
...
(x, y)
}
ping @dbtsai I implemented this patch using |
Test build #62003 has finished for PR 14109 at commit
|
Jenkins, test this please. |
Test build #63123 has finished for PR 14109 at commit
|
val coefficientsArray = coefficients.toArray.clone() | ||
private val dim = bcCoefficients.value.size | ||
@transient private lazy val featuresStd = bcFeaturesStd.value | ||
@transient private lazy val coefAndOffset = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coefAndOffset
should be effectiveCoefficientsAndOffset
be better? It's effective coefficients rather than coefficients.
@sethah I left two inline comments. Otherwise, LGTM. Thanks! |
@yanboliang Do you have thoughts on my comments regarding the trade-offs with using I'll address your other comments shortly. Thanks! |
Test build #63175 has finished for PR 14109 at commit
|
val coefficientsArray = coefficients.toArray.clone() | ||
private val dim = bcCoefficients.value.size | ||
@transient private lazy val featuresStd = bcFeaturesStd.value | ||
@transient private lazy val effectiveCoefAndOffset = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about @transient private lazy val (effectiveCoefficientsVector: Vector, offset: Double) = ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sethah has explained this issue in comment which has been folded.
@transient private lazy val (effectiveCoefficientsVector: Vector, offset: Double)
will generates a Tuple2
which does not contain the transient tag. The individual vals are still transient, but the tuple is not and thus gets serialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is indeed obscure. I like the fact that using @transient
will tell those fields are not being serialized. However, this can be difficulty to debug. How about have the documentation written in the code? Or we can do def initializeEffectiveCoefficientsVectorAndOffset
, and call it in the add
method for the first time? I don't have strong opinion about this.
@sethah In my opinion, I think using |
@dbtsai @yanboliang I went ahead and added a couple comments so someone will not mistakenly change this behavior in the future. Let me know if you see anything else, thanks! |
} | ||
|
||
val totalGradientArray = leastSquaresAggregator.gradient.toArray | ||
bcCoeffs.destroy(blocking = false) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, why do we not explicitly destroy bcFeaturesStd
and bcFeaturesMean
here? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot destroy them here because they are used on every iteration. I just added a commit to fix this so that, after the algorithm is run, we destroy the broadcast variables.
Test build #63282 has finished for PR 14109 at commit
|
Test build #63288 has finished for PR 14109 at commit
|
The current fix for broadcast variable destroy is ok. LGTM. Thanks! |
LGTM. Merged into master. Great work! Thanks. |
Would be great to have LOR sharing similar style and destroy mean and variance after usage. Thanks. |
…es unnecessary data. ## What changes were proposed in this pull request? Similar to ```LeastSquaresAggregator``` in #14109, ```AFTAggregator``` used for ```AFTSurvivalRegression``` ends up serializing the ```parameters``` and ```featuresStd```, which is not necessary and can cause performance issues for high dimensional data. This patch removes this serialization. This PR is highly inspired by #14109. ## How was this patch tested? I tested this locally and verified the serialization reduction. Before patch ![image](https://cloud.githubusercontent.com/assets/1962026/17512035/abb93f04-5dda-11e6-97d3-8ae6b61a0dfd.png) After patch ![image](https://cloud.githubusercontent.com/assets/1962026/17512024/9e0dc44c-5dda-11e6-93d0-6e130ba0d6aa.png) Author: Yanbo Liang <ybliang8@gmail.com> Closes #14519 from yanboliang/spark-16933.
…code to make it consistent with LinearRegression ## What changes were proposed in this pull request? Update LogisticCostAggregator serialization code to make it consistent with #14109 ## How was this patch tested? MLlib 2.0: ![image](https://cloud.githubusercontent.com/assets/19235986/17649601/5e2a79ac-61ee-11e6-833c-3bd8b5250470.png) After this PR: ![image](https://cloud.githubusercontent.com/assets/19235986/17649599/52b002ae-61ee-11e6-9402-9feb3439880f.png) Author: WeichenXu <WeichenXu123@outlook.com> Closes #14520 from WeichenXu123/improve_logistic_regression_costfun.
…berAggregator ## What changes were proposed in this pull request? Modifies the HuberAggregator class so that a copy of the coefficients vector isn't created every time that an instance is added. Follows the approach of LeastSquaresAggregator and uses transient lazy class variable to store the reused quantities. (See apache#14109 for explanation of the use of transient lazy variables) On the test case in the linked JIRA, this change gives an order of magnitude performance improvement reducing the time taken to fit the model from 540 to 47 seconds. ## How was this patch tested? Existing unit tests. See https://issues.apache.org/jira/browse/SPARK-28062 for results from running a benchmark script. Closes apache#24880 from Andrew-Crosby/spark-28062. Authored-by: Andrew-Crosby <andrew.crosby@autotrader.co.uk> Signed-off-by: Sean Owen <sean.owen@databricks.com>
What changes were proposed in this pull request?
Similar to
LogisticAggregator
,LeastSquaresAggregator
used for linear regression ends up serializing the coefficients and the features standard deviations, which is not necessary and can cause performance issues for high dimensional data. This patch removes this serialization.In #13729 the approach was to pass these values directly to the add method. The approach used here, initially, is to mark these fields as transient instead which gives the benefit of keeping the signature of the add method simple and interpretable. The downside is that it requires the use of
@transient lazy val
s which are difficult to reason about if one is not quite familiar with serialization in Scala/Spark.How was this patch tested?
MLlib
ML without patch
ML with patch