New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11852] [ML] StandardScaler minor refactor #9839
Conversation
def checkModelData(model1: StandardScalerModel, model2: StandardScalerModel): Unit = { | ||
assert(model1.mean === model2.mean) | ||
assert(model1.std === model2.std) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need to check mean
and std
which are parts of the model, withStd
and withStd
are params.
) | ||
val df = sqlContext.createDataFrame(data.zip(resWithBoth)).toDF("features", "expected") | ||
val standardScaler = new StandardScaler() | ||
testEstimatorAndModelReadWrite(standardScaler, df, allParams, checkModelData) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
withStd
and withStd
of StandardScalerModel
must be inherited from StandardScaler
, so we can not construct StandardScalerModel
directly by specifying the two variables. Here we combine the original test cases into one with testEstimatorAndModelReadWrite
which both test the estimator and model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not an ideal unit test for read/write because the model fitting part shouldn't be part of it, which is already covered by other tests. Constructing estimator and model directly can save some test time.
Test build #46330 has finished for PR 9839 at commit
|
Test build #46333 has finished for PR 9839 at commit
|
@@ -189,7 +189,6 @@ object StandardScalerModel extends MLReadable[StandardScalerModel] { | |||
sqlContext.read.parquet(dataPath) | |||
.select("std", "mean", "withStd", "withMean") | |||
.head() | |||
// This is very likely to change in the future because withStd and withMean should be params. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If withMean
and withStd
are parameters, we should save them in metadata/
but not both under data/
and medadata/
. Can we change the constructor of ml.StandardScalerModel
to take only std
and mean
but construct scaler
only inside transform
? So scaler
is no longer a member variable. We can fix performance issues in 1.7.
Jenkins, test this please. |
Test build #46413 has finished for PR 9839 at commit
|
Test build #46415 has finished for PR 9839 at commit
|
LGTM. Merged into master and branch-1.6. Thanks! |
withStd
andwithMean
should be params ofStandardScaler
andStandardScalerModel
.