Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11852] [ML] StandardScaler minor refactor #9839

Closed
wants to merge 5 commits into from

Conversation

yanboliang
Copy link
Contributor

withStd and withMean should be params of StandardScaler and StandardScalerModel.

def checkModelData(model1: StandardScalerModel, model2: StandardScalerModel): Unit = {
assert(model1.mean === model2.mean)
assert(model1.std === model2.std)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need to check mean and std which are parts of the model, withStd and withStd are params.

)
val df = sqlContext.createDataFrame(data.zip(resWithBoth)).toDF("features", "expected")
val standardScaler = new StandardScaler()
testEstimatorAndModelReadWrite(standardScaler, df, allParams, checkModelData)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

withStd and withStd of StandardScalerModel must be inherited from StandardScaler, so we can not construct StandardScalerModel directly by specifying the two variables. Here we combine the original test cases into one with testEstimatorAndModelReadWrite which both test the estimator and model.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not an ideal unit test for read/write because the model fitting part shouldn't be part of it, which is already covered by other tests. Constructing estimator and model directly can save some test time.

@SparkQA
Copy link

SparkQA commented Nov 19, 2015

Test build #46330 has finished for PR 9839 at commit 37fe45d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 19, 2015

Test build #46333 has finished for PR 9839 at commit 76ef338.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -189,7 +189,6 @@ object StandardScalerModel extends MLReadable[StandardScalerModel] {
sqlContext.read.parquet(dataPath)
.select("std", "mean", "withStd", "withMean")
.head()
// This is very likely to change in the future because withStd and withMean should be params.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If withMean and withStd are parameters, we should save them in metadata/ but not both under data/ and medadata/. Can we change the constructor of ml.StandardScalerModel to take only std and mean but construct scaler only inside transform? So scaler is no longer a member variable. We can fix performance issues in 1.7.

@yanboliang
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Nov 20, 2015

Test build #46413 has finished for PR 9839 at commit c6b6d7e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 20, 2015

Test build #46415 has finished for PR 9839 at commit c6b6d7e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Nov 20, 2015
```withStd``` and ```withMean``` should be params of ```StandardScaler``` and ```StandardScalerModel```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9839 from yanboliang/standardScaler-refactor.

(cherry picked from commit 9ace2e5)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@mengxr
Copy link
Contributor

mengxr commented Nov 20, 2015

LGTM. Merged into master and branch-1.6. Thanks!

@asfgit asfgit closed this in 9ace2e5 Nov 20, 2015
@yanboliang yanboliang deleted the standardScaler-refactor branch November 22, 2015 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants