SPARK-2272 [MLlib] Feature scaling which standardizes the range of independent variables or features of data #1207

dbtsai · 2014-06-25T03:37:27Z

Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is generally performed during the data preprocessing step.

In this work, a trait called VectorTransformer is defined for generic transformation on a vector. It contains one method to be implemented, transform which applies transformation on a vector.

There are two implementations of VectorTransformer now, and they all can be easily extended with PMML transformation support.

StandardScaler - Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
Normalizer - Normalizes samples individually to unit L^n norm

AmplabJenkins · 2014-06-25T03:40:17Z

Merged build triggered.

AmplabJenkins · 2014-06-25T03:40:22Z

Merged build started.

AmplabJenkins · 2014-06-25T05:07:09Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-06-25T05:07:09Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16102/

mengxr · 2014-07-10T04:49:22Z

Is there a reference implementation that you followed or this is all new? Does PMML standard define something similar?

SparkQA · 2014-08-03T05:29:12Z

QA tests have started for PR 1207. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17802/consoleFull

SparkQA · 2014-08-03T05:29:20Z

QA results for PR 1207:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Normalizer(n: Int) extends VectorTransformer with Serializable {
class StandardScaler(withMean: Boolean, withStd: Boolean)
trait VectorTransformer {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17802/consoleFull

SparkQA · 2014-08-03T05:34:29Z

QA tests have started for PR 1207. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17803/consoleFull

SparkQA · 2014-08-03T05:34:36Z

QA results for PR 1207:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Normalizer(n: Int) extends VectorTransformer with Serializable {
class StandardScaler(withMean: Boolean, withStd: Boolean)
trait VectorTransformer {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17803/consoleFull

SparkQA · 2014-08-03T05:39:09Z

QA tests have started for PR 1207. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17804/consoleFull

mengxr · 2014-08-03T06:30:20Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala

+ * @param n  L^2 norm by default. Normalization in L^n space.
+ */
+@DeveloperApi
+class Normalizer(n: Int) extends VectorTransformer with Serializable {


n -> p, which is commonly used for norms.

SparkQA · 2014-08-03T06:34:33Z

QA results for PR 1207:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Normalizer(n: Int) extends VectorTransformer with Serializable {
class StandardScaler(withMean: Boolean, withStd: Boolean)
trait VectorTransformer {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17804/consoleFull

dbtsai · 2014-08-03T09:31:07Z

TODO

p = Double.PositiveInfinity. 1, 2, and inf.
Add withStd back.

SparkQA · 2014-08-03T09:34:05Z

QA tests have started for PR 1207. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17813/consoleFull

SparkQA · 2014-08-03T09:34:47Z

QA results for PR 1207:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Normalizer(p: Int) extends VectorTransformer with Serializable {
class StandardScaler(withMean: Boolean)
trait VectorTransformer extends Serializable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17813/consoleFull

SparkQA · 2014-08-03T23:54:17Z

QA tests have started for PR 1207. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17833/consoleFull

SparkQA · 2014-08-04T00:04:21Z

QA tests have started for PR 1207. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17834/consoleFull

SparkQA · 2014-08-04T00:50:10Z

QA results for PR 1207:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Normalizer(p: Double) extends VectorTransformer {
class StandardScaler(withMean: Boolean, withStd: Boolean) extends VectorTransformer {
trait VectorTransformer extends Serializable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17833/consoleFull

mengxr · 2014-08-04T02:56:53Z

mllib/src/test/scala/org/apache/spark/mllib/feature/NormalizerSuite.scala

+    val dataInf = data.map(lInfNormalizer.transform(_))
+    val dataInfRDD = lInfNormalizer.transform(dataRDD)
+
+    assert((data.map(_.toBreeze), dataInf.map(_.toBreeze), dataInfRDD.collect().map(_.toBreeze))


ditto: same as the L1 test case

SparkQA · 2014-08-04T03:29:13Z

QA tests have started for PR 1207. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17839/consoleFull

mengxr · 2014-08-04T03:56:17Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala

+
+/**
+ * :: DeveloperApi ::
+ * Normalizes samples individually to unit L^p^ norm


^p^ (fun to read ^o^)

SparkQA · 2014-08-04T04:24:23Z

QA tests have started for PR 1207. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17841/consoleFull

SparkQA · 2014-08-04T04:28:30Z

QA results for PR 1207:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Normalizer(p: Double) extends VectorTransformer {
class StandardScaler(withMean: Boolean, withStd: Boolean) extends VectorTransformer {
trait VectorTransformer extends Serializable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17839/consoleFull

mengxr · 2014-08-04T04:40:06Z

LGTM. Merged into both master and branch-1.1. Thanks!!

…dependent variables or features of data Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is generally performed during the data preprocessing step. In this work, a trait called `VectorTransformer` is defined for generic transformation on a vector. It contains one method to be implemented, `transform` which applies transformation on a vector. There are two implementations of `VectorTransformer` now, and they all can be easily extended with PMML transformation support. 1) `StandardScaler` - Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set. 2) `Normalizer` - Normalizes samples individually to unit L^n norm Author: DB Tsai <dbtsai@alpinenow.com> Closes #1207 from dbtsai/dbtsai-feature-scaling and squashes the following commits: 78c15d3 [DB Tsai] Alpine Data Labs (cherry picked from commit ae58aea) Signed-off-by: Xiangrui Meng <meng@databricks.com>

SparkQA · 2014-08-04T05:21:50Z

QA results for PR 1207:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Normalizer(p: Double) extends VectorTransformer {
class StandardScaler(withMean: Boolean, withStd: Boolean) extends VectorTransformer {
trait VectorTransformer extends Serializable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17841/consoleFull

…dependent variables or features of data Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is generally performed during the data preprocessing step. In this work, a trait called `VectorTransformer` is defined for generic transformation on a vector. It contains one method to be implemented, `transform` which applies transformation on a vector. There are two implementations of `VectorTransformer` now, and they all can be easily extended with PMML transformation support. 1) `StandardScaler` - Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set. 2) `Normalizer` - Normalizes samples individually to unit L^n norm Author: DB Tsai <dbtsai@alpinenow.com> Closes apache#1207 from dbtsai/dbtsai-feature-scaling and squashes the following commits: 78c15d3 [DB Tsai] Alpine Data Labs

mengxr mentioned this pull request Jul 31, 2014

Add normalizeByCol method to mllib.util.MLUtils. #1698

Closed

mengxr reviewed Aug 3, 2014
View reviewed changes

mengxr reviewed Aug 4, 2014
View reviewed changes

Alpine Data Labs

78c15d3

asfgit closed this in ae58aea Aug 4, 2014

wangyum pushed a commit that referenced this pull request May 26, 2023

[CARMEL-6513] Bug fix for reorder predicate (#1207)

8c7e20f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-2272 [MLlib] Feature scaling which standardizes the range of independent variables or features of data #1207

SPARK-2272 [MLlib] Feature scaling which standardizes the range of independent variables or features of data #1207

dbtsai commented Jun 25, 2014

AmplabJenkins commented Jun 25, 2014

AmplabJenkins commented Jun 25, 2014

AmplabJenkins commented Jun 25, 2014

AmplabJenkins commented Jun 25, 2014

mengxr commented Jul 10, 2014

SparkQA commented Aug 3, 2014

SparkQA commented Aug 3, 2014

SparkQA commented Aug 3, 2014

SparkQA commented Aug 3, 2014

SparkQA commented Aug 3, 2014

mengxr Aug 3, 2014

SparkQA commented Aug 3, 2014

dbtsai commented Aug 3, 2014

SparkQA commented Aug 3, 2014

SparkQA commented Aug 3, 2014

SparkQA commented Aug 3, 2014

SparkQA commented Aug 4, 2014

SparkQA commented Aug 4, 2014

mengxr Aug 4, 2014

SparkQA commented Aug 4, 2014

mengxr Aug 4, 2014

dbtsai Aug 4, 2014

SparkQA commented Aug 4, 2014

SparkQA commented Aug 4, 2014

mengxr commented Aug 4, 2014

SparkQA commented Aug 4, 2014

SPARK-2272 [MLlib] Feature scaling which standardizes the range of independent variables or features of data #1207

SPARK-2272 [MLlib] Feature scaling which standardizes the range of independent variables or features of data #1207

Conversation

dbtsai commented Jun 25, 2014

AmplabJenkins commented Jun 25, 2014

AmplabJenkins commented Jun 25, 2014

AmplabJenkins commented Jun 25, 2014

AmplabJenkins commented Jun 25, 2014

mengxr commented Jul 10, 2014

SparkQA commented Aug 3, 2014

SparkQA commented Aug 3, 2014

SparkQA commented Aug 3, 2014

SparkQA commented Aug 3, 2014

SparkQA commented Aug 3, 2014

mengxr Aug 3, 2014

Choose a reason for hiding this comment

SparkQA commented Aug 3, 2014

dbtsai commented Aug 3, 2014

SparkQA commented Aug 3, 2014

SparkQA commented Aug 3, 2014

SparkQA commented Aug 3, 2014

SparkQA commented Aug 4, 2014

SparkQA commented Aug 4, 2014

mengxr Aug 4, 2014

Choose a reason for hiding this comment

SparkQA commented Aug 4, 2014

mengxr Aug 4, 2014

Choose a reason for hiding this comment

dbtsai Aug 4, 2014

Choose a reason for hiding this comment

SparkQA commented Aug 4, 2014

SparkQA commented Aug 4, 2014

mengxr commented Aug 4, 2014

SparkQA commented Aug 4, 2014