Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-2272 [MLlib] Feature scaling which standardizes the range of independent variables or features of data #1207

Closed
wants to merge 1 commit into from

Conversation

dbtsai
Copy link
Member

@dbtsai dbtsai commented Jun 25, 2014

Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is generally performed during the data preprocessing step.

In this work, a trait called VectorTransformer is defined for generic transformation on a vector. It contains one method to be implemented, transform which applies transformation on a vector.

There are two implementations of VectorTransformer now, and they all can be easily extended with PMML transformation support.

  1. StandardScaler - Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.

  2. Normalizer - Normalizes samples individually to unit L^n norm

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16102/

@mengxr
Copy link
Contributor

mengxr commented Jul 10, 2014

Is there a reference implementation that you followed or this is all new? Does PMML standard define something similar?

@SparkQA
Copy link

SparkQA commented Aug 3, 2014

QA tests have started for PR 1207. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17802/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 3, 2014

QA results for PR 1207:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Normalizer(n: Int) extends VectorTransformer with Serializable {
class StandardScaler(withMean: Boolean, withStd: Boolean)
trait VectorTransformer {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17802/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 3, 2014

QA tests have started for PR 1207. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17803/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 3, 2014

QA results for PR 1207:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Normalizer(n: Int) extends VectorTransformer with Serializable {
class StandardScaler(withMean: Boolean, withStd: Boolean)
trait VectorTransformer {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17803/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 3, 2014

QA tests have started for PR 1207. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17804/consoleFull

* @param n L^2 norm by default. Normalization in L^n space.
*/
@DeveloperApi
class Normalizer(n: Int) extends VectorTransformer with Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n -> p, which is commonly used for norms.

@SparkQA
Copy link

SparkQA commented Aug 3, 2014

QA results for PR 1207:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Normalizer(n: Int) extends VectorTransformer with Serializable {
class StandardScaler(withMean: Boolean, withStd: Boolean)
trait VectorTransformer {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17804/consoleFull

@dbtsai
Copy link
Member Author

dbtsai commented Aug 3, 2014

TODO

  1. p = Double.PositiveInfinity. 1, 2, and inf.
  2. Add withStd back.

@SparkQA
Copy link

SparkQA commented Aug 3, 2014

QA tests have started for PR 1207. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17813/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 3, 2014

QA results for PR 1207:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Normalizer(p: Int) extends VectorTransformer with Serializable {
class StandardScaler(withMean: Boolean)
trait VectorTransformer extends Serializable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17813/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 3, 2014

QA tests have started for PR 1207. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17833/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 4, 2014

QA tests have started for PR 1207. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17834/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 4, 2014

QA results for PR 1207:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Normalizer(p: Double) extends VectorTransformer {
class StandardScaler(withMean: Boolean, withStd: Boolean) extends VectorTransformer {
trait VectorTransformer extends Serializable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17833/consoleFull

val dataInf = data.map(lInfNormalizer.transform(_))
val dataInfRDD = lInfNormalizer.transform(dataRDD)

assert((data.map(_.toBreeze), dataInf.map(_.toBreeze), dataInfRDD.collect().map(_.toBreeze))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: same as the L1 test case

@SparkQA
Copy link

SparkQA commented Aug 4, 2014

QA tests have started for PR 1207. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17839/consoleFull


/**
* :: DeveloperApi ::
* Normalizes samples individually to unit L^p^ norm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^p^ (fun to read ^o^)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol...

@SparkQA
Copy link

SparkQA commented Aug 4, 2014

QA tests have started for PR 1207. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17841/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 4, 2014

QA results for PR 1207:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Normalizer(p: Double) extends VectorTransformer {
class StandardScaler(withMean: Boolean, withStd: Boolean) extends VectorTransformer {
trait VectorTransformer extends Serializable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17839/consoleFull

@mengxr
Copy link
Contributor

mengxr commented Aug 4, 2014

LGTM. Merged into both master and branch-1.1. Thanks!!

@asfgit asfgit closed this in ae58aea Aug 4, 2014
asfgit pushed a commit that referenced this pull request Aug 4, 2014
…dependent variables or features of data

Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is generally performed during the data preprocessing step.

In this work, a trait called `VectorTransformer` is defined for generic transformation on a vector. It contains one method to be implemented, `transform` which applies transformation on a vector.

There are two implementations of `VectorTransformer` now, and they all can be easily extended with PMML transformation support.

1) `StandardScaler` - Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.

2) `Normalizer` - Normalizes samples individually to unit L^n norm

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #1207 from dbtsai/dbtsai-feature-scaling and squashes the following commits:

78c15d3 [DB Tsai] Alpine Data Labs

(cherry picked from commit ae58aea)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@SparkQA
Copy link

SparkQA commented Aug 4, 2014

QA results for PR 1207:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Normalizer(p: Double) extends VectorTransformer {
class StandardScaler(withMean: Boolean, withStd: Boolean) extends VectorTransformer {
trait VectorTransformer extends Serializable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17841/consoleFull

xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…dependent variables or features of data

Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is generally performed during the data preprocessing step.

In this work, a trait called `VectorTransformer` is defined for generic transformation on a vector. It contains one method to be implemented, `transform` which applies transformation on a vector.

There are two implementations of `VectorTransformer` now, and they all can be easily extended with PMML transformation support.

1) `StandardScaler` - Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.

2) `Normalizer` - Normalizes samples individually to unit L^n norm

Author: DB Tsai <dbtsai@alpinenow.com>

Closes apache#1207 from dbtsai/dbtsai-feature-scaling and squashes the following commits:

78c15d3 [DB Tsai] Alpine Data Labs
wangyum pushed a commit that referenced this pull request May 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants