-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-2272 [MLlib] Feature scaling which standardizes the range of independent variables or features of data #1207
Conversation
Merged build triggered. |
Merged build started. |
Merged build finished. All automated tests passed. |
All automated tests passed. |
Is there a reference implementation that you followed or this is all new? Does PMML standard define something similar? |
QA tests have started for PR 1207. This patch merges cleanly. |
QA results for PR 1207: |
QA tests have started for PR 1207. This patch merges cleanly. |
QA results for PR 1207: |
QA tests have started for PR 1207. This patch merges cleanly. |
* @param n L^2 norm by default. Normalization in L^n space. | ||
*/ | ||
@DeveloperApi | ||
class Normalizer(n: Int) extends VectorTransformer with Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n
-> p
, which is commonly used for norms.
QA results for PR 1207: |
TODO
|
QA tests have started for PR 1207. This patch merges cleanly. |
QA results for PR 1207: |
QA tests have started for PR 1207. This patch merges cleanly. |
QA tests have started for PR 1207. This patch merges cleanly. |
QA results for PR 1207: |
val dataInf = data.map(lInfNormalizer.transform(_)) | ||
val dataInfRDD = lInfNormalizer.transform(dataRDD) | ||
|
||
assert((data.map(_.toBreeze), dataInf.map(_.toBreeze), dataInfRDD.collect().map(_.toBreeze)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto: same as the L1 test case
QA tests have started for PR 1207. This patch merges cleanly. |
|
||
/** | ||
* :: DeveloperApi :: | ||
* Normalizes samples individually to unit L^p^ norm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^p^
(fun to read ^o^)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol...
QA tests have started for PR 1207. This patch merges cleanly. |
QA results for PR 1207: |
LGTM. Merged into both master and branch-1.1. Thanks!! |
…dependent variables or features of data Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is generally performed during the data preprocessing step. In this work, a trait called `VectorTransformer` is defined for generic transformation on a vector. It contains one method to be implemented, `transform` which applies transformation on a vector. There are two implementations of `VectorTransformer` now, and they all can be easily extended with PMML transformation support. 1) `StandardScaler` - Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set. 2) `Normalizer` - Normalizes samples individually to unit L^n norm Author: DB Tsai <dbtsai@alpinenow.com> Closes #1207 from dbtsai/dbtsai-feature-scaling and squashes the following commits: 78c15d3 [DB Tsai] Alpine Data Labs (cherry picked from commit ae58aea) Signed-off-by: Xiangrui Meng <meng@databricks.com>
QA results for PR 1207: |
…dependent variables or features of data Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is generally performed during the data preprocessing step. In this work, a trait called `VectorTransformer` is defined for generic transformation on a vector. It contains one method to be implemented, `transform` which applies transformation on a vector. There are two implementations of `VectorTransformer` now, and they all can be easily extended with PMML transformation support. 1) `StandardScaler` - Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set. 2) `Normalizer` - Normalizes samples individually to unit L^n norm Author: DB Tsai <dbtsai@alpinenow.com> Closes apache#1207 from dbtsai/dbtsai-feature-scaling and squashes the following commits: 78c15d3 [DB Tsai] Alpine Data Labs
Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is generally performed during the data preprocessing step.
In this work, a trait called
VectorTransformer
is defined for generic transformation on a vector. It contains one method to be implemented,transform
which applies transformation on a vector.There are two implementations of
VectorTransformer
now, and they all can be easily extended with PMML transformation support.StandardScaler
- Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.Normalizer
- Normalizes samples individually to unit L^n norm