Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28399][ML][PYTHON] implement RobustScaler #25160

Closed
wants to merge 16 commits into from

Conversation

zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

Implement RobustScaler
Since the transformation is quite similar to StandardScaler, I refactor the transform function so that it can be reused in both scalers.

How was this patch tested?

existing and added tests

@SparkQA
Copy link

SparkQA commented Jul 15, 2019

Test build #107682 has finished for PR 25160 at commit 5c800de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


/**
* Scale features using statistics that are robust to outliers.
* This Scaler removes the median and scales the data according to the quantile range
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you probably want to clarify what 'scales' means here. You divide through by the IQR?
Also the IQR isn't necessarily 25%-75% because it's configurable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, after optional remove the median, the features will be divided by the quantile range.
IQR is a special quantile range, from 25% to 75%. But if the lower/upper are set to other values, then the range is not IQR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK can you say it's IQR by default but can be configured, something like that?

}
}

if (localAgg != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be clearer as:

if (localAgg == null) {
  Iterator.empty
} else {
  ...
}

private[spark] def transformDenseWithScale(scale: Array[Double],
values: Array[Double]): Array[Double] = {
var i = 0
while(i < values.length) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: space after while

}


def assertResult: Row => Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private?

@SparkQA
Copy link

SparkQA commented Jul 17, 2019

Test build #107770 has finished for PR 25160 at commit bcc10cd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 18, 2019

Test build #107811 has finished for PR 25160 at commit a196c09.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking OK, but how about adding to pyspark?

@zhengruifeng
Copy link
Contributor Author

@srowen I am adding it to the pyspark side in this PR.

@SparkQA
Copy link

SparkQA commented Jul 19, 2019

Test build #107893 has finished for PR 25160 at commit cf0c581.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class RobustScaler(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
  • class RobustScalerModel(JavaModel, JavaMLReadable, JavaMLWritable):

@SparkQA
Copy link

SparkQA commented Jul 19, 2019

Test build #107894 has finished for PR 25160 at commit 5e3ffc5.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 19, 2019

Test build #107895 has finished for PR 25160 at commit 2391c3d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

agg = Array.fill(vec.size)(
new QuantileSummaries(QuantileSummaries.defaultCompressThreshold, 0.001))
}
require(vec.size == agg.length)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a meaningful error message here?

if (agg == null) {
Iterator.empty
} else {
var i = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agg.map(_.compress())?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to keep current impl, since this can avoid create a tmp array, and should be a little faster

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the perf gain is quite irrelevant compared to the operations and the objects created in compress...

require(agg1.length == agg2.length)
var i = 0
while (i < agg1.length) {
agg1(i) = agg1(i).merge(agg2(i)).compress()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you adding compress here? AFAIK it is not needed here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I added this to confirm compression, I will remove it.

range.toArray.map { v => if (v == 0) 0.0 else 1.0 / v }
} else Array.emptyDoubleArray

val func = StandardScalerModel.getTransformFunc(shift, scale,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

val func = StandardScalerModel.getTransformFunc(
  shift, scale, $(withCentering), $(withScaling))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, but is it a convention? It is easy to find similar indent/style other places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always saw the above mentioned style. Maybe SQL part is more strict on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am neutral on this, and will follow you advice.

case d: DenseVector => d.values.clone()
case v: Vector => v.toArray
}
val newValues = NewStandardScalerModel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this can go on one line

values
}

private[spark] def transformWithShift(shift: Array[Double],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

values
}

private[spark] def transformDenseWithScale(scale: Array[Double],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

values
}

private[spark] def transformSparseWithScale(scale: Array[Double],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

values
}

private[ml] def getTransformFunc(shift: Array[Double],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@SparkQA
Copy link

SparkQA commented Jul 22, 2019

Test build #108007 has finished for PR 25160 at commit 364fb83.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng zhengruifeng changed the title [SPARK-28399][ML] implement RobustScaler [SPARK-28399][ML][PYTHON] implement RobustScaler Jul 24, 2019
@srowen
Copy link
Member

srowen commented Jul 25, 2019

WDYT @mgaido91 ?

@mgaido91
Copy link
Contributor

LGTM, thanks.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing: can we add this to ml-features.md? It should be documented.
It would involve copy-pasting another scaler's examples, ideally, to match what we have for other implementations.

@zhengruifeng
Copy link
Contributor Author

@srowen OK, I am going to add the documents.

@SparkQA
Copy link

SparkQA commented Jul 29, 2019

Test build #108293 has finished for PR 25160 at commit a5d7de4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class JavaRobustScalerExample

@SparkQA
Copy link

SparkQA commented Jul 29, 2019

Test build #108294 has finished for PR 25160 at commit 45aa724.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

图片

I localy test the examples, and they run successfully.
I build the doc, and it looks fine.

@srowen
Copy link
Member

srowen commented Jul 30, 2019

Merged to master

@srowen srowen closed this in 44c28d7 Jul 30, 2019
@zhengruifeng zhengruifeng deleted the robust_scaler branch July 31, 2019 01:42
@zhengruifeng
Copy link
Contributor Author

@srowen @mgaido91 Thanks a lot for reviewing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants