[SPARK-28399][ML][PYTHON] implement RobustScaler #25160

zhengruifeng · 2019-07-15T11:19:23Z

What changes were proposed in this pull request?

Implement RobustScaler
Since the transformation is quite similar to StandardScaler, I refactor the transform function so that it can be reused in both scalers.

How was this patch tested?

existing and added tests

SparkQA · 2019-07-15T12:28:21Z

Test build #107682 has finished for PR 25160 at commit 5c800de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mllib/src/main/scala/org/apache/spark/ml/feature/RobustScaler.scala

srowen · 2019-07-16T13:59:59Z

mllib/src/main/scala/org/apache/spark/ml/feature/RobustScaler.scala

+
+/**
+ * Scale features using statistics that are robust to outliers.
+ * This Scaler removes the median and scales the data according to the quantile range


Nit: you probably want to clarify what 'scales' means here. You divide through by the IQR?
Also the IQR isn't necessarily 25%-75% because it's configurable.

Yes, after optional remove the median, the features will be divided by the quantile range.
IQR is a special quantile range, from 25% to 75%. But if the lower/upper are set to other values, then the range is not IQR.

OK can you say it's IQR by default but can be configured, something like that?

srowen · 2019-07-16T14:00:42Z

mllib/src/main/scala/org/apache/spark/ml/feature/RobustScaler.scala

+        }
+      }
+
+      if (localAgg != null) {


This might be clearer as:

if (localAgg == null) { Iterator.empty } else { ... }

srowen · 2019-07-16T14:02:11Z

mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala

+  private[spark] def transformDenseWithScale(scale: Array[Double],
+                                             values: Array[Double]): Array[Double] = {
+    var i = 0
+    while(i < values.length) {


Nit: space after while

srowen · 2019-07-16T14:03:51Z

mllib/src/test/scala/org/apache/spark/ml/feature/RobustScalerSuite.scala

+  }
+
+
+  def assertResult: Row => Unit = {


SparkQA · 2019-07-17T04:28:59Z

Test build #107770 has finished for PR 25160 at commit bcc10cd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-18T04:33:44Z

Test build #107811 has finished for PR 25160 at commit a196c09.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Looking OK, but how about adding to pyspark?

zhengruifeng · 2019-07-19T01:40:27Z

@srowen I am adding it to the pyspark side in this PR.

SparkQA · 2019-07-19T08:41:07Z

Test build #107893 has finished for PR 25160 at commit cf0c581.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class RobustScaler(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
class RobustScalerModel(JavaModel, JavaMLReadable, JavaMLWritable):

SparkQA · 2019-07-19T08:49:01Z

Test build #107894 has finished for PR 25160 at commit 5e3ffc5.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-19T10:03:37Z

Test build #107895 has finished for PR 25160 at commit 2391c3d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mllib/src/main/scala/org/apache/spark/ml/feature/RobustScaler.scala

mgaido91 · 2019-07-20T10:05:19Z

mllib/src/main/scala/org/apache/spark/ml/feature/RobustScaler.scala

+          agg = Array.fill(vec.size)(
+            new QuantileSummaries(QuantileSummaries.defaultCompressThreshold, 0.001))
+        }
+        require(vec.size == agg.length)


can we add a meaningful error message here?

mgaido91 · 2019-07-20T10:08:04Z

mllib/src/main/scala/org/apache/spark/ml/feature/RobustScaler.scala

+      if (agg == null) {
+        Iterator.empty
+      } else {
+        var i = 0


agg.map(_.compress())?

I tend to keep current impl, since this can avoid create a tmp array, and should be a little faster

I think the perf gain is quite irrelevant compared to the operations and the objects created in compress...

mgaido91 · 2019-07-20T10:09:27Z

mllib/src/main/scala/org/apache/spark/ml/feature/RobustScaler.scala

+      require(agg1.length == agg2.length)
+      var i = 0
+      while (i < agg1.length) {
+        agg1(i) = agg1(i).merge(agg2(i)).compress()


why are you adding compress here? AFAIK it is not needed here

Yes, I added this to confirm compression, I will remove it.

mgaido91 · 2019-07-20T10:12:22Z

mllib/src/main/scala/org/apache/spark/ml/feature/RobustScaler.scala

+      range.toArray.map { v => if (v == 0) 0.0 else 1.0 / v }
+    } else Array.emptyDoubleArray
+
+    val func = StandardScalerModel.getTransformFunc(shift, scale,


nit:

val func = StandardScalerModel.getTransformFunc( shift, scale, $(withCentering), $(withScaling))

I am not sure, but is it a convention? It is easy to find similar indent/style other places.

I always saw the above mentioned style. Maybe SQL part is more strict on this.

I am neutral on this, and will follow you advice.

mgaido91 · 2019-07-20T10:16:30Z

mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala

+          case d: DenseVector => d.values.clone()
+          case v: Vector => v.toArray
+        }
+        val newValues = NewStandardScalerModel


nit: this can go on one line

mgaido91 · 2019-07-20T10:17:10Z

mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala

+    values
+  }
+
+  private[spark] def transformWithShift(shift: Array[Double],


mgaido91 · 2019-07-20T10:17:16Z

mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala

+    values
+  }
+
+  private[spark] def transformDenseWithScale(scale: Array[Double],


mgaido91 · 2019-07-20T10:17:22Z

mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala

+    values
+  }
+
+  private[spark] def transformSparseWithScale(scale: Array[Double],


mgaido91 · 2019-07-20T10:17:28Z

mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala

+    values
+  }
+
+  private[ml] def getTransformFunc(shift: Array[Double],


SparkQA · 2019-07-22T10:54:46Z

Test build #108007 has finished for PR 25160 at commit 364fb83.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-07-25T12:08:19Z

WDYT @mgaido91 ?

mgaido91 · 2019-07-26T08:01:10Z

LGTM, thanks.

srowen

One last thing: can we add this to ml-features.md? It should be documented.
It would involve copy-pasting another scaler's examples, ideally, to match what we have for other implementations.

zhengruifeng · 2019-07-29T02:09:48Z

@srowen OK, I am going to add the documents.

SparkQA · 2019-07-29T06:25:38Z

Test build #108293 has finished for PR 25160 at commit a5d7de4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class JavaRobustScalerExample

SparkQA · 2019-07-29T06:39:08Z

Test build #108294 has finished for PR 25160 at commit 45aa724.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-07-29T09:29:29Z

I localy test the examples, and they run successfully.
I build the doc, and it looks fine.

srowen · 2019-07-30T15:24:16Z

Merged to master

zhengruifeng · 2019-07-31T01:44:25Z

@srowen @mgaido91 Thanks a lot for reviewing!

zhengruifeng added 6 commits July 15, 2019 11:28

init

fdd6d56

init

275b29c

update impl and test

21738d9

update tests

7b687ec

nit

31fdb61

opt

5c800de

dongjoon-hyun added the ML label Jul 16, 2019

srowen requested changes Jul 16, 2019

View reviewed changes

zhengruifeng mentioned this pull request Jul 17, 2019

[SPARK-24283][ML] Make ml.StandardScaler skip conversion of Spar… #21942

Closed

update comments

bcc10cd

update doc

a196c09

srowen reviewed Jul 18, 2019

View reviewed changes

add into py

cf0c581

zhengruifeng added 2 commits July 19, 2019 16:41

add into py II

5e3ffc5

fix py style

2391c3d

srowen approved these changes Jul 19, 2019

View reviewed changes

mgaido91 reviewed Jul 20, 2019

View reviewed changes

zhengruifeng added 3 commits July 22, 2019 17:31

del unnecsssary compress() && update style

44a5632

uuu

cc69911

add error msg

364fb83

zhengruifeng changed the title ~~[SPARK-28399][ML] implement RobustScaler~~ [SPARK-28399][ML][PYTHON] implement RobustScaler Jul 24, 2019

srowen reviewed Jul 26, 2019

View reviewed changes

zhengruifeng added 2 commits July 29, 2019 13:15

add docs

a5d7de4

doc nit

45aa724

srowen approved these changes Jul 29, 2019

View reviewed changes

srowen closed this in 44c28d7 Jul 30, 2019

zhengruifeng deleted the robust_scaler branch July 31, 2019 01:42

[SPARK-28399][ML][PYTHON] implement RobustScaler #25160

[SPARK-28399][ML][PYTHON] implement RobustScaler #25160

Conversation

zhengruifeng commented Jul 15, 2019

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jul 15, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 17, 2019

SparkQA commented Jul 18, 2019

srowen left a comment

Choose a reason for hiding this comment

zhengruifeng commented Jul 19, 2019

SparkQA commented Jul 19, 2019

SparkQA commented Jul 19, 2019

SparkQA commented Jul 19, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 22, 2019

srowen commented Jul 25, 2019

mgaido91 commented Jul 26, 2019

srowen left a comment

Choose a reason for hiding this comment

zhengruifeng commented Jul 29, 2019

SparkQA commented Jul 29, 2019

SparkQA commented Jul 29, 2019

zhengruifeng commented Jul 29, 2019

srowen commented Jul 30, 2019

zhengruifeng commented Jul 31, 2019