[SPARK-19208][ML][WIP] MaxAbsScaler and MinMaxScaler are very inefficient #16571

zhengruifeng · 2017-01-13T10:13:30Z

What changes were proposed in this pull request?

eliminate the usage of MultivariateOnlineSummarizer

How was this patch tested?

existing tests, manual tests in spark-shell

SparkQA · 2017-01-13T10:55:10Z

Test build #71310 has finished for PR 16571 at commit dd0e0c3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-13T12:38:08Z

Test build #71315 has finished for PR 16571 at commit 671e566.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-13T13:12:53Z

Test build #71318 has finished for PR 16571 at commit 5839ac2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

imatiach-msft · 2017-01-13T23:09:30Z

mllib/src/main/scala/org/apache/spark/ml/feature/MaxAbsScaler.scala

-    val input: RDD[OldVector] = dataset.select($(inputCol)).rdd.map {
-      case Row(v: Vector) => OldVectors.fromML(v)
-    }
-    val summary = Statistics.colStats(input)


is it the call to colStats and vector conversion that was so inefficient? do you have any performance numbers to justify the change, since it does make the code more complicated.

imatiach-msft · 2017-01-13T23:33:02Z

mllib/src/main/scala/org/apache/spark/ml/feature/MaxAbsScaler.scala

+            case _ =>
+          }
+          max
+      }, combOp = {


I think it may make the code clearer to move the seqOp and combOp to separate methods

imatiach-msft · 2017-01-13T23:35:28Z

mllib/src/main/scala/org/apache/spark/ml/feature/MaxAbsScaler.scala

-    val maxAbs = Array.tabulate(n) { i => math.max(math.abs(minVals(i)), math.abs(maxVals(i))) }
+
+    val maxAbs = dataset.select($(inputCol)).rdd.map {
+      row => row.getAs[Vector](0)


would moving this inside the tree aggregate possibly make the computation faster? eg operate on row of vector? that way you wouldn't have to pass through the data twice

imatiach-msft · 2017-01-13T23:43:26Z

mllib/src/main/scala/org/apache/spark/ml/feature/MaxAbsScaler.scala

-    val maxAbs = Array.tabulate(n) { i => math.max(math.abs(minVals(i)), math.abs(maxVals(i))) }
+
+    val maxAbs = dataset.select($(inputCol)).rdd.map {
+      row => row.getAs[Vector](0)


actually, I think it might make the code clearer to:
1.) map to Array[Double] similar to what you did with vector but take the abs
2.) instead of using treeAggregate just do a simple reduce on the arrays by getting the max for each slot
that would simplify the code more. Would that be worse performance-wise?

imatiach-msft · 2017-01-13T23:46:23Z

mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala

+          require(min.length == vec.size,
+            s"Dimensions mismatch when adding new sample: ${min.length} != ${vec.size}")
+          vec.foreachActive {
+            case (i, v) if v != 0.0 =>


this looks the same as the code above, can you refactor to a separate function and call it from both places?

sethah · 2017-01-14T00:04:54Z

Can we please discuss, on the JIRA, whether this is something we actually want to do? @srowen raises a point that I tend to agree with, so I'd prefer not to proceed with code review until we are sure about it.

imatiach-msft · 2017-01-14T00:21:24Z

@sethah thank you for your concern, I added my thoughts to the JIRA

zhengruifeng · 2017-01-19T12:57:56Z

In the jira, we decide to optimize MultivariateOnlineSummarizer first, so this pr will be closed.

WeichenXu123 · 2017-07-14T23:03:40Z

This PR is very similar to my early PR. Is that right? @jkbradley #14950

zhengruifeng added 2 commits January 13, 2017 18:52

create pr

ea201aa

update

ff7786f

fix for nan

671e566

zhengruifeng force-pushed the new_ma branch from dd0e0c3 to 671e566 Compare January 13, 2017 11:41

fix for nan II

5839ac2

imatiach-msft reviewed Jan 13, 2017

View reviewed changes

zhengruifeng changed the title ~~[SPARK-19208][ML] MaxAbsScaler and MinMaxScaler are very inefficient~~ [SPARK-19208][ML][WIP] MaxAbsScaler and MinMaxScaler are very inefficient Jan 16, 2017

zhengruifeng closed this Jan 19, 2017

zhengruifeng deleted the new_ma branch January 19, 2017 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19208][ML][WIP] MaxAbsScaler and MinMaxScaler are very inefficient #16571

[SPARK-19208][ML][WIP] MaxAbsScaler and MinMaxScaler are very inefficient #16571

zhengruifeng commented Jan 13, 2017

SparkQA commented Jan 13, 2017

SparkQA commented Jan 13, 2017

SparkQA commented Jan 13, 2017

imatiach-msft Jan 13, 2017

imatiach-msft Jan 13, 2017

imatiach-msft Jan 13, 2017

imatiach-msft Jan 13, 2017

imatiach-msft Jan 13, 2017

sethah commented Jan 14, 2017

imatiach-msft commented Jan 14, 2017

zhengruifeng commented Jan 19, 2017

WeichenXu123 commented Jul 14, 2017

[SPARK-19208][ML][WIP] MaxAbsScaler and MinMaxScaler are very inefficient #16571

[SPARK-19208][ML][WIP] MaxAbsScaler and MinMaxScaler are very inefficient #16571

Conversation

zhengruifeng commented Jan 13, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jan 13, 2017

SparkQA commented Jan 13, 2017

SparkQA commented Jan 13, 2017

imatiach-msft Jan 13, 2017

Choose a reason for hiding this comment

imatiach-msft Jan 13, 2017

Choose a reason for hiding this comment

imatiach-msft Jan 13, 2017

Choose a reason for hiding this comment

imatiach-msft Jan 13, 2017

Choose a reason for hiding this comment

imatiach-msft Jan 13, 2017

Choose a reason for hiding this comment

sethah commented Jan 14, 2017

imatiach-msft commented Jan 14, 2017

zhengruifeng commented Jan 19, 2017

WeichenXu123 commented Jul 14, 2017