[SPARK-29751][ML] Scalers use Summarizer instead of MultivariateOnlineSummarizer #26393

zhengruifeng · 2019-11-05T06:25:51Z

What changes were proposed in this pull request?

use ml.Summarizer instead of mllib.MultivariateOnlineSummarizer

Why are the changes needed?

1, I found that using ml.Summarizer is faster than current impl;
2, mllib.MultivariateOnlineSummarizer maintain all arrays, while ml.Summarizer only maintain necessary arrays
3, using ml.Summarizer will avoid vector conversions to mlllib.Vector

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing testsuites

zhengruifeng · 2019-11-05T06:32:53Z

test code

import org.apache.spark.ml.feature._

scala> var df = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a")
19/11/05 13:47:02 WARN LibSVMFileFormat: 'numFeatures' option not specified, determining the number of features by going though the input. If you know the number in advance, please specify it via 'numFeatures' option to avoid the extra scan.
df: org.apache.spark.sql.DataFrame = [label: double, features: vector]          

scala> df.persist()
res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, features: vector]


scala> df.count
res1: Long = 32561

scala> (0 until 8).foreach(_ => df = df.union(df))

scala> df.count
res3: Long = 8335616 

val durations1 = (0 until 50).map{i => val tic = System.currentTimeMillis; val scaler = new MaxAbsScaler().setInputCol("features"); val model = scaler.fit(df); val toc = System.currentTimeMillis; toc - tic}

durations1.takeRight(30).sum.toDouble / 30



val durations2 = (0 until 50).map{i => val tic = System.currentTimeMillis; val scaler = new MinMaxScaler().setInputCol("features"); val model = scaler.fit(df); val toc = System.currentTimeMillis; toc - tic}

durations2.takeRight(30).sum.toDouble / 30



val durations3 = (0 until 50).map{i => val tic = System.currentTimeMillis; val scaler = new StandardScaler().setInputCol("features"); val model = scaler.fit(df); val toc = System.currentTimeMillis; toc - tic}

durations3.takeRight(30).sum.toDouble / 30

Results: （the last 30 fitting are taken into account）

MaxAbsScaler(Old)	MinMaxScaler(Old)	StandardScaler(Old)	MaxAbsScaler(New)	MinMaxScaler(New)	StandardScaler(New)
6768.1	6875.2	6899.9	5862.1	5880.3	5889.7

zhengruifeng · 2019-11-05T07:15:07Z

IIRC, when Summarizer was created, I had done similar tests, and at that time Summarizer was much slower than MultivariateOnlineSummarizer.
I am surprised by the results and will look into whether it works in other places.

zhengruifeng · 2019-11-05T07:16:53Z

friendly ping @srowen @WeichenXu123

SparkQA · 2019-11-05T07:37:55Z

Test build #113244 has finished for PR 26393 at commit a2955d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Also looks OK as a refactoring and optimization - if it's faster.

srowen · 2019-11-05T14:47:22Z

mllib/src/main/scala/org/apache/spark/ml/feature/MaxAbsScaler.scala

-    val input = dataset.select($(inputCol)).rdd.map {
-      case Row(v: Vector) => OldVectors.fromML(v)
-    }
-    val summary = Statistics.colStats(input)


Is Statistics.colStats still used after this? just wondering if it goes away

After this PR, Statistics.colStats is no longer directly used in the .ml side.
It is still used in mllib.PCA, which is the impl of ml.PCA

zhengruifeng · 2019-11-07T02:00:43Z

Thanks @srowen for reviewing!

zhengruifeng added 2 commits November 5, 2019 11:48

create pr

0f14949

create pr

a2955d8

zhengruifeng added the ML label Nov 5, 2019

zhengruifeng mentioned this pull request Nov 5, 2019

[SPARK-29754][ML] LoR/AFT/LiR/SVC use Summarizer instead of MultivariateOnlineSummarizer #26396

Closed

srowen reviewed Nov 5, 2019

View reviewed changes

srowen closed this in 854f30f Nov 6, 2019

zhengruifeng deleted the maxabs_opt branch November 7, 2019 02:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29751][ML] Scalers use Summarizer instead of MultivariateOnlineSummarizer #26393

[SPARK-29751][ML] Scalers use Summarizer instead of MultivariateOnlineSummarizer #26393

zhengruifeng commented Nov 5, 2019

zhengruifeng commented Nov 5, 2019 •

edited

zhengruifeng commented Nov 5, 2019 •

edited

zhengruifeng commented Nov 5, 2019

SparkQA commented Nov 5, 2019

srowen left a comment

srowen Nov 5, 2019

zhengruifeng Nov 6, 2019

zhengruifeng commented Nov 7, 2019

[SPARK-29751][ML] Scalers use Summarizer instead of MultivariateOnlineSummarizer #26393

[SPARK-29751][ML] Scalers use Summarizer instead of MultivariateOnlineSummarizer #26393

Conversation

zhengruifeng commented Nov 5, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

zhengruifeng commented Nov 5, 2019 • edited

zhengruifeng commented Nov 5, 2019 • edited

zhengruifeng commented Nov 5, 2019

SparkQA commented Nov 5, 2019

srowen left a comment

Choose a reason for hiding this comment

srowen Nov 5, 2019

Choose a reason for hiding this comment

zhengruifeng Nov 6, 2019

Choose a reason for hiding this comment

zhengruifeng commented Nov 7, 2019

zhengruifeng commented Nov 5, 2019 •

edited

zhengruifeng commented Nov 5, 2019 •

edited