Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29751][ML] Scalers use Summarizer instead of MultivariateOnlineSummarizer #26393

Closed
wants to merge 2 commits into from

Conversation

zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

use ml.Summarizer instead of mllib.MultivariateOnlineSummarizer

Why are the changes needed?

1, I found that using ml.Summarizer is faster than current impl;
2, mllib.MultivariateOnlineSummarizer maintain all arrays, while ml.Summarizer only maintain necessary arrays
3, using ml.Summarizer will avoid vector conversions to mlllib.Vector

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing testsuites

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Nov 5, 2019

test code

import org.apache.spark.ml.feature._

scala> var df = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a")
19/11/05 13:47:02 WARN LibSVMFileFormat: 'numFeatures' option not specified, determining the number of features by going though the input. If you know the number in advance, please specify it via 'numFeatures' option to avoid the extra scan.
df: org.apache.spark.sql.DataFrame = [label: double, features: vector]          

scala> df.persist()
res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, features: vector]


scala> df.count
res1: Long = 32561

scala> (0 until 8).foreach(_ => df = df.union(df))

scala> df.count
res3: Long = 8335616 

val durations1 = (0 until 50).map{i => val tic = System.currentTimeMillis; val scaler = new MaxAbsScaler().setInputCol("features"); val model = scaler.fit(df); val toc = System.currentTimeMillis; toc - tic}

durations1.takeRight(30).sum.toDouble / 30



val durations2 = (0 until 50).map{i => val tic = System.currentTimeMillis; val scaler = new MinMaxScaler().setInputCol("features"); val model = scaler.fit(df); val toc = System.currentTimeMillis; toc - tic}

durations2.takeRight(30).sum.toDouble / 30



val durations3 = (0 until 50).map{i => val tic = System.currentTimeMillis; val scaler = new StandardScaler().setInputCol("features"); val model = scaler.fit(df); val toc = System.currentTimeMillis; toc - tic}

durations3.takeRight(30).sum.toDouble / 30

Results: (the last 30 fitting are taken into account)

MaxAbsScaler(Old) MinMaxScaler(Old) StandardScaler(Old) MaxAbsScaler(New) MinMaxScaler(New) StandardScaler(New)
6768.1 6875.2 6899.9 5862.1 5880.3 5889.7

@zhengruifeng zhengruifeng added the ML label Nov 5, 2019
@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Nov 5, 2019

IIRC, when Summarizer was created, I had done similar tests, and at that time Summarizer was much slower than MultivariateOnlineSummarizer.
I am surprised by the results and will look into whether it works in other places.

@zhengruifeng
Copy link
Contributor Author

friendly ping @srowen @WeichenXu123

@SparkQA
Copy link

SparkQA commented Nov 5, 2019

Test build #113244 has finished for PR 26393 at commit a2955d8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also looks OK as a refactoring and optimization - if it's faster.

val input = dataset.select($(inputCol)).rdd.map {
case Row(v: Vector) => OldVectors.fromML(v)
}
val summary = Statistics.colStats(input)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Statistics.colStats still used after this? just wondering if it goes away

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this PR, Statistics.colStats is no longer directly used in the .ml side.
It is still used in mllib.PCA, which is the impl of ml.PCA

@srowen srowen closed this in 854f30f Nov 6, 2019
@zhengruifeng zhengruifeng deleted the maxabs_opt branch November 7, 2019 02:00
@zhengruifeng
Copy link
Contributor Author

Thanks @srowen for reviewing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants