New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29751][ML] Scalers use Summarizer instead of MultivariateOnlineSummarizer #26393
Conversation
test code import org.apache.spark.ml.feature._
scala> var df = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a")
19/11/05 13:47:02 WARN LibSVMFileFormat: 'numFeatures' option not specified, determining the number of features by going though the input. If you know the number in advance, please specify it via 'numFeatures' option to avoid the extra scan.
df: org.apache.spark.sql.DataFrame = [label: double, features: vector]
scala> df.persist()
res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, features: vector]
scala> df.count
res1: Long = 32561
scala> (0 until 8).foreach(_ => df = df.union(df))
scala> df.count
res3: Long = 8335616
val durations1 = (0 until 50).map{i => val tic = System.currentTimeMillis; val scaler = new MaxAbsScaler().setInputCol("features"); val model = scaler.fit(df); val toc = System.currentTimeMillis; toc - tic}
durations1.takeRight(30).sum.toDouble / 30
val durations2 = (0 until 50).map{i => val tic = System.currentTimeMillis; val scaler = new MinMaxScaler().setInputCol("features"); val model = scaler.fit(df); val toc = System.currentTimeMillis; toc - tic}
durations2.takeRight(30).sum.toDouble / 30
val durations3 = (0 until 50).map{i => val tic = System.currentTimeMillis; val scaler = new StandardScaler().setInputCol("features"); val model = scaler.fit(df); val toc = System.currentTimeMillis; toc - tic}
durations3.takeRight(30).sum.toDouble / 30 Results: (the last 30 fitting are taken into account)
|
IIRC, when |
friendly ping @srowen @WeichenXu123 |
Test build #113244 has finished for PR 26393 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also looks OK as a refactoring and optimization - if it's faster.
val input = dataset.select($(inputCol)).rdd.map { | ||
case Row(v: Vector) => OldVectors.fromML(v) | ||
} | ||
val summary = Statistics.colStats(input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is Statistics.colStats still used after this? just wondering if it goes away
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this PR, Statistics.colStats
is no longer directly used in the .ml
side.
It is still used in mllib.PCA
, which is the impl of ml.PCA
Thanks @srowen for reviewing! |
What changes were proposed in this pull request?
use
ml.Summarizer
instead ofmllib.MultivariateOnlineSummarizer
Why are the changes needed?
1, I found that using
ml.Summarizer
is faster than current impl;2,
mllib.MultivariateOnlineSummarizer
maintain all arrays, whileml.Summarizer
only maintain necessary arrays3, using
ml.Summarizer
will avoid vector conversions tomlllib.Vector
Does this PR introduce any user-facing change?
No
How was this patch tested?
existing testsuites