Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19634][SQL][ML][FOLLOW-UP] Improve interface of dataframe vectorized summarizer #19156

Closed

Conversation

@WeichenXu123
Copy link
Contributor

WeichenXu123 commented Sep 7, 2017

What changes were proposed in this pull request?

Make several improvements in dataframe vectorized summarizer.

  1. Make the summarizer return Vector type for all metrics (except "count").
    It will return "WrappedArray" type before which won't be very convenient.

  2. Make MetricsAggregate inherit ImplicitCastInputTypes trait. So it can check and implicitly cast input values.

  3. Add "weight" parameter for all single metric method.

  4. Update doc and improve the example code in doc.

  5. Simplified test cases.

How was this patch tested?

Test added and simplified.

@WeichenXu123

This comment has been minimized.

Copy link
Contributor Author

WeichenXu123 commented Sep 7, 2017

cc @yanboliang @thunterdb Thanks!

@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Sep 7, 2017

Test build #81517 has finished for PR 19156 at commit 7b9fbdc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala Outdated
@@ -109,31 +108,47 @@ object Summarizer extends Logging {
}

@Since("2.3.0")
def mean(col: Column): Column = getSingleMetric(col, "mean")
def mean(col: Column, weightCol: Column = lit(1.0)): Column = {

This comment has been minimized.

Copy link
@thunterdb

thunterdb Sep 7, 2017

Contributor

I am not a fan of default parameters, it tends to cause issues with binary compatibility. Unless you have some good reasons, you should have two different functions:

def mean(col: Column): Column = mean(col, lit(1.0))
def mean(col: Column, weightCol: Column): Column = ...
@WeichenXu123

This comment has been minimized.

Copy link
Contributor Author

WeichenXu123 commented Sep 8, 2017

Thanks @thunterdb code updated.

mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala Outdated
* @return a builder.
* @throws IllegalArgumentException if one of the metric names is not understood.
*
* Note: Currently, the performance of this interface is about 2x~3x slower then using the RDD
* interface.
*/
@Since("2.3.0")
def metrics(firstMetric: String, metrics: String*): SummaryBuilder = {
val (typedMetrics, computeMetrics) = getRelevantMetrics(Seq(firstMetric) ++ metrics)
def metrics(metrics: String*): SummaryBuilder = {

This comment has been minimized.

Copy link
@WeichenXu123

WeichenXu123 Sep 8, 2017

Author Contributor

Use def metrics(metrics: String*) instead of def metrics(firstMetric: String, metrics: String*).
It will make pyspark call this interface more easier. (Later I will add python API)

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Nov 8, 2017

Contributor

have you tried about java? IIRC this style is for java compatibility.

This comment has been minimized.

Copy link
@yanboliang

This comment has been minimized.

Copy link
@WeichenXu123

WeichenXu123 Nov 9, 2017

Author Contributor

I haven't test this on Java. But, I can find some other places use similar style, such as Dataset.toDF, Dataset.drop, Does it mean they also have java compatibility issue ?

This comment has been minimized.

Copy link
@WeichenXu123

WeichenXu123 Nov 9, 2017

Author Contributor

@cloud-fan Do you say about this bug ? https://issues.apache.org/jira/browse/SPARK-5904
But it is only related to abstract method.
Now I add java testsuite to make sure it works fine.

@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Sep 8, 2017

Test build #81554 has finished for PR 19156 at commit f5b0b11.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@WeichenXu123

This comment has been minimized.

Copy link
Contributor Author

WeichenXu123 commented Sep 13, 2017

ping @yanboliang Any other comments ?
We need merge this before 2.3 release.

@WeichenXu123 WeichenXu123 changed the title [SPARK-19634][FOLLOW-UP][ML] Improve interface of dataframe vectorized summarizer [SPARK-19634][SQL][ML][FOLLOW-UP] Improve interface of dataframe vectorized summarizer Sep 18, 2017
@yanboliang

This comment has been minimized.

Copy link
Contributor

yanboliang commented Sep 19, 2017

@WeichenXu123 Sorry for late response, really busy in these days. I will take a look in a few days. Thanks for your patience.

@WeichenXu123

This comment has been minimized.

Copy link
Contributor Author

WeichenXu123 commented Sep 22, 2017

@cloud-fan Can you help review the part of code which related to SQL interface ?

@yanboliang

This comment has been minimized.

Copy link
Contributor

yanboliang commented Nov 7, 2017

I'd like to make a pass soon.

@cloud-fan

This comment has been minimized.

Copy link
Contributor

cloud-fan commented Nov 7, 2017

the SQL part LGTM

@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Nov 8, 2017

Test build #83574 has finished for PR 19156 at commit 480e80d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala Outdated
* val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features"))
* val Row(Row(min_, max_)) = allStats.first()
* import org.apache.spark.ml.linalg._
* import org.apache.spark.sql.Row

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Nov 8, 2017

Contributor

nit: indent


override def eval(state: SummarizerBuffer): InternalRow = {
override def eval(state: SummarizerBuffer): Any = {

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Nov 8, 2017

Contributor

why change the return type?

This comment has been minimized.

Copy link
@WeichenXu123

WeichenXu123 Nov 9, 2017

Author Contributor

Both of them works, but other similar aggregate function also use Any. Will it cause some issues ?

("min", Min, arrayDType, Seq(ComputeMin, ComputeNNZ)),
("normL2", NormL2, arrayDType, Seq(ComputeM2)),
("normL1", NormL1, arrayDType, Seq(ComputeL1))
("numNonZeros", NumNonZeros, vectorUDT, Seq(ComputeNNZ)),

This comment has been minimized.

Copy link
@yanboliang

yanboliang Nov 8, 2017

Contributor

Could you let me know why did you make this change? I think we should use long array rather than double array to store numNonZeros.

This comment has been minimized.

Copy link
@WeichenXu123

WeichenXu123 Nov 9, 2017

Author Contributor

org.apache.spark.mllib.stat.MultivariateOnlineSummarizer also return Vector for numNonZeros. So I prefer keep consistent with it.

This comment has been minimized.

Copy link
@yanboliang

yanboliang Dec 12, 2017

Contributor

In the old mllib.stat.MultivariateOnlineSummarizer, the internal variable is type of Array[Long], but the return type is Vector. Do you know the impact of using Vector internally? Thanks.

This comment has been minimized.

Copy link
@WeichenXu123

WeichenXu123 Dec 13, 2017

Author Contributor

Internally still use Array[Long] to do the computation. Only when returning result, convert it to vector.

@WeichenXu123 WeichenXu123 force-pushed the WeichenXu123:improve_vec_summarizer branch 2 times, most recently Nov 9, 2017
@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Nov 9, 2017

Test build #83639 has finished for PR 19156 at commit 525692e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Nov 9, 2017

Test build #83640 has finished for PR 19156 at commit 2e4b232.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
def metrics(firstMetric: String, metrics: String*): SummaryBuilder = {
val (typedMetrics, computeMetrics) = getRelevantMetrics(Seq(firstMetric) ++ metrics)
@scala.annotation.varargs
def metrics(metrics: String*): SummaryBuilder = {

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Nov 9, 2017

Contributor

How about binary compatibility? e.g. spark jobs built with old spark versions, can they run on new Spark without re-compile?

This comment has been minimized.

Copy link
@WeichenXu123

WeichenXu123 Nov 9, 2017

Author Contributor

This class was added after 2.2, does it matters ?

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Nov 9, 2017

Contributor

ah then it doesn't matter

))

testExample("dense vector input",
Seq(Seq(-1.0, 0.0, 6.0), Seq(3.0, -3.0, 0.0)),

This comment has been minimized.

Copy link
@yanboliang

yanboliang Dec 13, 2017

Contributor

Why do you remove the test against ground true value?

mllib/src/test/scala/org/apache/spark/ml/stat/SummarizerSuite.scala Outdated
val (df, c) = wrappedInit()
compare(df.select(metrics("variance").summary(c), variance(c)),
Seq(Row(exp.variance), summarizer.variance))
val (df, c, weight) = wrappedInit()

This comment has been minimized.

Copy link
@yanboliang

yanboliang Dec 13, 2017

Contributor

nit: weight can be abbreviated to w.

@WeichenXu123 WeichenXu123 force-pushed the WeichenXu123:improve_vec_summarizer branch Dec 13, 2017
@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Dec 13, 2017

Test build #84845 has finished for PR 19156 at commit 5647a49.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Dec 15, 2017

Test build #84954 has finished for PR 19156 at commit 4d6617e.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@WeichenXu123

This comment has been minimized.

Copy link
Contributor Author

WeichenXu123 commented Dec 15, 2017

Jenkins retest this please.

@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Dec 15, 2017

Test build #84958 has finished for PR 19156 at commit 4d6617e.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@WeichenXu123 WeichenXu123 force-pushed the WeichenXu123:improve_vec_summarizer branch to f34da1f Dec 15, 2017
@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Dec 15, 2017

Test build #84960 has finished for PR 19156 at commit f34da1f.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@WeichenXu123

This comment has been minimized.

Copy link
Contributor Author

WeichenXu123 commented Dec 19, 2017

Jenkins retest this please.

@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Dec 19, 2017

Test build #85109 has finished for PR 19156 at commit f34da1f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
Copy link
Contributor

yanboliang left a comment

LGTM except the last comment. Thanks.

}
}

test("basic error handling") {

This comment has been minimized.

Copy link
@yanboliang

yanboliang Dec 20, 2017

Contributor

Why do you remove these two tests?

@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Dec 21, 2017

Test build #85229 has finished for PR 19156 at commit 24697f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@yanboliang

This comment has been minimized.

Copy link
Contributor

yanboliang commented Dec 21, 2017

Merged into master, thanks.

@asfgit asfgit closed this in d3ae3e1 Dec 21, 2017
@WeichenXu123 WeichenXu123 deleted the WeichenXu123:improve_vec_summarizer branch Apr 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.