[SPARK-14352][SQL] approxQuantile should support multi columns #12135

zhengruifeng · 2016-04-03T08:19:17Z

What changes were proposed in this pull request?

1, add the multi-cols support based on current private api
2, add the multi-cols support to pyspark

How was this patch tested?

unit tests

SparkQA · 2016-04-03T09:51:51Z

Test build #54800 has finished for PR 12135 at commit 67947af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-04-05T20:40:16Z

So if we are going to add this - it might make sense to also add the Python API at the same time (or at least create a JIRA to track adding this to the Python API so it doesn't slip between the cracks).

zhengruifeng · 2016-04-06T04:01:43Z

@holdenk Ok, I will add the python API into this PR.

SparkQA · 2016-04-06T07:24:10Z

Test build #55100 has finished for PR 12135 at commit 6cf073d.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-06T08:46:57Z

Test build #55101 has finished for PR 12135 at commit 7348d49.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-08T04:51:44Z

Test build #55307 has finished for PR 12135 at commit c9ebfef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-08T07:54:10Z

Test build #55326 has finished for PR 12135 at commit dccd337.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2016-04-08T07:56:51Z

@holdenk The wrapper for pyspark is added.

MLnick · 2016-04-14T18:58:19Z

@zhengruifeng @viirya this is a duplicate of #12207

MLnick · 2016-04-15T13:02:22Z

@viirya could you review this PR and add any items from yours that might be missing?

zhengruifeng · 2016-04-15T14:01:46Z

@MLnick What should I do?

MLnick · 2016-04-18T08:06:17Z

@zhengruifeng we will review and aim to merge this PR.

ping @viirya @jkbradley @holdenk

viirya · 2016-04-18T09:33:36Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

+  /**
+   * Python-friendly version of [[approxQuantile()]]
+   */
+  private[spark] def approxQuantileMultiCols(


You don't need to have a new method name. Since this multi-column version has different input types, it can use the same method name approxQuantile.

zhengruifeng · 2016-04-18T11:16:45Z

@viirya Thanks, I have updated this PR according your comments.

SparkQA · 2016-04-18T19:16:47Z

Test build #56091 has finished for PR 12135 at commit ce41411.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-04-20T09:21:20Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

+  }
+
+
+  /**


If we only call the multi-column version from PySpark, do we still need the single column version Python API?

Both the versions are used now.

SparkQA · 2016-04-20T16:32:12Z

Test build #56368 has finished for PR 12135 at commit bae4053.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2016-05-02T09:31:31Z

cc @MLnick ?

zhengruifeng · 2016-08-02T03:50:08Z

Test this please

holdenk · 2016-08-03T21:20:07Z

python/pyspark/sql/dataframe.py

+        if not isinstance(col, (str, list, tuple)):
+            raise ValueError("col should be a string, list or tuple.")
+
+        isStr = isinstance(col, str)


Super minor and subjective - but you could also just wrap it as a list here instead of propagating isStr down to the return.

Thanks for helping to review this PR, it is quite a while.
The type of col detemine the type of return.
If I make col = [col] here, I will not know whether to return a list or a list of list.
Like this:

>>> dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt") >>> dataset.stat.approxQuantile(['label'], [0.1,0.2], 0.1) [[0.0, 1.0]] >>> dataset.stat.approxQuantile('label', [0.1,0.2], 0.1) [0.0, 1.0]

That's true, but if we can get rid of one of the unused private methods on the Scala side I'm all for that. Can we not simply return the first element of the result if it is length 1, otherwise return the result?

@MLnick Sorry, I am not very sure about your opinion. Do you mean that if the input col is a list of only one element, the output should be one element (not a list) ?

@MLnick Ok, I will remove the unused Python-friendly version of [[approxQuantile()]] with single col.

SparkQA · 2016-08-04T06:40:08Z

Test build #63208 has finished for PR 12135 at commit 785a667.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2016-09-30T05:39:24Z

Can any admin verify this PR? It's been a long time and I really need this feature...

MLnick

Made a few small comments. @viirya @holdenk mind making a final pass too?

MLnick · 2016-10-04T08:05:56Z

python/pyspark/sql/dataframe.py

@@ -1256,18 +1256,33 @@ def approxQuantile(self, col, probabilities, relativeError):
        Space-efficient Online Computation of Quantile Summaries]]
        by Greenwald and Khanna.

-        :param col: the name of the numerical column
+        :param col: the name of the numerical column, or a list/tuple of


I'd prefer to incorporate expected types here, see @viirya's doc here

OK, I will update this.

SparkQA · 2017-01-27T11:41:04Z

Test build #72070 has finished for PR 12135 at commit 29a691f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2017-01-29T13:43:43Z

ping @holdenk ?

holdenk · 2017-01-30T23:15:00Z

Sorry, my weekend ended up super busy. I'll try and take a look tomorrow :) Also thanks for adding more tests <3 tests :)

holdenk

Two minor pluralizations I noticed while going back over it, will double check with Nick for any other issues. If you have a chance to fix the docstrings that would be great otherwise I can do that during merge.

holdenk · 2017-01-31T22:03:15Z

python/pyspark/sql/dataframe.py

        :param probabilities: a list of quantile probabilities
          Each number must belong to [0, 1].
          For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
        :param relativeError:  The relative target precision to achieve
          (>= 0). If set to zero, the exact quantiles are computed, which
          could be very expensive. Note that values greater than 1 are
          accepted but give the same result as 1.
-        :return:  the approximate quantiles at the given probabilities
+        :return:  the approximate quantiles at the given probabilities. If
+          the input `col` is a string, the output is a list of float. If the


float should be pluralized (e.g. is a list of floats)

holdenk · 2017-01-31T23:03:40Z

python/pyspark/sql/dataframe.py

+          the input `col` is a string, the output is a list of float. If the
+          input `col` is a list or tuple of strings, the output is also a
+          list, but each element in it is a list of float, i.e., the output
+          is a list of list of float.


Also float -> floats here

zhengruifeng · 2017-02-01T06:46:02Z

@holdenk Updated! Thanks for your careful checking.

SparkQA · 2017-02-01T09:02:27Z

Test build #72237 has finished for PR 12135 at commit ccf4d8d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-02-02T02:03:48Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

+   * @param probabilities a list of quantile probabilities
+   *   Each number must belong to [0, 1].
+   *   For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
+   * @param relativeError The relative target precision to achieve (>= 0).


As a kind comment for the future changes and to inform as I know it is super easy for javadoc8 to be broken, It seems javadoc8 complains it as below:

[error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:43: error: unexpected content [error] * @see {@link DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile} for [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:52: error: bad use of '>' [error] * @param relativeError The relative target precision to achieve (>= 0). [error]

We could do this as

@param relativeError The relative target precision to achieve (greater or equal to 0).

and fix the link as below If there is no better choice:

@see `DataFrameStatsFunctions.approxQuantile` for detailed description.

Just FYI, there are several cases in #16013

Are these just warnings generated? It would be nice to know during Jenkins testing if javadoc8 (or scaladoc for that matter) breaks.

The 2nd case links nicely to the single-arg version of the method, which contains the detailed doc, in Scaladoc. Pity it won't work with javadoc - is there another way to link it correctly? I suspect that what will work for javadoc will break the link for scaladoc...

Yea.. so, kindly @jkbradley opened a JIRA here - http://issues.apache.org/jira/browse/SPARK-18692

Actually, they are errors that make documentation building failed in javadoc8. I and many guys had a hard time to figure that out a good way AKAIK (honestly, I would like to say that I have tried all the combination I could think. To make it worse, it seems case-by-case up to my observation and tests) and it kind of ended up with the one above.. as we are anyway going to drop Java 7 support in near future up to my knowledge.

Maybe, I will ping you if I happen to find another good way to make some links for both.

(BTW, IMHO, at least for now, building javadoc everytime might be good to do but not required. We can avoid them at our best in our PRs and then sweep them when the release is close or in other related PRs if there are.)

Should we create an issue to build javadoc with Java 8 to Jenkins then?

Ah, that JIRA is actually here - https://issues.apache.org/jira/browse/SPARK-18692 if we are talking about the same thing :)

Ah yes, sorry the comments imply building it separately from the main jenkins build, but if we want to avoid breaking Java 8 unidoc I was thinking building it as part of the normal PR build process would be better. Regardless lets move discussion over to that JIRA :)

gatorsmile · 2017-02-02T06:19:31Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

-    approxQuantile(col, probabilities.toArray, relativeError).toList.asJava
+      relativeError: Double): java.util.List[java.util.List[Double]] = {
+    approxQuantile(cols.toArray, probabilities.toArray, relativeError)
+        .map(_.toList.asJava).toList.asJava


The indent is not right.

gatorsmile · 2017-02-02T06:22:37Z

@holdenk When you do the code merge, you need to leave a comment to explain which branch you merged.

zhengruifeng · 2017-02-02T06:32:23Z

@HyukjinKwon @gatorsmile Thanks for pointing out those issues. I will create a followup PR to fix them ASAP.

gatorsmile · 2017-02-02T06:34:01Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

+   * @see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]] for
+   *     detailed description.
+   *
+   * Note that rows containing any null or NaN values values will be removed before


values values -> values

@zhengruifeng Could you submit a follow-up PR to add test cases for null values?

gatorsmile · 2017-02-02T06:35:56Z

@zhengruifeng Actually, I still have a few comments about this PR. I will leave the comments soon. Thanks!

gatorsmile · 2017-02-02T06:40:08Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

+   *   For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
+   * @param relativeError The relative target precision to achieve (>= 0).
+   *   If set to zero, the exact quantiles are computed, which could be very expensive.
+   *   Note that values greater than 1 are accepted but give the same result as 1.


It sounds like you did not add any test case to verify it.

gatorsmile · 2017-02-02T06:41:28Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

+   *   Each number must belong to [0, 1].
+   *   For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
+   * @param relativeError The relative target precision to achieve (>= 0).
+   *   If set to zero, the exact quantiles are computed, which could be very expensive.


This case is also missing.

Actually, you also need to consider the illegal cases, like negative values.

gatorsmile · 2017-02-02T06:42:38Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

+   * calculation.
+   * @param cols the names of the numerical columns
+   * @param probabilities a list of quantile probabilities
+   *   Each number must belong to [0, 1].


What happened if the users provide the number that is not in this boundary? Do we have a test case to verify the behavior?

gatorsmile · 2017-02-02T06:44:37Z

@zhengruifeng Please try to improve the test case coverage in the follow-up PRs. You might find some bugs when you added these test cases. Thanks for your work!

holdenk · 2017-02-02T08:32:46Z

Thanks for the reminder @gatorsmile (it wasn't in the list of things to do when merging so I'll follow up and update the http://spark.apache.org/committers.html docs to add that as a follow up step along with the JIRA update).

For the record: Merged to master in b098576

MLnick · 2017-02-02T10:55:20Z

@gatorsmile it's a good point about the tests. However this JIRA & PR was for exposing the multi-column functionality of approxQuantiles. The missing test cases date back to original impl really. I think we should create a new JIRA for it just to separate out the concerns and make tracking things easier.

@zhengruifeng could you create a JIRA ticket and link your new PR to that one also?

zhengruifeng · 2017-02-02T11:28:51Z

@MLnick I created SPARK-19436 for it.

gatorsmile · 2017-02-02T16:39:41Z

I am fine to create a separate one, but, normally, in Spark SQL, we do not create a separate JIRA for improving the related test case, if the original ones are missing.

## What changes were proposed in this pull request? 1, add the multi-cols support based on current private api 2, add the multi-cols support to pyspark ## How was this patch tested? unit tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Author: Ruifeng Zheng <ruifengz@foxmail.com> Closes apache#12135 from zhengruifeng/quantile4multicols.

zhengruifeng force-pushed the quantile4multicols branch from 67947af to 6cf073d Compare April 6, 2016 07:21

zhengruifeng force-pushed the quantile4multicols branch from 7348d49 to c9ebfef Compare April 8, 2016 03:19

viirya reviewed Apr 18, 2016
View reviewed changes

viirya reviewed Apr 20, 2016
View reviewed changes

holdenk reviewed Aug 3, 2016
View reviewed changes

zhengruifeng force-pushed the quantile4multicols branch from bae4053 to 785a667 Compare August 4, 2016 04:53

MLnick suggested changes Oct 4, 2016

View reviewed changes

update py tests

29a691f

zhengruifeng force-pushed the quantile4multicols branch from 6517f21 to 29a691f Compare January 27, 2017 09:19

holdenk reviewed Jan 31, 2017

View reviewed changes

float->floats

ccf4d8d

asfgit closed this in b098576 Feb 1, 2017

zhengruifeng deleted the quantile4multicols branch February 2, 2017 01:28

HyukjinKwon reviewed Feb 2, 2017

View reviewed changes

gatorsmile reviewed Feb 2, 2017

View reviewed changes

gatorsmile mentioned this pull request Feb 8, 2017

[SPARK-19436][SQL] Add missing tests for approxQuantile #16776

Closed

[SPARK-14352][SQL] approxQuantile should support multi columns #12135

[SPARK-14352][SQL] approxQuantile should support multi columns #12135

Conversation

zhengruifeng commented Apr 3, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Apr 3, 2016

holdenk commented Apr 5, 2016

zhengruifeng commented Apr 6, 2016

SparkQA commented Apr 6, 2016

SparkQA commented Apr 6, 2016

SparkQA commented Apr 8, 2016

SparkQA commented Apr 8, 2016

zhengruifeng commented Apr 8, 2016

MLnick commented Apr 14, 2016

MLnick commented Apr 15, 2016

zhengruifeng commented Apr 15, 2016

MLnick commented Apr 18, 2016 • edited

Choose a reason for hiding this comment

zhengruifeng commented Apr 18, 2016

SparkQA commented Apr 18, 2016

viirya Apr 20, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 20, 2016

zhengruifeng commented May 2, 2016

zhengruifeng commented Aug 2, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhengruifeng Oct 5, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 4, 2016

zhengruifeng commented Sep 30, 2016

MLnick left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 27, 2017

zhengruifeng commented Jan 29, 2017

holdenk commented Jan 30, 2017

holdenk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhengruifeng commented Feb 1, 2017

SparkQA commented Feb 1, 2017

HyukjinKwon Feb 2, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Feb 2, 2017 • edited

Choose a reason for hiding this comment

HyukjinKwon Feb 2, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Feb 2, 2017

zhengruifeng commented Feb 2, 2017

Choose a reason for hiding this comment

gatorsmile commented Feb 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Feb 2, 2017

holdenk commented Feb 2, 2017

MLnick commented Feb 2, 2017

zhengruifeng commented Feb 2, 2017 • edited

gatorsmile commented Feb 2, 2017

MLnick commented Apr 18, 2016 •

edited

viirya Apr 20, 2016 •

edited

zhengruifeng Oct 5, 2016 •

edited

HyukjinKwon Feb 2, 2017 •

edited

HyukjinKwon Feb 2, 2017 •

edited

HyukjinKwon Feb 2, 2017 •

edited

zhengruifeng commented Feb 2, 2017 •

edited