Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-14352][SQL] approxQuantile should support multi columns #12135

Closed
wants to merge 23 commits into from

Conversation

zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

1, add the multi-cols support based on current private api
2, add the multi-cols support to pyspark

How was this patch tested?

unit tests

@SparkQA
Copy link

SparkQA commented Apr 3, 2016

Test build #54800 has finished for PR 12135 at commit 67947af.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Apr 5, 2016

So if we are going to add this - it might make sense to also add the Python API at the same time (or at least create a JIRA to track adding this to the Python API so it doesn't slip between the cracks).

@zhengruifeng
Copy link
Contributor Author

@holdenk Ok, I will add the python API into this PR.

@SparkQA
Copy link

SparkQA commented Apr 6, 2016

Test build #55100 has finished for PR 12135 at commit 6cf073d.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 6, 2016

Test build #55101 has finished for PR 12135 at commit 7348d49.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 8, 2016

Test build #55307 has finished for PR 12135 at commit c9ebfef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 8, 2016

Test build #55326 has finished for PR 12135 at commit dccd337.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

@holdenk The wrapper for pyspark is added.

@MLnick
Copy link
Contributor

MLnick commented Apr 14, 2016

@zhengruifeng @viirya this is a duplicate of #12207

@MLnick
Copy link
Contributor

MLnick commented Apr 15, 2016

@viirya could you review this PR and add any items from yours that might be missing?

@zhengruifeng
Copy link
Contributor Author

@MLnick What should I do?

@MLnick
Copy link
Contributor

MLnick commented Apr 18, 2016

@zhengruifeng we will review and aim to merge this PR.

ping @viirya @jkbradley @holdenk

/**
* Python-friendly version of [[approxQuantile()]]
*/
private[spark] def approxQuantileMultiCols(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to have a new method name. Since this multi-column version has different input types, it can use the same method name approxQuantile.

@zhengruifeng
Copy link
Contributor Author

@viirya Thanks, I have updated this PR according your comments.

@SparkQA
Copy link

SparkQA commented Apr 18, 2016

Test build #56091 has finished for PR 12135 at commit ce41411.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}


/**
Copy link
Member

@viirya viirya Apr 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we only call the multi-column version from PySpark, do we still need the single column version Python API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the versions are used now.

@SparkQA
Copy link

SparkQA commented Apr 20, 2016

Test build #56368 has finished for PR 12135 at commit bae4053.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

cc @MLnick ?

@zhengruifeng
Copy link
Contributor Author

Test this please

if not isinstance(col, (str, list, tuple)):
raise ValueError("col should be a string, list or tuple.")

isStr = isinstance(col, str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor and subjective - but you could also just wrap it as a list here instead of propagating isStr down to the return.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for helping to review this PR, it is quite a while.
The type of col detemine the type of return.
If I make col = [col] here, I will not know whether to return a list or a list of list.
Like this:

>>> dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
>>> dataset.stat.approxQuantile(['label'], [0.1,0.2], 0.1)
[[0.0, 1.0]]
>>> dataset.stat.approxQuantile('label', [0.1,0.2], 0.1)
[0.0, 1.0]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, but if we can get rid of one of the unused private methods on the Scala side I'm all for that. Can we not simply return the first element of the result if it is length 1, otherwise return the result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MLnick Sorry, I am not very sure about your opinion. Do you mean that if the input col is a list of only one element, the output should be one element (not a list) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MLnick Ok, I will remove the unused Python-friendly version of [[approxQuantile()]] with single col.

@SparkQA
Copy link

SparkQA commented Aug 4, 2016

Test build #63208 has finished for PR 12135 at commit 785a667.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

Can any admin verify this PR? It's been a long time and I really need this feature...

Copy link
Contributor

@MLnick MLnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a few small comments. @viirya @holdenk mind making a final pass too?

@@ -1256,18 +1256,33 @@ def approxQuantile(self, col, probabilities, relativeError):
Space-efficient Online Computation of Quantile Summaries]]
by Greenwald and Khanna.

:param col: the name of the numerical column
:param col: the name of the numerical column, or a list/tuple of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to incorporate expected types here, see @viirya's doc here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will update this.

@SparkQA
Copy link

SparkQA commented Jan 27, 2017

Test build #72070 has finished for PR 12135 at commit 29a691f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

ping @holdenk ?

@holdenk
Copy link
Contributor

holdenk commented Jan 30, 2017

Sorry, my weekend ended up super busy. I'll try and take a look tomorrow :) Also thanks for adding more tests <3 tests :)

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor pluralizations I noticed while going back over it, will double check with Nick for any other issues. If you have a chance to fix the docstrings that would be great otherwise I can do that during merge.

:param probabilities: a list of quantile probabilities
Each number must belong to [0, 1].
For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
:param relativeError: The relative target precision to achieve
(>= 0). If set to zero, the exact quantiles are computed, which
could be very expensive. Note that values greater than 1 are
accepted but give the same result as 1.
:return: the approximate quantiles at the given probabilities
:return: the approximate quantiles at the given probabilities. If
the input `col` is a string, the output is a list of float. If the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float should be pluralized (e.g. is a list of floats)

the input `col` is a string, the output is a list of float. If the
input `col` is a list or tuple of strings, the output is also a
list, but each element in it is a list of float, i.e., the output
is a list of list of float.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also float -> floats here

@zhengruifeng
Copy link
Contributor Author

@holdenk Updated! Thanks for your careful checking.

@SparkQA
Copy link

SparkQA commented Feb 1, 2017

Test build #72237 has finished for PR 12135 at commit ccf4d8d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in b098576 Feb 1, 2017
@zhengruifeng zhengruifeng deleted the quantile4multicols branch February 2, 2017 01:28
* @param probabilities a list of quantile probabilities
* Each number must belong to [0, 1].
* For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
* @param relativeError The relative target precision to achieve (>= 0).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a kind comment for the future changes and to inform as I know it is super easy for javadoc8 to be broken, It seems javadoc8 complains it as below:

[error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:43: error: unexpected content
[error]    * @see {@link DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile} for
[error]      ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:52: error: bad use of '>'
[error]    * @param relativeError The relative target precision to achieve (>= 0).
[error]

We could do this as

@param relativeError The relative target precision to achieve (greater or equal to 0).

and fix the link as below If there is no better choice:

@see `DataFrameStatsFunctions.approxQuantile` for detailed description.

Just FYI, there are several cases in #16013

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these just warnings generated? It would be nice to know during Jenkins testing if javadoc8 (or scaladoc for that matter) breaks.

The 2nd case links nicely to the single-arg version of the method, which contains the detailed doc, in Scaladoc. Pity it won't work with javadoc - is there another way to link it correctly? I suspect that what will work for javadoc will break the link for scaladoc...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea.. so, kindly @jkbradley opened a JIRA here - http://issues.apache.org/jira/browse/SPARK-18692

Actually, they are errors that make documentation building failed in javadoc8. I and many guys had a hard time to figure that out a good way AKAIK (honestly, I would like to say that I have tried all the combination I could think. To make it worse, it seems case-by-case up to my observation and tests) and it kind of ended up with the one above.. as we are anyway going to drop Java 7 support in near future up to my knowledge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, I will ping you if I happen to find another good way to make some links for both.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(BTW, IMHO, at least for now, building javadoc everytime might be good to do but not required. We can avoid them at our best in our PRs and then sweep them when the release is close or in other related PRs if there are.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create an issue to build javadoc with Java 8 to Jenkins then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that JIRA is actually here - https://issues.apache.org/jira/browse/SPARK-18692 if we are talking about the same thing :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, sorry the comments imply building it separately from the main jenkins build, but if we want to avoid breaking Java 8 unidoc I was thinking building it as part of the normal PR build process would be better. Regardless lets move discussion over to that JIRA :)

approxQuantile(col, probabilities.toArray, relativeError).toList.asJava
relativeError: Double): java.util.List[java.util.List[Double]] = {
approxQuantile(cols.toArray, probabilities.toArray, relativeError)
.map(_.toList.asJava).toList.asJava
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indent is not right.

@gatorsmile
Copy link
Member

@holdenk When you do the code merge, you need to leave a comment to explain which branch you merged.

@zhengruifeng
Copy link
Contributor Author

@HyukjinKwon @gatorsmile Thanks for pointing out those issues. I will create a followup PR to fix them ASAP.

* @see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]] for
* detailed description.
*
* Note that rows containing any null or NaN values values will be removed before
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values values -> values

@zhengruifeng Could you submit a follow-up PR to add test cases for null values?

@gatorsmile
Copy link
Member

@zhengruifeng Actually, I still have a few comments about this PR. I will leave the comments soon. Thanks!

* For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
* @param relativeError The relative target precision to achieve (>= 0).
* If set to zero, the exact quantiles are computed, which could be very expensive.
* Note that values greater than 1 are accepted but give the same result as 1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like you did not add any test case to verify it.

* Each number must belong to [0, 1].
* For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
* @param relativeError The relative target precision to achieve (>= 0).
* If set to zero, the exact quantiles are computed, which could be very expensive.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is also missing.

Actually, you also need to consider the illegal cases, like negative values.

* calculation.
* @param cols the names of the numerical columns
* @param probabilities a list of quantile probabilities
* Each number must belong to [0, 1].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened if the users provide the number that is not in this boundary? Do we have a test case to verify the behavior?

@gatorsmile
Copy link
Member

@zhengruifeng Please try to improve the test case coverage in the follow-up PRs. You might find some bugs when you added these test cases. Thanks for your work!

@holdenk
Copy link
Contributor

holdenk commented Feb 2, 2017

Thanks for the reminder @gatorsmile (it wasn't in the list of things to do when merging so I'll follow up and update the http://spark.apache.org/committers.html docs to add that as a follow up step along with the JIRA update).

For the record: Merged to master in b098576

@MLnick
Copy link
Contributor

MLnick commented Feb 2, 2017

@gatorsmile it's a good point about the tests. However this JIRA & PR was for exposing the multi-column functionality of approxQuantiles. The missing test cases date back to original impl really. I think we should create a new JIRA for it just to separate out the concerns and make tracking things easier.

@zhengruifeng could you create a JIRA ticket and link your new PR to that one also?

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Feb 2, 2017

@MLnick I created SPARK-19436 for it.

@gatorsmile
Copy link
Member

I am fine to create a separate one, but, normally, in Spark SQL, we do not create a separate JIRA for improving the related test case, if the original ones are missing.

cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?

1, add the multi-cols support based on current private api
2, add the multi-cols support to pyspark
## How was this patch tested?

unit tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>
Author: Ruifeng Zheng <ruifengz@foxmail.com>

Closes apache#12135 from zhengruifeng/quantile4multicols.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants