[SPARK-14432][SQL] Add API to calculate the approximate quantiles for multiple columns#12207
[SPARK-14432][SQL] Add API to calculate the approximate quantiles for multiple columns#12207viirya wants to merge 7 commits intoapache:masterfrom
Conversation
|
Test build #55116 has finished for PR 12207 at commit
|
|
retest this please. |
|
Test build #55117 has finished for PR 12207 at commit
|
|
cc @jkbradley |
| StatFunctions.multipleApproxQuantiles(df, Seq(col), probabilities, relativeError).head.toArray | ||
| } | ||
|
|
||
| /** |
There was a problem hiding this comment.
If we don't have the full doc from the above method, we should perhaps provide an @see link to the full info about the algorithm?
There was a problem hiding this comment.
Does the @see link work (as in links to the method with full doc)? Can you build the docs on your PR and check it? I'm not totally sure whether it will point to the doc of the other method or just to itself.
There was a problem hiding this comment.
I've updated it with specified parameter types.
There was a problem hiding this comment.
I'm not sure this will actually show up in the generated Scaladoc HTML.
@jkbradley @mengxr do you prefer to actually make links show up in the HTML API doc? If so, then it often doesn't look good in an IDE. But to do that something like this is needed:
@see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]] for detailed description.
|
Thanks @viirya - any chance to update the PySpark API at the same time? :) |
|
@MLnick Thanks for review. I've updated PySpark API. |
|
Test build #55216 has finished for PR 12207 at commit
|
|
ping @jkbradley @MLnick any further comments for this? |
| self.assertEqual(len(aq), 3) | ||
| self.assertTrue(all(isinstance(q, float) for q in aq)) | ||
|
|
||
| aqs = df.stat.approxQuantile(["a", "a"], [0.1, 0.5, 0.9], 0.1) |
There was a problem hiding this comment.
shall we add an assert that len(aqs) is 2?
|
Test build #55349 has finished for PR 12207 at commit
|
|
Test build #55350 has finished for PR 12207 at commit
|
|
@jkbradley Can you take a look too? Thanks! |
|
ping @jkbradley @MLnick |
python/pyspark/sql/dataframe.py
Outdated
|
|
||
| :param col: the name of the numerical column | ||
| :param cols: str, list. | ||
| The name(s) of the numerical column(s). Can be a string of the name |
There was a problem hiding this comment.
I think we can simplify this comment to: Can be a single column name, or a list of names for multiple columns. I think it's clear from the specified types that it's a string name or a list of string names.
(we mention in the method doc that it operates on numerical columns, we don't need to repeat that).
|
Test build #55512 has finished for PR 12207 at commit
|
|
LGTM, pending the discussion on the |
| if isinstance(cols, tuple): | ||
| cols = list(cols) | ||
| if isinstance(cols, list): | ||
| cols = _to_list(self._sc, cols) |
There was a problem hiding this comment.
We could consider verifying the contents of the list as done for probabilities right bellow (but just a minor point and probably not as important - just if people pass in a list of expressions rather than strings would be nice to have a useful error message).
|
Test build #55568 has finished for PR 12207 at commit
|
|
ping @jkbradley @mengxr |
|
Let me close this due to an earlier duplicate one. |
What changes were proposed in this pull request?
JIRA: https://issues.apache.org/jira/browse/SPARK-14432
As we have the underlying implementation to calculate the approximate quantiles for multiple columns, I think we have no reason only providing API to calculate the approximate quantiles for just one column at a time. We should add API to do multiple columns too.
How was this patch tested?
Add tests to
DataFrameStatSuite.