[SPARK-14432][SQL] Add API to calculate the approximate quantiles for multiple columns by viirya · Pull Request #12207 · apache/spark

viirya · 2016-04-06T14:31:24Z

What changes were proposed in this pull request?

JIRA: https://issues.apache.org/jira/browse/SPARK-14432

As we have the underlying implementation to calculate the approximate quantiles for multiple columns, I think we have no reason only providing API to calculate the approximate quantiles for just one column at a time. We should add API to do multiple columns too.

How was this patch tested?

Add tests to DataFrameStatSuite.

SparkQA · 2016-04-06T14:39:49Z

Test build #55116 has finished for PR 12207 at commit 47d52b9.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-04-06T14:42:30Z

retest this please.

SparkQA · 2016-04-06T16:06:25Z

Test build #55117 has finished for PR 12207 at commit 47d52b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-04-07T02:43:14Z

cc @jkbradley

MLnick · 2016-04-07T07:41:03Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

    StatFunctions.multipleApproxQuantiles(df, Seq(col), probabilities, relativeError).head.toArray
  }

+  /**


If we don't have the full doc from the above method, we should perhaps provide an @see link to the full info about the algorithm?

Ok. Updated it.

Does the @see link work (as in links to the method with full doc)? Can you build the docs on your PR and check it? I'm not totally sure whether it will point to the doc of the other method or just to itself.

I've updated it with specified parameter types.

I'm not sure this will actually show up in the generated Scaladoc HTML.

@jkbradley @mengxr do you prefer to actually make links show up in the HTML API doc? If so, then it often doesn't look good in an IDE. But to do that something like this is needed:
@see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]] for detailed description.

MLnick · 2016-04-07T07:48:03Z

Thanks @viirya - any chance to update the PySpark API at the same time? :)

viirya · 2016-04-07T10:37:11Z

@MLnick Thanks for review. I've updated PySpark API.

SparkQA · 2016-04-07T11:58:12Z

Test build #55216 has finished for PR 12207 at commit 75edcb1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-04-08T13:10:47Z

ping @jkbradley @MLnick any further comments for this?

MLnick · 2016-04-08T13:45:23Z

python/pyspark/sql/tests.py

        self.assertEqual(len(aq), 3)
        self.assertTrue(all(isinstance(q, float) for q in aq))

+        aqs = df.stat.approxQuantile(["a", "a"], [0.1, 0.5, 0.9], 0.1)


shall we add an assert that len(aqs) is 2?

SparkQA · 2016-04-08T15:53:36Z

Test build #55349 has finished for PR 12207 at commit 619660d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-08T15:59:59Z

Test build #55350 has finished for PR 12207 at commit b64bd4e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-04-08T23:43:44Z

@jkbradley Can you take a look too? Thanks!

viirya · 2016-04-11T03:09:02Z

ping @jkbradley @MLnick

MLnick · 2016-04-11T06:52:36Z

python/pyspark/sql/dataframe.py


-        :param col: the name of the numerical column
+        :param cols: str, list.
+            The name(s) of the numerical column(s). Can be a string of the name


I think we can simplify this comment to: Can be a single column name, or a list of names for multiple columns. I think it's clear from the specified types that it's a string name or a list of string names.

(we mention in the method doc that it operates on numerical columns, we don't need to repeat that).

ok. updated.

SparkQA · 2016-04-11T08:30:41Z

Test build #55512 has finished for PR 12207 at commit 89d4d3e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-04-11T09:37:59Z

LGTM, pending the discussion on the @see link. @jkbradley?

holdenk · 2016-04-11T21:04:54Z

python/pyspark/sql/dataframe.py

+        if isinstance(cols, tuple):
+            cols = list(cols)
+        if isinstance(cols, list):
+            cols = _to_list(self._sc, cols)


We could consider verifying the contents of the list as done for probabilities right bellow (but just a minor point and probably not as important - just if people pass in a list of expressions rather than strings would be nice to have a useful error message).

SparkQA · 2016-04-12T03:23:48Z

Test build #55568 has finished for PR 12207 at commit 4309001.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-04-13T07:29:33Z

ping @jkbradley @mengxr

viirya · 2016-04-15T03:53:31Z

Let me close this due to an earlier duplicate one.

viirya added 2 commits April 6, 2016 14:22

Add API to compute approxQuantile for multiple columns.

a8f1b33

Add comment.

47d52b9

viirya mentioned this pull request Apr 6, 2016

[SPARK-13568] [ML] Create feature transformer to impute missing values #11601

Closed

MLnick reviewed Apr 7, 2016
View reviewed changes

Address comments and change Python API too.

75edcb1

MLnick reviewed Apr 8, 2016
View reviewed changes

viirya added 2 commits April 8, 2016 14:30

Address comments.

619660d

Update comment.

b64bd4e

MLnick reviewed Apr 11, 2016
View reviewed changes

Slightly modify comment.

89d4d3e

holdenk reviewed Apr 11, 2016
View reviewed changes

Check the content of list.

4309001

MLnick mentioned this pull request Apr 14, 2016

[SPARK-14352][SQL] approxQuantile should support multi columns #12135

Closed

viirya closed this Apr 15, 2016

viirya deleted the multi-cols-approxquantile branch December 27, 2023 18:33

Conversation

viirya commented Apr 6, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 6, 2016

Uh oh!

viirya commented Apr 6, 2016

Uh oh!

SparkQA commented Apr 6, 2016

Uh oh!

viirya commented Apr 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick commented Apr 7, 2016

Uh oh!

viirya commented Apr 7, 2016

Uh oh!

SparkQA commented Apr 7, 2016

Uh oh!

viirya commented Apr 8, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 8, 2016

Uh oh!

SparkQA commented Apr 8, 2016

Uh oh!

viirya commented Apr 8, 2016

Uh oh!

viirya commented Apr 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 11, 2016

Uh oh!

MLnick commented Apr 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 12, 2016

Uh oh!

viirya commented Apr 13, 2016

Uh oh!

viirya commented Apr 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants