[SPARK-27425][SQL] Add count_if function #24335

cryeo · 2019-04-10T09:54:08Z

What changes were proposed in this pull request?

Add count_if function which returns the number of records satisfying a given condition.

There is no aggregation function like this in Spark, so we need to write like

COUNT(CASE WHEN some_condition THEN 1 END) or
SUM(CASE WHEN some_condition THEN 1 END),
which looks painful.

This kind of function is already supported in Presto, BigQuery and even Excel.

Presto: count_if
BigQuery: countif
Excel: COUNTIF (It is a little different from above twos)

How was this patch tested?

This patch is tested by unit test.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountIf.scala

...talyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountIfSuite.scala

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

attilapiros · 2019-04-10T12:11:09Z

Thanks for the PR @cryeo (I have executed the tests and they are passing, run scalastyle and there was no violation).

Ok to test.

attilapiros · 2019-04-10T12:59:44Z

retest this please

srowen · 2019-04-10T13:32:39Z

As with lots of these -- we wouldn't add a new function unless it were standard SQL. With Spark SQL, it's pretty trivial to express count-if with a filter and count.

cryeo · 2019-04-10T15:10:36Z

we wouldn't add a new function unless it were standard SQL.

Would you mind if I ask you the reason?
Presto and BigQuery provide this nevertheless it isn't ISO/ANSI standards.

With Spark SQL, it's pretty trivial to express count-if with a filter and count.

As you said, we can archive this with existing functions like followings, which are a little bit inconvenient.

COUNT(IF(very_complex_condition, 1, NULL))
COUNT(CASE WHEN very_complex_condition THEN 1 END)
SUM(IF(very_complex_condition, 1, NULL))
SUM(CASE WHEN very_complex_condition THEN 1 END)

However, I think that these are a little bit inconvenient and painful.

srowen · 2019-04-10T15:43:21Z

This is my opinion but I think it would be shared by others.
Presto and BigQuery are SQL-only and UDFs are relatively hard. Hence it makes sense to bake in a lot of SQL helper functions. In Spark it's easy to mix code in, so the value of SQL-only helpers isn't nearly as big.

df.filter("...complex condition").count() does this easily for example. It's not a great example, because I'm not even sure the current SQL equivalent is much more complex.

The SQL helper functions you see today are mostly to match Hive. If Hive supports something that's a more compelling argument to add it for interop.

SparkQA · 2019-04-10T17:09:44Z

Test build #104483 has finished for PR 24335 at commit 9330d02.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-04-11T17:43:58Z

To me, I am not sure how useful it is too. But at least it's matched with some other DBs. Maybe it's better to be asked to mailing list and see if people like and need it. If there are not so much input about this, I wouldn't go for it for now.

cryeo · 2019-04-12T08:22:59Z

OK. I'll ask to mailing list :)

rxin · 2019-04-15T23:38:22Z

To chime in here -- I feel this one is probably OK, given its ubiquity (also in excel?)

Question is ... should it be count_if or countif?

HyukjinKwon · 2019-04-16T03:03:59Z

Another question is tho, are there like sum_if, avg_if too?

yeikel · 2019-04-19T00:52:48Z

I personally agree with that @srowen said . I don't believe we need to clutter the API when we have a simple solution like that.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountIf.scala

dongjoon-hyun · 2019-05-20T04:46:40Z

Hi, @cryeo . Did you ask the questions to the community as @HyukjinKwon recommended? I'm just wondering if the decision was made. If we are not going to proceed with this, we had better close this PR and JIRA issue.

cryeo · 2019-05-20T06:28:00Z

Another question is tho, are there like sum_if, avg_if too?

Sorry for the late reply.
I think that the use cases of count_if are quite different from that of sum_if or avg_if.
That's why Presto and BigQuery provide only count_if.

HyukjinKwon · 2019-05-20T08:48:14Z

Then I guess it's fine to add count_if alone. Is the name count_if prevailing or countif per #24335 (comment)?

cryeo · 2019-05-20T17:48:19Z

I have just found four products which provide this function: Facebook Presto, Google BigQuery, IBM Informix, Microsoft Excel. Only Presto supports as count_if, the others support ascountif.

I think that count_if is more easier to read than countif, but it seems that countif is more prevailing.

HyukjinKwon · 2019-05-21T04:26:14Z

okie. can you rebase?

cryeo · 2019-05-21T05:00:40Z

OK. I just did it :)

HyukjinKwon · 2019-05-21T05:45:31Z

retest this please

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountIf.scala

dongjoon-hyun · 2019-06-07T03:04:46Z

@cryeo . Please update the PR description with more SQL references. You already told us Presto/BigQuery/Excel references. That will make this PR stronger.

dongjoon-hyun · 2019-06-07T03:05:56Z

I also support this feature and @HyukjinKwon .

cc @gatorsmile

cryeo · 2019-06-07T05:39:03Z

@dongjoon-hyun Thanks for your review. I just modified code and PR description. Could you confirm it?

SparkQA · 2019-06-07T07:05:02Z

Test build #106269 has finished for PR 24335 at commit 81ab7e6.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-06-07T16:23:42Z

Retest this please.

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountIf.scala

dongjoon-hyun · 2019-06-07T16:54:16Z

Thank you for updating, @cryeo . The PR description looks enough.

SparkQA · 2019-06-07T19:31:26Z

Test build #106275 has finished for PR 24335 at commit 81ab7e6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-07T20:47:04Z

Test build #106281 has finished for PR 24335 at commit 2f4d64e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-06-07T20:55:44Z

Retest this please.

SparkQA · 2019-06-07T23:58:43Z

Test build #106284 has finished for PR 24335 at commit 2f4d64e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-06-08T02:44:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountIf.scala

+@ExpressionDescription(
+  usage = """
+    _FUNC_(expr) - Returns the number of `TRUE` values for the expression.
+      This function is equivalent to `count(CASE WHEN expr THEN 1 END)`.


Initially, I recommended this to give a hint to the users like the other SQL engines. The reason why I chose this expression instead of Count(NullIf(...)) which is used in this PR with RuntimeReplaceable is that Count(NullIf(...)) doesn't work like new count_if due to the type casting.

For the following case, Count(NullIf(...)) works while count_if doesn't.

spark-sql> select count(nullif(a, false)) from values (1) T(a); 1 spark-sql> select count_if(a) from values (1) T(a); Error in query: cannot resolve 'count_if(T.a)' due to data type mismatch: function count_if requires boolean type, spark-sql> select count(case when a then 1 end) from values (1) T(a); Error in query: cannot resolve 'CASE WHEN T.`a` THEN 1 END' due to data type mismatch: WHEN expressions in CaseWhen

In short, new count_if's behavior is the same with count(CASE WHEN expr THEN 1 END). However, while reviewing this PR again, I notice that this might mislead the developers because we are using count(nullif(...)) technically.

To sum up, we cannot give the simple fallback example here. Both ones are inadequate. We had better remove this line. So, could you remove this line again, @cryeo ? Sorry, it's my bad.

Okay, thanks for your advice.

dongjoon-hyun

+1, LGTM. (Pending Jenkins).
@HyukjinKwon . Could you do the final sign-off and merge since you help @cryeo from the beginning?

SparkQA · 2019-06-10T07:05:02Z

Test build #106338 has finished for PR 24335 at commit c0a3289.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-06-10T07:09:02Z

Retest this please.

SparkQA · 2019-06-10T10:14:23Z

Test build #106344 has finished for PR 24335 at commit c0a3289.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-06-10T10:51:37Z

Merged to master.

HyukjinKwon · 2019-06-10T10:55:13Z

Thank you all and @cryeo. Welcome to contributors :).

## What changes were proposed in this pull request? Add `count_if` function which returns the number of records satisfying a given condition. There is no aggregation function like this in Spark, so we need to write like - `COUNT(CASE WHEN some_condition THEN 1 END)` or - `SUM(CASE WHEN some_condition THEN 1 END)`, which looks painful. This kind of function is already supported in Presto, BigQuery and even Excel. - Presto: [`count_if`](https://prestodb.github.io/docs/current/functions/aggregate.html#count_if) - BigQuery: [`countif`](https://cloud.google.com/bigquery/docs/reference/standard-sql/aggregate_functions?hl=en#countif) - Excel: [`COUNTIF`](https://support.office.com/en-us/article/countif-function-e0de10c6-f885-4e71-abb4-1f464816df34?omkt=en-US&ui=en-US&rs=en-US&ad=US) (It is a little different from above twos) ## How was this patch tested? This patch is tested by unit test. Closes apache#24335 from cryeo/SPARK-27425. Authored-by: Chaerim Yeo <yeochaerim@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

attilapiros reviewed Apr 10, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountIf.scala Outdated Show resolved Hide resolved

attilapiros reviewed Apr 10, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountIf.scala Outdated Show resolved Hide resolved

attilapiros reviewed Apr 10, 2019

View reviewed changes

...talyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountIfSuite.scala Outdated Show resolved Hide resolved

attilapiros reviewed Apr 10, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/functions.scala Outdated Show resolved Hide resolved

cryeo force-pushed the SPARK-27425 branch from 912df3d to bc8adc1 Compare April 10, 2019 14:51

mgaido91 reviewed Apr 23, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountIf.scala Show resolved Hide resolved

dongjoon-hyun changed the title ~~[SPARK-27425] Add count_if functions~~ [SPARK-27425][SQL] Add count_if functions May 20, 2019

cryeo added 4 commits May 21, 2019 13:53

[SPARK-27425] Add count_if functions

e8f206f

[SPARK-27425] Fix

d7370ed

[SPARK-27425] Modify version, documentation and style

45215f9

[SPARK-27425] Modify unit test

e4f0465

cryeo force-pushed the SPARK-27425 branch from bc8adc1 to e4f0465 Compare May 21, 2019 05:00

dongjoon-hyun reviewed Jun 7, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Jun 7, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountIf.scala Outdated Show resolved Hide resolved

dongjoon-hyun changed the title ~~[SPARK-27425][SQL] Add count_if functions~~ [SPARK-27425][SQL] Add count_if function Jun 7, 2019

dongjoon-hyun reviewed Jun 7, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountIf.scala Outdated Show resolved Hide resolved

[SPARK-27425] Reflect review

81ab7e6

dongjoon-hyun reviewed Jun 7, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala Show resolved Hide resolved

dongjoon-hyun reviewed Jun 7, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountIf.scala Outdated Show resolved Hide resolved

[SPARK-27425] Reflect review

2f4d64e

dongjoon-hyun reviewed Jun 8, 2019

View reviewed changes

[SPARK-27425] Reflect review

c0a3289

dongjoon-hyun approved these changes Jun 10, 2019

View reviewed changes

HyukjinKwon approved these changes Jun 10, 2019

View reviewed changes

HyukjinKwon closed this in c1bb331 Jun 10, 2019

[SPARK-27425][SQL] Add count_if function #24335

[SPARK-27425][SQL] Add count_if function #24335

Conversation

cryeo commented Apr 10, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

attilapiros commented Apr 10, 2019

attilapiros commented Apr 10, 2019

srowen commented Apr 10, 2019

cryeo commented Apr 10, 2019

srowen commented Apr 10, 2019

SparkQA commented Apr 10, 2019

HyukjinKwon commented Apr 11, 2019

cryeo commented Apr 12, 2019

rxin commented Apr 15, 2019

HyukjinKwon commented Apr 16, 2019

yeikel commented Apr 19, 2019

dongjoon-hyun commented May 20, 2019

cryeo commented May 20, 2019

HyukjinKwon commented May 20, 2019

cryeo commented May 20, 2019 • edited Loading

HyukjinKwon commented May 21, 2019

cryeo commented May 21, 2019

HyukjinKwon commented May 21, 2019

dongjoon-hyun commented Jun 7, 2019

dongjoon-hyun commented Jun 7, 2019 • edited Loading

cryeo commented Jun 7, 2019

SparkQA commented Jun 7, 2019

dongjoon-hyun commented Jun 7, 2019

dongjoon-hyun commented Jun 7, 2019

SparkQA commented Jun 7, 2019

SparkQA commented Jun 7, 2019

dongjoon-hyun commented Jun 7, 2019

SparkQA commented Jun 7, 2019

dongjoon-hyun Jun 8, 2019

Choose a reason for hiding this comment

cryeo Jun 10, 2019

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

SparkQA commented Jun 10, 2019

dongjoon-hyun commented Jun 10, 2019

SparkQA commented Jun 10, 2019

HyukjinKwon commented Jun 10, 2019

HyukjinKwon commented Jun 10, 2019

cryeo commented Apr 10, 2019 •

edited

Loading

cryeo commented May 20, 2019 •

edited

Loading

dongjoon-hyun commented Jun 7, 2019 •

edited

Loading