-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27425][SQL] Add count_if function #24335
Conversation
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountIf.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountIf.scala
Outdated
Show resolved
Hide resolved
...talyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountIfSuite.scala
Outdated
Show resolved
Hide resolved
Thanks for the PR @cryeo (I have executed the tests and they are passing, run scalastyle and there was no violation). Ok to test. |
retest this please |
As with lots of these -- we wouldn't add a new function unless it were standard SQL. With Spark SQL, it's pretty trivial to express count-if with a filter and count. |
Would you mind if I ask you the reason?
As you said, we can archive this with existing functions like followings, which are a little bit inconvenient.
However, I think that these are a little bit inconvenient and painful. |
This is my opinion but I think it would be shared by others.
The SQL helper functions you see today are mostly to match Hive. If Hive supports something that's a more compelling argument to add it for interop. |
Test build #104483 has finished for PR 24335 at commit
|
To me, I am not sure how useful it is too. But at least it's matched with some other DBs. Maybe it's better to be asked to mailing list and see if people like and need it. If there are not so much input about this, I wouldn't go for it for now. |
OK. I'll ask to mailing list :) |
To chime in here -- I feel this one is probably OK, given its ubiquity (also in excel?) Question is ... should it be count_if or countif? |
Another question is tho, are there like sum_if, avg_if too? |
I personally agree with that @srowen said . I don't believe we need to clutter the API when we have a simple solution like that. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountIf.scala
Show resolved
Hide resolved
Hi, @cryeo . Did you ask the questions to the community as @HyukjinKwon recommended? I'm just wondering if the decision was made. If we are not going to proceed with this, we had better close this PR and JIRA issue. |
Sorry for the late reply. |
Then I guess it's fine to add |
I have just found four products which provide this function: Facebook Presto, Google BigQuery, IBM Informix, Microsoft Excel. Only Presto supports as I think that |
okie. can you rebase? |
OK. I just did it :) |
retest this please |
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountIf.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountIf.scala
Outdated
Show resolved
Hide resolved
@cryeo . Please update the PR description with more SQL references. You already told us |
I also support this feature and @HyukjinKwon . cc @gatorsmile |
@dongjoon-hyun Thanks for your review. I just modified code and PR description. Could you confirm it? |
Test build #106269 has finished for PR 24335 at commit
|
Retest this please. |
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountIf.scala
Outdated
Show resolved
Hide resolved
Thank you for updating, @cryeo . The PR description looks enough. |
Test build #106275 has finished for PR 24335 at commit
|
Test build #106281 has finished for PR 24335 at commit
|
Retest this please. |
Test build #106284 has finished for PR 24335 at commit
|
@ExpressionDescription( | ||
usage = """ | ||
_FUNC_(expr) - Returns the number of `TRUE` values for the expression. | ||
This function is equivalent to `count(CASE WHEN expr THEN 1 END)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, I recommended this to give a hint to the users like the other SQL engines. The reason why I chose this expression instead of Count(NullIf(...))
which is used in this PR with RuntimeReplaceable
is that Count(NullIf(...))
doesn't work like new count_if
due to the type casting.
For the following case, Count(NullIf(...))
works while count_if
doesn't.
spark-sql> select count(nullif(a, false)) from values (1) T(a);
1
spark-sql> select count_if(a) from values (1) T(a);
Error in query: cannot resolve 'count_if(T.a)' due to data type mismatch: function count_if requires boolean type,
spark-sql> select count(case when a then 1 end) from values (1) T(a);
Error in query: cannot resolve 'CASE WHEN T.`a` THEN 1 END' due to data type mismatch: WHEN expressions in CaseWhen
In short, new count_if
's behavior is the same with count(CASE WHEN expr THEN 1 END)
. However, while reviewing this PR again, I notice that this might mislead the developers because we are using count(nullif(...))
technically.
To sum up, we cannot give the simple fallback example here. Both ones are inadequate. We had better remove this line. So, could you remove this line again, @cryeo ? Sorry, it's my bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thanks for your advice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. (Pending Jenkins).
@HyukjinKwon . Could you do the final sign-off and merge since you help @cryeo from the beginning?
Test build #106338 has finished for PR 24335 at commit
|
Retest this please. |
Test build #106344 has finished for PR 24335 at commit
|
Merged to master. |
Thank you all and @cryeo. Welcome to contributors :). |
## What changes were proposed in this pull request? Add `count_if` function which returns the number of records satisfying a given condition. There is no aggregation function like this in Spark, so we need to write like - `COUNT(CASE WHEN some_condition THEN 1 END)` or - `SUM(CASE WHEN some_condition THEN 1 END)`, which looks painful. This kind of function is already supported in Presto, BigQuery and even Excel. - Presto: [`count_if`](https://prestodb.github.io/docs/current/functions/aggregate.html#count_if) - BigQuery: [`countif`](https://cloud.google.com/bigquery/docs/reference/standard-sql/aggregate_functions?hl=en#countif) - Excel: [`COUNTIF`](https://support.office.com/en-us/article/countif-function-e0de10c6-f885-4e71-abb4-1f464816df34?omkt=en-US&ui=en-US&rs=en-US&ad=US) (It is a little different from above twos) ## How was this patch tested? This patch is tested by unit test. Closes apache#24335 from cryeo/SPARK-27425. Authored-by: Chaerim Yeo <yeochaerim@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Add
count_if
function which returns the number of records satisfying a given condition.There is no aggregation function like this in Spark, so we need to write like
COUNT(CASE WHEN some_condition THEN 1 END)
orSUM(CASE WHEN some_condition THEN 1 END)
,which looks painful.
This kind of function is already supported in Presto, BigQuery and even Excel.
count_if
countif
COUNTIF
(It is a little different from above twos)How was this patch tested?
This patch is tested by unit test.