Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27425][SQL] Add count_if function #24335

Closed
wants to merge 10 commits into from
Closed

Conversation

cryeo
Copy link
Contributor

@cryeo cryeo commented Apr 10, 2019

What changes were proposed in this pull request?

Add count_if function which returns the number of records satisfying a given condition.

There is no aggregation function like this in Spark, so we need to write like

  • COUNT(CASE WHEN some_condition THEN 1 END) or
  • SUM(CASE WHEN some_condition THEN 1 END)
    which looks painful.

This kind of function is already supported in Presto, BigQuery and even Excel.

How was this patch tested?

This patch is tested by unit test.

@attilapiros
Copy link
Contributor

Thanks for the PR @cryeo (I have executed the tests and they are passing, run scalastyle and there was no violation).

Ok to test.

@attilapiros
Copy link
Contributor

retest this please

@srowen
Copy link
Member

srowen commented Apr 10, 2019

As with lots of these -- we wouldn't add a new function unless it were standard SQL. With Spark SQL, it's pretty trivial to express count-if with a filter and count.

@cryeo
Copy link
Contributor Author

cryeo commented Apr 10, 2019

we wouldn't add a new function unless it were standard SQL.

Would you mind if I ask you the reason?
Presto and BigQuery provide this nevertheless it isn't ISO/ANSI standards.

With Spark SQL, it's pretty trivial to express count-if with a filter and count.

As you said, we can archive this with existing functions like followings, which are a little bit inconvenient.

  • COUNT(IF(very_complex_condition, 1, NULL))
  • COUNT(CASE WHEN very_complex_condition THEN 1 END)
  • SUM(IF(very_complex_condition, 1, NULL))
  • SUM(CASE WHEN very_complex_condition THEN 1 END)

However, I think that these are a little bit inconvenient and painful.

@srowen
Copy link
Member

srowen commented Apr 10, 2019

This is my opinion but I think it would be shared by others.
Presto and BigQuery are SQL-only and UDFs are relatively hard. Hence it makes sense to bake in a lot of SQL helper functions. In Spark it's easy to mix code in, so the value of SQL-only helpers isn't nearly as big.

df.filter("...complex condition").count() does this easily for example. It's not a great example, because I'm not even sure the current SQL equivalent is much more complex.

The SQL helper functions you see today are mostly to match Hive. If Hive supports something that's a more compelling argument to add it for interop.

@SparkQA
Copy link

SparkQA commented Apr 10, 2019

Test build #104483 has finished for PR 24335 at commit 9330d02.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

To me, I am not sure how useful it is too. But at least it's matched with some other DBs. Maybe it's better to be asked to mailing list and see if people like and need it. If there are not so much input about this, I wouldn't go for it for now.

@cryeo
Copy link
Contributor Author

cryeo commented Apr 12, 2019

OK. I'll ask to mailing list :)

@rxin
Copy link
Contributor

rxin commented Apr 15, 2019

To chime in here -- I feel this one is probably OK, given its ubiquity (also in excel?)

Question is ... should it be count_if or countif?

@HyukjinKwon
Copy link
Member

Another question is tho, are there like sum_if, avg_if too?

@yeikel
Copy link

yeikel commented Apr 19, 2019

I personally agree with that @srowen said . I don't believe we need to clutter the API when we have a simple solution like that.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-27425] Add count_if functions [SPARK-27425][SQL] Add count_if functions May 20, 2019
@dongjoon-hyun
Copy link
Member

Hi, @cryeo . Did you ask the questions to the community as @HyukjinKwon recommended? I'm just wondering if the decision was made. If we are not going to proceed with this, we had better close this PR and JIRA issue.

@cryeo
Copy link
Contributor Author

cryeo commented May 20, 2019

Another question is tho, are there like sum_if, avg_if too?

Sorry for the late reply.
I think that the use cases of count_if are quite different from that of sum_if or avg_if.
That's why Presto and BigQuery provide only count_if.

@HyukjinKwon
Copy link
Member

Then I guess it's fine to add count_if alone. Is the name count_if prevailing or countif per #24335 (comment)?

@cryeo
Copy link
Contributor Author

cryeo commented May 20, 2019

I have just found four products which provide this function: Facebook Presto, Google BigQuery, IBM Informix, Microsoft Excel. Only Presto supports as count_if, the others support ascountif.

I think that count_if is more easier to read than countif, but it seems that countif is more prevailing.

@HyukjinKwon
Copy link
Member

okie. can you rebase?

@cryeo
Copy link
Contributor Author

cryeo commented May 21, 2019

OK. I just did it :)

@HyukjinKwon
Copy link
Member

retest this please

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-27425][SQL] Add count_if functions [SPARK-27425][SQL] Add count_if function Jun 7, 2019
@dongjoon-hyun
Copy link
Member

@cryeo . Please update the PR description with more SQL references. You already told us Presto/BigQuery/Excel references. That will make this PR stronger.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jun 7, 2019

I also support this feature and @HyukjinKwon .

cc @gatorsmile

@cryeo
Copy link
Contributor Author

cryeo commented Jun 7, 2019

@dongjoon-hyun Thanks for your review. I just modified code and PR description. Could you confirm it?

@SparkQA
Copy link

SparkQA commented Jun 7, 2019

Test build #106269 has finished for PR 24335 at commit 81ab7e6.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@dongjoon-hyun
Copy link
Member

Thank you for updating, @cryeo . The PR description looks enough.

@SparkQA
Copy link

SparkQA commented Jun 7, 2019

Test build #106275 has finished for PR 24335 at commit 81ab7e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 7, 2019

Test build #106281 has finished for PR 24335 at commit 2f4d64e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Jun 7, 2019

Test build #106284 has finished for PR 24335 at commit 2f4d64e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ExpressionDescription(
usage = """
_FUNC_(expr) - Returns the number of `TRUE` values for the expression.
This function is equivalent to `count(CASE WHEN expr THEN 1 END)`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I recommended this to give a hint to the users like the other SQL engines. The reason why I chose this expression instead of Count(NullIf(...)) which is used in this PR with RuntimeReplaceable is that Count(NullIf(...)) doesn't work like new count_if due to the type casting.

For the following case, Count(NullIf(...)) works while count_if doesn't.

spark-sql> select count(nullif(a, false)) from values (1) T(a);
1

spark-sql> select count_if(a) from values (1) T(a);
Error in query: cannot resolve 'count_if(T.a)' due to data type mismatch: function count_if requires boolean type,

spark-sql> select count(case when a then 1 end) from values (1) T(a);
Error in query: cannot resolve 'CASE WHEN T.`a` THEN 1 END' due to data type mismatch: WHEN expressions in CaseWhen

In short, new count_if's behavior is the same with count(CASE WHEN expr THEN 1 END). However, while reviewing this PR again, I notice that this might mislead the developers because we are using count(nullif(...)) technically.

To sum up, we cannot give the simple fallback example here. Both ones are inadequate. We had better remove this line. So, could you remove this line again, @cryeo ? Sorry, it's my bad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks for your advice.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. (Pending Jenkins).
@HyukjinKwon . Could you do the final sign-off and merge since you help @cryeo from the beginning?

@SparkQA
Copy link

SparkQA commented Jun 10, 2019

Test build #106338 has finished for PR 24335 at commit c0a3289.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Jun 10, 2019

Test build #106344 has finished for PR 24335 at commit c0a3289.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@HyukjinKwon
Copy link
Member

Thank you all and @cryeo. Welcome to contributors :).

emanuelebardelli pushed a commit to emanuelebardelli/spark that referenced this pull request Jun 15, 2019
## What changes were proposed in this pull request?

Add `count_if` function which returns the number of records satisfying a given condition.

There is no aggregation function like this in Spark, so we need to write like
- `COUNT(CASE WHEN some_condition THEN 1 END)` or
- `SUM(CASE WHEN some_condition THEN 1 END)`, 
which looks painful.

This kind of function is already supported in Presto, BigQuery and even Excel.
- Presto: [`count_if`](https://prestodb.github.io/docs/current/functions/aggregate.html#count_if)
- BigQuery: [`countif`](https://cloud.google.com/bigquery/docs/reference/standard-sql/aggregate_functions?hl=en#countif)
- Excel: [`COUNTIF`](https://support.office.com/en-us/article/countif-function-e0de10c6-f885-4e71-abb4-1f464816df34?omkt=en-US&ui=en-US&rs=en-US&ad=US) (It is a little different from above twos)

## How was this patch tested?

This patch is tested by unit test.

Closes apache#24335 from cryeo/SPARK-27425.

Authored-by: Chaerim Yeo <yeochaerim@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet