Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: fix COUNT DISTINCT filtering #37176

Merged
merged 6 commits into from Jan 8, 2019
Merged

SQL: fix COUNT DISTINCT filtering #37176

merged 6 commits into from Jan 8, 2019

Conversation

astefan
Copy link
Contributor

@astefan astefan commented Jan 7, 2019

This PR fixes #37086 bug where aggregation filtering (HAVING) on COUNT(DISTINCT) was behaving just like it does for non-DISTINCT version. Also, COUNT(DISTINCT) will try to use an exact version of the field whenever possible (the common scenario is the one of a text field), since it's using the cardinality aggregation which will not work on text fields.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@astefan
Copy link
Contributor Author

astefan commented Jan 7, 2019

run the gradle build tests 2

Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I would ask to add some tests to the QueryTranslatorTests too, on top of the integ tests.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix. Can you please check that we have duplicates in the data and do a test with COUNT(column) and COUNT(DISTINCT column) to check they return different (and correct) results?
Potentially by using the gender field instead.
COUNT(*) vs COUNT(column) vs COLUMN (DISTINCT column) should all return different results especially since we NULL data.

@astefan
Copy link
Contributor Author

astefan commented Jan 7, 2019

@costin while testing COUNT in a separate PR (still WIP), COUNT(column) and COUNT(DISTINCT column) are treated as the same agg. One first issue is that equals and hashCode doesn't account for DISTINCT. I fixed this, but still, the issue persists.

So, at this moment, a test with COUNT(column), COUNT(DISTINCT column) will return incorrect results. But the issue is on my radar and working on it. My suggestion, if you agree, would be to push this PR and the follow up one will cover #34549 and any other issues I encounter (including the COUNT(column) - COUNT(DISTINCT column) you mentioned).

@astefan
Copy link
Contributor Author

astefan commented Jan 7, 2019

run the gradle build tests 2

@costin
Copy link
Member

costin commented Jan 7, 2019

@astefan +1

Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@astefan
Copy link
Contributor Author

astefan commented Jan 7, 2019

run the gradle build tests 1

1 similar comment
@astefan
Copy link
Contributor Author

astefan commented Jan 7, 2019

run the gradle build tests 1

@astefan astefan merged commit 3fad9d2 into elastic:master Jan 8, 2019
@astefan astefan deleted the 37086fix branch January 8, 2019 06:47
astefan added a commit that referenced this pull request Jan 8, 2019
* Use `_count` aggregation value only for not-DISTINCT COUNT function calls
* COUNT DISTINCT will use the _exact_ version of a field (the `keyword` sub-field for example), if there is one

(cherry picked from commit 3fad9d2)
astefan added a commit that referenced this pull request Jan 8, 2019
* Use `_count` aggregation value only for not-DISTINCT COUNT function calls
* COUNT DISTINCT will use the _exact_ version of a field (the `keyword` sub-field for example), if there is one

(cherry picked from commit 3fad9d2)
@pcsanwald
Copy link
Contributor

closes #37087

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL: HAVING is using _count instead of the specified column
6 participants