Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce SQL interface for distinct count extension #13927

Conversation

sduffey-partnerize
Copy link

Description

Introduce a SQL interface for the distinctcount extension, via a new function SEGMENT_DISTINCT.

Added calcite and druid-sql as dependencies of distinctcount, then introduced SegmentDistinctSqlAggregator, an implementation of calcite's SqlAggregator

Need some direction on documentation. For example, would we want to see the SQL equivalents of the examples that already exist here? Anything else?

Release note

New: You can now use distinct count in a SQL query with SEGMENT_DISTINCT


Key changed/added classes in this PR
  • org.apache.druid.query.aggregation.distinctcount.sql.SegmentDistinctSqlAggregator

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

}
if (lhs == null) {
return ((Number) rhs).longValue();
}
return ((Number) lhs).longValue() + ((Number) rhs).longValue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes combine no longer work on nulls; was that not needed for some reason?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted

OperandTypes.ANY,
OperandTypes.and(
OperandTypes.sequence(SIGNATURE, OperandTypes.ANY, OperandTypes.LITERAL),
OperandTypes.family(SqlTypeFamily.ANY, SqlTypeFamily.STRING)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the LITERAL STRING argument being used in the function body. Is that intentional?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a look back at some other classes that extend SqlAggFunction, particularly ApproxCountDistinctSqlAggFunction, and noticed that doesn't take the bitmap factory argument. So we decided to simplify SEGMENT_DISTINCT in the same way. Is that OK?

final ColumnType inputType = Calcites.getColumnTypeForRelDataType(dataType);

if (inputType == null) {
throw new ISE(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use org.apache.druid.sql.calcite.planner.UnsupportedSQLQueryException instead of ISE. Please refer to the class documentation why the former is preferred.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you please explain why this inputType check is required? If we don't create the dimensionSpec below (as mentioned in another comment of mine), we probably won't run into an error with inputType being null in this code.
Would nullity of inputType cause any issue in the aggregation, and if so can you please update with a comment?

@@ -45,6 +45,7 @@ public void aggregate()
IndexedInts row = selector.getRow();
for (int i = 0, rowSize = row.size(); i < rowSize; i++) {
int index = row.get(i);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We can revert this change

dimensionSpec = new DefaultDimensionSpec(virtualColumnName, null, inputType);
}

aggregatorFactory = new DistinctCountAggregatorFactory(name, dimensionSpec.getDimension(), null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems slightly counter-intuitive that we are creating a dimension spec in the above cases just to get dimensionSpec.getDimension() while creating the final aggregator.

Instead of Line#116, can we do dimensionName = columnArg.getSimpleExtraction().getColumn (since its a direct column access0, and in Line#122 we do dimensionName = virtualColumnName and pass that to the aggregator factory.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, will try that out!

final ColumnType inputType = Calcites.getColumnTypeForRelDataType(dataType);

if (inputType == null) {
throw new ISE(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you please explain why this inputType check is required? If we don't create the dimensionSpec below (as mentioned in another comment of mine), we probably won't run into an error with inputType being null in this code.
Would nullity of inputType cause any issue in the aggregation, and if so can you please update with a comment?

Copy link
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After re-reviewing the PR, here are a few high-level comments:

  1. Since the extension only works when certain pre-conditions are met, there should be some form of validation in the aggregator or the SQL function that errors out when pre-conditions are met.
  2. A test run is failing probably because the test cases don't take into account the behaviour for nulls. Can you fix those as well?

@LakshSingla
Copy link
Contributor

Hi, @sduffey-partnerize! Did you make progress on the PR?
Feel free to reach out to me in case you have any doubts regarding the comments!

Copy link

github-actions bot commented Feb 9, 2024

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label Feb 9, 2024
Copy link

github-actions bot commented Mar 8, 2024

This pull request/issue has been closed due to lack of activity. If you think that
is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants