Introduce SQL interface for distinct count extension #13927

sduffey-partnerize · 2023-03-13T12:17:56Z

Description

Introduce a SQL interface for the distinctcount extension, via a new function SEGMENT_DISTINCT.

Added calcite and druid-sql as dependencies of distinctcount, then introduced SegmentDistinctSqlAggregator, an implementation of calcite's SqlAggregator

Need some direction on documentation. For example, would we want to see the SQL equivalents of the examples that already exist here? Anything else?

Release note

New: You can now use distinct count in a SQL query with SEGMENT_DISTINCT

Key changed/added classes in this PR

org.apache.druid.query.aggregation.distinctcount.sql.SegmentDistinctSqlAggregator

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

gianm · 2023-03-13T22:01:38Z

...in/java/org/apache/druid/query/aggregation/distinctcount/DistinctCountAggregatorFactory.java

-    }
-    if (lhs == null) {
-      return ((Number) rhs).longValue();
-    }
    return ((Number) lhs).longValue() + ((Number) rhs).longValue();


This change makes combine no longer work on nulls; was that not needed for some reason?

gianm · 2023-03-13T22:03:32Z

.../java/org/apache/druid/query/aggregation/distinctcount/sql/SegmentDistinctSqlAggregator.java

+              OperandTypes.ANY,
+              OperandTypes.and(
+                  OperandTypes.sequence(SIGNATURE, OperandTypes.ANY, OperandTypes.LITERAL),
+                  OperandTypes.family(SqlTypeFamily.ANY, SqlTypeFamily.STRING)


I don't see the LITERAL STRING argument being used in the function body. Is that intentional?

We had a look back at some other classes that extend SqlAggFunction, particularly ApproxCountDistinctSqlAggFunction, and noticed that doesn't take the bitmap factory argument. So we decided to simplify SEGMENT_DISTINCT in the same way. Is that OK?

.../java/org/apache/druid/query/aggregation/distinctcount/sql/SegmentDistinctSqlAggregator.java

…he#13927)

…count_sql

…cess (apache#13927)

abhishekagarwal87 · 2023-04-26T09:47:13Z

.../java/org/apache/druid/query/aggregation/distinctcount/sql/SegmentDistinctSqlAggregator.java

+      final ColumnType inputType = Calcites.getColumnTypeForRelDataType(dataType);
+
+      if (inputType == null) {
+        throw new ISE(


You should use org.apache.druid.sql.calcite.planner.UnsupportedSQLQueryException instead of ISE. Please refer to the class documentation why the former is preferred.

Also, can you please explain why this inputType check is required? If we don't create the dimensionSpec below (as mentioned in another comment of mine), we probably won't run into an error with inputType being null in this code.
Would nullity of inputType cause any issue in the aggregation, and if so can you please update with a comment?

LakshSingla · 2023-05-02T04:05:29Z

.../src/main/java/org/apache/druid/query/aggregation/distinctcount/DistinctCountAggregator.java

@@ -45,6 +45,7 @@ public void aggregate()
    IndexedInts row = selector.getRow();
    for (int i = 0, rowSize = row.size(); i < rowSize; i++) {
      int index = row.get(i);
+


nit: We can revert this change

LakshSingla · 2023-05-02T07:45:05Z

.../java/org/apache/druid/query/aggregation/distinctcount/sql/SegmentDistinctSqlAggregator.java

+        dimensionSpec = new DefaultDimensionSpec(virtualColumnName, null, inputType);
+      }
+
+      aggregatorFactory = new DistinctCountAggregatorFactory(name, dimensionSpec.getDimension(), null);


Seems slightly counter-intuitive that we are creating a dimension spec in the above cases just to get dimensionSpec.getDimension() while creating the final aggregator.

Instead of Line#116, can we do dimensionName = columnArg.getSimpleExtraction().getColumn (since its a direct column access0, and in Line#122 we do dimensionName = virtualColumnName and pass that to the aggregator factory.

Thanks for the feedback, will try that out!

LakshSingla · 2023-05-02T07:47:07Z

.../java/org/apache/druid/query/aggregation/distinctcount/sql/SegmentDistinctSqlAggregator.java

+      final ColumnType inputType = Calcites.getColumnTypeForRelDataType(dataType);
+
+      if (inputType == null) {
+        throw new ISE(


Also, can you please explain why this inputType check is required? If we don't create the dimensionSpec below (as mentioned in another comment of mine), we probably won't run into an error with inputType being null in this code.
Would nullity of inputType cause any issue in the aggregation, and if so can you please update with a comment?

…count_sql

LakshSingla

After re-reviewing the PR, here are a few high-level comments:

Since the extension only works when certain pre-conditions are met, there should be some form of validation in the aggregator or the SQL function that errors out when pre-conditions are met.
A test run is failing probably because the test cases don't take into account the behaviour for nulls. Can you fix those as well?

LakshSingla · 2023-06-26T09:06:28Z

Hi, @sduffey-partnerize! Did you make progress on the PR?
Feel free to reach out to me in case you have any doubts regarding the comments!

github-actions · 2024-02-09T00:15:55Z

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

github-actions · 2024-03-08T00:16:37Z

This pull request/issue has been closed due to lack of activity. If you think that
is incorrect, or the pull request requires review, you can revive the PR at any time.

Introduce SQL interface for distinct count extension

c7b0068

gianm reviewed Mar 13, 2023

View reviewed changes

github-advanced-security bot found potential problems Mar 16, 2023

View reviewed changes

.../java/org/apache/druid/query/aggregation/distinctcount/sql/SegmentDistinctSqlAggregator.java Fixed Show resolved Hide resolved

sduffey-partnerize added 4 commits March 23, 2023 11:41

Reinstate null checks in DistinctCountAggregatorFactory.combine (apac…

4340664

…he#13927)

Remove bimap factory from SEGMENT_DISTINCT signature (apache#13927)

122fc24

Merge remote-tracking branch 'upstream/master' into feature/distinct_…

ff68859

…count_sql

Switch from deprecated 3-arg version to 4-arg Expressions.fromFieldAc…

8d67573

…cess (apache#13927)

abhishekagarwal87 added the Area - SQL label Apr 26, 2023

abhishekagarwal87 reviewed Apr 26, 2023

View reviewed changes

LakshSingla reviewed May 2, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/master' into feature/distinct_…

d590d75

…count_sql

LakshSingla reviewed May 31, 2023

View reviewed changes

github-actions bot added the stale label Feb 9, 2024

github-actions bot closed this Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce SQL interface for distinct count extension #13927

Introduce SQL interface for distinct count extension #13927

sduffey-partnerize commented Mar 13, 2023

gianm Mar 13, 2023

sduffey-partnerize Mar 23, 2023

gianm Mar 13, 2023

sduffey-partnerize Mar 23, 2023

abhishekagarwal87 Apr 26, 2023

LakshSingla May 2, 2023

LakshSingla May 2, 2023

LakshSingla May 2, 2023

sduffey-partnerize May 7, 2023

LakshSingla May 2, 2023

LakshSingla left a comment

LakshSingla commented Jun 26, 2023

github-actions bot commented Feb 9, 2024

github-actions bot commented Mar 8, 2024

Introduce SQL interface for distinct count extension #13927

Introduce SQL interface for distinct count extension #13927

Conversation

sduffey-partnerize commented Mar 13, 2023

Description

Release note

Key changed/added classes in this PR

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LakshSingla left a comment

Choose a reason for hiding this comment

LakshSingla commented Jun 26, 2023

github-actions bot commented Feb 9, 2024

github-actions bot commented Mar 8, 2024