SQL support for joins on subqueries. #9545

gianm · 2020-03-20T15:32:37Z

Changes to SQL module:

DruidJoinRule: Allow joins on subqueries (left/right are no longer
required to be scans or mappings).
DruidJoinRel: Add cost estimation code for joins on subqueries.
DruidSemiJoinRule, DruidSemiJoinRel: Removed, since DruidJoinRule can
handle this case now.
DruidRel: Remove Nullable annotation from toDruidQuery, because
it is no longer needed (it was used by DruidSemiJoinRel).
Update Rules constants to reflect new rules available in our current
version of Calcite. Some of these are useful for optimizing joins on
subqueries.
Rework cost estimation to be in terms of cost per row, and place all
relevant constants in CostEstimates.
Remove maxQueryCount and maxSemiJoinRowsInMemory properties
from the SQL layer, since the responsibility for handling subqueries has
moved down into the native query layer. There's no replacement for
maxQueryCount (I don't think it's necessary). The replacement for
maxSemiJoinRowsInMemory is maxSubqueryRows.

Other changes:

RowBasedColumnSelectorFactory: Don't set hasMultipleValues. The lack
of isComplete is enough to let callers know that columns might have
multiple values, and explicitly setting it to true causes
ExpressionSelectors to think it definitely has multiple values, and
treat the inputs as arrays. This behavior interfered with some of the
new tests that involved queries on lookups.
QueryContexts: Add maxSubqueryRows parameter, and use it in druid-sql
tests.

Changes to SQL module: - DruidJoinRule: Allow joins on subqueries (left/right are no longer required to be scans or mappings). - DruidJoinRel: Add cost estimation code for joins on subqueries. - DruidSemiJoinRule, DruidSemiJoinRel: Removed, since DruidJoinRule can handle this case now. - DruidRel: Remove Nullable annotation from toDruidQuery, because it is no longer needed (it was used by DruidSemiJoinRel). - Update Rules constants to reflect new rules available in our current version of Calcite. Some of these are useful for optimizing joins on subqueries. - Rework cost estimation to be in terms of cost per row, and place all relevant constants in CostEstimates. Other changes: - RowBasedColumnSelectorFactory: Don't set hasMultipleValues. The lack of isComplete is enough to let callers know that columns might have multiple values, and explicitly setting it to true causes ExpressionSelectors to think it definitely has multiple values, and treat the inputs as arrays. This behavior interfered with some of the new tests that involved queries on lookups. - QueryContexts: Add maxSubqueryRows parameter, and use it in druid-sql tests.

jihoonson · 2020-03-20T17:38:16Z

docs/configuration/index.md

@@ -1632,6 +1632,10 @@ The Druid SQL server is configured through the following properties on the Broke
 |`druid.sql.planner.sqlTimeZone`|Sets the default time zone for the server, which will affect how time functions and timestamp literals behave. Should be a time zone name like "America/Los_Angeles" or offset like "-08:00".|UTC|
 |`druid.sql.planner.serializeComplexValues`|Whether to serialize "complex" output values, false will return the class name instead of the serialized value.|true|

+> Previous versions of Druid had properties named `druid.sql.planner.maxQueryCount` and `druid.sql.planner.maxSemiJoinRowsInMemory`.
+> These properties are no longer available. Since Druid 0.18.0, you can use `druid.server.http.maxSubqueryRows` to control the maximum
+> number of rows permitted across all subqueries.


Maybe worth to mention that there is no replacement for maxQueryCount and why.

Should maxSubqueryRows be added to the above table instead of maxQueryCount and maxSemiJoinRowsInMemory?

I don't think so, since it's not a SQL parameter anymore. It should be added to the general Broker table though. I'll do that.

jihoonson · 2020-03-20T17:41:16Z

docs/querying/sql.md

@@ -949,6 +947,10 @@ The Druid SQL server is configured through the following properties on the Broke
 |`druid.sql.planner.metadataSegmentCacheEnable`|Whether to keep a cache of published segments in broker. If true, broker polls coordinator in background to get segments from metadata store and maintains a local cache. If false, coordinator's REST API will be invoked when broker needs published segments info.|false|
 |`druid.sql.planner.metadataSegmentPollPeriod`|How often to poll coordinator for published segments list if `druid.sql.planner.metadataSegmentCacheEnable` is set to true. Poll period is in milliseconds. |60000|

+> Previous versions of Druid had properties named `druid.sql.planner.maxQueryCount` and `druid.sql.planner.maxSemiJoinRowsInMemory`.
+> These properties are no longer available. Since Druid 0.18.0, you can use `druid.server.http.maxSubqueryRows` to control the maximum


Same here. Should maxSubqueryRows be added to the above table?

Not here, but it should be added to the other table. I'll add it.

jihoonson · 2020-03-20T18:59:32Z

sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQueryRel.java

-  private static final double COST_GROUPING_MULTIPLIER = 0.5;
-  private static final double COST_LIMIT_MULTIPLIER = 0.5;
-  private static final double COST_HAVING_MULTIPLIER = 5.0;
+  static final double COST_BASE = 1.0;


Unused variable.

Thanks, I'll remove it.

jihoonson · 2020-03-20T19:01:46Z

sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidRel.java

@@ -71,17 +71,14 @@ public boolean isValidDruidQuery()
   * Convert this DruidRel to a DruidQuery. This may be an expensive operation. For example, DruidSemiJoin needs to


This doesn't seem right since we don't have DruidSemiJoin anymore.

I forgot to update this javadoc! Updated.

jihoonson · 2020-03-20T19:02:02Z

sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidRel.java

@@ -71,17 +71,14 @@ public boolean isValidDruidQuery()
   * Convert this DruidRel to a DruidQuery. This may be an expensive operation. For example, DruidSemiJoin needs to
   * execute the right-hand side query in order to complete this method.
   *
-   * This method may return null if it knows that this rel will yield an empty result set.
+   * This method may not return null.


This method does not return null?

Changed it to "This method must not return null."

jihoonson · 2020-03-20T21:05:41Z

sql/src/test/java/org/apache/druid/sql/calcite/CalciteParameterQueryTest.java

                      )
                  )
                  .aggregators(aggregators(new CountAggregatorFactory("a0")))
                  .context(TIMESERIES_CONTEXT_DEFAULT)
                  .build()
        ) : ImmutableList.of(),
-        useDefault ? ImmutableList.of() : ImmutableList.of(new Object[]{0L}),
+        ImmutableList.of(),


This doesn't seem right.. I guess it should be always 0.

Based on the expected results before this PR, it seems that a timeseries query was issued when useDefault = true which returns an empty result. When useDefault = false, no query was issued so probably the query computation was done by Calcite.

So, I guess there are two potential issues here. One is the timeseries query returning an empty result and another is Calcite returning an empty result which used to return a valid result. We may want to fix the Calcite issue in this PR. The timeseries issue seems a bug, but since it's not introduced in this PR and we are about to cut the branch for 0.18, I'm ok with merging this PR and backporting the bug fix later.

Your analysis sounds right to me. I tracked this down to the removal of AggregateMergeRule, which I did because it was causing one of the new tests to fail (testDoubleNestedGroupBy2). Without that rule loaded, this query gets executed by a BindableAggregate, which doesn't seem to properly generate the expected row.

Since this is a situation where two bugs are conflicting, I think it might take some time to unwind. I'd rather look into that in a future patch. My preference would be to do this:

Identify and fix the bug with AggregateMergeRule, and re-enable it. That should get us back to normal on this test.

After that, stop using Bindable completely, and instead use native Druid queries for everything (even queries on literal values). We can do this by adding a few more capabilities to native Druid queries. Then we won't have to worry about it.

Sounds good to me.

jihoonson · 2020-03-20T21:22:50Z

sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java

    testQuery(
        "SELECT COUNT(*) FROM druid.foo WHERE dim2 = 'a' and not (dim1 > 'a' OR dim1 < 'b')",
        ImmutableList.of(),
-        ImmutableList.of(new Object[]{0L})
+        ImmutableList.of()


Seems like the same Calcite issue.

Yeah, I think it is.

jihoonson

LGTM. +1 after CI

gianm added the Area - SQL label Mar 20, 2020

jihoonson added the Release Notes label Mar 20, 2020

Fixes for tests.

36619ca

jihoonson reviewed Mar 20, 2020

View reviewed changes

Adjustments.

5176b4f

jihoonson approved these changes Mar 21, 2020

View reviewed changes

clintropolis approved these changes Mar 21, 2020

View reviewed changes

gianm merged commit 54c9325 into apache:master Mar 22, 2020

gianm deleted the joins-sql-subqueries branch March 22, 2020 23:44

clintropolis mentioned this pull request Mar 25, 2020

add lane enforcement for joinish queries #9563

Merged

8 tasks

jihoonson added this to the 0.18.0 milestone Mar 26, 2020

jihoonson mentioned this pull request Apr 9, 2020

[Draft] 0.18.0 release notes #9652

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQL support for joins on subqueries. #9545

SQL support for joins on subqueries. #9545

gianm commented Mar 20, 2020

jihoonson Mar 20, 2020

jihoonson Mar 20, 2020

gianm Mar 20, 2020

jihoonson Mar 20, 2020

gianm Mar 21, 2020

jihoonson Mar 20, 2020

gianm Mar 20, 2020

jihoonson Mar 20, 2020

gianm Mar 20, 2020

jihoonson Mar 20, 2020

gianm Mar 20, 2020

jihoonson Mar 20, 2020

gianm Mar 21, 2020

jihoonson Mar 21, 2020

jihoonson Mar 20, 2020

gianm Mar 20, 2020

jihoonson left a comment

		@@ -71,17 +71,14 @@ public boolean isValidDruidQuery()
		* Convert this DruidRel to a DruidQuery. This may be an expensive operation. For example, DruidSemiJoin needs to

SQL support for joins on subqueries. #9545

SQL support for joins on subqueries. #9545

Conversation

gianm commented Mar 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jihoonson left a comment

Choose a reason for hiding this comment