Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL support for joins on subqueries. #9545

Merged
merged 3 commits into from
Mar 22, 2020
Merged

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Mar 20, 2020

Changes to SQL module:

  • DruidJoinRule: Allow joins on subqueries (left/right are no longer
    required to be scans or mappings).
  • DruidJoinRel: Add cost estimation code for joins on subqueries.
  • DruidSemiJoinRule, DruidSemiJoinRel: Removed, since DruidJoinRule can
    handle this case now.
  • DruidRel: Remove Nullable annotation from toDruidQuery, because
    it is no longer needed (it was used by DruidSemiJoinRel).
  • Update Rules constants to reflect new rules available in our current
    version of Calcite. Some of these are useful for optimizing joins on
    subqueries.
  • Rework cost estimation to be in terms of cost per row, and place all
    relevant constants in CostEstimates.
  • Remove maxQueryCount and maxSemiJoinRowsInMemory properties
    from the SQL layer, since the responsibility for handling subqueries has
    moved down into the native query layer. There's no replacement for
    maxQueryCount (I don't think it's necessary). The replacement for
    maxSemiJoinRowsInMemory is maxSubqueryRows.

Other changes:

  • RowBasedColumnSelectorFactory: Don't set hasMultipleValues. The lack
    of isComplete is enough to let callers know that columns might have
    multiple values, and explicitly setting it to true causes
    ExpressionSelectors to think it definitely has multiple values, and
    treat the inputs as arrays. This behavior interfered with some of the
    new tests that involved queries on lookups.
  • QueryContexts: Add maxSubqueryRows parameter, and use it in druid-sql
    tests.

Changes to SQL module:

- DruidJoinRule: Allow joins on subqueries (left/right are no longer
  required to be scans or mappings).
- DruidJoinRel: Add cost estimation code for joins on subqueries.
- DruidSemiJoinRule, DruidSemiJoinRel: Removed, since DruidJoinRule can
  handle this case now.
- DruidRel: Remove Nullable annotation from toDruidQuery, because
  it is no longer needed (it was used by DruidSemiJoinRel).
- Update Rules constants to reflect new rules available in our current
  version of Calcite. Some of these are useful for optimizing joins on
  subqueries.
- Rework cost estimation to be in terms of cost per row, and place all
  relevant constants in CostEstimates.

Other changes:

- RowBasedColumnSelectorFactory: Don't set hasMultipleValues. The lack
  of isComplete is enough to let callers know that columns might have
  multiple values, and explicitly setting it to true causes
  ExpressionSelectors to think it definitely has multiple values, and
  treat the inputs as arrays. This behavior interfered with some of the
  new tests that involved queries on lookups.
- QueryContexts: Add maxSubqueryRows parameter, and use it in druid-sql
  tests.
@@ -1632,6 +1632,10 @@ The Druid SQL server is configured through the following properties on the Broke
|`druid.sql.planner.sqlTimeZone`|Sets the default time zone for the server, which will affect how time functions and timestamp literals behave. Should be a time zone name like "America/Los_Angeles" or offset like "-08:00".|UTC|
|`druid.sql.planner.serializeComplexValues`|Whether to serialize "complex" output values, false will return the class name instead of the serialized value.|true|

> Previous versions of Druid had properties named `druid.sql.planner.maxQueryCount` and `druid.sql.planner.maxSemiJoinRowsInMemory`.
> These properties are no longer available. Since Druid 0.18.0, you can use `druid.server.http.maxSubqueryRows` to control the maximum
> number of rows permitted across all subqueries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth to mention that there is no replacement for maxQueryCount and why.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should maxSubqueryRows be added to the above table instead of maxQueryCount and maxSemiJoinRowsInMemory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, since it's not a SQL parameter anymore. It should be added to the general Broker table though. I'll do that.

@@ -949,6 +947,10 @@ The Druid SQL server is configured through the following properties on the Broke
|`druid.sql.planner.metadataSegmentCacheEnable`|Whether to keep a cache of published segments in broker. If true, broker polls coordinator in background to get segments from metadata store and maintains a local cache. If false, coordinator's REST API will be invoked when broker needs published segments info.|false|
|`druid.sql.planner.metadataSegmentPollPeriod`|How often to poll coordinator for published segments list if `druid.sql.planner.metadataSegmentCacheEnable` is set to true. Poll period is in milliseconds. |60000|

> Previous versions of Druid had properties named `druid.sql.planner.maxQueryCount` and `druid.sql.planner.maxSemiJoinRowsInMemory`.
> These properties are no longer available. Since Druid 0.18.0, you can use `druid.server.http.maxSubqueryRows` to control the maximum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Should maxSubqueryRows be added to the above table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not here, but it should be added to the other table. I'll add it.

private static final double COST_GROUPING_MULTIPLIER = 0.5;
private static final double COST_LIMIT_MULTIPLIER = 0.5;
private static final double COST_HAVING_MULTIPLIER = 5.0;
static final double COST_BASE = 1.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll remove it.

@@ -71,17 +71,14 @@ public boolean isValidDruidQuery()
* Convert this DruidRel to a DruidQuery. This may be an expensive operation. For example, DruidSemiJoin needs to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right since we don't have DruidSemiJoin anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to update this javadoc! Updated.

@@ -71,17 +71,14 @@ public boolean isValidDruidQuery()
* Convert this DruidRel to a DruidQuery. This may be an expensive operation. For example, DruidSemiJoin needs to
* execute the right-hand side query in order to complete this method.
*
* This method may return null if it knows that this rel will yield an empty result set.
* This method may not return null.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method does not return null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to "This method must not return null."

)
)
.aggregators(aggregators(new CountAggregatorFactory("a0")))
.context(TIMESERIES_CONTEXT_DEFAULT)
.build()
) : ImmutableList.of(),
useDefault ? ImmutableList.of() : ImmutableList.of(new Object[]{0L}),
ImmutableList.of(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right.. I guess it should be always 0.

Based on the expected results before this PR, it seems that a timeseries query was issued when useDefault = true which returns an empty result. When useDefault = false, no query was issued so probably the query computation was done by Calcite.

So, I guess there are two potential issues here. One is the timeseries query returning an empty result and another is Calcite returning an empty result which used to return a valid result. We may want to fix the Calcite issue in this PR. The timeseries issue seems a bug, but since it's not introduced in this PR and we are about to cut the branch for 0.18, I'm ok with merging this PR and backporting the bug fix later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your analysis sounds right to me. I tracked this down to the removal of AggregateMergeRule, which I did because it was causing one of the new tests to fail (testDoubleNestedGroupBy2). Without that rule loaded, this query gets executed by a BindableAggregate, which doesn't seem to properly generate the expected row.

Since this is a situation where two bugs are conflicting, I think it might take some time to unwind. I'd rather look into that in a future patch. My preference would be to do this:

  1. Identify and fix the bug with AggregateMergeRule, and re-enable it. That should get us back to normal on this test.
  2. After that, stop using Bindable completely, and instead use native Druid queries for everything (even queries on literal values). We can do this by adding a few more capabilities to native Druid queries. Then we won't have to worry about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

testQuery(
"SELECT COUNT(*) FROM druid.foo WHERE dim2 = 'a' and not (dim1 > 'a' OR dim1 < 'b')",
ImmutableList.of(),
ImmutableList.of(new Object[]{0L})
ImmutableList.of()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the same Calcite issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it is.

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. +1 after CI

@gianm gianm merged commit 54c9325 into apache:master Mar 22, 2020
@gianm gianm deleted the joins-sql-subqueries branch March 22, 2020 23:44
@jihoonson jihoonson added this to the 0.18.0 milestone Mar 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants