-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move scan-query from a contrib extension into core. #4751
Conversation
Based on a proposal at: https://groups.google.com/d/topic/druid-development/ME_OatUDnbk/discussion This patch also adds support for virtual columns to the Scan query, and updates Druid SQL to use Scan instead of Select. This patch also makes some behavioral changes to handling of the __time column. In particular, it is now is returned as "__time" rather than "timestamp"; it is no longer included if you do not specifically ask for it in your "columns"; and it is returned as a long rather than a string. Users can revert time handling to the legacy extension behavior by setting "legacy" : true in their queries, or setting the property druid.query.scan.legacy = true. This is meant to provide a migration path for users that were formerly using the contrib extension.
@kaijianding, as the original author of #3307 would you like to review this? |
will review tommorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone through the PR, everything fine except the removal of select query from calcite rule part. I think we should still keep select query for some specific case, like order by __time desc
and limit 10 offset 20
pagination
@@ -19,6 +20,12 @@ Select queries return raw Druid rows and support pagination. | |||
} | |||
``` | |||
|
|||
<div class="note info"> | |||
Consider using the [Scan query](scan-query.html) instead of the Select query if you don't need the strict time-ascending |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about mention scan-query doesn't support pagination?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll mention it.
selectProjection != null ? VirtualColumns.create(selectProjection.getVirtualColumns()) : VirtualColumns.EMPTY, | ||
ScanQuery.RESULT_FORMAT_COMPACTED_LIST, | ||
0, | ||
limitSpec == null || limitSpec.getLimit() == Integer.MAX_VALUE ? 0 : limitSpec.getLimit(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this 'limit' in ScanQuery is long type, do we also need change limitSpec.getLimit() to long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't make this change in this patch, since there's code elsewhere that assumes Integer.MAX_VALUE
means "no limit". Also Calcite treats limit as int in a few places anyway. So some more care would be needed to migrate that. I'll add a comment though.
if (orderBys.isEmpty() || | ||
(orderBys.size() == 1 && orderBys.get(0).getDimension().equals(Column.TIME_COLUMN_NAME))) { | ||
// Scan query can handle limiting but not sorting, so avoid applying this rule if there is a sort. | ||
if (limitSpec.getColumns().isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how a sql to retrieve the latest 10 records is handled after druid select query is totally removed from rules?
select a,b,c from datasource order by __time desc limit 10
This sql can be translated to druid select query with descending=true and should be fast if limit is a small number.
If we don't handle this kind of sql here, what kind of druid query is used after we bypass this rule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch would have made order by time impossible. I guess I could add the "select" query back, just so this kind of query is possible. It will take some more time.
Updated the patch based on your comments @kaijianding. Thank you for reviewing. |
This patch looks good to me. Please fix the conflicts. |
@jihoonson I fixed the conflicts. |
@@ -341,7 +344,7 @@ public RelDataType getRowType() | |||
|
|||
/** | |||
* Return this query as some kind of Druid query. The returned query will either be {@link TopNQuery}, | |||
* {@link TimeseriesQuery}, {@link GroupByQuery}, or {@link SelectQuery}. | |||
* {@link TimeseriesQuery}, {@link GroupByQuery}, or {@link ScanQuery}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still keep link to SelectQuery and add ScanQuery
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is good to me now
Based on a proposal at: https://groups.google.com/d/topic/druid-development/ME_OatUDnbk/discussion
This patch also adds support for virtual columns to the Scan query,
and updates Druid SQL to use Scan instead of Select.
This patch also makes some behavioral changes to handling of the __time
column. In particular, it is now is returned as "__time" rather than
"timestamp"; it is no longer included if you do not specifically ask for
it in your "columns"; and it is returned as a long rather than a string.
Users can revert time handling to the legacy extension behavior by
setting "legacy" : true in their queries, or setting the property
druid.query.scan.legacy = true. This is meant to provide a migration
path for users that were formerly using the contrib extension.