New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add join-related DataSource types, and analysis functionality. #9235
Conversation
Builds on apache#9111 and implements the datasource analysis mentioned in apache#8728. Still can't handle join datasources, but we're a step closer. Join-related DataSource types: 1) Add "join", "lookup", and "inline" datasources. 2) Add "getChildren" and "withChildren" methods to DataSource, which will be used in the future for query rewriting (e.g. inlining of subqueries). DataSource analysis functionality: 1) Add DataSourceAnalysis class, which breaks down datasources into three components: outer queries, a base datasource (left-most of the highest level left-leaning join tree), and other joined-in leaf datasources (the right-hand branches of the left-leaning join tree). 2) Add "isConcrete", "isGlobal", and "isCacheable" methods to DataSource in order to support analysis. Other notes: 1) Renamed DataSource#getNames to DataSource#getTableNames, which I think is clearer. Also, made it a Set, so implementations don't need to worry about duplicates. 2) The addition of "isCacheable" should work around apache#8713, since UnionDataSource now returns false for cacheability.
&& strategy != null | ||
&& cacheConfig.isUseResultLevelCache() | ||
&& cacheConfig.isQueryCacheable(query); | ||
&& cacheConfig.isQueryCacheable(query) | ||
&& query.getDataSource().isCacheable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment to the one I left in CacheUtil
.
&& strategy != null | ||
&& cacheConfig.isUseCache() | ||
&& cacheConfig.isQueryCacheable(query); | ||
&& cacheConfig.isQueryCacheable(query) | ||
&& query.getDataSource().isCacheable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps extract these four lines should be a helper method since it's duplicated in populateResultLevelCache().
Is there a test that is affected by the addition of isCacheable()
or should one be added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, after thinking about this a bit IMO it is best to combine ResultLevelCacheUtil and CacheUtil, then make the change you suggested (although I think we can only extract three lines), and then add tests for CacheUtil. I'll add that to this patch in a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe not combining them in this patch, but at least doing the extraction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ccaominh I decided to combine them after all, and add a CacheUtil test that tests the newly extracted isQueryCacheable
method.
* non-Jackson callers should use {@link #fromIterable}. | ||
*/ | ||
@JsonCreator | ||
public static InlineDataSource fromJson( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be private
to match the javadoc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, yes, it should. I'll fix it.
@Override | ||
public String toString() | ||
{ | ||
// Don't include 'rows' in stringificatione, because it might be long and/or lazy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: stringificatione -> stringification
Is it worth adding a test to enforce this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I'll add one (and fix the typo).
/** | ||
* Utility methods for working with {@link Joinable} related classes. | ||
*/ | ||
public class Joinables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to add unit tests for these methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I'll add some.
/** | ||
* Analysis of a datasource for purposes of deciding how to execute a particular query. | ||
* | ||
* The analysis breaks a datasource down in the following way: | ||
* | ||
* <pre> | ||
* | ||
* Q <-- Possible outer query datasource(s) [may be multiple stacked] | ||
* | | ||
* J <-- Possible join tree, expected to be left-leaning | ||
* / \ | ||
* J Dj <-- Other leaf datasources | ||
* Base datasource / \ which will be joined | ||
* (bottom-leftmost) --> Db Dj <---- into the base datasource | ||
* | ||
* </pre> | ||
* | ||
* The base datasource (Db) is returned by {@link #getBaseDataSource()}. The other leaf datasources are returned by | ||
* {@link #getPreJoinableClauses()}. The outer query datasources are available as part of {@link #getDataSource()}, | ||
* which just returns the original datasource that was provided for analysis. | ||
* | ||
* The base datasource (Db) will never be a join, but it can be any other type of datasource (table, query, etc). | ||
* Note that join trees are only flattened if they occur at the top of the overall tree (or underneath an outer query), | ||
* and that join trees are only flattened to the degree that they are left-leaning. Due to these facts, it is possible | ||
* for the base or leaf datasources to include additional joins. | ||
* | ||
* The base datasource is the one that will be considered by the core Druid query stack for scanning via | ||
* {@link org.apache.druid.segment.Segment} and {@link org.apache.druid.segment.StorageAdapter}. The other leaf | ||
* datasources must be joinable onto the base data. | ||
* | ||
* The idea here is to keep things simple and dumb. So we focus only on identifying left-leaning join trees, which map | ||
* neatly onto a series of hash table lookups at query time. The user/system generating the queries, e.g. the druid-sql | ||
* layer (or the end user in the case of native queries), is responsible for containing the smarts to structure the | ||
* tree in a way that will lead to optimal execution. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice javadoc!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ASCII art is what it's all about.
analysis.getPreJoinableClauses() | ||
); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to add an EqualsVerifier
test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll add one.
final QueryDataSource queryDataSource = new QueryDataSource( | ||
GroupByQuery.builder() | ||
.setDataSource(LOOKUP_LOOKYLOO) | ||
.setInterval(new MultipleIntervalSegmentSpec(Collections.singletonList(Intervals.of("2000/3000")))) | ||
.setGranularity(Granularities.ALL) | ||
.build() | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you can use subquery(LOOKUP_LOOKYLOO)
here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I'll change it.
return new QueryDataSource( | ||
GroupByQuery.builder() | ||
.setDataSource(dataSource) | ||
.setInterval(new MultipleIntervalSegmentSpec(Collections.singletonList(Intervals.of("2000/3000")))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps create a named constant for the interval since it's in the asserts for several of the new tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, sounds reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
high level lgtm 👍, inline datasource looks like an avenue to make pretty much anything a druid query which sounds fun
public boolean isCacheable() | ||
{ | ||
// Disables result-level caching for 'union' datasources, which doesn't work currently. | ||
// See https://github.com/apache/druid/issues/8713 for reference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 this seems like a good intermediary solution until we fix the issue for real
* The rows are backed by an Iterable, which can be lazy or not. Lazy datasources will only be iterated if someone calls | ||
* {@link #getRows()} and iterates the result, or until someone calls {@link #getRowsAsList()}. | ||
*/ | ||
public class InlineDataSource implements DataSource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤘
Objects.equals(granularity, baseQuery.granularity); | ||
} | ||
|
||
@Override | ||
public int hashCode() | ||
{ | ||
|
||
return Objects.hash(dataSource, descending, context, querySegmentSpec, duration, granularity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious about why this change. I know this isn't really part of your change, but I've also noticed equals and hashcode for a lot of classes that extend BaseQuery have their own implementation of equals and hashCode and do not check the base class. Sounds like we should add some equalsVerifier tests for them, maybe in another patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because duration
is lazily computed, so it might be null. Calling getDuration()
forces it to be computed. I'll add a comment.
Btw, a lot of the subclasses of BaseQuery do call super.equals
at some point (but they do other stuff too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird - I see the calls to super.equals()
now, maybe I just shouldn't read code in the night 😝
EqualsVerifier is still complaining about some other stuff. But cleaning that up can be another issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks cool! I'll look at DataSourceAnalysis more closely tomorrow when I'm a little less sleepy
@Override | ||
public DataSource withChildren(List<DataSource> children) | ||
{ | ||
if (children.size() != dataSources.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why must these sizes match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a precondition for the method, since its intent is for replacing inputs with inlined/etc versions, so the number of children shouldn't change. I'll add a comment in the interface.
} | ||
} | ||
|
||
public static boolean isPrefixedBy(final String columnName, final String prefix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NonNull
annotations for the parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd prefer @EverythingIsNonnullByDefault
(which is how I prefer to write/read code usually anyway). We use it in a few other packages in Druid. Maybe we can add that here in a future patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me! I haven't gotten in the habit of looking out for those annotations yet, but I love the idea of NonnullBy default!
{ | ||
// Verify that toString does not iterate the rows. | ||
final String ignored = iterableDataSource.toString(); | ||
Assert.assertEquals(0, iterationCounter.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried adding rows
to toString
and it doesn't iterate over rowsIterable
(it prints an identifier for the lambda).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense. I guess the behavior would depend on whatever the particular Iterable’s toString method does. Do you think it makes sense to leave the test like this it change it somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with leaving the test as it is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any suggestions for how to improve it in a follow-up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Builds on apache#9235, using the datasource analysis functionality to replace various ad-hoc approaches. The most interesting changes are in ClientQuerySegmentWalker (brokers), ServerManager (historicals), and SinkQuerySegmentWalker (indexing tasks). Other changes related to improving how we analyze queries: 1) Changes TimelineServerView to return an Optional timeline, which I thought made the analysis changes cleaner to implement. 2) Added QueryToolChest#canPerformSubquery, which is now used by query entry points to determine whether it is safe to pass a subquery dataSource to the query toolchest. Fixes an issue introduced in apache#5471 where subqueries under non-groupBy-typed queries were silently ignored, since neither the query entry point nor the toolchest did anything special with them. 3) Removes the QueryPlus.withQuerySegmentSpec method, which was mostly being used in error-prone ways (ignoring any potential subqueries, and not verifying that the underlying data source is actually a table). Replaces with a new function, Queries.withSpecificSegments, that includes sanity checks.
Builds on #9235, using the datasource analysis functionality to replace various ad-hoc approaches. The most interesting changes are in ClientQuerySegmentWalker (brokers), ServerManager (historicals), and SinkQuerySegmentWalker (indexing tasks). Other changes related to improving how we analyze queries: 1) Changes TimelineServerView to return an Optional timeline, which I thought made the analysis changes cleaner to implement. 2) Added QueryToolChest#canPerformSubquery, which is now used by query entry points to determine whether it is safe to pass a subquery dataSource to the query toolchest. Fixes an issue introduced in #5471 where subqueries under non-groupBy-typed queries were silently ignored, since neither the query entry point nor the toolchest did anything special with them. 3) Removes the QueryPlus.withQuerySegmentSpec method, which was mostly being used in error-prone ways (ignoring any potential subqueries, and not verifying that the underlying data source is actually a table). Replaces with a new function, Queries.withSpecificSegments, that includes sanity checks.
Builds on #9111 and implements the datasource analysis mentioned in #8728. Still can't
handle join datasources, but we're a step closer.
Join-related DataSource types:
in the future for query rewriting (e.g. inlining of subqueries).
DataSource analysis functionality:
outer queries, a base datasource (left-most of the highest level left-leaning join
tree), and other joined-in leaf datasources (the right-hand branches of the
left-leaning join tree).
support analysis.
Other notes:
Also, made it a Set, so implementations don't need to worry about duplicates.
returns false for cacheability.