Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add join-related DataSource types, and analysis functionality. #9235

Merged
merged 5 commits into from Jan 22, 2020

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Jan 21, 2020

Builds on #9111 and implements the datasource analysis mentioned in #8728. Still can't
handle join datasources, but we're a step closer.

Join-related DataSource types:

  1. Add "join", "lookup", and "inline" datasources.
  2. Add "getChildren" and "withChildren" methods to DataSource, which will be used
    in the future for query rewriting (e.g. inlining of subqueries).

DataSource analysis functionality:

  1. Add DataSourceAnalysis class, which breaks down datasources into three components:
    outer queries, a base datasource (left-most of the highest level left-leaning join
    tree), and other joined-in leaf datasources (the right-hand branches of the
    left-leaning join tree).
  2. Add "isConcrete", "isGlobal", and "isCacheable" methods to DataSource in order to
    support analysis.

Other notes:

  1. Renamed DataSource#getNames to DataSource#getTableNames, which I think is clearer.
    Also, made it a Set, so implementations don't need to worry about duplicates.
  2. The addition of "isCacheable" should work around Improper result-level cache ETag handling for union datasources #8713, since UnionDataSource now
    returns false for cacheability.

Builds on apache#9111 and implements the datasource analysis mentioned in apache#8728. Still can't
handle join datasources, but we're a step closer.

Join-related DataSource types:

1) Add "join", "lookup", and "inline" datasources.
2) Add "getChildren" and "withChildren" methods to DataSource, which will be used
   in the future for query rewriting (e.g. inlining of subqueries).

DataSource analysis functionality:

1) Add DataSourceAnalysis class, which breaks down datasources into three components:
   outer queries, a base datasource (left-most of the highest level left-leaning join
   tree), and other joined-in leaf datasources (the right-hand branches of the
   left-leaning join tree).
2) Add "isConcrete", "isGlobal", and "isCacheable" methods to DataSource in order to
   support analysis.

Other notes:

1) Renamed DataSource#getNames to DataSource#getTableNames, which I think is clearer.
   Also, made it a Set, so implementations don't need to worry about duplicates.
2) The addition of "isCacheable" should work around apache#8713, since UnionDataSource now
   returns false for cacheability.
Comment on lines 74 to 77
&& strategy != null
&& cacheConfig.isUseResultLevelCache()
&& cacheConfig.isQueryCacheable(query);
&& cacheConfig.isQueryCacheable(query)
&& query.getDataSource().isCacheable();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment to the one I left in CacheUtil.

Comment on lines 100 to 103
&& strategy != null
&& cacheConfig.isUseCache()
&& cacheConfig.isQueryCacheable(query);
&& cacheConfig.isQueryCacheable(query)
&& query.getDataSource().isCacheable();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps extract these four lines should be a helper method since it's duplicated in populateResultLevelCache().

Is there a test that is affected by the addition of isCacheable() or should one be added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, after thinking about this a bit IMO it is best to combine ResultLevelCacheUtil and CacheUtil, then make the change you suggested (although I think we can only extract three lines), and then add tests for CacheUtil. I'll add that to this patch in a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe not combining them in this patch, but at least doing the extraction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccaominh I decided to combine them after all, and add a CacheUtil test that tests the newly extracted isQueryCacheable method.

* non-Jackson callers should use {@link #fromIterable}.
*/
@JsonCreator
public static InlineDataSource fromJson(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be private to match the javadoc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yes, it should. I'll fix it.

@Override
public String toString()
{
// Don't include 'rows' in stringificatione, because it might be long and/or lazy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: stringificatione -> stringification

Is it worth adding a test to enforce this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I'll add one (and fix the typo).

/**
* Utility methods for working with {@link Joinable} related classes.
*/
public class Joinables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to add unit tests for these methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I'll add some.

Comment on lines +40 to +74
/**
* Analysis of a datasource for purposes of deciding how to execute a particular query.
*
* The analysis breaks a datasource down in the following way:
*
* <pre>
*
* Q <-- Possible outer query datasource(s) [may be multiple stacked]
* |
* J <-- Possible join tree, expected to be left-leaning
* / \
* J Dj <-- Other leaf datasources
* Base datasource / \ which will be joined
* (bottom-leftmost) --> Db Dj <---- into the base datasource
*
* </pre>
*
* The base datasource (Db) is returned by {@link #getBaseDataSource()}. The other leaf datasources are returned by
* {@link #getPreJoinableClauses()}. The outer query datasources are available as part of {@link #getDataSource()},
* which just returns the original datasource that was provided for analysis.
*
* The base datasource (Db) will never be a join, but it can be any other type of datasource (table, query, etc).
* Note that join trees are only flattened if they occur at the top of the overall tree (or underneath an outer query),
* and that join trees are only flattened to the degree that they are left-leaning. Due to these facts, it is possible
* for the base or leaf datasources to include additional joins.
*
* The base datasource is the one that will be considered by the core Druid query stack for scanning via
* {@link org.apache.druid.segment.Segment} and {@link org.apache.druid.segment.StorageAdapter}. The other leaf
* datasources must be joinable onto the base data.
*
* The idea here is to keep things simple and dumb. So we focus only on identifying left-leaning join trees, which map
* neatly onto a series of hash table lookups at query time. The user/system generating the queries, e.g. the druid-sql
* layer (or the end user in the case of native queries), is responsible for containing the smarts to structure the
* tree in a way that will lead to optimal execution.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice javadoc!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASCII art is what it's all about.

analysis.getPreJoinableClauses()
);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to add an EqualsVerifier test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll add one.

Comment on lines 149 to 155
final QueryDataSource queryDataSource = new QueryDataSource(
GroupByQuery.builder()
.setDataSource(LOOKUP_LOOKYLOO)
.setInterval(new MultipleIntervalSegmentSpec(Collections.singletonList(Intervals.of("2000/3000"))))
.setGranularity(Granularities.ALL)
.build()
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you can use subquery(LOOKUP_LOOKYLOO) here instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I'll change it.

return new QueryDataSource(
GroupByQuery.builder()
.setDataSource(dataSource)
.setInterval(new MultipleIntervalSegmentSpec(Collections.singletonList(Intervals.of("2000/3000"))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps create a named constant for the interval since it's in the asserts for several of the new tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sounds reasonable.

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high level lgtm 👍, inline datasource looks like an avenue to make pretty much anything a druid query which sounds fun

public boolean isCacheable()
{
// Disables result-level caching for 'union' datasources, which doesn't work currently.
// See https://github.com/apache/druid/issues/8713 for reference.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this seems like a good intermediary solution until we fix the issue for real

* The rows are backed by an Iterable, which can be lazy or not. Lazy datasources will only be iterated if someone calls
* {@link #getRows()} and iterates the result, or until someone calls {@link #getRowsAsList()}.
*/
public class InlineDataSource implements DataSource
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤘

Objects.equals(granularity, baseQuery.granularity);
}

@Override
public int hashCode()
{

return Objects.hash(dataSource, descending, context, querySegmentSpec, duration, granularity);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious about why this change. I know this isn't really part of your change, but I've also noticed equals and hashcode for a lot of classes that extend BaseQuery have their own implementation of equals and hashCode and do not check the base class. Sounds like we should add some equalsVerifier tests for them, maybe in another patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because duration is lazily computed, so it might be null. Calling getDuration() forces it to be computed. I'll add a comment.

Btw, a lot of the subclasses of BaseQuery do call super.equals at some point (but they do other stuff too).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird - I see the calls to super.equals() now, maybe I just shouldn't read code in the night 😝

EqualsVerifier is still complaining about some other stuff. But cleaning that up can be another issue.

Copy link
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks cool! I'll look at DataSourceAnalysis more closely tomorrow when I'm a little less sleepy

@Override
public DataSource withChildren(List<DataSource> children)
{
if (children.size() != dataSources.size()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why must these sizes match?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a precondition for the method, since its intent is for replacing inputs with inlined/etc versions, so the number of children shouldn't change. I'll add a comment in the interface.

}
}

public static boolean isPrefixedBy(final String columnName, final String prefix)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NonNull annotations for the parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer @EverythingIsNonnullByDefault (which is how I prefer to write/read code usually anyway). We use it in a few other packages in Druid. Maybe we can add that here in a future patch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me! I haven't gotten in the habit of looking out for those annotations yet, but I love the idea of NonnullBy default!

@gianm
Copy link
Contributor Author

gianm commented Jan 22, 2020

@ccaominh, @suneet-s — I pushed up a new commit reflecting your comments. Let me know what you think.

{
// Verify that toString does not iterate the rows.
final String ignored = iterableDataSource.toString();
Assert.assertEquals(0, iterationCounter.get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried adding rows to toString and it doesn't iterate over rowsIterable (it prints an identifier for the lambda).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. I guess the behavior would depend on whatever the particular Iterable’s toString method does. Do you think it makes sense to leave the test like this it change it somehow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with leaving the test as it is for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestions for how to improve it in a follow-up?

Copy link
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@gianm gianm merged commit d886463 into apache:master Jan 22, 2020
@gianm gianm deleted the joins-one-and-a-half branch January 22, 2020 22:54
gianm added a commit to gianm/druid that referenced this pull request Jan 22, 2020
Builds on apache#9235, using the datasource analysis functionality to replace various ad-hoc
approaches. The most interesting changes are in ClientQuerySegmentWalker (brokers),
ServerManager (historicals), and SinkQuerySegmentWalker (indexing tasks).

Other changes related to improving how we analyze queries:

1) Changes TimelineServerView to return an Optional timeline, which I thought made
   the analysis changes cleaner to implement.
2) Added QueryToolChest#canPerformSubquery, which is now used by query entry points to
   determine whether it is safe to pass a subquery dataSource to the query toolchest.
   Fixes an issue introduced in apache#5471 where subqueries under non-groupBy-typed queries
   were silently ignored, since neither the query entry point nor the toolchest did
   anything special with them.
3) Removes the QueryPlus.withQuerySegmentSpec method, which was mostly being used in
   error-prone ways (ignoring any potential subqueries, and not verifying that the
   underlying data source is actually a table). Replaces with a new function,
   Queries.withSpecificSegments, that includes sanity checks.
gianm added a commit that referenced this pull request Jan 23, 2020
Builds on #9235, using the datasource analysis functionality to replace various ad-hoc
approaches. The most interesting changes are in ClientQuerySegmentWalker (brokers),
ServerManager (historicals), and SinkQuerySegmentWalker (indexing tasks).

Other changes related to improving how we analyze queries:

1) Changes TimelineServerView to return an Optional timeline, which I thought made
   the analysis changes cleaner to implement.
2) Added QueryToolChest#canPerformSubquery, which is now used by query entry points to
   determine whether it is safe to pass a subquery dataSource to the query toolchest.
   Fixes an issue introduced in #5471 where subqueries under non-groupBy-typed queries
   were silently ignored, since neither the query entry point nor the toolchest did
   anything special with them.
3) Removes the QueryPlus.withQuerySegmentSpec method, which was mostly being used in
   error-prone ways (ignoring any potential subqueries, and not verifying that the
   underlying data source is actually a table). Replaces with a new function,
   Queries.withSpecificSegments, that includes sanity checks.
clintropolis added a commit to implydata/druid-public that referenced this pull request Jan 30, 2020
@jihoonson jihoonson added this to the 0.18.0 milestone Mar 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants