Add join-related DataSource types, and analysis functionality. #9235

gianm · 2020-01-21T18:59:29Z

Builds on #9111 and implements the datasource analysis mentioned in #8728. Still can't
handle join datasources, but we're a step closer.

Join-related DataSource types:

Add "join", "lookup", and "inline" datasources.
Add "getChildren" and "withChildren" methods to DataSource, which will be used
in the future for query rewriting (e.g. inlining of subqueries).

DataSource analysis functionality:

Add DataSourceAnalysis class, which breaks down datasources into three components:
outer queries, a base datasource (left-most of the highest level left-leaning join
tree), and other joined-in leaf datasources (the right-hand branches of the
left-leaning join tree).
Add "isConcrete", "isGlobal", and "isCacheable" methods to DataSource in order to
support analysis.

Other notes:

Renamed DataSource#getNames to DataSource#getTableNames, which I think is clearer.
Also, made it a Set, so implementations don't need to worry about duplicates.
The addition of "isCacheable" should work around Improper result-level cache ETag handling for union datasources #8713, since UnionDataSource now
returns false for cacheability.

Builds on apache#9111 and implements the datasource analysis mentioned in apache#8728. Still can't handle join datasources, but we're a step closer. Join-related DataSource types: 1) Add "join", "lookup", and "inline" datasources. 2) Add "getChildren" and "withChildren" methods to DataSource, which will be used in the future for query rewriting (e.g. inlining of subqueries). DataSource analysis functionality: 1) Add DataSourceAnalysis class, which breaks down datasources into three components: outer queries, a base datasource (left-most of the highest level left-leaning join tree), and other joined-in leaf datasources (the right-hand branches of the left-leaning join tree). 2) Add "isConcrete", "isGlobal", and "isCacheable" methods to DataSource in order to support analysis. Other notes: 1) Renamed DataSource#getNames to DataSource#getTableNames, which I think is clearer. Also, made it a Set, so implementations don't need to worry about duplicates. 2) The addition of "isCacheable" should work around apache#8713, since UnionDataSource now returns false for cacheability.

ccaominh · 2020-01-21T22:58:55Z

server/src/main/java/org/apache/druid/client/ResultLevelCacheUtil.java

           && strategy != null
           && cacheConfig.isUseResultLevelCache()
-           && cacheConfig.isQueryCacheable(query);
+           && cacheConfig.isQueryCacheable(query)
+           && query.getDataSource().isCacheable();


Similar comment to the one I left in CacheUtil.

ccaominh · 2020-01-21T23:02:27Z

server/src/main/java/org/apache/druid/client/CacheUtil.java

           && strategy != null
           && cacheConfig.isUseCache()
-           && cacheConfig.isQueryCacheable(query);
+           && cacheConfig.isQueryCacheable(query)
+           && query.getDataSource().isCacheable();


Perhaps extract these four lines should be a helper method since it's duplicated in populateResultLevelCache().

Is there a test that is affected by the addition of isCacheable() or should one be added?

Hmm, after thinking about this a bit IMO it is best to combine ResultLevelCacheUtil and CacheUtil, then make the change you suggested (although I think we can only extract three lines), and then add tests for CacheUtil. I'll add that to this patch in a bit.

Or maybe not combining them in this patch, but at least doing the extraction.

@ccaominh I decided to combine them after all, and add a CacheUtil test that tests the newly extracted isQueryCacheable method.

ccaominh · 2020-01-21T23:09:53Z

processing/src/main/java/org/apache/druid/query/InlineDataSource.java

+   * non-Jackson callers should use {@link #fromIterable}.
+   */
+  @JsonCreator
+  public static InlineDataSource fromJson(


Should this be private to match the javadoc?

Oops, yes, it should. I'll fix it.

ccaominh · 2020-01-21T23:31:41Z

processing/src/main/java/org/apache/druid/query/InlineDataSource.java

+  @Override
+  public String toString()
+  {
+    // Don't include 'rows' in stringificatione, because it might be long and/or lazy.


Typo: stringificatione -> stringification

Is it worth adding a test to enforce this?

Good idea, I'll add one (and fix the typo).

ccaominh · 2020-01-22T00:20:25Z

processing/src/main/java/org/apache/druid/segment/join/Joinables.java

+/**
+ * Utility methods for working with {@link Joinable} related classes.
+ */
+public class Joinables


Do you want to add unit tests for these methods?

Good idea, I'll add some.

ccaominh · 2020-01-22T00:33:16Z

processing/src/main/java/org/apache/druid/query/planning/DataSourceAnalysis.java

+/**
+ * Analysis of a datasource for purposes of deciding how to execute a particular query.
+ *
+ * The analysis breaks a datasource down in the following way:
+ *
+ * <pre>
+ *
+ *                             Q  <-- Possible outer query datasource(s) [may be multiple stacked]
+ *                             |
+ *                             J  <-- Possible join tree, expected to be left-leaning
+ *                            / \
+ *                           J  Dj <--  Other leaf datasources
+ *   Base datasource        / \         which will be joined
+ *  (bottom-leftmost) -->  Db Dj  <---- into the base datasource
+ *
+ * </pre>
+ *
+ * The base datasource (Db) is returned by {@link #getBaseDataSource()}. The other leaf datasources are returned by
+ * {@link #getPreJoinableClauses()}. The outer query datasources are available as part of {@link #getDataSource()},
+ * which just returns the original datasource that was provided for analysis.
+ *
+ * The base datasource (Db) will never be a join, but it can be any other type of datasource (table, query, etc).
+ * Note that join trees are only flattened if they occur at the top of the overall tree (or underneath an outer query),
+ * and that join trees are only flattened to the degree that they are left-leaning. Due to these facts, it is possible
+ * for the base or leaf datasources to include additional joins.
+ *
+ * The base datasource is the one that will be considered by the core Druid query stack for scanning via
+ * {@link org.apache.druid.segment.Segment} and {@link org.apache.druid.segment.StorageAdapter}. The other leaf
+ * datasources must be joinable onto the base data.
+ *
+ * The idea here is to keep things simple and dumb. So we focus only on identifying left-leaning join trees, which map
+ * neatly onto a series of hash table lookups at query time. The user/system generating the queries, e.g. the druid-sql
+ * layer (or the end user in the case of native queries), is responsible for containing the smarts to structure the
+ * tree in a way that will lead to optimal execution.
+ */


Nice javadoc!

ASCII art is what it's all about.

ccaominh · 2020-01-22T00:51:46Z

processing/src/test/java/org/apache/druid/query/planning/DataSourceAnalysisTest.java

+        analysis.getPreJoinableClauses()
+    );
+  }
+


Do you want to add an EqualsVerifier test?

Yeah, I'll add one.

ccaominh · 2020-01-22T00:56:58Z

processing/src/test/java/org/apache/druid/query/planning/DataSourceAnalysisTest.java

+    final QueryDataSource queryDataSource = new QueryDataSource(
+        GroupByQuery.builder()
+                    .setDataSource(LOOKUP_LOOKYLOO)
+                    .setInterval(new MultipleIntervalSegmentSpec(Collections.singletonList(Intervals.of("2000/3000"))))
+                    .setGranularity(Granularities.ALL)
+                    .build()
+    );


Looks like you can use subquery(LOOKUP_LOOKYLOO) here instead?

You're right, I'll change it.

ccaominh · 2020-01-22T01:05:08Z

processing/src/test/java/org/apache/druid/query/planning/DataSourceAnalysisTest.java

+    return new QueryDataSource(
+        GroupByQuery.builder()
+                    .setDataSource(dataSource)
+                    .setInterval(new MultipleIntervalSegmentSpec(Collections.singletonList(Intervals.of("2000/3000"))))


Perhaps create a named constant for the interval since it's in the asserts for several of the new tests

Sure, sounds reasonable.

clintropolis

high level lgtm 👍, inline datasource looks like an avenue to make pretty much anything a druid query which sounds fun

clintropolis · 2020-01-21T21:34:29Z

processing/src/main/java/org/apache/druid/query/UnionDataSource.java

+  public boolean isCacheable()
+  {
+    // Disables result-level caching for 'union' datasources, which doesn't work currently.
+    // See https://github.com/apache/druid/issues/8713 for reference.


👍 this seems like a good intermediary solution until we fix the issue for real

clintropolis · 2020-01-21T21:46:04Z

processing/src/main/java/org/apache/druid/query/InlineDataSource.java

+ * The rows are backed by an Iterable, which can be lazy or not. Lazy datasources will only be iterated if someone calls
+ * {@link #getRows()} and iterates the result, or until someone calls {@link #getRowsAsList()}.
+ */
+public class InlineDataSource implements DataSource


suneet-s · 2020-01-22T07:54:26Z

processing/src/main/java/org/apache/druid/query/BaseQuery.java

           Objects.equals(granularity, baseQuery.granularity);
  }

  @Override
  public int hashCode()
  {
-
-    return Objects.hash(dataSource, descending, context, querySegmentSpec, duration, granularity);


curious about why this change. I know this isn't really part of your change, but I've also noticed equals and hashcode for a lot of classes that extend BaseQuery have their own implementation of equals and hashCode and do not check the base class. Sounds like we should add some equalsVerifier tests for them, maybe in another patch.

It's because duration is lazily computed, so it might be null. Calling getDuration() forces it to be computed. I'll add a comment.

Btw, a lot of the subclasses of BaseQuery do call super.equals at some point (but they do other stuff too).

Weird - I see the calls to super.equals() now, maybe I just shouldn't read code in the night 😝

EqualsVerifier is still complaining about some other stuff. But cleaning that up can be another issue.

suneet-s

looks cool! I'll look at DataSourceAnalysis more closely tomorrow when I'm a little less sleepy

suneet-s · 2020-01-22T08:30:42Z

processing/src/main/java/org/apache/druid/query/UnionDataSource.java

+  @Override
+  public DataSource withChildren(List<DataSource> children)
+  {
+    if (children.size() != dataSources.size()) {


why must these sizes match?

It's a precondition for the method, since its intent is for replacing inputs with inlined/etc versions, so the number of children shouldn't change. I'll add a comment in the interface.

suneet-s · 2020-01-22T08:41:22Z

processing/src/main/java/org/apache/druid/segment/join/Joinables.java

+    }
+  }
+
+  public static boolean isPrefixedBy(final String columnName, final String prefix)


@NonNull annotations for the parameters?

I think I'd prefer @EverythingIsNonnullByDefault (which is how I prefer to write/read code usually anyway). We use it in a few other packages in Druid. Maybe we can add that here in a future patch?

Works for me! I haven't gotten in the habit of looking out for those annotations yet, but I love the idea of NonnullBy default!

gianm · 2020-01-22T17:33:01Z

@ccaominh, @suneet-s — I pushed up a new commit reflecting your comments. Let me know what you think.

ccaominh · 2020-01-22T21:38:13Z

processing/src/test/java/org/apache/druid/query/InlineDataSourceTest.java

+  {
+    // Verify that toString does not iterate the rows.
+    final String ignored = iterableDataSource.toString();
+    Assert.assertEquals(0, iterationCounter.get());


I tried adding rows to toString and it doesn't iterate over rowsIterable (it prints an identifier for the lambda).

Yeah, that makes sense. I guess the behavior would depend on whatever the particular Iterable’s toString method does. Do you think it makes sense to leave the test like this it change it somehow?

I'm ok with leaving the test as it is for now.

Any suggestions for how to improve it in a follow-up?

suneet-s

+1

Builds on apache#9235, using the datasource analysis functionality to replace various ad-hoc approaches. The most interesting changes are in ClientQuerySegmentWalker (brokers), ServerManager (historicals), and SinkQuerySegmentWalker (indexing tasks). Other changes related to improving how we analyze queries: 1) Changes TimelineServerView to return an Optional timeline, which I thought made the analysis changes cleaner to implement. 2) Added QueryToolChest#canPerformSubquery, which is now used by query entry points to determine whether it is safe to pass a subquery dataSource to the query toolchest. Fixes an issue introduced in apache#5471 where subqueries under non-groupBy-typed queries were silently ignored, since neither the query entry point nor the toolchest did anything special with them. 3) Removes the QueryPlus.withQuerySegmentSpec method, which was mostly being used in error-prone ways (ignoring any potential subqueries, and not verifying that the underlying data source is actually a table). Replaces with a new function, Queries.withSpecificSegments, that includes sanity checks.

Builds on #9235, using the datasource analysis functionality to replace various ad-hoc approaches. The most interesting changes are in ClientQuerySegmentWalker (brokers), ServerManager (historicals), and SinkQuerySegmentWalker (indexing tasks). Other changes related to improving how we analyze queries: 1) Changes TimelineServerView to return an Optional timeline, which I thought made the analysis changes cleaner to implement. 2) Added QueryToolChest#canPerformSubquery, which is now used by query entry points to determine whether it is safe to pass a subquery dataSource to the query toolchest. Fixes an issue introduced in #5471 where subqueries under non-groupBy-typed queries were silently ignored, since neither the query entry point nor the toolchest did anything special with them. 3) Removes the QueryPlus.withQuerySegmentSpec method, which was mostly being used in error-prone ways (ignoring any potential subqueries, and not verifying that the underlying data source is actually a table). Replaces with a new function, Queries.withSpecificSegments, that includes sanity checks.

gianm added the Area - Querying label Jan 21, 2020

gianm mentioned this pull request Jan 21, 2020

Add join-related DataSource types and analysis functionality. #9234

Closed

Remove javadoc comment.

0ffda6a

ccaominh reviewed Jan 22, 2020

View reviewed changes

clintropolis approved these changes Jan 22, 2020

View reviewed changes

suneet-s reviewed Jan 22, 2020

View reviewed changes

gianm added 3 commits January 22, 2020 09:27

Updates reflecting code review.

385bd75

Add comments.

2b81140

Add more comments.

2f21089

ccaominh reviewed Jan 22, 2020

View reviewed changes

suneet-s approved these changes Jan 22, 2020

View reviewed changes

ccaominh approved these changes Jan 22, 2020

View reviewed changes

gianm merged commit d886463 into apache:master Jan 22, 2020

gianm deleted the joins-one-and-a-half branch January 22, 2020 22:54

gianm mentioned this pull request Jan 22, 2020

Use DataSourceAnalysis throughout the query stack. #9239

Merged

clintropolis added a commit to implydata/druid-public that referenced this pull request Jan 30, 2020

backport DataSource cache stuff from apache#9235

cc49aab

clintropolis mentioned this pull request Jan 30, 2020

[Backport] disable union query result level caching implydata/druid-public#49

Merged

jihoonson added this to the 0.18.0 milestone Mar 26, 2020

Add join-related DataSource types, and analysis functionality. #9235

Add join-related DataSource types, and analysis functionality. #9235

Conversation

gianm commented Jan 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clintropolis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

suneet-s left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm commented Jan 22, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

suneet-s left a comment

Choose a reason for hiding this comment