Add HashJoinSegment, a virtual segment for joins. #9111

gianm · 2019-12-30T16:19:42Z

An initial step towards #8728. This patch adds enough functionality to implement a joining
cursor on top of a normal datasource. It does not include enough to actually do a query. For
that, future patches will need to wire this low-level functionality into the query language.

The main files in this patch:

HashJoinSegment: The virtual join Segment described in Initial join support #8728.
HashJoinSegmentStorageAdapter: Storage adapter for that segment; "makeCursors" is the
interesting part.
HashJoinEngine: Contains JoinColumnSelectorFactory, JoinCursor, which together implement
the row-by-row logic of a join.
LookupJoinable: Allows joining onto lookups.
IndexedTableJoinable: A more flexible Joinable that can have multiple columns in general,
including multiple key columns. I expect this will be used for joining onto subquery results
in the future. It may even be used as a sort of super-lookup.

See https://gist.github.com/gianm/39548daef74f0373b3c87056e3db4627 for details on how the above things work together.

Some supporting elements:

Added a "withDimension" method to DimensionSpec so prefixed dimensions can be rewritten to
remove their prefixes.
Added "canIterate" and "iterable" to LookupExtractor, necessary for right and full joins
on lookups. It will also be useful for direct queries on lookups in the future.
Removed "getSegmentIdentifier" method from StorageAdapter. It was not being used.
Moved RowBasedColumnSelectorFactory out of the groupBy engine, reflecting the fact that it
has been used by other, non-groupBy things. Also, split out the RowAdapter interface, which
is now used by RowBasedIndexedTable as well.
Renamed VectorColumnStrategizer to VectorColumnProcessorFactory (see below).
Added a "ColumnProcessors" utility class and "ColumnProcessorFactory" interface that is
currently only used to make join condition matchers in IndexedTableJoinMatcher. It wasn't
strictly necessary, but I think it's designed better than ColumnSelectorStrategyFactory,
and could replace it in the future. It's similar in design to VectorColumnProcessorFactory.

An initial step towards apache#8728. This patch adds enough functionality to implement a joining cursor on top of a normal datasource. It does not include enough to actually do a query. For that, future patches will need to wire this low-level functionality into the query language.

drcrallen · 2019-12-30T17:22:10Z

This is super cool thank you for putting this together. I would love to see a README.md (or similar) going over the high level mechanics of the join to help understand its limitations. I see a few comments scattered among classes but don't quite see the high level data flow. Is it all captured in #8728 , or are there more nuances in this PR since it is a foundation PR?

lgtm-com · 2019-12-30T18:22:11Z

This pull request introduces 1 alert when merging 2820e87 into dec619e - view on LGTM.com

new alerts:

1 for Missing format argument

gianm · 2019-12-30T18:25:03Z

This is super cool thank you for putting this together. I would love to see a README.md (or similar) going over the high level mechanics of the join to help understand its limitations. I see a few comments scattered among classes but don't quite see the high level data flow. Is it all captured in #8728 , or are there more nuances in this PR since it is a foundation PR?

@drcrallen --

#8728 is more of a program for starting to build join support than a description of mechanics. I agree that such a description of mechanics would be useful, although I haven't written it yet. I will start working on that and, maybe, it will outlive the proposal #8728 in usefulness.

What sorts of questions would you want this mechanical description / design doc to answer?

gianm · 2019-12-30T18:30:46Z

Reviewers: by the way, I've done two things in this patch that were suggested on the dev list and I found to be useful. Please let me know if you agree.

The new unit tests in this patch are named in the style of https://lists.apache.org/thread.html/381816301bda945b7bf19c05822d1d910282ceb2a96db86b3bfe72b8%40%3Cdev.druid.apache.org%3E. For example see HashJoinSegmentStorageAdapterTest.
Used Optional a bit more, as suggested in https://lists.apache.org/thread.html/5bb9c985bf0918dfbd2d247aa960bbaf8982654d95c648dba8ef2d67%40%3Cdev.druid.apache.org%3E. For example see HashJoinSegmentStorageAdapter#getClauseForColumn and Exprs.decomposeEquals. There are still a lot of @Nullable where needed to interact with existing interfaces. I didn't think going on a rampage changing them was a good idea.

gianm · 2019-12-30T19:33:29Z

I started a design document at https://gist.github.com/gianm/39548daef74f0373b3c87056e3db4627. @drcrallen let me know if this answers your questions.

drcrallen · 2019-12-30T21:18:16Z

Thank you @gianm I'll comment in the doc as opposed to cluttering here.

jnaous · 2019-12-30T21:31:14Z

processing/src/test/java/org/apache/druid/segment/join/HashJoinSegmentStorageAdapterTest.java

+  }
+
+  @Test
+  public void test_getInterval_factToCountry()


factToCountry isn't really a condition for the test. Is this testing a normal or error case? What kind? I would add also add an expected section to the name like shouldBeReturned. Something like test_getInterval_intervalWithinJoinedSegment_shouldBeReturned().

jnaous · 2019-12-30T21:33:05Z

processing/src/test/java/org/apache/druid/segment/join/HashJoinSegmentStorageAdapterTest.java

+  }
+
+  @Test
+  public void test_getDimensionCardinality_factToCountryNonexistentFactColumn()


IMHO: ..._shouldHaveCardinality1

jnaous

I still have to go through the rest. 41/69 files done, but will continue later this weekend most likely.

jnaous · 2020-01-10T14:35:49Z

core/src/main/java/org/apache/druid/common/config/NullHandling.java

+    } else if (clazz == String.class) {
+      return (T) defaultStringValue();
+    } else {
+      return null;


This else is problematic. What other classes do you expect? I prefer we're explicit in this method, returning null for the supported classes, and otherwise throwing an exception in the else case. If in the future we add a new supported type, and forget to add the default value if case, we would hit the exception instead of silently returning a null and causing issues. A unit test should additionally be created based on the set of supported types. Ideally, we should have our own typing system with an interface that has methods like getDefault() so we avoid these issues. I friggin hate type checks in Java.

Unfortunately all types are supported. This method is used by the PossiblyNullColumnValueSelector when it generates nulls that don't exist in the base selector. The base selector could be returning any type, even weird ones that don't exist in Druid's type system (there's a COMPLEX catch-all for those, & it's used for stuff like sketches).

jnaous · 2020-01-10T14:44:00Z

core/src/main/java/org/apache/druid/math/expr/Exprs.java

+        stack.push(((BinAndExpr) current).left);
+      } else {
+        retVal.add(current);
+      }


This potentially seems like a method that belongs on Expr rather than here.

What would the method look like?

jnaous · 2020-01-10T14:45:05Z

core/src/main/java/org/apache/druid/math/expr/Exprs.java

+   */
+  public static List<Expr> decomposeAnd(final Expr expr)
+  {
+    final List<Expr> retVal = new ArrayList<>();


Might wanna specify initial size since these are likely smallish in the usual case? Or perhaps a LL is good enough?

What would you suggest for an initial size?

jnaous · 2020-01-10T14:46:02Z

core/src/main/java/org/apache/druid/math/expr/Exprs.java

+   *
+   * @return decomposed equality, or empty if the input expr was not an equality expr
+   */
+  public static Optional<Pair<Expr, Expr>> decomposeEquals(final Expr expr)


Also looks like a method on Expr rather than here.

What would the method look like?

jnaous · 2020-01-10T14:49:00Z

core/src/main/java/org/apache/druid/math/expr/Exprs.java

+import java.util.Optional;
+import java.util.Stack;
+
+public class Exprs


Missing unit tests for this class.

Yeah, these should have unit tests. I'll add them.

jnaous · 2020-01-10T16:31:03Z

processing/src/main/java/org/apache/druid/segment/join/HashJoinSegmentStorageAdapter.java

+            .getAvailableColumns()
+            .stream()
+            .map(c -> clause.getPrefix() + c)
+            .forEach(availableDimensions::add);


This probably should be a method on JoinableClause.

esp given the unPrefix method which assumes the logic in the map call above.

And you'd probably need a unit test to make sure that what the clause prefixes is unprefixed correctly.

These changes sound like good ideas. I'll make them.

jnaous · 2020-01-10T16:34:18Z

processing/src/main/java/org/apache/druid/segment/join/JoinableClause.java

+   * Removes our prefix from "columnName". Must only be called if {@link #includesColumn} would have returned true
+   * on this column name.
+   */
+  public String unprefix(final String columnName)


This should probably have a counter addPrefix method in this class..

Prefix addition is only done in this class itself, in just one spot, so maybe not necessary? I guess it could be a private method?

jnaous · 2020-01-10T16:35:14Z

processing/src/main/java/org/apache/druid/segment/join/JoinableClause.java

+  public int hashCode()
+  {
+    return Objects.hash(prefix, joinable, joinType, condition);
+  }


hashCode and equals require tests for maintainability. They are easy to write using utility classes. I think Suneet has added some.

I added a test using EqualsVerifier.

jnaous · 2020-01-10T16:48:28Z

processing/src/main/java/org/apache/druid/segment/join/JoinConditionAnalysis.java

+
+        if (isLeftExprAndRightColumn(lhs, rhs, rightPrefix)) {
+          // rhs is a right-hand column; lhs is an expression solely of the left-hand side.
+          equiConditions.add(new Equality(lhs, rhs.getIdentifierIfIdentifier().substring(rightPrefix.length())));


Am I correct in understanding that this is doing the unprefix method? Seems like you should be calling that instead?

jnaous · 2020-01-10T16:49:22Z

processing/src/main/java/org/apache/druid/segment/join/JoinConditionAnalysis.java

+  public int hashCode()
+  {
+    return Objects.hash(originalExpression);
+  }


Need tests for equals and hashCode.

I'll add one with EqualsVerifier.

jnaous

If you're not going to add unit tests, I think we should have a coverage check to make sure that the inputs to the functional tests end up testing all branches of the code...

jnaous · 2020-01-10T18:10:59Z

processing/src/main/java/org/apache/druid/segment/join/JoinType.java

+   * "Righty" joins (RIGHT or FULL) always include the full right-hand side, and can generate nulls on the left.
+   */
+  abstract boolean isRighty();
+}


jnaous · 2020-01-10T18:12:26Z

processing/src/main/java/org/apache/druid/segment/join/Joinable.java

+   * @return capabilities, or null if the columnName is not one of this Joinable's columns
+   */
+  @Nullable
+  ColumnCapabilities getColumnCapabilities(String columnName);


The combination of nullable and optional use is a potential source of error. We wouldn't be able to get the full benefits of optional imho unless we decide that nothing can be null...

I made this one nullable since it mirrors a lot of other methods with identical signature in other interfaces. I didn't want to change them all and I thought it'd be nice for them to echo each other.

I still think changing them all in this PR is a bad idea, but maybe the echoing is also a bad idea, and the right thing would be to make this one an Optional.

What do you think?

jnaous · 2020-01-10T18:13:07Z

processing/src/main/java/org/apache/druid/segment/join/JoinableClause.java

+    this.prefix = prefix != null ? prefix : "";
+    this.joinable = Preconditions.checkNotNull(joinable, "joinable");
+    this.joinType = Preconditions.checkNotNull(joinType, "joinType");
+    this.condition = Preconditions.checkNotNull(condition, "condition");


Check out Lombok's @NotNull annotation :)

jnaous · 2020-01-10T18:15:06Z

processing/src/main/java/org/apache/druid/segment/join/PossiblyNullColumnValueSelector.java

+  @Override
+  public float getFloat()
+  {
+    return beNull.getAsBoolean() ? 0L : baseSelector.getFloat();


perhaps return float 0 to avoid runtime conversion?

Oops, this was a copy paste error. I'll fix it.

jnaous · 2020-01-10T18:15:57Z

processing/src/main/java/org/apache/druid/segment/join/PossiblyNullColumnValueSelector.java

+  @Override
+  public double getDouble()
+  {
+    return beNull.getAsBoolean() ? 0L : baseSelector.getDouble();


should we be returning the defaultValueForNull instead of 0?

Hmm, that's a boxed type, so probably not. But there's some ZERO_* constants that would be appropriate, so I'll use those.

jnaous · 2020-01-10T18:17:52Z

processing/src/main/java/org/apache/druid/segment/join/PossiblyNullDimensionSelector.java

+    }
+
+    this.nullAdjustedRow = new NullAdjustedIndexedInts(nullAdjustment);
+  }


Some comments would be helpful to understand + unit tests for maintainability

I've added a bunch of tests.

jnaous · 2020-01-10T18:25:16Z

I've found it tough reviewing this PR. I think it could have been broken down into a few other PRs. For example:

The refactor of class/method name changes
The implementation of all the factories for the selectors
The Join logic
The IndexedTable implementation and its related pieces

gianm · 2020-01-10T19:01:37Z

If you're not going to add unit tests, I think we should have a coverage check to make sure that the inputs to the functional tests end up testing all branches of the code...

Sounds like a good idea to me. IMO it isn't worth unit testing every single class, and can even be counter-productive (adds to the amount of work it takes to refactor things later). So I like the idea of pairing functional tests with coverage checks.

That being said, some of the examples you pointed out would be good to add unit tests for, so I'll go through and add a few more. I probably won't have a chance to do this today, but I'll try to get to it soon.

gianm · 2020-01-13T23:09:36Z

@jnaous — I've pushed up a commit with all the changes I said I'd make in your review.

jon-wei · 2020-01-14T00:58:29Z

processing/src/main/java/org/apache/druid/segment/join/JoinableClause.java

+
+  public JoinableClause(@Nullable String prefix, Joinable joinable, JoinType joinType, JoinConditionAnalysis condition)
+  {
+    this.prefix = prefix != null ? prefix : "";


Suggest adding a comment somewhere that avoiding name conflicts with the prefix is the responsibility of the caller

I added it in HashJoinEngine — that seemed like the most appropriate place.

jon-wei · 2020-01-14T02:00:14Z

processing/src/main/java/org/apache/druid/segment/join/JoinMatcher.java

+   *
+   * Will only work correctly if {@link Joinable#makeJoinMatcher} was called with {@code remainderNeeded == true}.
+   */
+  void matchRemainder();


Suggest adding some more info here about the relationship/ordering between matchCondition() and matchRemainder() calls, e.g., is it an error if matchCondition is called after matchRemainder? (looking at how it's used in HashJoinEngine.matchCurrentPosition)

Hmm, the way they're currently specced and implemented, they will work no matter what order you call them in. I think this is fine. It's more generic than necessary but I don't think it hurts.

jon-wei · 2020-01-14T02:23:58Z

processing/src/main/java/org/apache/druid/segment/join/lookup/LookupJoinMatcher.java

+                          .map(Equality::getLeftExpr)
+                          .collect(Collectors.toList());
+    } else {
+      throw new IAE("Cannot join lookup with condition: %s", condition);


Suggest splitting up the non-equiconditions check and the lookup key column check, and using a specific exception message for each case

Ah, good idea!!

jon-wei · 2020-01-14T10:01:21Z

I haven't really looked through the tests yet, but I don't have any more comments on the main code

jon-wei · 2020-01-14T22:01:41Z

Tests LGTM, there are some real TC errors about unused methods

gianm · 2020-01-14T23:51:03Z

Tests LGTM, there are some real TC errors about unused methods

I'm looking into these.

gianm · 2020-01-15T00:17:52Z

Tests LGTM, there are some real TC errors about unused methods

I'm looking into these.

I've pushed up a commit that I think will fix these. I've removed some code and added tests for other code.

jon-wei

LGTM

jihoonson · 2020-01-15T23:28:21Z

Removed "getSegmentIdentifier" method from StorageAdapter. It was not being used.

Added "Release Notes" label since it's marked as a public API.

clintropolis · 2020-01-16T02:26:18Z

processing/src/main/java/org/apache/druid/segment/join/PossiblyNullColumnValueSelector.java

+import javax.annotation.Nullable;
+import java.util.function.BooleanSupplier;
+
+public class PossiblyNullColumnValueSelector<T> implements ColumnValueSelector<T>


this would probably be worth distinguishing functionality/intent from BaseNullableColumnValueSelector through javadocs

I could add some javadocs in a follow up. Hopefully for now the fact that it's in the org.apache.druid.segment.join package makes it clear enough that it's specifically join-related and is not a generically useful thing like BaseNullableColumnValueSelector.

I'm thinking of adding this:

/** * A {@link ColumnValueSelector} that wraps a base selector but might also generate null values on demand. This * is used for "righty" joins (see {@link JoinType#isRighty()}), which may need to generate nulls on the left-hand side. */

clintropolis · 2020-01-16T02:52:52Z

...ssing/src/main/java/org/apache/druid/segment/join/table/IndexedTableColumnValueSelector.java

+
+    // Otherwise this shouldn't have been called (due to isNull returning true).
+    assert NullHandling.replaceWithDefault();
+    return NullHandling.defaultDoubleValue();


Will this and other primitive get methods cause a null pointer exception if it does happen to get called in production-ish environment and sql compatible null handling is enabled because the assert will not be executed and NullHandling.defaultDoubleValue() will return null. Should this return the 0 value instead?

It's a logic error to call this method if isNull returns true for the row, so it seems ok to me to throw an NPE if someone actually does it in production with asserts disabled.

jihoonson

Thanks for the great patch! Halfway through and still reviewing.

jihoonson · 2020-01-15T22:51:31Z

processing/src/main/java/org/apache/druid/segment/ColumnProcessorFactory.java

+  /**
+   * This default type will be used when the underlying column has an unknown type.
+   */
+  ValueType defaultType();


nit: I'm not sure whether this method should be in this class or not. From the code, I guess the default type is used when ColumnCapabilities is missing. In this case, it makes sense to me to infer the type from the query as in IndexedTableJoinMatcher. For LookupJoinMatcher, I guess the default type is String because all columns in the lookup have the String type. I guess I'm confused with what defaultType means here even though ColumnProcessorFactory seems to support all value types. I understand the inferred type should be stored somewhere, but not sure why it should be this class.

It's meant to be the preferred type that the processor wants to deal with in situations where there is no type information for the underlying column. It should usually be related to whatever the processor wants to do with the data. The idea is that you would return STRING if you prefer to deal with strings, DOUBLE (or LONG) if you prefer to deal with numbers, etc.

Does that make sense / sound reasonable?

I'm thinking about adding this javadoc:

/** * This default type will be used when the underlying column has an unknown type. * * This allows a column processor factory to specify what type it prefers to deal with (the most 'natural' type for * whatever it is doing) when all else is equal. */

jihoonson · 2020-01-15T23:30:31Z

processing/src/main/java/org/apache/druid/segment/VectorColumnProcessorFactory.java

+ * {@link DimensionHandlerUtils#makeVectorProcessor}.
+ *
+ * Unlike {@link ColumnProcessorFactory}, this interface does not have a "defaultType" method. The default type is
+ * always implicitly STRING. It also does not have a "makeComplexProcessor" method; instead, complex-typed columns


Would you elaborate more on why the default type is always string? Or is it a temporary thing?

I imagined it's a temporary thing. I would eventually like the two column processor factory interfaces to match up better.

jihoonson · 2020-01-15T23:37:42Z

processing/src/main/java/org/apache/druid/segment/join/table/IndexedTableJoinMatcher.java

+    if (condition.isAlwaysTrue()) {
+      this.conditionMatchers = Collections.singletonList(() -> IntIterators.fromTo(0, table.numRows()));
+    } else if (condition.isAlwaysFalse()) {
+      this.conditionMatchers = Collections.singletonList(() -> IntIterators.fromTo(0, 0));


better to use IntIterators.EMPTY_ITERATOR.

Yeah, good point.

jihoonson · 2020-01-16T06:52:17Z

processing/src/main/java/org/apache/druid/segment/join/JoinableClause.java

+ */
+public class JoinableClause
+{
+  private final String prefix;


What is prefix here? rightTableName?

It's whatever the caller wants it to be, really. The SQL layer is gonna use strings like _j0..

I'm planning to add this javadoc, which'll make it clearer:

/** * The prefix to apply to all columns from the Joinable. The idea is that during a join, any columns that start with * this prefix should be retrieved from our Joinable's {@link JoinMatcher#getColumnSelectorFactory()}. Any other * columns should be returned from the left-hand side of the join. * * The prefix can be any string, as long as it is nonempty and not itself a prefix of the reserved column name * {@code __time}. * * @see #getAvailableColumnsPrefixed() the list of columns from our {@link Joinable} with prefixes attached * @see #unprefix a method for removing prefixes */

jihoonson · 2020-01-16T06:58:45Z

processing/src/main/java/org/apache/druid/segment/join/JoinableClause.java

+   * Returns a list of columns from the underlying {@link Joinable#getAvailableColumns()} method, with our
+   * prefix ({@link #getPrefix()}) prepended.
+   */
+  public List<String> getAvailableColumnsPrefixed()


nit: getAvailableResolvedColumns() or getAvailableQualifiedColumns?

I actually like that the word "prefix" is in here since it makes the connection with getPrefix and unprefix more clear.

jihoonson · 2020-01-16T07:03:05Z

processing/src/main/java/org/apache/druid/segment/join/table/IndexedTable.java

+import java.util.List;
+import java.util.Map;
+
+public interface IndexedTable


Would you please add javadoc?

Sure, good call.

I'm thinking of adding this:

/** * An interface to a table where some columns (the 'key columns') have indexes that enable fast lookups. * * The main user of this class is {@link IndexedTableJoinable}, and its main purpose is to participate in joins. */ public interface IndexedTable { /** * Returns the columns of this table that have indexes. */ List<String> keyColumns(); /** * Returns all columns of this table, including the key and non-key columns. */ List<String> allColumns(); /** * Returns the signature of this table: a map where each key is a column from {@link #allColumns()} and each value * is a type code. */ Map<String, ValueType> rowSignature(); /** * Returns the number of rows in this table. It must not change over time, since it is used for things like algorithm * selection and reporting of cardinality metadata. */ int numRows(); /** * Returns the index for a particular column. The provided column number must be that column's position in * {@link #allColumns()}. */ Index columnIndex(int column); /** * Returns a reader for a particular column. The provided column number must be that column's position in * {@link #allColumns()}. */ Reader columnReader(int column); /** * Indexes support fast lookups on key columns. */ interface Index { /** * Returns the list of row numbers where the column this Reader is based on contains 'key'. */ IntList find(Object key); } /** * Readers support reading values out of any column. */ interface Reader { /** * Read the value at a particular row number. Throws an exception if the row is out of bounds (must be between zero * and {@link #numRows()}). */ @Nullable Object read(int row); }

jihoonson · 2020-01-16T07:32:18Z

processing/src/main/java/org/apache/druid/segment/join/JoinConditionAnalysis.java

+  }
+
+  public static JoinConditionAnalysis forExpression(
+      final String condition,


Seems like assuming the column name from the right table is always resolved. If so, this should be documented.

Yes, and that's because the way to think about the prefixes is that they aren't table names (which might be present or not), they are column name prefixes. They are mandatory.

I'm thinking of adding this javadoc:

/** * Analyze a join condition. * * @param condition the condition expression * @param rightPrefix prefix for the right-hand side of the join; will be used to determine which identifiers in * the condition come from the right-hand side and which come from the left-hand side * @param macroTable macro table for parsing the condition expression */

jihoonson · 2020-01-16T07:38:59Z

processing/src/main/java/org/apache/druid/segment/join/HashJoinEngine.java

+   * joinable clause's prefix (see {@link JoinableClause#getPrefix()}) will come from the Joinable's column selector
+   * factory, and all other columns will come from the leftCursor's column selector factory.
+   *
+   * Ensuing that the joinable clause's prefix does not conflict with any columns from "leftCursor" is the


Ensuring?

Would you elaborate more on this, especially what conflict between the prefix of the joinable clause and columns from the left cursor? Is it like the conflict between the right table name and the left column name?

Oops, yeah, that's a typo. It should be "ensuring".

Is this clearer?

/** * Ensuring that the joinable clause's prefix does not conflict with any columns from "leftCursor" is the * responsibility of the caller. If there is such a conflict (for example, if the joinable clause's prefix is "j.", * and the leftCursor has a field named "j.j.abrams"), then the field from the leftCursor will be shadowed and will * not be queryable through the returned Cursor. This happens even if the right-hand joinable doesn't actually have a * column with this name. */

jihoonson

@gianm thanks. All updated docs sounds good. Left one more trivial comment.

jihoonson · 2020-01-16T18:54:53Z

processing/src/main/java/org/apache/druid/segment/join/table/SortedIntIntersectionIterator.java

+import java.util.NoSuchElementException;
+
+/**
+ * Iterates over the intersection of an array of sorted int lists. Intended for situations where the number


nit: probably better to be sorted positive int lists.

Even though the next sentence says "The iterators must be composed of ascending, nonnegative ints."?

Oh, I missed that part. Sounds good.

jihoonson · 2020-01-16T19:32:24Z

processing/src/main/java/org/apache/druid/segment/join/HashJoinEngine.java

+
+    final JoinColumnSelectorFactory joinColumnSelectorFactory = new JoinColumnSelectorFactory();
+
+    class JoinCursor implements Cursor


jihoonson

LGTM

A follow-up to apache#9111.

A follow-up to #9111.

Builds on apache#9111 and implements the datasource analysis mentioned in apache#8728. Still can't handle join datasources, but we're a step closer. Join-related DataSource types: 1) Add "join", "lookup", and "inline" datasources. 2) Add "getChildren" and "withChildren" methods to DataSource, which will be used in the future for query rewriting (e.g. inlining of subqueries). DataSource analysis functionality: 1) Add DataSourceAnalysis class, which breaks down datasources into three components: outer queries, a base datasource (left-most of the highest level left-leaning join tree), and other joined-in leaf datasources (the right-hand branches of the left-leaning join tree). 2) Add "isConcrete", "isGlobal", and "isCacheable" methods to DataSource in order to support analysis. 3) Use the DataSourceAnalysis methods throughout the query handling stack, replacing various ad-hoc approaches. Most of the interesting changes are in ClientQuerySegmentWalker (brokers), ServerManager (historicals), and SinkQuerySegmentWalker (indexing tasks). Other notes: 1) Changed TimelineServerView to return an Optional timeline, which I thought made the analysis changes cleaner to implement. 2) Renamed DataSource#getNames to DataSource#getTableNames, which I think is clearer. Also, made it a Set, so implementations don't need to worry about duplicates. 3) Added QueryToolChest#canPerformSubquery, which is now used by query entry points to determine whether it is safe to pass a subquery dataSource to the query toolchest. Fixes an issue introduced in apache#5471 where subqueries under non-groupBy-typed queries were silently ignored, since neither the query entry point nor the toolchest did anything special with them. 4) The addition of "isCacheable" should work around apache#8713, since UnionDataSource now returns false for cacheability.

Builds on apache#9111 and implements the datasource analysis mentioned in apache#8728. Still can't handle join datasources, but we're a step closer. Join-related DataSource types: 1) Add "join", "lookup", and "inline" datasources. 2) Add "getChildren" and "withChildren" methods to DataSource, which will be used in the future for query rewriting (e.g. inlining of subqueries). DataSource analysis functionality: 1) Add DataSourceAnalysis class, which breaks down datasources into three components: outer queries, a base datasource (left-most of the highest level left-leaning join tree), and other joined-in leaf datasources (the right-hand branches of the left-leaning join tree). 2) Add "isConcrete", "isGlobal", and "isCacheable" methods to DataSource in order to support analysis. Other notes: 1) Renamed DataSource#getNames to DataSource#getTableNames, which I think is clearer. Also, made it a Set, so implementations don't need to worry about duplicates. 2) The addition of "isCacheable" should work around apache#8713, since UnionDataSource now returns false for cacheability.

* Add join-related DataSource types, and analysis functionality. Builds on #9111 and implements the datasource analysis mentioned in #8728. Still can't handle join datasources, but we're a step closer. Join-related DataSource types: 1) Add "join", "lookup", and "inline" datasources. 2) Add "getChildren" and "withChildren" methods to DataSource, which will be used in the future for query rewriting (e.g. inlining of subqueries). DataSource analysis functionality: 1) Add DataSourceAnalysis class, which breaks down datasources into three components: outer queries, a base datasource (left-most of the highest level left-leaning join tree), and other joined-in leaf datasources (the right-hand branches of the left-leaning join tree). 2) Add "isConcrete", "isGlobal", and "isCacheable" methods to DataSource in order to support analysis. Other notes: 1) Renamed DataSource#getNames to DataSource#getTableNames, which I think is clearer. Also, made it a Set, so implementations don't need to worry about duplicates. 2) The addition of "isCacheable" should work around #8713, since UnionDataSource now returns false for cacheability. * Remove javadoc comment. * Updates reflecting code review. * Add comments. * Add more comments.

gianm added Feature Area - Querying labels Dec 30, 2019

Fixups.

2820e87

Fix missing format argument.

d8d7524

jnaous reviewed Dec 30, 2019

View reviewed changes

jnaous reviewed Jan 10, 2020

View reviewed changes

Various tests and minor improvements.

a0f7a97

Merge branch 'master' into joins-one

4fbc32a

jon-wei reviewed Jan 14, 2020

View reviewed changes

gianm added 2 commits January 14, 2020 10:05

Merge branch 'master' into joins-one

dfbaee4

Changes.

654fdb5

Remove or add tests for unused stuff.

9ba79e4

gianm added 2 commits January 15, 2020 08:28

Fix up package locations.

e78d208

Merge branch 'master' into joins-one

0bbb767

jon-wei approved these changes Jan 15, 2020

View reviewed changes

jihoonson added the Release Notes label Jan 15, 2020

clintropolis reviewed Jan 16, 2020

View reviewed changes

jihoonson reviewed Jan 16, 2020

View reviewed changes

jihoonson approved these changes Jan 16, 2020

View reviewed changes

clintropolis approved these changes Jan 16, 2020

View reviewed changes

gianm modified the milestone: 0.17.0 Jan 16, 2020

gianm merged commit a87db7f into apache:master Jan 16, 2020

gianm deleted the joins-one branch January 16, 2020 21:14

gianm added a commit to gianm/druid that referenced this pull request Jan 16, 2020

Add javadocs and small improvements to join code.

5246603

A follow-up to apache#9111.

gianm mentioned this pull request Jan 16, 2020

Add javadocs and small improvements to join code. #9196

Merged

gianm added a commit that referenced this pull request Jan 16, 2020

Add javadocs and small improvements to join code. (#9196)

bfcb30e

A follow-up to #9111.

gianm mentioned this pull request Jan 21, 2020

Add join-related DataSource types and analysis functionality. #9234

Closed

gianm mentioned this pull request Jan 21, 2020

Add join-related DataSource types, and analysis functionality. #9235

Merged

jihoonson added this to the 0.18.0 milestone Mar 26, 2020

jihoonson mentioned this pull request Apr 9, 2020

[Draft] 0.18.0 release notes #9652

Closed


		final JoinColumnSelectorFactory joinColumnSelectorFactory = new JoinColumnSelectorFactory();

		class JoinCursor implements Cursor

Add HashJoinSegment, a virtual segment for joins. #9111

Add HashJoinSegment, a virtual segment for joins. #9111

Conversation

gianm commented Dec 30, 2019 • edited Loading

drcrallen commented Dec 30, 2019

lgtm-com bot commented Dec 30, 2019

gianm commented Dec 30, 2019

gianm commented Dec 30, 2019

gianm commented Dec 30, 2019

drcrallen commented Dec 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnaous left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnaous left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnaous commented Jan 10, 2020

gianm commented Jan 10, 2020

gianm commented Jan 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jon-wei commented Jan 14, 2020

jon-wei commented Jan 14, 2020

gianm commented Jan 14, 2020

gianm commented Jan 15, 2020

jon-wei left a comment

Choose a reason for hiding this comment

jihoonson commented Jan 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jihoonson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm Jan 16, 2020 • edited Loading

Choose a reason for hiding this comment

gianm commented Dec 30, 2019 •

edited

Loading

gianm Jan 16, 2020 •

edited

Loading

gianm Jan 16, 2020 •

edited

Loading