Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HashJoinSegment, a virtual segment for joins. #9111

Merged
merged 10 commits into from
Jan 16, 2020
Merged

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Dec 30, 2019

An initial step towards #8728. This patch adds enough functionality to implement a joining
cursor on top of a normal datasource. It does not include enough to actually do a query. For
that, future patches will need to wire this low-level functionality into the query language.

The main files in this patch:

  • HashJoinSegment: The virtual join Segment described in Initial join support #8728.
  • HashJoinSegmentStorageAdapter: Storage adapter for that segment; "makeCursors" is the
    interesting part.
  • HashJoinEngine: Contains JoinColumnSelectorFactory, JoinCursor, which together implement
    the row-by-row logic of a join.
  • LookupJoinable: Allows joining onto lookups.
  • IndexedTableJoinable: A more flexible Joinable that can have multiple columns in general,
    including multiple key columns. I expect this will be used for joining onto subquery results
    in the future. It may even be used as a sort of super-lookup.

See https://gist.github.com/gianm/39548daef74f0373b3c87056e3db4627 for details on how the above things work together.

Some supporting elements:

  • Added a "withDimension" method to DimensionSpec so prefixed dimensions can be rewritten to
    remove their prefixes.
  • Added "canIterate" and "iterable" to LookupExtractor, necessary for right and full joins
    on lookups. It will also be useful for direct queries on lookups in the future.
  • Removed "getSegmentIdentifier" method from StorageAdapter. It was not being used.
  • Moved RowBasedColumnSelectorFactory out of the groupBy engine, reflecting the fact that it
    has been used by other, non-groupBy things. Also, split out the RowAdapter interface, which
    is now used by RowBasedIndexedTable as well.
  • Renamed VectorColumnStrategizer to VectorColumnProcessorFactory (see below).
  • Added a "ColumnProcessors" utility class and "ColumnProcessorFactory" interface that is
    currently only used to make join condition matchers in IndexedTableJoinMatcher. It wasn't
    strictly necessary, but I think it's designed better than ColumnSelectorStrategyFactory,
    and could replace it in the future. It's similar in design to VectorColumnProcessorFactory.

An initial step towards apache#8728. This patch adds enough functionality to implement a joining
cursor on top of a normal datasource. It does not include enough to actually do a query. For
that, future patches will need to wire this low-level functionality into the query language.
@drcrallen
Copy link
Contributor

This is super cool thank you for putting this together. I would love to see a README.md (or similar) going over the high level mechanics of the join to help understand its limitations. I see a few comments scattered among classes but don't quite see the high level data flow. Is it all captured in #8728 , or are there more nuances in this PR since it is a foundation PR?

@lgtm-com
Copy link

lgtm-com bot commented Dec 30, 2019

This pull request introduces 1 alert when merging 2820e87 into dec619e - view on LGTM.com

new alerts:

  • 1 for Missing format argument

@gianm
Copy link
Contributor Author

gianm commented Dec 30, 2019

This is super cool thank you for putting this together. I would love to see a README.md (or similar) going over the high level mechanics of the join to help understand its limitations. I see a few comments scattered among classes but don't quite see the high level data flow. Is it all captured in #8728 , or are there more nuances in this PR since it is a foundation PR?

@drcrallen --

#8728 is more of a program for starting to build join support than a description of mechanics. I agree that such a description of mechanics would be useful, although I haven't written it yet. I will start working on that and, maybe, it will outlive the proposal #8728 in usefulness.

What sorts of questions would you want this mechanical description / design doc to answer?

@gianm
Copy link
Contributor Author

gianm commented Dec 30, 2019

Reviewers: by the way, I've done two things in this patch that were suggested on the dev list and I found to be useful. Please let me know if you agree.

  1. The new unit tests in this patch are named in the style of https://lists.apache.org/thread.html/381816301bda945b7bf19c05822d1d910282ceb2a96db86b3bfe72b8%40%3Cdev.druid.apache.org%3E. For example see HashJoinSegmentStorageAdapterTest.
  2. Used Optional a bit more, as suggested in https://lists.apache.org/thread.html/5bb9c985bf0918dfbd2d247aa960bbaf8982654d95c648dba8ef2d67%40%3Cdev.druid.apache.org%3E. For example see HashJoinSegmentStorageAdapter#getClauseForColumn and Exprs.decomposeEquals. There are still a lot of @Nullable where needed to interact with existing interfaces. I didn't think going on a rampage changing them was a good idea.

@gianm
Copy link
Contributor Author

gianm commented Dec 30, 2019

I started a design document at https://gist.github.com/gianm/39548daef74f0373b3c87056e3db4627. @drcrallen let me know if this answers your questions.

@drcrallen
Copy link
Contributor

Thank you @gianm I'll comment in the doc as opposed to cluttering here.

}

@Test
public void test_getInterval_factToCountry()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

factToCountry isn't really a condition for the test. Is this testing a normal or error case? What kind? I would add also add an expected section to the name like shouldBeReturned. Something like test_getInterval_intervalWithinJoinedSegment_shouldBeReturned().

}

@Test
public void test_getDimensionCardinality_factToCountryNonexistentFactColumn()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO: ..._shouldHaveCardinality1

Copy link
Contributor

@jnaous jnaous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have to go through the rest. 41/69 files done, but will continue later this weekend most likely.

} else if (clazz == String.class) {
return (T) defaultStringValue();
} else {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This else is problematic. What other classes do you expect? I prefer we're explicit in this method, returning null for the supported classes, and otherwise throwing an exception in the else case. If in the future we add a new supported type, and forget to add the default value if case, we would hit the exception instead of silently returning a null and causing issues. A unit test should additionally be created based on the set of supported types. Ideally, we should have our own typing system with an interface that has methods like getDefault() so we avoid these issues. I friggin hate type checks in Java.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately all types are supported. This method is used by the PossiblyNullColumnValueSelector when it generates nulls that don't exist in the base selector. The base selector could be returning any type, even weird ones that don't exist in Druid's type system (there's a COMPLEX catch-all for those, & it's used for stuff like sketches).

stack.push(((BinAndExpr) current).left);
} else {
retVal.add(current);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This potentially seems like a method that belongs on Expr rather than here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the method look like?

*/
public static List<Expr> decomposeAnd(final Expr expr)
{
final List<Expr> retVal = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might wanna specify initial size since these are likely smallish in the usual case? Or perhaps a LL is good enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you suggest for an initial size?

*
* @return decomposed equality, or empty if the input expr was not an equality expr
*/
public static Optional<Pair<Expr, Expr>> decomposeEquals(final Expr expr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also looks like a method on Expr rather than here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the method look like?

import java.util.Optional;
import java.util.Stack;

public class Exprs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing unit tests for this class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, these should have unit tests. I'll add them.

.getAvailableColumns()
.stream()
.map(c -> clause.getPrefix() + c)
.forEach(availableDimensions::add);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably should be a method on JoinableClause.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

esp given the unPrefix method which assumes the logic in the map call above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you'd probably need a unit test to make sure that what the clause prefixes is unprefixed correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes sound like good ideas. I'll make them.

* Removes our prefix from "columnName". Must only be called if {@link #includesColumn} would have returned true
* on this column name.
*/
public String unprefix(final String columnName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably have a counter addPrefix method in this class..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefix addition is only done in this class itself, in just one spot, so maybe not necessary? I guess it could be a private method?

public int hashCode()
{
return Objects.hash(prefix, joinable, joinType, condition);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hashCode and equals require tests for maintainability. They are easy to write using utility classes. I think Suneet has added some.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test using EqualsVerifier.


if (isLeftExprAndRightColumn(lhs, rhs, rightPrefix)) {
// rhs is a right-hand column; lhs is an expression solely of the left-hand side.
equiConditions.add(new Equality(lhs, rhs.getIdentifierIfIdentifier().substring(rightPrefix.length())));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct in understanding that this is doing the unprefix method? Seems like you should be calling that instead?

public int hashCode()
{
return Objects.hash(originalExpression);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need tests for equals and hashCode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add one with EqualsVerifier.

Copy link
Contributor

@jnaous jnaous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're not going to add unit tests, I think we should have a coverage check to make sure that the inputs to the functional tests end up testing all branches of the code...

* "Righty" joins (RIGHT or FULL) always include the full right-hand side, and can generate nulls on the left.
*/
abstract boolean isRighty();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

* @return capabilities, or null if the columnName is not one of this Joinable's columns
*/
@Nullable
ColumnCapabilities getColumnCapabilities(String columnName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The combination of nullable and optional use is a potential source of error. We wouldn't be able to get the full benefits of optional imho unless we decide that nothing can be null...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this one nullable since it mirrors a lot of other methods with identical signature in other interfaces. I didn't want to change them all and I thought it'd be nice for them to echo each other.

I still think changing them all in this PR is a bad idea, but maybe the echoing is also a bad idea, and the right thing would be to make this one an Optional.

What do you think?

this.prefix = prefix != null ? prefix : "";
this.joinable = Preconditions.checkNotNull(joinable, "joinable");
this.joinType = Preconditions.checkNotNull(joinType, "joinType");
this.condition = Preconditions.checkNotNull(condition, "condition");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check out Lombok's @NotNull annotation :)

@Override
public float getFloat()
{
return beNull.getAsBoolean() ? 0L : baseSelector.getFloat();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps return float 0 to avoid runtime conversion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, this was a copy paste error. I'll fix it.

@Override
public double getDouble()
{
return beNull.getAsBoolean() ? 0L : baseSelector.getDouble();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be returning the defaultValueForNull instead of 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's a boxed type, so probably not. But there's some ZERO_* constants that would be appropriate, so I'll use those.

}

this.nullAdjustedRow = new NullAdjustedIndexedInts(nullAdjustment);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments would be helpful to understand + unit tests for maintainability

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a bunch of tests.

@jnaous
Copy link
Contributor

jnaous commented Jan 10, 2020

I've found it tough reviewing this PR. I think it could have been broken down into a few other PRs. For example:

  1. The refactor of class/method name changes
  2. The implementation of all the factories for the selectors
  3. The Join logic
  4. The IndexedTable implementation and its related pieces

@gianm
Copy link
Contributor Author

gianm commented Jan 10, 2020

If you're not going to add unit tests, I think we should have a coverage check to make sure that the inputs to the functional tests end up testing all branches of the code...

Sounds like a good idea to me. IMO it isn't worth unit testing every single class, and can even be counter-productive (adds to the amount of work it takes to refactor things later). So I like the idea of pairing functional tests with coverage checks.

That being said, some of the examples you pointed out would be good to add unit tests for, so I'll go through and add a few more. I probably won't have a chance to do this today, but I'll try to get to it soon.

@gianm
Copy link
Contributor Author

gianm commented Jan 13, 2020

@jnaous — I've pushed up a commit with all the changes I said I'd make in your review.


public JoinableClause(@Nullable String prefix, Joinable joinable, JoinType joinType, JoinConditionAnalysis condition)
{
this.prefix = prefix != null ? prefix : "";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest adding a comment somewhere that avoiding name conflicts with the prefix is the responsibility of the caller

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it in HashJoinEngine — that seemed like the most appropriate place.

*
* Will only work correctly if {@link Joinable#makeJoinMatcher} was called with {@code remainderNeeded == true}.
*/
void matchRemainder();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest adding some more info here about the relationship/ordering between matchCondition() and matchRemainder() calls, e.g., is it an error if matchCondition is called after matchRemainder? (looking at how it's used in HashJoinEngine.matchCurrentPosition)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the way they're currently specced and implemented, they will work no matter what order you call them in. I think this is fine. It's more generic than necessary but I don't think it hurts.

.map(Equality::getLeftExpr)
.collect(Collectors.toList());
} else {
throw new IAE("Cannot join lookup with condition: %s", condition);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest splitting up the non-equiconditions check and the lookup key column check, and using a specific exception message for each case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good idea!!

@jon-wei
Copy link
Contributor

jon-wei commented Jan 14, 2020

I haven't really looked through the tests yet, but I don't have any more comments on the main code

@jon-wei
Copy link
Contributor

jon-wei commented Jan 14, 2020

Tests LGTM, there are some real TC errors about unused methods

@gianm
Copy link
Contributor Author

gianm commented Jan 14, 2020

Tests LGTM, there are some real TC errors about unused methods

I'm looking into these.

@gianm
Copy link
Contributor Author

gianm commented Jan 15, 2020

Tests LGTM, there are some real TC errors about unused methods

I'm looking into these.

I've pushed up a commit that I think will fix these. I've removed some code and added tests for other code.

Copy link
Contributor

@jon-wei jon-wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jihoonson
Copy link
Contributor

Removed "getSegmentIdentifier" method from StorageAdapter. It was not being used.

Added "Release Notes" label since it's marked as a public API.

import javax.annotation.Nullable;
import java.util.function.BooleanSupplier;

public class PossiblyNullColumnValueSelector<T> implements ColumnValueSelector<T>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would probably be worth distinguishing functionality/intent from BaseNullableColumnValueSelector through javadocs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add some javadocs in a follow up. Hopefully for now the fact that it's in the org.apache.druid.segment.join package makes it clear enough that it's specifically join-related and is not a generically useful thing like BaseNullableColumnValueSelector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of adding this:

/**
 * A {@link ColumnValueSelector} that wraps a base selector but might also generate null values on demand. This
 * is used for "righty" joins (see {@link JoinType#isRighty()}), which may need to generate nulls on the left-hand side.
 */


// Otherwise this shouldn't have been called (due to isNull returning true).
assert NullHandling.replaceWithDefault();
return NullHandling.defaultDoubleValue();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this and other primitive get methods cause a null pointer exception if it does happen to get called in production-ish environment and sql compatible null handling is enabled because the assert will not be executed and NullHandling.defaultDoubleValue() will return null. Should this return the 0 value instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a logic error to call this method if isNull returns true for the row, so it seems ok to me to throw an NPE if someone actually does it in production with asserts disabled.

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great patch! Halfway through and still reviewing.

/**
* This default type will be used when the underlying column has an unknown type.
*/
ValueType defaultType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm not sure whether this method should be in this class or not. From the code, I guess the default type is used when ColumnCapabilities is missing. In this case, it makes sense to me to infer the type from the query as in IndexedTableJoinMatcher. For LookupJoinMatcher, I guess the default type is String because all columns in the lookup have the String type. I guess I'm confused with what defaultType means here even though ColumnProcessorFactory seems to support all value types. I understand the inferred type should be stored somewhere, but not sure why it should be this class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's meant to be the preferred type that the processor wants to deal with in situations where there is no type information for the underlying column. It should usually be related to whatever the processor wants to do with the data. The idea is that you would return STRING if you prefer to deal with strings, DOUBLE (or LONG) if you prefer to deal with numbers, etc.

Does that make sense / sound reasonable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about adding this javadoc:

  /**
   * This default type will be used when the underlying column has an unknown type.
   *
   * This allows a column processor factory to specify what type it prefers to deal with (the most 'natural' type for
   * whatever it is doing) when all else is equal.
   */

* {@link DimensionHandlerUtils#makeVectorProcessor}.
*
* Unlike {@link ColumnProcessorFactory}, this interface does not have a "defaultType" method. The default type is
* always implicitly STRING. It also does not have a "makeComplexProcessor" method; instead, complex-typed columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you elaborate more on why the default type is always string? Or is it a temporary thing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagined it's a temporary thing. I would eventually like the two column processor factory interfaces to match up better.

if (condition.isAlwaysTrue()) {
this.conditionMatchers = Collections.singletonList(() -> IntIterators.fromTo(0, table.numRows()));
} else if (condition.isAlwaysFalse()) {
this.conditionMatchers = Collections.singletonList(() -> IntIterators.fromTo(0, 0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to use IntIterators.EMPTY_ITERATOR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point.

*/
public class JoinableClause
{
private final String prefix;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is prefix here? rightTableName?

Copy link
Contributor Author

@gianm gianm Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's whatever the caller wants it to be, really. The SQL layer is gonna use strings like _j0..

Copy link
Contributor Author

@gianm gianm Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm planning to add this javadoc, which'll make it clearer:

  /**
   * The prefix to apply to all columns from the Joinable. The idea is that during a join, any columns that start with
   * this prefix should be retrieved from our Joinable's {@link JoinMatcher#getColumnSelectorFactory()}. Any other
   * columns should be returned from the left-hand side of the join.
   *
   * The prefix can be any string, as long as it is nonempty and not itself a prefix of the reserved column name
   * {@code __time}.
   *
   * @see #getAvailableColumnsPrefixed() the list of columns from our {@link Joinable} with prefixes attached
   * @see #unprefix a method for removing prefixes
   */

* Returns a list of columns from the underlying {@link Joinable#getAvailableColumns()} method, with our
* prefix ({@link #getPrefix()}) prepended.
*/
public List<String> getAvailableColumnsPrefixed()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: getAvailableResolvedColumns() or getAvailableQualifiedColumns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually like that the word "prefix" is in here since it makes the connection with getPrefix and unprefix more clear.

import java.util.List;
import java.util.Map;

public interface IndexedTable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please add javadoc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, good call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of adding this:

/**
 * An interface to a table where some columns (the 'key columns') have indexes that enable fast lookups.
 *
 * The main user of this class is {@link IndexedTableJoinable}, and its main purpose is to participate in joins.
 */
public interface IndexedTable
{
  /**
   * Returns the columns of this table that have indexes.
   */
  List<String> keyColumns();

  /**
   * Returns all columns of this table, including the key and non-key columns.
   */
  List<String> allColumns();

  /**
   * Returns the signature of this table: a map where each key is a column from {@link #allColumns()} and each value
   * is a type code.
   */
  Map<String, ValueType> rowSignature();

  /**
   * Returns the number of rows in this table. It must not change over time, since it is used for things like algorithm
   * selection and reporting of cardinality metadata.
   */
  int numRows();

  /**
   * Returns the index for a particular column. The provided column number must be that column's position in
   * {@link #allColumns()}.
   */
  Index columnIndex(int column);

  /**
   * Returns a reader for a particular column. The provided column number must be that column's position in
   * {@link #allColumns()}.
   */
  Reader columnReader(int column);

  /**
   * Indexes support fast lookups on key columns.
   */
  interface Index
  {
    /**
     * Returns the list of row numbers where the column this Reader is based on contains 'key'.
     */
    IntList find(Object key);
  }

  /**
   * Readers support reading values out of any column.
   */
  interface Reader
  {
    /**
     * Read the value at a particular row number. Throws an exception if the row is out of bounds (must be between zero
     * and {@link #numRows()}).
     */
    @Nullable
    Object read(int row);
  }

}

public static JoinConditionAnalysis forExpression(
final String condition,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like assuming the column name from the right table is always resolved. If so, this should be documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and that's because the way to think about the prefixes is that they aren't table names (which might be present or not), they are column name prefixes. They are mandatory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of adding this javadoc:

  /**
   * Analyze a join condition.
   *
   * @param condition   the condition expression
   * @param rightPrefix prefix for the right-hand side of the join; will be used to determine which identifiers in
   *                    the condition come from the right-hand side and which come from the left-hand side
   * @param macroTable  macro table for parsing the condition expression
   */

* joinable clause's prefix (see {@link JoinableClause#getPrefix()}) will come from the Joinable's column selector
* factory, and all other columns will come from the leftCursor's column selector factory.
*
* Ensuing that the joinable clause's prefix does not conflict with any columns from "leftCursor" is the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensuring?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you elaborate more on this, especially what conflict between the prefix of the joinable clause and columns from the left cursor? Is it like the conflict between the right table name and the left column name?

Copy link
Contributor Author

@gianm gianm Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yeah, that's a typo. It should be "ensuring".

Is this clearer?

  /**
   * Ensuring that the joinable clause's prefix does not conflict with any columns from "leftCursor" is the
   * responsibility of the caller. If there is such a conflict (for example, if the joinable clause's prefix is "j.",
   * and the leftCursor has a field named "j.j.abrams"), then the field from the leftCursor will be shadowed and will
   * not be queryable through the returned Cursor. This happens even if the right-hand joinable doesn't actually have a
   * column with this name.
   */

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gianm thanks. All updated docs sounds good. Left one more trivial comment.

import java.util.NoSuchElementException;

/**
* Iterates over the intersection of an array of sorted int lists. Intended for situations where the number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: probably better to be sorted positive int lists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though the next sentence says "The iterators must be composed of ascending, nonnegative ints."?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed that part. Sounds good.


final JoinColumnSelectorFactory joinColumnSelectorFactory = new JoinColumnSelectorFactory();

class JoinCursor implements Cursor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gianm gianm modified the milestone: 0.17.0 Jan 16, 2020
@gianm gianm merged commit a87db7f into apache:master Jan 16, 2020
@gianm gianm deleted the joins-one branch January 16, 2020 21:14
gianm added a commit to gianm/druid that referenced this pull request Jan 16, 2020
gianm added a commit that referenced this pull request Jan 16, 2020
gianm added a commit to gianm/druid that referenced this pull request Jan 21, 2020
Builds on apache#9111 and implements the datasource analysis mentioned in apache#8728. Still can't
handle join datasources, but we're a step closer.

Join-related DataSource types:

1) Add "join", "lookup", and "inline" datasources.
2) Add "getChildren" and "withChildren" methods to DataSource, which will be used
   in the future for query rewriting (e.g. inlining of subqueries).

DataSource analysis functionality:

1) Add DataSourceAnalysis class, which breaks down datasources into three components:
   outer queries, a base datasource (left-most of the highest level left-leaning join
   tree), and other joined-in leaf datasources (the right-hand branches of the
   left-leaning join tree).
2) Add "isConcrete", "isGlobal", and "isCacheable" methods to DataSource in order to
   support analysis.
3) Use the DataSourceAnalysis methods throughout the query handling stack, replacing
   various ad-hoc approaches. Most of the interesting changes are in
   ClientQuerySegmentWalker (brokers), ServerManager (historicals), and
   SinkQuerySegmentWalker (indexing tasks).

Other notes:

1) Changed TimelineServerView to return an Optional timeline, which I thought made
   the analysis changes cleaner to implement.
2) Renamed DataSource#getNames to DataSource#getTableNames, which I think is clearer.
   Also, made it a Set, so implementations don't need to worry about duplicates.
3) Added QueryToolChest#canPerformSubquery, which is now used by query entry points to
   determine whether it is safe to pass a subquery dataSource to the query toolchest.
   Fixes an issue introduced in apache#5471 where subqueries under non-groupBy-typed queries
   were silently ignored, since neither the query entry point nor the toolchest did
   anything special with them.
4) The addition of "isCacheable" should work around apache#8713, since UnionDataSource now
   returns false for cacheability.
gianm added a commit to gianm/druid that referenced this pull request Jan 21, 2020
Builds on apache#9111 and implements the datasource analysis mentioned in apache#8728. Still can't
handle join datasources, but we're a step closer.

Join-related DataSource types:

1) Add "join", "lookup", and "inline" datasources.
2) Add "getChildren" and "withChildren" methods to DataSource, which will be used
   in the future for query rewriting (e.g. inlining of subqueries).

DataSource analysis functionality:

1) Add DataSourceAnalysis class, which breaks down datasources into three components:
   outer queries, a base datasource (left-most of the highest level left-leaning join
   tree), and other joined-in leaf datasources (the right-hand branches of the
   left-leaning join tree).
2) Add "isConcrete", "isGlobal", and "isCacheable" methods to DataSource in order to
   support analysis.

Other notes:

1) Renamed DataSource#getNames to DataSource#getTableNames, which I think is clearer.
   Also, made it a Set, so implementations don't need to worry about duplicates.
2) The addition of "isCacheable" should work around apache#8713, since UnionDataSource now
   returns false for cacheability.
gianm added a commit that referenced this pull request Jan 22, 2020
* Add join-related DataSource types, and analysis functionality.

Builds on #9111 and implements the datasource analysis mentioned in #8728. Still can't
handle join datasources, but we're a step closer.

Join-related DataSource types:

1) Add "join", "lookup", and "inline" datasources.
2) Add "getChildren" and "withChildren" methods to DataSource, which will be used
   in the future for query rewriting (e.g. inlining of subqueries).

DataSource analysis functionality:

1) Add DataSourceAnalysis class, which breaks down datasources into three components:
   outer queries, a base datasource (left-most of the highest level left-leaning join
   tree), and other joined-in leaf datasources (the right-hand branches of the
   left-leaning join tree).
2) Add "isConcrete", "isGlobal", and "isCacheable" methods to DataSource in order to
   support analysis.

Other notes:

1) Renamed DataSource#getNames to DataSource#getTableNames, which I think is clearer.
   Also, made it a Set, so implementations don't need to worry about duplicates.
2) The addition of "isCacheable" should work around #8713, since UnionDataSource now
   returns false for cacheability.

* Remove javadoc comment.

* Updates reflecting code review.

* Add comments.

* Add more comments.
@jihoonson jihoonson added this to the 0.18.0 milestone Mar 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants