Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable rewriting certain inner joins as filters. #11068

Merged
merged 9 commits into from
Apr 14, 2021

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Apr 5, 2021

The main logic for doing the rewrite is in JoinableFactoryWrapper's
segmentMapFn method. The requirements are:

  • It must be an inner equi-join.
  • The right-hand columns referenced by the condition must not contain any
    duplicate values. (If they did, the inner join would not be guaranteed
    to return at most one row for each left-hand-side row.)
  • No columns from the right-hand side can be used by anything other than
    the join condition itself.

HashJoinSegmentStorageAdapter is also modified to pass through to
the base adapter (even allowing vectorization!) in the case where 100%
of join clauses could be rewritten as filters.

In support of this goal:

  • Add Query getRequiredColumns() method to help us figure out whether
    the right-hand side of a join datasource is being used or not.
  • Add JoinConditionAnalysis getRequiredColumns() method to help us
    figure out if the right-hand side of a join is being used by later
    join clauses acting on the same base.
  • Add Joinable getNonNullColumnValuesIfAllUnique method to enable
    retrieving the set of values that will form the "in" filter.
  • Add LookupExtractor canGetKeySet() and keySet() methods to support
    LookupJoinable in its efforts to implement the new Joinable method.
  • Add enableRewriteJoinToFilter feature flag to
    JoinFilterRewriteConfig. The default is disabled.

Testing strategy:

  • Add join-to-filter conversion tests to JoinableFactorWrapperTests.
  • Add getRequiredColumns tests to individual query engines.
  • Add getNonNullColumnValuesIfAllUnique tests to LookupJoinable and IndexedTableJoinable.
  • Extend BaseCalciteQueryTest's QueryContextForJoinProvider to also
    provide query contexts that enable this rewrite.
  • Add some new tests to CalciteQueryTest that are designed to exercise
    this rewrite. (And some existing tests did, too.)

The main logic for doing the rewrite is in JoinableFactoryWrapper's
segmentMapFn method. The requirements are:

- It must be an inner equi-join.
- The right-hand columns referenced by the condition must not contain any
  duplicate values. (If they did, the inner join would not be guaranteed
  to return at most one row for each left-hand-side row.)
- No columns from the right-hand side can be used by anything other than
  the join condition itself.

HashJoinSegmentStorageAdapter is also modified to pass through to
the base adapter (even allowing vectorization!) in the case where 100%
of join clauses could be rewritten as filters.

In support of this goal:

- Add Query getRequiredColumns() method to help us figure out whether
  the right-hand side of a join datasource is being used or not.
- Add JoinConditionAnalysis getRequiredColumns() method to help us
  figure out if the right-hand side of a join is being used by later
  join clauses acting on the same base.
- Add Joinable getNonNullColumnValuesIfAllUnique method to enable
  retrieving the set of values that will form the "in" filter.
- Add LookupExtractor canGetKeySet() and keySet() methods to support
  LookupJoinable in its efforts to implement the new Joinable method.
- Add "enableRewriteJoinToFilter" feature flag to
  JoinFilterRewriteConfig. The default is disabled.
return parseBoolean(
query,
REWRITE_JOIN_TO_FILTER_ENABLE_KEY,
DEFAULT_ENABLE_JOIN_FILTER_REWRITE_VALUE_COLUMN_FILTERS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DEFAULT_ENABLE_JOIN_FILTER_REWRITE_VALUE_COLUMN_FILTERS
DEFAULT_ENABLE_REWRITE_JOIN_TO_FILTER

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😱

Copy link
Contributor

@abhishekagarwal87 abhishekagarwal87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Out of curiosity, given that we exclude null values, how are null values in the left table matched?

/**
* Returns a Set of all keys in this lookup extractor. The returned Set will not change.
*
* @throws UnsupportedOperationException if {@link #canIterate()} returns false.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @throws UnsupportedOperationException if {@link #canIterate()} returns false.
* @throws UnsupportedOperationException if {@link #canGetKeySet()} returns false.

for (int i = 0; i < table.numRows(); i++) {
final String s = DimensionHandlerUtils.convertObjectToString(reader.read(i));

if (s != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need not exclude empty strings here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think we should. I replaced it with NullHandling.isNullOrEquivalent(s).

final Pair<List<Filter>, List<JoinableClause>> conversionResult = convertJoinsToFilters(
joinableClauses.getJoinableClauses(),
requiredColumns,
Ints.checkedCast(Math.min(filterRewriteConfig.getFilterRewriteMaxSize(), Integer.MAX_VALUE))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe not here but should we cap the max size so that it can't result in an OOM?

Copy link
Contributor Author

@gianm gianm Apr 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we should rely on the user setting this parameter "correctly" sort of like the subquery limit. I also think most people won't change it from the default, which is 10,000 and should be pretty safe. Unless the values are gigantic it's only going to be a few MB per query.

I thought a bit about measuring these limits in terms of bytes instead of rows, which has pros/cons:

  • Pro of bytes: less likely to be misconfigured & cause OOME, more likely to use memory efficiently & maximally
  • Con of bytes: harder for users to understand the limit. "10,000 rows" is easy to communicate & understand; "5MB" is harder because people won't be able to easily figure out if a particular data set fits in 5MB or not.

@gianm
Copy link
Contributor Author

gianm commented Apr 6, 2021

Looks good. Out of curiosity, given that we exclude null values, how are null values in the left table matched?

I was thinking that because it's an inner join, null values from the left table are supposed to be dropped anyway. (That's why I didn't allow this optimization to trigger for left joins.)

@abhishekagarwal87
Copy link
Contributor

Looks good. Out of curiosity, given that we exclude null values, how are null values in the left table matched?

I was thinking that because it's an inner join, null values from the left table are supposed to be dropped anyway. (That's why I didn't allow this optimization to trigger for left joins.)

Got it. Thanks for clarifying.

* @param additionalColumns additional columns to include. Each of these will be added to the returned set, unless it
* refers to a virtual column, in which case the virtual column inputs will be added instead.
*/
public static Set<String> computeRequiredColumns(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@gianm
Copy link
Contributor Author

gianm commented Apr 8, 2021

@abhishekagarwal87 any other thoughts?

@gianm gianm merged commit 202c78c into apache:master Apr 14, 2021
@gianm gianm deleted the query-inner-join-to-filter branch April 14, 2021 17:49
gianm added a commit to gianm/druid that referenced this pull request Apr 27, 2021
Normally, InDimFilters that come from JSON have HashSets for "values".
However, programmatically-generated filters (like the ones from apache#11068)
may use other set types. Some set types, like TreeSets with natural
ordering, will throw NPE on "contains(null)", which causes the
InDimFilter's ValueMatcher to throw NPE if it encounters a null value.

This patch adds code to detect if the values set can support
contains(null), and if not, wrap that in a null-checking lambda.

Also included:

- Remove unneeded NullHandling.needsEmptyToNull method.
- Update IndexedTableJoinable to generate a TreeSet that does not
  require lambda-wrapping. (This particular TreeSet is how I noticed
  the bug in the first place.)
gianm added a commit that referenced this pull request Apr 28, 2021
* InDimFilter: Fix NPE involving certain Set types.

Normally, InDimFilters that come from JSON have HashSets for "values".
However, programmatically-generated filters (like the ones from #11068)
may use other set types. Some set types, like TreeSets with natural
ordering, will throw NPE on "contains(null)", which causes the
InDimFilter's ValueMatcher to throw NPE if it encounters a null value.

This patch adds code to detect if the values set can support
contains(null), and if not, wrap that in a null-checking lambda.

Also included:

- Remove unneeded NullHandling.needsEmptyToNull method.
- Update IndexedTableJoinable to generate a TreeSet that does not
  require lambda-wrapping. (This particular TreeSet is how I noticed
  the bug in the first place.)

* Test fixes.

* Improve test coverage
@clintropolis clintropolis added this to the 0.22.0 milestone Aug 12, 2021
jon-wei pushed a commit to jon-wei/druid that referenced this pull request Nov 22, 2021
* Update security overview with additional recommendations (apache#11016)

* updatee security overview with additional recommendations for improved security

* address first set of review questions

* Update docs/operations/security-overview.md

* Update docs/operations/security-overview.md

* apply changes from review

* Update docs/operations/security-overview.md

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Update docs/operations/security-overview.md

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Update docs/operations/security-overview.md

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Update security-overview.md

fix additional comments & typos cc: @suneet-s, @jihoonsoon

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Enable rewriting certain inner joins as filters. (apache#11068)

* Enable rewriting certain inner joins as filters.

The main logic for doing the rewrite is in JoinableFactoryWrapper's
segmentMapFn method. The requirements are:

- It must be an inner equi-join.
- The right-hand columns referenced by the condition must not contain any
  duplicate values. (If they did, the inner join would not be guaranteed
  to return at most one row for each left-hand-side row.)
- No columns from the right-hand side can be used by anything other than
  the join condition itself.

HashJoinSegmentStorageAdapter is also modified to pass through to
the base adapter (even allowing vectorization!) in the case where 100%
of join clauses could be rewritten as filters.

In support of this goal:

- Add Query getRequiredColumns() method to help us figure out whether
  the right-hand side of a join datasource is being used or not.
- Add JoinConditionAnalysis getRequiredColumns() method to help us
  figure out if the right-hand side of a join is being used by later
  join clauses acting on the same base.
- Add Joinable getNonNullColumnValuesIfAllUnique method to enable
  retrieving the set of values that will form the "in" filter.
- Add LookupExtractor canGetKeySet() and keySet() methods to support
  LookupJoinable in its efforts to implement the new Joinable method.
- Add "enableRewriteJoinToFilter" feature flag to
  JoinFilterRewriteConfig. The default is disabled.

* Test improvements.

* Test fixes.

* Avoid slow size() call.

* Remove invalid test.

* Fix style.

* Fix mistaken default.

* Small fixes.

* Fix logic error.

Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
Co-authored-by: Suneet Saldanha <suneet@apache.org>
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
jon-wei pushed a commit to jon-wei/druid that referenced this pull request Nov 22, 2021
* Update security overview with additional recommendations (apache#11016)

* updatee security overview with additional recommendations for improved security

* address first set of review questions

* Update docs/operations/security-overview.md

* Update docs/operations/security-overview.md

* apply changes from review

* Update docs/operations/security-overview.md

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Update docs/operations/security-overview.md

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Update docs/operations/security-overview.md

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Update security-overview.md

fix additional comments & typos cc: @suneet-s, @jihoonsoon

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Enable rewriting certain inner joins as filters. (apache#11068)

* Enable rewriting certain inner joins as filters.

The main logic for doing the rewrite is in JoinableFactoryWrapper's
segmentMapFn method. The requirements are:

- It must be an inner equi-join.
- The right-hand columns referenced by the condition must not contain any
  duplicate values. (If they did, the inner join would not be guaranteed
  to return at most one row for each left-hand-side row.)
- No columns from the right-hand side can be used by anything other than
  the join condition itself.

HashJoinSegmentStorageAdapter is also modified to pass through to
the base adapter (even allowing vectorization!) in the case where 100%
of join clauses could be rewritten as filters.

In support of this goal:

- Add Query getRequiredColumns() method to help us figure out whether
  the right-hand side of a join datasource is being used or not.
- Add JoinConditionAnalysis getRequiredColumns() method to help us
  figure out if the right-hand side of a join is being used by later
  join clauses acting on the same base.
- Add Joinable getNonNullColumnValuesIfAllUnique method to enable
  retrieving the set of values that will form the "in" filter.
- Add LookupExtractor canGetKeySet() and keySet() methods to support
  LookupJoinable in its efforts to implement the new Joinable method.
- Add "enableRewriteJoinToFilter" feature flag to
  JoinFilterRewriteConfig. The default is disabled.

* Test improvements.

* Test fixes.

* Avoid slow size() call.

* Remove invalid test.

* Fix style.

* Fix mistaken default.

* Small fixes.

* Fix logic error.

* Doc updates for union datasources. (apache#11103)

The main one is updating datasources.md to talk about SQL. (It still said
that table unions are not supported in SQL.) Also, this doc update adds
some clarifying details on limitations.

* [Security] Bump netty4.version from 4.1.48.Final to 4.1.63.Final (apache#11117)

* Vectorized versions of HllSketch aggregators. (apache#11115)

* Vectorized versions of HllSketch aggregators.

The patch uses the same "helper" approach as apache#10767 and apache#10304, and
extends the tests to run in both vectorized and non-vectorized modes.

Also includes some minor changes to the theta sketch vector aggregator:

- Cosmetic changes to make the hll and theta implementations look
  more similar.
- Extends the theta SQL tests to run in vectorized mode.

* Updates post-code-review.

* Fix javadoc.

* Web console: update dev dependencies (apache#11119)

* Update some dev dependencies, prettify, tslint-fix

* Sort tsconfig keys for easy comparison

* Set noImplicitThis

* Slightly more accurate types

* Bump Jest and related

* Bump react to latest on v16

* Bump node-sass, sass-loader for node14 support

* Remove node-sass-chokidar (unused)

* More unused dependencies

* Fix blueprint imports

* Webpack 5

* Update webpack config for 'process' usage

* Update playwright-chromium

* Emit esnext modules for tree shaking

* Enable source maps in development

* Dedupe

* Bump babel and things

* npm audit fix

* Add .editorconfig file to match prettier settings

* Update licenses (tslib is 0BSD as of 1.11.2)

microsoft/tslib#96

* Require node >= 10

* Use Node 10 to run e2e tests

* Use 'ws' transport mode for dev server (will be default in next version)

* Remove an 'any'

* No sourcemaps in prod

* Exclude .editorconfig from license checks

* Try nvm for setting node version

Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
Co-authored-by: Suneet Saldanha <suneet@apache.org>
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
Co-authored-by: Sandeep <isandeep41@gmail.com>
Co-authored-by: John Gozde <john@gozde.ca>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants