Enable rewriting certain inner joins as filters. #11068

gianm · 2021-04-05T19:15:19Z

The main logic for doing the rewrite is in JoinableFactoryWrapper's
segmentMapFn method. The requirements are:

It must be an inner equi-join.
The right-hand columns referenced by the condition must not contain any
duplicate values. (If they did, the inner join would not be guaranteed
to return at most one row for each left-hand-side row.)
No columns from the right-hand side can be used by anything other than
the join condition itself.

HashJoinSegmentStorageAdapter is also modified to pass through to
the base adapter (even allowing vectorization!) in the case where 100%
of join clauses could be rewritten as filters.

In support of this goal:

Add Query getRequiredColumns() method to help us figure out whether
the right-hand side of a join datasource is being used or not.
Add JoinConditionAnalysis getRequiredColumns() method to help us
figure out if the right-hand side of a join is being used by later
join clauses acting on the same base.
Add Joinable getNonNullColumnValuesIfAllUnique method to enable
retrieving the set of values that will form the "in" filter.
Add LookupExtractor canGetKeySet() and keySet() methods to support
LookupJoinable in its efforts to implement the new Joinable method.
Add enableRewriteJoinToFilter feature flag to
JoinFilterRewriteConfig. The default is disabled.

Testing strategy:

Add join-to-filter conversion tests to JoinableFactorWrapperTests.
Add getRequiredColumns tests to individual query engines.
Add getNonNullColumnValuesIfAllUnique tests to LookupJoinable and IndexedTableJoinable.
Extend BaseCalciteQueryTest's QueryContextForJoinProvider to also
provide query contexts that enable this rewrite.
Add some new tests to CalciteQueryTest that are designed to exercise
this rewrite. (And some existing tests did, too.)

The main logic for doing the rewrite is in JoinableFactoryWrapper's segmentMapFn method. The requirements are: - It must be an inner equi-join. - The right-hand columns referenced by the condition must not contain any duplicate values. (If they did, the inner join would not be guaranteed to return at most one row for each left-hand-side row.) - No columns from the right-hand side can be used by anything other than the join condition itself. HashJoinSegmentStorageAdapter is also modified to pass through to the base adapter (even allowing vectorization!) in the case where 100% of join clauses could be rewritten as filters. In support of this goal: - Add Query getRequiredColumns() method to help us figure out whether the right-hand side of a join datasource is being used or not. - Add JoinConditionAnalysis getRequiredColumns() method to help us figure out if the right-hand side of a join is being used by later join clauses acting on the same base. - Add Joinable getNonNullColumnValuesIfAllUnique method to enable retrieving the set of values that will form the "in" filter. - Add LookupExtractor canGetKeySet() and keySet() methods to support LookupJoinable in its efforts to implement the new Joinable method. - Add "enableRewriteJoinToFilter" feature flag to JoinFilterRewriteConfig. The default is disabled.

abhishekagarwal87 · 2021-04-06T07:26:15Z

processing/src/main/java/org/apache/druid/query/QueryContexts.java

+    return parseBoolean(
+        query,
+        REWRITE_JOIN_TO_FILTER_ENABLE_KEY,
+        DEFAULT_ENABLE_JOIN_FILTER_REWRITE_VALUE_COLUMN_FILTERS


Suggested change

DEFAULT_ENABLE_JOIN_FILTER_REWRITE_VALUE_COLUMN_FILTERS

DEFAULT_ENABLE_REWRITE_JOIN_TO_FILTER

abhishekagarwal87

Looks good. Out of curiosity, given that we exclude null values, how are null values in the left table matched?

abhishekagarwal87 · 2021-04-06T07:27:51Z

processing/src/main/java/org/apache/druid/query/lookup/LookupExtractor.java

+  /**
+   * Returns a Set of all keys in this lookup extractor. The returned Set will not change.
+   *
+   * @throws UnsupportedOperationException if {@link #canIterate()} returns false.


Suggested change

* @throws UnsupportedOperationException if {@link #canIterate()} returns false.

* @throws UnsupportedOperationException if {@link #canGetKeySet()} returns false.

abhishekagarwal87 · 2021-04-06T07:46:57Z

processing/src/main/java/org/apache/druid/segment/join/table/IndexedTableJoinable.java

+      for (int i = 0; i < table.numRows(); i++) {
+        final String s = DimensionHandlerUtils.convertObjectToString(reader.read(i));
+
+        if (s != null) {


do we need not exclude empty strings here?

Ah, I think we should. I replaced it with NullHandling.isNullOrEquivalent(s).

abhishekagarwal87 · 2021-04-06T07:57:54Z

processing/src/main/java/org/apache/druid/segment/join/JoinableFactoryWrapper.java

+              final Pair<List<Filter>, List<JoinableClause>> conversionResult = convertJoinsToFilters(
+                  joinableClauses.getJoinableClauses(),
+                  requiredColumns,
+                  Ints.checkedCast(Math.min(filterRewriteConfig.getFilterRewriteMaxSize(), Integer.MAX_VALUE))


maybe not here but should we cap the max size so that it can't result in an OOM?

I was thinking we should rely on the user setting this parameter "correctly" sort of like the subquery limit. I also think most people won't change it from the default, which is 10,000 and should be pretty safe. Unless the values are gigantic it's only going to be a few MB per query.

I thought a bit about measuring these limits in terms of bytes instead of rows, which has pros/cons:

Pro of bytes: less likely to be misconfigured & cause OOME, more likely to use memory efficiently & maximally

Con of bytes: harder for users to understand the limit. "10,000 rows" is easy to communicate & understand; "5MB" is harder because people won't be able to easily figure out if a particular data set fits in 5MB or not.

gianm · 2021-04-06T16:37:14Z

Looks good. Out of curiosity, given that we exclude null values, how are null values in the left table matched?

I was thinking that because it's an inner join, null values from the left table are supposed to be dropped anyway. (That's why I didn't allow this optimization to trigger for left joins.)

abhishekagarwal87 · 2021-04-06T17:44:35Z

Looks good. Out of curiosity, given that we exclude null values, how are null values in the left table matched?

I was thinking that because it's an inner join, null values from the left table are supposed to be dropped anyway. (That's why I didn't allow this optimization to trigger for left joins.)

Got it. Thanks for clarifying.

clintropolis · 2021-04-07T02:58:23Z

processing/src/main/java/org/apache/druid/query/Queries.java

+   * @param additionalColumns additional columns to include. Each of these will be added to the returned set, unless it
+   *                          refers to a virtual column, in which case the virtual column inputs will be added instead.
+   */
+  public static Set<String> computeRequiredColumns(


gianm · 2021-04-08T17:20:41Z

@abhishekagarwal87 any other thoughts?

Normally, InDimFilters that come from JSON have HashSets for "values". However, programmatically-generated filters (like the ones from apache#11068) may use other set types. Some set types, like TreeSets with natural ordering, will throw NPE on "contains(null)", which causes the InDimFilter's ValueMatcher to throw NPE if it encounters a null value. This patch adds code to detect if the values set can support contains(null), and if not, wrap that in a null-checking lambda. Also included: - Remove unneeded NullHandling.needsEmptyToNull method. - Update IndexedTableJoinable to generate a TreeSet that does not require lambda-wrapping. (This particular TreeSet is how I noticed the bug in the first place.)

* InDimFilter: Fix NPE involving certain Set types. Normally, InDimFilters that come from JSON have HashSets for "values". However, programmatically-generated filters (like the ones from #11068) may use other set types. Some set types, like TreeSets with natural ordering, will throw NPE on "contains(null)", which causes the InDimFilter's ValueMatcher to throw NPE if it encounters a null value. This patch adds code to detect if the values set can support contains(null), and if not, wrap that in a null-checking lambda. Also included: - Remove unneeded NullHandling.needsEmptyToNull method. - Update IndexedTableJoinable to generate a TreeSet that does not require lambda-wrapping. (This particular TreeSet is how I noticed the bug in the first place.) * Test fixes. * Improve test coverage

@suneet-s

* Update security overview with additional recommendations (apache#11016) * updatee security overview with additional recommendations for improved security * address first set of review questions * Update docs/operations/security-overview.md * Update docs/operations/security-overview.md * apply changes from review * Update docs/operations/security-overview.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update docs/operations/security-overview.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update docs/operations/security-overview.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update security-overview.md fix additional comments & typos cc: @suneet-s, @jihoonsoon Co-authored-by: Suneet Saldanha <suneet@apache.org> * Enable rewriting certain inner joins as filters. (apache#11068) * Enable rewriting certain inner joins as filters. The main logic for doing the rewrite is in JoinableFactoryWrapper's segmentMapFn method. The requirements are: - It must be an inner equi-join. - The right-hand columns referenced by the condition must not contain any duplicate values. (If they did, the inner join would not be guaranteed to return at most one row for each left-hand-side row.) - No columns from the right-hand side can be used by anything other than the join condition itself. HashJoinSegmentStorageAdapter is also modified to pass through to the base adapter (even allowing vectorization!) in the case where 100% of join clauses could be rewritten as filters. In support of this goal: - Add Query getRequiredColumns() method to help us figure out whether the right-hand side of a join datasource is being used or not. - Add JoinConditionAnalysis getRequiredColumns() method to help us figure out if the right-hand side of a join is being used by later join clauses acting on the same base. - Add Joinable getNonNullColumnValuesIfAllUnique method to enable retrieving the set of values that will form the "in" filter. - Add LookupExtractor canGetKeySet() and keySet() methods to support LookupJoinable in its efforts to implement the new Joinable method. - Add "enableRewriteJoinToFilter" feature flag to JoinFilterRewriteConfig. The default is disabled. * Test improvements. * Test fixes. * Avoid slow size() call. * Remove invalid test. * Fix style. * Fix mistaken default. * Small fixes. * Fix logic error. Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com> Co-authored-by: Suneet Saldanha <suneet@apache.org> Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Gian Merlino <gianmerlino@gmail.com>

@suneet-s

* Update security overview with additional recommendations (apache#11016) * updatee security overview with additional recommendations for improved security * address first set of review questions * Update docs/operations/security-overview.md * Update docs/operations/security-overview.md * apply changes from review * Update docs/operations/security-overview.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update docs/operations/security-overview.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update docs/operations/security-overview.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update security-overview.md fix additional comments & typos cc: @suneet-s, @jihoonsoon Co-authored-by: Suneet Saldanha <suneet@apache.org> * Enable rewriting certain inner joins as filters. (apache#11068) * Enable rewriting certain inner joins as filters. The main logic for doing the rewrite is in JoinableFactoryWrapper's segmentMapFn method. The requirements are: - It must be an inner equi-join. - The right-hand columns referenced by the condition must not contain any duplicate values. (If they did, the inner join would not be guaranteed to return at most one row for each left-hand-side row.) - No columns from the right-hand side can be used by anything other than the join condition itself. HashJoinSegmentStorageAdapter is also modified to pass through to the base adapter (even allowing vectorization!) in the case where 100% of join clauses could be rewritten as filters. In support of this goal: - Add Query getRequiredColumns() method to help us figure out whether the right-hand side of a join datasource is being used or not. - Add JoinConditionAnalysis getRequiredColumns() method to help us figure out if the right-hand side of a join is being used by later join clauses acting on the same base. - Add Joinable getNonNullColumnValuesIfAllUnique method to enable retrieving the set of values that will form the "in" filter. - Add LookupExtractor canGetKeySet() and keySet() methods to support LookupJoinable in its efforts to implement the new Joinable method. - Add "enableRewriteJoinToFilter" feature flag to JoinFilterRewriteConfig. The default is disabled. * Test improvements. * Test fixes. * Avoid slow size() call. * Remove invalid test. * Fix style. * Fix mistaken default. * Small fixes. * Fix logic error. * Doc updates for union datasources. (apache#11103) The main one is updating datasources.md to talk about SQL. (It still said that table unions are not supported in SQL.) Also, this doc update adds some clarifying details on limitations. * [Security] Bump netty4.version from 4.1.48.Final to 4.1.63.Final (apache#11117) * Vectorized versions of HllSketch aggregators. (apache#11115) * Vectorized versions of HllSketch aggregators. The patch uses the same "helper" approach as apache#10767 and apache#10304, and extends the tests to run in both vectorized and non-vectorized modes. Also includes some minor changes to the theta sketch vector aggregator: - Cosmetic changes to make the hll and theta implementations look more similar. - Extends the theta SQL tests to run in vectorized mode. * Updates post-code-review. * Fix javadoc. * Web console: update dev dependencies (apache#11119) * Update some dev dependencies, prettify, tslint-fix * Sort tsconfig keys for easy comparison * Set noImplicitThis * Slightly more accurate types * Bump Jest and related * Bump react to latest on v16 * Bump node-sass, sass-loader for node14 support * Remove node-sass-chokidar (unused) * More unused dependencies * Fix blueprint imports * Webpack 5 * Update webpack config for 'process' usage * Update playwright-chromium * Emit esnext modules for tree shaking * Enable source maps in development * Dedupe * Bump babel and things * npm audit fix * Add .editorconfig file to match prettier settings * Update licenses (tslib is 0BSD as of 1.11.2) microsoft/tslib#96 * Require node >= 10 * Use Node 10 to run e2e tests * Use 'ws' transport mode for dev server (will be default in next version) * Remove an 'any' * No sourcemaps in prod * Exclude .editorconfig from license checks * Try nvm for setting node version Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com> Co-authored-by: Suneet Saldanha <suneet@apache.org> Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Gian Merlino <gianmerlino@gmail.com> Co-authored-by: Sandeep <isandeep41@gmail.com> Co-authored-by: John Gozde <john@gozde.ca>

gianm added Performance Area - Querying labels Apr 5, 2021

gianm added 4 commits April 5, 2021 15:59

Test improvements.

64508f3

Test fixes.

c7710b1

Avoid slow size() call.

620129a

Remove invalid test.

62f88b7

abhishekagarwal87 reviewed Apr 6, 2021

View reviewed changes

gianm added 2 commits April 6, 2021 00:52

Fix style.

32ea480

Fix mistaken default.

c4feee8

abhishekagarwal87 reviewed Apr 6, 2021

View reviewed changes

gianm added 2 commits April 6, 2021 11:20

Small fixes.

c262a24

Fix logic error.

e7c06bf

clintropolis approved these changes Apr 7, 2021

View reviewed changes

abhishekagarwal87 approved these changes Apr 14, 2021

View reviewed changes

gianm merged commit 202c78c into apache:master Apr 14, 2021

gianm deleted the query-inner-join-to-filter branch April 14, 2021 17:49

gianm mentioned this pull request Apr 27, 2021

InDimFilter: Fix NPE involving certain Set types. #11169

Merged

clintropolis added this to the 0.22.0 milestone Aug 12, 2021

clintropolis mentioned this pull request Sep 3, 2021

[Draft] 0.22.0 Release Notes #11657

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable rewriting certain inner joins as filters. #11068

Enable rewriting certain inner joins as filters. #11068

gianm commented Apr 5, 2021

abhishekagarwal87 Apr 6, 2021

gianm Apr 6, 2021

abhishekagarwal87 left a comment

abhishekagarwal87 Apr 6, 2021

abhishekagarwal87 Apr 6, 2021

gianm Apr 6, 2021

abhishekagarwal87 Apr 6, 2021

gianm Apr 6, 2021 •

edited

Loading

gianm commented Apr 6, 2021

abhishekagarwal87 commented Apr 6, 2021

clintropolis Apr 7, 2021

gianm commented Apr 8, 2021

	DEFAULT_ENABLE_JOIN_FILTER_REWRITE_VALUE_COLUMN_FILTERS
	DEFAULT_ENABLE_REWRITE_JOIN_TO_FILTER

	* @throws UnsupportedOperationException if {@link #canIterate()} returns false.
	* @throws UnsupportedOperationException if {@link #canGetKeySet()} returns false.

Enable rewriting certain inner joins as filters. #11068

Enable rewriting certain inner joins as filters. #11068

Conversation

gianm commented Apr 5, 2021

abhishekagarwal87 Apr 6, 2021

Choose a reason for hiding this comment

gianm Apr 6, 2021

Choose a reason for hiding this comment

abhishekagarwal87 left a comment

Choose a reason for hiding this comment

abhishekagarwal87 Apr 6, 2021

Choose a reason for hiding this comment

abhishekagarwal87 Apr 6, 2021

Choose a reason for hiding this comment

gianm Apr 6, 2021

Choose a reason for hiding this comment

abhishekagarwal87 Apr 6, 2021

Choose a reason for hiding this comment

gianm Apr 6, 2021 • edited Loading

Choose a reason for hiding this comment

gianm commented Apr 6, 2021

abhishekagarwal87 commented Apr 6, 2021

clintropolis Apr 7, 2021

Choose a reason for hiding this comment

gianm commented Apr 8, 2021

gianm Apr 6, 2021 •

edited

Loading