Memoize InDimFilter hashCode calculation by suneet-s · Pull Request #10316 · apache/druid

suneet-s · 2020-08-24T15:31:37Z

Description

InDimFilter can operate on a large set of values. Computing the hashCode for
this large set of values can be expensive.

The hashCode calculation is also memoized so that it's only done once per
object further reducing the cost of this calculation when the filters are used in
Sets (eg. in an AndFilter).

This flamegraph shows a query that spends ~10% of it's time calculating the hashCode for the InDimFilter which has a large number of values

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

InDimFilter can operate on a large set of values. Computing the hashCode for this large set of values can be expensive. Instead of this, Druid can use the number of values in the filter to compute the hashCode. This should speed up the computation with the side-effect of higher collisions. The equals method will still check every value in the list, so 2 filters operating on the same dimension with the same filter shape and values, will not be considered equal.

gianm · 2020-08-25T06:04:57Z

processing/src/main/java/org/apache/druid/query/filter/InDimFilter.java

  public int hashCode()
  {
-    return Objects.hash(values, dimension, extractionFn, filterTuning);
+    return Objects.hash(values.size(), dimension, extractionFn, filterTuning);


Maybe use the size and the first few values?

It's easy to imagine situations where the extra collisions from only checking size are a problem, and it's tough to imagine situations where the perf impact of adding the first few values is going to be big. So it seems like a good idea.

Please also include a comment about the rationale for the nonstandard hashCode impl. It'd be good to link to this PR.

It's interesting though that values itself is a HashSet being passed to InDimFilter which would mean hash code is evaluated for all the elements in the set. But that penalty for constructing values doesn't show up in the graph. is the full flame graph available to look further?
I can see in one place where multiple InDimFilter are created with the same values. Maybe that's the part responsible for perf penalty. If there is a Set type that remembers its hashCode, using such type for values could be more beneficial.

@abhishekagarwal87 Good point... I'm not sure why the construction time doesn't show up. I'll check if there is a set that memoizes it's hashcode as the set is being constructed.

ImmutableSet computes its hashcode as it is built and then caches it.

abhishekagarwal87 · 2020-08-27T06:46:42Z

diff --git a/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java b/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
index 67b1ee0b76..7da31f3811 100644
--- a/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
+++ b/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
@@ -435,6 +435,11 @@ public class JoinFilterAnalyzer
           );
         }
 
+        // Wrap filter values in immutable set so that classes consuming this set do not compute the hash code
+        // again and again. We are creating multiple InDimFilter filters down for the same set of filter values.
+        // Calculating hashCode will be less expensive for these InDimFilter as hashCode for filter values is pre-computed
+        newFilterValues = ImmutableSet.copyOf(newFilterValues);
+
         for (String correlatedBaseColumn : correlationAnalysis.getBaseColumns()) {
           Filter rewrittenFilter = new InDimFilter(
               correlatedBaseColumn,

This is what I had in mind.

abhishekagarwal87 · 2020-08-27T06:50:44Z

there are two other places where this InDimFilter is being created and there too an ImmutableSet can be used. As Gian pointed out, the ImmutableSet caches the hashCode while it's building the set.

suneet-s · 2020-08-27T14:54:39Z

diff --git a/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java b/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
index 67b1ee0b76..7da31f3811 100644
--- a/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
+++ b/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
@@ -435,6 +435,11 @@ public class JoinFilterAnalyzer
           );
         }
 
+        // Wrap filter values in immutable set so that classes consuming this set do not compute the hash code
+        // again and again. We are creating multiple InDimFilter filters down for the same set of filter values.
+        // Calculating hashCode will be less expensive for these InDimFilter as hashCode for filter values is pre-computed
+        newFilterValues = ImmutableSet.copyOf(newFilterValues);
+
         for (String correlatedBaseColumn : correlationAnalysis.getBaseColumns()) {
           Filter rewrittenFilter = new InDimFilter(
               correlatedBaseColumn,

This is what I had in mind.

I initially didn't want to use an ImmutableSet because ImmutableSet.copyOf(...) would have to traverse through the entire set, and there may be cases where an InDimFilter doesn't need to traverse the entire set. However, this comment made me think that maybe the best way to do this is to have the correlatedValuesMap use an ImmutableSet.Builder instead. This way the hashCode is memoized as the set is being constructed! I might leave the other code paths as is so that this change only really impacts join clauses. What do you think @abhishekagarwal87 ?

suneet-s · 2020-08-27T15:25:47Z

diff --git a/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java b/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
index 67b1ee0b76..7da31f3811 100644
--- a/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
+++ b/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
@@ -435,6 +435,11 @@ public class JoinFilterAnalyzer
           );
         }
 
+        // Wrap filter values in immutable set so that classes consuming this set do not compute the hash code
+        // again and again. We are creating multiple InDimFilter filters down for the same set of filter values.
+        // Calculating hashCode will be less expensive for these InDimFilter as hashCode for filter values is pre-computed
+        newFilterValues = ImmutableSet.copyOf(newFilterValues);
+
         for (String correlatedBaseColumn : correlationAnalysis.getBaseColumns()) {
           Filter rewrittenFilter = new InDimFilter(
               correlatedBaseColumn,
This is what I had in mind.
I initially didn't want to use an ImmutableSet because ImmutableSet.copyOf(...) would have to traverse through the entire set, and there may be cases where an InDimFilter doesn't need to traverse the entire set. However, this comment made me think that maybe the best way to do this is to have the correlatedValuesMap use an ImmutableSet.Builder instead. This way the hashCode is memoized as the set is being constructed! I might leave the other code paths as is so that this change only really impacts join clauses. What do you think @abhishekagarwal87 ?

I changed my mind... decided it's better to spend a little more time thinking about this

IndexedTableJoinable needs to know the number of uniques while constructing the Set so that we can limit the number of values that can be pushed down. The ImmutableSetBuilder doesn't know the number of uniques till the time the Set is constructed

IntList rowIndex = index.find(searchColumnValue);
        for (int i = 0; i < rowIndex.size(); i++) {
          int rowNum = rowIndex.getInt(i);
          String correlatedDimVal = Objects.toString(reader.read(rowNum), null);
          correlatedValuesBuilder.add(correlatedDimVal);

          if (correlatedValuesBuilder.size() > maxCorrelationSetSize) {
            return Optional.empty();
          }
        }
        return Optional.of(correlatedValuesBuilder.build());

This reverts commit db166e4.

suneet-s · 2020-08-29T01:23:36Z

I'm not happy with this approach. Going to think about this for a little more time and I'll re-open when I think of a better approach.

suneet-s added Performance Area - Querying labels Aug 24, 2020

fix tests

77feb63

gianm reviewed Aug 25, 2020

View reviewed changes

suneet-s added 5 commits August 25, 2020 14:12

Merge remote-tracking branch 'upstream/master' into flame2

c87666d

Memoize hashCode and limit number of values

c11b92f

Merge branch 'master' into flame2

f9fc641

Merge remote-tracking branch 'upstream/master' into flame2

05c7dc0

do not limit values in hashCode

db166e4

suneet-s changed the title ~~Optimize InDimFilter hashCode calculation~~ Memoize InDimFilter hashCode calculation Aug 27, 2020

maytasm approved these changes Aug 28, 2020

View reviewed changes

suneet-s added 2 commits August 28, 2020 17:35

Revert "do not limit values in hashCode"

cd0ab1f

This reverts commit db166e4.

Merge remote-tracking branch 'upstream/master' into flame2

44b367a

suneet-s closed this Aug 29, 2020

abhishekagarwal87 mentioned this pull request Jan 18, 2021

Retain order of AND, OR filter children. #10758

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memoize InDimFilter hashCode calculation#10316

Memoize InDimFilter hashCode calculation#10316
suneet-s wants to merge 9 commits intoapache:masterfrom
suneet-s:flame2

suneet-s commented Aug 24, 2020 •

edited

Loading

Uh oh!

gianm Aug 25, 2020 •

edited

Loading

Uh oh!

abhishekagarwal87 Aug 25, 2020

Uh oh!

suneet-s Aug 25, 2020

Uh oh!

gianm Aug 25, 2020

Uh oh!

abhishekagarwal87 commented Aug 27, 2020

Uh oh!

abhishekagarwal87 commented Aug 27, 2020

Uh oh!

suneet-s commented Aug 27, 2020

Uh oh!

suneet-s commented Aug 27, 2020

Uh oh!

suneet-s commented Aug 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

suneet-s commented Aug 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

gianm Aug 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhishekagarwal87 Aug 25, 2020

Choose a reason for hiding this comment

Uh oh!

suneet-s Aug 25, 2020

Choose a reason for hiding this comment

Uh oh!

gianm Aug 25, 2020

Choose a reason for hiding this comment

Uh oh!

abhishekagarwal87 commented Aug 27, 2020

Uh oh!

abhishekagarwal87 commented Aug 27, 2020

Uh oh!

suneet-s commented Aug 27, 2020

Uh oh!

suneet-s commented Aug 27, 2020

Uh oh!

suneet-s commented Aug 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

suneet-s commented Aug 24, 2020 •

edited

Loading

gianm Aug 25, 2020 •

edited

Loading