Skip to content

Memoize InDimFilter hashCode calculation#10316

Closed
suneet-s wants to merge 9 commits intoapache:masterfrom
suneet-s:flame2
Closed

Memoize InDimFilter hashCode calculation#10316
suneet-s wants to merge 9 commits intoapache:masterfrom
suneet-s:flame2

Conversation

@suneet-s
Copy link
Contributor

@suneet-s suneet-s commented Aug 24, 2020

Description

InDimFilter can operate on a large set of values. Computing the hashCode for
this large set of values can be expensive.

The hashCode calculation is also memoized so that it's only done once per
object further reducing the cost of this calculation when the filters are used in
Sets (eg. in an AndFilter).

Screen Shot 2020-08-24 at 8 29 54 AM

This flamegraph shows a query that spends ~10% of it's time calculating the hashCode for the InDimFilter which has a large number of values


This PR has:

  • been self-reviewed.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

InDimFilter can operate on a large set of values. Computing the hashCode for
this large set of values can be expensive. Instead of this, Druid can use the
number of values in the filter to compute the hashCode. This should speed up
the computation with the side-effect of higher collisions.

The equals method will still check every value in the list, so 2 filters
operating on the same dimension with the same filter shape and values, will
not be considered equal.
public int hashCode()
{
return Objects.hash(values, dimension, extractionFn, filterTuning);
return Objects.hash(values.size(), dimension, extractionFn, filterTuning);
Copy link
Contributor

@gianm gianm Aug 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use the size and the first few values?

It's easy to imagine situations where the extra collisions from only checking size are a problem, and it's tough to imagine situations where the perf impact of adding the first few values is going to be big. So it seems like a good idea.

Please also include a comment about the rationale for the nonstandard hashCode impl. It'd be good to link to this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting though that values itself is a HashSet being passed to InDimFilter which would mean hash code is evaluated for all the elements in the set. But that penalty for constructing values doesn't show up in the graph. is the full flame graph available to look further?
I can see in one place where multiple InDimFilter are created with the same values. Maybe that's the part responsible for perf penalty. If there is a Set type that remembers its hashCode, using such type for values could be more beneficial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhishekagarwal87 Good point... I'm not sure why the construction time doesn't show up. I'll check if there is a set that memoizes it's hashcode as the set is being constructed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImmutableSet computes its hashcode as it is built and then caches it.

@suneet-s suneet-s changed the title Optimize InDimFilter hashCode calculation Memoize InDimFilter hashCode calculation Aug 27, 2020
@abhishekagarwal87
Copy link
Contributor

diff --git a/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java b/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
index 67b1ee0b76..7da31f3811 100644
--- a/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
+++ b/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
@@ -435,6 +435,11 @@ public class JoinFilterAnalyzer
           );
         }
 
+        // Wrap filter values in immutable set so that classes consuming this set do not compute the hash code
+        // again and again. We are creating multiple InDimFilter filters down for the same set of filter values.
+        // Calculating hashCode will be less expensive for these InDimFilter as hashCode for filter values is pre-computed
+        newFilterValues = ImmutableSet.copyOf(newFilterValues);
+
         for (String correlatedBaseColumn : correlationAnalysis.getBaseColumns()) {
           Filter rewrittenFilter = new InDimFilter(
               correlatedBaseColumn,

This is what I had in mind.

@abhishekagarwal87
Copy link
Contributor

there are two other places where this InDimFilter is being created and there too an ImmutableSet can be used. As Gian pointed out, the ImmutableSet caches the hashCode while it's building the set.

@suneet-s
Copy link
Contributor Author

diff --git a/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java b/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
index 67b1ee0b76..7da31f3811 100644
--- a/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
+++ b/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
@@ -435,6 +435,11 @@ public class JoinFilterAnalyzer
           );
         }
 
+        // Wrap filter values in immutable set so that classes consuming this set do not compute the hash code
+        // again and again. We are creating multiple InDimFilter filters down for the same set of filter values.
+        // Calculating hashCode will be less expensive for these InDimFilter as hashCode for filter values is pre-computed
+        newFilterValues = ImmutableSet.copyOf(newFilterValues);
+
         for (String correlatedBaseColumn : correlationAnalysis.getBaseColumns()) {
           Filter rewrittenFilter = new InDimFilter(
               correlatedBaseColumn,

This is what I had in mind.

I initially didn't want to use an ImmutableSet because ImmutableSet.copyOf(...) would have to traverse through the entire set, and there may be cases where an InDimFilter doesn't need to traverse the entire set. However, this comment made me think that maybe the best way to do this is to have the correlatedValuesMap use an ImmutableSet.Builder instead. This way the hashCode is memoized as the set is being constructed! I might leave the other code paths as is so that this change only really impacts join clauses. What do you think @abhishekagarwal87 ?

@suneet-s
Copy link
Contributor Author

diff --git a/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java b/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
index 67b1ee0b76..7da31f3811 100644
--- a/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
+++ b/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
@@ -435,6 +435,11 @@ public class JoinFilterAnalyzer
           );
         }
 
+        // Wrap filter values in immutable set so that classes consuming this set do not compute the hash code
+        // again and again. We are creating multiple InDimFilter filters down for the same set of filter values.
+        // Calculating hashCode will be less expensive for these InDimFilter as hashCode for filter values is pre-computed
+        newFilterValues = ImmutableSet.copyOf(newFilterValues);
+
         for (String correlatedBaseColumn : correlationAnalysis.getBaseColumns()) {
           Filter rewrittenFilter = new InDimFilter(
               correlatedBaseColumn,

This is what I had in mind.

I initially didn't want to use an ImmutableSet because ImmutableSet.copyOf(...) would have to traverse through the entire set, and there may be cases where an InDimFilter doesn't need to traverse the entire set. However, this comment made me think that maybe the best way to do this is to have the correlatedValuesMap use an ImmutableSet.Builder instead. This way the hashCode is memoized as the set is being constructed! I might leave the other code paths as is so that this change only really impacts join clauses. What do you think @abhishekagarwal87 ?

I changed my mind... decided it's better to spend a little more time thinking about this

IndexedTableJoinable needs to know the number of uniques while constructing the Set so that we can limit the number of values that can be pushed down. The ImmutableSetBuilder doesn't know the number of uniques till the time the Set is constructed

IntList rowIndex = index.find(searchColumnValue);
        for (int i = 0; i < rowIndex.size(); i++) {
          int rowNum = rowIndex.getInt(i);
          String correlatedDimVal = Objects.toString(reader.read(rowNum), null);
          correlatedValuesBuilder.add(correlatedDimVal);

          if (correlatedValuesBuilder.size() > maxCorrelationSetSize) {
            return Optional.empty();
          }
        }
        return Optional.of(correlatedValuesBuilder.build());

@suneet-s
Copy link
Contributor Author

I'm not happy with this approach. Going to think about this for a little more time and I'll re-open when I think of a better approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants