Memoize InDimFilter hashCode calculation#10316
Memoize InDimFilter hashCode calculation#10316suneet-s wants to merge 9 commits intoapache:masterfrom
Conversation
InDimFilter can operate on a large set of values. Computing the hashCode for this large set of values can be expensive. Instead of this, Druid can use the number of values in the filter to compute the hashCode. This should speed up the computation with the side-effect of higher collisions. The equals method will still check every value in the list, so 2 filters operating on the same dimension with the same filter shape and values, will not be considered equal.
| public int hashCode() | ||
| { | ||
| return Objects.hash(values, dimension, extractionFn, filterTuning); | ||
| return Objects.hash(values.size(), dimension, extractionFn, filterTuning); |
There was a problem hiding this comment.
Maybe use the size and the first few values?
It's easy to imagine situations where the extra collisions from only checking size are a problem, and it's tough to imagine situations where the perf impact of adding the first few values is going to be big. So it seems like a good idea.
Please also include a comment about the rationale for the nonstandard hashCode impl. It'd be good to link to this PR.
There was a problem hiding this comment.
It's interesting though that values itself is a HashSet being passed to InDimFilter which would mean hash code is evaluated for all the elements in the set. But that penalty for constructing values doesn't show up in the graph. is the full flame graph available to look further?
I can see in one place where multiple InDimFilter are created with the same values. Maybe that's the part responsible for perf penalty. If there is a Set type that remembers its hashCode, using such type for values could be more beneficial.
There was a problem hiding this comment.
@abhishekagarwal87 Good point... I'm not sure why the construction time doesn't show up. I'll check if there is a set that memoizes it's hashcode as the set is being constructed.
There was a problem hiding this comment.
ImmutableSet computes its hashcode as it is built and then caches it.
This is what I had in mind. |
|
there are two other places where this InDimFilter is being created and there too an ImmutableSet can be used. As Gian pointed out, the ImmutableSet caches the hashCode while it's building the set. |
I initially didn't want to use an ImmutableSet because |
I changed my mind... decided it's better to spend a little more time thinking about this IndexedTableJoinable needs to know the number of uniques while constructing the Set so that we can limit the number of values that can be pushed down. The ImmutableSetBuilder doesn't know the number of uniques till the time the Set is constructed |
|
I'm not happy with this approach. Going to think about this for a little more time and I'll re-open when I think of a better approach. |
Description
InDimFilter can operate on a large set of values. Computing the hashCode for
this large set of values can be expensive.
The hashCode calculation is also memoized so that it's only done once per
object further reducing the cost of this calculation when the filters are used in
Sets (eg. in an AndFilter).
This flamegraph shows a query that spends ~10% of it's time calculating the hashCode for the InDimFilter which has a large number of values
This PR has: