Optimize dictionary lookup for IN clause by Jackie-Jiang · Pull Request #8891 · apache/pinot

Jackie-Jiang · 2022-06-15T01:11:55Z

Cache the parsed values in the IN/NOT_IN predicate to prevent per-segment string value parse
Add on-heap dictionary for BYTES and BIG_DECIMAL data type
For IN predicate with lots of values, bound the initial dict id set to 1000 to prevent over-allocating when lots of values are not in the dictionary
Implement Dictionary.indexOf() for all data types to avoid the unnecessary string conversion

codecov-commenter · 2022-06-15T05:55:39Z

Codecov Report

Merging #8891 (8c8bb67) into master (c802786) will decrease coverage by 0.06%.
The diff coverage is 63.98%.

@@             Coverage Diff              @@
##             master    #8891      +/-   ##
============================================
- Coverage     69.78%   69.72%   -0.07%     
- Complexity     4679     4880     +201     
============================================
  Files          1808     1811       +3     
  Lines         94235    94430     +195     
  Branches      14052    14085      +33     
============================================
+ Hits          65765    65844      +79     
- Misses        23908    24017     +109     
- Partials       4562     4569       +7

Flag	Coverage Δ
integration1	`26.65% <30.05%> (-0.06%)`	⬇️
integration2	`24.85% <31.25%> (-0.04%)`	⬇️
unittests1	`66.37% <59.52%> (-0.03%)`	⬇️
unittests2	`14.84% <0.00%> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...dictionary/BigDecimalOffHeapMutableDictionary.java	`38.38% <0.00%> (-0.40%)`	⬇️
.../dictionary/BigDecimalOnHeapMutableDictionary.java	`34.83% <0.00%> (-0.40%)`	⬇️
...impl/dictionary/BytesOffHeapMutableDictionary.java	`54.79% <0.00%> (-0.77%)`	⬇️
.../impl/dictionary/BytesOnHeapMutableDictionary.java	`51.56% <0.00%> (-0.82%)`	⬇️
...mpl/dictionary/DoubleOffHeapMutableDictionary.java	`36.45% <0.00%> (-0.39%)`	⬇️
...impl/dictionary/DoubleOnHeapMutableDictionary.java	`32.94% <0.00%> (-0.40%)`	⬇️
...impl/dictionary/FloatOffHeapMutableDictionary.java	`38.54% <0.00%> (-0.41%)`	⬇️
.../impl/dictionary/FloatOnHeapMutableDictionary.java	`34.11% <0.00%> (-0.41%)`	⬇️
...e/impl/dictionary/IntOffHeapMutableDictionary.java	`47.91% <0.00%> (-0.51%)`	⬇️
...me/impl/dictionary/IntOnHeapMutableDictionary.java	`42.35% <0.00%> (-0.51%)`	⬇️
... and 70 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c802786...8c8bb67. Read the comment docs.

richardstartin · 2022-06-15T10:18:54Z

...ain/java/org/apache/pinot/core/operator/filter/predicate/NotInPredicateEvaluatorFactory.java

+        // NOTE: Add value-by-value to avoid overhead
+        //noinspection ManualArrayToCollectionCopy


What is the overhead being avoided here? Have you compared with

Set<ByteArray> nonMatchingValues = new ObjectOpenHashSet<>(Arrays.asList(notInPredicate.getBytesValues()));

Directly construct the set from a list won't honor the min hash set size (not sure how much it helps, but don't want to couple that change into this change).

I decide to keep the value-by-value add to skip the redundant capacity check in the ObjectOpenHashSet.addAll() because we already set the proper capacity up-front. Also want to keep the behavior the same for all data types so that it is easier to track

I’m not going to test it but would be amazed if this were beneficial. Shouldn’t hash map size be controlled by the load factor anyway?

The min hash set size is introduced in #3009, and the claim is that it reduces the latency for a query from 580ms to 430ms. We might want to revisit that number some time

I’m sure the hash sets were too small by default but this could have been resolved via the load factor, and would probably have made this even faster (it would have been nice if profiles before and after were captured for posterity).

xiangfu0 · 2022-06-16T09:51:10Z

pinot-spi/src/main/java/org/apache/pinot/spi/utils/ByteArray.java

+  // Hash for empty ByteArray is 1
+  private int _hash = 1;
+
  public ByteArray(byte[] bytes) {


Is this ByteArray reusable? If so, we should reset _hash = 1 here.
Or just have one more boolean represent if hash is already computed in method hash()

Java’s string added a flag whether the hash had been computed to avoid computing the hash every time if the hash code happened to be 0, the same should be done here in case the hash happens to be 1 (finding {x_i} such that sum(x_i * 31^(n-i)) = 1 gives the byte arrays which collide, it’s easy to construct examples and they do occur in reality)

@xiangfu0 Good point. We don't reuse the byte[] in the ByteArray right now, but there is no way to enforce that without cloning a byte array during construction, which will add overhead.
Added some comments to the javadoc

@richardstartin I'm following the String implementation within the adopt-openjdk-11 which has the following check:

public int hashCode() { int h = hash; if (h == 0 && value.length > 0) { hash = h = isLatin1() ? StringLatin1.hashCode(value) : StringUTF16.hashCode(value); } return h; }

I assume the collision will be super rare, and is not worth the overhead of storing an extra boolean field? Do you know if this implementation is changed in newer JDK version?

Jackie-Jiang added the enhancement label Jun 15, 2022

Optimize dictionary lookup for IN clause

8c8bb67

Jackie-Jiang force-pushed the bytes_on_heap branch from 9df901f to 8c8bb67 Compare June 15, 2022 05:23

richardstartin reviewed Jun 15, 2022

View reviewed changes

xiangfu0 approved these changes Jun 16, 2022

View reviewed changes

Jackie-Jiang merged commit 4548783 into apache:master Jun 18, 2022

Jackie-Jiang deleted the bytes_on_heap branch June 18, 2022 23:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize dictionary lookup for IN clause#8891

Optimize dictionary lookup for IN clause#8891
Jackie-Jiang merged 1 commit intoapache:masterfrom
Jackie-Jiang:bytes_on_heap

Jackie-Jiang commented Jun 15, 2022

Uh oh!

codecov-commenter commented Jun 15, 2022 •

edited

Loading

Uh oh!

richardstartin Jun 15, 2022

Uh oh!

Jackie-Jiang Jun 15, 2022

Uh oh!

richardstartin Jun 16, 2022

Uh oh!

Jackie-Jiang Jun 16, 2022

Uh oh!

richardstartin Jun 16, 2022 •

edited

Loading

Uh oh!

xiangfu0 Jun 16, 2022

Uh oh!

richardstartin Jun 16, 2022

Uh oh!

Jackie-Jiang Jun 16, 2022

Uh oh!

Jackie-Jiang Jun 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		// NOTE: Add value-by-value to avoid overhead
		//noinspection ManualArrayToCollectionCopy

Conversation

Jackie-Jiang commented Jun 15, 2022

Uh oh!

codecov-commenter commented Jun 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

richardstartin Jun 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Jun 15, 2022 •

edited

Loading

richardstartin Jun 16, 2022 •

edited

Loading