Skip to content

Optimize dictionary lookup for IN clause#8891

Merged
Jackie-Jiang merged 1 commit intoapache:masterfrom
Jackie-Jiang:bytes_on_heap
Jun 18, 2022
Merged

Optimize dictionary lookup for IN clause#8891
Jackie-Jiang merged 1 commit intoapache:masterfrom
Jackie-Jiang:bytes_on_heap

Conversation

@Jackie-Jiang
Copy link
Contributor

  • Cache the parsed values in the IN/NOT_IN predicate to prevent per-segment string value parse
  • Add on-heap dictionary for BYTES and BIG_DECIMAL data type
  • For IN predicate with lots of values, bound the initial dict id set to 1000 to prevent over-allocating when lots of values are not in the dictionary
  • Implement Dictionary.indexOf() for all data types to avoid the unnecessary string conversion

@codecov-commenter
Copy link

codecov-commenter commented Jun 15, 2022

Codecov Report

Merging #8891 (8c8bb67) into master (c802786) will decrease coverage by 0.06%.
The diff coverage is 63.98%.

@@             Coverage Diff              @@
##             master    #8891      +/-   ##
============================================
- Coverage     69.78%   69.72%   -0.07%     
- Complexity     4679     4880     +201     
============================================
  Files          1808     1811       +3     
  Lines         94235    94430     +195     
  Branches      14052    14085      +33     
============================================
+ Hits          65765    65844      +79     
- Misses        23908    24017     +109     
- Partials       4562     4569       +7     
Flag Coverage Δ
integration1 26.65% <30.05%> (-0.06%) ⬇️
integration2 24.85% <31.25%> (-0.04%) ⬇️
unittests1 66.37% <59.52%> (-0.03%) ⬇️
unittests2 14.84% <0.00%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...dictionary/BigDecimalOffHeapMutableDictionary.java 38.38% <0.00%> (-0.40%) ⬇️
.../dictionary/BigDecimalOnHeapMutableDictionary.java 34.83% <0.00%> (-0.40%) ⬇️
...impl/dictionary/BytesOffHeapMutableDictionary.java 54.79% <0.00%> (-0.77%) ⬇️
.../impl/dictionary/BytesOnHeapMutableDictionary.java 51.56% <0.00%> (-0.82%) ⬇️
...mpl/dictionary/DoubleOffHeapMutableDictionary.java 36.45% <0.00%> (-0.39%) ⬇️
...impl/dictionary/DoubleOnHeapMutableDictionary.java 32.94% <0.00%> (-0.40%) ⬇️
...impl/dictionary/FloatOffHeapMutableDictionary.java 38.54% <0.00%> (-0.41%) ⬇️
.../impl/dictionary/FloatOnHeapMutableDictionary.java 34.11% <0.00%> (-0.41%) ⬇️
...e/impl/dictionary/IntOffHeapMutableDictionary.java 47.91% <0.00%> (-0.51%) ⬇️
...me/impl/dictionary/IntOnHeapMutableDictionary.java 42.35% <0.00%> (-0.51%) ⬇️
... and 70 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c802786...8c8bb67. Read the comment docs.

Comment on lines +140 to +141
// NOTE: Add value-by-value to avoid overhead
//noinspection ManualArrayToCollectionCopy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the overhead being avoided here? Have you compared with

        Set<ByteArray> nonMatchingValues = new ObjectOpenHashSet<>(Arrays.asList(notInPredicate.getBytesValues()));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Directly construct the set from a list won't honor the min hash set size (not sure how much it helps, but don't want to couple that change into this change).

I decide to keep the value-by-value add to skip the redundant capacity check in the ObjectOpenHashSet.addAll() because we already set the proper capacity up-front. Also want to keep the behavior the same for all data types so that it is easier to track

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not going to test it but would be amazed if this were beneficial. Shouldn’t hash map size be controlled by the load factor anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The min hash set size is introduced in #3009, and the claim is that it reduces the latency for a query from 580ms to 430ms. We might want to revisit that number some time

Copy link
Member

@richardstartin richardstartin Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m sure the hash sets were too small by default but this could have been resolved via the load factor, and would probably have made this even faster (it would have been nice if profiles before and after were captured for posterity).

// Hash for empty ByteArray is 1
private int _hash = 1;

public ByteArray(byte[] bytes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ByteArray reusable? If so, we should reset _hash = 1 here.
Or just have one more boolean represent if hash is already computed in method hash()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java’s string added a flag whether the hash had been computed to avoid computing the hash every time if the hash code happened to be 0, the same should be done here in case the hash happens to be 1 (finding {x_i} such that sum(x_i * 31^(n-i)) = 1 gives the byte arrays which collide, it’s easy to construct examples and they do occur in reality)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiangfu0 Good point. We don't reuse the byte[] in the ByteArray right now, but there is no way to enforce that without cloning a byte array during construction, which will add overhead.
Added some comments to the javadoc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardstartin I'm following the String implementation within the adopt-openjdk-11 which has the following check:

    public int hashCode() {
        int h = hash;
        if (h == 0 && value.length > 0) {
            hash = h = isLatin1() ? StringLatin1.hashCode(value)
                                  : StringUTF16.hashCode(value);
        }
        return h;
    }

I assume the collision will be super rare, and is not worth the overhead of storing an extra boolean field? Do you know if this implementation is changed in newer JDK version?

@Jackie-Jiang Jackie-Jiang merged commit 4548783 into apache:master Jun 18, 2022
@Jackie-Jiang Jackie-Jiang deleted the bytes_on_heap branch June 18, 2022 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants