Optimize dictionary lookup for IN clause#8891
Conversation
9df901f to
8c8bb67
Compare
Codecov Report
@@ Coverage Diff @@
## master #8891 +/- ##
============================================
- Coverage 69.78% 69.72% -0.07%
- Complexity 4679 4880 +201
============================================
Files 1808 1811 +3
Lines 94235 94430 +195
Branches 14052 14085 +33
============================================
+ Hits 65765 65844 +79
- Misses 23908 24017 +109
- Partials 4562 4569 +7
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
| // NOTE: Add value-by-value to avoid overhead | ||
| //noinspection ManualArrayToCollectionCopy |
There was a problem hiding this comment.
What is the overhead being avoided here? Have you compared with
Set<ByteArray> nonMatchingValues = new ObjectOpenHashSet<>(Arrays.asList(notInPredicate.getBytesValues()));There was a problem hiding this comment.
Directly construct the set from a list won't honor the min hash set size (not sure how much it helps, but don't want to couple that change into this change).
I decide to keep the value-by-value add to skip the redundant capacity check in the ObjectOpenHashSet.addAll() because we already set the proper capacity up-front. Also want to keep the behavior the same for all data types so that it is easier to track
There was a problem hiding this comment.
I’m not going to test it but would be amazed if this were beneficial. Shouldn’t hash map size be controlled by the load factor anyway?
There was a problem hiding this comment.
The min hash set size is introduced in #3009, and the claim is that it reduces the latency for a query from 580ms to 430ms. We might want to revisit that number some time
There was a problem hiding this comment.
I’m sure the hash sets were too small by default but this could have been resolved via the load factor, and would probably have made this even faster (it would have been nice if profiles before and after were captured for posterity).
| // Hash for empty ByteArray is 1 | ||
| private int _hash = 1; | ||
|
|
||
| public ByteArray(byte[] bytes) { |
There was a problem hiding this comment.
Is this ByteArray reusable? If so, we should reset _hash = 1 here.
Or just have one more boolean represent if hash is already computed in method hash()
There was a problem hiding this comment.
Java’s string added a flag whether the hash had been computed to avoid computing the hash every time if the hash code happened to be 0, the same should be done here in case the hash happens to be 1 (finding {x_i} such that sum(x_i * 31^(n-i)) = 1 gives the byte arrays which collide, it’s easy to construct examples and they do occur in reality)
There was a problem hiding this comment.
@xiangfu0 Good point. We don't reuse the byte[] in the ByteArray right now, but there is no way to enforce that without cloning a byte array during construction, which will add overhead.
Added some comments to the javadoc
There was a problem hiding this comment.
@richardstartin I'm following the String implementation within the adopt-openjdk-11 which has the following check:
public int hashCode() {
int h = hash;
if (h == 0 && value.length > 0) {
hash = h = isLatin1() ? StringLatin1.hashCode(value)
: StringUTF16.hashCode(value);
}
return h;
}
I assume the collision will be super rare, and is not worth the overhead of storing an extra boolean field? Do you know if this implementation is changed in newer JDK version?
Dictionary.indexOf()for all data types to avoid the unnecessary string conversion