New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-24245: Vectorized PTF with count and distinct over partition producing incorrect results. #1649
Conversation
7afafd4
to
e7b9863
Compare
e7b9863
to
bf24ff6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor comments, patch LGTM in general
protected void countValue(ColumnVector colVector, int i) { | ||
Object value = getValue(colVector, i); | ||
if (!uniqueObjects.contains(value)) { | ||
uniqueObjects.add(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do check if the element is present twice here -- add method already checks if the element is part of the Set so we dont need to call contains. I would just add a comment here instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, thanks, forgot that Set takes care of uniqueness :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, I copied this wrong approach from GenericUDAFCount, I'm fixing it there also
boolean[] batchIsNull = colVector.isNull; | ||
for (int i = 0; i < size; i++) { | ||
if (!batchIsNull[i]) { | ||
countValue(colVector, i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of updating the inherited count variable we could override getLongGroupResult() with uniqueObjects.size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right! updating it
|
||
protected Object getValue(ColumnVector colVector, int i) { | ||
BytesColumnVector inV = (BytesColumnVector) colVector; | ||
return Murmur3.hash32(inV.vector[i], inV.start[i], inV.length[i], Murmur3.DEFAULT_SEED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the benefit of returning Murmur hash directly here? Adding a String to the HashSet will also hash the input (so essentially we are hashing twice)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used hashing for memory considerations, I was hoping that this way the unique set will consume less memory (storing hashes instead of strings of arbitrary length)...now I think that we could be fine with storing strings and prevent additional hashing, as we tend to optimize CPU cycles instead of memory in the very-first round...I'll change this to new String(byte[])
bf24ff6
to
86b53d3
Compare
uniqueObjs.add(obj); | ||
} else { | ||
boolean inserted = uniqueObjs.add(obj); | ||
if (!inserted){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking care of this
Thanks for the update @abstractdog ! |
86b53d3
to
8d9f894
Compare
…ducing incorrect results.
8d9f894
to
f9a7a9a
Compare
merged to master, thanks @pgaref for the review! |
What changes were proposed in this pull request?
Implements vectorized count(distinct txt1) over(partition by txt1), so a distinct count. Currently, this is not supported, and silently falls back to non-distinct which leads to wrong result.
Why are the changes needed?
Does this PR introduce any user-facing change?
How was this patch tested?
Qtests are included.