Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-24245: Vectorized PTF with count and distinct over partition producing incorrect results. #1649

Merged
merged 1 commit into from Dec 10, 2020

Conversation

abstractdog
Copy link
Contributor

@abstractdog abstractdog commented Nov 3, 2020

What changes were proposed in this pull request?

Implements vectorized count(distinct txt1) over(partition by txt1), so a distinct count. Currently, this is not supported, and silently falls back to non-distinct which leads to wrong result.

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Qtests are included.

Copy link
Contributor

@pgaref pgaref left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments, patch LGTM in general

protected void countValue(ColumnVector colVector, int i) {
Object value = getValue(colVector, i);
if (!uniqueObjects.contains(value)) {
uniqueObjects.add(value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do check if the element is present twice here -- add method already checks if the element is part of the Set so we dont need to call contains. I would just add a comment here instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, thanks, forgot that Set takes care of uniqueness :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, I copied this wrong approach from GenericUDAFCount, I'm fixing it there also

boolean[] batchIsNull = colVector.isNull;
for (int i = 0; i < size; i++) {
if (!batchIsNull[i]) {
countValue(colVector, i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of updating the inherited count variable we could override getLongGroupResult() with uniqueObjects.size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right! updating it


protected Object getValue(ColumnVector colVector, int i) {
BytesColumnVector inV = (BytesColumnVector) colVector;
return Murmur3.hash32(inV.vector[i], inV.start[i], inV.length[i], Murmur3.DEFAULT_SEED);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit of returning Murmur hash directly here? Adding a String to the HashSet will also hash the input (so essentially we are hashing twice)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used hashing for memory considerations, I was hoping that this way the unique set will consume less memory (storing hashes instead of strings of arbitrary length)...now I think that we could be fine with storing strings and prevent additional hashing, as we tend to optimize CPU cycles instead of memory in the very-first round...I'll change this to new String(byte[])

uniqueObjs.add(obj);
} else {
boolean inserted = uniqueObjs.add(obj);
if (!inserted){
Copy link
Contributor

@pgaref pgaref Nov 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking care of this

@pgaref
Copy link
Contributor

pgaref commented Nov 25, 2020

Thanks for the update @abstractdog !
+1 tests pending

@abstractdog
Copy link
Contributor Author

merged to master, thanks @pgaref for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants