Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-21935: Hive Vectorization : degraded performance with vectorize UDF #2242

Closed
wants to merge 1 commit into from

Conversation

mustafaiman
Copy link
Contributor

Change-Id: I5fad847456d8c3319ea07cfe114007ed01b54bbe

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

deserializerBatch.selectedInUse = false;
deserializerBatch.size = 0;
deserializerBatch.endOfFile = false;
deserializerBatch.reset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reset, can you reinit isNull and isRepeating for columns which are not present in currentDataColumnCount?
(i.e similar to the case mentioned in https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorMapOperator.java#L706)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I excluded partition column and rowidentifiercolumn from reset. So this is more like the original code now. It only resets the output columns in addition to original implementation.

@abstractdog
Copy link
Contributor

the change makes sense to me, could you confirm that this use-case is covered by a q.test?
I was looking for qtests, I found similar ones but only with struct type (StructColumnVector, which doesn’t inherit MultiValuedColumnVector)
e.g.:
schema_evol_text_vecrow_part_all_complex.q
I think q.test for this use should have the following properties:

  1. use text format file for reading
  2. force vector/row serde deserialize:
SET hive.vectorized.use.vectorized.input.format=false;
SET hive.vectorized.use.vector.serde.deserialize=true;
SET hive.vectorized.use.row.serde.deserialize=true;
  1. use map/list type
  2. work on small batches or a large amount of rows <1024 (to force batch reuse = reuse of deserializerBatch)

Change-Id: I45e89412a1f50f1ab6a2494bb692ef7f4fc58a7e
Copy link
Contributor

@pgaref pgaref left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest changes + test look good to me!

@mustafaiman
Copy link
Contributor Author

@abstractdog @rbalamohan are you ok with the latest patch?

@abstractdog abstractdog self-requested a review May 17, 2021 07:19
Copy link
Contributor

@abstractdog abstractdog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
debugged locally, resetVectorizedRowBatchForDeserialize only clears data columns, then skips partition columns if any, and then clears all the remaining columns, so this looks good to me
thanks @mustafaiman for the patch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants