Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance VarByteChunkSVForwardIndexReader to directly read from data buffer for uncompressed data #5816

Merged
merged 1 commit into from Aug 6, 2020

Conversation

Jackie-Jiang
Copy link
Contributor

Description

Currently for var-byte raw index, we always pre-allocate the chunk buffer even for uncompressed data (the buffer could be huge if the index contains large size value). When reading values, we first copy the data into the chunk buffer, then read from the buffer. This could cause unnecessary overhead on copying the data as well as allocating the direct memory, and can even cause OOM if the buffer size is too big. The chunk buffer is needed for compressed data in order to decompress it, but not necessary for uncompressed data as we can directly read from the data buffer.

This PR enhances the VarByteChunkSVForwardIndexReader to directly read from the data buffer for uncompressed data, and avoid the overhead of the chunk buffer.

Copy link
Contributor

@snleee snleee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we cover both compressed/uncompressed cases in the unit test?

LGTM otherwise.

@Jackie-Jiang
Copy link
Contributor Author

Do we cover both compressed/uncompressed cases in the unit test?

Yes, VarByteChunkSVForwardIndexTest covers both compressed and uncompressed indexes

int nextRowOffset;
private int getValueEndOffset(int rowId, ByteBuffer chunkBuffer) {
if (rowId == _numDocsPerChunk - 1) {
// Last row in the trunk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo; trunk -> chunk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

/**
* Helper method to compute the end offset of the value in the data buffer.
*/
private long getValueEndOffset(int chunkId, int chunkRowId, long chunkStartOffset) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the algorithm for getting endoffset or startOffset for next row is different for uncompressed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Algorithm is slightly different because with the chunkBuffer we can directly get the chunkEndOffset via chunkBuffer.limit(), which is not the case for the uncompressed one. That is why we have a branch on the last chunk.

@Jackie-Jiang Jackie-Jiang merged commit f68b82e into apache:master Aug 6, 2020
@Jackie-Jiang Jackie-Jiang deleted the var_byte_chunk_reader branch August 6, 2020 04:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants