Enhance VarByteChunkSVForwardIndexReader to directly read from data buffer for uncompressed data #5816

Jackie-Jiang · 2020-08-05T23:21:20Z

Description

Currently for var-byte raw index, we always pre-allocate the chunk buffer even for uncompressed data (the buffer could be huge if the index contains large size value). When reading values, we first copy the data into the chunk buffer, then read from the buffer. This could cause unnecessary overhead on copying the data as well as allocating the direct memory, and can even cause OOM if the buffer size is too big. The chunk buffer is needed for compressed data in order to decompress it, but not necessary for uncompressed data as we can directly read from the data buffer.

This PR enhances the VarByteChunkSVForwardIndexReader to directly read from the data buffer for uncompressed data, and avoid the overhead of the chunk buffer.

snleee

Do we cover both compressed/uncompressed cases in the unit test?

LGTM otherwise.

Jackie-Jiang · 2020-08-06T00:06:59Z

Do we cover both compressed/uncompressed cases in the unit test?

Yes, VarByteChunkSVForwardIndexTest covers both compressed and uncompressed indexes

siddharthteotia · 2020-08-06T00:47:17Z

...va/org/apache/pinot/core/segment/index/readers/forward/VarByteChunkSVForwardIndexReader.java

-    int nextRowOffset;
+  private int getValueEndOffset(int rowId, ByteBuffer chunkBuffer) {
+    if (rowId == _numDocsPerChunk - 1) {
+      // Last row in the trunk


typo; trunk -> chunk

siddharthteotia · 2020-08-06T00:54:50Z

...va/org/apache/pinot/core/segment/index/readers/forward/VarByteChunkSVForwardIndexReader.java

+  /**
+   * Helper method to compute the end offset of the value in the data buffer.
+   */
+  private long getValueEndOffset(int chunkId, int chunkRowId, long chunkStartOffset) {


Why the algorithm for getting endoffset or startOffset for next row is different for uncompressed?

Algorithm is slightly different because with the chunkBuffer we can directly get the chunkEndOffset via chunkBuffer.limit(), which is not the case for the uncompressed one. That is why we have a branch on the last chunk.

…uffer for uncompressed data

Jackie-Jiang requested review from siddharthteotia and mayankshriv August 5, 2020 23:21

snleee approved these changes Aug 5, 2020

View reviewed changes

siddharthteotia reviewed Aug 6, 2020

View reviewed changes

Enhance VarByteChunkSVForwardIndexReader to directly read from data b…

2d12ff6

…uffer for uncompressed data

Jackie-Jiang force-pushed the var_byte_chunk_reader branch from d6deb16 to 2d12ff6 Compare August 6, 2020 00:59

Jackie-Jiang merged commit f68b82e into apache:master Aug 6, 2020

Jackie-Jiang deleted the var_byte_chunk_reader branch August 6, 2020 04:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance VarByteChunkSVForwardIndexReader to directly read from data buffer for uncompressed data #5816

Enhance VarByteChunkSVForwardIndexReader to directly read from data buffer for uncompressed data #5816

Jackie-Jiang commented Aug 5, 2020

snleee left a comment

Jackie-Jiang commented Aug 6, 2020

siddharthteotia Aug 6, 2020

Jackie-Jiang Aug 6, 2020

siddharthteotia Aug 6, 2020

Jackie-Jiang Aug 6, 2020

Enhance VarByteChunkSVForwardIndexReader to directly read from data buffer for uncompressed data #5816

Enhance VarByteChunkSVForwardIndexReader to directly read from data buffer for uncompressed data #5816

Conversation

Jackie-Jiang commented Aug 5, 2020

Description

snleee left a comment

Choose a reason for hiding this comment

Jackie-Jiang commented Aug 6, 2020

siddharthteotia Aug 6, 2020

Choose a reason for hiding this comment

Jackie-Jiang Aug 6, 2020

Choose a reason for hiding this comment

siddharthteotia Aug 6, 2020

Choose a reason for hiding this comment

Jackie-Jiang Aug 6, 2020

Choose a reason for hiding this comment