-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster vectorized bit unpacking (Part 1) #5409
Conversation
private volatile PinotDataBitSetV2 _dataBitSet; | ||
private final int _numBitsPerValue; | ||
|
||
public FixedBitIntReaderWriterV2(PinotDataBuffer dataBuffer, int numValues, int numBitsPerValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider writing a header for backward/forward compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in the follow-up when this code is wired with reader and writer (FixedBitSingleValueReader and FixedBitSingleValueWriter) and the scan operator, I will consider if the format has to be changed and bump the version and write a header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you do this, please write a standalone header buffer that can be used in other places
if (docIdRange > DocIdSetPlanNode.MAX_DOC_PER_CALL) { | ||
return false; | ||
} | ||
return numDocsToRead >= ((double)docIdRange * 0.7); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reasoning for magic number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have provided details in javadoc. Let me know if that is helpful.
import org.apache.pinot.core.io.writer.impl.v1.FixedBitSingleValueWriter; | ||
import org.apache.pinot.core.segment.memory.PinotDataBuffer; | ||
|
||
public class ForwardIndexBenchmark { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to use JMH to make benchmark more accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
import org.apache.pinot.core.segment.memory.PinotDataBuffer; | ||
|
||
|
||
public abstract class PinotDataBitSetV2 implements Closeable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class needs exhaustive unit tests to ensure all cases are covered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added several covering all possible cases. Will do another round in a follow-up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add todo to remove the dependency on having dictId size of MAX_DOCS_PER_CALL
private boolean shouldBulkRead(int[] docIds, int startIndex, int endIndex) { | ||
int numDocsToRead = endIndex - startIndex + 1; | ||
int docIdRange = docIds[endIndex] - docIds[startIndex] + 1; | ||
if (docIdRange > DocIdSetPlanNode.MAX_DOC_PER_CALL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this check needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is coming from the previous commit. The latest version of the PE doesn't have this check.
private volatile PinotDataBitSetV2 _dataBitSet; | ||
private final int _numBitsPerValue; | ||
|
||
public FixedBitIntReaderWriterV2(PinotDataBuffer dataBuffer, int numValues, int numBitsPerValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you do this, please write a standalone header buffer that can be used in other places
|
||
|
||
public final class FixedBitIntReaderWriterV2 implements Closeable { | ||
private volatile PinotDataBitSetV2 _dataBitSet; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to make this volatile and set it to null in close(). This issue has been addressed in #4764
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
protected int _numBitsPerValue; | ||
|
||
/** | ||
* Unpack single dictId at the given docId. This is efficient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't limit it to dictId
and docId
in the javadoc. The BitSet
is a general reader/writer which can be used for different purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* @param out out array to store the unpacked dictIds | ||
* @param outpos starting index in the out array | ||
*/ | ||
public void readInt(int[] docIds, int docIdsStartIndex, int length, int[] out, int outpos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For performance concern, we can remove the docIdsStartIndex
and outpos
and always assume they are 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the outer API in FixedBitIntReaderWriterV2 that calls this. That API tries to judge sparseness before deciding to do bulk read. The decision is made on a chunk of values at a time and this bulk API is called for each chunk.
int endDocId = docIds[docIdsStartIndex + length - 1]; | ||
int[] dictIds = THREAD_LOCAL_DICT_IDS.get(); | ||
// do a contiguous bulk read | ||
readInt(startDocId, endDocId - startDocId + 1, dictIds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work because docIds
might not be contiguous and endDocId - startDocId + 1
could be much larger than DocIdSetPlanNode.MAX_DOC_PER_CALL
. Also, we should not always do such bulk read, docIds
can be very sparse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the outer API in FixedBitIntReaderWriterV2 that calls this. That API tries to judge sparseness before deciding to do bulk read. The decision is made on a chunk of values at a time
@Override | ||
public int readInt(int index) { | ||
long bitOffset = (long) index * _numBitsPerValue; | ||
int byteOffset = (int) (bitOffset / Byte.SIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use long to index the dataBuffer so that we can handle big buffer (> 2G)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
public static PinotDataBitSetV2 createBitSet(PinotDataBuffer pinotDataBuffer, int numBitsPerValue) { | ||
switch (numBitsPerValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might also have 0 (all zeros) and 1 (0/1)
For the benchmark, you should also compare the worst case scenario such as 3, 5, 9, 17 bits |
@Override | ||
public void close() | ||
throws IOException { | ||
_dataBuffer.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not close the buffer (see #5400)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
I am still working on adding faster methods for non power of 2. Follow up will address this. Right now it is standalone code (yet to wire in). |
Address the review comments. Please take another look. This is standalone code at this point so want to get this in sooner and put up follow-ups in next couple of days. |
Here are the JMH results: cc @kishoreg @Jackie-Jiang Please see the spreadsheet for additional performance numbers obtained via manual instrumentation
|
Merging it. Follow-ups coming next:
|
Couple of improvements have been done for bit-unpacking.
Right now, the new code isn't yet wired into existing bit reader and writer. Couple of follow-ups will be coming soon:
Description of changes:
A new version of FixedBitIntReaderWriter is written that underneath uses a new version of fast bit unpack reader PinotDataBitSetV2.
There are 3 important APIs here:
public int readInt(int index)
Exists in the current code as well - Used by the scan operator to read through the forward index and dictId for each docId
public void readInt(int startDocId, int length, int[] buffer)
Exists in the current code as well - Used by the multi-value bit reader to get dictId for all MVs in a given cell.
public void readValues(int[] docIds, int docIdStartIndex, int docIdLength, int[] values, int valuesStartIndex)
Exists at the FixedBitSingleColumnSingleValueReader interface and used by the dictionary based group by executor to get dictIds for a set of docIds (monotonically increasing but not necessarily contiguous). But the API still issued single read calls underneath. This PR introduces this API at the FixedBitIntReaderWriterV2 level so that group by executor can leverage it using the bulk read semantics.
When this code is wired in, the scan operator will start using one of the second or third API.
Please see the spreadsheet for performance numbers.
Two kinds of tests were done:
getInt(index)
with faster bit unpacking code.readInt(int startDocId, int length, int[] buffer)
with faster bit unpacking code.Will be adding some units tests. The current PR has performance test.