Faster vectorized bit unpacking (Part 1) #5409

siddharthteotia · 2020-05-18T22:59:26Z

Couple of improvements have been done for bit-unpacking.

Use hand-written unpack methods for power of 2 (1, 2, 4, 8, 16, 32) number of bits used to encode the dictionaryId. The hand-written methods are faster than generic due to simplified bit math.
Amortize the overhead of function calls.

Right now, the new code isn't yet wired into existing bit reader and writer. Couple of follow-ups will be coming soon:

Evaluate this optimization for non power of 2 number of bits. It is fairly possible but the performance benefit of using a special hand-written function for unpacking seems to get lost as the bit math itself gets complicated with branches for non power of 2 number of bits.
Consider using a new format where if the number of bits to encode is non power of 2, we convert it to nearest power of 2. This means if you need more than 16 bits, we use 32 bits (raw value). We get diminished returns as the overhead of unpacking itself increases at the cost of saving 10-12 bits.
Integrate the new changes with existing code.

Description of changes:

A new version of FixedBitIntReaderWriter is written that underneath uses a new version of fast bit unpack reader PinotDataBitSetV2.

There are 3 important APIs here:

public int readInt(int index)
Exists in the current code as well - Used by the scan operator to read through the forward index and dictId for each docId

public void readInt(int startDocId, int length, int[] buffer)
Exists in the current code as well - Used by the multi-value bit reader to get dictId for all MVs in a given cell.

public void readValues(int[] docIds, int docIdStartIndex, int docIdLength, int[] values, int valuesStartIndex)
Exists at the FixedBitSingleColumnSingleValueReader interface and used by the dictionary based group by executor to get dictIds for a set of docIds (monotonically increasing but not necessarily contiguous). But the API still issued single read calls underneath. This PR introduces this API at the FixedBitIntReaderWriterV2 level so that group by executor can leverage it using the bulk read semantics.

When this code is wired in, the scan operator will start using one of the second or third API.

Please see the spreadsheet for performance numbers.

Two kinds of tests were done:

Compare the performance of sequential consecutive reads using single read API getInt(index) with faster bit unpacking code.
Compare the performance of sequential consecutive reads using array API readInt(int startDocId, int length, int[] buffer) with faster bit unpacking code.

Will be adding some units tests. The current PR has performance test.

mayankshriv · 2020-05-19T00:46:20Z

pinot-core/src/main/java/org/apache/pinot/core/io/util/FixedBitIntReaderWriterV2.java

+  private volatile PinotDataBitSetV2 _dataBitSet;
+  private final int _numBitsPerValue;
+
+  public FixedBitIntReaderWriterV2(PinotDataBuffer dataBuffer, int numValues, int numBitsPerValue) {


Consider writing a header for backward/forward compatibility.

Yes, in the follow-up when this code is wired with reader and writer (FixedBitSingleValueReader and FixedBitSingleValueWriter) and the scan operator, I will consider if the format has to be changed and bump the version and write a header.

when you do this, please write a standalone header buffer that can be used in other places

mayankshriv · 2020-05-19T00:47:08Z

pinot-core/src/main/java/org/apache/pinot/core/io/util/FixedBitIntReaderWriterV2.java

+    if (docIdRange > DocIdSetPlanNode.MAX_DOC_PER_CALL) {
+      return false;
+    }
+    return numDocsToRead >= ((double)docIdRange * 0.7);


Reasoning for magic number?

I have provided details in javadoc. Let me know if that is helpful.

mayankshriv · 2020-05-19T00:53:33Z

pinot-perf/src/main/java/org/apache/pinot/perf/ForwardIndexBenchmark.java

+import org.apache.pinot.core.io.writer.impl.v1.FixedBitSingleValueWriter;
+import org.apache.pinot.core.segment.memory.PinotDataBuffer;
+
+public class ForwardIndexBenchmark {


Would be good to use JMH to make benchmark more accurate.

mayankshriv · 2020-05-19T00:55:27Z

pinot-core/src/main/java/org/apache/pinot/core/io/util/PinotDataBitSetV2.java

+import org.apache.pinot.core.segment.memory.PinotDataBuffer;
+
+
+public abstract class PinotDataBitSetV2 implements Closeable {


This class needs exhaustive unit tests to ensure all cases are covered.

Added several covering all possible cases. Will do another round in a follow-up

kishoreg

Add todo to remove the dependency on having dictId size of MAX_DOCS_PER_CALL

kishoreg · 2020-05-18T23:08:29Z

pinot-core/src/main/java/org/apache/pinot/core/io/util/FixedBitIntReaderWriterV2.java

+  private boolean shouldBulkRead(int[] docIds, int startIndex, int endIndex) {
+    int numDocsToRead = endIndex - startIndex + 1;
+    int docIdRange = docIds[endIndex] - docIds[startIndex] + 1;
+    if (docIdRange > DocIdSetPlanNode.MAX_DOC_PER_CALL) {


why is this check needed?

I think this is coming from the previous commit. The latest version of the PE doesn't have this check.

kishoreg · 2020-05-26T06:03:56Z

pinot-core/src/main/java/org/apache/pinot/core/io/util/FixedBitIntReaderWriterV2.java

+  private volatile PinotDataBitSetV2 _dataBitSet;
+  private final int _numBitsPerValue;
+
+  public FixedBitIntReaderWriterV2(PinotDataBuffer dataBuffer, int numValues, int numBitsPerValue) {


when you do this, please write a standalone header buffer that can be used in other places

Jackie-Jiang · 2020-05-26T17:42:48Z

pinot-core/src/main/java/org/apache/pinot/core/io/util/FixedBitIntReaderWriterV2.java

+
+
+public final class FixedBitIntReaderWriterV2 implements Closeable {
+  private volatile PinotDataBitSetV2 _dataBitSet;


You don't need to make this volatile and set it to null in close(). This issue has been addressed in #4764

Jackie-Jiang · 2020-05-26T17:46:18Z

pinot-core/src/main/java/org/apache/pinot/core/io/util/PinotDataBitSetV2.java

+  protected int _numBitsPerValue;
+
+  /**
+   * Unpack single dictId at the given docId. This is efficient


Don't limit it to dictId and docId in the javadoc. The BitSet is a general reader/writer which can be used for different purposes.

Jackie-Jiang · 2020-05-26T17:51:40Z

pinot-core/src/main/java/org/apache/pinot/core/io/util/PinotDataBitSetV2.java

+   * @param out out array to store the unpacked dictIds
+   * @param outpos starting index in the out array
+   */
+  public void readInt(int[] docIds, int docIdsStartIndex, int length, int[] out, int outpos) {


For performance concern, we can remove the docIdsStartIndex and outpos and always assume they are 0

See the outer API in FixedBitIntReaderWriterV2 that calls this. That API tries to judge sparseness before deciding to do bulk read. The decision is made on a chunk of values at a time and this bulk API is called for each chunk.

Jackie-Jiang · 2020-05-26T17:54:28Z

pinot-core/src/main/java/org/apache/pinot/core/io/util/PinotDataBitSetV2.java

+    int endDocId = docIds[docIdsStartIndex + length - 1];
+    int[] dictIds = THREAD_LOCAL_DICT_IDS.get();
+    // do a contiguous bulk read
+    readInt(startDocId, endDocId - startDocId + 1, dictIds);


This won't work because docIds might not be contiguous and endDocId - startDocId + 1 could be much larger than DocIdSetPlanNode.MAX_DOC_PER_CALL. Also, we should not always do such bulk read, docIds can be very sparse.

See the outer API in FixedBitIntReaderWriterV2 that calls this. That API tries to judge sparseness before deciding to do bulk read. The decision is made on a chunk of values at a time

Jackie-Jiang · 2020-05-26T17:56:40Z

pinot-core/src/main/java/org/apache/pinot/core/io/util/PinotDataBitSetV2.java

+    @Override
+    public int readInt(int index) {
+      long bitOffset = (long) index * _numBitsPerValue;
+      int byteOffset = (int) (bitOffset / Byte.SIZE);


Use long to index the dataBuffer so that we can handle big buffer (> 2G)

Jackie-Jiang · 2020-05-26T17:57:50Z

pinot-core/src/main/java/org/apache/pinot/core/io/util/PinotDataBitSetV2.java

+  }
+
+  public static PinotDataBitSetV2 createBitSet(PinotDataBuffer pinotDataBuffer, int numBitsPerValue) {
+    switch (numBitsPerValue) {


We might also have 0 (all zeros) and 1 (0/1)

Jackie-Jiang · 2020-05-26T18:11:10Z

For the benchmark, you should also compare the worst case scenario such as 3, 5, 9, 17 bits

Jackie-Jiang · 2020-05-26T18:12:46Z

pinot-core/src/main/java/org/apache/pinot/core/io/util/PinotDataBitSetV2.java

+  @Override
+  public void close()
+      throws IOException {
+    _dataBuffer.close();


Do not close the buffer (see #5400)

siddharthteotia · 2020-05-27T02:46:28Z

For the benchmark, you should also compare the worst case scenario such as 3, 5, 9, 17 bits

I am still working on adding faster methods for non power of 2. Follow up will address this. Right now it is standalone code (yet to wire in).

siddharthteotia · 2020-05-28T22:45:00Z

Address the review comments. Please take another look. This is standalone code at this point so want to get this in sooner and put up follow-ups in next couple of days.

siddharthteotia · 2020-05-28T22:52:41Z

Here are the JMH results: cc @kishoreg @Jackie-Jiang

Please see the spreadsheet for additional performance numbers obtained via manual instrumentation

Benchmark	Score	Mode
BenchmarkPinotDataBitSet.twoBitBulkWithGaps	28.537 ops/ms	Throughpput
BenchmarkPinotDataBitSet.twoBitBulkWithGapsFast	44.332 ops/ms	Throughpput
BenchmarkPinotDataBitSet.fourBitBulkWithGaps	28.262 ops/ms	Throughpput
BenchmarkPinotDataBitSet.fourBitBulkWithGapsFast	42.449 ops/ms	Throughpput
BenchmarkPinotDataBitSet.eightBitBulkWithGaps	26.816 ops/ms	Throughpput
BenchmarkPinotDataBitSet.eightBitBulkWithGapsFast	38.777 ops/ms	Throughpput
BenchmarkPinotDataBitSet.sixteenBitBulkWithGaps	21.095 ops/ms	Throughpput
BenchmarkPinotDataBitSet.sixteenBitBulkWithGapsFast	38.380 ops/ms	Throughpput

BenchmarkPinotDataBitSet.twoBitContiguous	26.188 ms/op	Avgt
BenchmarkPinotDataBitSet.twoBitContiguousFast	15.692 ms/op	Avgt
BenchmarkPinotDataBitSet.fourBitContiguous	26.113 ms/op	Avgt
BenchmarkPinotDataBitSet.fourBitContiguousFast	15.178 ms/op	Avgt
BenchmarkPinotDataBitSet.eightBitContiguous	26.693 ms/op	Avgt
BenchmarkPinotDataBitSet.eightBitContiguousFast	15.726 ms/op	Avgt
BenchmarkPinotDataBitSet.sixteenBitContiguous	43.968 ms/op	Avgt
BenchmarkPinotDataBitSet.sixteenBitContiguousFast	21.390 ms/op	Avgt

BenchmarkPinotDataBitSet.twoBitBulkContiguous	32.504 ms/op	Avgt
BenchmarkPinotDataBitSet.twoBitBulkContiguousFast	14.614 ms/op	Avgt
BenchmarkPinotDataBitSet.fourBitBulkContiguous	30.794 ms/op	Avgt
BenchmarkPinotDataBitSet.fourBitBulkContiguousFast	14.583 ms/op	Avgt
BenchmarkPinotDataBitSet.eightBitBulkContiguous	16.525 ms/op	Avgt
BenchmarkPinotDataBitSet.eightBitBulkContiguousFast	10.777 ms/op	Avgt
BenchmarkPinotDataBitSet.sixteenBitBulkContiguous	54.731 ms/op	Avgt
BenchmarkPinotDataBitSet.sixteenBitBulkContiguousFast	19.312 ms/op	Avgt

BenchmarkPinotDataBitSet.twoBitBulkContiguousUnaligned	32.018 ms/op	Avgt
BenchmarkPinotDataBitSet.twoBitBulkContiguousUnalignedFast	14.344 ms/op	Avgt
BenchmarkPinotDataBitSet.eightBitBulkContiguousUnaligned	21.204 ms/op	Avgt
BenchmarkPinotDataBitSet.eightBitBulkContiguousUnalignedFast	15.393 ms/op	Avgt
BenchmarkPinotDataBitSet.fourBitBulkContiguousUnaligned	20.125 ms/op	Avgt
BenchmarkPinotDataBitSet.fourBitBulkContiguousUnalignedFast	14.836 ms/op	Avgt
BenchmarkPinotDataBitSet.sixteenBitBulkContiguousUnaligned	58.086 ms/op	Avgt
BenchmarkPinotDataBitSet.sixteenBitBulkContiguousUnalignedFast	22.002 ms/op	Avgt

siddharthteotia · 2020-05-29T21:07:45Z

Merging it.

Follow-ups coming next:

Wire with the reader and writer. Currently the fast methods here can be used for existing reader and writer if the index is power of 2 bit encoded.
Consider changing the format to Little Endian
Consider aligning the bytes on the writer side. This will remove a few branches for 2/4 bit encoding to handle unaligned reads. It will also make it easier to add fast methods for non power of 2 bit encodings.

Faster bit unpacking

a65409f

mayankshriv reviewed May 19, 2020

View reviewed changes

Siddharth Teotia added 7 commits May 20, 2020 00:29

Add unit tests

f0679ec

new

3cf9361

Improved degree of vectorization and more tests

942ef3e

fix build

45bd3a3

cleanup

14543bb

docs

4798b79

change file name

ab19454

kishoreg approved these changes May 26, 2020

View reviewed changes

Jackie-Jiang reviewed May 26, 2020

View reviewed changes

address review comments and add more benchmarks

7e84e41

siddharthteotia merged commit b40dd99 into apache:master May 29, 2020

siddharthteotia changed the title ~~Faster bit unpacking (Part 1)~~ Faster vectorized bit unpacking (Part 1) Jun 5, 2020

siddharthteotia mentioned this pull request Jun 12, 2020

Integrate vectorized bit-unpacking #5548

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster vectorized bit unpacking (Part 1) #5409

Faster vectorized bit unpacking (Part 1) #5409

siddharthteotia commented May 18, 2020

mayankshriv May 19, 2020

siddharthteotia May 22, 2020 •

edited

Loading

kishoreg May 26, 2020

mayankshriv May 19, 2020

siddharthteotia May 22, 2020

mayankshriv May 19, 2020

siddharthteotia May 22, 2020

mayankshriv May 19, 2020

siddharthteotia May 22, 2020

kishoreg left a comment

kishoreg May 18, 2020

siddharthteotia May 26, 2020

kishoreg May 26, 2020

Jackie-Jiang May 26, 2020

siddharthteotia May 27, 2020

Jackie-Jiang May 26, 2020

siddharthteotia May 27, 2020

Jackie-Jiang May 26, 2020

siddharthteotia May 27, 2020

Jackie-Jiang May 26, 2020

siddharthteotia May 27, 2020

Jackie-Jiang May 26, 2020

siddharthteotia May 27, 2020

Jackie-Jiang May 26, 2020

Jackie-Jiang commented May 26, 2020

Jackie-Jiang May 26, 2020

siddharthteotia May 28, 2020

siddharthteotia commented May 27, 2020 •

edited

Loading

siddharthteotia commented May 28, 2020 •

edited

Loading

siddharthteotia commented May 28, 2020 •

edited

Loading

siddharthteotia commented May 29, 2020

		import org.apache.pinot.core.segment.memory.PinotDataBuffer;


		public abstract class PinotDataBitSetV2 implements Closeable {



		public final class FixedBitIntReaderWriterV2 implements Closeable {
		private volatile PinotDataBitSetV2 _dataBitSet;

Faster vectorized bit unpacking (Part 1) #5409

Faster vectorized bit unpacking (Part 1) #5409

Conversation

siddharthteotia commented May 18, 2020

Choose a reason for hiding this comment

siddharthteotia May 22, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kishoreg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jackie-Jiang commented May 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siddharthteotia commented May 27, 2020 • edited Loading

siddharthteotia commented May 28, 2020 • edited Loading

siddharthteotia commented May 28, 2020 • edited Loading

siddharthteotia commented May 29, 2020

siddharthteotia May 22, 2020 •

edited

Loading

siddharthteotia commented May 27, 2020 •

edited

Loading

siddharthteotia commented May 28, 2020 •

edited

Loading

siddharthteotia commented May 28, 2020 •

edited

Loading