Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster vectorized bit unpacking (Part 1) #5409

Merged
merged 9 commits into from
May 29, 2020

Conversation

siddharthteotia
Copy link
Contributor

Couple of improvements have been done for bit-unpacking.

  • Use hand-written unpack methods for power of 2 (1, 2, 4, 8, 16, 32) number of bits used to encode the dictionaryId. The hand-written methods are faster than generic due to simplified bit math.
  • Amortize the overhead of function calls.

Right now, the new code isn't yet wired into existing bit reader and writer. Couple of follow-ups will be coming soon:

  • Evaluate this optimization for non power of 2 number of bits. It is fairly possible but the performance benefit of using a special hand-written function for unpacking seems to get lost as the bit math itself gets complicated with branches for non power of 2 number of bits.
  • Consider using a new format where if the number of bits to encode is non power of 2, we convert it to nearest power of 2. This means if you need more than 16 bits, we use 32 bits (raw value). We get diminished returns as the overhead of unpacking itself increases at the cost of saving 10-12 bits.
  • Integrate the new changes with existing code.

Description of changes:

A new version of FixedBitIntReaderWriter is written that underneath uses a new version of fast bit unpack reader PinotDataBitSetV2.

There are 3 important APIs here:

public int readInt(int index)
Exists in the current code as well - Used by the scan operator to read through the forward index and dictId for each docId

public void readInt(int startDocId, int length, int[] buffer)
Exists in the current code as well - Used by the multi-value bit reader to get dictId for all MVs in a given cell.

public void readValues(int[] docIds, int docIdStartIndex, int docIdLength, int[] values, int valuesStartIndex)
Exists at the FixedBitSingleColumnSingleValueReader interface and used by the dictionary based group by executor to get dictIds for a set of docIds (monotonically increasing but not necessarily contiguous). But the API still issued single read calls underneath. This PR introduces this API at the FixedBitIntReaderWriterV2 level so that group by executor can leverage it using the bulk read semantics.

When this code is wired in, the scan operator will start using one of the second or third API.

Please see the spreadsheet for performance numbers.

Two kinds of tests were done:

  • Compare the performance of sequential consecutive reads using single read API getInt(index) with faster bit unpacking code.
  • Compare the performance of sequential consecutive reads using array API readInt(int startDocId, int length, int[] buffer) with faster bit unpacking code.

Will be adding some units tests. The current PR has performance test.

private volatile PinotDataBitSetV2 _dataBitSet;
private final int _numBitsPerValue;

public FixedBitIntReaderWriterV2(PinotDataBuffer dataBuffer, int numValues, int numBitsPerValue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider writing a header for backward/forward compatibility.

Copy link
Contributor Author

@siddharthteotia siddharthteotia May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the follow-up when this code is wired with reader and writer (FixedBitSingleValueReader and FixedBitSingleValueWriter) and the scan operator, I will consider if the format has to be changed and bump the version and write a header.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you do this, please write a standalone header buffer that can be used in other places

if (docIdRange > DocIdSetPlanNode.MAX_DOC_PER_CALL) {
return false;
}
return numDocsToRead >= ((double)docIdRange * 0.7);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasoning for magic number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have provided details in javadoc. Let me know if that is helpful.

import org.apache.pinot.core.io.writer.impl.v1.FixedBitSingleValueWriter;
import org.apache.pinot.core.segment.memory.PinotDataBuffer;

public class ForwardIndexBenchmark {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to use JMH to make benchmark more accurate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

import org.apache.pinot.core.segment.memory.PinotDataBuffer;


public abstract class PinotDataBitSetV2 implements Closeable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class needs exhaustive unit tests to ensure all cases are covered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added several covering all possible cases. Will do another round in a follow-up

Copy link
Member

@kishoreg kishoreg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add todo to remove the dependency on having dictId size of MAX_DOCS_PER_CALL

private boolean shouldBulkRead(int[] docIds, int startIndex, int endIndex) {
int numDocsToRead = endIndex - startIndex + 1;
int docIdRange = docIds[endIndex] - docIds[startIndex] + 1;
if (docIdRange > DocIdSetPlanNode.MAX_DOC_PER_CALL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this check needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is coming from the previous commit. The latest version of the PE doesn't have this check.

private volatile PinotDataBitSetV2 _dataBitSet;
private final int _numBitsPerValue;

public FixedBitIntReaderWriterV2(PinotDataBuffer dataBuffer, int numValues, int numBitsPerValue) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you do this, please write a standalone header buffer that can be used in other places



public final class FixedBitIntReaderWriterV2 implements Closeable {
private volatile PinotDataBitSetV2 _dataBitSet;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to make this volatile and set it to null in close(). This issue has been addressed in #4764

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

protected int _numBitsPerValue;

/**
* Unpack single dictId at the given docId. This is efficient
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't limit it to dictId and docId in the javadoc. The BitSet is a general reader/writer which can be used for different purposes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* @param out out array to store the unpacked dictIds
* @param outpos starting index in the out array
*/
public void readInt(int[] docIds, int docIdsStartIndex, int length, int[] out, int outpos) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For performance concern, we can remove the docIdsStartIndex and outpos and always assume they are 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the outer API in FixedBitIntReaderWriterV2 that calls this. That API tries to judge sparseness before deciding to do bulk read. The decision is made on a chunk of values at a time and this bulk API is called for each chunk.

int endDocId = docIds[docIdsStartIndex + length - 1];
int[] dictIds = THREAD_LOCAL_DICT_IDS.get();
// do a contiguous bulk read
readInt(startDocId, endDocId - startDocId + 1, dictIds);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work because docIds might not be contiguous and endDocId - startDocId + 1 could be much larger than DocIdSetPlanNode.MAX_DOC_PER_CALL. Also, we should not always do such bulk read, docIds can be very sparse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the outer API in FixedBitIntReaderWriterV2 that calls this. That API tries to judge sparseness before deciding to do bulk read. The decision is made on a chunk of values at a time

@Override
public int readInt(int index) {
long bitOffset = (long) index * _numBitsPerValue;
int byteOffset = (int) (bitOffset / Byte.SIZE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use long to index the dataBuffer so that we can handle big buffer (> 2G)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

public static PinotDataBitSetV2 createBitSet(PinotDataBuffer pinotDataBuffer, int numBitsPerValue) {
switch (numBitsPerValue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might also have 0 (all zeros) and 1 (0/1)

@Jackie-Jiang
Copy link
Contributor

For the benchmark, you should also compare the worst case scenario such as 3, 5, 9, 17 bits

@Override
public void close()
throws IOException {
_dataBuffer.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not close the buffer (see #5400)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@siddharthteotia
Copy link
Contributor Author

siddharthteotia commented May 27, 2020

For the benchmark, you should also compare the worst case scenario such as 3, 5, 9, 17 bits

I am still working on adding faster methods for non power of 2. Follow up will address this. Right now it is standalone code (yet to wire in).

@siddharthteotia
Copy link
Contributor Author

siddharthteotia commented May 28, 2020

Address the review comments. Please take another look. This is standalone code at this point so want to get this in sooner and put up follow-ups in next couple of days.

@siddharthteotia
Copy link
Contributor Author

siddharthteotia commented May 28, 2020

Here are the JMH results: cc @kishoreg @Jackie-Jiang

Please see the spreadsheet for additional performance numbers obtained via manual instrumentation

Benchmark Score Mode
BenchmarkPinotDataBitSet.twoBitBulkWithGaps 28.537 ops/ms Throughpput
BenchmarkPinotDataBitSet.twoBitBulkWithGapsFast 44.332 ops/ms Throughpput
BenchmarkPinotDataBitSet.fourBitBulkWithGaps 28.262 ops/ms Throughpput
BenchmarkPinotDataBitSet.fourBitBulkWithGapsFast 42.449 ops/ms Throughpput
BenchmarkPinotDataBitSet.eightBitBulkWithGaps 26.816 ops/ms Throughpput
BenchmarkPinotDataBitSet.eightBitBulkWithGapsFast 38.777 ops/ms Throughpput
BenchmarkPinotDataBitSet.sixteenBitBulkWithGaps 21.095 ops/ms Throughpput
BenchmarkPinotDataBitSet.sixteenBitBulkWithGapsFast 38.380 ops/ms Throughpput
     
BenchmarkPinotDataBitSet.twoBitContiguous 26.188 ms/op Avgt
BenchmarkPinotDataBitSet.twoBitContiguousFast 15.692 ms/op Avgt
BenchmarkPinotDataBitSet.fourBitContiguous 26.113 ms/op Avgt
BenchmarkPinotDataBitSet.fourBitContiguousFast 15.178 ms/op Avgt
BenchmarkPinotDataBitSet.eightBitContiguous 26.693 ms/op Avgt
BenchmarkPinotDataBitSet.eightBitContiguousFast 15.726 ms/op Avgt
BenchmarkPinotDataBitSet.sixteenBitContiguous 43.968 ms/op Avgt
BenchmarkPinotDataBitSet.sixteenBitContiguousFast 21.390 ms/op Avgt
     
BenchmarkPinotDataBitSet.twoBitBulkContiguous 32.504 ms/op Avgt
BenchmarkPinotDataBitSet.twoBitBulkContiguousFast 14.614 ms/op Avgt
BenchmarkPinotDataBitSet.fourBitBulkContiguous 30.794 ms/op Avgt
BenchmarkPinotDataBitSet.fourBitBulkContiguousFast 14.583 ms/op Avgt
BenchmarkPinotDataBitSet.eightBitBulkContiguous 16.525 ms/op Avgt
BenchmarkPinotDataBitSet.eightBitBulkContiguousFast 10.777 ms/op Avgt
BenchmarkPinotDataBitSet.sixteenBitBulkContiguous 54.731 ms/op Avgt
BenchmarkPinotDataBitSet.sixteenBitBulkContiguousFast 19.312 ms/op Avgt
     
BenchmarkPinotDataBitSet.twoBitBulkContiguousUnaligned 32.018 ms/op Avgt
BenchmarkPinotDataBitSet.twoBitBulkContiguousUnalignedFast 14.344 ms/op Avgt
BenchmarkPinotDataBitSet.eightBitBulkContiguousUnaligned 21.204 ms/op Avgt
BenchmarkPinotDataBitSet.eightBitBulkContiguousUnalignedFast 15.393 ms/op Avgt
BenchmarkPinotDataBitSet.fourBitBulkContiguousUnaligned 20.125 ms/op Avgt
BenchmarkPinotDataBitSet.fourBitBulkContiguousUnalignedFast 14.836 ms/op Avgt
BenchmarkPinotDataBitSet.sixteenBitBulkContiguousUnaligned 58.086 ms/op Avgt
BenchmarkPinotDataBitSet.sixteenBitBulkContiguousUnalignedFast 22.002 ms/op Avgt

@siddharthteotia
Copy link
Contributor Author

Merging it.

Follow-ups coming next:

  • Wire with the reader and writer. Currently the fast methods here can be used for existing reader and writer if the index is power of 2 bit encoded.
  • Consider changing the format to Little Endian
  • Consider aligning the bytes on the writer side. This will remove a few branches for 2/4 bit encoding to handle unaligned reads. It will also make it easier to add fast methods for non power of 2 bit encodings.

@siddharthteotia siddharthteotia merged commit b40dd99 into apache:master May 29, 2020
@siddharthteotia siddharthteotia changed the title Faster bit unpacking (Part 1) Faster vectorized bit unpacking (Part 1) Jun 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants