-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-1342:Add bloom filter utility class #425
Conversation
42e6c77
to
1d3170f
Compare
CI check failed on thrift build. No idea why.. |
Building is failed. If you change the thrift file in Parquet-format, you have to wait for the commitment for that part at first to make CI happy. |
@winningsix In this patch I don't change any thrift file but just add Bloom and Murmur3 class. Strange. |
@cjjnjust you can use the command locally to do a build. |
@winningsix /home/travis/build/apache/parquet-mr/thrift-0.7.0/lib/php/src/ext/thrift_protocol/php_thrift_protocol.cpp:95:8: error: ‘function_entry’ does not name a type I search this and found a JIRA https://issues.apache.org/jira/browse/THRIFT-1602. I recall I met this error when building thrift also and I fix this by remove PHP support in thrift. |
@winningsix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just started. Given the number of comments I have left, I may not be able to give this the review I believe it deserves.
// Bytes in a bucket. | ||
public static final int BYTES_PER_BUCKET = 32; | ||
|
||
// Minimum bloom filter data size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the unit?
* underlying class of Bloom Filter which stores a bit set represents elements set, hash strategy and bloom filter | ||
* algorithm. | ||
* | ||
* Bloom Filter algorithm is implemented using block Bloom filters from Putze et al.'s "Cache-, Hash- and Space-Efficient Bloom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is very long. Does this project have line length practices?
import org.apache.parquet.schema.PrimitiveType; | ||
|
||
/** | ||
* Bloom Filter is a compat structure to indicate whether an item is not in set or probably in set. Bloom class is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"compact"
// Algorithm applied of this bloom filter. | ||
public ALGORITHM bloomFilterAlgorithm = ALGORITHM.BLOCK; | ||
|
||
private HashSet<Long> elements = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put a comment here explaining why this exists. The same applies for the members below.
* Create a new bitset for bloom filter, at least 256 bits will be create. | ||
* @param numBytes number of bytes for bit set. | ||
*/ | ||
public void initBitset(int numBytes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is incorrect: it must force the number of bytes allocated to be a multiple of 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will force to bucket size alignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - I meant power of 2. This must force the number of bytes allocated to be a power of 2, which I do not believe it does.
* Element is represented by hash in bloom filter. The hash function takes plain encoding | ||
* of element as input. | ||
*/ | ||
public Encoding getEncoding() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused
* @return Bytes buffered of bloom filter. | ||
*/ | ||
public long getBufferedSize() { | ||
return elements.size() * 8 + numBytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
What do you mean by buffered bytes?
-
Also, why multiply by 8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Buffered bytes is the total number of bytes we allocate in heap for bloom filter, please refer to similar definition in dictionaryValueWriter.
-
Elements are stored in Long which is 8 bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Please add a comment that explains it as well as the method on https://github.com/apache/parquet-mr/blob/70f28810a5547219e18ffc3465f519c454fee6e5/parquet-column/src/main/java/org/apache/parquet/column/values/ValuesWriter.java
-
A HashSet will use a good bit more than 8 heap bytes per value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean of "A HashSet will use a good bit more than 8 heap bytes per value."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to say it another way. A HashSet containing N Longs will use substantially more than N*8 bytes of memory. This is because of object overhead, empty space, and other overheads associated with hash tables occupy heap space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
understood!
/** | ||
* @return Bytes buffered of bloom filter. | ||
*/ | ||
public long getAllocatedSize() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean?
return new FloatBloom(size, hash, algorithm); | ||
case DOUBLE: | ||
return new DoubleBloom(size, hash, algorithm); | ||
case BINARY: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need lines 331 or 333, right?
} | ||
|
||
public static class BinaryBloom extends Bloom<Binary> { | ||
private CapacityByteArrayOutputStream arrayout = new CapacityByteArrayOutputStream(1024, 64 * 1024, new HeapByteBufferAllocator()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the constants? Give them courtesy names, perhaps.
// Bytes in a bucket. | ||
public static final int BYTES_PER_BUCKET = 32; | ||
|
||
// Minimum bloom filter data size in byte. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should the minimum size be 256 bytes, rather than 32 bytes?
// The underlying byte array for bloom filter bitset. | ||
private byte[] bitset; | ||
|
||
// The size of bitset in byte. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's already stored in bitset.length, right? Why duplicate it?
// The size of bitset in byte. | ||
private int numBytes; | ||
|
||
// List of byte input to construct the bloom filter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comments to the file: Under what circumstances is it used? What are the data structure invariants? Can both it and bitset be non-null?
Preconditions.checkArgument((p > 0.0 && p < 1.0), | ||
"FPP should be less than 1.0 and great than 0.0"); | ||
|
||
int bits = (int)(-n * Math.log(p) / (Math.log(2) * Math.log(2))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not believe this math is correct when k, the number of hash functions, is fixed at 8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is from https://en.wikipedia.org/wiki/Bloom_filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think that wikipedia page supports your statement when k is fixed at 8.
* all elements in cache. If bitset was already created and set, it do nothing. | ||
*/ | ||
public void flush() { | ||
if (!elements.isEmpty() && bitset == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this silently do nothing if !elements.isEmpty() but bitset != null? Should it throw an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original bloomfilter value writer has two path, one is writing values to a hash set as cache, and finalize bloom filter when finalize the column chunk. Another is setting bits immediately when writing values which does not need a flush since the bitset is ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what the "original bloom filter value writer" is. This is the first one I have seen.
Please add a check here for this case, or, preferably, move the buffering BF construction out of the BF class.
} | ||
|
||
for (int i = 0; i < 8; ++i) { | ||
mask[i] = mask[i] >> 27; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks to me like a correctness issue: it should be unsigned right shift, yes?
Preconditions.checkArgument((p > 0.0 && p < 1.0), | ||
"FPP should be less than 1.0 and great than 0.0"); | ||
|
||
int bits = (int)(-n * Math.log(p) / (Math.log(2) * Math.log(2))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if this overflows - can bits end up negative, or just truncated and too low? Please consider that and add some checks.
|
||
int bits = (int)(-n * Math.log(p) / (Math.log(2) * Math.log(2))); | ||
|
||
// Get next power of 2 if bits is not power of 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return getBufferedSize(); | ||
} | ||
|
||
public String memUsageString(String prefix) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any callers
); | ||
} | ||
|
||
public static Bloom getBloomOnType(PrimitiveType.PrimitiveTypeName type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any callers.
bits++; | ||
|
||
return bits; | ||
return Integer.highestOneBit(bits) << 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check if bits is 0, negative, or a power of 2 first.
maxSlabSize, new HeapByteBufferAllocator()); | ||
private LittleEndianDataOutputStream out = new LittleEndianDataOutputStream(arrayout); | ||
|
||
final int maxSlabSize = 64 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final compile-time constants are often given names "LIKE_THIS"
byte[] encoded = BytesInput.from(arrayout).toByteArray(); | ||
arrayout.reset(); | ||
switch (bloomFilterHash) { | ||
PlainValuesWriter plainValuesWriter = new PlainValuesWriter(value.length() + 4, maxSlabSize, new HeapByteBufferAllocator()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is complex enough to deserve a test.
bits = Integer.highestOneBit(bits) << 1; | ||
|
||
if (bits < 0) { | ||
bits = ParquetProperties.DEFAULT_DICTIONARY_PAGE_SIZE * 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain in the comments why you pick this number of bits?
22cc9af
to
1918260
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A number of other previous comments were not addressed. This is an example. you can see old comments in the "Conversation" tab. Older ones are collapsed, but you can open them with the button in the upper-right-hand corner.
* Default false positive probability value use to calculate optimal number of bits | ||
* used by bloom filter. | ||
*/ | ||
public final double FPP = 0.05; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed review comment: 0.01 is closer to the optimal range for this FPP.
Also, new comment: Maybe call it DEFAULT_FPP, since the fpp can be different?
* 32-bit Java port of https://code.google.com/p/smhasher/source/browse/trunk/MurmurHash3.cpp#94 | ||
* 128-bit Java port of https://code.google.com/p/smhasher/source/browse/trunk/MurmurHash3.cpp#255 | ||
* | ||
* This is a public domain code with no copyrights. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use the code directly and use JNI to call it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will introduce a c++ dependency, not sure whether current parquet-mr has any C++ dependency yet. Also I didn't see any formal release of Murmur3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What methodology have you used to check that your Java translation computes the same values as the C++ version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked it manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are implementations of Murmur3 available elsewhere. Google guava has one, for example, that handles ints and longs directly. I don't think we should include this.
} | ||
|
||
@Test | ||
public void testMurmur3() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see some more tests on some different lengths.
} | ||
|
||
// exist can be true in a very low probability. | ||
boolean exist = binaryBloom.find(binaryBloom.hash(Binary.fromString("not exist"))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should include many more calls to find on different input - enough to be confident of the false positive rate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given 0.01 FPP, 10 false positive tests can get a 0.99 ^ 10 about 0.9 probability pass. I will update to 10 false positive tests here.
import static org.junit.Assert.assertTrue; | ||
|
||
|
||
public class TestBloom { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see the C++ version available and integration tested before committing either the Java or C++ version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can anyone assist with an implementation for parquet-cpp? If anyone needs help with "where to put the code" either I, @xhochy, or @majetideepak can assist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What integration test you mean? Interact with cpp version? Such as write to file using java version and read with cpp version?
I'd like to try to implement the cpp version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that is what I mean.
public final ALGORITHM algorithm; | ||
|
||
// The underlying byte array for bloom filter bitset. | ||
private byte[] bitset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just store int[] bitset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since when serializing and deserializing you will need to additional considering about the byte order and converting byte array to int array which should have some overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doubles the memory requirement, and I don't think it is worth it -- doubly so since you can change the endianness with the subtraction bit below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intBuffer just use bitset as backup array, it doesn't double the memory size.
|
||
// Get next power of 2 if it is not power of 2. | ||
if ((numBytes & (numBytes - 1)) != 0) { | ||
numBytes = Integer.highestOneBit(numBytes) << 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be negative if a bit gets shifted into the highest-order position, yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I updated overflow handle code following.
* @param out output stream to write | ||
*/ | ||
public void writeTo(OutputStream out) throws IOException { | ||
Preconditions.checkArgument(bitset.length > 0, "Bloom filter bitset length should be larger than 0"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this possible? Java promises never to return a negative for the length of an array, right?
} | ||
|
||
for (int i = 0; i < 8; ++i) { | ||
mask[i] = mask[i] >>> 27; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For switching between little-endian and big-endian, I think you can subtract mask[i] from 31 after this operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to switch endianness now.
* 32-bit Java port of https://code.google.com/p/smhasher/source/browse/trunk/MurmurHash3.cpp#94 | ||
* 128-bit Java port of https://code.google.com/p/smhasher/source/browse/trunk/MurmurHash3.cpp#255 | ||
* | ||
* This is a public domain code with no copyrights. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What methodology have you used to check that your Java translation computes the same values as the C++ version?
* @param data - input byte array | ||
* @return - hashcode | ||
*/ | ||
public static int hash32(byte[] data) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make this public if the Bloom filter implementation does not use this. Same question for other methods.
public class TestBloom { | ||
@Test | ||
public void testIntBloom () throws IOException { | ||
Bloom bloom = new Bloom(279, Bloom.HASH.MURMUR3_X64_128, Bloom.ALGORITHM.BLOCK); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment explaining the rationale for the number 279
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just a value I want to check what bitset length will be next power of 2. I added a test for that.
* @param hash The hash strategy bloom filter apply. | ||
* @param algorithm The algorithm of bloom filter. | ||
*/ | ||
public Bloom(byte[] bitset, HASH hash, ALGORITHM algorithm) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have stated my opinion in the past on this review that it is my opinion that all public methods of a class should be tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added write/read from outputstream in test.
addElement(hash); | ||
} | ||
|
||
elements.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elements = null allows the GC to clean it up, I suspect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you have elements = null, you no longer need elements.clear().
@@ -50,6 +50,7 @@ | |||
public static final boolean DEFAULT_ESTIMATE_ROW_COUNT_FOR_PAGE_SIZE_CHECK = true; | |||
public static final int DEFAULT_MINIMUM_RECORD_COUNT_FOR_CHECK = 100; | |||
public static final int DEFAULT_MAXIMUM_RECORD_COUNT_FOR_CHECK = 10000; | |||
public static final int DEFAULT_MAXIMUM_BLOOM_FILTER_SIZE = 16 * 1024 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"SIZE" is ambiguous; prefer "BYTES"
4c4ecab
to
cae4ea7
Compare
Bloom.ALGORITHM.BLOCK); | ||
|
||
List<String> strings = new ArrayList<>(); | ||
RandomStr randomStr = new RandomStr(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably want to provide a seed to make this test pseudo-random but deterministic, so it's not flaky.
addElement(hash); | ||
} | ||
|
||
elements.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you have elements = null, you no longer need elements.clear().
@@ -162,7 +162,7 @@ private void initBitset(int numBytes) { | |||
* @param out output stream to write | |||
*/ | |||
public void writeTo(OutputStream out) throws IOException { | |||
Preconditions.checkArgument(bitset.length > 0, "Bloom filter bitset length should be larger than 0"); | |||
Preconditions.checkArgument(bitset != null, "Bloom filter bitset has not create yet."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not call flush in this case?
@@ -50,6 +50,7 @@ | |||
public static final boolean DEFAULT_ESTIMATE_ROW_COUNT_FOR_PAGE_SIZE_CHECK = true; | |||
public static final int DEFAULT_MINIMUM_RECORD_COUNT_FOR_CHECK = 100; | |||
public static final int DEFAULT_MAXIMUM_RECORD_COUNT_FOR_CHECK = 10000; | |||
public static final int DEFAULT_MAXIMUM_BLOOM_FILTER_BYTES = 16 * 1024 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default bloom filter max is 1/8th of the default row group size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This need to discuss.
|
||
public class Bloom { | ||
// Hash strategy available for bloom filter. | ||
public enum HASH { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Should be HashAlgorithm
.
} | ||
|
||
// Bloom filter algorithm. | ||
public enum ALGORITHM { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Should be Algorithm
(not all caps). Is there a better name for this? It isn't really an algorithm. It is a variant of the data structure.
public static final int HEADER_SIZE = 12; | ||
|
||
// Bytes in a bucket. | ||
public static final int BYTES_PER_BUCKET = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a better name than "bucket"? In the paper, each filter is called a block. That's an overused term in Hadoop, so maybe we should call it something more specific, like FilterBlock
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* @param hash The hash strategy bloom filter apply. | ||
* @param algorithm The algorithm of bloom filter. | ||
*/ | ||
public Bloom(int numBytes, HASH hash, ALGORITHM algorithm) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt we are going to use this class for other bloom filter variants, so I don't see the need to pass the algorithm in. Same thing with the hash function. Since there is only one, let's avoid passing it explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to private and provide another constructor.
ByteBuffer plain = ByteBuffer.allocate(Integer.SIZE/Byte.SIZE); | ||
plain.order(ByteOrder.LITTLE_ENDIAN).putInt(value); | ||
switch (hash) { | ||
case MURMUR3_X64_128: return Murmur3.hash64(plain.array()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should hash a ByteBuffer
instead of byte[]
. That will require less copying for types like Binary
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guava support ByteBuffer argument from version 23. I just update pom.xml also .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like guava 23.0 changes quite some APIs which cause several build issues in parquet-cli. I reverted back to use guava 20.0 to use hash64(byte[]) API firstly.
plain.order(ByteOrder.LITTLE_ENDIAN).putInt(value.length()); | ||
ByteArrayOutputStream baos = new ByteArrayOutputStream(value.length() + 4); | ||
baos.write(plain.array(), 0, 4); | ||
value.writeTo(baos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't copy. It should retrieve a byte buffer for the binary using toByteBuffer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated to use toByteBuffer , not sure whether this be consistent with plain encoding definition of parquet.
* Insert element to set represented by bloom bitset. | ||
* @param value the value to insert into bloom filter.. | ||
*/ | ||
public void insert(long value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires the hash of a value, not a value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
*/ | ||
public boolean find(long hash) { | ||
// Elements are in cache, flush them firstly. | ||
if (elements != null && !elements.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need a class that handles reads mixed with writes. Reads and writes will be separated, so I'm a little concerned that this over-complicates the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this cache logic to keep simple.
* 32-bit Java port of https://code.google.com/p/smhasher/source/browse/trunk/MurmurHash3.cpp#94 | ||
* 128-bit Java port of https://code.google.com/p/smhasher/source/browse/trunk/MurmurHash3.cpp#255 | ||
* | ||
* This is a public domain code with no copyrights. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are implementations of Murmur3 available elsewhere. Google guava has one, for example, that handles ints and longs directly. I don't think we should include this.
* @return hash result | ||
*/ | ||
public long hash(Binary value) { | ||
return hashFunction.hashBytes(value.toByteBuffer().array()).asLong(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is still a open since we agreed to use plain encoding input. In parquet thrift definition, BYTE_ARRAY is encoded with length, and FIXED_LEN_BYTE_ARRAY is encoded without length. In PlainValuesWriter.java, it stored length for Binary.
Do we need to still use plain encoding input for Binary? @rdblue @jbapple-cloudera
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to include the length.
Also, this shouldn't use .array()
without checking whether the buffer is direct or on heap, and without supplying an offset because the ByteBuffer may point to the middle of its backing array. Here's what I'm using in another project:
// MURMUR3 is the configured hash function from guava; this uses the 32-bit variant
public int hash(ByteBuffer value) {
if (value.hasArray()) {
return MURMUR3.hashBytes(value.array(),
value.arrayOffset() + value.position(),
value.arrayOffset() + value.remaining()).asInt();
} else {
byte[] copy = new byte[value.remaining()];
value.get(copy);
return MURMUR3.hashBytes(copy).asInt();
}
}
There should also be test cases for the following:
- Direct byte buffer
- Direct byte buffer with non-zero
position
- Heap byte buffer
- Heap byte buffer with non-zero
arrayOffset
- Heap byte buffer with non-zero
position
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ByteBuffer is from Binary API which already consider things you mentioned.
The same as (getBytes/getUnsafeBytes[no copy]).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use the ByteBuffer and add tests for the cases that I listed. ByteBuffers will require fewer copies. Otherwise, this will need to be rewritten before actually using it to produce bloom filters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand ByteBuffer should avoid copies, but Murmur3 from guava 20.0 only support byte[] parameter. We can change to use ByteBuffer when update to guava23.0.
The difference from your reference code is here we use Binary as input parameter, and we should trust the APIs from Binary class, shouldn't we? If it accept a ByteBuffer parameter, we should add tests you list accordingly.
In addition, in you reference code, why you add arrayOffset to remaining? remaining should be (limit -position) and irrelevant with arraryOffset, otherwise the length changes, am I wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trusting an API for correctness and using the most efficient method from that API are two different things. If you use toByteArray
, your code will often copy data it doesn't need to, which is why it will need to be replaced before actually using this. If Guava doesn't support ByteBuffer, is there another implementation that does? Can we update Guava?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can update Guava to 23.0, it needs some minor changes to fix build issues. Please have a look 5f0eab7 commit.
I also update maven-shade-plugin version since it show ArrayIndexOutOfBoundsException when creating shaded jar for "com/google/common/cache/LocalCache$EntrySet.class".
Hi @wesm |
Hi @winningsix |
7ee8817
to
8e59ed8
Compare
Hi developers Sorry for no update long time for bloom filter topic due to transition, now I'm back to move this forward start from this rebase. No obviously change in this commit, I just remove unrelated changes. @rdblue @jbapple-cloudera Could you please kindly to have a look? Many thanks |
remove some obsolete changes
Hi @rdblue Thanks in advanced. |
Change title to conform to subtask JIRA. |
ping @rdblue |
Hi @rdblue, do you get time to have a look? |
cbd1032
to
53f22e0
Compare
Can this be closed in favor of #425? |
Close this due to duplicated one : #521 |
This is an initial patch just include bloom filter itself.