PARQUET-1342:Add bloom filter utility class #425

chenjunjiedada · 2017-09-05T08:52:23Z

This is an initial patch just include bloom filter itself.

chenjunjiedada · 2017-09-06T06:45:51Z

CI check failed on thrift build. No idea why..

winningsix · 2017-09-06T14:12:43Z

Building is failed. If you change the thrift file in Parquet-format, you have to wait for the commitment for that part at first to make CI happy.

chenjunjiedada · 2017-09-07T01:43:22Z

@winningsix In this patch I don't change any thrift file but just add Bloom and Murmur3 class. Strange.

winningsix · 2017-09-07T01:45:12Z

@cjjnjust you can use the command locally to do a build.

chenjunjiedada · 2017-09-07T05:20:45Z

@winningsix
From the log we can see the error is happen when building thrift-0.7.0, and detail is related to PHP support.

/home/travis/build/apache/parquet-mr/thrift-0.7.0/lib/php/src/ext/thrift_protocol/php_thrift_protocol.cpp:95:8: error: ‘function_entry’ does not name a type

I search this and found a JIRA https://issues.apache.org/jira/browse/THRIFT-1602. I recall I met this error when building thrift also and I fix this by remove PHP support in thrift.

chenjunjiedada · 2017-09-07T13:05:37Z

@winningsix
I fixed build error by adding --without-php when configure thrift.

jbapple-cloudera

I've just started. Given the number of comments I have left, I may not be able to give this the review I believe it deserves.

jbapple-cloudera · 2017-09-13T21:28:39Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloom/Bloom.java

+  // Bytes in a bucket.
+  public static final int BYTES_PER_BUCKET = 32;
+
+  // Minimum bloom filter data size.


What is the unit?

jbapple-cloudera · 2017-09-13T21:33:42Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloom/Bloom.java

+ * underlying class of Bloom Filter which stores a bit set represents elements set, hash strategy and bloom filter
+ * algorithm.
+ *
+ * Bloom Filter algorithm is implemented using block Bloom filters from Putze et al.'s "Cache-, Hash- and Space-Efficient Bloom


This line is very long. Does this project have line length practices?

jbapple-cloudera · 2017-09-13T21:34:01Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloom/Bloom.java

+import org.apache.parquet.schema.PrimitiveType;
+
+/**
+ * Bloom Filter is a compat structure to indicate whether an item is not in set or probably in set. Bloom class is


jbapple-cloudera · 2017-09-13T21:34:37Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloom/Bloom.java

+  // Algorithm applied of this bloom filter.
+  public ALGORITHM bloomFilterAlgorithm = ALGORITHM.BLOCK;
+
+  private HashSet<Long> elements = new HashSet<>();


Please put a comment here explaining why this exists. The same applies for the members below.

jbapple-cloudera · 2017-09-13T21:38:00Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloom/Bloom.java

+   * Create a new bitset for bloom filter, at least 256 bits will be create.
+   * @param numBytes number of bytes for bit set.
+   */
+  public void initBitset(int numBytes) {


This method is incorrect: it must force the number of bytes allocated to be a multiple of 2.

It will force to bucket size alignment.

Sorry - I meant power of 2. This must force the number of bytes allocated to be a power of 2, which I do not believe it does.

jbapple-cloudera · 2017-09-14T00:32:47Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloom/Bloom.java

+   * Element is represented by hash in bloom filter. The hash function takes plain encoding
+   * of element as input.
+   */
+  public Encoding getEncoding() {


jbapple-cloudera · 2017-09-14T00:33:45Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloom/Bloom.java

+   * @return Bytes buffered of bloom filter.
+   */
+  public long getBufferedSize() {
+    return elements.size() * 8 + numBytes;


What do you mean by buffered bytes?

Also, why multiply by 8?

Buffered bytes is the total number of bytes we allocate in heap for bloom filter, please refer to similar definition in dictionaryValueWriter.

Elements are stored in Long which is 8 bytes.

Please add a comment that explains it as well as the method on https://github.com/apache/parquet-mr/blob/70f28810a5547219e18ffc3465f519c454fee6e5/parquet-column/src/main/java/org/apache/parquet/column/values/ValuesWriter.java

A HashSet will use a good bit more than 8 heap bytes per value.

What do you mean of "A HashSet will use a good bit more than 8 heap bytes per value."?

I'm not sure how to say it another way. A HashSet containing N Longs will use substantially more than N*8 bytes of memory. This is because of object overhead, empty space, and other overheads associated with hash tables occupy heap space.

understood!

jbapple-cloudera · 2017-09-14T00:34:14Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloom/Bloom.java

+  /**
+   * @return Bytes buffered of bloom filter.
+   */
+  public long getAllocatedSize() {


What does this mean?

jbapple-cloudera · 2017-09-14T00:35:23Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloom/Bloom.java

+        return new FloatBloom(size, hash, algorithm);
+      case DOUBLE:
+        return new DoubleBloom(size, hash, algorithm);
+      case BINARY:


You don't need lines 331 or 333, right?

jbapple-cloudera · 2017-09-14T00:36:23Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloom/Bloom.java

+  }
+
+  public static class BinaryBloom extends Bloom<Binary> {
+    private CapacityByteArrayOutputStream arrayout = new CapacityByteArrayOutputStream(1024, 64 * 1024, new HeapByteBufferAllocator());


What are the constants? Give them courtesy names, perhaps.

jbapple-cloudera · 2017-09-14T05:23:10Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloom/Bloom.java

+  // Bytes in a bucket.
+  public static final int BYTES_PER_BUCKET = 32;
+
+  // Minimum bloom filter data size in byte.


Why should the minimum size be 256 bytes, rather than 32 bytes?

jbapple-cloudera · 2017-09-14T05:23:58Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloom/Bloom.java

+  // The underlying byte array for bloom filter bitset.
+  private byte[] bitset;
+
+  // The size of bitset in byte.


That's already stored in bitset.length, right? Why duplicate it?

jbapple-cloudera · 2017-09-14T05:24:47Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloom/Bloom.java

+  // The size of bitset in byte.
+  private int numBytes;
+
+  // List of byte input to construct the bloom filter.


Please add comments to the file: Under what circumstances is it used? What are the data structure invariants? Can both it and bitset be non-null?

jbapple-cloudera · 2017-09-14T05:27:44Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloom/Bloom.java

+    Preconditions.checkArgument((p > 0.0 && p < 1.0),
+      "FPP should be less than 1.0 and great than 0.0");
+
+    int bits = (int)(-n * Math.log(p) / (Math.log(2) * Math.log(2)));


I do not believe this math is correct when k, the number of hash functions, is fixed at 8.

It is from https://en.wikipedia.org/wiki/Bloom_filter.

I do not think that wikipedia page supports your statement when k is fixed at 8.

jbapple-cloudera · 2017-09-14T05:29:42Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloom/Bloom.java

+   * all elements in cache. If bitset was already created and set, it do nothing.
+   */
+  public void flush() {
+    if (!elements.isEmpty() && bitset == null) {


Why does this silently do nothing if !elements.isEmpty() but bitset != null? Should it throw an exception?

The original bloomfilter value writer has two path, one is writing values to a hash set as cache, and finalize bloom filter when finalize the column chunk. Another is setting bits immediately when writing values which does not need a flush since the bitset is ready.

I don't know what the "original bloom filter value writer" is. This is the first one I have seen.

Please add a check here for this case, or, preferably, move the buffering BF construction out of the BF class.

jbapple-cloudera · 2017-09-14T05:47:27Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloom/Bloom.java

+    }
+
+    for (int i = 0; i < 8; ++i) {
+      mask[i] = mask[i] >> 27;


This looks to me like a correctness issue: it should be unsigned right shift, yes?

jbapple-cloudera · 2017-09-14T05:49:29Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloom/Bloom.java

+    Preconditions.checkArgument((p > 0.0 && p < 1.0),
+      "FPP should be less than 1.0 and great than 0.0");
+
+    int bits = (int)(-n * Math.log(p) / (Math.log(2) * Math.log(2)));


What if this overflows - can bits end up negative, or just truncated and too low? Please consider that and add some checks.

jbapple-cloudera · 2017-09-14T05:50:30Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloom/Bloom.java

+
+    int bits = (int)(-n * Math.log(p) / (Math.log(2) * Math.log(2)));
+
+    // Get next power of 2 if bits is not power of 2.


Use https://docs.oracle.com/javase/8/docs/api/java/lang/Integer.html#highestOneBit-int-

jbapple-cloudera · 2017-09-14T05:50:51Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloom/Bloom.java

+    return getBufferedSize();
+  }
+
+  public String memUsageString(String prefix) {


I don't see any callers

jbapple-cloudera · 2017-09-14T05:51:15Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloom/Bloom.java

+    );
+  }
+
+  public static Bloom getBloomOnType(PrimitiveType.PrimitiveTypeName type,


I don't see any callers.

jbapple-cloudera · 2017-09-14T22:40:07Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloom/Bloom.java

-    bits++;
-
-    return bits;
+    return Integer.highestOneBit(bits) << 1;


Please check if bits is 0, negative, or a power of 2 first.

jbapple-cloudera · 2017-09-14T22:41:45Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloom/Bloom.java

-      maxSlabSize, new HeapByteBufferAllocator());
-    private LittleEndianDataOutputStream out = new LittleEndianDataOutputStream(arrayout);
-
+    final int maxSlabSize = 64 * 1024;


final compile-time constants are often given names "LIKE_THIS"

jbapple-cloudera · 2017-09-14T22:42:06Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloom/Bloom.java

-        byte[] encoded = BytesInput.from(arrayout).toByteArray();
-        arrayout.reset();
-        switch (bloomFilterHash) {
+        PlainValuesWriter plainValuesWriter = new PlainValuesWriter(value.length() + 4, maxSlabSize, new HeapByteBufferAllocator());


This is complex enough to deserve a test.

jbapple-cloudera · 2017-09-14T22:43:27Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloom/Bloom.java

+    bits = Integer.highestOneBit(bits) << 1;
+
+    if (bits < 0) {
+      bits = ParquetProperties.DEFAULT_DICTIONARY_PAGE_SIZE * 8;


Can you explain in the comments why you pick this number of bits?

jbapple-cloudera

A number of other previous comments were not addressed. This is an example. you can see old comments in the "Conversation" tab. Older ones are collapsed, but you can open them with the button in the upper-right-hand corner.

jbapple-cloudera · 2017-09-22T13:51:25Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/Bloom.java

+   * Default false positive probability value use to calculate optimal number of bits
+   * used by bloom filter.
+   */
+  public final double FPP = 0.05;


Missed review comment: 0.01 is closer to the optimal range for this FPP.

Also, new comment: Maybe call it DEFAULT_FPP, since the fpp can be different?

jbapple-cloudera · 2017-09-22T18:34:30Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/Murmur3.java

+ * 32-bit Java port of https://code.google.com/p/smhasher/source/browse/trunk/MurmurHash3.cpp#94
+ * 128-bit Java port of https://code.google.com/p/smhasher/source/browse/trunk/MurmurHash3.cpp#255
+ *
+ * This is a public domain code with no copyrights.


Why not use the code directly and use JNI to call it?

it will introduce a c++ dependency, not sure whether current parquet-mr has any C++ dependency yet. Also I didn't see any formal release of Murmur3.

What methodology have you used to check that your Java translation computes the same values as the C++ version?

I checked it manually.

There are implementations of Murmur3 available elsewhere. Google guava has one, for example, that handles ints and longs directly. I don't think we should include this.

jbapple-cloudera · 2017-09-22T18:34:36Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/TestBloom.java

+  }
+
+  @Test
+  public void testMurmur3() throws IOException {


I would like to see some more tests on some different lengths.

jbapple-cloudera · 2017-09-22T18:37:53Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/TestBloom.java

+    }
+
+    // exist can be true in a very low probability.
+    boolean exist = binaryBloom.find(binaryBloom.hash(Binary.fromString("not exist")));


I think this should include many more calls to find on different input - enough to be confident of the false positive rate.

Given 0.01 FPP, 10 false positive tests can get a 0.99 ^ 10 about 0.9 probability pass. I will update to 10 false positive tests here.

jbapple-cloudera · 2017-09-22T18:43:19Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/TestBloom.java

+import static org.junit.Assert.assertTrue;
+
+
+public class TestBloom {


I would like to see the C++ version available and integration tested before committing either the Java or C++ version

Can anyone assist with an implementation for parquet-cpp? If anyone needs help with "where to put the code" either I, @xhochy, or @majetideepak can assist

What integration test you mean? Interact with cpp version? Such as write to file using java version and read with cpp version?

I'd like to try to implement the cpp version.

yes, that is what I mean.

jbapple-cloudera · 2017-09-24T03:34:47Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/Bloom.java

+  public final ALGORITHM algorithm;
+
+  // The underlying byte array for bloom filter bitset.
+  private byte[] bitset;


Why not just store int[] bitset?

Since when serializing and deserializing you will need to additional considering about the byte order and converting byte array to int array which should have some overhead.

This doubles the memory requirement, and I don't think it is worth it -- doubly so since you can change the endianness with the subtraction bit below.

The intBuffer just use bitset as backup array, it doesn't double the memory size.

jbapple-cloudera · 2017-09-24T03:37:51Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/Bloom.java

+
+    // Get next power of 2 if it is not power of 2.
+    if ((numBytes & (numBytes - 1)) != 0) {
+      numBytes = Integer.highestOneBit(numBytes) << 1;


This could be negative if a bit gets shifted into the highest-order position, yes?

yes, I updated overflow handle code following.

jbapple-cloudera · 2017-09-24T03:38:35Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/Bloom.java

+   * @param out output stream to write
+   */
+  public void writeTo(OutputStream out) throws IOException {
+    Preconditions.checkArgument(bitset.length > 0, "Bloom filter bitset length should be larger than 0");


Is this possible? Java promises never to return a negative for the length of an array, right?

jbapple-cloudera · 2017-09-24T03:39:36Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/Bloom.java

+    }
+
+    for (int i = 0; i < 8; ++i) {
+      mask[i] = mask[i] >>> 27;


For switching between little-endian and big-endian, I think you can subtract mask[i] from 31 after this operation.

No need to switch endianness now.

jbapple-cloudera · 2017-09-24T03:42:06Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/Murmur3.java

+ * 32-bit Java port of https://code.google.com/p/smhasher/source/browse/trunk/MurmurHash3.cpp#94
+ * 128-bit Java port of https://code.google.com/p/smhasher/source/browse/trunk/MurmurHash3.cpp#255
+ *
+ * This is a public domain code with no copyrights.


What methodology have you used to check that your Java translation computes the same values as the C++ version?

jbapple-cloudera · 2017-09-24T03:45:36Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/Murmur3.java

+   * @param data - input byte array
+   * @return - hashcode
+   */
+  public static int hash32(byte[] data) {


Why make this public if the Bloom filter implementation does not use this. Same question for other methods.

jbapple-cloudera · 2017-09-24T03:46:30Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/TestBloom.java

+public class TestBloom {
+  @Test
+  public void testIntBloom () throws IOException {
+    Bloom bloom = new Bloom(279, Bloom.HASH.MURMUR3_X64_128, Bloom.ALGORITHM.BLOCK);


Please add a comment explaining the rationale for the number 279

This just a value I want to check what bitset length will be next power of 2. I added a test for that.

jbapple-cloudera · 2017-09-24T03:47:39Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/Bloom.java

+   * @param hash The hash strategy bloom filter apply.
+   * @param algorithm The algorithm of bloom filter.
+   */
+  public Bloom(byte[] bitset, HASH hash, ALGORITHM algorithm) {


I have stated my opinion in the past on this review that it is my opinion that all public methods of a class should be tested.

I just added write/read from outputstream in test.

jbapple-cloudera · 2017-09-24T03:52:47Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/Bloom.java

+        addElement(hash);
+      }
+
+      elements.clear();


elements = null allows the GC to clean it up, I suspect.

Now that you have elements = null, you no longer need elements.clear().

jbapple-cloudera · 2017-09-24T03:55:03Z

parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java

@@ -50,6 +50,7 @@
  public static final boolean DEFAULT_ESTIMATE_ROW_COUNT_FOR_PAGE_SIZE_CHECK = true;
  public static final int DEFAULT_MINIMUM_RECORD_COUNT_FOR_CHECK = 100;
  public static final int DEFAULT_MAXIMUM_RECORD_COUNT_FOR_CHECK = 10000;
+  public static final int DEFAULT_MAXIMUM_BLOOM_FILTER_SIZE = 16 * 1024 * 1024;


"SIZE" is ambiguous; prefer "BYTES"

jbapple-cloudera · 2017-09-24T16:23:44Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/TestBloom.java

+      Bloom.ALGORITHM.BLOCK);
+
+    List<String> strings = new ArrayList<>();
+    RandomStr randomStr = new RandomStr();


You probably want to provide a seed to make this test pseudo-random but deterministic, so it's not flaky.

jbapple-cloudera · 2017-09-24T16:26:34Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/Bloom.java

+        addElement(hash);
+      }
+
+      elements.clear();


Now that you have elements = null, you no longer need elements.clear().

jbapple-cloudera · 2017-09-24T16:31:02Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/Bloom.java

@@ -162,7 +162,7 @@ private void initBitset(int numBytes) {
   * @param out output stream to write
   */
  public void writeTo(OutputStream out) throws IOException {
-    Preconditions.checkArgument(bitset.length > 0, "Bloom filter bitset length should be larger than 0");
+    Preconditions.checkArgument(bitset != null, "Bloom filter bitset has not create yet.");


Why not call flush in this case?

rdblue · 2017-09-25T18:29:06Z

parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java

@@ -50,6 +50,7 @@
  public static final boolean DEFAULT_ESTIMATE_ROW_COUNT_FOR_PAGE_SIZE_CHECK = true;
  public static final int DEFAULT_MINIMUM_RECORD_COUNT_FOR_CHECK = 100;
  public static final int DEFAULT_MAXIMUM_RECORD_COUNT_FOR_CHECK = 10000;
+  public static final int DEFAULT_MAXIMUM_BLOOM_FILTER_BYTES = 16 * 1024 * 1024;


The default bloom filter max is 1/8th of the default row group size?

This need to discuss.

rdblue · 2017-09-25T18:29:35Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/Bloom.java

+
+public class Bloom {
+  // Hash strategy available for bloom filter.
+  public enum HASH {


Nit: Should be HashAlgorithm.

rdblue · 2017-09-25T18:30:17Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/Bloom.java

+  }
+
+  // Bloom filter algorithm.
+  public enum ALGORITHM {


Nit: Should be Algorithm (not all caps). Is there a better name for this? It isn't really an algorithm. It is a variant of the data structure.

rdblue · 2017-09-25T18:31:57Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/Bloom.java

+  public static final int HEADER_SIZE = 12;
+
+  // Bytes in a bucket.
+  public static final int BYTES_PER_BUCKET = 32;


Can we use a better name than "bucket"? In the paper, each filter is called a block. That's an overused term in Hadoop, so maybe we should call it something more specific, like FilterBlock.

rdblue · 2017-09-25T18:34:22Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/Bloom.java

+   * @param hash The hash strategy bloom filter apply.
+   * @param algorithm The algorithm of bloom filter.
+   */
+  public Bloom(int numBytes, HASH hash, ALGORITHM algorithm) {


I doubt we are going to use this class for other bloom filter variants, so I don't see the need to pass the algorithm in. Same thing with the hash function. Since there is only one, let's avoid passing it explicitly.

Changed to private and provide another constructor.

rdblue · 2017-09-25T18:44:11Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/Bloom.java

+    ByteBuffer plain = ByteBuffer.allocate(Integer.SIZE/Byte.SIZE);
+    plain.order(ByteOrder.LITTLE_ENDIAN).putInt(value);
+    switch (hash) {
+      case MURMUR3_X64_128: return Murmur3.hash64(plain.array());


This should hash a ByteBuffer instead of byte[]. That will require less copying for types like Binary.

guava support ByteBuffer argument from version 23. I just update pom.xml also .

Looks like guava 23.0 changes quite some APIs which cause several build issues in parquet-cli. I reverted back to use guava 20.0 to use hash64(byte[]) API firstly.

rdblue · 2017-09-25T18:46:15Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/Bloom.java

+      plain.order(ByteOrder.LITTLE_ENDIAN).putInt(value.length());
+      ByteArrayOutputStream baos = new ByteArrayOutputStream(value.length() + 4);
+      baos.write(plain.array(), 0, 4);
+      value.writeTo(baos);


This shouldn't copy. It should retrieve a byte buffer for the binary using toByteBuffer.

I updated to use toByteBuffer , not sure whether this be consistent with plain encoding definition of parquet.

rdblue · 2017-09-25T18:47:22Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/Bloom.java

+   * Insert element to set represented by bloom bitset.
+   * @param value the value to insert into bloom filter..
+   */
+  public void insert(long value) {


This requires the hash of a value, not a value.

rdblue · 2017-09-25T18:48:51Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/Bloom.java

+   */
+  public boolean find(long hash) {
+    // Elements are in cache, flush them firstly.
+    if (elements != null && !elements.isEmpty()) {


I don't think we need a class that handles reads mixed with writes. Reads and writes will be separated, so I'm a little concerned that this over-complicates the implementation.

Removed this cache logic to keep simple.

rdblue · 2017-09-25T18:54:30Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/Murmur3.java

+ * 32-bit Java port of https://code.google.com/p/smhasher/source/browse/trunk/MurmurHash3.cpp#94
+ * 128-bit Java port of https://code.google.com/p/smhasher/source/browse/trunk/MurmurHash3.cpp#255
+ *
+ * This is a public domain code with no copyrights.


There are implementations of Murmur3 available elsewhere. Google guava has one, for example, that handles ints and longs directly. I don't think we should include this.

chenjunjiedada · 2017-09-27T06:29:45Z

parquet-column/src/test/java/org/apache/parquet/column/values/bloom/Bloom.java

+   * @return hash result
+   */
+  public long hash(Binary value) {
+      return hashFunction.hashBytes(value.toByteBuffer().array()).asLong();


I think this is still a open since we agreed to use plain encoding input. In parquet thrift definition, BYTE_ARRAY is encoded with length, and FIXED_LEN_BYTE_ARRAY is encoded without length. In PlainValuesWriter.java, it stored length for Binary.

Do we need to still use plain encoding input for Binary? @rdblue @jbapple-cloudera

We don't need to include the length.

Also, this shouldn't use .array() without checking whether the buffer is direct or on heap, and without supplying an offset because the ByteBuffer may point to the middle of its backing array. Here's what I'm using in another project:

// MURMUR3 is the configured hash function from guava; this uses the 32-bit variant public int hash(ByteBuffer value) { if (value.hasArray()) { return MURMUR3.hashBytes(value.array(), value.arrayOffset() + value.position(), value.arrayOffset() + value.remaining()).asInt(); } else { byte[] copy = new byte[value.remaining()]; value.get(copy); return MURMUR3.hashBytes(copy).asInt(); } }

There should also be test cases for the following:

Direct byte buffer

Direct byte buffer with non-zero position

Heap byte buffer

Heap byte buffer with non-zero arrayOffset

Heap byte buffer with non-zero position

The ByteBuffer is from Binary API which already consider things you mentioned.

The same as (getBytes/getUnsafeBytes[no copy]).

This should use the ByteBuffer and add tests for the cases that I listed. ByteBuffers will require fewer copies. Otherwise, this will need to be rewritten before actually using it to produce bloom filters.

I understand ByteBuffer should avoid copies, but Murmur3 from guava 20.0 only support byte[] parameter. We can change to use ByteBuffer when update to guava23.0.

The difference from your reference code is here we use Binary as input parameter, and we should trust the APIs from Binary class, shouldn't we? If it accept a ByteBuffer parameter, we should add tests you list accordingly.

In addition, in you reference code, why you add arrayOffset to remaining? remaining should be (limit -position) and irrelevant with arraryOffset, otherwise the length changes, am I wrong?

Trusting an API for correctness and using the most efficient method from that API are two different things. If you use toByteArray, your code will often copy data it doesn't need to, which is why it will need to be replaced before actually using this. If Guava doesn't support ByteBuffer, is there another implementation that does? Can we update Guava?

Yes, we can update Guava to 23.0, it needs some minor changes to fix build issues. Please have a look 5f0eab7 commit.

I also update maven-shade-plugin version since it show ArrayIndexOutOfBoundsException when creating shaded jar for "com/google/common/cache/LocalCache$EntrySet.class".

chenjunjiedada · 2017-11-06T00:58:31Z

Hi @wesm
@winningsix will work on parquet-cpp part.

chenjunjiedada · 2018-05-03T12:39:14Z

Hi @winningsix
any update on this?

chenjunjiedada · 2018-05-23T07:26:13Z

Hi developers

Sorry for no update long time for bloom filter topic due to transition, now I'm back to move this forward start from this rebase.

No obviously change in this commit, I just remove unrelated changes.

@rdblue @jbapple-cloudera Could you please kindly to have a look?
@wesm As you point out, I had a PR in parquet-cpp also, could you please also have a look?

Many thanks

remove some obsolete changes

chenjunjiedada · 2018-06-21T08:18:44Z

Hi @rdblue
I have updated code according your comments before and recently I rebased code also. I see there is still changes requests from this PR page and I checked them one by one, I think almost all comments have related update. So could you please kindly take a look again?

Thanks in advanced.

…lity issue

chenjunjiedada · 2018-06-22T02:25:27Z

Change title to conform to subtask JIRA.

chenjunjiedada · 2018-06-28T09:29:43Z

ping @rdblue

chenjunjiedada · 2018-08-15T17:16:06Z

Hi @rdblue, do you get time to have a look?

jbapple-cloudera · 2018-11-06T20:37:23Z

Can this be closed in favor of #425?

chenjunjiedada · 2018-11-07T01:44:23Z

Close this due to duplicated one : #521

chenjunjiedada force-pushed the parquet-41 branch 3 times, most recently from 42e6c77 to 1d3170f Compare September 6, 2017 03:26

jbapple-cloudera suggested changes Sep 14, 2017

View reviewed changes

jbapple-cloudera reviewed Sep 14, 2017

View reviewed changes

chenjunjiedada force-pushed the parquet-41 branch 2 times, most recently from 22cc9af to 1918260 Compare September 15, 2017 09:05

jbapple-cloudera suggested changes Sep 22, 2017

View reviewed changes

jbapple-cloudera suggested changes Sep 24, 2017

View reviewed changes

chenjunjiedada force-pushed the parquet-41 branch 2 times, most recently from 4c4ecab to cae4ea7 Compare September 24, 2017 11:22

jbapple-cloudera suggested changes Sep 25, 2017

View reviewed changes

rdblue requested changes Sep 25, 2017

View reviewed changes

chenjunjiedada commented Sep 27, 2017

View reviewed changes

chenjunjiedada force-pushed the parquet-41 branch from 7ee8817 to 8e59ed8 Compare May 23, 2018 07:23

PARQUET-41: rebase to latest master

8e59ed8

remove some obsolete changes

PARQUET-41: update intBuffer to use little endian for cross compatibi…

9a1955d

…lity issue

chenjunjiedada changed the title ~~PARQUET-41:Add Bloom Filter for parquet~~ PARQUET-1332:Add bloom filter utility class Jun 22, 2018

PARQUET-1332: update according to review comments from parquet-cpp

ec9cefd

chenjunjiedada changed the title ~~PARQUET-1332:Add bloom filter utility class~~ PARQUET-1342:Add bloom filter utility class Jun 29, 2018

PARQUET-1342: update according to comments from parquet-cpp

6a43476

PARQUET-1342: update murmur3 seed value

53f22e0

chenjunjiedada force-pushed the parquet-41 branch from cbd1032 to 53f22e0 Compare August 17, 2018 13:22

chenjunjiedada closed this Nov 7, 2018

asfimport mentioned this pull request Jun 23, 2024

Add bloom filters to parquet statistics #1468

Closed

17 tasks


		int bits = (int)(-n * Math.log(p) / (Math.log(2) * Math.log(2)));

		// Get next power of 2 if bits is not power of 2.

		import static org.junit.Assert.assertTrue;


		public class TestBloom {

PARQUET-1342:Add bloom filter utility class #425

PARQUET-1342:Add bloom filter utility class #425

Conversation

chenjunjiedada commented Sep 5, 2017 • edited Loading

chenjunjiedada commented Sep 6, 2017

winningsix commented Sep 6, 2017

chenjunjiedada commented Sep 7, 2017

winningsix commented Sep 7, 2017

chenjunjiedada commented Sep 7, 2017 • edited Loading

chenjunjiedada commented Sep 7, 2017

jbapple-cloudera left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenjunjiedada Sep 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbapple-cloudera left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenjunjiedada Sep 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenjunjiedada commented Sep 5, 2017 •

edited

Loading

chenjunjiedada commented Sep 7, 2017 •

edited

Loading

chenjunjiedada Sep 14, 2017 •

edited

Loading

chenjunjiedada Sep 25, 2017 •

edited

Loading