PARQUET-1332: Add bloom filter for parquet #432

chenjunjiedada · 2018-01-19T08:26:51Z

This is first part of bloom filter patch set, which include a bloom filter utility and also some unit tests.
Note that this patch also includes murmur3Hash original code from Austin Appleby. The code isn't formatted as parquet-cpp format.

daedric · 2018-01-24T17:22:35Z

src/parquet/bloom.cc

+    this->hashFunc = &MurmurHash3_x64_128;
+    break;
+  default:
+    throw new parquet::ParquetException("Not supported hash strategy");


are you sure you want to new the exception ?

nice catch, thanks.

daedric · 2018-01-29T12:30:49Z

src/parquet/bloom.h

+  Algorithm algorithm;
+
+  // The underlying byte array for bloom filter bitset.
+  uint8_t* bitset;


The move ctor and operator are not deleted which could lead to a crash since the bitset is freed in the dtor.
Good practice would be to wrap it in a std::unique_ptr<uint8_t[]>

Good suggestion. applied.

daedric · 2018-01-29T12:31:18Z

src/parquet/bloom.cc

+
+Bloom::~Bloom() {
+  if (bitset) {
+    free(bitset);


use delete[] bitset; it was newed not malloced

chenjunjiedada · 2018-01-29T15:34:32Z

Not sure why CI build error, valgrind complain for "jump or move depends on uninitialised value".

daedric · 2018-01-29T16:11:46Z

You don't memset to 0 your int array so it contains garbage :)

chenjunjiedada · 2018-05-03T12:29:09Z

Hi @winningsix and @daedric
any update on this?

chenjunjiedada · 2018-05-23T07:45:27Z

@wesm @xhochy @daedric Since @winningsix has no bandwidth to handle this, I 'm back to take this again. This has been reviewed for long time, could you please kindly help to have a look again?

jbapple-cloudera

I'll have more comments later, but I didn't want you to have to wait too long to get started on some feedback.

jbapple-cloudera · 2018-06-19T17:04:35Z

src/parquet/bloom.h

+  static constexpr uint32_t SALT[8] = { 0x47b6137bU, 0x44974d91U, 0x8824ad5bU,
+      0xa2b7289dU, 0x705495c7U, 0x2df1424bU, 0x9efc4947U, 0x5c6bfb31U };
+
+  typedef void (*HashFunc)(const void *, int, uint32_t, void*);


Can you give these parameters names? Also, the last parameter should/can be a uint64_t [2], yes?

Added full names, I'm not understand the last question, why we need to change it? any benefit?

jbapple-cloudera · 2018-06-19T17:09:36Z

src/parquet/bloom.h

+ public:
+  /// Constructor of bloom filter, if numBytes is zero, bloom filter bitset
+  /// will be created lazily and the number of bytes will be calculated through
+  /// distinct values in cache. It use murmur3_x64_128 as its default hash function


nit: "It will use"

jbapple-cloudera · 2018-06-19T17:11:04Z

src/parquet/bloom.h

+  /// Constructor of bloom filter, if numBytes is zero, bloom filter bitset
+  /// will be created lazily and the number of bytes will be calculated through
+  /// distinct values in cache. It use murmur3_x64_128 as its default hash function
+  /// and block based algorithm as default algorithm.


nit: "and the block based algorithm." No need to add "as default algorithm".

jbapple-cloudera · 2018-06-19T17:15:08Z

src/parquet/bloom.h

+  // Bytes in a tiny bloom filter block.
+  static constexpr int BYTES_PER_FILTER_BLOCK = 32;
+
+  // Default seed for hash function


Can you add in the comment on line 64 where you got this default seed from?

jbapple-cloudera · 2018-06-20T16:31:13Z

src/parquet/bloom.h

+  // Default maximum bloom filter size (need to discuss)
+  static constexpr int DEFAULT_MAXIMUM_BLOOM_FILTER_BYTES = 16 * 1024 * 1024;
+
+  // The block based algorithm needs 8 odd SALT values to calculate eight index


nits: "block-based", "eight indexes", "one bit in each 32-bit word"

jbapple-cloudera · 2018-06-20T16:36:59Z

src/parquet/bloom.h

+  std::unique_ptr<uint8_t[]> bitset;
+
+  // Hash function applied.
+  HashFunc hashFunc;


Why do we need both hashFunc and hash_strategy?

hash_strategy is used to specify which hash function to use (MurmurHash3_x86_32, MurmurHash3_x86_128, MurmurHash3_x64_128). While hashFunc is a function pointer used to point to the function we chose by hash_strategy.

jbapple-cloudera · 2018-06-20T16:39:17Z

src/parquet/bloom.h

+  void writeTo(const std::shared_ptr<OutputStream>& sink);
+
+ private:
+  // Create a new bitset for bloom filter, at least 256 bits will be create.


nit: istead of "at least 256 bits will be create", use "the size will be at least 32 bytes". Also, please document any restrictions on num_bytes and the consequences for ignoring them.

jbapple-cloudera · 2018-06-20T16:39:38Z

src/parquet/bloom.h

+  // Create a new bitset for bloom filter, at least 256 bits will be create.
+  // @param numBytes number of bytes for bitset
+  void initBitset(uint32_t num_bytes);
+  void setMask(uint32_t key, uint32_t mask[8]);


Please add a comment for every method, even private ones.

jbapple-cloudera · 2018-06-21T04:42:18Z

src/parquet/bloom.cc

+  sink->Write(bitset.get(), num_bytes);
+}
+
+Bloom::~Bloom() {


You can put this in .h with and use " = default;"

jbapple-cloudera · 2018-06-21T04:43:18Z

src/parquet/bloom.h

+  Algorithm algorithm;
+
+  // The underlying byte array for bloom filter bitset.
+  std::unique_ptr<uint8_t[]> bitset;


This is almost always going to be used as a uint32_t[], yes? Can we use that instead of uint8_t[]?

chenjunjiedada · 2018-06-21T08:03:50Z

Thanks Jim, I have updated code accordingly. I also added an unit test for compatibility test.

Not sure why CI still failed, four build passed, two build failed.

chenjunjiedada · 2018-06-22T02:24:41Z

Change title to conform to subtask JIRA.

jbapple-cloudera

Something happened with your rebase and it is now huge, including files not directly related to BFs. You might be able to fix this if you changee your GH repo's master to match parquet-cpp's master.

Once that's set, I'll resume my review.

chenjunjiedada · 2018-06-23T02:58:02Z

Sorry, the rebase seems weird, it contains several same commits.

I just picked up all related changes and patched to master, it overrides commit history, but I think it should be much clean to cherry pick all commits. It needs you to check previously comments by manually click "Show outdated", but the updated code is not show in that conversation box, you can review the latest code by click the "File Changes" at top of this page.

jbapple-cloudera · 2018-06-25T04:23:47Z

src/parquet/bloom.h

+  static constexpr double DEFAULT_FPP = 0.01;
+
+  // Bloom filter data header, including number of bytes, hash strategy and algorithm.
+  static constexpr int HEADER_SIZE = 12;


This doesn't appear to be used anywhere.

I will remove this.

jbapple-cloudera · 2018-06-25T04:24:19Z

src/parquet/bloom.h

+   * Default false positive probability value use to calculate optimal number of bits
+   * used by bloom filter.
+   */
+  static constexpr double DEFAULT_FPP = 0.01;


This doesn't appear to be used anywhere.

Will remove in next commit.

jbapple-cloudera · 2018-06-25T04:25:16Z

src/parquet/bloom.h

+  static constexpr int BYTES_PER_FILTER_BLOCK = 32;
+
+  // Default seed for hash function, to keep cross compatibility the seed is same as
+  // value defined in parquet-mr.


Why was this seed picked in parquet-mr? What is the origin of this value for the seed?

The seed is a primer which comes from org.apache.hive.common.util.Murmur3.java. I haven't found why it use this primer, but this value is widely used in different versions of implementation.

Please always leave a comment in the code when you use a "magic number" like this regarding where you got that number.

OK, I will state this comes from other library or jar in next commit.

jbapple-cloudera · 2018-06-25T04:26:10Z

src/parquet/bloom.h

+  // value defined in parquet-mr.
+  static constexpr int DEFAULT_SEED = 104729;
+
+  // Default maximum bloom filter size (need to discuss)


Need to discuss with whom?

Originally I think this would need to discuss with parquet community, maybe we can set to 1/8 or parquet.block.size in case of one column of long type with its distinct values ratio equal to 1 and FPP set to 0.01. What do you think?

What would it mean to set this to 1/8? I don't understand. Doesn't this need to be an integer?

I would suggest starting the conversation on dev@ if you think it needs to be discussed by the community before committing.

Yes, I will start thread on this.

jbapple-cloudera · 2018-06-25T04:50:41Z

src/parquet/bloom.cc

+  addElement(hash);
+}
+
+uint64_t Bloom::hash(int value) {


Most of these can be replaced by a template with specializations for the types that are not POD.

Do you mean types are POD?

What I mean is that most of these could be replaced by a single template.

OK, will do this in next commit.

jbapple-cloudera · 2018-06-25T05:05:13Z

src/parquet/CMakeLists.txt

@@ -50,6 +52,7 @@ install(FILES
  "${CMAKE_CURRENT_BINARY_DIR}/parquet.pc"
  DESTINATION "${CMAKE_INSTALL_LIBDIR}/pkgconfig/")

+ADD_PARQUET_TEST(bloom-test)


Is this only used in the test? What must be done to actually integrate it with the rest of the repository and get it into some binary tools or libraries that can be used by consumers of parquet-cpp build artifacts?

Yes, this is only expected to be used in test. I don't expect this unit test to be contained in final binary file. I think this test is used to verify correctness and compatibility before parquet-cpp releasing. Do we need to include unit tests in release jar or .so/.a?

I don't think I understand. If this file isn't used anywhere else in the project, how can it be actually used? @rdblue , maybe you could shed some light on how new features are usually integrated into the non-test parts of this project.

this file is used when execute "make unittest" command, it run several unit test. But the unit test code is not linked into finally parquet-cpp binary.

I think that the parquet-cpp binary should link to this Bloom filter code in some way. Do you disagree, @cjjnjust ?

My bad, I thought it is not linked to, while I just see it is linked into final binary in compiling log. Let me investigate how other unit test achieve this.

Just reviewed this again, bloom-test.cc is not linked to either libparquet.a or libparquet.so, the logs I mentioned above is linking bloom-test.o to bloom-test binary in debug/bloom-test.

I am not comfortable that this patch is complete until it is clearly used somewhere other than a test. If BFs are part of the format, and this repo includes utilities for reading or writing the format, they should include this code in their compiled binaries.

I understand your concern while we should discussed in previous parquet sync-up meeting and agreed to implement BF utility in the test scope firstly since BFs are optional of metadata. BFs are planed to be used when either read or write side is implemented, we should include some usages right in next steps, it that OK?

OK. Please note as such in your commit message.

jbapple-cloudera · 2018-06-25T05:22:41Z

src/parquet/bloom.h

+ public:
+  /// Constructor of bloom filter. The range of num_bytes should be within
+  /// [BYTES_PER_FILTER_BLOCK, DEFAULT_MAXIMUM_BLOOM_FILTER_BYTES], it will be
+  /// round up/down to lower/upper bound if num_bytes is out of range. It will use


nit: "rounded up/down"

Will update in next commit.

jbapple-cloudera · 2018-06-25T05:29:50Z

src/parquet/bloom.h

+  /// Constructor of bloom filter. The range of num_bytes should be within
+  /// [BYTES_PER_FILTER_BLOCK, DEFAULT_MAXIMUM_BLOOM_FILTER_BYTES], it will be
+  /// round up/down to lower/upper bound if num_bytes is out of range. It will use
+  /// murmur3_x64_128 as its default hash function and the block based algorithm.


I'm not sure it makes sense for this class to include both an implementation and the interface of a filter. Would it make more sense to define a pure virtual Filter interface and then derive from that and override methods?

Shall we do the same for the hash function?

The idea for using a function pointer is to treat murmur3 hash implementation as library functions, such as library here. So that one day if we want to switch to other hash in a c++ library, we only need to just assign the function pointer to it instead of change internal implementation.

I'm sorry, but I'm not sure this addresses my concerns either about the Filter API or the Hash API. Pure abstract classes is a common idiom that more simply captures the intent that there is one API and multiple implementations.

The Derived classes, can, of course, call Murmur3 or whatever.

Understood your point, will update after next commit.

I added a abstract class Hasher, and let MurmurHash3 to inherit to it.

jbapple-cloudera · 2018-06-25T05:52:49Z

src/parquet/bloom.h

+  /// function and block-based algorithm as default algorithm.
+  /// @param bitset The given bitset to construct bloom filter.
+  /// @param len Length of bitset.
+  Bloom(const uint8_t* bitset, uint32_t len);


This is a little tricky: If we are going to treat bitset as if it were an array of uint32_ts, we run into http://eel.is/c++draft/basic.lval#11

If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:

the dynamic type of the object,
...

a char, unsigned char, or std::byte type.

So we need uint8_t to be char, unsigned char, or std::byte. We'll need to check for alignment, as well, because compilers are allowed to assume that pointers to uint32_t have alignment that is greater than that of uint8_t.

Let's think about how this will get hooked into other libraries an ask @rdblue , who is more experienced in this code base, if we can put alignment requirements on the data we'll end up needing to use this with.

In an ideal world, I'd like it to be 64-byte aligned.

Generally types alignment is equal to types size (Wiki, this may be different among compilers), so I think the last sentence you mentioned should be 64-bits aligned, right?

Therefore, uint32_t should have an four bytes alignment generally, consider we are restricting bitset length to [32bytes, 16Mbytes], and also length is power of 2, then the uint32_t* must not contains any padding, right?

BTW, as I known, uint8_t is typedef of unsigned char, am I wrong?

I do really mean 64-byte. That is the cache line size in most modern x86-64 CPUs. Alignment is especially useful for SIMD operations, in which misaligned access can cause a segfault.

OK, Now I understood your concern. SIMD operation should have some performance penalty when access unaligned data due to cross cache line access or cache miss (Some tests here: http://www.agner.org/optimize/blog/read.php?i=415#423), but I think it should not cause a segfault.

How about set minimum bloom filter size to 64 bytes to avoid this?

Some SIMD operations can segfault when crossing cache lines. Others won't.

How do you believe that setting the minimum BF size would fix this? What mechanism would you use to set the size?

I got your point from your email reply.

How about we check bitset address alignment and do copy if not? And of course document alignment requirement and performance impact. Can we do this firstly and consider BF client/consumer when doing read-side task?

Do you mean this kind segfalut? https://stackoverflow.com/questions/25596379/simd-intrinsics-segmentation-fault

What about for now making the parameter a uint32_t *? This puts the obligation on the caller. In the tests you wrote, you can do the alignment check and also check (with static_assert) that uint8_t is unsigned char.

I can do this. I don't understand the last sentence, do you mean I should add alignment check and static_assert code as well?

Yes, add these in the test code. The return value you get from an allocatino should be aligned at at least alignof(uint32_t) and also uint8_t should be std::is_same to unsigned char.

jbapple-cloudera · 2018-06-25T05:53:58Z

src/parquet/bloom.cc

+
+uint64_t Bloom::hash(const Int96 *value) {
+  uint64_t out[2];
+  (*hashFunc)(reinterpret_cast<void *>(const_cast<uint32_t*>(value->value)),


This const_cast is not necessarily standards-compliant. I expect the type signature of the hash function is a problem: it should not be modifying the data passed into it, so it should be OK with a const pointer.

yes, the const_cast is not necessarily and the signature is right also, I need to change reinterpret_cast<void*> to reinterpret_cast<const void *>, will update in next commit.

jbapple-cloudera · 2018-06-25T22:10:19Z

src/parquet/bloom.h

+#ifndef PARQUET_BLOOM_H
+#define PARQUET_BLOOM_H
+
+#include <set>


Why do you need the standard's set header? I don't see it used.

Will remove in next commit.

jbapple-cloudera · 2018-06-25T22:15:55Z

src/parquet/bloom.h

+namespace parquet {
+class OutputStream;
+
+// Bloom Filter is a compact structure to indicate whether an item is not in set or


nit: Either "Bloom filters are compact structures" or "The Bloom filter is a compact structure" or "A Bloom filter is a compact structure". That is, for singular improper nouns, you need an article like "a" or "the".

Note that "filter" is not usually capitalized in "Bloom filter".

This is also true in other comments in this PR.

Thanks, will update in next commit.

jbapple-cloudera · 2018-06-25T22:18:26Z

src/parquet/bloom.h

+  ~Bloom() = default;
+
+  // Calculate optimal size according to the number of distinct values and false
+  // positive probability.


Please leave a blank line when breaking paragraphs.

do you mean add a blank line between function description and param description?

not a blank line, a line with just the comment characters and a newline.

In general, for style comments like this that you're not clear on, please check the other C++ files in this repo and copy their style.

will update in next commit.

jbapple-cloudera · 2018-06-25T22:23:20Z

src/parquet/bloom.h

+  // @param ndv: The number of distinct values.
+  // @param fpp: The false positive probability.
+  // @return optimal number of bits of given n and p.
+  static uint32_t optimalNumOfBits(uint32_t ndv, double fpp);


The README.md says to use the Google style guide for C++, which specifies that methods start with a capital letter:

https://google.github.io/styleguide/cppguide.html#Function_Names

The README also describes how to run cpplint; please do so and correct the things it finds.

Will update function names in next commit.

parquet-cpp already have cpplint tool integrated, we can run make lint to check.

jbapple-cloudera · 2018-06-25T22:24:47Z

src/parquet/bloom.h

+  // Determine whether an element exist in set or not.
+  // @param hash the element to contain.
+  // @return false if value is definitely not in set, and true means PROBABLY in set.
+  bool find(uint64_t hash);


I'd recommend naming these FindHash and InsertHash to prevent users from accidentally thinking that they inserted an int16_t or whatever, not a hash

Agreed, will update in next commit.

jbapple-cloudera · 2018-06-26T02:17:58Z

src/parquet/bloom-test.cc

+    }
+
+    // The exist should be probably less than 1000 according default FPP 0.01.
+    ASSERT_TRUE(exist < 1000);


Please make the total count and fpp into variables and use some multiple of their product here.

will update in next commit.

jbapple-cloudera · 2018-06-26T02:19:15Z

src/parquet/bloom-test.cc

+
+    std::string testString[4] = {"hello", "parquet", "bloom", "filter"};
+	std::string data_dir(test::get_data_dir());
+    std::string bloomDataPath = data_dir + "/bloom_filter.bin";


Please add a README in the directory with the .bin file to explain what it is and how you made it.

Can we add doc at begin of this test? Add README file in source subdir looks a bit weird and may be ignored easily.

jbapple-cloudera · 2018-06-26T02:22:38Z

src/parquet/bloom-test.cc

+    for (int i = 0; i < 100000; i++) {
+      ByteArray byte_array1(10,
+          reinterpret_cast<const uint8_t*>(random_strings[i].c_str()));
+      ASSERT_TRUE(de_bloom->find(de_bloom->hash(&byte_array1)));


I think you want to use one of the CHECK macros when handling the main substance of the test, not ASSERT. See here and elsewhere in this file.

I didn't found other unit tests use CHECK macros yet. Do you mean DCHECK? Could you please list one code example use CHECK so that I can refer to.

You're right - I meant EXPECT: https://github.com/google/googletest/blob/master/googletest/docs/primer.md

OK, will update to use EXPECT_EQ in next commit.

jbapple-cloudera · 2018-06-26T02:23:53Z

src/parquet/bloom-test.cc

+    ASSERT_TRUE(exist < 1000);
+}
+
+TEST(CompatibilityTest, TestBloomFilter) {


Please put a comment on this test to describe what compatibility it tests.

will add in next commit.

jbapple-cloudera · 2018-06-26T02:24:31Z

src/parquet/bloom-test.cc

+    std::unique_ptr<uint8_t[]> bitset;
+
+    std::string testString[4] = {"hello", "parquet", "bloom", "filter"};
+	std::string data_dir(test::get_data_dir());


This is an unusual indentation.

chenjunjiedada · 2018-06-26T09:00:06Z

Thanks Jim for these really great comments!

I 'd like to update some simple changes firstly, some other comments need more investigation and discussion will be updated after discussing with community.

chenjunjiedada · 2018-06-28T23:50:31Z

Hi @jbapple-cloudera , I have updated code, could you please take a look?

jbapple-cloudera · 2018-06-29T00:03:25Z

I get a message for every update, so I was already notified. In the future, there is no need for an additional reminder.

I have some other responsibilities and will get back to this in a couple of days.

chenjunjiedada · 2018-06-29T00:35:06Z

@jbapple-cloudera Sure, thanks.

jbapple-cloudera · 2018-07-02T00:14:51Z

src/parquet/murmur3.cc

+
+uint64_t MurmurHash3::Hash(int64_t value) {
+  uint64_t out[2];
+  Hash_x64_128(reinterpret_cast<void*>(&value), sizeof(int), seed_, out);


Not sizeof(int), sizeof(int64_t).

As you have discovered (and I had overlooked), using a pure virtual base class requires separate methods, since there cannot be virtual template methods. However, you can call out to a helper template defined only in this translation unit. That method would avoid some of the copy-and-paste that created the error here.

jbapple-cloudera · 2018-07-02T00:15:36Z

src/parquet/murmur3.cc

+uint64_t MurmurHash3::Hash(const Int96* value) {
+  uint64_t out[2];
+  Hash_x64_128(reinterpret_cast<const void*>(value->value), sizeof(value->value),
+		  seed_, out);


nit: indentation is off

jbapple-cloudera · 2018-07-02T00:16:55Z

src/parquet/murmur3.h


-#include <stdint.h>
+  /// Original murmur3 hash implementation
+  void Hash_x86_32(const void* key, int len, uint32_t seed, void* out);


Why are these three methods in the class interface at all? Why not make them helper functions visible only in the .cc file?

jbapple-cloudera · 2018-07-02T00:17:16Z

src/parquet/murmur3.h


-#endif // !defined(_MSC_VER)
+  uint64_t Hash(int32_t value);


Please use https://en.cppreference.com/w/cpp/language/override

jbapple-cloudera · 2018-07-02T00:17:40Z

src/parquet/hasher.h

+  ///
+  /// @param value the value to hash.
+  /// @return hash result.
+  virtual uint64_t Hash(int32_t value) = 0;


Why not make these also const methods?

jbapple-cloudera · 2018-07-03T16:57:16Z

src/parquet/bloom.h

+  ///
+  /// @param num_bytes number of bytes for bitset. The range of num_bytes should be within
+  ///   [BYTES_PER_FILTER_BLOCK, DEFAULT_MAXIMUM_BLOOM_FILTER_BYTES], it will be
+  ///   rounded up/down to lower/upper bound if num_bytes is out of range nd also


nit: "and also"

jbapple-cloudera · 2018-07-03T16:58:20Z

src/parquet/bloom.h

-  void setMask(uint32_t key, uint32_t mask[8]);
+  /// Set bits in mask array according to input key.
+  /// @param key the value to calculate mask values.
+  /// @param mask the mask array is used to set or clear bits inside a block


nit: this is never used to clear any bits.

jbapple-cloudera · 2018-07-03T17:06:47Z

src/parquet/CMakeLists.txt

@@ -50,6 +52,7 @@ install(FILES
  "${CMAKE_CURRENT_BINARY_DIR}/parquet.pc"
  DESTINATION "${CMAKE_INSTALL_LIBDIR}/pkgconfig/")

+ADD_PARQUET_TEST(bloom-test)


I am not comfortable that this patch is complete until it is clearly used somewhere other than a test. If BFs are part of the format, and this repo includes utilities for reading or writing the format, they should include this code in their compiled binaries.

jbapple-cloudera · 2018-07-03T17:12:38Z

src/parquet/bloom.cc

+}
+
+void Bloom::WriteTo(OutputStream* sink) {
+  sink->Write(reinterpret_cast<const uint8_t*>(&num_bytes_), sizeof(uint32_t));


The problem here is that you don't know that num_bytes_ has size sizeof(uint32_t). Please use an explicit underlying type: https://en.cppreference.com/w/cpp/language/enum

I mentioned this in a previous comment, to which you replied "will update in next commit." I assume you fixed the part of the comment you did understand but did not quite catch my meaning about the underlying type. That happens all the time, and it's nothing to be uncomfortable about. I strongly suggest asking for clarification if you get a comment from a code reviewer and you do not understand what it means.

My bad, I did miss this. I assume you are talking about add underlying type for HashStratey and Algorithm enum.

jbapple-cloudera · 2018-07-03T17:15:42Z

src/parquet/bloom-test.cc

+  EXPECT_TRUE(exist < total_count * fpp);
+}
+
+// The CompatibilityTest is use to test cross compatibility with parquet-mr, it read


nit: "it reads"

xhochy · 2018-07-28T15:29:47Z

src/parquet/bloom_filter_algorithm.cc

+
+  num_bytes_ = num_bytes;
+
+  ::arrow::Status status =


Instead of calling raw Allocate use https://github.com/apache/arrow/blob/c6e33d89d10d9aa324cbd5b6776cadbb367970a7/cpp/src/arrow/buffer.h#L232 to get a Buffer. This will then be automatically be deallocated on destruction of the reference.

Thanks, I will use it. Just one thing needs to be consider, converting from uint8_t* to uint32_t* with reinterpret_cast may lead to undefined behavior.

Hi @jbapple-cloudera , looks like reinterpret_cast from uint8_t* to other bigger size type pointer is widely used in this project and also arrow, such as TypedBufferBuilder. Is the undefined behavior not present when the pointer's address and the buffer size is a multiply of requirement?

#425 removed -fno-strict-aliasing from the build flags. This makes many common calls to reinterpret_cast invalid. If you see lots of other use, I'd suggest starting a thread on dev@ on the subject.

Note that the pointer returned from Allocate is guaranteed to be 64 byte aligned. Thus all these casts should be safe.

xhochy · 2018-07-28T15:31:52Z

src/parquet/bloom_filter_algorithm.cc

+
+  ::arrow::Status status =
+      pool_->Allocate(num_bytes, reinterpret_cast<uint8_t**>(&bitset32_));
+  if (!status.ok()) {


Use the PARQUET_THROW_NOT_OK macro to check the Status value

Yes, thanks.

xhochy · 2018-07-28T15:32:16Z

src/parquet/bloom_filter_algorithm.cc

+BlockBasedAlgorithm::BlockBasedAlgorithm(const uint8_t* bitset, uint32_t num_bytes)
+    : BlockBasedAlgorithm(num_bytes) {
+  if (!bitset) {
+    throw ParquetException("Given bitste is NULL");


xhochy · 2018-07-28T15:33:40Z

src/parquet/bloom_filter.cc

+      this->hasher_.reset(new MurmurHash3());
+      break;
+    default:
+      throw parquet::ParquetException("Unsupported hash strategy.");


While you're in the parquet namespace, there is no need to prefix this with parquet::

ok. Thanks.

jbapple-cloudera · 2018-07-29T19:49:38Z

src/parquet/bloom_filter-test.cc

@@ -41,41 +41,19 @@ TEST(Murmur3Test, TestBloomFilter) {
 TEST(ConstructorTest, TestBloomFilter) {
  EXPECT_NO_THROW(BloomFilter bloom_filter1(1000));

-  std::unique_ptr<uint32_t[]> bitset1(new uint32_t[1024]());
+  // It throws because the length does not match the bitset


That's not why it throws, exactly - the constructor cannot check the length of the bitset. It throws because the length cannot be 0.

jbapple-cloudera · 2018-07-29T19:55:03Z

src/parquet/bloom_filter.cc

+  if (static_cast<uint32_t>(bytes_available) != sizeof(uint32_t)) {
+    throw ParquetException("Failed to deserialize from input stream");
+  }
+  if (static_cast<HashStrategy>(hash) != HashStrategy::MURMUR3_X64_128) {


This is already in the constructor; no need to duplicate it here.

jbapple-cloudera · 2018-07-29T20:05:19Z

src/parquet/bloom_filter.cc

  }
+
+  return new BloomFilter(input->Read(len, &bytes_available), len);


Why not return by value? Why return a pointer?

jbapple-cloudera · 2018-07-29T20:23:04Z

src/parquet/bloom_filter.h

-    }
-    external_bitset_ = NULL;
-  }
+  /// Deserialize the Bloom filter from an input stream, it is used when reconstructing


nit: period, not a comma. Also, capitalize "It".

jbapple-cloudera · 2018-07-29T20:24:17Z

src/parquet/bloom_filter.h

+  enum class HashStrategy : uint32_t { MURMUR3_X64_128 = 0 };
+
+  // Bloom filter algorithm.
+  enum class Algorithm : uint32_t { BLOCK = 0 };


Why are these moved back to public?

It was used in deserialize function. It can be move back to private since I will remove algorithm and hash check in deserialize function and let constructor to check them.

jbapple-cloudera · 2018-07-30T02:30:09Z

src/parquet/bloom_filter_algorithm.cc

+
+  num_bytes_ = num_bytes;
+
+  ::arrow::Status status =


#425 removed -fno-strict-aliasing from the build flags. This makes many common calls to reinterpret_cast invalid. If you see lots of other use, I'd suggest starting a thread on dev@ on the subject.

jbapple-cloudera · 2018-07-30T02:35:35Z

src/parquet/bloom_filter_algorithm.cc

 #include "parquet/types.h"

 namespace parquet {
 constexpr uint32_t BlockBasedAlgorithm::SALT[BITS_SET_PER_BLOCK];

+BlockBasedAlgorithm::BlockBasedAlgorithm(uint32_t num_bytes)


This is getting complex, which is why I suggest, as I mentioned a few times before, having two classes:

Filter, which is has pure abstract virtual methods, and BlockSplitBloomFilter, but not BloomFilterAlgorithm class.

struct Filter { virtual void InsertHash(uint64_t) = 0; virtual void FindHash(uint64_t) const = 0; // Hash methods, destructor, all pure virtual }; struct BlockSplitBloomFilter : public Filter { // Constructor // Overrides };

I suggest you remove the BloomFilterAlgorithm class.

Agree with you, that looks like simple. Just updated code, please help to have a look.

jbapple-cloudera · 2018-07-30T02:38:17Z

src/parquet/bloom_filter_algorithm.h

+  explicit BlockBasedAlgorithm(uint32_t num_bytes);
+
+  /// The constructor of the block-based algorithm. It copies the bitset as underlying
+  /// bitset because the given bitset may not satisfy the 64-byte alignment requirement


The alignment requirement is not 64-bytes, it is either 4 bytes (as the code is now) or 32 bytes (AVX2).

Did we discuss the alignment should be 64 bytes since the SIMD maybe trigger segfault if it access cross cache line data?

jbapple-cloudera · 2018-07-30T02:42:06Z

src/parquet/bloom_filter.h

+  /// @param num_bytes  The number of bytes of given bitset.
+  BloomFilter(const uint32_t* bitset, uint32_t num_bytes);
+
+  BloomFilter(const BloomFilter& orig) = delete;


While that is technically possible, that's a somewhat outdated C++ idiom.

jbapple-cloudera · 2018-07-30T02:48:52Z

src/parquet/bloom_filter_algorithm.cc

+  num_bytes_ = num_bytes;
+
+  ::arrow::Status status =
+      pool_->Allocate(num_bytes, reinterpret_cast<uint8_t**>(&bitset32_));


This comment from last time was not addressed, I think:

Incorrect alignment can produce incorrect compilation even without SIMD. If we treat the result like a uint32_t * and it is not 4-byte aligned, then the compilation can produce surprising and incorrect code.

The documentation says that Allocate returns 8-byte aligned results. Please add a check that it is at least alignof(decltype(bitset_[0])).

The reply maybe covered, I replied this comment:

I 'm ok to add the alignment check. Just want to confirm if we are looking at same document, the document from my eclipse is:
/// Allocate a new memory region of at least size bytes.
///
/// The allocated region shall be 64-byte aligned.
virtual Status Allocate(int64_t size, uint8_t** out) = 0;

Do you mean I haven't add check in commit？

jbapple-cloudera · 2018-07-30T02:51:54Z

src/parquet/bloom_filter.cc

+  int64_t bytes_available;
+
+  uint32_t len;
+  memcpy(&len, input->Read(sizeof(uint32_t), &bytes_available), sizeof(uint32_t));


Can Read return a nullptr? If so, you'll want to checkthat the result is non-null before calling memcpy.

jbapple-cloudera · 2018-08-03T20:59:09Z

src/parquet/bloom_filter.h

@@ -77,51 +42,209 @@ class BloomFilter final {
  // in block-based algorithm.
  static constexpr uint32_t MINIMUM_BLOOM_FILTER_BYTES = 32;


This constant is specific to the BlockSplitBF and belongs in that class, not this one.

jbapple-cloudera · 2018-08-03T21:01:33Z

src/parquet/bloom_filter.h

@@ -77,51 +42,209 @@ class BloomFilter final {
  // in block-based algorithm.
  static constexpr uint32_t MINIMUM_BLOOM_FILTER_BYTES = 32;

-  BloomFilter(const BloomFilter& orig) = delete;
-
  /// Calculate optimal size according to the number of distinct values and false


This is also specific to BlockSplitBF

jbapple-cloudera · 2018-08-03T21:07:42Z

src/parquet/bloom_filter.h

+  BlockSplitBloomFilter(const uint8_t *bitset, uint32_t num_bytes);
+
+  // Bytes in a tiny Bloom filter block.
+  static constexpr int BYTES_PER_FILTER_BLOCK = 32;


Why is this public? same with BITS_SET_PER_BLOCK, BlockMask, and SALT.

jbapple-cloudera · 2018-08-03T21:11:23Z

src/parquet/bloom_filter.h

+  ///
+  /// @param hash the hash used to calculate bit index to test
+  /// @return true
+  bool TestBits(uint64_t hash) const;


Why have TestBits and SetBits? Why not just put the bodies of those functions in FindHash and InsertHash?

jbapple-cloudera · 2018-08-03T21:17:06Z

src/parquet/bloom_filter.h

+  ::arrow::MemoryPool *pool_;
+
+  // The underlying buffer of bitset.
+  std::shared_ptr<Buffer> data_;


Since you wisely pointed out that we can get the buffer to be 64-byte aligned, you can make this a std::unique_ptr<uint32_t>. Sorry for the churn!

AllocateBuffer needs shared_ptr as input param and looks like no easy way to convert shared_ptr to unique_ptr.

Can you help me remember: why not use MemoryPool::Allocate then?

@xhochy suggested to use AllocateBuffer so that we don't have to free the buffer manually.

If you use MemoryPool::Allocate and unique_ptr with a Deleter, you don't need a shared_ptr. Generally, you want to avoid shared_ptr when unique_ptr will suffice.

Nearly all APIs related to memory allocation in Arrow use shared_ptr<Buffer> (because memory / buffer sharing is fundamental to how the project works)

Is that the right choice here - what other references do you expect there will be to this buffer other than in a single Bloom filter?

Reimplementing arrow::AllocateBuffer and adding a custom deleter for the sake of using unique_ptr seems a bit elaborate to me. It would be more worth your time to add an AllocateBuffer API that returns unique_ptr. I opened https://issues.apache.org/jira/browse/ARROW-2998

Hi @jbapple-cloudera, I tried to use custom deleter but looks like no easy way to implement it. This is because MemoryPool needs two arguments, pointer and size, to free a buffer. While a custom Deleter needs only one pointer argument. I also tried to use following way:

void FreeBitset(uint8_t* data, uint32_t num_bytes) { if (data) { ::arrow::MemoryPool *pool = ::arrow::default_memory_pool(); pool->Free(data, num_bytes); } } auto deleter= [&](uint8_t* p) { FreeBitset(p, ???); }; std::unique_ptr<uint8_t[], decltype(deleter)> data_;

But don't know how to pass in the num_bytes argument.

I saw you have assigned the ARROW-2998 to yourself and already submit PR yet, thanks!

jbapple-cloudera · 2018-08-03T21:24:56Z

src/parquet/bloom_filter.h

+  /// [MINIMUM_BLOOM_FILTER_BYTES, MAXIMUM_BLOOM_FILTER_BYTES], it will be
+  /// rounded up/down to lower/upper bound if num_bytes is out of range and also
+  /// will be rounded up to a power of 2. It will use murmur3_x64_128 as its
+  /// default


nit: odd line break here

jbapple-cloudera · 2018-08-03T21:25:41Z

src/parquet/bloom_filter.h

+  /// rounded up/down to lower/upper bound if num_bytes is out of range and also
+  /// will be rounded up to a power of 2. It will use murmur3_x64_128 as its
+  /// default
+  /// hash function and the block-based algorithm.


Don't need to say the default algorithm, only the hash function. Here and below.

jbapple-cloudera · 2018-08-03T21:37:08Z

src/parquet/bloom_filter.cc

  int64_t bytes_available;
+  const uint8_t* read_buffer = NULL;


Delay this definition until the first assignment on line 82.

jbapple-cloudera · 2018-08-03T21:39:44Z

src/parquet/bloom_filter.cc

+  if (static_cast<uint32_t>(bytes_available) != sizeof(uint32_t) || !read_buffer) {
+    throw ParquetException("Failed to deserialize from input stream");
+  }
+  memcpy(&hash, read_buffer, sizeof(uint32_t));
  if (static_cast<uint32_t>(bytes_available) != sizeof(uint32_t)) {


This is duplicated with the code on line 90. Same for line 107 and 103, respectively.

jbapple-cloudera · 2018-08-03T21:45:22Z

src/parquet/bloom_filter.cc

  if (static_cast<uint32_t>(bytes_available) != sizeof(uint32_t)) {
    throw ParquetException("Failed to deserialize from input stream");
  }
  if (static_cast<Algorithm>(algorithm) != BloomFilter::Algorithm::BLOCK) {
    throw ParquetException("Unsupported Bloom filter algorithm");
  }

-  return new BloomFilter(input->Read(len, &bytes_available), len);
+  return new BlockSplitBloomFilter(input->Read(len, &bytes_available), len);


Please return a value, not a new-allocated pointer.

wesm · 2018-08-04T15:37:14Z

I will give this a review when I can. It would be valuable to have bloom filters available in Apache Arrow also, related to our mailing list discussion happening right now https://lists.apache.org/thread.html/4bc135b4e933b959602df48bc3d5978ab7a4299d83d4295da9f498ac@%3Cdev.parquet.apache.org%3E

wesm

I made some comments. This is a lot of virtual APIs for a construct which is designed for hashing a large number of small values in the innermost loop of an encoding or decoding algorithm. I would prefer to see no use of virtual on the hot path. I would even go so far as to recommend passing the hashing algorithm as a template argument to the Bloom filter

wesm · 2018-08-04T20:03:04Z

CMakeLists.txt

  src/parquet/column_reader.cc
  src/parquet/column_scanner.cc
  src/parquet/column_writer.cc
  src/parquet/exception.cc
  src/parquet/file_reader.cc
  src/parquet/file_writer.cc
  src/parquet/metadata.cc
+  src/parquet/murmur3.cc
  src/parquet/parquet_constants.cpp
  src/parquet/parquet_types.cpp
  src/parquet/printer.cc


What is this binary data file and can it be generated rather than checked into the repo?

The binary file is used to ensure cross compatibility between parquet-mr and parquet-cpp.

I think we should store binary files someplace else, like in a parquet-compatibility repo. I can request ASF Infra to create such a repository if you think that's a good idea

Thanks, I agree that's a good idea. Other parquet implementation can benefit from that as well. How long it will take to create such a repo and any dependency? I have no experience on this.

wesm · 2018-08-04T20:03:39Z

src/parquet/README

+Step 1: Construct a Bloom filter with 1024 bytes of bitset.
+Step 2: Insert hashes of "hello", "parquet", "bloom", "filter" strings to Bloom filter
+by calling hash and insert APIs.
+Step 3: Call writeTo API to write to File.


I am not sure this is the right place for this. Put this comment in bloom_filter-test.cc?

We already have this doc in bloom_filter-test.cc.

wesm · 2018-08-04T20:05:04Z

src/parquet/bloom_filter.h

+// A Bloom filter is a compact structure to indicate whether an item is not in a set or
+// probably in a set. The Bloom filter usually consists of a bit set that represents a
+// set of elements, a hash strategy and a Bloom filter algorithm.
+class BloomFilter {


Pretty much all of our hashing and general algorithmic code for parquet-cpp is contained in the arrow/util directory in Apache Arrow. Maybe this should be contributed there also?

We can open a jira to track this when we complete this, is that OK?

wesm · 2018-08-04T20:06:09Z

src/parquet/bloom_filter.cc

+  }
+
+  num_bytes_ = num_bytes;
+  PARQUET_THROW_NOT_OK(::arrow::AllocateBuffer(pool_, num_bytes_, &data_));


I'm -1 on non-trivial constructors that can throw exceptions. Ideally a static alternate constructor returning arrow::Status

Maybe we can add a Init function to take over actual initialization work.

wesm · 2018-08-04T20:06:36Z

src/parquet/bloom_filter.cc

+      hash_strategy_(HashStrategy::MURMUR3_X64_128),
+      algorithm_(Algorithm::BLOCK) {
+  if (!bitset) {
+    throw ParquetException("Given bitset is NULL");


Is this a pre-condition? If so use DCHECK

wesm · 2018-08-04T20:07:51Z

src/parquet/bloom_filter.h

+ public:
+  // Maximum Bloom filter size, it sets to HDFS default block size 128MB
+  // This value will be reconsidered when implementing Bloom filter producer.
+  static constexpr uint32_t MAXIMUM_BLOOM_FILTER_BYTES = 128 * 1024 * 1024;


Use kMaximumBloomFilterBytes naming style for better consistency with Google style guide

wesm · 2018-08-04T20:08:34Z

src/parquet/bloom_filter.h

+  /// @param hash the element to contain.
+  /// @return false if value is definitely not in set, and true means PROBABLY
+  /// in set.
+  virtual bool FindHash(uint64_t hash) const = 0;


Having all of these virtual functions potentially on the hot path is concerning. What is the rationale for using virtual?

wesm · 2018-08-04T20:10:15Z

src/parquet/bloom_filter.h

+//
+// This implementation sets 8 bits in each tiny Bloom filter. Each tiny Bloom
+// filter is 32 bytes to take advantage of 32-byte SIMD instructions.
+class BlockSplitBloomFilter : public BloomFilter {


Per my comment above, if there is only a single initial implementation of BloomFilter, putting virtual paths on the hot path seems potentially concerning. There are a number of approaches possible (like CRTP) to avoid virtual while permitting different implementations

wesm · 2018-08-04T20:11:25Z

src/parquet/bloom_filter.h

+  BlockSplitBloomFilter(const uint8_t* bitset, uint32_t num_bytes);
+
+  // Minimum Bloom filter size, it sets to 32 bytes to fit a tiny Bloom filter.
+  static constexpr uint32_t MINIMUM_BLOOM_FILTER_BYTES = 32;


See comment re: name style for constants

wesm · 2018-08-04T20:12:03Z

src/parquet/hasher.h

+  ///
+  /// @param value the value to hash.
+  /// @return hash result.
+  virtual uint64_t Hash(int32_t value) const = 0;


Is there an approach possible that does not use virtual?

jbapple-cloudera · 2018-08-04T20:29:53Z

@wesm In answer to your question: yes, there are ways to address this without virtual. I originally suggested it as a way to address the open universe construct of knowing that we have a single filter API and multiple possible BF implementations and hash functions, even though we're starting with just one of each.

chenjunjiedada · 2018-08-10T01:55:19Z

Thanks @wesm and @jbapple-cloudera , I saw the ARROW-2998 already committed.

The latest commit cannot be applied directly since parquet-cpp is using arrow repo at Dec, 2017 and it should have many changes from then on. I can create a jira to track shared_ptr to unique_ptr under Bloom filter umbrella jira, is that OK?

jbapple-cloudera · 2018-08-10T14:39:08Z

Yes, @cjjnjust , it's fine with me if we leave shared_ptr for now.

I think the largest unsettled issue is the discussion @wesm started about the performance of virtual methods.

Wes, I'd like to enable an API where users can write code something like

void ProcessFilter(const char * serialized_filter) {
  unique_ptr<Filter> f = Filter::Deserialize(serialized_filter);
  // process here
}

We could accomplish something similar with templates, but all of the processing and storage would then need to lift the type to a template parameter. I think we'll get a better picture of how onerous this is once we have some non-test use of the filter in parquet-cpp. How would you feel about revisiting that question then?

wesm · 2018-08-10T15:46:18Z

I'm cool with tabling the virtual function discussion. I will be able to assist with refactoring for CRTP-ifying things when the time comes. I'd prefer to get the outstanding patches merged quickly and move as expediently as possible toward the Arrow-Parquet monorepo structure under discussion to make these things easier to work on

jbapple-cloudera · 2018-08-12T21:48:41Z

OK, LGTM. Wes, since you're a committer, do you want to check this in and then we can work on some follow-on patches, like templatizing, integration with the rest of the codebase, a parquet-binary repo, and so on?

wesm · 2018-08-15T12:57:06Z

+1. Merging this, sounds good. I created https://github.com/apache/parquet-testing; @cjjnjust let me know when you want to move the binary files there

daedric reviewed Jan 24, 2018

View reviewed changes

daedric reviewed Jan 29, 2018

View reviewed changes

src/parquet/bloom.cc Outdated

Bloom::~Bloom() {

if (bitset) {

free(bitset);

Copy link

Contributor

daedric Jan 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use delete[] bitset; it was newed not malloced

jbapple-cloudera suggested changes Jun 21, 2018

View reviewed changes

chenjunjiedada changed the title ~~PARQUET-41: Add bloom filter for parquet~~ PARQUET-1332: Add bloom filter for parquet Jun 22, 2018

jbapple-cloudera suggested changes Jun 22, 2018

View reviewed changes

chenjunjiedada force-pushed the master branch from 3721c5c to c22e125 Compare June 23, 2018 02:29

jbapple-cloudera suggested changes Jun 25, 2018

View reviewed changes

chenjunjiedada force-pushed the master branch 2 times, most recently from abb6daa to 35ea6b8 Compare June 25, 2018 12:17

jbapple-cloudera suggested changes Jun 26, 2018

View reviewed changes

chenjunjiedada force-pushed the master branch 5 times, most recently from ace7239 to aa52d6f Compare June 28, 2018 23:47

jbapple-cloudera suggested changes Jul 3, 2018

View reviewed changes

chenjunjiedada force-pushed the master branch 2 times, most recently from 44cd115 to 0ff60fd Compare July 13, 2018 05:08

Chen, Junjie added 7 commits July 22, 2018 23:46

PARQUET-1332: update according to code review

1374665

PARQUET-1332: use abstract class for hash class

0bc8595

PARQUET-1332: rebase to latest to solve CI error and update comments

731191c

PARQUET-1332: update code according to Jim's comments

1a105fc

PARQUET-1332: update according to latest comments

0c1e95f

PARQUET-1332: fix build failure for windows platform

ec6a6e9

PARQUET-1332: refine bloom filter algorithm

019322a

chenjunjiedada force-pushed the master branch from caccf40 to 019322a Compare July 22, 2018 16:02

PARQUET-1332: fix build error for clang Xcode

fd3ba23

xhochy reviewed Jul 28, 2018

View reviewed changes

jbapple-cloudera reviewed Jul 30, 2018

View reviewed changes

jbapple-cloudera suggested changes Jul 30, 2018

View reviewed changes

PARQUET-1332: refine the complex classes

fe97b44

chenjunjiedada force-pushed the master branch from a210966 to fe97b44 Compare August 2, 2018 02:55

jbapple-cloudera suggested changes Aug 3, 2018

View reviewed changes

PARQUET-1332: update code for some coding style

b0f3f80

wesm reviewed Aug 4, 2018

View reviewed changes

PARQUET-1332: update constructor

d4d3018

wesm mentioned this pull request Aug 8, 2018

ARROW-2998: [C++] Add unique_ptr versions of Allocate[Resizable]Buffer apache/arrow#2395

Closed

wesm closed this in 7a09ade Aug 15, 2018

jbapple-cloudera mentioned this pull request Oct 7, 2018

PARQUET-1328: Add Bloom filter reader and writer apache/parquet-java#521

Closed

This was referenced Jun 23, 2024

[C++][Parquet] Add bloom filter utility class apache/arrow#42829

Closed

Add bloom filters to parquet statistics apache/parquet-java#1468

Closed

		}

		return new BloomFilter(input->Read(len, &bytes_available), len);

		@@ -77,51 +42,209 @@ class BloomFilter final {
		// in block-based algorithm.
		static constexpr uint32_t MINIMUM_BLOOM_FILTER_BYTES = 32;

PARQUET-1332: Add bloom filter for parquet #432

PARQUET-1332: Add bloom filter for parquet #432

Conversation

chenjunjiedada commented Jan 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenjunjiedada commented Jan 29, 2018

daedric commented Jan 29, 2018

chenjunjiedada commented May 3, 2018

chenjunjiedada commented May 23, 2018

jbapple-cloudera left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenjunjiedada Jun 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenjunjiedada commented Jun 21, 2018

chenjunjiedada commented Jun 22, 2018

jbapple-cloudera left a comment

Choose a reason for hiding this comment

chenjunjiedada commented Jun 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenjunjiedada Jun 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenjunjiedada Jun 21, 2018 •

edited

Loading

chenjunjiedada commented Jun 23, 2018 •

edited

Loading

chenjunjiedada Jun 25, 2018 •

edited

Loading