-
Notifications
You must be signed in to change notification settings - Fork 192
PARQUET-1332: Add bloom filter for parquet #432
Conversation
src/parquet/bloom.cc
Outdated
this->hashFunc = &MurmurHash3_x64_128; | ||
break; | ||
default: | ||
throw new parquet::ParquetException("Not supported hash strategy"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure you want to new the exception ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch, thanks.
src/parquet/bloom.h
Outdated
Algorithm algorithm; | ||
|
||
// The underlying byte array for bloom filter bitset. | ||
uint8_t* bitset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The move ctor and operator are not deleted which could lead to a crash since the bitset is freed in the dtor.
Good practice would be to wrap it in a std::unique_ptr<uint8_t[]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. applied.
src/parquet/bloom.cc
Outdated
|
||
Bloom::~Bloom() { | ||
if (bitset) { | ||
free(bitset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use delete[] bitset; it was new
ed not malloc
ed
Not sure why CI build error, valgrind complain for "jump or move depends on uninitialised value". |
You don't memset to 0 your int array so it contains garbage :) |
Hi @winningsix and @daedric |
@wesm @xhochy @daedric Since @winningsix has no bandwidth to handle this, I 'm back to take this again. This has been reviewed for long time, could you please kindly help to have a look again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have more comments later, but I didn't want you to have to wait too long to get started on some feedback.
src/parquet/bloom.h
Outdated
static constexpr uint32_t SALT[8] = { 0x47b6137bU, 0x44974d91U, 0x8824ad5bU, | ||
0xa2b7289dU, 0x705495c7U, 0x2df1424bU, 0x9efc4947U, 0x5c6bfb31U }; | ||
|
||
typedef void (*HashFunc)(const void *, int, uint32_t, void*); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give these parameters names? Also, the last parameter should/can be a uint64_t [2]
, yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added full names, I'm not understand the last question, why we need to change it? any benefit?
src/parquet/bloom.h
Outdated
public: | ||
/// Constructor of bloom filter, if numBytes is zero, bloom filter bitset | ||
/// will be created lazily and the number of bytes will be calculated through | ||
/// distinct values in cache. It use murmur3_x64_128 as its default hash function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "It will use"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
src/parquet/bloom.h
Outdated
/// Constructor of bloom filter, if numBytes is zero, bloom filter bitset | ||
/// will be created lazily and the number of bytes will be calculated through | ||
/// distinct values in cache. It use murmur3_x64_128 as its default hash function | ||
/// and block based algorithm as default algorithm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "and the block based algorithm." No need to add "as default algorithm".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
src/parquet/bloom.h
Outdated
// Bytes in a tiny bloom filter block. | ||
static constexpr int BYTES_PER_FILTER_BLOCK = 32; | ||
|
||
// Default seed for hash function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add in the comment on line 64 where you got this default seed from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
src/parquet/bloom.h
Outdated
// Default maximum bloom filter size (need to discuss) | ||
static constexpr int DEFAULT_MAXIMUM_BLOOM_FILTER_BYTES = 16 * 1024 * 1024; | ||
|
||
// The block based algorithm needs 8 odd SALT values to calculate eight index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nits: "block-based", "eight indexes", "one bit in each 32-bit word"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
src/parquet/bloom.h
Outdated
std::unique_ptr<uint8_t[]> bitset; | ||
|
||
// Hash function applied. | ||
HashFunc hashFunc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need both hashFunc
and hash_strategy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hash_strategy is used to specify which hash function to use (MurmurHash3_x86_32, MurmurHash3_x86_128, MurmurHash3_x64_128). While hashFunc is a function pointer used to point to the function we chose by hash_strategy.
src/parquet/bloom.h
Outdated
void writeTo(const std::shared_ptr<OutputStream>& sink); | ||
|
||
private: | ||
// Create a new bitset for bloom filter, at least 256 bits will be create. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: istead of "at least 256 bits will be create", use "the size will be at least 32 bytes". Also, please document any restrictions on num_bytes
and the consequences for ignoring them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
src/parquet/bloom.h
Outdated
// Create a new bitset for bloom filter, at least 256 bits will be create. | ||
// @param numBytes number of bytes for bitset | ||
void initBitset(uint32_t num_bytes); | ||
void setMask(uint32_t key, uint32_t mask[8]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment for every method, even private ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
src/parquet/bloom.cc
Outdated
sink->Write(bitset.get(), num_bytes); | ||
} | ||
|
||
Bloom::~Bloom() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can put this in .h with and use " = default;"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
src/parquet/bloom.h
Outdated
Algorithm algorithm; | ||
|
||
// The underlying byte array for bloom filter bitset. | ||
std::unique_ptr<uint8_t[]> bitset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is almost always going to be used as a uint32_t[]
, yes? Can we use that instead of uint8_t[]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
Thanks Jim, I have updated code accordingly. I also added an unit test for compatibility test. Not sure why CI still failed, four build passed, two build failed. |
Change title to conform to subtask JIRA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something happened with your rebase and it is now huge, including files not directly related to BFs. You might be able to fix this if you changee your GH repo's master to match parquet-cpp's master.
Once that's set, I'll resume my review.
Sorry, the rebase seems weird, it contains several same commits. I just picked up all related changes and patched to master, it overrides commit history, but I think it should be much clean to cherry pick all commits. It needs you to check previously comments by manually click "Show outdated", but the updated code is not show in that conversation box, you can review the latest code by click the "File Changes" at top of this page. |
src/parquet/bloom.h
Outdated
static constexpr double DEFAULT_FPP = 0.01; | ||
|
||
// Bloom filter data header, including number of bytes, hash strategy and algorithm. | ||
static constexpr int HEADER_SIZE = 12; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't appear to be used anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove this.
src/parquet/bloom.h
Outdated
* Default false positive probability value use to calculate optimal number of bits | ||
* used by bloom filter. | ||
*/ | ||
static constexpr double DEFAULT_FPP = 0.01; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't appear to be used anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove in next commit.
src/parquet/bloom.h
Outdated
static constexpr int BYTES_PER_FILTER_BLOCK = 32; | ||
|
||
// Default seed for hash function, to keep cross compatibility the seed is same as | ||
// value defined in parquet-mr. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this seed picked in parquet-mr? What is the origin of this value for the seed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The seed is a primer which comes from org.apache.hive.common.util.Murmur3.java. I haven't found why it use this primer, but this value is widely used in different versions of implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please always leave a comment in the code when you use a "magic number" like this regarding where you got that number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will state this comes from other library or jar in next commit.
src/parquet/bloom.h
Outdated
// value defined in parquet-mr. | ||
static constexpr int DEFAULT_SEED = 104729; | ||
|
||
// Default maximum bloom filter size (need to discuss) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to discuss with whom?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I think this would need to discuss with parquet community, maybe we can set to 1/8 or parquet.block.size in case of one column of long type with its distinct values ratio equal to 1 and FPP set to 0.01. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would it mean to set this to 1/8? I don't understand. Doesn't this need to be an integer?
I would suggest starting the conversation on dev@ if you think it needs to be discussed by the community before committing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will start thread on this.
src/parquet/bloom.cc
Outdated
addElement(hash); | ||
} | ||
|
||
uint64_t Bloom::hash(int value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of these can be replaced by a template with specializations for the types that are not POD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean types are POD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is that most of these could be replaced by a single template.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will do this in next commit.
src/parquet/CMakeLists.txt
Outdated
@@ -50,6 +52,7 @@ install(FILES | |||
"${CMAKE_CURRENT_BINARY_DIR}/parquet.pc" | |||
DESTINATION "${CMAKE_INSTALL_LIBDIR}/pkgconfig/") | |||
|
|||
ADD_PARQUET_TEST(bloom-test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this only used in the test? What must be done to actually integrate it with the rest of the repository and get it into some binary tools or libraries that can be used by consumers of parquet-cpp build artifacts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is only expected to be used in test. I don't expect this unit test to be contained in final binary file. I think this test is used to verify correctness and compatibility before parquet-cpp releasing. Do we need to include unit tests in release jar or .so/.a?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand. If this file isn't used anywhere else in the project, how can it be actually used? @rdblue , maybe you could shed some light on how new features are usually integrated into the non-test parts of this project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file is used when execute "make unittest" command, it run several unit test. But the unit test code is not linked into finally parquet-cpp binary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the parquet-cpp binary should link to this Bloom filter code in some way. Do you disagree, @cjjnjust ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, I thought it is not linked to, while I just see it is linked into final binary in compiling log. Let me investigate how other unit test achieve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just reviewed this again, bloom-test.cc is not linked to either libparquet.a or libparquet.so, the logs I mentioned above is linking bloom-test.o to bloom-test binary in debug/bloom-test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not comfortable that this patch is complete until it is clearly used somewhere other than a test. If BFs are part of the format, and this repo includes utilities for reading or writing the format, they should include this code in their compiled binaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your concern while we should discussed in previous parquet sync-up meeting and agreed to implement BF utility in the test scope firstly since BFs are optional of metadata. BFs are planed to be used when either read or write side is implemented, we should include some usages right in next steps, it that OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Please note as such in your commit message.
src/parquet/bloom.h
Outdated
public: | ||
/// Constructor of bloom filter. The range of num_bytes should be within | ||
/// [BYTES_PER_FILTER_BLOCK, DEFAULT_MAXIMUM_BLOOM_FILTER_BYTES], it will be | ||
/// round up/down to lower/upper bound if num_bytes is out of range. It will use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "rounded up/down"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update in next commit.
src/parquet/bloom.h
Outdated
/// Constructor of bloom filter. The range of num_bytes should be within | ||
/// [BYTES_PER_FILTER_BLOCK, DEFAULT_MAXIMUM_BLOOM_FILTER_BYTES], it will be | ||
/// round up/down to lower/upper bound if num_bytes is out of range. It will use | ||
/// murmur3_x64_128 as its default hash function and the block based algorithm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it makes sense for this class to include both an implementation and the interface of a filter. Would it make more sense to define a pure virtual Filter interface and then derive from that and override methods?
Shall we do the same for the hash function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea for using a function pointer is to treat murmur3 hash implementation as library functions, such as library here. So that one day if we want to switch to other hash in a c++ library, we only need to just assign the function pointer to it instead of change internal implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, but I'm not sure this addresses my concerns either about the Filter API or the Hash API. Pure abstract classes is a common idiom that more simply captures the intent that there is one API and multiple implementations.
The Derived classes, can, of course, call Murmur3 or whatever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood your point, will update after next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a abstract class Hasher, and let MurmurHash3 to inherit to it.
src/parquet/bloom.h
Outdated
/// function and block-based algorithm as default algorithm. | ||
/// @param bitset The given bitset to construct bloom filter. | ||
/// @param len Length of bitset. | ||
Bloom(const uint8_t* bitset, uint32_t len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little tricky: If we are going to treat bitset
as if it were an array of uint32_t
s, we run into http://eel.is/c++draft/basic.lval#11
If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:
- the dynamic type of the object,
...- a char, unsigned char, or std::byte type.
So we need uint8_t
to be char
, unsigned char
, or std::byte
. We'll need to check for alignment, as well, because compilers are allowed to assume that pointers to uint32_t
have alignment that is greater than that of uint8_t
.
Let's think about how this will get hooked into other libraries an ask @rdblue , who is more experienced in this code base, if we can put alignment requirements on the data we'll end up needing to use this with.
In an ideal world, I'd like it to be 64-byte aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally types alignment is equal to types size (Wiki, this may be different among compilers), so I think the last sentence you mentioned should be 64-bits aligned, right?
Therefore, uint32_t should have an four bytes alignment generally, consider we are restricting bitset length to [32bytes, 16Mbytes], and also length is power of 2, then the uint32_t* must not contains any padding, right?
BTW, as I known, uint8_t is typedef of unsigned char, am I wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do really mean 64-byte. That is the cache line size in most modern x86-64 CPUs. Alignment is especially useful for SIMD operations, in which misaligned access can cause a segfault.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, Now I understood your concern. SIMD operation should have some performance penalty when access unaligned data due to cross cache line access or cache miss (Some tests here: http://www.agner.org/optimize/blog/read.php?i=415#423), but I think it should not cause a segfault.
How about set minimum bloom filter size to 64 bytes to avoid this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some SIMD operations can segfault when crossing cache lines. Others won't.
How do you believe that setting the minimum BF size would fix this? What mechanism would you use to set the size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got your point from your email reply.
How about we check bitset address alignment and do copy if not? And of course document alignment requirement and performance impact. Can we do this firstly and consider BF client/consumer when doing read-side task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean this kind segfalut? https://stackoverflow.com/questions/25596379/simd-intrinsics-segmentation-fault
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about for now making the parameter a uint32_t *? This puts the obligation on the caller. In the tests you wrote, you can do the alignment check and also check (with static_assert) that uint8_t is unsigned char.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do this. I don't understand the last sentence, do you mean I should add alignment check and static_assert code as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, add these in the test code. The return value you get from an allocatino should be aligned at at least alignof(uint32_t) and also uint8_t should be std::is_same to unsigned char.
src/parquet/bloom.cc
Outdated
|
||
uint64_t Bloom::hash(const Int96 *value) { | ||
uint64_t out[2]; | ||
(*hashFunc)(reinterpret_cast<void *>(const_cast<uint32_t*>(value->value)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This const_cast
is not necessarily standards-compliant. I expect the type signature of the hash function is a problem: it should not be modifying the data passed into it, so it should be OK with a const
pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the const_cast is not necessarily and the signature is right also, I need to change reinterpret_cast<void*> to reinterpret_cast<const void *>, will update in next commit.
abb6daa
to
35ea6b8
Compare
src/parquet/bloom.h
Outdated
#ifndef PARQUET_BLOOM_H | ||
#define PARQUET_BLOOM_H | ||
|
||
#include <set> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need the standard's set
header? I don't see it used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove in next commit.
src/parquet/bloom.h
Outdated
namespace parquet { | ||
class OutputStream; | ||
|
||
// Bloom Filter is a compact structure to indicate whether an item is not in set or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Either "Bloom filters are compact structures" or "The Bloom filter is a compact structure" or "A Bloom filter is a compact structure". That is, for singular improper nouns, you need an article like "a" or "the".
Note that "filter" is not usually capitalized in "Bloom filter".
This is also true in other comments in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will update in next commit.
src/parquet/bloom.h
Outdated
~Bloom() = default; | ||
|
||
// Calculate optimal size according to the number of distinct values and false | ||
// positive probability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please leave a blank line when breaking paragraphs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean add a blank line between function description and param description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a blank line, a line with just the comment characters and a newline.
In general, for style comments like this that you're not clear on, please check the other C++ files in this repo and copy their style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will update in next commit.
src/parquet/bloom.h
Outdated
// @param ndv: The number of distinct values. | ||
// @param fpp: The false positive probability. | ||
// @return optimal number of bits of given n and p. | ||
static uint32_t optimalNumOfBits(uint32_t ndv, double fpp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The README.md says to use the Google style guide for C++, which specifies that methods start with a capital letter:
https://google.github.io/styleguide/cppguide.html#Function_Names
The README also describes how to run cpplint; please do so and correct the things it finds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update function names in next commit.
parquet-cpp already have cpplint tool integrated, we can run make lint to check.
src/parquet/bloom.h
Outdated
// Determine whether an element exist in set or not. | ||
// @param hash the element to contain. | ||
// @return false if value is definitely not in set, and true means PROBABLY in set. | ||
bool find(uint64_t hash); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend naming these FindHash
and InsertHash
to prevent users from accidentally thinking that they inserted an int16_t
or whatever, not a hash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, will update in next commit.
src/parquet/bloom-test.cc
Outdated
} | ||
|
||
// The exist should be probably less than 1000 according default FPP 0.01. | ||
ASSERT_TRUE(exist < 1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make the total count and fpp into variables and use some multiple of their product here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will update in next commit.
src/parquet/bloom-test.cc
Outdated
|
||
std::string testString[4] = {"hello", "parquet", "bloom", "filter"}; | ||
std::string data_dir(test::get_data_dir()); | ||
std::string bloomDataPath = data_dir + "/bloom_filter.bin"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a README in the directory with the .bin file to explain what it is and how you made it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add doc at begin of this test? Add README file in source subdir looks a bit weird and may be ignored easily.
src/parquet/bloom-test.cc
Outdated
for (int i = 0; i < 100000; i++) { | ||
ByteArray byte_array1(10, | ||
reinterpret_cast<const uint8_t*>(random_strings[i].c_str())); | ||
ASSERT_TRUE(de_bloom->find(de_bloom->hash(&byte_array1))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want to use one of the CHECK
macros when handling the main substance of the test, not ASSERT
. See here and elsewhere in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't found other unit tests use CHECK macros yet. Do you mean DCHECK? Could you please list one code example use CHECK so that I can refer to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right - I meant EXPECT
: https://github.com/google/googletest/blob/master/googletest/docs/primer.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will update to use EXPECT_EQ in next commit.
src/parquet/bloom-test.cc
Outdated
ASSERT_TRUE(exist < 1000); | ||
} | ||
|
||
TEST(CompatibilityTest, TestBloomFilter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put a comment on this test to describe what compatibility it tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will add in next commit.
src/parquet/bloom-test.cc
Outdated
std::unique_ptr<uint8_t[]> bitset; | ||
|
||
std::string testString[4] = {"hello", "parquet", "bloom", "filter"}; | ||
std::string data_dir(test::get_data_dir()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an unusual indentation.
Thanks Jim for these really great comments! I 'd like to update some simple changes firstly, some other comments need more investigation and discussion will be updated after discussing with community. |
ace7239
to
aa52d6f
Compare
Hi @jbapple-cloudera , I have updated code, could you please take a look? |
I get a message for every update, so I was already notified. In the future, there is no need for an additional reminder. I have some other responsibilities and will get back to this in a couple of days. |
@jbapple-cloudera Sure, thanks. |
src/parquet/murmur3.cc
Outdated
|
||
uint64_t MurmurHash3::Hash(int64_t value) { | ||
uint64_t out[2]; | ||
Hash_x64_128(reinterpret_cast<void*>(&value), sizeof(int), seed_, out); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sizeof(int), sizeof(int64_t).
As you have discovered (and I had overlooked), using a pure virtual base class requires separate methods, since there cannot be virtual template methods. However, you can call out to a helper template defined only in this translation unit. That method would avoid some of the copy-and-paste that created the error here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/parquet/murmur3.cc
Outdated
uint64_t MurmurHash3::Hash(const Int96* value) { | ||
uint64_t out[2]; | ||
Hash_x64_128(reinterpret_cast<const void*>(value->value), sizeof(value->value), | ||
seed_, out); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indentation is off
src/parquet/murmur3.h
Outdated
|
||
#include <stdint.h> | ||
/// Original murmur3 hash implementation | ||
void Hash_x86_32(const void* key, int len, uint32_t seed, void* out); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these three methods in the class interface at all? Why not make them helper functions visible only in the .cc file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/parquet/murmur3.h
Outdated
|
||
#endif // !defined(_MSC_VER) | ||
uint64_t Hash(int32_t value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/parquet/hasher.h
Outdated
/// | ||
/// @param value the value to hash. | ||
/// @return hash result. | ||
virtual uint64_t Hash(int32_t value) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not make these also const methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/parquet/bloom.h
Outdated
/// | ||
/// @param num_bytes number of bytes for bitset. The range of num_bytes should be within | ||
/// [BYTES_PER_FILTER_BLOCK, DEFAULT_MAXIMUM_BLOOM_FILTER_BYTES], it will be | ||
/// rounded up/down to lower/upper bound if num_bytes is out of range nd also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "and also"
src/parquet/bloom.h
Outdated
void setMask(uint32_t key, uint32_t mask[8]); | ||
/// Set bits in mask array according to input key. | ||
/// @param key the value to calculate mask values. | ||
/// @param mask the mask array is used to set or clear bits inside a block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is never used to clear any bits.
src/parquet/CMakeLists.txt
Outdated
@@ -50,6 +52,7 @@ install(FILES | |||
"${CMAKE_CURRENT_BINARY_DIR}/parquet.pc" | |||
DESTINATION "${CMAKE_INSTALL_LIBDIR}/pkgconfig/") | |||
|
|||
ADD_PARQUET_TEST(bloom-test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not comfortable that this patch is complete until it is clearly used somewhere other than a test. If BFs are part of the format, and this repo includes utilities for reading or writing the format, they should include this code in their compiled binaries.
src/parquet/bloom.cc
Outdated
} | ||
|
||
void Bloom::WriteTo(OutputStream* sink) { | ||
sink->Write(reinterpret_cast<const uint8_t*>(&num_bytes_), sizeof(uint32_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here is that you don't know that num_bytes_ has size sizeof(uint32_t). Please use an explicit underlying type: https://en.cppreference.com/w/cpp/language/enum
I mentioned this in a previous comment, to which you replied "will update in next commit." I assume you fixed the part of the comment you did understand but did not quite catch my meaning about the underlying type. That happens all the time, and it's nothing to be uncomfortable about. I strongly suggest asking for clarification if you get a comment from a code reviewer and you do not understand what it means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, I did miss this. I assume you are talking about add underlying type for HashStratey and Algorithm enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
src/parquet/bloom-test.cc
Outdated
EXPECT_TRUE(exist < total_count * fpp); | ||
} | ||
|
||
// The CompatibilityTest is use to test cross compatibility with parquet-mr, it read |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "it reads"
44cd115
to
0ff60fd
Compare
|
||
num_bytes_ = num_bytes; | ||
|
||
::arrow::Status status = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of calling raw Allocate
use https://github.com/apache/arrow/blob/c6e33d89d10d9aa324cbd5b6776cadbb367970a7/cpp/src/arrow/buffer.h#L232 to get a Buffer. This will then be automatically be deallocated on destruction of the reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will use it. Just one thing needs to be consider, converting from uint8_t* to uint32_t* with reinterpret_cast may lead to undefined behavior.
Hi @jbapple-cloudera , looks like reinterpret_cast from uint8_t* to other bigger size type pointer is widely used in this project and also arrow, such as TypedBufferBuilder. Is the undefined behavior not present when the pointer's address and the buffer size is a multiply of requirement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#425 removed -fno-strict-aliasing
from the build flags. This makes many common calls to reinterpret_cast
invalid. If you see lots of other use, I'd suggest starting a thread on dev@ on the subject.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the pointer returned from Allocate
is guaranteed to be 64 byte aligned. Thus all these casts should be safe.
|
||
::arrow::Status status = | ||
pool_->Allocate(num_bytes, reinterpret_cast<uint8_t**>(&bitset32_)); | ||
if (!status.ok()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the PARQUET_THROW_NOT_OK
macro to check the Status value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks.
BlockBasedAlgorithm::BlockBasedAlgorithm(const uint8_t* bitset, uint32_t num_bytes) | ||
: BlockBasedAlgorithm(num_bytes) { | ||
if (!bitset) { | ||
throw ParquetException("Given bitste is NULL"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type
src/parquet/bloom_filter.cc
Outdated
this->hasher_.reset(new MurmurHash3()); | ||
break; | ||
default: | ||
throw parquet::ParquetException("Unsupported hash strategy."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you're in the parquet
namespace, there is no need to prefix this with parquet::
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. Thanks.
src/parquet/bloom_filter-test.cc
Outdated
@@ -41,41 +41,19 @@ TEST(Murmur3Test, TestBloomFilter) { | |||
TEST(ConstructorTest, TestBloomFilter) { | |||
EXPECT_NO_THROW(BloomFilter bloom_filter1(1000)); | |||
|
|||
std::unique_ptr<uint32_t[]> bitset1(new uint32_t[1024]()); | |||
// It throws because the length does not match the bitset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not why it throws, exactly - the constructor cannot check the length of the bitset. It throws because the length cannot be 0.
src/parquet/bloom_filter.cc
Outdated
if (static_cast<uint32_t>(bytes_available) != sizeof(uint32_t)) { | ||
throw ParquetException("Failed to deserialize from input stream"); | ||
} | ||
if (static_cast<HashStrategy>(hash) != HashStrategy::MURMUR3_X64_128) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already in the constructor; no need to duplicate it here.
src/parquet/bloom_filter.cc
Outdated
} | ||
|
||
return new BloomFilter(input->Read(len, &bytes_available), len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not return by value? Why return a pointer?
src/parquet/bloom_filter.h
Outdated
} | ||
external_bitset_ = NULL; | ||
} | ||
/// Deserialize the Bloom filter from an input stream, it is used when reconstructing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: period, not a comma. Also, capitalize "It".
enum class HashStrategy : uint32_t { MURMUR3_X64_128 = 0 }; | ||
|
||
// Bloom filter algorithm. | ||
enum class Algorithm : uint32_t { BLOCK = 0 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these moved back to public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was used in deserialize function. It can be move back to private since I will remove algorithm and hash check in deserialize function and let constructor to check them.
|
||
num_bytes_ = num_bytes; | ||
|
||
::arrow::Status status = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#425 removed -fno-strict-aliasing
from the build flags. This makes many common calls to reinterpret_cast
invalid. If you see lots of other use, I'd suggest starting a thread on dev@ on the subject.
#include "parquet/types.h" | ||
|
||
namespace parquet { | ||
constexpr uint32_t BlockBasedAlgorithm::SALT[BITS_SET_PER_BLOCK]; | ||
|
||
BlockBasedAlgorithm::BlockBasedAlgorithm(uint32_t num_bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting complex, which is why I suggest, as I mentioned a few times before, having two classes:
Filter
, which is has pure abstract virtual methods, and BlockSplitBloomFilter
, but not BloomFilterAlgorithm
class.
struct Filter {
virtual void InsertHash(uint64_t) = 0;
virtual void FindHash(uint64_t) const = 0;
// Hash methods, destructor, all pure virtual
};
struct BlockSplitBloomFilter : public Filter {
// Constructor
// Overrides
};
I suggest you remove the BloomFilterAlgorithm
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with you, that looks like simple. Just updated code, please help to have a look.
src/parquet/bloom_filter_algorithm.h
Outdated
explicit BlockBasedAlgorithm(uint32_t num_bytes); | ||
|
||
/// The constructor of the block-based algorithm. It copies the bitset as underlying | ||
/// bitset because the given bitset may not satisfy the 64-byte alignment requirement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alignment requirement is not 64-bytes, it is either 4 bytes (as the code is now) or 32 bytes (AVX2).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we discuss the alignment should be 64 bytes since the SIMD maybe trigger segfault if it access cross cache line data?
src/parquet/bloom_filter.h
Outdated
/// @param num_bytes The number of bytes of given bitset. | ||
BloomFilter(const uint32_t* bitset, uint32_t num_bytes); | ||
|
||
BloomFilter(const BloomFilter& orig) = delete; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While that is technically possible, that's a somewhat outdated C++ idiom.
num_bytes_ = num_bytes; | ||
|
||
::arrow::Status status = | ||
pool_->Allocate(num_bytes, reinterpret_cast<uint8_t**>(&bitset32_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment from last time was not addressed, I think:
Incorrect alignment can produce incorrect compilation even without SIMD. If we treat the result like a uint32_t *
and it is not 4-byte aligned, then the compilation can produce surprising and incorrect code.
The documentation says that Allocate returns 8-byte aligned results. Please add a check that it is at least alignof(decltype(bitset_[0]))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reply maybe covered, I replied this comment:
I 'm ok to add the alignment check. Just want to confirm if we are looking at same document, the document from my eclipse is:
/// Allocate a new memory region of at least size bytes.
///
/// The allocated region shall be 64-byte aligned.
virtual Status Allocate(int64_t size, uint8_t** out) = 0;
Do you mean I haven't add check in commit?
src/parquet/bloom_filter.cc
Outdated
int64_t bytes_available; | ||
|
||
uint32_t len; | ||
memcpy(&len, input->Read(sizeof(uint32_t), &bytes_available), sizeof(uint32_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can Read
return a nullptr
? If so, you'll want to checkthat the result is non-null before calling memcpy
.
src/parquet/bloom_filter.h
Outdated
@@ -77,51 +42,209 @@ class BloomFilter final { | |||
// in block-based algorithm. | |||
static constexpr uint32_t MINIMUM_BLOOM_FILTER_BYTES = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constant is specific to the BlockSplitBF and belongs in that class, not this one.
src/parquet/bloom_filter.h
Outdated
@@ -77,51 +42,209 @@ class BloomFilter final { | |||
// in block-based algorithm. | |||
static constexpr uint32_t MINIMUM_BLOOM_FILTER_BYTES = 32; | |||
|
|||
BloomFilter(const BloomFilter& orig) = delete; | |||
|
|||
/// Calculate optimal size according to the number of distinct values and false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also specific to BlockSplitBF
src/parquet/bloom_filter.h
Outdated
BlockSplitBloomFilter(const uint8_t *bitset, uint32_t num_bytes); | ||
|
||
// Bytes in a tiny Bloom filter block. | ||
static constexpr int BYTES_PER_FILTER_BLOCK = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this public? same with BITS_SET_PER_BLOCK, BlockMask, and SALT.
src/parquet/bloom_filter.h
Outdated
/// | ||
/// @param hash the hash used to calculate bit index to test | ||
/// @return true | ||
bool TestBits(uint64_t hash) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have TestBits and SetBits? Why not just put the bodies of those functions in FindHash and InsertHash?
src/parquet/bloom_filter.h
Outdated
::arrow::MemoryPool *pool_; | ||
|
||
// The underlying buffer of bitset. | ||
std::shared_ptr<Buffer> data_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you wisely pointed out that we can get the buffer to be 64-byte aligned, you can make this a std::unique_ptr<uint32_t>
. Sorry for the churn!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AllocateBuffer needs shared_ptr as input param and looks like no easy way to convert shared_ptr to unique_ptr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me remember: why not use MemoryPool::Allocate
then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xhochy suggested to use AllocateBuffer so that we don't have to free the buffer manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use MemoryPool::Allocate
and unique_ptr
with a Deleter
, you don't need a shared_ptr
. Generally, you want to avoid shared_ptr
when unique_ptr
will suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nearly all APIs related to memory allocation in Arrow use shared_ptr<Buffer>
(because memory / buffer sharing is fundamental to how the project works)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that the right choice here - what other references do you expect there will be to this buffer other than in a single Bloom filter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reimplementing arrow::AllocateBuffer
and adding a custom deleter for the sake of using unique_ptr
seems a bit elaborate to me. It would be more worth your time to add an AllocateBuffer API that returns unique_ptr
. I opened https://issues.apache.org/jira/browse/ARROW-2998
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jbapple-cloudera, I tried to use custom deleter but looks like no easy way to implement it. This is because MemoryPool needs two arguments, pointer and size, to free a buffer. While a custom Deleter needs only one pointer argument. I also tried to use following way:
void FreeBitset(uint8_t* data, uint32_t num_bytes) {
if (data) {
::arrow::MemoryPool *pool = ::arrow::default_memory_pool();
pool->Free(data, num_bytes);
}
}
auto deleter= [&](uint8_t* p) { FreeBitset(p, ???); };
std::unique_ptr<uint8_t[], decltype(deleter)> data_;
But don't know how to pass in the num_bytes argument.
I saw you have assigned the ARROW-2998 to yourself and already submit PR yet, thanks!
src/parquet/bloom_filter.h
Outdated
/// [MINIMUM_BLOOM_FILTER_BYTES, MAXIMUM_BLOOM_FILTER_BYTES], it will be | ||
/// rounded up/down to lower/upper bound if num_bytes is out of range and also | ||
/// will be rounded up to a power of 2. It will use murmur3_x64_128 as its | ||
/// default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: odd line break here
src/parquet/bloom_filter.h
Outdated
/// rounded up/down to lower/upper bound if num_bytes is out of range and also | ||
/// will be rounded up to a power of 2. It will use murmur3_x64_128 as its | ||
/// default | ||
/// hash function and the block-based algorithm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need to say the default algorithm, only the hash function. Here and below.
src/parquet/bloom_filter.cc
Outdated
int64_t bytes_available; | ||
const uint8_t* read_buffer = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delay this definition until the first assignment on line 82.
src/parquet/bloom_filter.cc
Outdated
if (static_cast<uint32_t>(bytes_available) != sizeof(uint32_t) || !read_buffer) { | ||
throw ParquetException("Failed to deserialize from input stream"); | ||
} | ||
memcpy(&hash, read_buffer, sizeof(uint32_t)); | ||
if (static_cast<uint32_t>(bytes_available) != sizeof(uint32_t)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is duplicated with the code on line 90. Same for line 107 and 103, respectively.
src/parquet/bloom_filter.cc
Outdated
if (static_cast<uint32_t>(bytes_available) != sizeof(uint32_t)) { | ||
throw ParquetException("Failed to deserialize from input stream"); | ||
} | ||
if (static_cast<Algorithm>(algorithm) != BloomFilter::Algorithm::BLOCK) { | ||
throw ParquetException("Unsupported Bloom filter algorithm"); | ||
} | ||
|
||
return new BloomFilter(input->Read(len, &bytes_available), len); | ||
return new BlockSplitBloomFilter(input->Read(len, &bytes_available), len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please return a value, not a new
-allocated pointer.
I will give this a review when I can. It would be valuable to have bloom filters available in Apache Arrow also, related to our mailing list discussion happening right now https://lists.apache.org/thread.html/4bc135b4e933b959602df48bc3d5978ab7a4299d83d4295da9f498ac@%3Cdev.parquet.apache.org%3E |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some comments. This is a lot of virtual APIs for a construct which is designed for hashing a large number of small values in the innermost loop of an encoding or decoding algorithm. I would prefer to see no use of virtual on the hot path. I would even go so far as to recommend passing the hashing algorithm as a template argument to the Bloom filter
src/parquet/column_reader.cc | ||
src/parquet/column_scanner.cc | ||
src/parquet/column_writer.cc | ||
src/parquet/exception.cc | ||
src/parquet/file_reader.cc | ||
src/parquet/file_writer.cc | ||
src/parquet/metadata.cc | ||
src/parquet/murmur3.cc | ||
src/parquet/parquet_constants.cpp | ||
src/parquet/parquet_types.cpp | ||
src/parquet/printer.cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this binary data file and can it be generated rather than checked into the repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The binary file is used to ensure cross compatibility between parquet-mr and parquet-cpp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should store binary files someplace else, like in a parquet-compatibility repo. I can request ASF Infra to create such a repository if you think that's a good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I agree that's a good idea. Other parquet implementation can benefit from that as well. How long it will take to create such a repo and any dependency? I have no experience on this.
Step 1: Construct a Bloom filter with 1024 bytes of bitset. | ||
Step 2: Insert hashes of "hello", "parquet", "bloom", "filter" strings to Bloom filter | ||
by calling hash and insert APIs. | ||
Step 3: Call writeTo API to write to File. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this is the right place for this. Put this comment in bloom_filter-test.cc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have this doc in bloom_filter-test.cc.
// A Bloom filter is a compact structure to indicate whether an item is not in a set or | ||
// probably in a set. The Bloom filter usually consists of a bit set that represents a | ||
// set of elements, a hash strategy and a Bloom filter algorithm. | ||
class BloomFilter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much all of our hashing and general algorithmic code for parquet-cpp is contained in the arrow/util
directory in Apache Arrow. Maybe this should be contributed there also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can open a jira to track this when we complete this, is that OK?
src/parquet/bloom_filter.cc
Outdated
} | ||
|
||
num_bytes_ = num_bytes; | ||
PARQUET_THROW_NOT_OK(::arrow::AllocateBuffer(pool_, num_bytes_, &data_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm -1 on non-trivial constructors that can throw exceptions. Ideally a static alternate constructor returning arrow::Status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add a Init function to take over actual initialization work.
src/parquet/bloom_filter.cc
Outdated
hash_strategy_(HashStrategy::MURMUR3_X64_128), | ||
algorithm_(Algorithm::BLOCK) { | ||
if (!bitset) { | ||
throw ParquetException("Given bitset is NULL"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a pre-condition? If so use DCHECK
src/parquet/bloom_filter.h
Outdated
public: | ||
// Maximum Bloom filter size, it sets to HDFS default block size 128MB | ||
// This value will be reconsidered when implementing Bloom filter producer. | ||
static constexpr uint32_t MAXIMUM_BLOOM_FILTER_BYTES = 128 * 1024 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use kMaximumBloomFilterBytes
naming style for better consistency with Google style guide
/// @param hash the element to contain. | ||
/// @return false if value is definitely not in set, and true means PROBABLY | ||
/// in set. | ||
virtual bool FindHash(uint64_t hash) const = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having all of these virtual functions potentially on the hot path is concerning. What is the rationale for using virtual?
// | ||
// This implementation sets 8 bits in each tiny Bloom filter. Each tiny Bloom | ||
// filter is 32 bytes to take advantage of 32-byte SIMD instructions. | ||
class BlockSplitBloomFilter : public BloomFilter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my comment above, if there is only a single initial implementation of BloomFilter
, putting virtual paths on the hot path seems potentially concerning. There are a number of approaches possible (like CRTP) to avoid virtual while permitting different implementations
src/parquet/bloom_filter.h
Outdated
BlockSplitBloomFilter(const uint8_t* bitset, uint32_t num_bytes); | ||
|
||
// Minimum Bloom filter size, it sets to 32 bytes to fit a tiny Bloom filter. | ||
static constexpr uint32_t MINIMUM_BLOOM_FILTER_BYTES = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment re: name style for constants
/// | ||
/// @param value the value to hash. | ||
/// @return hash result. | ||
virtual uint64_t Hash(int32_t value) const = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an approach possible that does not use virtual?
@wesm In answer to your question: yes, there are ways to address this without virtual. I originally suggested it as a way to address the open universe construct of knowing that we have a single filter API and multiple possible BF implementations and hash functions, even though we're starting with just one of each. |
Thanks @wesm and @jbapple-cloudera , I saw the ARROW-2998 already committed. The latest commit cannot be applied directly since parquet-cpp is using arrow repo at Dec, 2017 and it should have many changes from then on. I can create a jira to track |
Yes, @cjjnjust , it's fine with me if we leave I think the largest unsettled issue is the discussion @wesm started about the performance of virtual methods. Wes, I'd like to enable an API where users can write code something like
We could accomplish something similar with templates, but all of the processing and storage would then need to lift the type to a template parameter. I think we'll get a better picture of how onerous this is once we have some non-test use of the filter in parquet-cpp. How would you feel about revisiting that question then? |
I'm cool with tabling the virtual function discussion. I will be able to assist with refactoring for CRTP-ifying things when the time comes. I'd prefer to get the outstanding patches merged quickly and move as expediently as possible toward the Arrow-Parquet monorepo structure under discussion to make these things easier to work on |
OK, LGTM. Wes, since you're a committer, do you want to check this in and then we can work on some follow-on patches, like templatizing, integration with the rest of the codebase, a parquet-binary repo, and so on? |
+1. Merging this, sounds good. I created https://github.com/apache/parquet-testing; @cjjnjust let me know when you want to move the binary files there |
This is first part of bloom filter patch set, which include a bloom filter utility and also some unit tests.
Note that this patch also includes murmur3Hash original code from Austin Appleby. The code isn't formatted as parquet-cpp format.