-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add bloom filter index #1738
add bloom filter index #1738
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides implementation, I have several questions about the design
- Why do we choose to store one BF per column per row block (1024 rows)? This seems to have a fairly high storage and search overhead. For example, let's say a column contains 10M rows and thus 10K blocks. Under the default fpp which 5%, the size of each BF is around 800 bytes. So the total size of this BF index is ~8MB which is very large(for comparison, the uncompressed size of int column with 10M rows is 40MB). To evaluate a "in" predicate with
n
values,10K
BFs need to be tested onn
values, which could also take a long time. - Size of BF depends on
expected_entries
, which is the number of unique values that will get inserted. Currentlyexpected_entries
is set to num_rows_per_block, which is too large for row block that contains many repeated values. How about allocating one BF for k distinct values instead of k rows? We can use the BF we're building to de-duplicate inputs. - Classic BF (the one Doris is currently using) is known to be cache unfriendly. Have you considered BlockedBloomFilter which is used by Parquet and RocksDB? For more information on BlockedBloomFilter, you can refer to https://github.com/apache/parquet-format/blob/master/BloomFilter.md
good suggestion! Tks |
b2eb947
to
853213e
Compare
if (_num_inserted > _block_size) { | ||
int flush_round = _num_inserted / _block_size; | ||
for (int i = 0; i < flush_round; ++i) { | ||
flush(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We flush empty block?
Slice _data; | ||
size_t _block_num; | ||
uint32_t _expected_num; | ||
std::vector<std::shared_ptr<BloomFilter>> _bloom_filters; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why shared_ptr?
853213e
to
167f9f1
Compare
Add bloom filter index to prune data when loading segment