Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add bloom filter index #1738

Closed
wants to merge 7 commits into from

Conversation

kangpinghuang
Copy link
Contributor

Add bloom filter index to prune data when loading segment

be/src/olap/types.h Show resolved Hide resolved
be/src/olap/bloom_filter.hpp Show resolved Hide resolved
be/src/olap/rowset/segment_v2/bloom_filter_page.cpp Outdated Show resolved Hide resolved
be/src/olap/rowset/segment_v2/bloom_filter_page.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@gaodayue gaodayue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides implementation, I have several questions about the design

  1. Why do we choose to store one BF per column per row block (1024 rows)? This seems to have a fairly high storage and search overhead. For example, let's say a column contains 10M rows and thus 10K blocks. Under the default fpp which 5%, the size of each BF is around 800 bytes. So the total size of this BF index is ~8MB which is very large(for comparison, the uncompressed size of int column with 10M rows is 40MB). To evaluate a "in" predicate with n values, 10K BFs need to be tested on n values, which could also take a long time.
  2. Size of BF depends on expected_entries, which is the number of unique values that will get inserted. Currently expected_entries is set to num_rows_per_block, which is too large for row block that contains many repeated values. How about allocating one BF for k distinct values instead of k rows? We can use the BF we're building to de-duplicate inputs.
  3. Classic BF (the one Doris is currently using) is known to be cache unfriendly. Have you considered BlockedBloomFilter which is used by Parquet and RocksDB? For more information on BlockedBloomFilter, you can refer to https://github.com/apache/parquet-format/blob/master/BloomFilter.md

be/src/olap/rowset/segment_v2/binary_plain_page.h Outdated Show resolved Hide resolved
be/src/olap/rowset/segment_v2/segment_writer.cpp Outdated Show resolved Hide resolved
be/src/olap/rowset/segment_v2/column_writer.h Outdated Show resolved Hide resolved
be/src/olap/rowset/segment_v2/column_writer.h Outdated Show resolved Hide resolved
be/src/olap/rowset/segment_v2/bloom_filter_page.cpp Outdated Show resolved Hide resolved
gensrc/proto/segment_v2.proto Show resolved Hide resolved
@kangpinghuang
Copy link
Contributor Author

Besides implementation, I have several questions about the design

  1. Why do we choose to store one BF per column per row block (1024 rows)? This seems to have a fairly high storage and search overhead. For example, let's say a column contains 10M rows and thus 10K blocks. Under the default fpp which 5%, the size of each BF is around 800 bytes. So the total size of this BF index is ~8MB which is very large(for comparison, the uncompressed size of int column with 10M rows is 40MB). To evaluate a "in" predicate with n values, 10K BFs need to be tested on n values, which could also take a long time.
  2. Size of BF depends on expected_entries, which is the number of unique values that will get inserted. Currently expected_entries is set to num_rows_per_block, which is too large for row block that contains many repeated values. How about allocating one BF for k distinct values instead of k rows? We can use the BF we're building to de-duplicate inputs.
  3. Classic BF (the one Doris is currently using) is known to be cache unfriendly. Have you considered BlockedBloomFilter which is used by Parquet and RocksDB? For more information on BlockedBloomFilter, you can refer to https://github.com/apache/parquet-format/blob/master/BloomFilter.md

good suggestion! Tks

if (_num_inserted > _block_size) {
int flush_round = _num_inserted / _block_size;
for (int i = 0; i < flush_round; ++i) {
flush();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We flush empty block?

Slice _data;
size_t _block_num;
uint32_t _expected_num;
std::vector<std::shared_ptr<BloomFilter>> _bloom_filters;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why shared_ptr?

SWJTU-ZhangLei pushed a commit to SWJTU-ZhangLei/incubator-doris that referenced this pull request Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants