add bloom filter index #1738

kangpinghuang · 2019-09-02T12:33:03Z

Add bloom filter index to prune data when loading segment

be/src/olap/types.h

be/src/olap/bloom_filter.hpp

be/src/olap/rowset/segment_v2/bloom_filter_page.cpp

gaodayue

Besides implementation, I have several questions about the design

Why do we choose to store one BF per column per row block (1024 rows)? This seems to have a fairly high storage and search overhead. For example, let's say a column contains 10M rows and thus 10K blocks. Under the default fpp which 5%, the size of each BF is around 800 bytes. So the total size of this BF index is ~8MB which is very large(for comparison, the uncompressed size of int column with 10M rows is 40MB). To evaluate a "in" predicate with n values, 10K BFs need to be tested on n values, which could also take a long time.
Size of BF depends on expected_entries, which is the number of unique values that will get inserted. Currently expected_entries is set to num_rows_per_block, which is too large for row block that contains many repeated values. How about allocating one BF for k distinct values instead of k rows? We can use the BF we're building to de-duplicate inputs.
Classic BF (the one Doris is currently using) is known to be cache unfriendly. Have you considered BlockedBloomFilter which is used by Parquet and RocksDB? For more information on BlockedBloomFilter, you can refer to https://github.com/apache/parquet-format/blob/master/BloomFilter.md

be/src/olap/rowset/segment_v2/binary_plain_page.h

be/src/olap/rowset/segment_v2/segment_writer.cpp

be/src/olap/rowset/segment_v2/column_writer.h

be/src/olap/rowset/segment_v2/bloom_filter_page.cpp

gensrc/proto/segment_v2.proto

kangpinghuang · 2019-09-12T07:36:09Z

Besides implementation, I have several questions about the design

Why do we choose to store one BF per column per row block (1024 rows)? This seems to have a fairly high storage and search overhead. For example, let's say a column contains 10M rows and thus 10K blocks. Under the default fpp which 5%, the size of each BF is around 800 bytes. So the total size of this BF index is ~8MB which is very large(for comparison, the uncompressed size of int column with 10M rows is 40MB). To evaluate a "in" predicate with n values, 10K BFs need to be tested on n values, which could also take a long time.

Size of BF depends on expected_entries, which is the number of unique values that will get inserted. Currently expected_entries is set to num_rows_per_block, which is too large for row block that contains many repeated values. How about allocating one BF for k distinct values instead of k rows? We can use the BF we're building to de-duplicate inputs.

Classic BF (the one Doris is currently using) is known to be cache unfriendly. Have you considered BlockedBloomFilter which is used by Parquet and RocksDB? For more information on BlockedBloomFilter, you can refer to https://github.com/apache/parquet-format/blob/master/BloomFilter.md

good suggestion! Tks

imay · 2019-09-24T06:56:02Z

be/src/olap/rowset/segment_v2/bloom_filter_page.cpp

+    if (_num_inserted > _block_size) {
+        int flush_round = _num_inserted / _block_size;
+        for (int i = 0; i < flush_round; ++i) {
+            flush();


We flush empty block?

imay · 2019-09-24T06:58:15Z

be/src/olap/rowset/segment_v2/bloom_filter_page.h

+    Slice _data;
+    size_t _block_num;
+    uint32_t _expected_num;
+    std::vector<std::shared_ptr<BloomFilter>> _bloom_filters;


Why shared_ptr?

imay reviewed Sep 4, 2019

View reviewed changes

gaodayue requested changes Sep 9, 2019

View reviewed changes

kangpinghuang force-pushed the add_bloom_filter branch from b2eb947 to 853213e Compare September 23, 2019 07:49

imay reviewed Sep 24, 2019

View reviewed changes

kangpinghuang and others added 7 commits September 24, 2019 20:17

Add default value column iterator apache#1834 (apache#1835)

968272f

add bloom filter index

becc09d

remove additional code

ecbddb7

remove tab

95ab49b

modify cr problems

e162cce

rebase master

267990c

modify flush and shared_ptr

167f9f1

kangpinghuang force-pushed the add_bloom_filter branch from 853213e to 167f9f1 Compare September 24, 2019 12:40

kangpinghuang closed this Dec 31, 2019

SWJTU-ZhangLei pushed a commit to SWJTU-ZhangLei/incubator-doris that referenced this pull request Jul 25, 2023

(selectdb-cloud) Add time limit for pre heating (apache#1738)

77b0300

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add bloom filter index #1738

add bloom filter index #1738

kangpinghuang commented Sep 2, 2019

gaodayue left a comment •

edited

Loading

kangpinghuang commented Sep 12, 2019

imay Sep 24, 2019

imay Sep 24, 2019

add bloom filter index #1738

add bloom filter index #1738

Conversation

kangpinghuang commented Sep 2, 2019

gaodayue left a comment • edited Loading

Choose a reason for hiding this comment

kangpinghuang commented Sep 12, 2019

imay Sep 24, 2019

Choose a reason for hiding this comment

imay Sep 24, 2019

Choose a reason for hiding this comment

gaodayue left a comment •

edited

Loading