feat: enable bloom filter index #6639

dantengsky · 2022-07-15T01:50:20Z

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

Enable bloom filter at the block level. (thanks @junli1026 !)

For fields of primitive types, bloom filter index will be built during insertion (or rebuild during mutations)
for each block, an index file will be generated in the index path (prefixed with _i/)
bloom index will be used if point queries are detected (currently only binop "=")
- only the column that is used will be loaded (and cached if table cache is enabled)
- the default max bytes of cached index data is set to 1G

Performance

Read. Performance is improved as expected if the bloom filter index can be utilized.
Write. No significant impact found for write performance.
Index Size. For block of 1M rows, about 1.5MB index per column (on disk file size)
false positive rate set to 1%
number of distinct values is the number of rows of the given block (could be optimised later)

Test scenario

Standalone deployment, with Local FS
Table of 10B rows:
create table t10b as select cast(number as string) as c1, cast(rand() as string) as c2 from numbers(10000000000)

Read:

No Table Meta and Index Cache

without bloom filter index

mysql> select * from t10b where c2 = "0.7826850382733147";
Empty set (1 min 42.89 sec)
Read 10000000000 rows, 411.26 GiB in 102.846 sec., 97.23 million rows/sec., 4.00 GiB/sec.

with bloom filter index

mysql> select * from t10b where c2 = "0.7826850382733147";
Empty set (5.50 sec)
Read 106000000 rows, 4.36 GiB in 1.201 sec., 88.28 million rows/sec., 3.63 GiB/sec.

Table Meta and Index are fully cached (A 1B rows table used in this case, so that index can be fully cached)

without bloom filer index

mysql> select * from t1b where c2 = "0.78268503827331471";
Empty set (10.04 sec)
Read 1000000000 rows, 40.19 GiB in 10.028 sec., 99.72 million rows/sec., 4.01 GiB/sec.

with bloom filer index

mysql> select * from t1b where c2 = "0.78268503827331471";
Empty set (1.11 sec)
Read 7666666 rows, 315.42 MiB in 0.090 sec., 85.24 million rows/sec., 3.42 GiB/sec.

Write:

without bloom filter index

mysql> create table t10b_no_idx as select cast(number as string) as c1, cast(rand() as string) as c2 from numbers(10000000000);
Query OK, 0 rows affected (30 min 21.37 sec)

with bloom filter index

mysql> create table t10b as select cast(number as string) as c1, cast(rand() as string) as c2 from numbers(10000000000);
Query OK, 0 rows affected (31 min 35.24 sec)

Storage:

For this test scenario, index / data ~= 10%

Fixes #issue

vercel · 2022-07-15T01:50:24Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Updated
databend	⬜️ Ignored (Inspect)		Aug 3, 2022 at 10:52AM (UTC)

…external-index

common/settings/src/lib.rs

dantengsky · 2022-08-03T06:26:32Z

@youngsofun bloom filter index is not populated for ResultTables, seems it is not suitable, if anything else should be covered, please let me know.

@flaneur2020 filed index_size of table system.tables will no longer be all NULLs, for tables of fuse engine, the value will be eq to or larger than 0, hope this will not break things.

Xuanwo · 2022-08-03T06:43:21Z

Looks great so far!

query/src/storages/fuse/pruning/pruning_executor.rs

zhang2014 · 2022-08-03T07:04:58Z

How to set bloom filter false positive?

query/src/storages/fuse/io/read/snapshot_history_reader.rs

tests/logictest/suites/gen/06_show/06_0003_show_settings_v2

dantengsky · 2022-08-03T08:30:37Z

How to set bloom filter false positive?

it is hard coded as 1% now

Co-authored-by: Zhyass <mytesla@live.com>

…external-index

BohuTANG · 2022-08-03T10:24:11Z

Expected: statement query must get result equal to expected
Message: 
 Expected:
enable_async_insert 0 0 SESSION Whether the client open async insert mode, default value: 0 UInt64
enable_bloom_filter_index 0 0 SESSION Enable bloom filter index (if applicable for the underlying table engine) by setting this variable to 1, default value: 0	UInt64
enable_new_processor_framework 1 1 SESSION Enable new processor framework if value != 0, default value: 1 UInt64
enable_planner_v2 1 0 SESSION Enable planner v2 by setting this variable to 1, default value: 0 UInt64
 Actual:
                                                enable_async_insert                                                                  0                                                                  0                                                            SESSION        Whether the client open async insert mode, default value: 0                                                             UInt64
                                     enable_new_processor_framework                                                                  1                                                                  1                                                            SESSION     Enable new processor framework if value != 0, default value: 1                                                             UInt64
                                                  enable_planner_v2                                                                  1                                                                  0                                                            SESSION  Enable planner v2 by setting this variable to 1, default value: 0                                                             UInt64
 Statement:
Parsed Statement
    at_line: 32,
    s_type: Statement: query, type: TTTTTT, query_type: TTTTTT, retry: False,
    suite_name: base/06_show/06_0003_show_settings_v2,
    text:
        SHOW SETTINGS LIKE 'enable%';
    results: [(<re.Match object; span=(0, 4), match='----'>, 39, 'enable_async_insert 0 0 SESSION Whether the client open async insert mode, default value: 0 UInt64\nenable_bloom_filter_index 0 0 SESSION Enable bloom filter index (if applicable for the underlying table engine) by setting this variable to 1, default value: 0\tUInt64\nenable_new_processor_framework 1 1 SESSION Enable new processor framework if value != 0, default value: 1 UInt64\nenable_planner_v2 1 0 SESSION Enable planner v2 by setting this variable to 1, default value: 0 UInt64')],
    runs_on: {'mysql'},
 Start Line: 39, Result Label:

dantengsky changed the title ~~feat : enable bloom filter index~~ feat: enable bloom filter index Jul 15, 2022

mergify bot added the pr-feature this PR introduces a new feature to the codebase label Jul 15, 2022

enable bloom filter index

0f346c0

dantengsky force-pushed the feat-bloom-switch-to-external-index branch from cd2b416 to 0f346c0 Compare July 15, 2022 03:29

tweak ut

952e381

dantengsky mentioned this pull request Jul 19, 2022

fix(parquet): support read i96 timestamp from parquet file #6668

Merged

dantengsky added 12 commits July 19, 2022 18:07

rm unused code

86d7ccf

Merge remote-tracking branch 'origin/main' into feat-bloom-switch-to-…

d2a7cd2

…external-index

fix bloom filter compile err

3020d84

fix: replace deprecated parquet_source_builder

f98df53

minor refactor

aa97e9f

Merge remote-tracking branch 'origin/main' into feat-bloom-switch-to-…

e41d62a

…external-index

add instrument

25f2e36

Merge remote-tracking branch 'origin/main' into feat-bloom-switch-to-…

14f0051

…external-index

Merge remote-tracking branch 'origin/main' into feat-bloom-switch-to-…

482b56e

…external-index

Merge remote-tracking branch 'origin/main' into feat-bloom-switch-to-…

3182db5

…external-index

wip: block prunner shortcut

3257e7b

wip: refactoring pruner

72a43d7

dantengsky force-pushed the feat-bloom-switch-to-external-index branch from 561dc0f to 72a43d7 Compare July 25, 2022 14:19

dantengsky added 10 commits July 25, 2022 22:29

Merge remote-tracking branch 'origin/main' into feat-bloom-switch-to-…

60fefd6

…external-index

Merge remote-tracking branch 'origin/main' into feat-bloom-switch-to-…

c5a134c

…external-index

remove needless tokio spawn

3952dc9

refactor: block prunner shortcuts

28e5b96

WIP: lifetime seems ok

fe18bc0

Separated filters

16f4685

move predicate cstors to individual mods

8d8f507

tidy up

02d1724

add setting for toggling bloom filter

11fa723

refacor

ead1fa0

BohuTANG reviewed Aug 2, 2022

View reviewed changes

common/settings/src/lib.rs Outdated Show resolved Hide resolved

dantengsky added 2 commits August 3, 2022 12:47

count bloom filter index cache by bytes

2a80550

add sqlogictest for bloom filter index

94e1696

BohuTANG mentioned this pull request Aug 3, 2022

Release proposal: Nightly v0.8 #4591

Closed

55 tasks

dantengsky added 3 commits August 3, 2022 13:42

tidy up

aaf4be8

remove setting "enable_bloom_filter_index"

f4c06c5

adjust test cases

c3415d7

dantengsky force-pushed the feat-bloom-switch-to-external-index branch from aa0c1fd to c3415d7 Compare August 3, 2022 06:17

dantengsky marked this pull request as ready for review August 3, 2022 06:26

dantengsky requested review from zhyass and youngsofun August 3, 2022 06:27

BohuTANG requested review from zhang2014 and sundy-li August 3, 2022 06:40

zhang2014 reviewed Aug 3, 2022

View reviewed changes

query/src/storages/fuse/pruning/pruning_executor.rs Outdated Show resolved Hide resolved

zhyass reviewed Aug 3, 2022

View reviewed changes

query/src/storages/fuse/io/read/snapshot_history_reader.rs Outdated Show resolved Hide resolved

youngsofun reviewed Aug 3, 2022

View reviewed changes

tests/logictest/suites/gen/06_show/06_0003_show_settings_v2 Outdated Show resolved Hide resolved

dantengsky and others added 5 commits August 3, 2022 16:31

Update query/src/storages/fuse/io/read/snapshot_history_reader.rs

5f5c445

Co-authored-by: Zhyass <mytesla@live.com>

runtime arrangement tweaks

84d053f

make lint

0edf55f

Merge remote-tracking branch 'origin/main' into feat-bloom-switch-to-…

f2e0ecb

…external-index

bring back sqlogic test

89ad8ed

zhang2014 approved these changes Aug 3, 2022

View reviewed changes

Xuanwo approved these changes Aug 3, 2022

View reviewed changes

fix logictest cases

8477721

mergify bot merged commit 71b9327 into databendlabs:main Aug 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: enable bloom filter index #6639

feat: enable bloom filter index #6639

dantengsky commented Jul 15, 2022 •

edited

Loading

vercel bot commented Jul 15, 2022 •

edited

Loading

dantengsky commented Aug 3, 2022

Xuanwo commented Aug 3, 2022

zhang2014 commented Aug 3, 2022

dantengsky commented Aug 3, 2022

BohuTANG commented Aug 3, 2022

feat: enable bloom filter index #6639

feat: enable bloom filter index #6639

Conversation

dantengsky commented Jul 15, 2022 • edited Loading

Summary

vercel bot commented Jul 15, 2022 • edited Loading

dantengsky commented Aug 3, 2022

Xuanwo commented Aug 3, 2022

zhang2014 commented Aug 3, 2022

dantengsky commented Aug 3, 2022

BohuTANG commented Aug 3, 2022

dantengsky commented Jul 15, 2022 •

edited

Loading

vercel bot commented Jul 15, 2022 •

edited

Loading