Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: runtime filter #13842

Merged
merged 23 commits into from Dec 11, 2023
Merged

Conversation

xudong963
Copy link
Member

@xudong963 xudong963 commented Nov 29, 2023

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

Intro:
Adaptive derivation of new predicates at runtime is used to filter the join probe side to improve performance.

New predicates generated at runtime are pushed down through the processor to the table scan on the probe side for prune, thus improving performance significantly.

Simple benchmark
A simple example that is perfect for runtime filtering

select * from t1 join t2 on t1.a = t2.b;
t1: 1_000_000_000
t2: 9999 (inlist runtime filter will not be generated for >10k)

cluster:

before: 0.04 sec
runtime filter: 0.007 sec. 

single node:

before: 0.025s
runtime filter: 0.0046 s

Adaptive:

  1. Currently only inlist filter is generated, and only if the total data size of the build is less than 10k. If the inlist is too heavy, it is not efficient, so we may consider supporting bloom filter for data larger than 10k in the future.
  2. For clusters, only broadcast join supports runtime filter, because the data size gap between the build and probe sides of broadcast join is relatively large, and better filtering may be achieved.

Others:
Runtime filter will be saved into QueryCtx by HashMap, key is the table index, and values are filters for the table. Table will get the corresponding filters from ctx by table index to prune before it starts to read data.

  • Closes #issue

This change is Reviewable

@xudong963 xudong963 marked this pull request as draft November 29, 2023 04:40
@github-actions github-actions bot added the pr-refactor this PR changes the code base without new features or bugfix label Nov 29, 2023
@xudong963 xudong963 force-pushed the refactor_runtime_filter branch 2 times, most recently from f616257 to b23bf8a Compare December 1, 2023 16:02
@xudong963 xudong963 added the ci-benchmark Benchmark: run all test label Dec 5, 2023
Copy link
Contributor

github-actions bot commented Dec 5, 2023

Docker Image for PR

  • tag: pr-13842-0b7053e

note: this image tag is only available for internal use,
please check the internal doc for more details.

Copy link
Contributor

github-actions bot commented Dec 5, 2023

@xudong963 xudong963 force-pushed the refactor_runtime_filter branch 7 times, most recently from 06ac33c to ae6e2ca Compare December 7, 2023 16:13
@xudong963 xudong963 marked this pull request as ready for review December 8, 2023 03:23
@Dousir9
Copy link
Member

Dousir9 commented Dec 8, 2023

How to generate the test data of t1 and t2, is it numbers(1_000_000_000)?

@xudong963
Copy link
Member Author

How to generate the test data of t1 and t2, is it numbers(1_000_000_000)?

yeah

@BohuTANG BohuTANG added ci-benchmark Benchmark: run all test and removed ci-benchmark Benchmark: run all test labels Dec 8, 2023
Copy link
Contributor

github-actions bot commented Dec 8, 2023

Docker Image for PR

  • tag: pr-13842-faef3fd

note: this image tag is only available for internal use,
please check the internal doc for more details.

@xudong963 xudong963 marked this pull request as draft December 8, 2023 14:49
Copy link
Contributor

github-actions bot commented Dec 9, 2023

Docker Image for PR

  • tag: pr-13842-61e0906

note: this image tag is only available for internal use,
please check the internal doc for more details.

Copy link
Contributor

github-actions bot commented Dec 9, 2023

part: &PartInfoPtr,
filters: &Vec<Expr<String>>,
func_ctx: &FunctionContext,
) -> Result<bool> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need adding the runtime filter stats to explain? Now the stats:

├── pruning stats: [segments: <range pruning: 1 to 1>, blocks: <range pruning: 755 to 755, bloom pruning: 0 to 0>]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current stats are collected before the pipeline runs, but runtime filter stats will be generated during the pipeline running. Maybe we can try to add runtime filter stats to query profile later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding runtime_filter-related stats to the query log? @BohuTANG

@xudong963 xudong963 added this pull request to the merge queue Dec 11, 2023
@Dousir9 Dousir9 removed this pull request from the merge queue due to a manual request Dec 11, 2023
@Dousir9
Copy link
Member

Dousir9 commented Dec 11, 2023

rest LGTM !

@Dousir9 Dousir9 added this pull request to the merge queue Dec 11, 2023
@BohuTANG BohuTANG removed this pull request from the merge queue due to a manual request Dec 11, 2023
@BohuTANG BohuTANG merged commit 4b94823 into datafuselabs:main Dec 11, 2023
68 checks passed
@xudong963 xudong963 deleted the refactor_runtime_filter branch December 11, 2023 05:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-benchmark Benchmark: run all test pr-refactor this PR changes the code base without new features or bugfix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants