-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Generate In-Clause filters from hash joins #14864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ning the full column
…ng OR filters in the dynamic filter pushdown
After looking at #14864, I realized we aren't handling optional filters in arrow scans very well. We filter them out before transform filter, when they should be filtered out in transform filter. Since InFilters are pushed as children of optional filters, we should not check them in Transform Filter. The switch should be a white-list for what is allowed, and the default case should throw at every other filter type. Also added tests to catch In filter pushdown. Both from `IN` and from joins.
|
Small question here! My ability to read C++ is still limited... ;-) Does this PR enable OR filter pushdown for non-integer columns like strings and floats? Or is it integers only at the moment? Thanks! This will be huge! |
|
Only integers, but there's no real reason it can't work for strings |
|
I think it would be great if the I have included the generic the EXPLAIN ANALYZE results for the same query I was running below. |
|
They are now generated for strings since #15219 |
Generate In-Clause filters from hash joins (duckdb/duckdb#14864)
Generate In-Clause filters from hash joins (duckdb/duckdb#14864) Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>

Follow-up from #12908
This PR extends the join filters that get generated by hash joins to include an
INfilter when the hash table is small enough. Rather than generating only amin-maxfilter, we generate anINfilter with all values in the hash table. This can greatly improve performance over themin-maxfilter when there are few values that are far apart. For example, if the hash table contains the values1and100000, theINfilter might be much more effective as we can prune many more row groups.The threshold for which we generate an
INfilter is determined by thedynamic_or_filter_thresholdsetting, and defaults to 50 rows. TheINfilter is pushed as anOPTIONAL_FILTER, which is currently only evaluated for zone-map pruning. As a result, the performance impact of pushing this filter is minimal, while the performance improvement from extra zone-map pruning can be significant.Benchmark
Below is a benchmark that we run over TPC-H SF10 in which we run a join that is generated through an
INclause - we join on theminandmaxvalues ofl_orderkey(meaning the generated min/max filters will not be effective in pruning rows).