Split push_down_filter.slt into standalone sqllogictest files to reduce long-tail runtime#20566
Conversation
- Implement tests for push down filters in outer joins, ensuring filters are applied correctly based on join conditions. - Introduce tests for push down filters with Parquet files, including scenarios with limits and dynamic filters. - Add regression tests to address specific issues related to filter pushdown, ensuring stability and correctness. - Include tests for unnest operations with filters, verifying that filters are pushed down appropriately based on the context.
push_down_filter.slt into standalone sqllogictest files to reduce long-tail runtime
|
Running without splitwith split |
alamb
left a comment
There was a problem hiding this comment.
Thank you @kosiew -- this seems like a non trivial improvement to me
On my machine, this PR runs in 7 seconds
andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ cargo test --profile=ci --test sqllogictests
Finished `ci` profile [unoptimized] target(s) in 0.20s
Running bin/sqllogictests.rs (target/ci/deps/sqllogictests-c4e4be8d5c9fd66e)
Running with 16 test threads (available parallelism: 16)
Completed 411 test files in 7 secondsAnd main runs in 8 seconds
(venv) andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion2$ cargo test --profile=ci --test sqllogictests
Finished `ci` profile [unoptimized] target(s) in 0.20s
Running bin/sqllogictests.rs (target/ci/deps/sqllogictests-c4e4be8d5c9fd66e)
Running with 16 test threads (available parallelism: 16)
Completed 408 test files in 8 secondsI think part of the reason pushdown_filter_regression is taking so long is that it makes files with 1M rows and then the joins just take a while on debug builds
I don't see any real way to speed that up other than changing what it is doing or moving it to the extended tests somehow
Something else I did that seems to work quite well was to update the scheduling to start long running tests first -- I'll make a follow on PR to improve that too
| # Sorry about the spam in this slt test... | ||
|
|
||
| query III rowsort | ||
| select * |
There was a problem hiding this comment.
The join doesn't take that long
> select *
from t1
join t2 on t1.k = t2.k
where v = 1 or v = 10000000
order by t1.k, t2.v;
+----------+----------+----------+
| k | k | v |
+----------+----------+----------+
| 1 | 1 | 1 |
| 10000000 | 10000000 | 10000000 |
+----------+----------+----------+
2 row(s) fetched.
Elapsed 0.165 seconds.|
|
||
| # Regression test for https://github.com/apache/datafusion/issues/17188 | ||
| query I | ||
| COPY (select i as k from generate_series(1, 10000000) as t(i)) |
There was a problem hiding this comment.
I ran this query on a debug build of datafusion-cli and creating the data files takes over 4 seconds. The rest of the queries are pretty fast
andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion/datafusion/sqllogictest$ /Users/andrewlamb/Software/datafusion/target/debug/datafusion-cli
DataFusion CLI v52.1.0
> COPY (select i as k from generate_series(1, 10000000) as t(i))
TO 'test_files/scratch/push_down_filter_regression/t1.parquet'
STORED AS PARQUET;
+----------+
| count |
+----------+
| 10000000 |
+----------+
1 row(s) fetched.
Elapsed 2.074 seconds.
> COPY (select i as k, i as v from generate_series(1, 10000000) as t(i))
TO 'test_files/scratch/push_down_filter_regression/t2.parquet'
STORED AS PARQUET;
+----------+
| count |
+----------+
| 10000000 |
+----------+
1 row(s) fetched.
Elapsed 2.124 seconds.There was a problem hiding this comment.
BTW @Tim-53 has been working on this as well here
| } | ||
| } | ||
|
|
||
| // trigger ci test |
|
FYI I got the test even faster by scheduling the runs a little more carefully: |
This reverts commit e8369bb.
|
Thanks @alamb for the review. |



Which issue does this PR close?
Rationale for this change
datafusion/sqllogictest/test_files/push_down_filter.slthad grown into a large sqllogictest file. Since the sqllogictest runner parallelizes at file granularity, a single heavyweight file can become a straggler and dominate wall-clock time.This PR performs a non-invasive split of that file into smaller, self-contained
.sltfiles so the runner can distribute work more evenly across threads, improving overall suite balance without changing SQL semantics or test coverage.What changes are included in this PR?
Removed the monolithic
push_down_filter.slt.Added new standalone sqllogictest files, each with the minimal setup/teardown required to run independently:
push_down_filter_unnest.slt— unnest filter pushdown coverage (including struct/field cases).push_down_filter_parquet.slt— parquet filter pushdown + limit + cast predicate behavior + dynamic filter pushdown (swapped join inputs).push_down_filter_outer_joins.slt— LEFT/RIGHT join and anti-join logical filter pushdown checks.push_down_filter_regression.slt— regression coverage for issues Dynamic Filter Pushdown causes JOIN to return incorrect results #17188 and Logical optimizer pushdown_filters rule fails with relatively simple query #17512, plus aggregate dynamic filter pushdown checks.Updated scratch output paths to be file-scoped (e.g.
test_files/scratch/push_down_filter_parquet/...) to reduce the chance of conflicts when tests execute in parallel.Preserved all original query expectations and explain-plan assertions; changes are organizational only.
Are these changes tested?
Yes, with a python script to compare text blocks in the new slt files vs old single slt file.
Output:
The extra(1) is this statement block:
set datafusion.explain.physical_plan_only = true;
Why it shows as extra:
In split files, it appears 3 times:
push_down_filter_parquet.slt:21
push_down_filter_unnest.slt:21
push_down_filter_regression.slt:129
In the baseline monolithic file at e937cad^, it appears 2 times.
So comparison reports 3 - 2 = extra 1.
Are there any user-facing changes?
No user-facing behavior changes. This is a test-suite organization/performance improvement only.
Note before merging
Revert e8369bb (it is a commit to trigger the CI extented tests for sqllogictest)
LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.