Split `push_down_filter.slt` into standalone sqllogictest files to reduce long-tail runtime by kosiew · Pull Request #20566 · apache/datafusion

kosiew · 2026-02-26T06:57:16Z

Which issue does this PR close?

Part of Speedup execution of sqllogictests with more parallelization #20524

Rationale for this change

datafusion/sqllogictest/test_files/push_down_filter.slt had grown into a large sqllogictest file. Since the sqllogictest runner parallelizes at file granularity, a single heavyweight file can become a straggler and dominate wall-clock time.

This PR performs a non-invasive split of that file into smaller, self-contained .slt files so the runner can distribute work more evenly across threads, improving overall suite balance without changing SQL semantics or test coverage.

What changes are included in this PR?

Removed the monolithic push_down_filter.slt.
Added new standalone sqllogictest files, each with the minimal setup/teardown required to run independently:
- push_down_filter_unnest.slt — unnest filter pushdown coverage (including struct/field cases).
- push_down_filter_parquet.slt — parquet filter pushdown + limit + cast predicate behavior + dynamic filter pushdown (swapped join inputs).
- push_down_filter_outer_joins.slt — LEFT/RIGHT join and anti-join logical filter pushdown checks.
- push_down_filter_regression.slt — regression coverage for issues Dynamic Filter Pushdown causes JOIN to return incorrect results #17188 and Logical optimizer pushdown_filters rule fails with relatively simple query #17512, plus aggregate dynamic filter pushdown checks.
Updated scratch output paths to be file-scoped (e.g. test_files/scratch/push_down_filter_parquet/...) to reduce the chance of conflicts when tests execute in parallel.
Preserved all original query expectations and explain-plan assertions; changes are organizational only.

Are these changes tested?

Yes, with a python script to compare text blocks in the new slt files vs old single slt file.

python - <<'PY'
import subprocess, re
from collections import defaultdict, deque

repo = '.'
old_spec = '692a7cb67^:datafusion/sqllogictest/test_files/push_down_filter.slt'
new_specs = [
  'HEAD:datafusion/sqllogictest/test_files/push_down_filter_outer_joins.slt',
  'HEAD:datafusion/sqllogictest/test_files/push_down_filter_parquet.slt',
  'HEAD:datafusion/sqllogictest/test_files/push_down_filter_regression.slt',
  'HEAD:datafusion/sqllogictest/test_files/push_down_filter_unnest.slt',
]

def git_show(spec):
  return subprocess.check_output(['git', '-C', repo, 'show', spec], text=True)

def normalize_sql(sql):
  s = sql.strip().lower()
  s = re.sub(
    r"test_files/scratch/push_down_filter(?:_[^'\s;)/]+)?[^'\s;)]*",
    "test_files/scratch/__norm__",
    s,
  )
  s = re.sub(r'\s+', ' ', s)
  return s

def blocks(text):
  lines = text.splitlines()
  out = []
  i = 0
  while i < len(lines):
    m = lines[i].strip()
    if m.startswith('query '):
      i += 1
      b = []
      while i < len(lines) and lines[i].strip() != '----':
        if not lines[i].lstrip().startswith('#'):
          b.append(lines[i])
        i += 1
      sql = '\n'.join(b).strip()
      if sql:
        out.append(('query', normalize_sql(sql)))
    elif m.startswith('statement '):
      i += 1
      b = []
      while i < len(lines):
        s = lines[i].strip()
        if s == '':
          break
        if s.startswith('query ') or s.startswith('statement '):
          i -= 1
          break
        if not lines[i].lstrip().startswith('#'):
          b.append(lines[i])
        i += 1
      sql = '\n'.join(b).strip()
      if sql:
        out.append(('statement', normalize_sql(sql)))
    i += 1
  return out

old_blocks = blocks(git_show(old_spec))
new_blocks = []
for s in new_specs:
  new_blocks.extend(blocks(git_show(s)))

q = defaultdict(deque)
for item in new_blocks:
  q[item].append(item)

missing = 0
extra = 0
for item in old_blocks:
  if q[item]:
    q[item].popleft()
  else:
    missing += 1

for v in q.values():
  extra += len(v)

print(f'old_blocks={len(old_blocks)}')
print(f'new_blocks={len(new_blocks)}')
print(f'missing={missing}')
print(f'extra={extra}')
print(f'baseline={old_spec}')
if missing != 0:
  raise SystemExit(1)
PY

Output:

old_blocks=107
new_blocks=108
missing=0
extra=1

The extra(1) is this statement block:

set datafusion.explain.physical_plan_only = true;
Why it shows as extra:

In split files, it appears 3 times:
push_down_filter_parquet.slt:21
push_down_filter_unnest.slt:21
push_down_filter_regression.slt:129
In the baseline monolithic file at e937cad^, it appears 2 times.
So comparison reports 3 - 2 = extra 1.

Are there any user-facing changes?

No user-facing behavior changes. This is a test-suite organization/performance improvement only.

Note before merging

Revert e8369bb (it is a commit to trigger the CI extented tests for sqllogictest)

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

- Implement tests for push down filters in outer joins, ensuring filters are applied correctly based on join conditions. - Introduce tests for push down filters with Parquet files, including scenarios with limits and dynamic filters. - Add regression tests to address specific issues related to filter pushdown, ensuring stability and correctness. - Include tests for unnest operations with filters, verifying that filters are pushed down appropriately based on the context.

kosiew · 2026-02-26T08:15:30Z

Without split

source: https://github.com/apache/datafusion/actions/runs/22431741358/job/64951933427?pr=20567

with split

source:
https://github.com/apache/datafusion/actions/runs/22431461562/job/64951079150?pr=20566

kosiew · 2026-02-26T09:05:05Z

Running
cargo test --test sqllogictests -- --timing-summary top --timing-top-n 10
from #20569, explains the minimal improvement.

without split

Running with 10 test threads (available parallelism: 10)
Per-file elapsed summary (deterministic):                                                              
1.   18.804s  push_down_filter.slt                                                                   
2.    8.099s  joins.slt                                                                              
3.    7.504s  aggregate.slt

with split

Running with 10 test threads (available parallelism: 10)
Per-file elapsed summary (deterministic):                                                              
1.   10.355s  push_down_filter_regression.slt                                                        
2.    6.464s  aggregate.slt                                                                          
3.    6.445s  joins.slt

alamb

Thank you @kosiew -- this seems like a non trivial improvement to me

On my machine, this PR runs in 7 seconds

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ cargo test --profile=ci --test sqllogictests
    Finished `ci` profile [unoptimized] target(s) in 0.20s
     Running bin/sqllogictests.rs (target/ci/deps/sqllogictests-c4e4be8d5c9fd66e)
Running with 16 test threads (available parallelism: 16)
Completed 411 test files in 7 seconds

And main runs in 8 seconds

(venv) andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion2$ cargo test --profile=ci --test sqllogictests
    Finished `ci` profile [unoptimized] target(s) in 0.20s
     Running bin/sqllogictests.rs (target/ci/deps/sqllogictests-c4e4be8d5c9fd66e)
Running with 16 test threads (available parallelism: 16)
Completed 408 test files in 8 seconds

I think part of the reason pushdown_filter_regression is taking so long is that it makes files with 1M rows and then the joins just take a while on debug builds

I don't see any real way to speed that up other than changing what it is doing or moving it to the extended tests somehow

Something else I did that seems to work quite well was to update the scheduling to start long running tests first -- I'll make a follow on PR to improve that too

alamb · 2026-02-26T13:32:11Z

datafusion/sqllogictest/test_files/push_down_filter_regression.slt

+# Sorry about the spam in this slt test...
+
+query III rowsort
+select *


The join doesn't take that long

> select * from t1 join t2 on t1.k = t2.k where v = 1 or v = 10000000 order by t1.k, t2.v; +----------+----------+----------+ | k | k | v | +----------+----------+----------+ | 1 | 1 | 1 | | 10000000 | 10000000 | 10000000 | +----------+----------+----------+ 2 row(s) fetched. Elapsed 0.165 seconds.

alamb · 2026-02-26T13:32:17Z

datafusion/sqllogictest/test_files/push_down_filter_regression.slt

+
+# Regression test for https://github.com/apache/datafusion/issues/17188
+query I
+COPY (select i as k from generate_series(1, 10000000) as t(i))


I ran this query on a debug build of datafusion-cli and creating the data files takes over 4 seconds. The rest of the queries are pretty fast

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion/datafusion/sqllogictest$ /Users/andrewlamb/Software/datafusion/target/debug/datafusion-cli DataFusion CLI v52.1.0 > COPY (select i as k from generate_series(1, 10000000) as t(i)) TO 'test_files/scratch/push_down_filter_regression/t1.parquet' STORED AS PARQUET; +----------+ | count | +----------+ | 10000000 | +----------+ 1 row(s) fetched. Elapsed 2.074 seconds. > COPY (select i as k, i as v from generate_series(1, 10000000) as t(i)) TO 'test_files/scratch/push_down_filter_regression/t2.parquet' STORED AS PARQUET; +----------+ | count | +----------+ | 10000000 | +----------+ 1 row(s) fetched. Elapsed 2.124 seconds.

BTW @Tim-53 has been working on this as well here

Improve sqllogicteset speed by creating only a single large file rather than 2 #20586

alamb · 2026-02-26T13:49:52Z

datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs

    }
 }

+// trigger ci test


this seems unecessary

Yep, it was on my before-merge todo list.

alamb · 2026-02-26T14:03:29Z

FYI I got the test even faster by scheduling the runs a little more carefully:

Speedup sqllogictests by running long running tests first #20576

This reverts commit e8369bb.

kosiew · 2026-02-27T08:20:05Z

Thanks @alamb for the review.

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 26, 2026

trigger sqllogictest

e8369bb

github-actions bot added the optimizer Optimizer rules label Feb 26, 2026

kosiew changed the title ~~Add SQL logic tests for filter pushdown scenarios~~ Split push_down_filter.slt into standalone sqllogictest files to reduce long-tail runtime Feb 26, 2026

kosiew mentioned this pull request Feb 26, 2026

TEST - for comparison vs #20566 #20567

Closed

kosiew marked this pull request as ready for review February 26, 2026 09:24

alamb approved these changes Feb 26, 2026

View reviewed changes

alamb reviewed Feb 26, 2026

View reviewed changes

alamb mentioned this pull request Feb 26, 2026

Speedup sqllogictests by running long running tests first #20576

Open

Revert "trigger sqllogictest"

7741d13

This reverts commit e8369bb.

github-actions bot removed the optimizer Optimizer rules label Feb 27, 2026

kosiew added this pull request to the merge queue Feb 27, 2026

Merged via the queue into apache:main with commit bc600b3 Feb 27, 2026
28 checks passed

alamb mentioned this pull request Feb 27, 2026

Improve sqllogicteset speed by creating only a single large file rather than 2 #20586

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split `push_down_filter.slt` into standalone sqllogictest files to reduce long-tail runtime#20566

Split `push_down_filter.slt` into standalone sqllogictest files to reduce long-tail runtime#20566
kosiew merged 3 commits intoapache:mainfrom
kosiew:split-push-down-filter-slt-20524

kosiew commented Feb 26, 2026 •

edited

Loading

Uh oh!

kosiew commented Feb 26, 2026 •

edited

Loading

Uh oh!

kosiew commented Feb 26, 2026 •

edited

Loading

Uh oh!

alamb left a comment

Uh oh!

alamb Feb 26, 2026

Uh oh!

alamb Feb 26, 2026

Uh oh!

alamb Feb 27, 2026

Uh oh!

alamb Feb 26, 2026

Uh oh!

kosiew Feb 27, 2026

Uh oh!

alamb commented Feb 26, 2026

Uh oh!

kosiew commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kosiew commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Note before merging

LLM-generated code disclosure

Uh oh!

kosiew commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Without split

with split

Uh oh!

kosiew commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

without split

with split

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

kosiew Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 26, 2026

Uh oh!

kosiew commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kosiew commented Feb 26, 2026 •

edited

Loading

kosiew commented Feb 26, 2026 •

edited

Loading

kosiew commented Feb 26, 2026 •

edited

Loading