Skip to content

Split push_down_filter.slt into standalone sqllogictest files to reduce long-tail runtime#20566

Merged
kosiew merged 3 commits intoapache:mainfrom
kosiew:split-push-down-filter-slt-20524
Feb 27, 2026
Merged

Split push_down_filter.slt into standalone sqllogictest files to reduce long-tail runtime#20566
kosiew merged 3 commits intoapache:mainfrom
kosiew:split-push-down-filter-slt-20524

Conversation

@kosiew
Copy link
Contributor

@kosiew kosiew commented Feb 26, 2026

Which issue does this PR close?

Rationale for this change

datafusion/sqllogictest/test_files/push_down_filter.slt had grown into a large sqllogictest file. Since the sqllogictest runner parallelizes at file granularity, a single heavyweight file can become a straggler and dominate wall-clock time.

This PR performs a non-invasive split of that file into smaller, self-contained .slt files so the runner can distribute work more evenly across threads, improving overall suite balance without changing SQL semantics or test coverage.

What changes are included in this PR?

  • Removed the monolithic push_down_filter.slt.

  • Added new standalone sqllogictest files, each with the minimal setup/teardown required to run independently:

  • Updated scratch output paths to be file-scoped (e.g. test_files/scratch/push_down_filter_parquet/...) to reduce the chance of conflicts when tests execute in parallel.

  • Preserved all original query expectations and explain-plan assertions; changes are organizational only.

Are these changes tested?

Yes, with a python script to compare text blocks in the new slt files vs old single slt file.

python - <<'PY'
import subprocess, re
from collections import defaultdict, deque

repo = '.'
old_spec = '692a7cb67^:datafusion/sqllogictest/test_files/push_down_filter.slt'
new_specs = [
  'HEAD:datafusion/sqllogictest/test_files/push_down_filter_outer_joins.slt',
  'HEAD:datafusion/sqllogictest/test_files/push_down_filter_parquet.slt',
  'HEAD:datafusion/sqllogictest/test_files/push_down_filter_regression.slt',
  'HEAD:datafusion/sqllogictest/test_files/push_down_filter_unnest.slt',
]

def git_show(spec):
  return subprocess.check_output(['git', '-C', repo, 'show', spec], text=True)

def normalize_sql(sql):
  s = sql.strip().lower()
  s = re.sub(
    r"test_files/scratch/push_down_filter(?:_[^'\s;)/]+)?[^'\s;)]*",
    "test_files/scratch/__norm__",
    s,
  )
  s = re.sub(r'\s+', ' ', s)
  return s

def blocks(text):
  lines = text.splitlines()
  out = []
  i = 0
  while i < len(lines):
    m = lines[i].strip()
    if m.startswith('query '):
      i += 1
      b = []
      while i < len(lines) and lines[i].strip() != '----':
        if not lines[i].lstrip().startswith('#'):
          b.append(lines[i])
        i += 1
      sql = '\n'.join(b).strip()
      if sql:
        out.append(('query', normalize_sql(sql)))
    elif m.startswith('statement '):
      i += 1
      b = []
      while i < len(lines):
        s = lines[i].strip()
        if s == '':
          break
        if s.startswith('query ') or s.startswith('statement '):
          i -= 1
          break
        if not lines[i].lstrip().startswith('#'):
          b.append(lines[i])
        i += 1
      sql = '\n'.join(b).strip()
      if sql:
        out.append(('statement', normalize_sql(sql)))
    i += 1
  return out

old_blocks = blocks(git_show(old_spec))
new_blocks = []
for s in new_specs:
  new_blocks.extend(blocks(git_show(s)))

q = defaultdict(deque)
for item in new_blocks:
  q[item].append(item)

missing = 0
extra = 0
for item in old_blocks:
  if q[item]:
    q[item].popleft()
  else:
    missing += 1

for v in q.values():
  extra += len(v)

print(f'old_blocks={len(old_blocks)}')
print(f'new_blocks={len(new_blocks)}')
print(f'missing={missing}')
print(f'extra={extra}')
print(f'baseline={old_spec}')
if missing != 0:
  raise SystemExit(1)
PY

Output:

old_blocks=107
new_blocks=108
missing=0
extra=1

The extra(1) is this statement block:

set datafusion.explain.physical_plan_only = true;
Why it shows as extra:

In split files, it appears 3 times:
push_down_filter_parquet.slt:21
push_down_filter_unnest.slt:21
push_down_filter_regression.slt:129
In the baseline monolithic file at e937cad^, it appears 2 times.
So comparison reports 3 - 2 = extra 1.

Are there any user-facing changes?

No user-facing behavior changes. This is a test-suite organization/performance improvement only.

Note before merging

Revert e8369bb (it is a commit to trigger the CI extented tests for sqllogictest)

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

- Implement tests for push down filters in outer joins, ensuring filters are applied correctly based on join conditions.
- Introduce tests for push down filters with Parquet files, including scenarios with limits and dynamic filters.
- Add regression tests to address specific issues related to filter pushdown, ensuring stability and correctness.
- Include tests for unnest operations with filters, verifying that filters are pushed down appropriately based on the context.
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 26, 2026
@github-actions github-actions bot added the optimizer Optimizer rules label Feb 26, 2026
@kosiew kosiew changed the title Add SQL logic tests for filter pushdown scenarios Split push_down_filter.slt into standalone sqllogictest files to reduce long-tail runtime Feb 26, 2026
@kosiew
Copy link
Contributor Author

kosiew commented Feb 26, 2026

Without split

image

source: https://github.com/apache/datafusion/actions/runs/22431741358/job/64951933427?pr=20567

with split

image

source:
https://github.com/apache/datafusion/actions/runs/22431461562/job/64951079150?pr=20566

@kosiew
Copy link
Contributor Author

kosiew commented Feb 26, 2026

Running
cargo test --test sqllogictests -- --timing-summary top --timing-top-n 10
from #20569, explains the minimal improvement.

without split

Running with 10 test threads (available parallelism: 10)
Per-file elapsed summary (deterministic):                                                              
1.   18.804s  push_down_filter.slt                                                                   
2.    8.099s  joins.slt                                                                              
3.    7.504s  aggregate.slt 

with split

Running with 10 test threads (available parallelism: 10)
Per-file elapsed summary (deterministic):                                                              
1.   10.355s  push_down_filter_regression.slt                                                        
2.    6.464s  aggregate.slt                                                                          
3.    6.445s  joins.slt 

@kosiew kosiew marked this pull request as ready for review February 26, 2026 09:24
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kosiew -- this seems like a non trivial improvement to me

On my machine, this PR runs in 7 seconds

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ cargo test --profile=ci --test sqllogictests
    Finished `ci` profile [unoptimized] target(s) in 0.20s
     Running bin/sqllogictests.rs (target/ci/deps/sqllogictests-c4e4be8d5c9fd66e)
Running with 16 test threads (available parallelism: 16)
Completed 411 test files in 7 seconds

And main runs in 8 seconds

(venv) andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion2$ cargo test --profile=ci --test sqllogictests
    Finished `ci` profile [unoptimized] target(s) in 0.20s
     Running bin/sqllogictests.rs (target/ci/deps/sqllogictests-c4e4be8d5c9fd66e)
Running with 16 test threads (available parallelism: 16)
Completed 408 test files in 8 seconds

I think part of the reason pushdown_filter_regression is taking so long is that it makes files with 1M rows and then the joins just take a while on debug builds

I don't see any real way to speed that up other than changing what it is doing or moving it to the extended tests somehow

Something else I did that seems to work quite well was to update the scheduling to start long running tests first -- I'll make a follow on PR to improve that too

# Sorry about the spam in this slt test...

query III rowsort
select *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The join doesn't take that long

> select *
from t1
join t2 on t1.k = t2.k
where v = 1 or v = 10000000
order by t1.k, t2.v;
+----------+----------+----------+
| k        | k        | v        |
+----------+----------+----------+
| 1        | 1        | 1        |
| 10000000 | 10000000 | 10000000 |
+----------+----------+----------+
2 row(s) fetched.
Elapsed 0.165 seconds.


# Regression test for https://github.com/apache/datafusion/issues/17188
query I
COPY (select i as k from generate_series(1, 10000000) as t(i))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this query on a debug build of datafusion-cli and creating the data files takes over 4 seconds. The rest of the queries are pretty fast

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion/datafusion/sqllogictest$ /Users/andrewlamb/Software/datafusion/target/debug/datafusion-cli
DataFusion CLI v52.1.0
> COPY (select i as k from generate_series(1, 10000000) as t(i))
TO 'test_files/scratch/push_down_filter_regression/t1.parquet'
STORED AS PARQUET;
+----------+
| count    |
+----------+
| 10000000 |
+----------+
1 row(s) fetched.
Elapsed 2.074 seconds.
> COPY (select i as k, i as v from generate_series(1, 10000000) as t(i))
TO 'test_files/scratch/push_down_filter_regression/t2.parquet'
STORED AS PARQUET;
+----------+
| count    |
+----------+
| 10000000 |
+----------+
1 row(s) fetched.
Elapsed 2.124 seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
}

// trigger ci test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems unecessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it was on my before-merge todo list.

image

@alamb
Copy link
Contributor

alamb commented Feb 26, 2026

FYI I got the test even faster by scheduling the runs a little more carefully:

@kosiew
Copy link
Contributor Author

kosiew commented Feb 27, 2026

Thanks @alamb for the review.

@github-actions github-actions bot removed the optimizer Optimizer rules label Feb 27, 2026
@kosiew kosiew added this pull request to the merge queue Feb 27, 2026
Merged via the queue into apache:main with commit bc600b3 Feb 27, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants