Skip to content

perf: sort-merge join (SMJ) batch deferred filtering and move mark joins to specialized stream #21184

Open
mbutrovich wants to merge 18 commits intoapache:mainfrom
mbutrovich:simplify_smj_full_opt
Open

perf: sort-merge join (SMJ) batch deferred filtering and move mark joins to specialized stream #21184
mbutrovich wants to merge 18 commits intoapache:mainfrom
mbutrovich:simplify_smj_full_opt

Conversation

@mbutrovich
Copy link
Contributor

@mbutrovich mbutrovich commented Mar 26, 2026

Which issue does this PR close?

Partially addresses #20910.

Rationale for this change

Sort-merge join with a filter on outer joins (LEFT/RIGHT/FULL) runs process_filtered_batches() on every key transition in the Init state. With near-unique keys (1:1 cardinality), this means running the full deferred filtering pipeline (concat + get_corrected_filter_mask + filter_record_batch_by_join_type) once per row — making filtered LEFT/RIGHT/FULL 55x slower than INNER for 10M unique keys.

Additionally, mark join logic in SortMergeJoinStream materializes full (streamed, buffered) pairs only to discard most of them via get_corrected_filter_mask(). Mark joins are structurally identical to semi joins (one output row per outer row with a boolean result) and belong in SemiAntiMarkSortMergeJoinStream, which avoids pair materialization entirely using a per-outer-batch bitset.

What changes are included in this PR?

Three areas of improvement, building on the specialized semi/anti stream from #20806:

1. Move mark joins to SemiAntiMarkSortMergeJoinStream

  • Rename semi_anti_sort_merge_joinsemi_anti_mark_sort_merge_join
  • Match on join type; emit_outer_batch() emits all rows with the match bitset as a boolean column (vs semi's filter / anti's invert-and-filter)
  • Route LeftMark/RightMark from SortMergeJoinExec::execute() to the renamed stream
  • Remove all mark-specific logic from SortMergeJoinStream (mark_row_as_match, is_not_null column generation, mark arms in filter correction)

2. Batch filter evaluation in freeze_streamed()

  • Split freeze_streamed() into null-joined classification + freeze_streamed_matched() for batched materialization
  • Collect indices across chunks, materialize left/right columns once using tiered Arrow kernels (slicetakeinterleave)
  • Single RecordBatch construction and single expression.evaluate() per freeze instead of per chunk
  • Vectorize append_filter_metadata() using builder extend() instead of per-element loop

3. Batch deferred filtering in Init state (this is the big win for Q22 and Q23)

  • Gate process_filtered_batches() on accumulated rows >= batch_size instead of running on every Init entry
  • Accumulated data bounded to ~2×batch_size (one from freeze_dequeuing_buffered, one accumulating toward next freeze) — does not reintroduce unbounded buffering fixed by PR fix: SortMergeJoin don't wait for all input before emitting #20482
  • Exhausted state flushes any remainder

Cleanup:

  • SortMergeJoinStream now handles only Inner/Left/Right/Full — all semi/anti/mark branching removed
  • get_corrected_filter_mask(): merge identical Left/Right/Full branches; add null-metadata passthrough for already-null-joined rows
  • filter_record_batch_by_join_type(): rewrite from filter(true) + filter(false) + concat to zip() for in-place null-joining — preserves row ordering and removes create_null_joined_batch() entirely
  • filter_record_batch_by_join_type(): use compute::filter() directly on BooleanArray instead of wrapping in temporary RecordBatch

Benchmarks

cargo run --release --bin dfbench -- smj

Query Join Type Rows Keys Filter Main (ms) PR (ms) Speedup
Q1 INNER 100K×100K 1:1 1.7 1.7 1.0x
Q2 INNER 100K×1M 1:10 12.2 11.6 1.0x
Q3 INNER 1M×1M 1:100 64.2 64.9 1.0x
Q4 INNER 100K×1M 1:10 1% 2.2 2.2 1.0x
Q5 INNER 1M×1M 1:100 10% 12.8 12.7 1.0x
Q6 LEFT 100K×1M 1:10 11.1 11.3 1.0x
Q7 LEFT 100K×1M 1:10 50% 13.4 14.1 1.0x
Q8 FULL 100K×100K 1:10 2.2 2.2 1.0x
Q9 FULL 100K×1M 1:10 10% 14.5 14.8 1.0x
Q10 LEFT SEMI 100K×1M 1:10 3.6 3.4 1.0x
Q11 LEFT SEMI 100K×1M 1:10 1% 2.0 2.3 1.0x
Q12 LEFT SEMI 100K×1M 1:10 50% 5.1 5.4 1.0x
Q13 LEFT SEMI 100K×1M 1:10 90% 9.9 10.1 1.0x
Q14 LEFT ANTI 100K×1M 1:10 3.5 3.7 1.0x
Q15 LEFT ANTI 100K×1M 1:10 partial 3.7 3.5 1.0x
Q16 LEFT ANTI 100K×100K 1:1 1.6 1.7 1.0x
Q17 INNER 100K×5M 1:50 5% 7.4 7.8 1.0x
Q18 LEFT SEMI 100K×5M 1:50 2% 5.4 5.5 1.0x
Q19 LEFT ANTI 100K×5M 1:50 partial 21.0 21.2 1.0x
Q20 INNER 1M×10M 1:100 GROUP BY 759 761 1.0x
Q21 INNER 10M×10M 1:1 50% 181 173 1.0x
Q22 LEFT 10M×10M 1:1 50% 10,228 184 55x
Q23 FULL 10M×10M 1:1 50% 9,884 228 43x

General workload (Q1-Q20, various join types/cardinalities/selectivities): no regressions.

Are these changes tested?

Yes:

  • 48 SMJ unit tests (cargo test -p datafusion-physical-plan --lib joins::sort_merge_join)
  • 10 join sqllogictest files (cargo test -p datafusion-sqllogictest --test sqllogictests -- join)
  • Semi/anti/mark stream tests (cargo test -p datafusion-physical-plan --lib joins::semi_anti_mark_sort_merge_join)
  • New unit test for mark join with filter via the renamed stream
  • Three new unit tests to exercise full join with filter that spills
  • New fuzz test to exercise full join with filter that spills
  • New benchmark queries Q21-Q23: 10M×10M unique keys with 50% join filter for INNER/LEFT/FULL — exercises the degenerate case this PR fixes
  • I ran 50 iterations of the fuzz tests (modified to only test against hash join as the baseline because nested loop join takes too long) cargo test -p datafusion --features extended_tests --test fuzz -- join_fuzz

Are there any user-facing changes?

No.

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Mar 26, 2026
@mbutrovich mbutrovich requested review from comphead and rluvaton March 26, 2026 18:26
@mbutrovich
Copy link
Contributor Author

Tagging folks who had feedback on recent SMJ changes @comphead @rluvaton @stuhood. Thank you!

@rluvaton
Copy link
Member

run benchmarks sort_merge_join

@mbutrovich
Copy link
Contributor Author

run benchmarks sort_merge_join

Note that the 2 queries I expect a speedup on in the smj suite are new in this PR, so I don't think we'll see their performance against main. I had to hoist the benchmark to main and run it locally for the comparison in the PR description.

@mbutrovich mbutrovich marked this pull request as ready for review March 26, 2026 18:29
@adriangbot
Copy link

🤖 Criterion benchmark running (GKE) | trigger
Linux bench-c4137272954-565-stzm5 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing simplify_smj_full_opt (1c1bec5) to ba399a8 (merge-base) diff
BENCH_NAME=sort_merge_join
BENCH_COMMAND=cargo bench --features=parquet --bench sort_merge_join
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link

Benchmark for this request failed.

Last 20 lines of output:

Click to expand
Cloning into '/workspace/datafusion-branch'...
simplify_smj_full_opt
From https://github.com/apache/datafusion
 * [new ref]         refs/pull/21184/head -> simplify_smj_full_opt
 * branch            main                 -> FETCH_HEAD
Switched to branch 'simplify_smj_full_opt'
ba399a80f9ffcb0563adf2b67add13d0476f6291
Cloning into '/workspace/datafusion-base'...
HEAD is now at ba399a8 docs: add KalamDB to known users (#21181)
rustc 1.94.0 (4a4ef493e 2026-03-02)
1c1bec5e7a217c366e704d1fd5bf8594a9e9540e
ba399a80f9ffcb0563adf2b67add13d0476f6291
    Blocking waiting for file lock on package cache
    Blocking waiting for file lock on package cache
    Blocking waiting for file lock on package cache
error: target `sort_merge_join` in package `datafusion-physical-plan` requires the features: `test_utils`
Consider enabling them by passing, e.g., `--features="test_utils"`

File an issue against this benchmark runner

@mbutrovich
Copy link
Contributor Author

mbutrovich commented Mar 26, 2026

Also I'm now confused where I should add benchmarks. #20464 added Criterion SMJ benchmarks for sort-merge join , but it's missing scenarios from dfbench's smj benchmarks, which I further extend here. Any help?

@Dandandan
Copy link
Contributor

adriangb/datafusion-benchmarking#2

self.coalescer.push_batch(filtered)?;
let matched_buf = self.matched.finish();

match self.join_type {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@comphead this one is for you. You suggested this in #20806 and this time I finally listened :)

@Dandandan
Copy link
Contributor

run benchmark smj

@adriangbot
Copy link

🤖 Benchmark running (GKE) | trigger
Linux bench-c4137871831-571-vqcm7 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing simplify_smj_full_opt (481753a) to ba399a8 (merge-base) diff using: smj
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link

🤖 Benchmark completed (GKE) | trigger

Details

Comparing HEAD and simplify_smj_full_opt
--------------------
Benchmark smj.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query     ┃                                 HEAD ┃                 simplify_smj_full_opt ┃    Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1  │          3.82 / 3.90 ±0.09 / 4.08 ms │           3.72 / 3.79 ±0.09 / 3.96 ms │ no change │
│ QQuery 2  │       18.26 / 18.77 ±0.33 / 19.23 ms │        19.04 / 19.39 ±0.24 / 19.77 ms │ no change │
│ QQuery 3  │    102.84 / 104.12 ±0.83 / 105.10 ms │     103.73 / 105.01 ±1.04 / 106.35 ms │ no change │
│ QQuery 4  │          3.97 / 4.05 ±0.06 / 4.09 ms │           4.08 / 4.15 ±0.07 / 4.25 ms │ no change │
│ QQuery 5  │       22.22 / 22.69 ±0.30 / 23.05 ms │        22.06 / 22.43 ±0.26 / 22.79 ms │ no change │
│ QQuery 6  │       18.37 / 18.50 ±0.14 / 18.71 ms │        18.86 / 19.26 ±0.46 / 20.07 ms │ no change │
│ QQuery 7  │       22.00 / 22.65 ±0.55 / 23.40 ms │        22.81 / 22.98 ±0.11 / 23.10 ms │ no change │
│ QQuery 8  │          3.16 / 3.19 ±0.03 / 3.24 ms │           3.15 / 3.21 ±0.05 / 3.26 ms │ no change │
│ QQuery 9  │       22.72 / 23.07 ±0.36 / 23.66 ms │        23.23 / 23.61 ±0.45 / 24.48 ms │ no change │
│ QQuery 10 │          8.51 / 8.66 ±0.10 / 8.80 ms │           8.15 / 8.56 ±0.32 / 9.12 ms │ no change │
│ QQuery 11 │          4.01 / 4.03 ±0.02 / 4.06 ms │           3.98 / 4.00 ±0.01 / 4.02 ms │ no change │
│ QQuery 12 │          7.68 / 7.90 ±0.13 / 8.05 ms │           7.61 / 8.02 ±0.27 / 8.38 ms │ no change │
│ QQuery 13 │       11.75 / 12.16 ±0.30 / 12.68 ms │        11.24 / 12.15 ±0.49 / 12.66 ms │ no change │
│ QQuery 14 │          8.62 / 8.91 ±0.20 / 9.18 ms │           8.32 / 8.64 ±0.25 / 9.00 ms │ no change │
│ QQuery 15 │          8.46 / 8.94 ±0.37 / 9.34 ms │           8.49 / 8.94 ±0.29 / 9.29 ms │ no change │
│ QQuery 16 │          2.34 / 2.39 ±0.05 / 2.49 ms │           2.30 / 2.35 ±0.03 / 2.38 ms │ no change │
│ QQuery 17 │       15.37 / 15.58 ±0.13 / 15.77 ms │        15.53 / 15.70 ±0.12 / 15.86 ms │ no change │
│ QQuery 18 │       11.97 / 12.18 ±0.21 / 12.58 ms │        11.91 / 11.99 ±0.07 / 12.11 ms │ no change │
│ QQuery 19 │       36.95 / 40.07 ±1.73 / 41.82 ms │        38.85 / 40.18 ±1.32 / 42.60 ms │ no change │
│ QQuery 20 │ 1257.09 / 1263.71 ±4.97 / 1270.19 ms │ 1261.98 / 1272.40 ±13.15 / 1294.70 ms │ no change │
└───────────┴──────────────────────────────────────┴───────────────────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                    │ 1605.47ms │
│ Total Time (simplify_smj_full_opt)   │ 1616.74ms │
│ Average Time (HEAD)                  │   80.27ms │
│ Average Time (simplify_smj_full_opt) │   70.29ms │
│ Queries Faster                       │         0 │
│ Queries Slower                       │         0 │
│ Queries with No Change               │        20 │
│ Queries with Failure                 │         0 │
└──────────────────────────────────────┴───────────┘

Resource Usage

smj — base (merge-base)

Metric Value
Wall time 8.4s
Peak memory 3.1 GiB
Avg memory 3.1 GiB
CPU user 87.0s
CPU sys 0.9s
Disk read 0 B
Disk write 140.0 KiB

smj — branch

Metric Value
Wall time 14.5s
Peak memory 3.4 GiB
Avg memory 3.2 GiB
CPU user 114.4s
CPU sys 2.1s
Disk read 0 B
Disk write 130.0 MiB

File an issue against this benchmark runner

@mbutrovich
Copy link
Contributor Author

mbutrovich commented Mar 26, 2026

🤖 Benchmark completed (GKE) | trigger

Details
Resource Usage
File an issue against this benchmark runner

Yeah, these are expected results. The queries that demonstrate this issue are new in the PR so we don't get to compare against main.

@rluvaton
Copy link
Member

rluvaton commented Mar 26, 2026

Can you please create a pr with only the queries so we can run the benchmark on

@mbutrovich
Copy link
Contributor Author

mbutrovich commented Mar 26, 2026

I'm noticing we don't have a ton of test coverage with spilling. I'll try to shore that up.

mbutrovich added a commit to mbutrovich/datafusion that referenced this pull request Mar 26, 2026
@mbutrovich
Copy link
Contributor Author

Can you please create a pr with only the queries so we can run the benchmark on

#21188

github-merge-queue bot pushed a commit that referenced this pull request Mar 26, 2026
See #21184 for reason of this benchmark.
@github-actions github-actions bot added the core Core DataFusion crate label Mar 26, 2026
@adriangbot
Copy link

Hi @mbutrovich, thanks for the request (#21184 (comment)). Only whitelisted users can trigger benchmarks. Allowed users: Dandandan, Jefffrey, Omega359, adriangb, alamb, comphead, etseidl, gabotechs, geoffreyclaude, klion26, kosiew, rluvaton, xudong963, zhuqi-lucas.


File an issue against this benchmark runner

@rluvaton
Copy link
Member

run benchmark smj

@adriangbot
Copy link

🤖 Benchmark running (GKE) | trigger
Linux bench-c4138390922-574-8th64 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing simplify_smj_full_opt (2d2758b) to 37978e3 (merge-base) diff using: smj
Results will be posted here when complete


File an issue against this benchmark runner

@mbutrovich
Copy link
Contributor Author

mbutrovich commented Mar 26, 2026

I'm running another 50 iterations of fuzz tests now that I added one that spills, so that'll take ~90 minutes. So far I'm through 12 iterations, so I'll check back in once it's done.

@adriangbot
Copy link

🤖 Benchmark completed (GKE) | trigger

Details

Comparing HEAD and simplify_smj_full_opt
--------------------
Benchmark smj.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Query     ┃                                     HEAD ┃                simplify_smj_full_opt ┃         Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ QQuery 1  │              3.68 / 3.75 ±0.10 / 3.94 ms │          3.69 / 3.79 ±0.08 / 3.94 ms │      no change │
│ QQuery 2  │           18.27 / 18.67 ±0.24 / 18.98 ms │       18.88 / 19.90 ±0.96 / 21.65 ms │   1.07x slower │
│ QQuery 3  │        101.94 / 104.49 ±1.45 / 106.02 ms │    103.50 / 105.42 ±1.10 / 106.53 ms │      no change │
│ QQuery 4  │              3.85 / 4.00 ±0.10 / 4.12 ms │          3.97 / 4.01 ±0.04 / 4.08 ms │      no change │
│ QQuery 5  │           22.16 / 22.65 ±0.39 / 23.34 ms │       22.11 / 22.45 ±0.19 / 22.66 ms │      no change │
│ QQuery 6  │           18.21 / 18.90 ±0.48 / 19.72 ms │       18.32 / 18.85 ±0.54 / 19.82 ms │      no change │
│ QQuery 7  │           22.65 / 23.01 ±0.45 / 23.87 ms │       22.69 / 23.41 ±0.74 / 24.80 ms │      no change │
│ QQuery 8  │              3.12 / 3.18 ±0.06 / 3.29 ms │          3.02 / 3.13 ±0.06 / 3.16 ms │      no change │
│ QQuery 9  │           23.06 / 23.33 ±0.20 / 23.55 ms │       23.17 / 23.50 ±0.23 / 23.78 ms │      no change │
│ QQuery 10 │              8.50 / 8.71 ±0.18 / 9.03 ms │          8.26 / 8.73 ±0.26 / 8.93 ms │      no change │
│ QQuery 11 │              3.92 / 3.96 ±0.02 / 3.98 ms │          3.91 / 3.98 ±0.05 / 4.04 ms │      no change │
│ QQuery 12 │              7.78 / 7.89 ±0.05 / 7.93 ms │          7.52 / 7.81 ±0.18 / 8.06 ms │      no change │
│ QQuery 13 │           11.86 / 12.28 ±0.28 / 12.60 ms │       11.45 / 12.30 ±0.51 / 12.87 ms │      no change │
│ QQuery 14 │              8.44 / 8.70 ±0.18 / 8.95 ms │          8.64 / 8.90 ±0.17 / 9.16 ms │      no change │
│ QQuery 15 │              8.52 / 8.67 ±0.17 / 8.98 ms │          8.50 / 8.79 ±0.27 / 9.21 ms │      no change │
│ QQuery 16 │              2.30 / 2.35 ±0.03 / 2.38 ms │          2.31 / 2.38 ±0.05 / 2.44 ms │      no change │
│ QQuery 17 │           15.03 / 15.28 ±0.14 / 15.44 ms │       15.27 / 15.49 ±0.17 / 15.78 ms │      no change │
│ QQuery 18 │           11.75 / 11.82 ±0.06 / 11.92 ms │       11.87 / 11.93 ±0.08 / 12.09 ms │      no change │
│ QQuery 19 │           36.82 / 39.51 ±1.54 / 41.45 ms │       37.43 / 39.20 ±1.43 / 41.65 ms │      no change │
│ QQuery 20 │     1261.33 / 1270.77 ±4.85 / 1274.47 ms │ 1250.60 / 1261.22 ±8.99 / 1275.63 ms │      no change │
│ QQuery 21 │       361.96 / 375.60 ±11.11 / 390.97 ms │   353.89 / 373.99 ±13.74 / 392.09 ms │      no change │
│ QQuery 22 │   7402.75 / 8517.52 ±770.78 / 9493.61 ms │   372.52 / 383.54 ±10.57 / 401.27 ms │ +22.21x faster │
│ QQuery 23 │ 7686.58 / 9418.98 ±1057.23 / 10559.65 ms │   428.50 / 450.15 ±21.38 / 478.34 ms │ +20.92x faster │
└───────────┴──────────────────────────────────────────┴──────────────────────────────────────┴────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                    │ 19924.01ms │
│ Total Time (simplify_smj_full_opt)   │  2812.87ms │
│ Average Time (HEAD)                  │   866.26ms │
│ Average Time (simplify_smj_full_opt) │   122.30ms │
│ Queries Faster                       │          2 │
│ Queries Slower                       │          1 │
│ Queries with No Change               │         20 │
│ Queries with Failure                 │          0 │
└──────────────────────────────────────┴────────────┘

Resource Usage

smj — base (merge-base)

Metric Value
Wall time 99.9s
Peak memory 3.4 GiB
Avg memory 3.3 GiB
CPU user 1097.3s
CPU sys 4.1s
Disk read 0 B
Disk write 179.6 MiB

smj — branch

Metric Value
Wall time 14.4s
Peak memory 3.4 GiB
Avg memory 3.2 GiB
CPU user 114.2s
CPU sys 2.2s
Disk read 0 B
Disk write 88.0 KiB

File an issue against this benchmark runner

@mbutrovich
Copy link
Contributor Author

Comparing HEAD and simplify_smj_full_opt
--------------------
Benchmark smj.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Query     ┃                                     HEAD ┃                simplify_smj_full_opt ┃         Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ QQuery 22 │   7402.75 / 8517.52 ±770.78 / 9493.61 ms │   372.52 / 383.54 ±10.57 / 401.27 ms │ +22.21x faster │
│ QQuery 23 │ 7686.58 / 9418.98 ±1057.23 / 10559.65 ms │   428.50 / 450.15 ±21.38 / 478.34 ms │ +20.92x faster │
└───────────┴──────────────────────────────────────────┴──────────────────────────────────────┴────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                    │ 19924.01ms │
│ Total Time (simplify_smj_full_opt)   │  2812.87ms │
│ Average Time (HEAD)                  │   866.26ms │
│ Average Time (simplify_smj_full_opt) │   122.30ms │
└──────────────────────────────────────┴────────────┘

You love to see it.

@rluvaton
Copy link
Member

Damn, well done

@mbutrovich
Copy link
Contributor Author

I'm running another 50 iterations of fuzz tests now that I added one that spills, so that'll take ~90 minutes. So far I'm through 12 iterations, so I'll check back in once it's done.

This finished without issue.

@adriangb
Copy link
Contributor

run benchmark smj

@adriangbot
Copy link

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4140424132-575-ns2gb 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing simplify_smj_full_opt (2d2758b) to 37978e3 (merge-base) diff using: smj
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

Comparing HEAD and simplify_smj_full_opt
--------------------
Benchmark smj.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Query     ┃                                   HEAD ┃                simplify_smj_full_opt ┃         Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ QQuery 1  │            3.93 / 4.41 ±0.26 / 4.68 ms │          3.84 / 3.90 ±0.06 / 3.99 ms │  +1.13x faster │
│ QQuery 2  │         18.55 / 18.91 ±0.26 / 19.26 ms │       18.35 / 19.08 ±0.50 / 19.90 ms │      no change │
│ QQuery 3  │      103.97 / 105.48 ±1.16 / 107.12 ms │    103.99 / 104.98 ±1.17 / 107.28 ms │      no change │
│ QQuery 4  │            4.03 / 4.06 ±0.03 / 4.12 ms │          4.07 / 4.13 ±0.04 / 4.18 ms │      no change │
│ QQuery 5  │         22.42 / 22.77 ±0.25 / 23.13 ms │       22.15 / 22.49 ±0.21 / 22.77 ms │      no change │
│ QQuery 6  │         18.09 / 18.43 ±0.23 / 18.73 ms │       18.24 / 18.80 ±0.34 / 19.13 ms │      no change │
│ QQuery 7  │         22.20 / 22.48 ±0.30 / 23.06 ms │       22.79 / 22.90 ±0.13 / 23.14 ms │      no change │
│ QQuery 8  │            3.11 / 3.21 ±0.11 / 3.41 ms │          3.05 / 3.15 ±0.06 / 3.23 ms │      no change │
│ QQuery 9  │         23.60 / 23.90 ±0.28 / 24.36 ms │       23.09 / 23.78 ±0.37 / 24.10 ms │      no change │
│ QQuery 10 │            8.29 / 8.77 ±0.25 / 8.97 ms │          8.91 / 9.08 ±0.12 / 9.25 ms │      no change │
│ QQuery 11 │            3.95 / 3.99 ±0.03 / 4.03 ms │          3.96 / 4.04 ±0.07 / 4.17 ms │      no change │
│ QQuery 12 │            7.56 / 7.70 ±0.09 / 7.81 ms │          7.61 / 7.91 ±0.24 / 8.34 ms │      no change │
│ QQuery 13 │         11.11 / 11.54 ±0.46 / 12.22 ms │       11.95 / 12.14 ±0.14 / 12.29 ms │   1.05x slower │
│ QQuery 14 │            8.52 / 8.69 ±0.12 / 8.85 ms │          8.85 / 9.04 ±0.12 / 9.23 ms │      no change │
│ QQuery 15 │            8.62 / 8.79 ±0.15 / 9.04 ms │          8.82 / 8.92 ±0.10 / 9.10 ms │      no change │
│ QQuery 16 │            2.27 / 2.32 ±0.03 / 2.36 ms │          2.28 / 2.38 ±0.07 / 2.51 ms │      no change │
│ QQuery 17 │         14.94 / 15.16 ±0.12 / 15.24 ms │       15.31 / 15.47 ±0.10 / 15.59 ms │      no change │
│ QQuery 18 │         11.67 / 11.80 ±0.08 / 11.88 ms │       11.80 / 11.96 ±0.09 / 12.07 ms │      no change │
│ QQuery 19 │         37.09 / 39.37 ±1.87 / 42.78 ms │       37.85 / 39.47 ±1.54 / 42.07 ms │      no change │
│ QQuery 20 │   1249.96 / 1256.53 ±5.36 / 1265.16 ms │ 1254.91 / 1267.28 ±7.59 / 1275.30 ms │      no change │
│ QQuery 21 │     372.04 / 389.34 ±17.66 / 420.56 ms │    359.17 / 364.27 ±3.39 / 368.48 ms │  +1.07x faster │
│ QQuery 22 │ 7209.91 / 8453.19 ±950.95 / 9574.83 ms │   374.00 / 395.98 ±14.01 / 410.69 ms │ +21.35x faster │
│ QQuery 23 │ 6691.68 / 7908.85 ±848.30 / 8974.33 ms │   422.23 / 450.58 ±42.13 / 534.04 ms │ +17.55x faster │
└───────────┴────────────────────────────────────────┴──────────────────────────────────────┴────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                    │ 18349.71ms │
│ Total Time (simplify_smj_full_opt)   │  2821.77ms │
│ Average Time (HEAD)                  │   797.81ms │
│ Average Time (simplify_smj_full_opt) │   122.69ms │
│ Queries Faster                       │          4 │
│ Queries Slower                       │          1 │
│ Queries with No Change               │         18 │
│ Queries with Failure                 │          0 │
└──────────────────────────────────────┴────────────┘

Resource Usage

smj — base (merge-base)

Metric Value
Wall time 92.1s
Peak memory 3.4 GiB
Avg memory 3.3 GiB
CPU user 1032.9s
CPU sys 3.9s
Disk read 0 B
Disk write 179.6 MiB

smj — branch

Metric Value
Wall time 14.4s
Peak memory 3.4 GiB
Avg memory 3.2 GiB
CPU user 114.1s
CPU sys 2.1s
Disk read 0 B
Disk write 708.0 KiB

File an issue against this benchmark runner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants