Skip to content

perf: Optimize ExternalSorter with chunked sort pipeline and radix sort kernel#21600

Closed
mbutrovich wants to merge 26 commits intoapache:mainfrom
mbutrovich:sort_redesign
Closed

perf: Optimize ExternalSorter with chunked sort pipeline and radix sort kernel#21600
mbutrovich wants to merge 26 commits intoapache:mainfrom
mbutrovich:sort_redesign

Conversation

@mbutrovich
Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich commented Apr 13, 2026

Draft for benchmarking, builds on #21525.

Which issue does this PR close?

Partially addresses #21543.

Rationale for this change

ExternalSorter's merge path sorts each incoming batch individually (typically 8192 rows), then k-way merges all of them. This creates two problems:

  1. Too many small sorted runs. At scale (TPC-H SF10, ~60M rows in lineitem), ~7300 individually-sorted batches feed the k-way merge with high fan-in.
  2. Radix sort can't amortize encoding. MSD radix sort (feat(arrow-row): add MSD radix sort kernel for row-encoded keys arrow-rs#9683) is 2-3x faster than lexsort_to_indices at 32K+ rows for multi-column sorts, but at 8K rows the RowConverter encoding cost dominates. TPC-H SF10 benchmarks on perf: Bring over apache/arrow-rs/9683 radix sort, integrate into ExternalSorter #21525 confirmed this: naively swapping in radix sort made 12/22 queries slower (up to 1.20x).

What changes are included in this PR?

Chunked sort pipeline

Replaces ExternalSorter's buffer-then-sort architecture with a coalesce-then-sort pipeline:

  • Incoming batches accumulate in a BatchCoalescer until sort_coalesce_target_rows (default 32768) is reached
  • Each coalesced batch is sorted (radix or lexsort) and chunked back to batch_size
  • On memory pressure, sorted runs spill to disk (merged into one file when headroom is available, one file per run otherwise)
  • At query completion, runs are k-way merged via the existing StreamingMergeBuilder

Uniform coalescing, per-batch algorithm selection

All schemas coalesce to sort_coalesce_target_rows. This reduces merge fan-in for all queries, including single-column sorts like sort-merge join keys.

Per batch, radix sort is used when the schema is eligible (multi-column, primitives/strings) and the batch reached sort_coalesce_target_rows. Otherwise lexsort is used. A sort_use_radix config (default true) allows disabling radix entirely to isolate the pipeline's contribution.

Metrics

New radix_sorted_batches and lexsort_sorted_batches counters in ExternalSorterMetrics, visible in EXPLAIN ANALYZE.

Dead code removal

Sorted runs no longer require an in-memory merge before spilling. Removes in_mem_sort_stream, sort_batch_stream, consume_and_spill_append, spill_finish, organize_stringview_arrays, and in_progress_spill_file.

Config changes

  • New: sort_coalesce_target_rows (default 32768)
  • New: sort_use_radix (default true)
  • Deprecated: sort_in_place_threshold_bytes (no longer read, warn attribute per API health policy)

Are these changes tested?

  • 4 new unit tests (coalescing, partial flush, per-run spill, merged spill)
  • All 52 sort unit tests pass
  • All sort fuzz, sort query fuzz, and spilling fuzz tests pass
  • information_schema.slt updated for new configs

Are there any user-facing changes?

  • New config sort_coalesce_target_rows (default 32768) controls coalesce target
  • New config sort_use_radix (default true) enables/disables radix sort
  • New metrics radix_sorted_batches and lexsort_sorted_batches in EXPLAIN ANALYZE
  • sort_in_place_threshold_bytes is deprecated
  • The pipeline is more memory-efficient (shrinks reservations after sorting) so some workloads may spill less frequently

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate execution Related to the execution crate physical-plan Changes to the physical-plan crate labels Apr 13, 2026
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Apr 13, 2026
Comment on lines +380 to +383
} else if self.sorted_runs_memory > reservation_size {
self.reservation
.grow(self.sorted_runs_memory - reservation_size);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says it would only exceed the limit by a small amount, but wouldn't this compound across partitions if it's a high amount of them?

Maybe we could still use a try grow here and remedy the failure, or at least cap the grow amount?

Copy link
Copy Markdown
Contributor Author

@mbutrovich mbutrovich Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comments to explain the scenario a bit more, but let me know if you still think we should do something more strict.

Comment on lines +349 to +351
let use_radix_for_this_batch =
self.use_radix && batch.num_rows() > self.batch_size;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since there is already a gate in sort_batch that handles radix vs lexsort maybe we could change the name of this variable for readability to something like use_chunked_radix?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also just realizing, the "graceful degradation" section says the pipeline falls back to lexsort below batch_size rows, but wouldn't the else branch here still take the radix path? When use_radix_for_this_batch is false, this calls sort_batch, and sort_batch independently checks use_radix_sort when fetch.is_none()

for radix-eligible schemas it takes the radix path regardless of row count. Wouldn't that be slower?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @gratus00! I addressed both of these. Checking per-batch is wasteful too. I created an inner function that we can call directly because sort_batch is a public API and I don't want to change it.

- Always coalesce to `sort_coalesce_target_rows` regardless of schema (removed conditional that fell back to `batch_size` for non-radix)
- Both radix and lexsort paths now go through `sort_batch_chunked` (both chunk output to `batch_size`)
- Per-batch radix decision uses `sort_coalesce_target_rows` as threshold instead of `batch_size`
- Added `radix_sorted_batches` and `lexsort_sorted_batches` counters to `ExternalSorterMetrics`
- Added `sort_coalesce_target_rows` and `sort_use_radix` config fields to `ExternalSorter`
- New `sort_use_radix` parameter gates the `use_radix_sort()` schema check

## `datafusion/common/src/config.rs`
- New config: `sort_use_radix: bool, default = true`
- Updated `sort_coalesce_target_rows` doc

## `datafusion/execution/src/config.rs`
- New builder method: `with_sort_use_radix()`

## `datafusion/core/tests/fuzz_cases/sort_fuzz.rs`
- `(20000, false)` → `(50000, true)` to fix flaky test

## `datafusion/sqllogictest/test_files/information_schema.slt` + `docs/source/user-guide/configs.md`
- Added `sort_use_radix` entry, updated `sort_coalesce_target_rows` description
@adriangbot
Copy link
Copy Markdown

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4245342572-1236-8dldw 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing sort_redesign (73bb06b) to 0143dfe (merge-base) diff using: tpch10
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

Comparing HEAD and sort_redesign
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                   HEAD ┃                         sort_redesign ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │      370.78 / 372.71 ±1.53 / 375.02 ms │     367.32 / 369.51 ±1.62 / 371.85 ms │     no change │
│ QQuery 2  │     479.03 / 498.26 ±12.67 / 512.09 ms │     441.93 / 449.16 ±4.75 / 456.30 ms │ +1.11x faster │
│ QQuery 3  │     550.68 / 651.96 ±51.73 / 692.64 ms │     504.72 / 514.53 ±5.69 / 521.96 ms │ +1.27x faster │
│ QQuery 4  │     382.30 / 478.86 ±50.61 / 522.66 ms │     341.16 / 343.59 ±2.52 / 346.66 ms │ +1.39x faster │
│ QQuery 5  │  1094.76 / 1119.98 ±14.85 / 1136.27 ms │  989.92 / 1035.74 ±30.16 / 1083.34 ms │ +1.08x faster │
│ QQuery 6  │      134.61 / 137.54 ±3.24 / 143.63 ms │     132.58 / 135.76 ±4.78 / 145.26 ms │     no change │
│ QQuery 7  │   1529.12 / 1544.02 ±8.52 / 1554.66 ms │ 1352.09 / 1364.35 ±13.84 / 1390.36 ms │ +1.13x faster │
│ QQuery 8  │ 1495.43 / 1983.26 ±252.68 / 2161.10 ms │ 1178.60 / 1195.20 ±16.42 / 1219.19 ms │ +1.66x faster │
│ QQuery 9  │ 1985.32 / 2251.76 ±135.70 / 2348.90 ms │ 1769.17 / 1861.72 ±83.52 / 1962.04 ms │ +1.21x faster │
│ QQuery 10 │      530.10 / 533.74 ±4.88 / 543.33 ms │    496.72 / 511.79 ±15.67 / 531.85 ms │     no change │
│ QQuery 11 │      455.90 / 464.13 ±5.22 / 470.06 ms │     416.63 / 426.94 ±9.64 / 440.31 ms │ +1.09x faster │
│ QQuery 12 │      288.98 / 292.38 ±2.53 / 295.72 ms │     277.24 / 280.50 ±3.22 / 285.47 ms │     no change │
│ QQuery 13 │      366.95 / 373.42 ±4.66 / 379.47 ms │     346.27 / 354.40 ±4.90 / 358.95 ms │ +1.05x faster │
│ QQuery 14 │      195.18 / 198.85 ±2.46 / 202.91 ms │     192.87 / 197.00 ±2.91 / 200.35 ms │     no change │
│ QQuery 15 │      323.95 / 331.40 ±6.53 / 342.87 ms │     319.56 / 326.97 ±6.54 / 339.16 ms │     no change │
│ QQuery 16 │      121.75 / 123.85 ±2.25 / 127.96 ms │     114.45 / 116.88 ±2.91 / 122.43 ms │ +1.06x faster │
│ QQuery 17 │ 1574.15 / 1819.60 ±123.13 / 1892.65 ms │ 1372.85 / 1388.43 ±10.80 / 1402.63 ms │ +1.31x faster │
│ QQuery 18 │  1535.10 / 1560.14 ±19.95 / 1594.45 ms │ 1407.54 / 1451.07 ±36.80 / 1513.73 ms │ +1.08x faster │
│ QQuery 19 │     276.90 / 290.88 ±17.67 / 325.49 ms │    277.65 / 291.86 ±25.11 / 342.04 ms │     no change │
│ QQuery 20 │      451.87 / 457.31 ±4.64 / 464.09 ms │    417.57 / 429.48 ±12.70 / 453.17 ms │ +1.06x faster │
│ QQuery 21 │ 2981.79 / 3226.93 ±156.76 / 3396.23 ms │ 2602.88 / 2639.31 ±24.65 / 2668.51 ms │ +1.22x faster │
│ QQuery 22 │      190.75 / 194.35 ±5.46 / 205.18 ms │     153.72 / 160.32 ±4.83 / 168.41 ms │ +1.21x faster │
└───────────┴────────────────────────────────────────┴───────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary            ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)            │ 18905.31ms │
│ Total Time (sort_redesign)   │ 15844.52ms │
│ Average Time (HEAD)          │   859.33ms │
│ Average Time (sort_redesign) │   720.21ms │
│ Queries Faster               │         15 │
│ Queries Slower               │          0 │
│ Queries with No Change       │          7 │
│ Queries with Failure         │          0 │
└──────────────────────────────┴────────────┘

Resource Usage

tpch10 — base (merge-base)

Metric Value
Wall time 94.9s
Peak memory 12.8 GiB
Avg memory 8.6 GiB
CPU user 868.4s
CPU sys 74.1s
Peak spill 0 B

tpch10 — branch

Metric Value
Wall time 79.5s
Peak memory 10.8 GiB
Avg memory 8.0 GiB
CPU user 782.1s
CPU sys 67.4s
Peak spill 0 B

File an issue against this benchmark runner

@mbutrovich
Copy link
Copy Markdown
Contributor Author

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Details

Comparing HEAD and sort_redesign
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                   HEAD ┃                         sort_redesign ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │      370.78 / 372.71 ±1.53 / 375.02 ms │     367.32 / 369.51 ±1.62 / 371.85 ms │     no change │
│ QQuery 2  │     479.03 / 498.26 ±12.67 / 512.09 ms │     441.93 / 449.16 ±4.75 / 456.30 ms │ +1.11x faster │
│ QQuery 3  │     550.68 / 651.96 ±51.73 / 692.64 ms │     504.72 / 514.53 ±5.69 / 521.96 ms │ +1.27x faster │
│ QQuery 4  │     382.30 / 478.86 ±50.61 / 522.66 ms │     341.16 / 343.59 ±2.52 / 346.66 ms │ +1.39x faster │
│ QQuery 5  │  1094.76 / 1119.98 ±14.85 / 1136.27 ms │  989.92 / 1035.74 ±30.16 / 1083.34 ms │ +1.08x faster │
│ QQuery 6  │      134.61 / 137.54 ±3.24 / 143.63 ms │     132.58 / 135.76 ±4.78 / 145.26 ms │     no change │
│ QQuery 7  │   1529.12 / 1544.02 ±8.52 / 1554.66 ms │ 1352.09 / 1364.35 ±13.84 / 1390.36 ms │ +1.13x faster │
│ QQuery 8  │ 1495.43 / 1983.26 ±252.68 / 2161.10 ms │ 1178.60 / 1195.20 ±16.42 / 1219.19 ms │ +1.66x faster │
│ QQuery 9  │ 1985.32 / 2251.76 ±135.70 / 2348.90 ms │ 1769.17 / 1861.72 ±83.52 / 1962.04 ms │ +1.21x faster │
│ QQuery 10 │      530.10 / 533.74 ±4.88 / 543.33 ms │    496.72 / 511.79 ±15.67 / 531.85 ms │     no change │
│ QQuery 11 │      455.90 / 464.13 ±5.22 / 470.06 ms │     416.63 / 426.94 ±9.64 / 440.31 ms │ +1.09x faster │
│ QQuery 12 │      288.98 / 292.38 ±2.53 / 295.72 ms │     277.24 / 280.50 ±3.22 / 285.47 ms │     no change │
│ QQuery 13 │      366.95 / 373.42 ±4.66 / 379.47 ms │     346.27 / 354.40 ±4.90 / 358.95 ms │ +1.05x faster │
│ QQuery 14 │      195.18 / 198.85 ±2.46 / 202.91 ms │     192.87 / 197.00 ±2.91 / 200.35 ms │     no change │
│ QQuery 15 │      323.95 / 331.40 ±6.53 / 342.87 ms │     319.56 / 326.97 ±6.54 / 339.16 ms │     no change │
│ QQuery 16 │      121.75 / 123.85 ±2.25 / 127.96 ms │     114.45 / 116.88 ±2.91 / 122.43 ms │ +1.06x faster │
│ QQuery 17 │ 1574.15 / 1819.60 ±123.13 / 1892.65 ms │ 1372.85 / 1388.43 ±10.80 / 1402.63 ms │ +1.31x faster │
│ QQuery 18 │  1535.10 / 1560.14 ±19.95 / 1594.45 ms │ 1407.54 / 1451.07 ±36.80 / 1513.73 ms │ +1.08x faster │
│ QQuery 19 │     276.90 / 290.88 ±17.67 / 325.49 ms │    277.65 / 291.86 ±25.11 / 342.04 ms │     no change │
│ QQuery 20 │      451.87 / 457.31 ±4.64 / 464.09 ms │    417.57 / 429.48 ±12.70 / 453.17 ms │ +1.06x faster │
│ QQuery 21 │ 2981.79 / 3226.93 ±156.76 / 3396.23 ms │ 2602.88 / 2639.31 ±24.65 / 2668.51 ms │ +1.22x faster │
│ QQuery 22 │      190.75 / 194.35 ±5.46 / 205.18 ms │     153.72 / 160.32 ±4.83 / 168.41 ms │ +1.21x faster │
└───────────┴────────────────────────────────────────┴───────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary            ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)            │ 18905.31ms │
│ Total Time (sort_redesign)   │ 15844.52ms │
│ Average Time (HEAD)          │   859.33ms │
│ Average Time (sort_redesign) │   720.21ms │
│ Queries Faster               │         15 │
│ Queries Slower               │          0 │
│ Queries with No Change       │          7 │
│ Queries with Failure         │          0 │
└──────────────────────────────┴────────────┘

Resource Usage
tpch10 — base (merge-base)

Metric Value
Wall time 94.9s
Peak memory 12.8 GiB
Avg memory 8.6 GiB
CPU user 868.4s
CPU sys 74.1s
Peak spill 0 B
tpch10 — branch

Metric Value
Wall time 79.5s
Peak memory 10.8 GiB
Avg memory 8.0 GiB
CPU user 782.1s
CPU sys 67.4s
Peak spill 0 B
File an issue against this benchmark runner

So this is showing the improvement afforded by both the ExternalSorter rewrite (which helps lexsort by reducing fan-in) and radix sorting. I will push a commit that defaults radix sort off, run the benchmarks again to get a baseline understanding of the ExternalSorter changes.

@apache apache deleted a comment from adriangbot Apr 14, 2026
@apache apache deleted a comment from adriangbot Apr 14, 2026
@apache apache deleted a comment from adriangbot Apr 14, 2026
@apache apache deleted a comment from adriangbot Apr 14, 2026
@apache apache deleted a comment from adriangbot Apr 14, 2026
@apache apache deleted a comment from adriangbot Apr 14, 2026
@apache apache deleted a comment from adriangbot Apr 14, 2026
@adriangbot
Copy link
Copy Markdown

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4245617152-1237-trv8q 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing sort_redesign (482e72c) to 0143dfe (merge-base) diff using: tpch10
Results will be posted here when complete


File an issue against this benchmark runner

@apache apache deleted a comment from adriangbot Apr 14, 2026
@adriangbot
Copy link
Copy Markdown

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

Comparing HEAD and sort_redesign
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                   HEAD ┃                         sort_redesign ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │      369.16 / 371.50 ±1.50 / 373.82 ms │     368.07 / 370.91 ±1.62 / 372.36 ms │     no change │
│ QQuery 2  │      482.08 / 492.17 ±8.28 / 506.22 ms │     433.33 / 444.34 ±5.98 / 450.13 ms │ +1.11x faster │
│ QQuery 3  │     615.12 / 652.70 ±25.94 / 683.44 ms │     505.60 / 511.45 ±3.64 / 516.46 ms │ +1.28x faster │
│ QQuery 4  │     465.10 / 493.57 ±17.87 / 509.76 ms │     338.03 / 340.79 ±2.28 / 343.59 ms │ +1.45x faster │
│ QQuery 5  │  1064.46 / 1096.65 ±28.41 / 1137.47 ms │ 1004.03 / 1053.30 ±30.21 / 1086.10 ms │     no change │
│ QQuery 6  │      132.93 / 137.08 ±6.96 / 150.97 ms │     133.75 / 136.14 ±3.28 / 142.56 ms │     no change │
│ QQuery 7  │  1518.69 / 1545.85 ±32.91 / 1607.84 ms │ 1344.80 / 1370.15 ±25.64 / 1415.11 ms │ +1.13x faster │
│ QQuery 8  │ 1471.85 / 2012.19 ±270.24 / 2155.94 ms │ 1173.34 / 1225.25 ±60.85 / 1308.03 ms │ +1.64x faster │
│ QQuery 9  │ 2026.03 / 2167.95 ±127.83 / 2335.75 ms │ 1738.14 / 1801.47 ±57.88 / 1875.72 ms │ +1.20x faster │
│ QQuery 10 │      519.68 / 533.14 ±8.21 / 545.06 ms │     506.37 / 512.62 ±7.24 / 526.36 ms │     no change │
│ QQuery 11 │      447.97 / 455.81 ±7.10 / 467.81 ms │     415.12 / 426.13 ±6.22 / 433.26 ms │ +1.07x faster │
│ QQuery 12 │      284.55 / 288.58 ±3.24 / 294.03 ms │     274.70 / 281.79 ±4.01 / 286.60 ms │     no change │
│ QQuery 13 │      362.47 / 371.52 ±7.65 / 384.95 ms │     337.67 / 341.43 ±3.52 / 347.69 ms │ +1.09x faster │
│ QQuery 14 │      194.71 / 197.06 ±1.55 / 198.77 ms │     191.35 / 195.28 ±3.56 / 201.28 ms │     no change │
│ QQuery 15 │      324.22 / 331.51 ±5.40 / 341.04 ms │     320.89 / 323.67 ±3.29 / 329.87 ms │     no change │
│ QQuery 16 │      119.57 / 123.10 ±3.64 / 129.71 ms │     114.08 / 114.91 ±0.74 / 116.29 ms │ +1.07x faster │
│ QQuery 17 │ 1562.46 / 1637.28 ±123.05 / 1882.73 ms │  1362.70 / 1369.87 ±4.39 / 1375.48 ms │ +1.20x faster │
│ QQuery 18 │  1520.50 / 1555.74 ±33.25 / 1617.16 ms │ 1375.78 / 1415.36 ±26.77 / 1450.02 ms │ +1.10x faster │
│ QQuery 19 │     278.50 / 290.44 ±14.92 / 318.03 ms │    272.95 / 288.41 ±24.93 / 338.10 ms │     no change │
│ QQuery 20 │      444.79 / 451.07 ±3.45 / 453.90 ms │     436.88 / 443.21 ±5.19 / 448.41 ms │     no change │
│ QQuery 21 │ 2975.83 / 3236.75 ±151.25 / 3419.22 ms │ 2612.11 / 2652.91 ±34.67 / 2702.19 ms │ +1.22x faster │
│ QQuery 22 │      192.70 / 199.78 ±7.65 / 213.77 ms │     151.75 / 161.45 ±8.55 / 176.35 ms │ +1.24x faster │
└───────────┴────────────────────────────────────────┴───────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary            ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)            │ 18641.44ms │
│ Total Time (sort_redesign)   │ 15780.86ms │
│ Average Time (HEAD)          │   847.34ms │
│ Average Time (sort_redesign) │   717.31ms │
│ Queries Faster               │         13 │
│ Queries Slower               │          0 │
│ Queries with No Change       │          9 │
│ Queries with Failure         │          0 │
└──────────────────────────────┴────────────┘

Resource Usage

tpch10 — base (merge-base)

Metric Value
Wall time 93.6s
Peak memory 11.1 GiB
Avg memory 8.6 GiB
CPU user 865.9s
CPU sys 71.6s
Peak spill 0 B

tpch10 — branch

Metric Value
Wall time 79.2s
Peak memory 10.9 GiB
Avg memory 8.0 GiB
CPU user 782.3s
CPU sys 66.2s
Peak spill 0 B

File an issue against this benchmark runner

@mbutrovich
Copy link
Copy Markdown
Contributor Author

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Details

Comparing HEAD and sort_redesign
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                   HEAD ┃                         sort_redesign ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │      369.16 / 371.50 ±1.50 / 373.82 ms │     368.07 / 370.91 ±1.62 / 372.36 ms │     no change │
│ QQuery 2  │      482.08 / 492.17 ±8.28 / 506.22 ms │     433.33 / 444.34 ±5.98 / 450.13 ms │ +1.11x faster │
│ QQuery 3  │     615.12 / 652.70 ±25.94 / 683.44 ms │     505.60 / 511.45 ±3.64 / 516.46 ms │ +1.28x faster │
│ QQuery 4  │     465.10 / 493.57 ±17.87 / 509.76 ms │     338.03 / 340.79 ±2.28 / 343.59 ms │ +1.45x faster │
│ QQuery 5  │  1064.46 / 1096.65 ±28.41 / 1137.47 ms │ 1004.03 / 1053.30 ±30.21 / 1086.10 ms │     no change │
│ QQuery 6  │      132.93 / 137.08 ±6.96 / 150.97 ms │     133.75 / 136.14 ±3.28 / 142.56 ms │     no change │
│ QQuery 7  │  1518.69 / 1545.85 ±32.91 / 1607.84 ms │ 1344.80 / 1370.15 ±25.64 / 1415.11 ms │ +1.13x faster │
│ QQuery 8  │ 1471.85 / 2012.19 ±270.24 / 2155.94 ms │ 1173.34 / 1225.25 ±60.85 / 1308.03 ms │ +1.64x faster │
│ QQuery 9  │ 2026.03 / 2167.95 ±127.83 / 2335.75 ms │ 1738.14 / 1801.47 ±57.88 / 1875.72 ms │ +1.20x faster │
│ QQuery 10 │      519.68 / 533.14 ±8.21 / 545.06 ms │     506.37 / 512.62 ±7.24 / 526.36 ms │     no change │
│ QQuery 11 │      447.97 / 455.81 ±7.10 / 467.81 ms │     415.12 / 426.13 ±6.22 / 433.26 ms │ +1.07x faster │
│ QQuery 12 │      284.55 / 288.58 ±3.24 / 294.03 ms │     274.70 / 281.79 ±4.01 / 286.60 ms │     no change │
│ QQuery 13 │      362.47 / 371.52 ±7.65 / 384.95 ms │     337.67 / 341.43 ±3.52 / 347.69 ms │ +1.09x faster │
│ QQuery 14 │      194.71 / 197.06 ±1.55 / 198.77 ms │     191.35 / 195.28 ±3.56 / 201.28 ms │     no change │
│ QQuery 15 │      324.22 / 331.51 ±5.40 / 341.04 ms │     320.89 / 323.67 ±3.29 / 329.87 ms │     no change │
│ QQuery 16 │      119.57 / 123.10 ±3.64 / 129.71 ms │     114.08 / 114.91 ±0.74 / 116.29 ms │ +1.07x faster │
│ QQuery 17 │ 1562.46 / 1637.28 ±123.05 / 1882.73 ms │  1362.70 / 1369.87 ±4.39 / 1375.48 ms │ +1.20x faster │
│ QQuery 18 │  1520.50 / 1555.74 ±33.25 / 1617.16 ms │ 1375.78 / 1415.36 ±26.77 / 1450.02 ms │ +1.10x faster │
│ QQuery 19 │     278.50 / 290.44 ±14.92 / 318.03 ms │    272.95 / 288.41 ±24.93 / 338.10 ms │     no change │
│ QQuery 20 │      444.79 / 451.07 ±3.45 / 453.90 ms │     436.88 / 443.21 ±5.19 / 448.41 ms │     no change │
│ QQuery 21 │ 2975.83 / 3236.75 ±151.25 / 3419.22 ms │ 2612.11 / 2652.91 ±34.67 / 2702.19 ms │ +1.22x faster │
│ QQuery 22 │      192.70 / 199.78 ±7.65 / 213.77 ms │     151.75 / 161.45 ±8.55 / 176.35 ms │ +1.24x faster │
└───────────┴────────────────────────────────────────┴───────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary            ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)            │ 18641.44ms │
│ Total Time (sort_redesign)   │ 15780.86ms │
│ Average Time (HEAD)          │   847.34ms │
│ Average Time (sort_redesign) │   717.31ms │
│ Queries Faster               │         13 │
│ Queries Slower               │          0 │
│ Queries with No Change       │          9 │
│ Queries with Failure         │          0 │
└──────────────────────────────┴────────────┘

Resource Usage
File an issue against this benchmark runner

So it seems like the big win here is the ExternalSorter refactor reducing merge fan-in, considering this run has radix sort off and the speedup is still pretty strong.

This PR is mostly for experimenting anyway, but maybe this is motivation to structure the future work as:

  1. ExternalSorter refactor to use BatchCoalescer to reduce merge fan-in.
  2. After radix sort kernel lands in Arrow-rs and DF updates to that version of Arrow-rs, add radix sort support.

@mbutrovich mbutrovich changed the title perf: Optimize ExternalSorter with chunked sort pipeline perf: Optimize ExternalSorter with chunked sort pipeline and radix sort kernel Apr 14, 2026
mbutrovich added a commit to mbutrovich/datafusion that referenced this pull request Apr 14, 2026
@mbutrovich mbutrovich closed this Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation execution Related to the execution crate performance Make DataFusion faster physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants