perf: Optimize `array_min`, `array_max` for arrays of primitive types by neilconway · Pull Request #21101 · apache/datafusion

neilconway · 2026-03-22T16:10:49Z

Which issue does this PR close?

Closes Optimize array_min, array_max for primitive types #21100.

Rationale for this change

In the current implementation, we construct a PrimitiveArray for each row, feed it to the Arrow min / max kernel, and then collect the resulting ScalarValues in a Vec. We then construct a final PrimitiveArray for the result via ScalarValue::iter_to_array of the Vec.

We can do better for ListArrays of primitive types. First, we can iterate directly over the flat values buffer of the ListArray for the batch and compute the min/max from each row's slice directly. Second, Arrow's min / max kernels have a reasonable amount of per-call overhead; for small arrays, it is more efficient to compute the min/max ourselves via direct iteration.

Benchmarks (8192 rows, arrays of int64 values, M4 Max):

no_nulls / list_size=10: 309 µs → 26.6 µs (11.6x faster)
no_nulls / list_size=100: 392 µs → 150 µs (2.6x faster)
no_nulls / list_size=1000: 1.20 ms → 951 µs (1.26x faster)
nulls / list_size=10: 385 µs → 69.0 µs (5.6x faster)
nulls / list_size=100: 790 µs → 616 µs (1.28x faster)
nulls / list_size=1000: 5.34 ms → 5.21 ms (1.02x faster)

What changes are included in this PR?

Add benchmark for array_max
Expand SLT test coverage
Implement optimization

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

neilconway · 2026-03-22T16:11:25Z

We could add a similar fastpath for arrays of strings, although maybe not worth it because array_min / max on arrays of strings is not particularly common?

neilconway · 2026-03-22T18:42:48Z

On an M4 Max, it looks like the crossover point between direct iteration and using the Arrow kernel is 32-40 list elements:

  ┌───────────┬──────────┬──────────┬─────────────────────┐
  │ List size │  Scalar  │  Kernel  │  Kernel vs Scalar   │
  ├───────────┼──────────┼──────────┼─────────────────────┤
  │ 8         │ 54.8 µs  │ 172.7 µs │ scalar 3.2x faster  │
  ├───────────┼──────────┼──────────┼─────────────────────┤
  │ 16        │ 105.3 µs │ 188.1 µs │ scalar 1.8x faster  │
  ├───────────┼──────────┼──────────┼─────────────────────┤
  │ 32        │ 232.5 µs │ 253.2 µs │ scalar 1.09x faster │
  ├───────────┼──────────┼──────────┼─────────────────────┤
  │ 48        │ 362.6 µs │ 329.6 µs │ kernel 1.10x faster │
  ├───────────┼──────────┼──────────┼─────────────────────┤
  │ 64        │ 492.8 µs │ 444.2 µs │ kernel 1.11x faster │
  ├───────────┼──────────┼──────────┼─────────────────────┤
  │ 96        │ 761.7 µs │ 589.0 µs │ kernel 1.29x faster │
  ├───────────┼──────────┼──────────┼─────────────────────┤
  │ 128       │ 1.032 ms │ 782.0 µs │ kernel 1.32x faster │
  ├───────────┼──────────┼──────────┼─────────────────────┤
  │ 256       │ 2.076 ms │ 1.428 ms │ kernel 1.45x faster │
  ├───────────┼──────────┼──────────┼─────────────────────┤
  │ 512       │ 4.138 ms │ 2.728 ms │ kernel 1.52x faster │
  └───────────┴──────────┴──────────┴─────────────────────┘

So I lowered the iteration -> kernel switchover threshold to 32.

coderfender · 2026-03-22T21:33:49Z

These are great numbers ! @neilconway . Could we perhaps also remove if conditions as well and see if those help out. Example :

Separate implementation for non null arrays ( to prevent if loop cycles inside the inner function)
Hot loopingARROW_COMPUTE_THRESHOLD if calls
3.min/max check (separate max vs min impl)

neilconway · 2026-03-23T00:51:27Z

@coderfender Thanks for the feedback!

I quickly checked 1 and 3 and they don't yield any improvement; I'd suspect the compiler will hoist loop-invariant branches like this out of the loop. The threshold check should be similar: it should be branch-predicted effectively.

Lmk if you disagree!

…min-max

neilconway added 2 commits March 22, 2026 10:26

Add benchmark for array_max

f0ec26f

.

53ab1bd

github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Mar 22, 2026

neilconway added 2 commits March 22, 2026 12:32

Benchmark on 8192 row batches, for better fidelity

2f68986

Lower iteration -> kernel threshold from 64 -> 32

4153eb1

Merge remote-tracking branch 'origin/main' into neilc/optimize-array-…

640ce56

…min-max

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Optimize `array_min`, `array_max` for arrays of primitive types#21101

perf: Optimize `array_min`, `array_max` for arrays of primitive types#21101
neilconway wants to merge 5 commits intoapache:mainfrom
neilconway:neilc/optimize-array-min-max

neilconway commented Mar 22, 2026 •

edited

Loading

Uh oh!

neilconway commented Mar 22, 2026

Uh oh!

neilconway commented Mar 22, 2026

Uh oh!

coderfender commented Mar 22, 2026

Uh oh!

neilconway commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

neilconway commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

neilconway commented Mar 22, 2026

Uh oh!

neilconway commented Mar 22, 2026

Uh oh!

coderfender commented Mar 22, 2026

Uh oh!

neilconway commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

neilconway commented Mar 22, 2026 •

edited

Loading