Skip to content

perf: Optimize array_min, array_max for arrays of primitive types#21101

Open
neilconway wants to merge 5 commits intoapache:mainfrom
neilconway:neilc/optimize-array-min-max
Open

perf: Optimize array_min, array_max for arrays of primitive types#21101
neilconway wants to merge 5 commits intoapache:mainfrom
neilconway:neilc/optimize-array-min-max

Conversation

@neilconway
Copy link
Contributor

@neilconway neilconway commented Mar 22, 2026

Which issue does this PR close?

Rationale for this change

In the current implementation, we construct a PrimitiveArray for each row, feed it to the Arrow min / max kernel, and then collect the resulting ScalarValues in a Vec. We then construct a final PrimitiveArray for the result via ScalarValue::iter_to_array of the Vec.

We can do better for ListArrays of primitive types. First, we can iterate directly over the flat values buffer of the ListArray for the batch and compute the min/max from each row's slice directly. Second, Arrow's min / max kernels have a reasonable amount of per-call overhead; for small arrays, it is more efficient to compute the min/max ourselves via direct iteration.

Benchmarks (8192 rows, arrays of int64 values, M4 Max):

  • no_nulls / list_size=10: 309 µs → 26.6 µs (11.6x faster)
  • no_nulls / list_size=100: 392 µs → 150 µs (2.6x faster)
  • no_nulls / list_size=1000: 1.20 ms → 951 µs (1.26x faster)
  • nulls / list_size=10: 385 µs → 69.0 µs (5.6x faster)
  • nulls / list_size=100: 790 µs → 616 µs (1.28x faster)
  • nulls / list_size=1000: 5.34 ms → 5.21 ms (1.02x faster)

What changes are included in this PR?

  • Add benchmark for array_max
  • Expand SLT test coverage
  • Implement optimization

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Mar 22, 2026
@neilconway
Copy link
Contributor Author

We could add a similar fastpath for arrays of strings, although maybe not worth it because array_min / max on arrays of strings is not particularly common?

@neilconway
Copy link
Contributor Author

On an M4 Max, it looks like the crossover point between direct iteration and using the Arrow kernel is 32-40 list elements:

  ┌───────────┬──────────┬──────────┬─────────────────────┐
  │ List size │  Scalar  │  Kernel  │  Kernel vs Scalar   │
  ├───────────┼──────────┼──────────┼─────────────────────┤
  │ 8         │ 54.8 µs  │ 172.7 µs │ scalar 3.2x faster  │
  ├───────────┼──────────┼──────────┼─────────────────────┤
  │ 16        │ 105.3 µs │ 188.1 µs │ scalar 1.8x faster  │
  ├───────────┼──────────┼──────────┼─────────────────────┤
  │ 32        │ 232.5 µs │ 253.2 µs │ scalar 1.09x faster │
  ├───────────┼──────────┼──────────┼─────────────────────┤
  │ 48        │ 362.6 µs │ 329.6 µs │ kernel 1.10x faster │
  ├───────────┼──────────┼──────────┼─────────────────────┤
  │ 64        │ 492.8 µs │ 444.2 µs │ kernel 1.11x faster │
  ├───────────┼──────────┼──────────┼─────────────────────┤
  │ 96        │ 761.7 µs │ 589.0 µs │ kernel 1.29x faster │
  ├───────────┼──────────┼──────────┼─────────────────────┤
  │ 128       │ 1.032 ms │ 782.0 µs │ kernel 1.32x faster │
  ├───────────┼──────────┼──────────┼─────────────────────┤
  │ 256       │ 2.076 ms │ 1.428 ms │ kernel 1.45x faster │
  ├───────────┼──────────┼──────────┼─────────────────────┤
  │ 512       │ 4.138 ms │ 2.728 ms │ kernel 1.52x faster │
  └───────────┴──────────┴──────────┴─────────────────────┘

So I lowered the iteration -> kernel switchover threshold to 32.

@coderfender
Copy link
Contributor

These are great numbers ! @neilconway . Could we perhaps also remove if conditions as well and see if those help out. Example :

  1. Separate implementation for non null arrays ( to prevent if loop cycles inside the inner function)
  2. Hot loopingARROW_COMPUTE_THRESHOLD if calls
    3.min/max check (separate max vs min impl)

@neilconway
Copy link
Contributor Author

@coderfender Thanks for the feedback!

I quickly checked 1 and 3 and they don't yield any improvement; I'd suspect the compiler will hoist loop-invariant branches like this out of the loop. The threshold check should be similar: it should be branch-predicted effectively.

Lmk if you disagree!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize array_min, array_max for primitive types

2 participants