Skip to content

Conversation

@zhidongqu-db
Copy link
Contributor

@zhidongqu-db zhidongqu-db commented Feb 2, 2026

What changes were proposed in this pull request?

Implement a set of performance optimizations for the vector aggregation functions recently added.

  • Reuse binary buffer in-place: instead of allocating a new ByteBuffer for each update/merge call
  • Hoist division out of loop: compute invCount = 1.0f / newCount once before the loop instead of dividing per element
  • Hoist weight calculations out of loop: compute leftWeight and rightWeight once before the loop instead of 2 divisions per element
  • Skip null checks when unnecessary: Check ArrayType.containsNull at initialization and skip the per-element null check entirely when the array type cannot contain nulls

Why are the changes needed?

Existing implementation can cause excessive GCs due to wasted binary buffers on each update. This is particularly problematic for running large aggregation over high dimensional vectors.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests

Was this patch authored or co-authored using generative AI tooling?

Yes, code assistance with Claude Opus 4.5 in combination with manual editing by the author.

@zhidongqu-db zhidongqu-db changed the title Performance Optimizations for vector_avg/vector_sum [SPARK-55318] Performance Optimizations for vector_avg/vector_sum Feb 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant