[SPARK-56916][SQL] Simplify ElementAt array codegen under ANSI mode#55941
[SPARK-56916][SQL] Simplify ElementAt array codegen under ANSI mode#55941gengliangwang wants to merge 1 commit into
Conversation
### What changes were proposed in this pull request? Introduce `ArrayUtils.java` with a single helper `elementAtIndexExact(int length, int index, QueryContext context)` and use it from `ElementAt`'s `ArrayType` branch in both `doGenCode` and `doElementAt` (eval). The helper normalizes a 1-based `element_at` index against the array length and returns the 0-based position, throwing `invalidElementAtIndexError` for out-of-bound and `invalidIndexOfZeroError` for zero index. The caller still emits the type-specific `arr.get(pos, dataType)` (not the helper, since the return type depends on the array element type). The non-ANSI branch is left inline because it can choose between `defaultValueOutOfBound` (an `Option[Expression]` that requires codegen access) or `null`. ### Why are the changes needed? Part of SPARK-56908 (umbrella). The ANSI `ElementAt` codegen body was the largest single inline body in `collectionOperations.scala` -- the helper collapses ~12 lines to ~3 per call site. ### Does this PR introduce _any_ user-facing change? No. The compiled behavior is identical; only the emitted Java source text changes. ### How was this patch tested? ``` build/sbt "catalyst/testOnly *CollectionExpressionsSuite" ``` 59/59 pass. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor 1.x
Stack overview (SPARK-56908 umbrella)This PR is part of a stack of 8 PRs against SPARK-56908. Order:
PRs 1-4 are linearly stacked on each other (each branch is based on the previous one). PR 5 (decimal arithmetic) is stacked on top of PR 3 (cast decimal) since it uses |
cloud-fan
left a comment
There was a problem hiding this comment.
Summary
Prior state and problem. ElementAt.doGenCode for ANSI mode contained ~12 lines of inline codegen for index validation (length check, zero-index check, sign normalization), with the same logic duplicated in Scala in doElementAt (eval). Per the SPARK-56908 umbrella, this was the largest single inline body in collectionOperations.scala.
Design approach. Extract the ANSI-mode validation into a Java static helper ArrayUtils.elementAtIndexExact(int length, int index, QueryContext) and call it from both eval and codegen. Each method now splits case _: ArrayType into a failOnError branch (uses the helper) and a non-failOnError branch (kept inline — only the ANSI branch is unified in this PR).
Key design decisions. The helper returns the validated 0-based int; the type-specific arr.get(pos, dataType) remains at the call site so the helper stays independent of element type.
Implementation sketch.
- New file
ArrayUtils.javainorg.apache.spark.sql.catalyst.expressionswith a single static method. ElementAt.doElementAtandElementAt.doGenCodeeach gain a newcase _: ArrayType if failOnError =>branch.
Behavior verified case-by-case against pre-PR (OOB, zero index, negative index, empty array). Codegen scaffolding (nullCheck) is identical to pre-PR.
LGTM with two minor nits inline.
| * of inline length / zero / sign-normalization codegen with a return of | ||
| * the normalized array position (0-based). | ||
| */ | ||
| public final class ArrayUtils { |
There was a problem hiding this comment.
The stack uses per-operation naming (CastUtils, ArithmeticUtils, DateTimeConstructorUtils). ArrayUtils is broader than its single element_at-specific helper, and there's already an ArrayExpressionUtils.java in the same package that serves array-expression helpers. Risk: future readers won't know which utility class to look in, and ArrayUtils becomes a magnet for unrelated array helpers.
Consider renaming to ElementAtUtils (matches DateTimeConstructorUtils-style per-operation naming), or folding elementAtIndexExact into the existing ArrayExpressionUtils. WDYT?
| * {@link ElementAt} on {@code ArrayType}: a single call replaces ~12 lines | ||
| * of inline length / zero / sign-normalization codegen with a return of | ||
| * the normalized array position (0-based). |
There was a problem hiding this comment.
The a single call replaces ~12 lines ... clause describes the PR's effect rather than the helper's contract — once merged, the original 12-line inline form isn't visible to future readers. Peer CastUtils.java doesn't include similar line-count claims.
| * {@link ElementAt} on {@code ArrayType}: a single call replaces ~12 lines | |
| * of inline length / zero / sign-normalization codegen with a return of | |
| * the normalized array position (0-based). | |
| * {@link ElementAt} on {@code ArrayType}. |
Title: [SPARK-56916][SQL] Simplify ElementAt array codegen under ANSI mode
Base: master (independent)
Head: gengliangwang:SPARK-56916-element-at
What changes were proposed in this pull request?
Introduce
ArrayUtils.javawith a single helperelementAtIndexExact(int length, int index, QueryContext context)and use it fromElementAt'sArrayTypebranch in bothdoGenCodeanddoElementAt(eval).The helper normalizes a 1-based
element_atindex against the array length and returns the 0-based position, throwinginvalidElementAtIndexErrorfor out-of-bound andinvalidIndexOfZeroErrorfor zero index. The caller still emits the type-specificarr.get(pos, dataType)(not the helper, since the return type depends on the array element type).The non-ANSI branch is left inline because it can choose between
defaultValueOutOfBound(anOption[Expression]that requires codegen access) ornull.Why are the changes needed?
Part of SPARK-56908 (umbrella). The ANSI
ElementAtcodegen body was the largest single inline body incollectionOperations.scala-- the helper collapses ~12 lines to ~3 per call site.Does this PR introduce any user-facing change?
No.
How was this patch tested?
59/59 pass.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor 1.x