Skip to content

feat(udf)!: switch ScalarFunction.evaluate to ColumnarValue API (closes #62)#64

Merged
andygrove merged 7 commits into
apache:mainfrom
andygrove:feat/columnar-value-udf
May 18, 2026
Merged

feat(udf)!: switch ScalarFunction.evaluate to ColumnarValue API (closes #62)#64
andygrove merged 7 commits into
apache:mainfrom
andygrove:feat/columnar-value-udf

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #62.

Rationale for this change

DataFusion's Rust ScalarUDFImpl::invoke_with_args speaks ColumnarValue (Array or Scalar) rather than raw Arrow arrays. The Java binding previously materialised every scalar arg to a length-N array before crossing the JNI boundary, which lost the scalar-vs-array distinction and forced nullary UDFs to learn the batch row count by some out-of-band channel (the workaround proposed in PR #57).

Aligning the Java API with the Rust enum eliminates the workaround: a nullary UDF can return ColumnarValue.scalar(...) and the framework broadcasts it, and a UDF that takes literals sees them as Scalars without per-row duplication.

What changes are included in this PR?

  • New ColumnarValue sealed interface (Array/Scalar records, factory enforcing length-1 invariant on scalars).
  • New ScalarFunctionArgs record bundling List<ColumnarValue> and rowCount.
  • ScalarFunction.evaluate is now evaluate(BufferAllocator, ScalarFunctionArgs) -> ColumnarValue (source-breaking).
  • JniBridge.invokeScalarUdf rewritten to ship two struct arrays (length-N Array args + length-1 Scalar args) plus a byte[] argKinds positional mask, returning a byte indicating the result variant. JNI signature is now (Lorg/apache/datafusion/ScalarFunction;JJJJ[BJJI)B.
  • Native invoke_with_args no longer materialises scalars; it partitions args by ColumnarValue variant and reconstructs the result from the returned kind byte via ScalarValue::try_from_array.
  • AddOneExample and docs/source/user-guide/scalar-udf.md updated; new "Returning a Scalar" section added to the user guide.

How are these changes tested?

make test — 135 tests pass (12 pre-existing skips). Existing ScalarUdfTest cases (AddOne, Concat, Square, error paths, volatility round-trip) adapted to the new signature, plus three new tests:

  • nullaryScalarReturnUdf_overMultiRowQuery_broadcasts — a nullary java_pi returns ColumnarValue.scalar(...) and the framework expands it across rows, replacing the rowCount workaround.
  • scalarLiteralArg_arrivesAsScalarColumnarValue — UDF asserts that a SQL literal arrives as ColumnarValue.Scalar (length 1), proving scalar-ness survives the FFI.
  • udfReturningScalar_isBroadcastByFramework — explicit scalar-return path test.

Also covered by cargo clippy --all-targets --workspace -- -D warnings (clean) and ./mvnw spotless:check (clean).

Are there any user-facing changes?

Yes — source-breaking signature change to ScalarFunction.evaluate. Implementations must:

Before:

public FieldVector evaluate(BufferAllocator allocator, List<FieldVector> args) {
    IntVector in = (IntVector) args.get(0);
    // ...
    return out;
}

After:

public ColumnarValue evaluate(BufferAllocator allocator, ScalarFunctionArgs args) {
    IntVector in = (IntVector) args.args().get(0).vector();
    // ...
    return ColumnarValue.array(out);
}

Nullary or broadcast-style UDFs can return ColumnarValue.scalar(...) over a length-1 vector.

andygrove added 7 commits May 18, 2026 08:32
Replace evaluate(BufferAllocator, List<FieldVector>) -> FieldVector with
evaluate(BufferAllocator, ScalarFunctionArgs) -> ColumnarValue. Args now
carry per-arg Array-or-Scalar information plus the batch row count, and
the return distinguishes a length-N Array from a broadcast Scalar.

JniBridge.invokeScalarUdf now ships args as two struct arrays (length-N
for Array args, length-1 for Scalar args) plus a positional byte[]
argKinds, and returns a byte indicating the result variant. Existing
test UDFs (AddOne, Concat, Square, ReturnsNull, WrongRowCount,
WrongType, ThrowsIAE) and AddOneExample are updated to the new
signature. Native side will be updated in the next commit.

Refs apache#62.
JavaScalarUdf::invoke_with_args now partitions ScalarFunctionArgs::args
by ColumnarValue variant: array args travel through a length-N struct,
scalar args through a separate length-1 struct (one element each), and
a positional byte slice tells the Java side how to interleave them
back. The JNI signature of invokeScalarUdf becomes
(Lorg/apache/datafusion/ScalarFunction;JJJJ[BJJI)B; the returned byte
indicates Array (0) or Scalar (1) so the native side reconstructs the
right ColumnarValue variant via ScalarValue::try_from_array.

Drops the prior scalar-materialisation step, which was the workaround
that PR apache#57 attempted to patch by passing rowCount; nullary UDFs that
broadcast a value now return ColumnarValue.Scalar instead.

Closes apache#62.
@andygrove
Copy link
Copy Markdown
Member Author

@pgwhalen fyi

Copy link
Copy Markdown

@pgwhalen pgwhalen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me! My bindings apparently solved/noticed this problem by passing the row count like the other PR. Unfortunately I missed that that was actually important in your PR and assumed that you just had a cleaner version.

All the more reason to emulate the rust API closely.

@andygrove
Copy link
Copy Markdown
Member Author

Makes sense to me! My bindings apparently solved/noticed this problem by passing the row count like the other PR. Unfortunately I missed that that was actually important in your PR and assumed that you just had a cleaner version.

All the more reason to emulate the rust API closely.

Thanks for the review! After chatting with @mbutrovich about making a similar change in Comet, he pointed out that we may still need row count in some cases, like for uuid() which has no input arguments.

@andygrove andygrove merged commit 8e6ab54 into apache:main May 18, 2026
2 checks passed
@andygrove andygrove deleted the feat/columnar-value-udf branch May 18, 2026 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Change UDF signature to use ColumnarValue rather than raw Arrow types

2 participants