Skip to content

feat: add Java scalar UDF support#46

Draft
andygrove wants to merge 13 commits into
apache:mainfrom
andygrove:feat/scalar-udf
Draft

feat: add Java scalar UDF support#46
andygrove wants to merge 13 commits into
apache:mainfrom
andygrove:feat/scalar-udf

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented May 14, 2026

Which issue does this PR close?

N/A

Rationale for this change

Add support for implement scalar UDFs in Java.

What changes are included in this PR?

Public Java API

  • ScalarUdf @FunctionalInterface with one method FieldVector evaluate(BufferAllocator allocator, List<FieldVector> args)
  • Volatility enum (IMMUTABLE / STABLE / VOLATILE)
  • SessionContext.registerUdf(name, udf, returnType, argTypes, volatility)

Internals

  • org.apache.datafusion.internal.JniBridge — per-call static trampoline that imports the input columns, calls user code, validates (non-null + row count), and exports the result via the Arrow C Data Interface
  • native/src/udf.rsJavaScalarUdf implementing DataFusion's ScalarUDFImpl. Holds GlobalRefs to the user instance + bridge class and a cached JStaticMethodID; constructs an FFI struct-array view of the args, attaches the current thread to JNI, calls the bridge, translates any pending Java exception into a DataFusionError::Execution, imports the result via arrow::ffi::from_ffi, and validates the type matches the declared return type
  • JNI_OnLoad in native/src/lib.rs caches the JavaVM* in a OnceLock so DataFusion's worker threads can attach

Refinements from the design discussion

  • Args ride a single FFI struct pair (FFI_ArrowArray + FFI_ArrowSchema for a struct-array view of the columns), not an ArrowArrayStream — v1 always sees one batch per invoke and the simpler ABI avoids streaming overhead.
  • The per-call allocator is a single shared static RootAllocator on JniBridge rather than a fresh-per-call one — closing a per-call allocator while an FFI release callback still holds buffer references would throw.
  • StructArray::try_new_with_length(... , number_rows) is used for the args export so a zero-argument UDF doesn't panic.
  • number_rows is usize::try_into::<i32> checked before crossing JNI — a batch larger than i32::MAX would otherwise silently truncate and miscompare with Java's row-count check.

Docs + example

  • docs/source/user-guide/scalar-udf.md (linked from the user-guide toctree)
  • examples/src/main/java/org/apache/datafusion/examples/AddOneExample.java

How are these changes tested?

make test from a clean checkout. The new ScalarUdfTest has 12 tests covering:

  • Happy paths: add_one(Int32), concat(Utf8, Utf8), square(Float64), repeated invocations in one session, a 100-row VALUES scan
  • Contract violations: UDF returning null, wrong row count, wrong type, and a UDF that throws IllegalArgumentException — all surface as RuntimeExceptions with the expected message substrings (class name + user message preserved)
  • Lifecycle: two UDFs registered in the same session, register-after-close in a new session
  • Volatility: all three Volatility values round-trip through registration and execute correctly

andygrove added 13 commits May 13, 2026 17:06
Introduces JavaScalarUdf struct with its Signature, return_type, JNI
references, and a ScalarUDFImpl impl whose invoke_with_args returns
NotImplemented (placeholder for Task 6). Adds volatility_from_byte
helper. Dead-code lints suppressed with comments referencing Task 5/6.
Add end-to-end UDF registration: SessionContext.registerUdf serialises
the return/arg types as Arrow IPC, passes them to native via
registerScalarUdf JNI, which decodes the schema, constructs a
JavaScalarUdf with an exact Signature, and registers it on the
DataFusion SessionContext. SQL planning now resolves registered UDFs by
name; invocation still returns NotImplemented until Task 6.
Replace the NotImplemented stub with the real per-call data flow: materialise
ColumnarValue args to arrays, pack them into a StructArray, export via to_ffi,
call JniBridge.invokeScalarUdf over JNI with four FFI addresses, then import
the result via from_ffi and return it as a ColumnarValue::Array.

Update ScalarUdfTest to assert correct values from an AddOne UDF registered
against a three-row constant table.
… and zero-arg case

- Clear pending JNI exceptions in jthrowable_to_string fallback branches to
  avoid undefined behavior on thread detach
- Add SAFETY comment on the from_ffi unsafe block explaining FFI initialization guarantees
- Replace usize-to-i32 silent truncation with i32::try_from checked conversion for row count
- Replace StructArray::new with StructArray::try_new_with_length to handle zero-arg UDFs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant