feat: add Java scalar UDF support#46
Draft
andygrove wants to merge 13 commits into
Draft
Conversation
Introduces JavaScalarUdf struct with its Signature, return_type, JNI references, and a ScalarUDFImpl impl whose invoke_with_args returns NotImplemented (placeholder for Task 6). Adds volatility_from_byte helper. Dead-code lints suppressed with comments referencing Task 5/6.
Add end-to-end UDF registration: SessionContext.registerUdf serialises the return/arg types as Arrow IPC, passes them to native via registerScalarUdf JNI, which decodes the schema, constructs a JavaScalarUdf with an exact Signature, and registers it on the DataFusion SessionContext. SQL planning now resolves registered UDFs by name; invocation still returns NotImplemented until Task 6.
Replace the NotImplemented stub with the real per-call data flow: materialise ColumnarValue args to arrays, pack them into a StructArray, export via to_ffi, call JniBridge.invokeScalarUdf over JNI with four FFI addresses, then import the result via from_ffi and return it as a ColumnarValue::Array. Update ScalarUdfTest to assert correct values from an AddOne UDF registered against a three-row constant table.
… and zero-arg case - Clear pending JNI exceptions in jthrowable_to_string fallback branches to avoid undefined behavior on thread detach - Add SAFETY comment on the from_ffi unsafe block explaining FFI initialization guarantees - Replace usize-to-i32 silent truncation with i32::try_from checked conversion for row count - Replace StructArray::new with StructArray::try_new_with_length to handle zero-arg UDFs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
N/A
Rationale for this change
Add support for implement scalar UDFs in Java.
What changes are included in this PR?
Public Java API
ScalarUdf@FunctionalInterfacewith one methodFieldVector evaluate(BufferAllocator allocator, List<FieldVector> args)Volatilityenum (IMMUTABLE/STABLE/VOLATILE)SessionContext.registerUdf(name, udf, returnType, argTypes, volatility)Internals
org.apache.datafusion.internal.JniBridge— per-call static trampoline that imports the input columns, calls user code, validates (non-null + row count), and exports the result via the Arrow C Data Interfacenative/src/udf.rs—JavaScalarUdfimplementing DataFusion'sScalarUDFImpl. HoldsGlobalRefs to the user instance + bridge class and a cachedJStaticMethodID; constructs an FFI struct-array view of the args, attaches the current thread to JNI, calls the bridge, translates any pending Java exception into aDataFusionError::Execution, imports the result viaarrow::ffi::from_ffi, and validates the type matches the declared return typeJNI_OnLoadinnative/src/lib.rscaches theJavaVM*in aOnceLockso DataFusion's worker threads can attachRefinements from the design discussion
FFI_ArrowArray+FFI_ArrowSchemafor a struct-array view of the columns), not anArrowArrayStream— v1 always sees one batch per invoke and the simpler ABI avoids streaming overhead.RootAllocatoronJniBridgerather than a fresh-per-call one — closing a per-call allocator while an FFI release callback still holds buffer references would throw.StructArray::try_new_with_length(... , number_rows)is used for the args export so a zero-argument UDF doesn't panic.number_rowsisusize::try_into::<i32>checked before crossing JNI — a batch larger thani32::MAXwould otherwise silently truncate and miscompare with Java's row-count check.Docs + example
docs/source/user-guide/scalar-udf.md(linked from the user-guide toctree)examples/src/main/java/org/apache/datafusion/examples/AddOneExample.javaHow are these changes tested?
make testfrom a clean checkout. The newScalarUdfTesthas 12 tests covering:add_one(Int32),concat(Utf8, Utf8),square(Float64), repeated invocations in one session, a 100-row VALUES scanIllegalArgumentException— all surface asRuntimeExceptions with the expected message substrings (class name + user message preserved)Volatilityvalues round-trip through registration and execute correctly