Skip to content

fix(udf): declare scalar UDF signatures with Field, not ArrowType#59

Open
LantaoJin wants to merge 2 commits into
apache:mainfrom
LantaoJin:fix/udf-nested-arrow-types
Open

fix(udf): declare scalar UDF signatures with Field, not ArrowType#59
LantaoJin wants to merge 2 commits into
apache:mainfrom
LantaoJin:fix/udf-nested-arrow-types

Conversation

@LantaoJin
Copy link
Copy Markdown
Contributor

@LantaoJin LantaoJin commented May 18, 2026

Which issue does this PR close?

Rationale for this change

ScalarFunction.argTypes() returned List<ArrowType> and returnType() returned ArrowType. In Java Arrow, ArrowType is a leaf marker for the type kind: it is self-describing for primitives like Int32 or Float64, but for nested types (List, Struct, Map, FixedSizeList) the element / member / key / value types live on the parent Field's children list, not inside ArrowType. ArrowType.List is literally a no-field marker class.

A Java UDF author therefore had no way to declare a typed nested signature. Trying argTypes() = List.of(new ArrowType.List()) blew up at registration time:

IllegalArgumentException: Lists have one child Field. Found: none
  at SessionContext.serializeSchemaIpc(SessionContext.java:398)
  at SessionContext.registerUdf(SessionContext.java:391)

This blocked the entire family of nested-type UDFs that exist as built-ins in DataFusion's datafusion-functions-nested crate (array_length, cardinality, array_has, array_position, flatten, map_keys, map_values, arrows_zip, ...). Anyone porting Spark UDFs over ArrayType / StructType / MapType columns to DataFusion-Java hit this on the first attempt.

The Rust API does not have this problem because DataType::List(Arc<Field>) carries the child field inline. Switching the Java interface from ArrowType to Field is the structural mirror: Field is the only type that can carry children, so it's the type the interface has always needed to use.

What changes are included in this PR?

See commit log

Are these changes tested?

Yes -- 5 new tests over and above the existing 12.

Are there any user-facing changes?

Yes -- a source-breaking signature change to the public ScalarFunction interface. Existing primitive UDFs become slightly more verbose:

// Before:
public List<ArrowType> argTypes() { return List.of(INT32); }
public ArrowType returnType() { return INT32; }

// After:
public List<Field> argFields() { return List.of(Field.nullable("arg0", INT32)); }
public Field returnField() { return Field.nullable("return", INT32); }

Nested-type UDFs that were previously impossible to declare can now be expressed by attaching child fields. For example, a UDF over List<Int32> returning the list length:

// List<Int32> argument: the element type lives in the Field's children list.
public List<Field> argFields() {
  return List.of(
      new Field(
          "vals",
          FieldType.nullable(new ArrowType.List()),
          List.of(Field.nullable("item", INT32))));
}

public Field returnField() {
  return Field.nullable("len", INT32);
}

Same shape works for Struct<a: Int32, b: Int32> (children are the member fields) and Map<Utf8, Int32> (children describe the key/value entries struct).

The repo is pre-release, which makes this the right time to tighten the interface before downstream callers accumulate.

LantaoJin added 2 commits May 18, 2026 08:13
Change ScalarFunction.argTypes() / returnType() (List<ArrowType> /
ArrowType) to argFields() / returnField() (List<Field> / Field).
SessionContext.registerUdf forwards the Fields straight through.
JavaScalarUdf stores the full return FieldRef and overrides
ScalarUDFImpl::return_field_from_args, so declared nullability and
metadata round-trip into the result schema.

ArrowType is a leaf marker in Java Arrow: ArrowType.List has no
fields, and child element / member / key / value types live on the
parent Field's children list. The previous registration code
reconstructed the schema with `new Field(..., FieldType.nullable(
type), null)`, dropping nested-type metadata; the previous Rust impl
only stored a DataType, so the default return_field_from_args wrapped
results in a fresh always-nullable Field. Both are fixed by storing
and forwarding the user's Fields verbatim.
…w-types

# Conflicts:
#	core/src/main/java/org/apache/datafusion/ScalarFunction.java
#	examples/src/main/java/org/apache/datafusion/examples/AddOneExample.java
#	native/src/udf.rs
@LantaoJin
Copy link
Copy Markdown
Contributor Author

@andygrove resolved conflicts from #64 , can you review this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(udf): scalar UDFs over nested Arrow types cannot be registered

1 participant