Skip to content

Add Variant support and DataFusion functions#458

Merged
JingsongLi merged 3 commits into
apache:mainfrom
JingsongLi:codex/variant-datafusion-functions
Jul 5, 2026
Merged

Add Variant support and DataFusion functions#458
JingsongLi merged 3 commits into
apache:mainfrom
JingsongLi:codex/variant-datafusion-functions

Conversation

@JingsongLi

Copy link
Copy Markdown
Contributor

Summary

Adds Java-compatible VARIANT support to the Rust implementation and registers Spark-style Variant scalar functions in the DataFusion integration. This covers the end-to-end path from schema type handling and row/Arrow encoding to SQL JSON parsing and path extraction.

Changes

  • Add Paimon VARIANT type support, including schema JSON fixtures, Datum support, BinaryRow/BTree encoding, Arrow mapping, and row-format read/write handling for value + metadata buffers.
  • Add a Java-compatible Variant core module for JSON parsing, metadata dictionary construction, path access, JSON null handling, and scalar extraction/casting.
  • Register DataFusion scalar UDFs: parse_json, try_parse_json, variant_get, try_variant_get, and is_variant_null.
  • Document Variant table creation, inserts, queries, path syntax, supported cast targets, null semantics, raw SessionContext registration, and current limitations.

Testing

  • cargo test -p paimon variant --lib
  • cargo test -p paimon-datafusion variant_functions --lib
  • cargo check -p paimon-datafusion
  • cargo check --workspace
  • git diff --check
  • mkdocs build --strict

Notes

This PR implements the common scalar-function surface first. schema_of_variant, schema_of_variant_agg, to_variant_object, variant_explode, variant_explode_outer, and direct variant_get casts to ARRAY / MAP / STRUCT are left for follow-up work.

@leaves12138 leaves12138 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the large Variant implementation. The core shape looks good and the targeted Variant tests pass locally, but I found a couple of correctness issues that should be fixed before merging.

}
}

pub fn validate_payload(value: &[u8], metadata: &[u8]) -> Result<()> {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validation is used by the row/binary-row writers before accepting external Arrow / buffers, but it only checks the metadata version and total sizes. For example, and passes , so a non-null Variant with an empty value buffer can be persisted and only fail later when / reads it. Please validate at least the root value buffer here (and ideally recursively validate object/array offsets and metadata ids), or avoid using this helper at ingestion boundaries where malformed Variant bytes must be rejected.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 0194ca4: validate_payload now validates the metadata dictionary, root value size, scalar payloads, object metadata ids/key order, and recursively checks object/array child ranges/offsets. I also updated row/binary-row tests to use real Variant payloads and added malformed payload coverage.

if args.arg_fields.len() != 2 && args.arg_fields.len() != 3 {
return plan_err(format!("{} expects 2 or 3 arguments", self.name()));
}
let output = variant_get_output_type(args.scalar_arguments.get(2).and_then(|v| *v))?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the third argument is not a scalar literal, is , so the planner silently chooses the default return type. I reproduced this with where : the result schema is still the Variant struct instead of an integer (or a planning error). Since DataFusion needs a fixed output type, please reject non-literal third arguments during planning, rather than treating them as omitted.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 0194ca4: variant_get now treats the 2-arg form as the only omitted-type case; if the 3rd argument is present but not a scalar string literal, planning fails instead of silently returning Variant. Added a regression test for a column-backed type argument.

@leaves12138 leaves12138 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the large Variant implementation. The core shape looks good and the targeted Variant tests pass locally, but I found a couple of correctness issues that should be fixed before merging.

}
}

pub fn validate_payload(value: &[u8], metadata: &[u8]) -> Result<()> {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validation is used by the row/binary-row writers before accepting external Arrow value/metadata buffers, but it only checks the metadata version and total sizes. For example, value = [] and metadata = [0x01] passes VariantType::validate_payload, so a non-null Variant with an empty value buffer can be persisted and only fail later when VariantRef/to_json reads it. Please validate at least the root value buffer here (and ideally recursively validate object/array offsets and metadata ids), or avoid using this helper at ingestion boundaries where malformed Variant bytes must be rejected.

if args.arg_fields.len() != 2 && args.arg_fields.len() != 3 {
return plan_err(format!("{} expects 2 or 3 arguments", self.name()));
}
let output = variant_get_output_type(args.scalar_arguments.get(2).and_then(|v| *v))?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the third argument is not a scalar literal, scalar_arguments.get(2) is None, so the planner silently chooses the default Variant return type. I reproduced this with SELECT variant_get(parse_json('{"age":26}'), '$.age', ty) FROM t where ty = 'int': the result schema is still the Variant struct instead of an integer (or a planning error). Since DataFusion needs a fixed output type, please reject non-literal third arguments during planning, rather than treating them as omitted.

@leaves12138 leaves12138 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. The two blocking issues from my previous review are fixed: Variant payload validation now rejects malformed root/object payloads, and variant_get rejects non-literal type arguments during planning. I reran targeted checks locally and they passed.

@JingsongLi JingsongLi merged commit 839448a into apache:main Jul 5, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants