Add Variant support and DataFusion functions#458
Conversation
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the large Variant implementation. The core shape looks good and the targeted Variant tests pass locally, but I found a couple of correctness issues that should be fixed before merging.
| } | ||
| } | ||
|
|
||
| pub fn validate_payload(value: &[u8], metadata: &[u8]) -> Result<()> { |
There was a problem hiding this comment.
This validation is used by the row/binary-row writers before accepting external Arrow / buffers, but it only checks the metadata version and total sizes. For example, and passes , so a non-null Variant with an empty value buffer can be persisted and only fail later when / reads it. Please validate at least the root value buffer here (and ideally recursively validate object/array offsets and metadata ids), or avoid using this helper at ingestion boundaries where malformed Variant bytes must be rejected.
There was a problem hiding this comment.
Addressed in 0194ca4: validate_payload now validates the metadata dictionary, root value size, scalar payloads, object metadata ids/key order, and recursively checks object/array child ranges/offsets. I also updated row/binary-row tests to use real Variant payloads and added malformed payload coverage.
| if args.arg_fields.len() != 2 && args.arg_fields.len() != 3 { | ||
| return plan_err(format!("{} expects 2 or 3 arguments", self.name())); | ||
| } | ||
| let output = variant_get_output_type(args.scalar_arguments.get(2).and_then(|v| *v))?; |
There was a problem hiding this comment.
If the third argument is not a scalar literal, is , so the planner silently chooses the default return type. I reproduced this with where : the result schema is still the Variant struct instead of an integer (or a planning error). Since DataFusion needs a fixed output type, please reject non-literal third arguments during planning, rather than treating them as omitted.
There was a problem hiding this comment.
Addressed in 0194ca4: variant_get now treats the 2-arg form as the only omitted-type case; if the 3rd argument is present but not a scalar string literal, planning fails instead of silently returning Variant. Added a regression test for a column-backed type argument.
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the large Variant implementation. The core shape looks good and the targeted Variant tests pass locally, but I found a couple of correctness issues that should be fixed before merging.
| } | ||
| } | ||
|
|
||
| pub fn validate_payload(value: &[u8], metadata: &[u8]) -> Result<()> { |
There was a problem hiding this comment.
This validation is used by the row/binary-row writers before accepting external Arrow value/metadata buffers, but it only checks the metadata version and total sizes. For example, value = [] and metadata = [0x01] passes VariantType::validate_payload, so a non-null Variant with an empty value buffer can be persisted and only fail later when VariantRef/to_json reads it. Please validate at least the root value buffer here (and ideally recursively validate object/array offsets and metadata ids), or avoid using this helper at ingestion boundaries where malformed Variant bytes must be rejected.
| if args.arg_fields.len() != 2 && args.arg_fields.len() != 3 { | ||
| return plan_err(format!("{} expects 2 or 3 arguments", self.name())); | ||
| } | ||
| let output = variant_get_output_type(args.scalar_arguments.get(2).and_then(|v| *v))?; |
There was a problem hiding this comment.
If the third argument is not a scalar literal, scalar_arguments.get(2) is None, so the planner silently chooses the default Variant return type. I reproduced this with SELECT variant_get(parse_json('{"age":26}'), '$.age', ty) FROM t where ty = 'int': the result schema is still the Variant struct instead of an integer (or a planning error). Since DataFusion needs a fixed output type, please reject non-literal third arguments during planning, rather than treating them as omitted.
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the update. The two blocking issues from my previous review are fixed: Variant payload validation now rejects malformed root/object payloads, and variant_get rejects non-literal type arguments during planning. I reran targeted checks locally and they passed.
Summary
Adds Java-compatible
VARIANTsupport to the Rust implementation and registers Spark-style Variant scalar functions in the DataFusion integration. This covers the end-to-end path from schema type handling and row/Arrow encoding to SQL JSON parsing and path extraction.Changes
VARIANTtype support, including schema JSON fixtures, Datum support, BinaryRow/BTree encoding, Arrow mapping, and row-format read/write handling forvalue+metadatabuffers.parse_json,try_parse_json,variant_get,try_variant_get, andis_variant_null.SessionContextregistration, and current limitations.Testing
cargo test -p paimon variant --libcargo test -p paimon-datafusion variant_functions --libcargo check -p paimon-datafusioncargo check --workspacegit diff --checkmkdocs build --strictNotes
This PR implements the common scalar-function surface first.
schema_of_variant,schema_of_variant_agg,to_variant_object,variant_explode,variant_explode_outer, and directvariant_getcasts toARRAY/MAP/STRUCTare left for follow-up work.