Skip to content

Adding postgres json binary operators for nested data strcutures #18210

@sunng87

Description

@sunng87

Is your feature request related to a problem or challenge?

Hello community,

I have been thinking about adding Postgres style JSON operators for nested data structures, mostly for Struct and List. These operators include:

  • ->, ->>, #>>: Access data field by index/key
  • @>, <@, ?, ?|, ?&: Containment testing
  • ||, -, #-: Data structure manipulation
  • @?, @@: Predicate testing

Describe the solution you'd like

Just want to make sure I'm in the right direction.

  1. I assume we won't have built-in json type in datafusion, so these operators will be implemented directly on Struct, List and other json-like primitives directly, following postgres' semantics of them. I noticed we have VARIANT coming to arrow/datafusion, will we have a new DataType for VARIANT? If so, it will be good option for input and return type of these operators.
  2. At the moment, we don't have support for operators on nested data structure and primitives. If the left input is nested, we will assume the right array is nested too, and perform compare operators recursively:
    if left_data_type.is_nested() {
    if !left_data_type.equals_datatype(&right_data_type) {
    return internal_err!("Cannot evaluate binary expression because of type mismatch: left {}, right {} ", left_data_type, right_data_type);
    }
    return apply_cmp_for_nested(self.op, &lhs, &rhs);
    }
    I will need to change this behavior.
  3. Some of the operators may create dynamic results if the right input is Array. For example, if the right array of -> is ["a", "b", "c"], it is expected to return 3 different data types in result set which breaks our type system. So for these > operators, I'm going to support scalar version only.
  4. Also I expected less strict check in
    let result_type = self.data_type(input_schema)?;
    because these > will create dynamic return types.

Some of the kernels are going to be implemented in arrow-rs first, and integrate into datafusion.

Let me know if these changes will make sense, and align with our previous plan if any. And I will start to send pull requests on both repos.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions