Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Add support for registering tricky functions with the Substrait consumer (or add a bunch of substrait meta functions) #31046

Closed
asfimport opened this issue Feb 5, 2022 · 9 comments

Comments

@asfimport
Copy link

Sometimes one Substrait function will map to multiple Arrow functions. For example, the Substrait add function might be referring to Arrow's add or add_checked. We need to figure out how to register this correctly (e.g. one possible approach would be a substrait_add meta function).

Other times a substrait function will encode something Arrow considers an "option" as a function argument. For example, the is_in Arrow function is unary with an option for the lookup set. The substrait function is binary but the second argument must be constant and be the lookup set. Neither of which is to be confused with a truly binary is_in function which takes in a different set at every row.

It's possible there is no work to do here other than adding a bunch of substrait_ meta functions in Arrow. In that case all the work will be done in other JIRAs. Or, it is possible that there is some kind of extension we can make to the function registry that bypasses the need for the meta functions. I'm leaving this JIRA open so future contributors can consider this second option.

Reporter: Weston Pace / @westonpace
Assignee: Weston Pace / @westonpace

PRs and other links:

Note: This issue was originally created as ARROW-15582. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Weston Pace / @westonpace:
There are <100 "standard" Substrait functions right now but this list will probably grow. In general I do not think it is safe to assume that Substrait functions & Arrow functions will share the same name. Even if two functions do exist with the same name I don't think it's safe to assume they will have the same behavior. I think some kind of "mapping" object is going to have to be maintained.

At a minimum one would think this mapping object would be a simple bidirectional string:string map which goes from Arrow function name to Substrait function name and back. Unfortunately, as the ticket describes, I do not think this is possible today.

The worst case scenario is that we require two functions for every entry in the mapping. One that goes from a Substrait "call" to an Arrow "call" and the reverse. I think, as a first attempt, we should tackle this with a very manual mapping, probably with some kind of convenience option for the functions that are simple aliases and then we can look at how we improve from there.

A substrait "call" is a name (string), a vector of arguments (expressions), and a vector of options (literal expressions). An arrow "call" is a name (string), a vector of arguments (expressions), and an options object (POCO).

So my suggestion for the mapping would be something like...


using ArrowToSubstrait = std::function<substrait::Expression::ScalarFunction(const arrow::compute::Expression::Call&, std::vector<substrait::Expression>)>;
using SubstraitToArrow = std::function<arrow::compute::Expression::Call(const substrait::Expression::ScalarFunction&)>;
class FunctionMapping {

  // Registration API
  AddArrowToSubstrait(std::string arrow_function_name, ArrowToSubstrait conversion_func);
  AddSubstraitToArrow(std::string substrait_function_name, SubstraitToArrow conversion_func);

  // Usage API
  substrait::Expression::ScalarFunction ToProto(const arrow::compute::Expression::Call& call);
  arrow::compute::Expression::Call FromProto(const substrait::Expression::ScalarFunction& call);
};

The add function is an interesting example (some pseudo-code / imaginary helper functions for brevity):


SubstraitToArrow substrait_add_to_arrow = [] (const substrait::Expression::ScalarFunction& call)  {
  // Note, Substrait scalar functions don't distinguish between options and arguments so the
  // index of this option is 2 because it comes after the operands (at index 0 and 1).
  // This is why we have to specify how many args there are in the GetArgs invocation.
  auto args = GetArgs(call, 2);
  EnumLiteral overflow_handling = GetOption<EnumLiteral>(call, 2);
  if (IsSpecified(overflow_handling)) {
    switch (GetEnumValue(overflow_handling)) {
      case "SILENT":
        return call("add", args);
      case "SATURATE":
        return Status::Invalid("Arrow does not have a saturating add");
      case "ERROR":
        return call("add_checked", args);
    }
  } else {
    // Default to unchecked add because SILENT => unchecked and SILENT
    // is the first option in the enum (and thus the highest priority when
    // not specified)
    return call("add", args);
  }
};
// Note, we can automatically do the conversion from arrow args to Substrait args because
// we distinguish between args and options in Arrow.
ArrowToSubstrait arrow_add_to_substrait = [] (const arrow::compute::Expression::Call& call, std::vector<substrait::Expression> args) {
  var overflow_behavior = MakeEnum("ERROR");
  var all_args = Concat(std::move(args), {overflow_behavior});
  return MakeSubstraitCall("add", std::move(all_args));
};
ArrowToSubstrait arrow_unchecked_add_to_substrait = [] (const arrow::compute::Expression::Call& call, std::vector<substrait::Expression> args) {
  var overflow_behavior = MakeEnum("SILENT");
  var all_args = Concat(std::move(args), {overflow_behavior});
  return MakeSubstraitCall("add", std::move(all_args));
};
function_mapping.AddSubstraitToArrow("add", substrait_add_to_arrow);
function_mapping.AddArrowToSubstrait("add", arrow_add_to_substrait);
function_mapping.AddArrowToSubstrait("add_unchecked", arrow_add_unchecked_to_substrait);

@asfimport
Copy link
Author

Weston Pace / @westonpace:
It's unclear how much testing should be done here. Conformance testing for Substrait is a large question and could/should be shared by multiple implementations. If there were a sophisticated conformance tester then testing of the mapping code could be fairly minimal.

Also, if there were conformance testing then I think one could prioritize / only implement the Substrait -> Arrow path as the Arrow -> Substrait path is currently only used for testing (although one can imagine non-testing applications).

@asfimport
Copy link
Author

Weston Pace / @westonpace:
Also, the above example of add/add_unchecked is a worst case scenario. Some functions should be much simpler:


// The "less than" function has no options in both arrow and Substrait
AddSimpleMapping(function_mapping, "less", "lt");

@asfimport
Copy link
Author

@asfimport
Copy link
Author

David Li / @lidavidm:
I agree we're going to need some sort of mapping. I think someone will just have to sit down and look through all the functions to determine how best to structure/maintain this, though; maybe most are fairly trivial, some patterns exist like for arithmetic, and a few are just special cases (if_else etc. perhaps).

@asfimport
Copy link
Author

Sanjiban Sengupta / @sanjibansg:
The following spreadsheet provides a table of functions in Substrait and their equivalent in Arrow along with the required options for both of them.

https://docs.google.com/spreadsheets/d/1Jm7vt-sTxsmB7HlLsdWPk6LFINcOGzZjU9SL4ZtiYoY/edit?usp=sharing

@asfimport
Copy link
Author

Yaron Gvili / @rtpsw:
This is an interesting discussion for me as I ran into this issue myself. In my specific use case, which only required a specific set of functions, I was able to manage by hard-coding a couple of simple Substrait-to-Arrow function-name mappings and by adding a special case for cast, which I'm not sure fits the above API - the return-type, which affects the cast operation, might be missing from the Arrow-to-Substrait API. At least for me, it would be useful to get a short-term solution that provides (perhaps configurable) simple mappings.

In the context of the general discussion, Substrait also has a ternary-function "clip" that does not currently appear in the list. Some possible solutions for it are:

  1. Map "clip(x, a, b)" to an Arrow expression like "min_element_wise(max_element_wise(x, a), b)". This solution would work with the above Substrait-to-Arrow API but would require some kind of expression-matching in the reverse direction.
  2. Add An Arrow "clip" function. AFAIK, Arrow has good support for unary and binary scalar kernels but not for ternary ones, so this solution might involve adding this support first.
  3. Translate "clip(x, a, b)" to "clip_upper(clip_lower(x, a), b)" in Substrait and then add simple mappings from "clip_upper" and "clip_lower" in Substrait to "min_element_wise" and "max_element_wise" in Arrow, respectively. This solution has an impact on the Substrait DSL specification.

@asfimport
Copy link
Author

Weston Pace / @westonpace:
I'd have a slight preference for #2 (Arrow has two ternary functions at the moment, if_else, and replace_with_mask, and it shouldn't be too bad to add more).

I think #1 is something that will happen a lot at some point but I feel like it lives in the realm of the query planner/optimizer. So I'd almost want to say "Arrow doesn't support that function" before we get into the realm of "equivalent but not identical plans".

Having something like #3 in Substrait would possible enable something like #1 to happen in a query planner. One could then imagine the following conversation between planner and consumer:

  • Planner: Do you support clip?

  • Consumer: No

  • Planner: Do you support clip_lower and clip_upper?

  • Consumer: Yes

  • Planner produces plan with clip_lower and clip_upper.

    I'm happy for any compromise / alternatives for the short term. There is also a related JIRA ( ARROW-15535 ) which covers automatic generation of YAML.

    The way I think about it is that the "standard Substrait namespace" will require a high degree of manual mapping. However, the "Arrow namespace" should be automatically generated. Having the Arrow namespace will likely be good enough for initial development tasks and quickly expose the entirety of Arrow's compute functionality while we wait for the longer standards-approved methodologies to roll in.

    For example, the automatic generation should be able to easily map min_element_wise and max_element_wise so that we can use those while testing other features and prototyping. Then any prototypes can switch over to using "clip" as they need to start supporting multi-consumer support, etc.

    I think we could also have a short term solution for the standard Substrait namespace too if someone wants to put together a PR.

@asfimport
Copy link
Author

Weston Pace / @westonpace:
Issue resolved by pull request 13613
#13613

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants