Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-15582: [C++] Add support for registering tricky functions with the Substrait consumer (or add a bunch of substrait meta functions) #13285

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

sanjibansg
Copy link
Contributor

[WIP]
This PR adds function mappings for compute functions from substrait to arrow and vice-versa. This introduces a FunctionMapping class to register and store the mappings and supply when required. Registering a function includes encoding the various options and arguments in the respective mapping function's definition.

@github-actions
Copy link

github-actions bot commented Jun 1, 2022

@github-actions
Copy link

github-actions bot commented Jun 1, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make a pass at simplifying some things. A lot of these lambdas seem to follow a consistent pattern. This is a good start however! Excited to see it.

Status FunctionMapping::AddArrowToSubstrait(std::string arrow_function_name, ArrowToSubstrait conversion_func){
if (arrow_to_substrait.find(arrow_function_name) != arrow_to_substrait.end()){
arrow_to_substrait[arrow_function_name] = conversion_func;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return an invalid status as an else clause here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning an AlreadyExist status.

Status FunctionMapping::AddSubstraitToArrow(std::string substrait_function_name, SubstraitToArrow conversion_func){
if (substrait_to_arrow.find(substrait_function_name) != substrait_to_arrow.end()){
substrait_to_arrow[substrait_function_name] = conversion_func;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, perhaps the else should be an invalid status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning a AlreadyExist status here, will that be better?

cpp/src/arrow/engine/substrait/extension_set.cc Outdated Show resolved Hide resolved
cpp/src/arrow/engine/substrait/extension_set.h Outdated Show resolved Hide resolved
@sanjibansg sanjibansg force-pushed the substrait/compute_functions branch from a56a556 to b28e5b1 Compare June 3, 2022 10:31
@sanjibansg sanjibansg force-pushed the substrait/compute_functions branch from 41fc8fe to 7e27a7f Compare June 9, 2022 21:13
@westonpace westonpace self-requested a review June 16, 2022 15:47
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions from a quick scan.

return compute::call(func_name, std::move(arguments), std::move(cast_options));
}
case substrait::Expression::kEnum: {
auto enum_expr = expr.enum_();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this convert to the string value of the enum? Can you add a small comment here explaining that.

@@ -204,6 +209,521 @@ const int* GetIndex(const KeyToIndex& key_to_index, const Key& key) {
return &it->second;
}

Status FunctionMapping::AddArrowToSubstrait(std::string arrow_function_name, ArrowToSubstrait conversion_func){
if (arrow_to_substrait.find(arrow_function_name) != arrow_to_substrait.end()){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic seems backwards to me...wouldn't umap.find(...) != umap.end() mean the item already existed?

}

Status FunctionMapping::AddSubstraitToArrow(std::string substrait_function_name, SubstraitToArrow conversion_func){
if (substrait_to_arrow.find(substrait_function_name) != substrait_to_arrow.end()){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. This seems backwards (but maybe I'm just not thinking right)

}
}

std::vector<arrow::compute::Expression> substrait_convert_arguments(const substrait::Expression::ScalarFunction& call){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<arrow::compute::Expression> substrait_convert_arguments(const substrait::Expression::ScalarFunction& call){
std::vector<arrow::compute::Expression> ConvertSubstraitArguments(const substrait::Expression::ScalarFunction& call){


std::vector<arrow::compute::Expression> substrait_convert_arguments(const substrait::Expression::ScalarFunction& call){
substrait::Expression value;
ExtensionSet ext_set_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ExtensionSet ext_set_;
ExtensionSet ext_set;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems strange. Wouldn't this function take in an extension set as an argument?


substrait::Expression::ScalarFunction arrow_convert_enum_arguments(const arrow::compute::Expression::Call& call, substrait::Expression::ScalarFunction& substrait_call, ExtensionSet* ext_set_, std::string overflow_handling){
substrait::Expression::Enum options;
options.set_specified(overflow_handling);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overflow_handling seems like an odd name given this is a generic function

return arrow::compute::call("abs", substrait_convert_arguments(call));
};

ArrowToSubstrait arrow_add_to_substrait = [] (const arrow::compute::Expression::Call& call, ExtensionSet* ext_set_) -> Result<substrait::Expression::ScalarFunction> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of places where you have ext_set_ and it should probably be ext_set. For the sake of brevity I'm not going to mark them all.

Comment on lines 668 to 670
substrait::Expression::ScalarFunction substrait_call;
ARROW_ASSIGN_OR_RAISE(auto function_reference, ext_set_->EncodeFunction("extract"));
substrait_call.set_function_reference(function_reference);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these calls to EncodeFunction seem pretty repetitive. Is there any way we can move this into the part that calls GetArrowToSubstrait? Also, I don't see anything today that calls GetArrowToSubstrait

}
};

ArrowToSubstrait arrow_year_to_arrow = [] (const arrow::compute::Expression::Call& call, ExtensionSet* ext_set_) -> Result<substrait::Expression::ScalarFunction> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arrow_...to_arrow?

Comment on lines +812 to +813
DCHECK_OK(functions_map.AddSubstraitToArrow(id.name.to_string(), conversion_func));
return RegisterFunction(id, id.name.to_string());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a little odd that we need two maps. What happens if two functions exist with the same name but different URIs? Thinking on this longer, maybe substrait_to_arrow should replace the map in the extension id registry (that gets updated by the call to RegisterFunction?)

westonpace added a commit that referenced this pull request Aug 10, 2022
…ctions (#13613)

This picks up where #13285 has left off.  It mostly focuses on the Substrait->Arrow direction at the moment.  In addition, basic support is added for named tables.  This makes it possible to create unit tests that read from in-memory tables instead of requiring unit tests to do a scan.

The PR creates some utilities in `test_plan_builder.h` which allow for the construction of simple Substrait plans programmatically.  This is used to create unit tests for the function mapping.

The PR extracts id "ownership" out of the `ExtensionIdRegistry` and into its own `IdStorage` class.

The PR gets rid of `NestedExtensionIdRegistryImpl` and instead makes `ExtensionIdRegistryImpl` nested if `parent_ != nullptr`.



Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants