-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-33899: [C++] Add NamedTapRel relation as a Substrait extension #33909
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a quick look. The Substrait side of this makes sense but I'm a little unclear on what the plan is for the C++ side of things.
struct ARROW_ENGINE_EXPORT NamedTapNodeOptions : public compute::ExecNodeOptions { | ||
NamedTapNodeOptions(const std::string& name, std::shared_ptr<Schema> schema) | ||
: name(name), schema(std::move(schema)) {} | ||
|
||
std::string name; | ||
std::shared_ptr<Schema> schema; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the plan is here. I thought originally the goal was to mimic the named table. In that case I'd expect to see something like NamedTableProvider
(e.g. NamedTapProvider
) that translates names into actual nodes (e.g. a TeeNode
).
In that case I wouldn't expect there to be any "named tap" equivalent in Acero. I'm not sure what NamedTapNode
and AddPassFactory
are providing in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a design part to discuss. The current PR implementation takes the kind
field (i.e., named_tap_rel.kind()
) as the factory name in the declaration. There is no need for a new NamedTapProvider
concept when the factory registry is the provider of taps. OTOH, I agree with your mapping idea - adding a mapping from the kind
field to the factory name would improve.
So, the plan here is:
NamedTapRel
: thekind
field selects the tap's kind, thename
field configures the tap's action, and thecolumns
field configures the tap's output schema.NamedTapNodeOptions
: thename
field configured the tap's action as inNamedTapRel
, and theschema
field is the output schema configured for the tap. The selection of the tap kind is not done via options, so it's not included in them.
I could add a description along these lines in docstrings.
The test code shows a vanilla example of this. The AddPassFactory
tap is registered in the factory registry. It's configuration is not the focus of the test. In a (future) test case of a specific tap, we would cast the options argument of the tap's Make
function to NamedTapNodeOptions
and use the fields there to configure it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you return exec node options (similar to named tap provider) then you could bypass the need to encode properties into the name (you wouldn't even really need kind
):
# in python
def tap_provider(name):
if name == 'one':
return TeeNodeOptions('/tmp/dataset_one')
elif name == 'two':
return TeeNodeOptions('/tmp/dataset_two')
else:
raise Error(...)
or you could move the name encoding / decoding into python
# in python
def tap_provider(name)
path = get_path_from_name(name)
return TeeNodeOptions(path)
or you could still use the kind mapping
# in python
def tap_provider(name)
kind = get_kind_from_name(name)
if kind == 'tee':
return TeeNodeOptions(path)...
However, this feature is still pretty experimental so I don't mind sticking with kind<->factory name mapping for now if that is what you would prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my mind we should do the following:
(1) Add "NamedTapRel" as a purely abstract relation in Acero. The Acero substrait consumer code itself doesn't care or know how to turn a NamedTapRel message into a Declaration.
(2) Add a C++ extension API that allows downstream users / implementer of the NamedTapRel to tell Acero "hey, if you see a NamedTapRel message, call into the custom substrait extension function that I provide here and I will return a declaration to u" (basically this: #33850)
This way I believe we don't need most the stuff here (including changes to pass conv_opts, named_tap_mapper, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@westonpace @rtpsw WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree a specific structure can be useful however in this case I don't feel it is much more useful than a general one in this case. For example, the specific structure for the NamedTapNode make function creates a NamedTapNode option, and doesn't not allow the user to return a declaration with a custom node option (e.g, in our case, a WriteSmoothNodeOption), so we need to handle that in our code after calling this NamedTapNode make function here (or totally by pass NamedTapNode function here). This an be done but I feel that is unnecessary indirection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @rtpsw that going too generic (protobuf.Any in and Declaration out) is not a good idea. This is the exact API for the ExtensionProvider
and so we already have this. You could use this for WriteSmoothNodeOption
and it would end up looking just like the extension that is in place for AsofJoinNode. If that works for you then there is nothing we need to do :)
So if we want to go less generic then we should look at what Acero provides at each step to simplify things:
Generic Node
The next step down in generality would be to remove protobuf from the picture using Substrait/Arrow literals:
message GenericOptions {
message Property {
string key = 0;
Literal value = 1;
}
repeated Property properties = 1;
}
message GenericNode {
string kind = 0;
GenericOptions options = 1;
}
The API in Arrow would be:
using Options = unordered_map<string, Scalar>;
void RegisterGenericExtensionHandler(string kind, function<Declaration(Options)> handler);
In other words, given properties as a map of arrow scalars, create a declaration
Pros:
- No need for extension author to write proto file
- Can be used for any extension type
Cons:
- Need to create an entire ExecNode for your custom behavior
Generic Mapper
An even more focused approach could be to remove the need to create an exec node at all. This would be focused on the 1 input in / 1 input out / no accumulation case (e.g. what MapNode handles). From what I know of your write need this would fit this case.
The Substrait would be pretty much identical to the above.
message MapNode {
string kind = 0;
GenericOptions options = 1;
}
However, the Arrow API would be:
using Options = unordered_map<string, Scalar>;
using Mapper = function<ExecBatch(ExecBatch)>;
void RegisterGenericMapHandler(string kind, function<Mapper(Options)> handler);
Pros:
- No need to write proto file
- No need to create an exec node
Cons:
- Only handles 1-in relations that don't accumulate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, sorry, I was starting a bit with a blank slate above. I think, in this PR, the current idea is that instead of using GenericOptions
we use string name
and encode the options into the name. This is probably less flexible but in many cases users might be more familiar with encoding options into strings (ala URL encoding) instead of dealing with Arrow scalars. I'm fine with either approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So sticking with name
I think the two approaches are:
Generic Node
void RegisterGenericExtensionHandler(string kind, function<Declaration(string)> handler);
Generic Mapper
using Mapper = function<ExecBatch(ExecBatch)>;
void RegisterGenericMapHandler(string kind, function<Mapper(string)> handler);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow missed this message - I am leaning towards "Generic Node" since it allows more flexibility to define node that is not just "map-like" and also we don't need to write protobuf file.
I have a question - if we go the "Generic Node" route it looks like we can achieve register custom handlers via either RegisterGenericExtensionHandler
or set_default_extension_provider
(7423f03#diff-8c5c127e8db8e52b40519262ee2b7e0179d4a39e5bcfe8b2165173b2b352d6a9R139) so technically we don't need set_default_extension_provider
if we have RegisterGenericExtensionHandler
?
@westonpace, in the recent commit, when coding up |
Note that the factory registry does not support scoping, which could be useful for cleaning up after and isolating testing (here, due to |
What would you initialize it to? There is no default named table provider. I'm pretty sure we are using
|
I'm not opposed to this but I don't think it is really necessary. The tests all rely on the fact that the initialization methods are setup as "call-once" so calling them repeatedly won't be a problem. When dealing with the function registry it seemed important to support nested registries because you might have query-specific functions (e.g. if a UDF is embedded this could be one way to handle it). However, the odds of having query-specific nodes seems pretty slim to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this further I do worry that kind
will be too restrictive. For example, if you are writing to a temporary output directory and the name
specifies the path then how do you configure the directory to write to?
Perhaps that might be what you intended with the nesting of node factories? E.g. you would register a node factory that had the output directory hard-coded as a part of the node factory?
I would have the default implementation return the invalid status, like the one you pointed to. |
The main purpose I had in mind for this is test isolation. I agree it's not strictly necessary. |
Something like
One could register some node factory to implement a very specific custom behavior, and it's good that this option is available to users, but the above example is simple enough that it won't be needed for it. I believe this would be the case with many examples. If we settle on this |
My understanding is that even with the most generic solution Weston described, we would need some message defined since the current design has the extension provider check for the message using |
@westonpace, are you leaning toward accepting the current version of the PR, or something close to it? @icexelloss would be OK with reducing the PR to just adding the message, in case you'd prefer this. |
IMO we should reduce this PR to be just the extension protobuf/message. The main reason is that most of the other change in this PR will not be used by us anyway because we will using our custom extension provider to handle the NamedTapMessage (since we will be creating a WriteSmoothNode from that NamedTapMessage) so it will not be hitting many of the code changes in this PR. Therefore introducing a bunch of code that will not be used doesn't seem necessarily. |
If we changed |
@westonpace would u be ok with having just the proto change in I think since the we can register custom substrait extension consumer, we won't really need to call into the implementation for "NamedTapRel" in the default substrait consumer. If there is a use case can justifies support NamedTapRel in the default consumer in the future I think we can add that later (but currently, our use case doesn't need it). WDTY? |
One thing I am thinking is for testing maybe we can do the following:
This way the test serves and both an example for using NamedTapRel and also mimicing how we will be using this internally. |
I'd rather not just add proto without any support for processing that proto. A mapper that returns a declaration seems simple enough (and this PR is 80% of the way there) and this makes it clear what the intent of the proto message is even if you don't end up using it. |
Fair enough. Let do that @rtpsw |
I realize there might be some confusion so I want to clarify here. There seems to be two (or three) proposed solutions: It's not clear to me if we have agreement to go with one vs the other? Both seem workable to me. (2) does seem a bit more generic / future proof and (1) seems pretty much done. Li |
If 2 is acceptable then I would ask that we please move to that. I think the only change is to have the mapper return a declaration instead of a string. If we feel there is still some ambiguity I'd be happy to make a PR / diff real quick to show what I am thinking. |
The recent commit has |
2 is acceptable to me. @westonpace are u asking for "Generic Node" or "Generic Mapper"? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I hadn't noticed that commit. I agree this looks close. I have a few more questions / minor suggestions
using NamedTapKindMapper = std::function<Result<compute::Declaration>( | ||
const std::string&, std::vector<compute::Declaration::Input>, | ||
std::shared_ptr<compute::ExecNodeOptions>)>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. I think we can do one more small tweak.
using NamedTapKindMapper = std::function<Result<compute::Declaration>( | |
const std::string&, std::vector<compute::Declaration::Input>, | |
std::shared_ptr<compute::ExecNodeOptions>)>; | |
using NamedTapKindMapper = std::function<Result<compute::Declaration>( | |
const std::string&, std::vector<compute::Declaration::Input>, | |
std::shared_ptr<Schema>, std::string name)>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
const ExtensionDetails& ext_details, | ||
const ExtensionSet& ext_set) = 0; | ||
}; | ||
|
||
ARROW_ENGINE_EXPORT std::shared_ptr<ExtensionProvider> default_extension_provider(); | ||
|
||
struct ARROW_ENGINE_EXPORT NamedTapNodeOptions : public compute::ExecNodeOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can pass the schema and name directly instead of using this class. If we still want to wrap the two of them in some kind of object (to reduce the # of args passed into the mapper?) then we should include kind
, do not extend ExecNodeOptions
, and call it something like NamedTap or NamedTapOptions.
In other words, we shouldn't try to form the node options from the protobuf. That will be the mapper's job. We just need to pass whatever was in the protobuf along to the mapper and let the mapper figure out what node options are appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
/// \brief A custom strategy to be used for mapping a tap kind to a function name | ||
/// | ||
/// The default mapper returns a declaration whose factory name is equal to the tap kind | ||
NamedTapKindMapper named_tap_mapper = kDefaultNamedTapKindMapper; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you feel about renaming this to NamedTapProvider
for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
using NamedTapKindMapper = std::function<Result<compute::Declaration>( | ||
const std::string&, std::vector<compute::Declaration::Input>, | ||
std::shared_ptr<compute::ExecNodeOptions>)>; | ||
static NamedTapKindMapper kDefaultNamedTapKindMapper = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we will want to make the default named tap mapper (or the default named table provider for that matter) configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it configurable.
[](const std::string& kind, std::vector<compute::Declaration::Input> inputs, | ||
std::shared_ptr<compute::ExecNodeOptions> options) | ||
-> Result<compute::Declaration> { | ||
return compute::Declaration(kind, inputs, options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove NamedTapNodeOptions
like some of my other comments suggest then I suppose there is no more meaningful default mapper. That being said, I don't know of any use case where this default would be correct. Perhaps we just return an error here like we do with named tables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this works. Thank you for your persistence.
Benchmark runs are scheduled for baseline = 8fed97f and contender = 24e5a58. 24e5a58 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
…on (apache#33909) See apache#33899. This PR adds `NamedTapRel` and a simple test case with a no-op tap (i.e., just passing-through). * Closes: apache#33899 Lead-authored-by: Yaron Gvili <rtpsw@hotmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
…on (apache#33909) See apache#33899. This PR adds `NamedTapRel` and a simple test case with a no-op tap (i.e., just passing-through). * Closes: apache#33899 Lead-authored-by: Yaron Gvili <rtpsw@hotmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
See #33899. This PR adds
NamedTapRel
and a simple test case with a no-op tap (i.e., just passing-through).NamedTapRel
relation as a Substrait extension #33899